Overview

kprobe is a kernel-level incident forensics platform for production systems. It records what the Linux kernel did during an incident, correlates those events with application traces, and explains how low-level runtime behavior caused a user-facing failure.

When a payment, query, job, or request times out, conventional observability usually shows the symptom:

the request crossed its latency budget
a dependency was slow
a service returned an error
a retry storm began

kprobe answers the next question:

What happened underneath the application that made it fail?

What kprobe does

kprobe installs a recorder on each Linux host or Kubernetes node. The recorder uses eBPF to capture kernel events such as:

TCP send, receive, and retransmit activity
read and write syscalls
scheduler switches
page faults and memory pressure
block I/O issue and completion events

kprobe then attaches service context to those events. A low-level kernel event becomes part of a real incident:

payment_id=pay_8f21
service=ledger-service
operation=database_write
kernel_event=block_io
duration=842ms

The result is a causal graph that connects the application symptom to the kernel-level cause.

Product model

kprobe has five main parts:

Layer	Purpose
Recorder	Captures kernel events from each node with eBPF.
Pipeline	Streams, validates, enriches, and stores events.
Correlation	Joins kernel events with traces, services, pods, and request IDs.
Causal engine	Builds a graph of what caused what across the incident.
Console and API	Lets engineers search, inspect, replay, and integrate incident data.

A simple incident

Suppose a customer payment fails because the ledger service times out.

Application logs show:

payment pay_123 failed: ledger timeout

Tracing shows:

payment-api -> ledger-service -> database write: 1.2s

kprobe shows:

payment-api received pay_123
  -> ledger-service started write syscall
  -> block device queued the write behind compaction I/O
  -> write syscall completed after 842ms
  -> ledger-service missed its timeout
  -> payment failed

The difference is not more telemetry. The difference is a root-cause path.

Who uses kprobe

kprobe is built for teams that operate production systems where milliseconds matter:

payment gateways
distributed databases
trading, ledger, and settlement systems
high-throughput APIs
queue and stream processing platforms
latency-sensitive microservice systems

The primary users are SREs, platform engineers, backend engineers, and incident responders.

How to read these docs

If you are new to kprobe, start with Quickstart, then read Core Concepts.

If you are deploying kprobe, start with Installation, then choose Amazon EKS, Amazon ECS on EC2, or Bare Linux.

If you are operating kprobe in production, start with Production Deployment, Monitor kprobe, and Security.

If you are integrating kprobe into another product or internal platform, start with API Overview and Event Schema.