Quickstart

This guide walks you through your first causal trace — from deployment to investigating a real incident.

Assumptions

kprobe is installed and all pods are running (see Installation)
The dashboard is accessible at http://localhost:3000
You have at least one application running in the cluster that emits traffic or OpenTelemetry traces

Step 1 — Verify kprobe is recording

Open the dashboard and navigate to the Timeline view. You should see a live stream of kernel events flowing in from all nodes. If the stream is empty, check the probe pods:

kubectl logs -l app=kprobe-probe -n monitoring --tail=50

The probe logs should show events being published to the kernel.raw Kafka topic.

Step 2 — Find a transaction

kprobe indexes events by transaction ID, request ID, trace ID, service name, and timestamp. In the dashboard search bar, enter an ID from your system. If your event stream already carries trace context, kprobe will show it alongside the kernel events. Native OpenTelemetry span correlation is planned next.

If you don’t have a specific request to search for, trigger a test request through your system and note the request or trace ID.

Step 3 — Open the causal graph

Search for the request, transaction, or trace ID in the dashboard. kprobe returns the full causal graph — a directed graph from the service-level event at the top down to every kernel-level event that touched it.

Each node in the graph represents an event. Each edge represents a causal relationship — event A caused event B. The graph is colour-coded by latency impact:

Neutral — events within normal latency bounds
Amber — events that contributed to latency
Root cause node — the kernel event identified as the primary cause, marked with a badge

Click any node to see the full event details: timestamp, PID, CPU core, duration, and the application context (which request or transaction, which service, which operation).

Step 4 — Inspect the kernel event

If your transaction shows unexpected latency, the root cause node will identify the kernel-level trigger. Common patterns:

sched_switch with high delay — the process was preempted by the kernel scheduler. Look at what PID it was preempted for — a background batch job is a common culprit.

mm_page_fault spike — memory pressure at the time of a critical write. Correlate with pod memory metrics to identify the source.

tcp_retransmit — a network retransmit added latency. Check if this was a transient event or recurring on a specific node.

Step 5 — Replay the incident

Once you have identified the root cause, open the Replay Panel. Select the transaction and click Replay. kprobe will:

Load the full event log for that transaction from ClickHouse
Start a sandboxed process via ptrace
Serve all system calls from the recorded event log instead of the real kernel

The application behaves exactly as it did in production. You can now:

Modify the timeout threshold and replay to verify the fix
Add artificial delay to specific syscalls to surface race conditions
Test the same incident with a different kernel scheduler configuration

When the replay succeeds with your proposed fix, you can ship with confidence.

Next steps

Causal Graph View — full guide to navigating the graph
Timeline View — nanosecond precision event timeline
Replay Panel — full replay and fix verification workflow