about

Built for the failures nobody else can explain.

kprobe exists because there is a class of failure in distributed financial systems that the entire observability industry cannot see. Not because the tools are bad — they are excellent at what they do. But because they sit in the wrong place in the stack.

Where this comes from

The pattern is familiar to anyone who has operated a financial system at scale. A payment fails at 3am. The on-call engineer wakes up, opens Datadog, sees a latency spike. Opens Jaeger, finds a slow span. Spends four hours correlating logs across six microservices, working backwards from an error to something that might have caused it.

At the end of the investigation, nobody can fully confirm the root cause. A hypothesis is formed. A fix is shipped. The fix might work. It might not. If it does not, the incident repeats.

This is not a failure of the tools. Datadog recorded exactly what it saw. Jaeger traced exactly what it was told to trace. The problem is structural — the relevant events happened below where any of these tools can see. In the kernel. In the scheduler. In the memory subsystem.

kprobe was built to close that gap. Not to replace the existing stack, but to extend it downward into the layer where the hardest financial system failures actually originate.

A flight recorder and a debugger. Nothing else.

kprobe is
  • A kernel-level event recorder deployed passively alongside your services
  • A causal inference engine that constructs cause-and-effect graphs from raw kernel events
  • A deterministic replay engine for reproducing production incidents on development machines
  • A tool for financial systems engineering teams investigating complex, low-level failures
  • Complementary to your existing observability stack
kprobe is not
  • A monitoring tool — it does not alert on thresholds or track SLOs
  • A replacement for Datadog, Jaeger, or OpenTelemetry
  • An APM platform — it does not aggregate application performance metrics
  • A general-purpose observability tool — it is purpose-built for kernel-level incident investigation
  • Appropriate for systems not running on Linux kernel 5.15+

Every decision has a reason.

Passive by default
kprobe attaches to the kernel and records. It does not require application code changes, library imports, framework integrations, or redeployment. The cost of adoption is a single Helm install. This is not a convenience feature — it is a correctness feature. Any instrumentation you add to your application changes its behaviour, however slightly. kprobe's kernel-level approach sees the system as it actually runs, not as it runs under instrumentation.
Causal graphs over logs
Logs tell you what happened. Causal graphs tell you why. The difference matters enormously in incident investigation. A log entry saying "write took 800ms" is not actionable. A causal graph showing that a kernel scheduler preemption caused the delay, triggered by a batch job on the same CPU core, with an edge connecting the preemption event to the write delay, is actionable in seconds. kprobe does not store more data — it structures data differently.
Replay as first-class feature
Most observability tools treat incidents as read-only artifacts. You can look at what happened, but you cannot interact with it. kprobe's replay engine changes that. An incident becomes a test case. You can verify a fix works against the exact production conditions that caused the failure before deploying it. This changes the confidence level of fixes from "this should work" to "this did work, against the actual incident."
Financial domain as first-class concept
Generic observability tools produce generic outputs. kprobe is built for financial systems. Settlement boundaries, clearing windows, ledger writes, order book operations — these are native concepts in kprobe's domain model. A kernel event is not just "PID 2841 made a write syscall." It is "settlement #4821 ledger write, triggered by payment #98721, delayed by kernel memory pressure." The financial context is preserved end-to-end.

The choices that make kprobe possible.

Rust
Rust + Aya for eBPF
The eBPF verifier rejects programs that could crash the kernel. Writing verified eBPF in C is difficult and unsafe by nature. The Aya framework compiles Rust directly to eBPF bytecode, giving us the verifier's safety guarantees plus Rust's memory safety at the language level. The entire probe stack — kernel-side programs and userspace loader — is memory-safe. There is no C anywhere in the codebase.
Kafka
Kafka for event transport
Kernel events arrive at millions per second. Kafka provides back-pressure handling, durability, and replayability. If the causal engine falls behind during a complex incident, events are not lost — they queue in Kafka and are processed when the engine catches up. Kafka's log structure also makes the raw event stream inherently replayable, which is foundational to the replay engine.
Neo4j
Neo4j for causal graphs
Causal relationships are inherently graph-shaped. Traversing from a failed payment to its root kernel cause in a relational database requires recursive CTEs across multiple joins — slow and complex. In Neo4j, the same traversal is a short Cypher query that runs in milliseconds on graphs with millions of edges. The choice of storage model is not incidental — it is what makes sub-second root cause queries possible.
ClickHouse
ClickHouse for raw events
The replay engine needs to retrieve all events for a specific transaction in time order, from a table containing billions of rows. ClickHouse is purpose-built for this — columnar storage, vectorized execution, bloom filter indexes on transaction ID and PID. The same query that would take minutes in Postgres takes under a second in ClickHouse at this data volume.
Go
Go for the engine layer
The causal engine, replay engine, and API server are all Go. Go's goroutine model handles the concurrent event processing required for real-time causal inference without the complexity of async Rust or the overhead of a JVM. The standard library's `syscall` and `ptrace` bindings are stable and well-understood, which is critical for the replay engine's correctness.
Kubernetes
DaemonSet deployment model
The probe runs as a Kubernetes DaemonSet — one instance per node. This is the only deployment model that guarantees every kernel event on every node is captured. A deployment or sidecar model would miss events on nodes where no instrumented pod happened to be running. The DaemonSet is the architectural foundation of zero-gap coverage.

kprobe is open source.

kprobe is in active early development. The core pipeline — eBPF probe, Kafka transport, causal engine, Neo4j graph model, gRPC API — is complete. The dashboard and replay panel are in progress.

If you work on financial infrastructure, observability tooling, or low-level systems and want to contribute or share feedback, the repository is open.

Built with Rust · Go · TypeScript · React · Kafka · ClickHouse · Neo4j · Kubernetes