The observability stack has a blind spot.
Every tool in the modern observability stack — Datadog, Jaeger, Honeycomb, OpenTelemetry — operates at the application layer. They see what your code tells them. The failures that cost the most in financial systems happen where none of them can see.
Why application-layer tools miss the hardest failures
When you instrument a service with OpenTelemetry, you are telling the system: record a span when this function is called, record its duration, record its attributes. The trace shows you what your code did. It does not show you what the operating system did while your code was running.
This is not a limitation of any specific tool. It is a fundamental constraint of where these tools sit in the stack. They operate in userspace, above the kernel. They see only the events that cross the userspace boundary — function calls, HTTP requests, database queries. They are blind to everything below.
Below the application layer, the kernel is making decisions that directly affect your financial system's behaviour:
These are not edge cases. In distributed financial systems running on shared Kubernetes infrastructure, kernel-level interference between workloads is a routine cause of latency spikes, failed settlements, and ambiguous transaction states.
What each tool actually sees
- Best-in-class APM dashboards and alerting
- Broad integrations across languages and frameworks
- Strong log aggregation and correlation
- Excellent UI for operational visibility
- Operates entirely above application code — sees only what you instrument
- No visibility into CPU scheduling, memory pressure, or network retransmits
- Cannot construct causal chains across kernel boundaries
- No incident replay capability
- Root cause analysis is manual and incomplete for OS-level failures
- Industry standard for distributed tracing
- Excellent cross-service span visualization
- Open source, self-hosted option
- Strong OpenTelemetry compatibility
- Requires explicit instrumentation — only sees what you instrument
- Spans represent application-level timing, not OS-level timing
- A slow span could be a slow service or a kernel preemption — Jaeger cannot distinguish
- No kernel event visibility whatsoever
- No causal inference — shows what happened, not why
- Vendor-neutral instrumentation standard
- Covers traces, metrics, and logs in a unified SDK
- Growing auto-instrumentation support
- Exports to any backend
- Still requires code changes — auto-instrumentation covers common libraries, not custom code
- Application-layer only by design
- Traces represent what the SDK recorded, not ground truth timing
- No kernel visibility, no causal inference, no replay
- De facto standard for Kubernetes metrics
- Powerful query language (PromQL)
- Excellent for capacity planning and trend analysis
- Strong alerting via Alertmanager
- Metrics are aggregates — individual events and their causes are invisible
- A CPU spike metric tells you CPU was high, not which process caused it or why
- No tracing, no causal analysis, no replay
- Not designed for incident investigation — designed for operational visibility
- Excellent high-cardinality event exploration
- Strong query capabilities for trace analysis
- Good for understanding user-facing behaviour patterns
- Application-layer only — same fundamental constraint as other APM tools
- Requires instrumentation of every service
- No kernel visibility, no causal inference, no replay
The same incident through every tool
A payment of ₹50,000 fails to settle at 03:47am. Money is in limbo. Here is what each tool shows the on-call engineer.
Shows a latency spike in settlement-svc at 03:47. The p99
latency metric crossed the alert threshold. An alert fired. The APM
trace shows settlement-svc took 812ms on a request that
normally takes 200ms.
No further information is available about why it was slow.
Shows the distributed trace for the payment. The span tree shows payment-handler called settlement-svc, which made a database write.
The write span took 802ms. The overall trace took 812ms and timed
out.
The write span is the slow span. No information on what made it slow.
Shows CPU utilisation on the node spiked to 94% at 03:47. Memory usage was elevated. No connection between the node metrics and the specific payment failure is possible from metrics alone.
Shows the full causal graph for payment #98721. The root cause
node is a sched_switch event at 03:47:12.822 — the kernel
preempted PID 2841 (settlement-svc) and scheduled PID 4721 (a batch
job) on CPU 3.
PID 2841 was mid-write to the ledger when it was preempted. It waited 388ms on the scheduler queue before resuming. The write completed at 03:47:13.621 — 802ms total, 52ms past the payment handler's 750ms timeout.
The fix: increase the payment handler timeout to 1500ms. The engineer replays the incident with the modified timeout. The payment succeeds. The fix ships.
Full comparison
| Capability | Datadog | Jaeger | OTel | Prometheus | Honeycomb | kprobe |
|---|---|---|---|---|---|---|
| Application traces | Yes | Yes | Yes | — | Yes | Yes |
| Zero instrumentation | — | — | — | — | — | Yes |
| CPU scheduling visibility | — | — | — | — | — | Yes |
| Memory pressure events | — | — | — | Metrics only | — | Yes |
| Network packet timing | — | — | — | — | — | Yes |
| Syscall-level visibility | — | — | — | — | — | Yes |
| Causal graph construction | — | — | — | — | — | Yes |
| Root cause to kernel level | — | — | — | — | — | Yes |
| Deterministic replay | — | — | — | — | — | Yes |
| Fix verification before deploy | — | — | — | — | — | Yes |
| Financial domain primitives | — | — | — | — | — | Yes |
| Nanosecond event precision | — | Millisecond | Millisecond | Millisecond | Millisecond | Yes |
| Self-hosted option | — | Yes | Yes | Yes | — | Yes |
| Open source | — | Yes | Yes | Yes | — | Yes |
kprobe does not replace your existing stack.
Datadog, Jaeger, OpenTelemetry, and Prometheus are all doing their jobs correctly. They are excellent tools for what they are designed to do — monitor application behaviour, track distributed traces, aggregate metrics, alert on thresholds.
kprobe fills the gap below them. It sees what they cannot see by design. Used together, you get complete visibility: