compare

The observability stack has a blind spot.

Every tool in the modern observability stack — Datadog, Jaeger, Honeycomb, OpenTelemetry — operates at the application layer. They see what your code tells them. The failures that cost the most in financial systems happen where none of them can see.

Why application-layer tools miss the hardest failures

When you instrument a service with OpenTelemetry, you are telling the system: record a span when this function is called, record its duration, record its attributes. The trace shows you what your code did. It does not show you what the operating system did while your code was running.

This is not a limitation of any specific tool. It is a fundamental constraint of where these tools sit in the stack. They operate in userspace, above the kernel. They see only the events that cross the userspace boundary — function calls, HTTP requests, database queries. They are blind to everything below.

Below the application layer, the kernel is making decisions that directly affect your financial system's behaviour:

CPU scheduler preemption
The kernel can preempt your settlement write process mid-operation and schedule a background batch job on the same CPU core. Your application sees a slow write. OpenTelemetry records a slow span. Neither tells you a batch job caused it.
Memory pressure cascade
A service on the same node consuming excess memory triggers kernel memory pressure. The kernel begins reclaiming pages from other processes. Your ledger write, which normally takes 10ms, takes 800ms. Datadog sees a latency spike. It cannot tell you why.
TCP retransmission
A network packet is dropped between your payment handler and settlement service. The kernel retransmits it. The retransmit adds 200ms. That 200ms pushes your transaction past its clearing window. Jaeger shows a slow span. It cannot distinguish a slow service from a retransmit.
File descriptor contention
Two processes competing for writes to the same file descriptor cause kernel-level lock contention. One process blocks. Your application logs show nothing unusual — from its perspective, it made a write call and got a response. It just took longer than expected.

These are not edge cases. In distributed financial systems running on shared Kubernetes infrastructure, kernel-level interference between workloads is a routine cause of latency spikes, failed settlements, and ambiguous transaction states.

What each tool actually sees

Datadog
Application layer · SaaS APM
  • Best-in-class APM dashboards and alerting
  • Broad integrations across languages and frameworks
  • Strong log aggregation and correlation
  • Excellent UI for operational visibility
  • Operates entirely above application code — sees only what you instrument
  • No visibility into CPU scheduling, memory pressure, or network retransmits
  • Cannot construct causal chains across kernel boundaries
  • No incident replay capability
  • Root cause analysis is manual and incomplete for OS-level failures
Datadog is excellent for operational monitoring and alerting. It tells you something is wrong. For financial systems experiencing kernel-level failures, it cannot tell you why.
Jaeger
Application layer · Distributed tracing
  • Industry standard for distributed tracing
  • Excellent cross-service span visualization
  • Open source, self-hosted option
  • Strong OpenTelemetry compatibility
  • Requires explicit instrumentation — only sees what you instrument
  • Spans represent application-level timing, not OS-level timing
  • A slow span could be a slow service or a kernel preemption — Jaeger cannot distinguish
  • No kernel event visibility whatsoever
  • No causal inference — shows what happened, not why
Jaeger shows you the distributed call graph. kprobe extends that graph downward into the kernel, where the actual cause of slow spans often lives.
OpenTelemetry
Application layer · Instrumentation standard
  • Vendor-neutral instrumentation standard
  • Covers traces, metrics, and logs in a unified SDK
  • Growing auto-instrumentation support
  • Exports to any backend
  • Still requires code changes — auto-instrumentation covers common libraries, not custom code
  • Application-layer only by design
  • Traces represent what the SDK recorded, not ground truth timing
  • No kernel visibility, no causal inference, no replay
kprobe is complementary to OpenTelemetry, not a replacement. It uses your OTel traces as context — correlating them with kernel events to give both layers full meaning.
Prometheus + Grafana
Application layer · Metrics and dashboards
  • De facto standard for Kubernetes metrics
  • Powerful query language (PromQL)
  • Excellent for capacity planning and trend analysis
  • Strong alerting via Alertmanager
  • Metrics are aggregates — individual events and their causes are invisible
  • A CPU spike metric tells you CPU was high, not which process caused it or why
  • No tracing, no causal analysis, no replay
  • Not designed for incident investigation — designed for operational visibility
Prometheus tells you that latency increased at 3:47am. kprobe tells you that a batch job caused a kernel scheduler preemption at 3:47:12.822 that delayed your settlement write by 800ms.
Honeycomb
Application layer · Observability platform
  • Excellent high-cardinality event exploration
  • Strong query capabilities for trace analysis
  • Good for understanding user-facing behaviour patterns
  • Application-layer only — same fundamental constraint as other APM tools
  • Requires instrumentation of every service
  • No kernel visibility, no causal inference, no replay
Honeycomb is strong for analysing patterns across many traces. kprobe is focused on deep investigation of individual incidents — specifically those caused by OS-level behaviour.

The same incident through every tool

A payment of ₹50,000 fails to settle at 03:47am. Money is in limbo. Here is what each tool shows the on-call engineer.

Datadog

Shows a latency spike in settlement-svc at 03:47. The p99 latency metric crossed the alert threshold. An alert fired. The APM trace shows settlement-svc took 812ms on a request that normally takes 200ms.

No further information is available about why it was slow.

Tells you it was slow. Not why.
Jaeger

Shows the distributed trace for the payment. The span tree shows payment-handler called settlement-svc, which made a database write. The write span took 802ms. The overall trace took 812ms and timed out.

The write span is the slow span. No information on what made it slow.

Identifies the slow span. Not the cause.
Prometheus

Shows CPU utilisation on the node spiked to 94% at 03:47. Memory usage was elevated. No connection between the node metrics and the specific payment failure is possible from metrics alone.

Shows system stress. No connection to the incident.
kprobe

Shows the full causal graph for payment #98721. The root cause node is a sched_switch event at 03:47:12.822 — the kernel preempted PID 2841 (settlement-svc) and scheduled PID 4721 (a batch job) on CPU 3.

PID 2841 was mid-write to the ledger when it was preempted. It waited 388ms on the scheduler queue before resuming. The write completed at 03:47:13.621 — 802ms total, 52ms past the payment handler's 750ms timeout.

The fix: increase the payment handler timeout to 1500ms. The engineer replays the incident with the modified timeout. The payment succeeds. The fix ships.

Root cause identified. Fix verified. Total time: 5 minutes.

Full comparison

Capability Datadog Jaeger OTel Prometheus Honeycomb kprobe
Application traces Yes Yes Yes Yes Yes
Zero instrumentation Yes
CPU scheduling visibility Yes
Memory pressure events Metrics only Yes
Network packet timing Yes
Syscall-level visibility Yes
Causal graph construction Yes
Root cause to kernel level Yes
Deterministic replay Yes
Fix verification before deploy Yes
Financial domain primitives Yes
Nanosecond event precision Millisecond Millisecond Millisecond Millisecond Yes
Self-hosted option Yes Yes Yes Yes
Open source Yes Yes Yes Yes

kprobe does not replace your existing stack.

Datadog, Jaeger, OpenTelemetry, and Prometheus are all doing their jobs correctly. They are excellent tools for what they are designed to do — monitor application behaviour, track distributed traces, aggregate metrics, alert on thresholds.

kprobe fills the gap below them. It sees what they cannot see by design. Used together, you get complete visibility:

Alerting
Prometheus · Alertmanager · Grafana
Know when something is wrong
Application observability
Datadog · Jaeger · Honeycomb · OpenTelemetry
Understand what your application did
Kernel observability
kprobe
Understand what the OS did while your application was running
Linux kernel
Scheduler · Memory · Network · Filesystem
Where the hardest failures actually happen