compare

The observability stack has a blind spot.

Every tool in the modern observability stack — Datadog, Jaeger, Honeycomb, OpenTelemetry — operates at the application layer. They see what your code tells them. The failures that cost the most in production systems often happen where none of them can see.

the structural gap

Why application-layer tools miss the hardest failures

When you instrument a service with OpenTelemetry, you are telling the system: record a span when this function is called, record its duration, record its attributes. The trace shows you what your code did. It does not show you what the operating system did while your code was running.

This is not a limitation of any specific tool. It is a fundamental constraint of where these tools sit in the stack. They operate in userspace, above the kernel. They see only the events that cross the userspace boundary — function calls, HTTP requests, database queries. They are blind to everything below.

Below the application layer, the kernel is making decisions that directly affect your production system's behaviour:

CPU scheduler preemption

The kernel can preempt your database write process mid-operation and schedule a background batch job on the same CPU core. Your application sees a slow write. OpenTelemetry records a slow span. Neither tells you a batch job caused it.

Memory pressure cascade

A service on the same node consuming excess memory triggers kernel memory pressure. The kernel begins reclaiming pages from other processes. Your database write, which normally takes 10ms, takes 800ms. Datadog sees a latency spike. It cannot tell you why.

TCP retransmission

A network packet is dropped between your API worker and checkout service. The kernel retransmits it. The retransmit adds 200ms. That 200ms pushes your request past its timeout budget. Jaeger shows a slow span. It cannot distinguish a slow service from a retransmit.

File descriptor contention

Two processes competing for writes to the same file descriptor cause kernel-level lock contention. One process blocks. Your application logs show nothing unusual — from its perspective, it made a write call and got a response. It just took longer than expected.

These are not edge cases. In distributed systems running on shared Kubernetes infrastructure, kernel-level interference between workloads is a routine cause of latency spikes, failed requests, cascading retries, and ambiguous system states.

tool by tool

What each tool actually sees

Datadog

Application layer · SaaS APM

Strengths

Best-in-class APM dashboards and alerting
Broad integrations across languages and frameworks
Strong log aggregation and correlation
Excellent UI for operational visibility

Blind spots

Operates entirely above application code — sees only what you instrument
No visibility into CPU scheduling, memory pressure, or network retransmits
Cannot construct causal chains across kernel boundaries
No incident replay capability
Root cause analysis is manual and incomplete for OS-level failures

Datadog is excellent for operational monitoring and alerting. It tells you something is wrong. For production systems experiencing kernel-level failures, it cannot tell you why.

Jaeger

Application layer · Distributed tracing

Strengths

Industry standard for distributed tracing
Excellent cross-service span visualization
Open source, self-hosted option
Strong OpenTelemetry compatibility

Blind spots

Requires explicit instrumentation — only sees what you instrument
Spans represent application-level timing, not OS-level timing
A slow span could be a slow service or a kernel preemption — Jaeger cannot distinguish
No kernel event visibility whatsoever
No causal inference — shows what happened, not why

Jaeger shows you the distributed call graph. kprobe extends that graph downward into the kernel, where the actual cause of slow spans often lives.

OpenTelemetry

Application layer · Instrumentation standard

Strengths

Vendor-neutral instrumentation standard
Covers traces, metrics, and logs in a unified SDK
Growing auto-instrumentation support
Exports to any backend

Blind spots

Still requires code changes — auto-instrumentation covers common libraries, not custom code
Application-layer only by design
Traces represent what the SDK recorded, not ground truth timing
No kernel visibility, no causal inference, no replay

kprobe is complementary to OpenTelemetry, not a replacement. It uses your OTel traces as context — correlating them with kernel events to give both layers full meaning.

Prometheus + Grafana

Application layer · Metrics and dashboards

Strengths

De facto standard for Kubernetes metrics
Powerful query language (PromQL)
Excellent for capacity planning and trend analysis
Strong alerting via Alertmanager

Blind spots

Metrics are aggregates — individual events and their causes are invisible
A CPU spike metric tells you CPU was high, not which process caused it or why
No tracing, no causal analysis, no replay
Not designed for incident investigation — designed for operational visibility

Prometheus tells you that latency increased at 3:47am. kprobe tells you that a batch job caused a kernel scheduler preemption at 3:47:12.822 that delayed your database write by 800ms.

Honeycomb

Application layer · Observability platform

Strengths

Excellent high-cardinality event exploration
Strong query capabilities for trace analysis
Good for understanding user-facing behaviour patterns

Blind spots

Application-layer only — same fundamental constraint as other APM tools
Requires instrumentation of every service
No kernel visibility, no causal inference, no replay

Honeycomb is strong for analysing patterns across many traces. kprobe is focused on deep investigation of individual incidents — specifically those caused by OS-level behaviour.

real scenario

The same incident through every tool

A checkout request fails at 03:47am after a database write exceeds its timeout budget. Retries begin to pile up. Here is what each tool shows the on-call engineer.

Datadog

Shows a latency spike in checkout-service at 03:47. The p99 latency metric crossed the alert threshold. An alert fired. The APM trace shows checkout-service took 812ms on a request that normally takes 200ms.

No further information is available about why it was slow.

Tells you it was slow. Not why.

Jaeger

Shows the distributed trace for the request. The span tree shows api-worker called checkout-service, which made a database write. The write span took 802ms. The overall trace took 812ms and timed out.

The write span is the slow span. No information on what made it slow.

Identifies the slow span. Not the cause.

Prometheus

Shows CPU utilisation on the node spiked to 94% at 03:47. Memory usage was elevated. No connection between the node metrics and the specific request failure is possible from metrics alone.

Shows system stress. No connection to the incident.

kprobe

Shows the full causal graph for request req-9f21. The root cause node is a sched_switch event at 03:47:12.822 — the kernel preempted PID 2841 (checkout-service) and scheduled PID 4721 (a batch job) on CPU 3.

PID 2841 was mid-write to the database when it was preempted. It waited 388ms on the scheduler queue before resuming. The write completed at 03:47:13.621 — 802ms total, 52ms past the API worker's 750ms timeout.

The fix: increase the API worker timeout to 1500ms. The engineer replays the incident with the modified timeout. The request succeeds. The fix ships.

Root cause identified. Fix verified. Total time: 5 minutes.

capability matrix

Full comparison

Capability	Datadog	Jaeger	OTel	Prometheus	Honeycomb	kprobe
Application traces	Yes	Yes	Yes	—	Yes	Yes
Zero instrumentation	—	—	—	—	—	Yes
CPU scheduling visibility	—	—	—	—	—	Yes
Memory pressure events	—	—	—	Metrics only	—	Yes
Network packet timing	—	—	—	—	—	Yes
Syscall-level visibility	—	—	—	—	—	Yes
Causal graph construction	—	—	—	—	—	Yes
Root cause to kernel level	—	—	—	—	—	Yes
Deterministic replay	—	—	—	—	—	Yes
Fix verification before deploy	—	—	—	—	—	Yes
Application/domain context	—	—	—	—	—	Yes
Nanosecond event precision	—	Millisecond	Millisecond	Millisecond	Millisecond	Yes
Self-hosted option	—	Yes	Yes	Yes	—	Yes
Open source	—	Yes	Yes	Yes	—	Yes

positioning

kprobe does not replace your existing stack.

Datadog, Jaeger, OpenTelemetry, and Prometheus are all doing their jobs correctly. They are excellent tools for what they are designed to do — monitor application behaviour, track distributed traces, aggregate metrics, alert on thresholds.

kprobe fills the gap below them. It sees what they cannot see by design. Used together, you get complete visibility:

Alerting

Prometheus · Alertmanager · Grafana

Know when something is wrong

Application observability

Datadog · Jaeger · Honeycomb · OpenTelemetry

Understand what your application did

Kernel observability

kprobe

Understand what the OS did while your application was running

Linux kernel

Scheduler · Memory · Network · Filesystem

Where the hardest failures actually happen