eBPF Performance Monitoring with Aya
Performance problems in production systems are usually a puzzle with missing pieces. You know the symptom — high latency, unexpected CPU usage, I/O spikes — but the signals you have don’t point to the cause. The missing pieces are often things like cache misses starving the pipeline, NUMA traffic slowing down memory access, or a scheduler that can’t find a runnable task fast enough.
This tutorial is about filling in those pieces. We’re going to instrument a Linux system to capture a detailed performance picture across three layers:
Hardware — Performance Monitoring Counters (PMCs) that live on the CPU chip itself. These count events like cache line loads, branch mispredictions, and TLB walks. They’re the closest thing to a direct line into what the CPU is actually doing.
Kernel — eBPF programs attached to kprobes and tracepoints let us observe kernel behavior without adding latency. We can watch how tasks get scheduled, how I/O requests enter the block layer, and how vhost/virtio rings get driven.
Procfs and sysfs — The kernel exposes a huge amount of state through virtual filesystems. NUMA hit/miss statistics, thermal zone temperatures, uncore IMC counters, per-CPU stats — all readable from userspace with no eBPF required.
The tool that ties it all together is Aya — a Rust library for building eBPF programs without relying on LLVM/BCC/libbpf. We’ll write eBPF programs in Rust, and use the same Rust codebase to read userspace sources and aggregate everything into a coherent output.
The Metrics
Here’s what we’ll capture and why:
CPU Internals
- L1/L2/L3 cache miss rates — When a core can’t find data in cache, it stalls. High L3 miss rates often point to poor data locality or memory bandwidth saturation.
- Branch mispredict rate — Modern CPUs speculatively execute. A mispredict throws away work. High mispredict rates point to unpredictable control flow or hot loops.
- IPC and stall ratio — Instructions per cycle (IPC) measures actual throughput. Stall ratio is the fraction of cycles the core was waiting, not executing.
- dTLB/iTLB miss — The Translation Lookaside Buffer caches virtual-to-physical address translations. A miss means a costly page table walk.
Memory and NUMA
- Uncore IMC bandwidth — The Integrated Memory Controller sits off-chip. It has its own performance counters. Saturation means memory bandwidth is the bottleneck.
- NUMA remote ratio — On a multi-socket system, accessing memory attached to a remote socket is slower. The remote ratio tells you what fraction of memory access is remote.
- Page migration rate — NUMA balancing moves pages between nodes. High migration rates mean the system is fighting itself.
- Hugepage utilization — Hugepages reduce TLB pressure. If the pool is exhausted or transparent hugepages aren’t coalescing, applications aren’t getting the benefit they should.
I/O
- IOPS pattern entropy — Predictable I/O patterns are easier for the kernel to batch. High entropy means the I/O is random and may saturate queues.
- vhost queue depth p50/p99 — How many requests are buffered in the vhost/virtio ring. High p99 means occasional stalls.
- virtio ring stalls — How often the virtio ring can’t proceed because it’s waiting for descriptors.
Scheduler
- runqueue wait p50/p99 — How long a task sits waiting on the runqueue before it gets a CPU. High p99 means scheduling latency spikes.
- vCPU steal time — On virtualized systems, “steal” is time the guest wanted to run but the hypervisor didn’t give it. High steal means the host is oversubscribed.
- involuntary ctxsw rate — How often tasks get kicked off the CPU involuntarily (preempted, time slice expired). High rates can mean too many CPU-bound tasks.
Thermal
- per-core thermal headroom — The gap between current temperature and the thermal throttle point. Headroom means performance isn’t thermally constrained.
The tutorial:
- Parts 1–2: Architecture and project setup
- Parts 3–4: Hardware PMCs with
perf_event_open - Parts 5–6: Cache and TLB metrics; scheduler tracing with eBPF
- Parts 7–8: NUMA and memory metrics; uncore IMC bandwidth
- Parts 9–10: Thermal monitoring; block I/O tracing and entropy
- Parts 11–12: vhost and virtio ring instrumentation; histograms for queue depth
Next: Part 1 — Three Sources of Signal — Where our data comes from: hardware PMCs, kernel tracepoints, and virtual filesystems.