Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

eBPF Performance Monitoring with Aya

Performance problems in production systems are usually a puzzle with missing pieces. You know the symptom — high latency, unexpected CPU usage, I/O spikes — but the signals you have don’t point to the cause. The missing pieces are often things like cache misses starving the pipeline, NUMA traffic slowing down memory access, or a scheduler that can’t find a runnable task fast enough.

This tutorial is about filling in those pieces. We’re going to instrument a Linux system to capture a detailed performance picture across three layers:

Hardware — Performance Monitoring Counters (PMCs) that live on the CPU chip itself. These count events like cache line loads, branch mispredictions, and TLB walks. They’re the closest thing to a direct line into what the CPU is actually doing.

Kernel — eBPF programs attached to kprobes and tracepoints let us observe kernel behavior without adding latency. We can watch how tasks get scheduled, how I/O requests enter the block layer, and how vhost/virtio rings get driven.

Procfs and sysfs — The kernel exposes a huge amount of state through virtual filesystems. NUMA hit/miss statistics, thermal zone temperatures, uncore IMC counters, per-CPU stats — all readable from userspace with no eBPF required.

The tool that ties it all together is Aya — a Rust library for building eBPF programs without relying on LLVM/BCC/libbpf. We’ll write eBPF programs in Rust, and use the same Rust codebase to read userspace sources and aggregate everything into a coherent output.

The Metrics

Here’s what we’ll capture and why:

CPU Internals

  • L1/L2/L3 cache miss rates — When a core can’t find data in cache, it stalls. High L3 miss rates often point to poor data locality or memory bandwidth saturation.
  • Branch mispredict rate — Modern CPUs speculatively execute. A mispredict throws away work. High mispredict rates point to unpredictable control flow or hot loops.
  • IPC and stall ratio — Instructions per cycle (IPC) measures actual throughput. Stall ratio is the fraction of cycles the core was waiting, not executing.
  • dTLB/iTLB miss — The Translation Lookaside Buffer caches virtual-to-physical address translations. A miss means a costly page table walk.

Memory and NUMA

  • Uncore IMC bandwidth — The Integrated Memory Controller sits off-chip. It has its own performance counters. Saturation means memory bandwidth is the bottleneck.
  • NUMA remote ratio — On a multi-socket system, accessing memory attached to a remote socket is slower. The remote ratio tells you what fraction of memory access is remote.
  • Page migration rate — NUMA balancing moves pages between nodes. High migration rates mean the system is fighting itself.
  • Hugepage utilization — Hugepages reduce TLB pressure. If the pool is exhausted or transparent hugepages aren’t coalescing, applications aren’t getting the benefit they should.

I/O

  • IOPS pattern entropy — Predictable I/O patterns are easier for the kernel to batch. High entropy means the I/O is random and may saturate queues.
  • vhost queue depth p50/p99 — How many requests are buffered in the vhost/virtio ring. High p99 means occasional stalls.
  • virtio ring stalls — How often the virtio ring can’t proceed because it’s waiting for descriptors.

Scheduler

  • runqueue wait p50/p99 — How long a task sits waiting on the runqueue before it gets a CPU. High p99 means scheduling latency spikes.
  • vCPU steal time — On virtualized systems, “steal” is time the guest wanted to run but the hypervisor didn’t give it. High steal means the host is oversubscribed.
  • involuntary ctxsw rate — How often tasks get kicked off the CPU involuntarily (preempted, time slice expired). High rates can mean too many CPU-bound tasks.

Thermal

  • per-core thermal headroom — The gap between current temperature and the thermal throttle point. Headroom means performance isn’t thermally constrained.

The tutorial:

  • Parts 1–2: Architecture and project setup
  • Parts 3–4: Hardware PMCs with perf_event_open
  • Parts 5–6: Cache and TLB metrics; scheduler tracing with eBPF
  • Parts 7–8: NUMA and memory metrics; uncore IMC bandwidth
  • Parts 9–10: Thermal monitoring; block I/O tracing and entropy
  • Parts 11–12: vhost and virtio ring instrumentation; histograms for queue depth

Next: Part 1 — Three Sources of Signal — Where our data comes from: hardware PMCs, kernel tracepoints, and virtual filesystems.