eBPF Performance Monitoring with Aya

Performance problems in production systems are usually a puzzle with missing pieces. You know the symptom — high latency, unexpected CPU usage, I/O spikes — but the signals you have don’t point to the cause. The missing pieces are often things like cache misses starving the pipeline, NUMA traffic slowing down memory access, or a scheduler that can’t find a runnable task fast enough.

This tutorial is about filling in those pieces. We’re going to instrument a Linux system to capture a detailed performance picture across three layers:

Hardware — Performance Monitoring Counters (PMCs) that live on the CPU chip itself. These count events like cache line loads, branch mispredictions, and TLB walks. They’re the closest thing to a direct line into what the CPU is actually doing.

Kernel — eBPF programs attached to kprobes and tracepoints let us observe kernel behavior without adding latency. We can watch how tasks get scheduled, how I/O requests enter the block layer, and how vhost/virtio rings get driven.

Procfs and sysfs — The kernel exposes a huge amount of state through virtual filesystems. NUMA hit/miss statistics, thermal zone temperatures, uncore IMC counters, per-CPU stats — all readable from userspace with no eBPF required.

The tool that ties it all together is Aya — a Rust library for building eBPF programs without relying on LLVM/BCC/libbpf. We’ll write eBPF programs in Rust, and use the same Rust codebase to read userspace sources and aggregate everything into a coherent output.

The Metrics

Here’s what we’ll capture and why:

CPU Internals

L1/L2/L3 cache miss rates — When a core can’t find data in cache, it stalls. High L3 miss rates often point to poor data locality or memory bandwidth saturation.
Branch mispredict rate — Modern CPUs speculatively execute. A mispredict throws away work. High mispredict rates point to unpredictable control flow or hot loops.
IPC and stall ratio — Instructions per cycle (IPC) measures actual throughput. Stall ratio is the fraction of cycles the core was waiting, not executing.
dTLB/iTLB miss — The Translation Lookaside Buffer caches virtual-to-physical address translations. A miss means a costly page table walk.

Memory and NUMA

Uncore IMC bandwidth — The Integrated Memory Controller sits off-chip. It has its own performance counters. Saturation means memory bandwidth is the bottleneck.
NUMA remote ratio — On a multi-socket system, accessing memory attached to a remote socket is slower. The remote ratio tells you what fraction of memory access is remote.
Page migration rate — NUMA balancing moves pages between nodes. High migration rates mean the system is fighting itself.
Hugepage utilization — Hugepages reduce TLB pressure. If the pool is exhausted or transparent hugepages aren’t coalescing, applications aren’t getting the benefit they should.

I/O

IOPS pattern entropy — Predictable I/O patterns are easier for the kernel to batch. High entropy means the I/O is random and may saturate queues.
vhost queue depth p50/p99 — How many requests are buffered in the vhost/virtio ring. High p99 means occasional stalls.
virtio ring stalls — How often the virtio ring can’t proceed because it’s waiting for descriptors.

Scheduler

runqueue wait p50/p99 — How long a task sits waiting on the runqueue before it gets a CPU. High p99 means scheduling latency spikes.
vCPU steal time — On virtualized systems, “steal” is time the guest wanted to run but the hypervisor didn’t give it. High steal means the host is oversubscribed.
involuntary ctxsw rate — How often tasks get kicked off the CPU involuntarily (preempted, time slice expired). High rates can mean too many CPU-bound tasks.

Thermal

per-core thermal headroom — The gap between current temperature and the thermal throttle point. Headroom means performance isn’t thermally constrained.

The tutorial:

Parts 1–2: Architecture and project setup
Parts 3–4: Hardware PMCs with perf_event_open
Parts 5–6: Cache and TLB metrics; scheduler tracing with eBPF
Parts 7–8: NUMA and memory metrics; uncore IMC bandwidth
Parts 9–10: Thermal monitoring; block I/O tracing and entropy
Parts 11–12: vhost and virtio ring instrumentation; histograms for queue depth

Next: Part 1 — Three Sources of Signal — Where our data comes from: hardware PMCs, kernel tracepoints, and virtual filesystems.

Part 1 — Three Sources of Signal

Performance data comes from three places on Linux. Understanding which source a metric comes from tells you a lot about how to capture it, what it means, and where its limits are.

Source 1: Hardware Performance Counters (PMCs)

Modern CPUs have dedicated hardware counters on-die. They’re called PMCs — Performance Monitoring Counters. They count things like:

How many times a cache line was requested and missed
How many branches executed and how many were mispredicted
How many instructions retired and how many cycles were stalled
How many TLB walks happened and how many missed the hardware TLB

PMCs are read through the perf_event_open syscall. On x86, the underlying hardware is the Performance Counters MSRs (Model-Specific Registers). On ARM, there’s the ARM PMU (Performance Monitoring Unit). The syscall abstracts this, but the available events depend on your CPU microarchitecture.

The kernel exposes a curated list of events through /dev/cpu/*/msr (requires root) and through perf list (user-accessible for most events). If you only need the universal hardware events (instructions, cycles, cache references), the perf-event crate wraps perf_event_open with a safe Rust interface. We don’t use it here because we need raw PMC events (cache misses, TLB walks, branch mispredicts) that vary by CPU microarchitecture — the crate doesn’t expose PERF_TYPE_RAW on all platforms. Part 3 shows the hand-rolled struct and the direct syscall.

Constraints:

Most PMCs require root or CAP_SYS_ADMIN
Not all events are available on all CPUs
On hypervisors, some events may not reflect guest-observed behavior accurately
Counting across multiple CPUs requires either per-CPU file descriptors or multiplexing

Source 2: Kernel Tracepoints and Kprobes

The Linux kernel emits events at interesting points in its execution. These come in two forms:

Tracepoints are stable hooks placed in the kernel by developers. They have stable names and argument formats. Examples:

sched:sched_waking — a task is about to be woken
sched:sched_switch — the scheduler switched from one task to another
block:block_bio_queue — a block I/O request was submitted
irq:softirq_entry — a softirq started executing

Kprobes are dynamic probes that can be placed at almost any kernel function entry or return. They’re less stable (function names change between kernel versions) but much more powerful — you can probe any function, not just the ones with tracepoints.

Both tracepoints and kprobes are programmable via eBPF. This is the core of what Aya lets us do in Rust — write eBPF programs that read data from these hooks and push it into maps that userspace can read.

Key maps:

BPF_MAP_TYPE_PERF_EVENT_ARRAY — ring buffer for sending structured events to userspace
BPF_MAP_TYPE_HASH — key-value store for counters and state
BPF_MAP_TYPE_ARRAY — indexed array, good for histograms
BPF_MAP_TYPE_RINGBUF — lock-free ring buffer, newer and faster than perf events

Constraints:

Kprobe function names are kernel-version specific
eBPF programs are verified — you can’t write to arbitrary memory or loop unboundedly
The eBPF VM has a 512-byte stack limit (no heap)
CO-RE (Compile Once, Run Everywhere) with BTF makes kprobes portable; without it, you need kernel headers for each target version

Source 3: Procfs and Sysfs

The kernel exposes a huge amount of state through two virtual filesystems that don’t exist on disk:

/proc/ — process and system information. Relevant files:

/proc/cpuinfo — CPU model, microarchitecture, flags
/proc/vmstat — virtual memory statistics, including NUMA page stats
/proc/schedstat — scheduler statistics per CPU
/proc/interrupts — interrupt counts per CPU
/proc/loadavg — load average

/sys/ — kernel data structures organized as a tree. Relevant paths:

/sys/devices/system/cpu/ — per-CPU attributes
/sys/class/thermal/thermal_zone*/ — thermal zones with current temperature
/sys/bus/event_source/devices/ — available perf events
/sys/kernel/mm/ — hugepages and transparent hugepage settings
/sys/class/block/ — per-block-device statistics
/sys/devices/system/node/ — NUMA node memory statistics

These are readable with standard file I/O — no root required for most files, and no eBPF required.

The Hybrid Architecture

Our monitoring system uses all three sources together. The architecture looks like this:

┌──────────────────────────────────────────────────┐
│                Userspace (monitor)              │
│                                                 │
│  ┌─────────────┐  ┌──────────────┐  ┌────────┐ │
│  │ perf_event  │  │ ring buffer  │  │procfs/ │ │
│  │ open poll   │  │ reader       │  │sysfs   │ │
│  └──────┬──────┘  └──────┬───────┘  └───┬────┘ │
│         │                │              │       │
└─────────┼────────────────┼──────────────┼───────┘
          │                │              │
   perf_event_open()   eBPF maps     file I/O
          ▲                ▲              ▲
          │                │              │
┌─────────┴────────────────┼──────────────┼───────┐
│    Linux Kernel          │              │        │
│                          │              │        │
│  ┌─────────────────┐    │              │        │
│  │ PMC counters    │    │              │        │
│  │ (perf_event_open│    │              │        │
│  │  file desc.)    │    │              │        │
│  └─────────────────┘    │              │        │
│         read via fd     │              │        │
│                          │              │        │
│  ┌─────────────────────────────────┐  │        │
│  │       eBPF programs            │  │        │
│  │  ┌──────────────────────────┐  │  │        │
│  │  │ scheduler tracepoints   │  │  │        │
│  │  │ (waking, switch)        │  │  │        │
│  │  └──────────────────────────┘  │  │        │
│  │  ┌──────────────────────────┐  │  │        │
│  │  │ block I/O, vhost, etc.  │  │  │        │
│  │  └──────────────────────────┘  │  │        │
│  │  writes to eBPF maps ──────────┘  │        │
│  └─────────────────────────────────┘          │
│                                                 │
│  ┌─────────────┐  ┌────────────────┐           │
│  │ /proc/vmstat│  │/sys/class/     │           │
│  │ /proc/stat  │  │thermal/        │           │
│  └─────────────┘  └────────────────┘           │
└─────────────────────────────────────────────────┘

PMC counters are not eBPF programs — they’re file descriptors opened via perf_event_open and read directly. eBPF programs are the ones attached to kernel tracepoints and kprobes. They write aggregated data (histograms, counters) into eBPF maps, which the ring buffer reader in userspace consumes. Procfs and sysfs are plain file reads — no special API, no eBPF.

The user-space program polls all three sources in a single event loop. The eBPF programs handle the in-kernel aggregation so we only get summaries over the ring buffer rather than a firehose of raw events.

What Aya Provides

Aya is the Rust library that ties the eBPF and userspace halves together. It handles:

Compiling eBPF programs via aya-build (runs automatically in build.rs — no separate build step)
Loading and attaching programs from Rust
Creating and populating maps (ring buffers, hashes, histograms)
Reading from maps in the user-space half

Aya is unusual among eBPF libraries because it doesn’t depend on libbpf, BCC, or a C toolchain. Everything is Rust, end to end.

Next: Part 2 — Project Setup and Minimal eBPF — Scaffold the Aya project, write a tracepoint handler, and read scheduler events in userspace.

Part 2 — Project Setup and Minimal eBPF

Before we write any instrumentation, we need a working project that compiles and runs. Aya projects have two halves: the user-space Rust program and the eBPF programs. Getting the build tooling right is the first thing people get stuck on, so let’s do it carefully.

Prerequisites

You’ll need:

# Rust stable and nightly (needed for eBPF compilation)
rustup install stable
rustup toolchain install nightly --component rust-src

# Add the eBPF target to the nightly toolchain
# (Aya compiles eBPF programs as a separate target)
rustup target add bpfel-unknown-none --toolchain nightly

# bpf-linker: compiles eBPF bytecode from Rust
cargo install bpf-linker

# bpftool: generates Rust bindings from BTF info
# On Ubuntu, install from your package manager first, or build from source:
# https://github.com/libbpf/bpftool
sudo apt install linux-tools-$(uname -r)

# cargo-generate: scaffolds the Aya template
cargo install cargo-generate

Check your kernel version — eBPF is generally well-supported on kernels 5.8+, but some features (ringbuf, BTF) work better on 5.10+:

uname -r

⚠️ One Version Trap to Watch For

Aya has two separate crates with independent version tracks: aya (user-space) and aya-ebpf (eBPF kernel programs). They don’t share a version number. When you see aya = "0.13", the companion eBPF crate might be 0.1, 0.2, or something else entirely — check crates.io to confirm the current version.

The tutorial uses the latest compatible versions. If cargo update pulls in mismatched versions, pin them explicitly in Cargo.toml.

Scaffolding the Project

The Aya team provides a template. Use it with the program type and tracepoint details pre-filled — otherwise cargo generate will prompt you interactively, which breaks the copy-paste flow:

cargo generate --name perf-monitor \
  -d program_type=tracepoint \
  -d tracepoint_category=sched \
  -d tracepoint_name=sched_switch \
  https://github.com/aya-rs/aya-template

What each argument does:

--name perf-monitor — the name of the directory and the Rust workspace. This becomes the workspace root, and the three crates get named accordingly.
-d program_type=tracepoint — tells the template to generate a tracepoint program. The template supports many types (xdp, kprobe, uprobe, tracepoint, etc.). Each type changes the generated code: a tracepoint program reads from a TracePointContext, an XDP program reads from an XdpContext, and so on. We pick tracepoint because our first instrument targets a kernel tracepoint.
-d tracepoint_category=sched — the tracepoint category (also called the subsystem). Kernel tracepoints are organized as category:name. The sched category contains scheduler events: sched_switch, sched_wakeup, sched_waking, etc.
-d tracepoint_name=sched_switch — the specific tracepoint event. sched_switch fires every time the kernel switches from one task to another. It’s the most fundamental scheduler tracepoint — it tells you what ran and when.
https://github.com/aya-rs/aya-template — the template repository. cargo generate downloads this, replaces placeholders with the values you passed (-d flags), and writes the result to a new directory.

The -d flags are how you answer the template’s questions ahead of time. Without them, cargo generate prompts you interactively.

Important: The template uses git dependencies by default. After generating, switch them to crates.io versions for stability. Edit Cargo.toml in the workspace root and perf-monitor-ebpf/Cargo.toml:

# Before (git — moves, may break)
aya = { git = "https://github.com/aya-rs/aya" }

# After (crates.io — stable, tested)
aya = "0.13"       # resolves to 0.13.1 (0.13.2 was yanked)
aya-build = "0.1"
aya-ebpf = "0.1"
aya-log = "0.2"
aya-log-ebpf = "0.1"

Then run cargo update to resolve.

This creates a workspace with three crates:

perf-monitor/                  ← workspace root (Cargo.toml at root)
├── perf-monitor/              ← user-space program (what we write)
│   ├── Cargo.toml
│   ├── build.rs
│   └── src/main.rs
├── perf-monitor-ebpf/         ← eBPF programs (compiled to BPF bytecode)
│   ├── Cargo.toml
│   ├── build.rs
│   └── src/main.rs
└── perf-monitor-common/       ← code shared between userspace and eBPF
    ├── Cargo.toml
    └── src/lib.rs

The eBPF build is handled automatically: perf-monitor-ebpf/build.rs runs the Aya build toolchain, which compiles the eBPF programs to BPF bytecode and embeds them into OUT_DIR. One cargo build compiles both halves.

The tracepoint template scaffolds a working program already wired to sched:sched_switch. We’ll replace the generated body with our own code, but the structure — the workspace, the three crates, the build configuration — is what we need.

The Two Halves

eBPF Programs (perf-monitor-ebpf/)

The eBPF programs are written in Rust but compiled to BPF bytecode by the build.rs script. The Aya aya-ebpf crate provides the Rust API for eBPF maps, programs, and context objects — no standard library, no heap, strict verifier.

A simple tracepoint program looks like this:

#![allow(unused)]
fn main() {
// perf-monitor-ebpf/src/main.rs

#![no_std]
#![no_main]

use aya_ebpf::programs::TracePointContext;
use aya_ebpf::macros::tracepoint;
use aya_ebpf::maps::RingBuf;
use aya_ebpf::helpers::{bpf_ktime_get_ns, bpf_get_smp_processor_id};

// Events we emit to userspace
#[derive(Clone, Copy)]
#[repr(C)]
struct SchedulerEvent {
    cpu_id: u32,
    prev_pid: u32,
    next_pid: u32,
    timestamp: u64,
}

// Declare the ring buffer as a static — this is how maps work in eBPF
#[map]
static EVENTS: RingBuf = RingBuf::with_byte_size(8 * 4096, 0);

// Attach to sched:sched_switch
#[tracepoint]
pub fn sched_switch(ctx: TracePointContext) -> u32 {
    // The tracepoint payload layout for sched_switch on Linux 5.x:
    // (after the common tracepoint header)
    // offset 0:  prev_comm  char[16]  (TASK_COMM_LEN)
    // offset 16: prev_pid   u32
    // offset 20: prev_prio  u32
    // offset 24: prev_state u64       (TASK_* state mask)
    // offset 32: next_comm  char[16]
    // offset 48: next_pid   u32
    // offset 52: next_prio  u32
    //
    // Verify on your system: cat /sys/kernel/tracing/events/sched/sched_switch/format
    let prev_pid = unsafe { ctx.read_at::<u32>(16).unwrap_or(0) };
    let next_pid = unsafe { ctx.read_at::<u32>(48).unwrap_or(0) };
    let cpu_id = unsafe { bpf_get_smp_processor_id() };
    let timestamp = unsafe { bpf_ktime_get_ns() };

    let event = SchedulerEvent {
        cpu_id,
        prev_pid,
        next_pid,
        timestamp,
    };

    // Send to ring buffer — userspace reads from the EVENTS map
    EVENTS.output(&event, 0);

    0
}
}

A few things to notice here:

#![no_std] and #![no_main]: eBPF programs don’t use the standard library (no heap, no I/O) and don’t have a main function. The entry point is the function marked with #[tracepoint].

#[map]: The #[map] attribute registers the static as an eBPF map. The ring buffer is declared at the top of the file and lives for the lifetime of the program. You don’t access it through a context object.

#[tracepoint]: The eBPF macro marks the function as a tracepoint program. No arguments — the category and name are provided from userspace via program.attach(). The eBPF macros are lowercase; userspace program types are PascalCase.

ctx: TracePointContext: The context is passed by value (not &mut). The TracePointContext gives you access to the tracepoint payload via read_at::<T>(offset).

unsafe { ctx.read_at::<T>(offset) }: Reading tracepoint payload requires unsafe — the verifier can’t guarantee the memory is valid. In practice, reading from a kernel-placed tracepoint payload is safe.

bpf_get_smp_processor_id() and bpf_ktime_get_ns(): These are BPF helpers exposed through aya_ebpf::helpers. They’re available in every eBPF program.

Verifying tracepoint offsets on your kernel. The sched_switch layout above is correct for Linux 5.x, but kernel versions can change field sizes, add new fields, or reorder them. Before you trust any hardcoded offset, check the format file for your running kernel:
cat /sys/kernel/tracing/events/sched/sched_switch/format
You’ll see output like this (your exact offsets may differ):
name: sched_switch
ID: 314
format:
    field:unsigned short common_type;       offset:0;   size:2;  signed:0;
    field:unsigned char  common_flags;       offset:2;   size:1;  signed:0;
    field:unsigned char  common_preempt_count; offset:3; size:1; signed:0;
    field:int            common_pad;         offset:4;   size:4;  signed:1;

    field:char     prev_comm[16];          offset:8;   size:16; signed:0;
    field:pid_t    prev_pid;               offset:24;  size:4;  signed:1;
    field:int      prev_prio;              offset:28;  size:4;  signed:1;
    field:long     prev_state;             offset:32;  size:8;  signed:1;
    field:char     next_comm[16];          offset:40;  size:16; signed:0;
    field:pid_t    next_pid;               offset:56;  size:4;  signed:1;
    field:int      next_prio;              offset:60;  size:4;  signed:1;
The first four fields (common_*) are the tracepoint header — 8 bytes of metadata present in every tracepoint record. The read_at offsets in this tutorial start from the first byte after this header, so they are 8 bytes less than the offsets shown in the format file. For example, the format file shows prev_pid at offset 24, and the corresponding read_at call uses offset 16 (24 − 8 = 16).

The key cross-check: find the field you want in the format output, read its offset value, and subtract 8 to get your read_at offset. If the result doesn’t match the tutorial’s code, the field layout has changed on your kernel — and you’ll read garbage unless you update the offset. This applies to every tracepoint in every part of this tutorial. When in doubt, check the format file.

User-Space Program (perf-monitor/)

The user-space program loads the compiled eBPF object, creates maps, attaches programs, and reads data from ring buffers. It runs as a normal Rust binary:

// perf-monitor/src/main.rs

use aya::programs::TracePoint;
use aya::maps::RingBuf;
use aya::Ebpf;
use std::convert::TryFrom;

#[derive(Clone, Copy, Debug)]
#[repr(C)]
struct SchedulerEvent {
    cpu_id: u32,
    prev_pid: u32,
    next_pid: u32,
    timestamp: u64,
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // The eBPF object is embedded at compile-time via OUT_DIR.
    // Ebpf::load() finds it without needing a file path.
    let mut ebpf = aya::Ebpf::load(aya::include_bytes_aligned!(
        concat!(env!("OUT_DIR"), "/perf-monitor")
    ))?;

    // Attach the tracepoint program
    let program: &mut TracePoint = ebpf
        .program_mut("perf_monitor")
        .unwrap()
        .try_into()?;
    program.load()?;
    program.attach("sched", "sched_switch")?;

    // Create the ring buffer from the map named "events"
    let mut ring_buf = RingBuf::try_from(ebpf.map_mut("events")?)?;

    // Poll the ring buffer
    loop {
        tokio::time::sleep(tokio::time::Duration::from_millis(100)).await;

        while let Some(item) = ring_buf.next() {
            // item derefs to &[u8] — cast to our event type
            let event = unsafe {
                std::ptr::read_unaligned(item.as_ptr() as *const SchedulerEvent)
            };
            println!(
                "cpu={} prev_pid={} next_pid={} ts={}",
                event.cpu_id, event.prev_pid, event.next_pid, event.timestamp
            );
        }
    }
}

Note: the program name in the template is perf_monitor (underscore, from the crate name). The map name events comes from the eBPF side — check perf-monitor-ebpf/src/main.rs for the map definition.

Running Locally

You’ll need a Linux machine with eBPF support (kernel 5.8+). If you’re developing on a VM, eBPF may or may not work depending on the hypervisor — nested virtualization support for eBPF varies. On real hardware it always works.

# Build both userspace and eBPF — build.rs compiles eBPF and embeds it
cargo build

# Run as root (eBPF programs require elevated privileges)
sudo ./target/debug/perf-monitor

The eBPF programs are compiled automatically by the build.rs script in the perf-monitor-ebpf/ crate. cargo build handles everything in one step.

Troubleshooting

Build fails with exit status around 25856 → The nightly toolchain can’t compile for bpfel-unknown-none. Run:

rustup target add bpfel-unknown-none --toolchain nightly

Build fails with “target not found” for bpfel-unknown-none → Same fix. The BPF target isn’t automatically installed when you add the nightly toolchain — you have to add it explicitly.

Permission denied when running → eBPF programs require root. Use sudo.

Project Structure for This Tutorial

For the full performance monitor, we’ll extend the scaffold with multiple eBPF programs and multiple data sources. The structure we’ll build:

perf-monitor/                  ← workspace root
├── perf-monitor/              ← user-space program
│   └── src/
│       ├── main.rs           ← event loop, loads programs, aggregates data
│       ├── pmc.rs            ← perf_event_open wrapper, PMC event reading
│       ├── numa.rs           ← procfs/sysfs readers for NUMA stats
│       ├── thermal.rs        ← sysfs thermal zone reader
│       └── types.rs          ← shared event structs
├── perf-monitor-ebpf/         ← eBPF programs
│   └── src/
│       ├── scheduler.rs      ← sched tracepoints: switch, waking, stat_wait
│       ├── blockio.rs        ← block I/O tracepoints
│       ├── vhost.rs          ← kprobes on vhost/virtio ring functions
│       └── lib.rs            ← map definitions, program registration
└── perf-monitor-common/       ← code shared between userspace and eBPF
    └── src/lib.rs

Next: Part 3 — Hardware PMCs with perf_event_open — Open a counter, read it, and compute instructions per cycle.

Part 3 — Hardware PMCs with perf_event_open

The CPU has hardware counters on-die. The only way to read them on Linux is perf_event_open.

What PMCs Are

Performance Monitoring Counters — PMCs — are tiny registers inside the CPU chip. They count specific microarchitectural events: a cache line loaded from L1, a branch instruction resolved, a TLB walk performed. Every modern x86 and ARM processor has them.

The counter is an accumulation of incremental events. Each hardware event increments it. You open a file descriptor for a specific event, and then you read the counter value by reading from that file descriptor. That’s the whole interface.

The events are defined by a type and a config. On x86, type is usually PERF_TYPE_HARDWARE (0) or PERF_TYPE_RAW (4). The hardware events — instructions retired, CPU cycles, cache references — live in the hardware type. The raw events — cache misses, branch mispredicts, TLB walks — live in the raw type, and their event numbers vary by CPU microarchitecture. That’s why Part 4 matters: before you can open a cache miss counter, you need to know what CPU you’re on.

The syscall

perf_event_open is a Linux-specific syscall. Here’s its signature from the kernel headers:

int perf_event_open(
    struct perf_event_attr *attr,  // what to count
    pid_t pid,                     // attach to this process (0 = self)
    int cpu,                       // which CPU (-1 = all)
    int group_fd,                  // group leader fd (-1 = new group)
    unsigned long flags            // PERF_FLAG_FD_CLOEXEC etc.
);

Returns a file descriptor on success, or -1 on error.

The perf_event_attr struct is the interesting part. Here’s the relevant subset from the kernel headers (include/uapi/linux/perf_event.h):

struct perf_event_attr {
    __u32 type;              // PERF_TYPE_HARDWARE, PERF_TYPE_RAW, etc.
    __u32 size;              // sizeof(struct perf_event_attr)
    __u64 config;            // which specific event
    union {
        __u64 sample_period; // sample every N events (for sampling mode)
        __u64 sample_freq;   // target sample frequency (for sampling mode)
    };
    __u64 sample_type;       // what gets written to the sample buffer
    __u64 read_format;       // format for reading counter values
    __u64 flags;             // disabled, pinned, inherit, etc.
};

For counting mode (what we’ll use), you set the union to 0 — both sample_period and sample_freq are zero, meaning no sampling. The kernel won’t generate PMIs (Performance Monitoring Interrupts), and you read the counter value directly from the file descriptor. For profiling mode, you set sample_freq to a target sample rate (e.g., 1000 Hz) and the kernel samples periodically.

The Rust hand-rolled struct below uses a single sample_period field instead of the union — since we’re in counting mode, the union value is always 0 and a single field is sufficient. The struct includes size (required by the kernel) and omits read_format and the full flags field in the explanation above — the working code at the end of this part includes everything you need.

Rust bindings

There’s a perf-event crate that wraps perf_event_open with a safe interface, but it doesn’t support all the raw PMC event types we need, and libc::perf_event_attr isn’t available on all platforms. So we hand-roll the struct and call the syscall directly via libc.

On Linux, perf_event_open is syscall number 298 on x86_64. The number varies by architecture: 241 on ARM64 and RISC-V, 319 on PowerPC. The code uses libc::SYS_perf_event_open, which resolves to the right number per architecture. You can’t call it through libc::perf_event_open because not all libc implementations expose it.

The full open_pmc function is in the minimal working example at the end of this part. The next few sections show the API surface — what the events are, how to read counters, how to compute useful metrics — and then we bring it all together.

Opening a counter for instructions and cycles

The hardware events live in PERF_TYPE_HARDWARE. The two universal hardware events — every CPU supports them — are PERF_COUNT_HW_INSTRUCTIONS and PERF_COUNT_HW_CPU_CYCLES:

#![allow(unused)]
fn main() {
let instr_fd = open_pmc(
    libc::PERF_TYPE_HARDWARE as u32,
    libc::PERF_COUNT_HW_INSTRUCTIONS as u64,
    0,  // attach to current process
    -1, // all CPUs
)?;

let cycles_fd = open_pmc(
    libc::PERF_TYPE_HARDWARE as u32,
    libc::PERF_COUNT_HW_CPU_CYCLES as u64,
    0,
    -1,
)?;
}

Reading and resetting

Reading from the file descriptor returns the current counter value. The counter starts disabled — you enable it with an ioctl call:

#![allow(unused)]
fn main() {
fn read_counter(fd: i32) -> std::io::Result<u64> {
    let mut val: u64 = 0;
    let n = unsafe {
        libc::read(fd, &mut val as *mut _ as *mut libc::c_void, 8)
    };
    if n < 0 {
        Err(std::io::Error::last_os_error())
    } else {
        Ok(val)
    }
}

fn enable_counter(fd: i32) -> std::io::Result<()> {
    unsafe {
        // arg=0 enables this counter. PERF_IOC_FLAG_GROUP would enable
        // all counters in the same group — but we aren't using groups here.
        libc::ioctl(fd, libc::PERF_EVENT_IOC_ENABLE, 0);
    }
    Ok(())
}
}

To reset the counter to zero, use ioctl with PERF_EVENT_IOC_RESET:

#![allow(unused)]
fn main() {
unsafe {
    libc::ioctl(fd, libc::PERF_EVENT_IOC_RESET, 0);
}
}

Computing IPC and stall ratio

Instructions per cycle (IPC) tells you how much useful work the CPU is doing per clock tick. A healthy compute-bound workload might hit 3-4 IPC on a modern out-of-order core. A memory-bound workload stalls frequently and might hit 0.5 IPC.

#![allow(unused)]
fn main() {
fn compute_ipc(instructions: u64, cycles: u64) -> f64 {
    if cycles == 0 {
        return 0.0;
    }
    instructions as f64 / cycles as f64
}
}

Stall ratio is the fraction of cycles where the core wasn’t retiring instructions. This happens when the core is waiting on memory, a branch mispredict, or any other pipeline stall.

The key insight: when the pipeline stalls, the instruction counter slows down even though cycles keep ticking. Here’s the catch — modern out-of-order cores can retire multiple instructions per cycle (4-6 on recent x86). So cycles - instructions doesn’t directly give you stalled cycles. What it gives you is a lower bound on stalls — cycles where the core couldn’t even manage 1 retirement. The max(0) clamp hides the IPC > 1 case entirely.

#![allow(unused)]
fn main() {
fn compute_stall_ratio(instructions: u64, cycles: u64) -> f64 {
    if instructions == 0 || cycles == 0 {
        return 0.0;
    }
    // When IPC > 1, the core is more than 1 instruction per cycle — no stalls.
    // When IPC < 1, the gap is a lower bound on stalled cycles.
    // This underestimates stalls when the core retires multiple instructions
    // in some cycles and zero in others (the average IPC hides the stall bursts).
    let stalled = (cycles as i64 - instructions as i64).max(0) as f64;
    stalled / cycles as f64
}
}

This is a simplification in two ways. First, it can’t detect stall bursts hidden by a high average IPC. Second, some cycles are legitimately empty (no instructions ready, no work to do). For a more accurate stall ratio, you’d read the actual stall-cycle PMCs that modern CPUs expose (e.g., INT_MISC.RECOVERY_CYCLES on Intel). But as a quick health check from two counters alone, it’s a useful signal.

Per-process vs. system-wide monitoring

The pid parameter controls what gets counted. The cpu parameter controls where. Getting these wrong gives you the right numbers for the wrong thing.

Per-process (pid=0, cpu=-1): Open the counter with pid=0 and you measure the calling process as it runs on any CPU. The kernel follows the process across CPU migrations and accumulates the count.

#![allow(unused)]
fn main() {
let cycles_fd = open_pmc(PERF_TYPE_HARDWARE, libc::PERF_COUNT_HW_CPU_CYCLES as u64, 0, -1)?;
}

This is what the example program uses. It works for measuring the IPC of a workload you launch from your monitoring tool. It does not measure the entire system — if your process is mostly sleeping (waiting for the next read interval), the counter values will be near zero.

System-wide (pid=-1, per-CPU): Open a counter for each CPU with pid=-1 to measure all processes on that CPU. The kernel requires a specific cpu number when pid=-1 (passing cpu=-1 with pid=-1 returns EINVAL):

#![allow(unused)]
fn main() {
fn open_all_cpus(type_: u32, config: u64) -> std::io::Result<Vec<(i32, i32)>> {
    let mut fds = Vec::new();
    for cpu in 0..num_cpus() {
        // pid=-1: system-wide. Measures all processes on this CPU.
        let fd = open_pmc(type_, config, -1, cpu as i32)?;
        fds.push((cpu, fd));
    }
    Ok(fds)
}
}

Read each fd and aggregate in userspace for a total. Or keep them separate for per-CPU granularity — a CPU with unusually high cache misses might be running a memory-bound workload pinned to that core.

For a system monitoring tool, pid=-1 is almost always what you want. The example program uses pid=0 for simplicity — a single fd, no aggregation loop — but a real deployment should switch to pid=-1 with per-CPU fds. Part 8 uses pid=-1 for uncore IMC counters, which are inherently system-wide.

Error handling

Two errors are common:

EPERM: The syscall returns EPERM when the calling process lacks the right capabilities. PMC access requires CAP_SYS_ADMIN (root) or CAP_SYS_PERFMON (Linux 5.8+, a narrower capability). If you hit this in a container, the host may need to grant the capability.

EINVAL: The event you asked for isn’t available on this CPU. Raw PMC events in particular vary by microarchitecture — the same event number might mean “cache miss” on Skylake and “not defined” on Ice Lake. This is why Part 4 exists: detect the CPU first, then select events.

A minimal working example

Here’s a complete program that opens an instructions counter and a cycles counter, enables them, reads them once per second, and prints per-second IPC. This brings together the open_pmc, read_counter, and enable_counter functions from the sections above:

use std::io;
use std::thread;
use std::time::Duration;

fn open_pmc(type_: u32, config: u64, pid: i32, cpu: i32) -> io::Result<i32> {
    // Hand-rolled perf_event_attr — libc::perf_event_attr is not available
    // on all platforms. The field ordering must match the kernel struct exactly.
    #[repr(C)]
    struct PerfEventAttr {
        type_: u32,
        size: u32,
        config: u64,
        sample_period: u64, // 0 for counting mode
        sample_type: u64,
        read_format: u64,
        flags: u64, // bit 0: disabled, bit 2: pinned
    }

    let attr = PerfEventAttr {
        type_,
        size: std::mem::size_of::<PerfEventAttr>() as u32,
        config,
        sample_period: 0, // counting mode: no sampling, read counter directly
        sample_type: 0,
        read_format: 0,
        flags: 0b101, // disabled=1 (bit 0), pinned=1 (bit 2)
    //
    // The flags field is a C bitfield packed into a u64. On x86-64,
    // GCC/Clang allocate bits from LSB to MSB, so:
    //   bit 0 = disabled (start counter in disabled state)
    //   bit 1 = inherit  (children inherit the counter — not set here)
    //   bit 2 = pinned   (counter must stay on the PMU — prevents multiplexing)
    // We set disabled so we can enable the counter explicitly via ioctl.
    // We set pinned because counting mode needs the counter scheduled
    // at all times — without pinned, the kernel may multiplex the counter
    // on busy systems, producing scaled values instead of exact counts.
    };

    let fd = unsafe {
        libc::syscall(
            libc::SYS_perf_event_open,
            &attr as *const _,
            pid,
            cpu,
            -1, // no group leader
            0,  // no flags
        )
    };

    if fd < 0 {
        return Err(io::Error::last_os_error());
    }
    Ok(fd as i32)
}

fn read_counter(fd: i32) -> io::Result<u64> {
    let mut val: u64 = 0;
    let n = unsafe {
        libc::read(fd, &mut val as *mut _ as *mut libc::c_void, 8)
    };
    if n < 0 {
        Err(io::Error::last_os_error())
    } else {
        Ok(val)
    }
}

fn enable_counter(fd: i32) -> io::Result<()> {
    unsafe {
        // arg=0: enable this counter (not a group)
        libc::ioctl(fd, libc::PERF_EVENT_IOC_ENABLE, 0);
    }
    Ok(())
}

fn main() -> io::Result<()> {
    let instr_fd = open_pmc(
        libc::PERF_TYPE_HARDWARE as u32,
        libc::PERF_COUNT_HW_INSTRUCTIONS as u64,
        0,  // pid=0: measure this process (not system-wide; see "Per-process vs. system-wide" below)
        -1, // follow this process across all CPUs
    )?;
    let cycles_fd = open_pmc(
        libc::PERF_TYPE_HARDWARE as u32,
        libc::PERF_COUNT_HW_CPU_CYCLES as u64,
        0,  // pid=0: measure this process
        -1, // follow this process across all CPUs
    )?;

    enable_counter(instr_fd)?;
    enable_counter(cycles_fd)?;

    let mut prev_instr = 0u64;
    let mut prev_cycles = 0u64;

    loop {
        thread::sleep(Duration::from_secs(1));

        let instr = read_counter(instr_fd)?;
        let cycles = read_counter(cycles_fd)?;

        let instr_delta = instr - prev_instr;
        let cycles_delta = cycles - prev_cycles;

        let ipc = if cycles_delta > 0 {
            instr_delta as f64 / cycles_delta as f64
        } else {
            0.0
        };

        println!("instructions={instr_delta} cycles={cycles_delta} ipc={ipc:.3}");

        prev_instr = instr;
        prev_cycles = cycles;
    }
}

Run it with sudo cargo run — it needs root. The counter values are cumulative, so the program computes deltas between reads to get per-second IPC.

Next: Part 4 — CPU Microarchitecture Detection — Before you open any raw PMC, figure out what CPU you’re running on and pick the right event numbers.

Part 4 — CPU Microarchitecture Detection

The same PMC event number means different things on different CPUs. On Intel Skylake, event 0xD1 with umask 0x08 counts L1 data cache load misses. On AMD Zen 2, that same event number might not even exist. On ARM, the encoding scheme is completely different. If you hardcode event numbers, your code breaks on every machine that isn’t yours.

The fix isn’t complicated: before you open any PMC, figure out what CPU you’re running on, then pick the right event numbers for that chip. This part builds the detection layer that Parts 5 through 12 will rely on.

Reading /proc/cpuinfo

The file has one block per CPU. For our purposes, the relevant fields are the same across all cores of the same physical CPU, so we’ll read the first block:

#![allow(unused)]
fn main() {
use std::fs;

fn read_cpuinfo() -> io::Result<String> {
    fs::read_to_string("/proc/cpuinfo")
}
}

Here’s what the output looks like on an Intel Skylake desktop:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
stepping        : 3
microcode       : 0x96
cpu MHz         : 4000.000
cache size      : 8192 KB
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov

The fields we care about:

vendor_id: "GenuineIntel" or "AuthenticAMD" (the strings Linux uses)
cpu family: a decimal number, part of the CPUID family field
model: decimal, the CPUID model field
stepping: decimal, the CPUID stepping field
flags: a space-separated list of CPU feature flags

The family and model numbers are how we identify the microarchitecture. For Intel, family 6 means “P6 family or later” (which covers everything from Pentium Pro through modern Skylake/Ice Lake). The model field then distinguishes the specific generation.

Parsing vendor and family/model

#![allow(unused)]
fn main() {
use std::io::{self, BufRead};

#[derive(Debug, Clone, PartialEq, Eq)]
pub enum Vendor {
    Intel,
    Amd,
    Unknown,
}

#[derive(Debug, Clone)]
pub struct CpuInfo {
    pub vendor: Vendor,
    pub family: u32,
    pub model: u32,
    pub stepping: u32,
    pub flags: Vec<String>,
}

fn parse_cpuinfo(raw: &str) -> Option<CpuInfo> {
    let mut vendor = Vendor::Unknown;
    let mut family: u32 = 0;
    let mut model: u32 = 0;
    let mut stepping: u32 = 0;
    let mut flags: Vec<String> = Vec::new();

    for line in raw.lines() {
        let mut parts = line.splitn(2, ':');
        let key = parts.next()?.trim();
        let value = parts.next()?.trim();

        match key {
            "vendor_id" | "vendor" => {
                vendor = match value {
                    "GenuineIntel" => Vendor::Intel,
                    "AuthenticAMD" => Vendor::Amd,
                    _ => Vendor::Unknown,
                };
            }
            "cpu family" => {
                family = value.parse().ok()?;
            }
            "model" => {
                model = value.parse().ok()?;
            }
            "stepping" => {
                stepping = value.parse().ok()?;
            }
            "flags" => {
                flags = value.split_whitespace().map(String::from).collect();
            }
            _ => {}
        }

        // Stop after the first processor block. "processor" is the first
        // field of each CPU block in /proc/cpuinfo. When we see the second
        // "processor" line and we've already collected flags from the first
        // block, we've read everything we need (vendor, family, model,
        // stepping, and flags are the same across all cores of the same CPU).
        if key == "processor" && !flags.is_empty() {
            break;
        }
    }

    if vendor == Vendor::Unknown {
        return None;
    }

    Some(CpuInfo { vendor, family, model, stepping, flags })
}
}

Mapping to microarchitecture names

The family and model numbers combine to produce a “microarchitecture identifier.” Here’s how to build that mapping for Intel:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum Microarch {
    // Intel
    Skylake,
    SkylakeSp,
    KabyLake,        // includes Coffee Lake, Comet Lake (same PMC event encodings)
    IceLake,
    IceLakeSp,
    RocketLake,
    AlderLake,
    SapphireRapids,
    // AMD
    Zen,
    ZenPlus,
    Zen2,
    Zen3,
    Zen4,
    // Generic
    Unknown,
}

pub fn detect_intel_microarch(family: u32, model: u32, stepping: u32) -> Microarch {
    // The "model" field in /proc/cpuinfo is the CPUID model value with the
    // extended model already folded in (family 6 Intel: model = ext_model << 4 | base_model).
    //
    // Why do some very different CPUs share a match arm? Because they share PMC events.
    // Kaby Lake, Coffee Lake, and Comet Lake all use Skylake-core PMCs. The perfmon
    // repo (github.com/intel/perfmon) puts them all in the /SKL/ event directory.
    // For a monitoring tool, the PMC mapping is what matters, not the marketing name.
    match (family, model) {
        // Skylake family — all use the same PMC event encodings
        (6, 85) => Microarch::SkylakeSp,                              // Skylake SP / Cascade Lake SP
        (6, 94) => Microarch::Skylake,                              // Desktop Skylake (i7-6700K)
        (6, 142 | 158) => Microarch::KabyLake,                      // Kaby Lake / Coffee Lake (same PMCs)
        (6, 165 | 166) => Microarch::KabyLake,                      // Comet Lake H/S (same PMCs as KBL)

        // Ice Lake — some events moved, L3 miss not available on all SKUs
        (6, 125 | 126) => Microarch::IceLake,                       // Ice Lake client (0x7D, 0x7E)
        (6, 106 | 108) => Microarch::IceLakeSp,                     // Ice Lake server (0x6A, 0x6C)

        // Tiger Lake — Ice Lake PMC events, plus some uncore changes
        (6, 140 | 141) => Microarch::IceLake,                      // Tiger Lake (0x8C, 0x8D; ICL PMCs)

        // Rocket Lake
        (6, 167) => Microarch::RocketLake,                          // Rocket Lake (0xA7)

        // Alder Lake — hybrid P-core (Golden Cove) + E-core (Gracemont)
        (6, 151 | 154) => Microarch::AlderLake,                     // Alder Lake desktop (0x97) / mobile (0x9A)

        // Sapphire Rapids
        (6, 143) => Microarch::SapphireRapids,                      // Sapphire Rapids (0x8F)

        _ => Microarch::Unknown,
    }
}

pub fn detect_amd_microarch(family: u32, model: u32) -> Microarch {
    // AMD family encoding: CPUID family = extended_family * 16 + base_family
    // For Zen, base_family is 15 (0xF) with extended_family = 1 → CPUID family 23 (0x17)
    // For Zen 2, AMD kept family 23 (0x17) for Matisse/Rome desktop and server parts.
    // Zen 3 and later use family 25 (0x19).
    //
    // Model numbers come from the AMD Processor Programming Reference (PPR) and
    // the /proc/cpuinfo "model" field. Server and desktop parts within the same
    // Zen generation use different model numbers but share PMC event encodings.
    match (family, model) {
        // Zen 1
        (23, 1) => Microarch::Zen,             // Naples (EPYC 1st gen)
        (23, 17) => Microarch::Zen,             // Raven Ridge (Zen 1 APU)

        // Zen+
        (23, 8) => Microarch::ZenPlus,          // Pinnacle Ridge (Ryzen 2000 desktop)
        (23, 32) => Microarch::ZenPlus,         // Colfax (Threadripper 2000)

        // Zen 2
        (23, 49) => Microarch::Zen2,            // Rome (EPYC 2nd gen, model 0x31)
        (23, 113) => Microarch::Zen2,           // Matisse (Ryzen 3000 desktop, model 0x71)

        // Zen 3
        (25, 1) => Microarch::Zen3,             // Milan (EPYC 3rd gen)
        (25, 33) => Microarch::Zen3,            // Vermeer (Ryzen 5000 desktop, model 0x21)

        // Zen 4
        (25, 17) => Microarch::Zen4,            // Genoa (EPYC 4th gen, model 0x11)
        (25, 97) => Microarch::Zen4,            // Raphael (Ryzen 7000 desktop, model 0x61)

        _ => Microarch::Unknown,
    }
}

pub fn detect_microarch(info: &CpuInfo) -> Microarch {
    match info.vendor {
        Vendor::Intel => detect_intel_microarch(info.family, info.model, info.stepping),
        Vendor::Amd => detect_amd_microarch(info.family, info.model),
        Vendor::Unknown => Microarch::Unknown,
    }
}
}

This mapping covers the major Intel and AMD microarchitectures in production. Two caveats for production use:

Intel hybrid cores (Alder Lake and later): P-cores use Golden Cove PMC encodings; E-cores use Gracemont encodings. The detect_intel_microarch function returns AlderLake for both, but a production tool would need to detect which core type it’s running on (via the hybrid CPUID leaf, leaf 0x1A) and select events per-core. The Intel SDM documents both encoding sets in Volume 3B, Chapter 19.
Extended model numbers: The model field in /proc/cpuinfo already folds in the extended model bits for family 6 Intel CPUs (the kernel does this for you). If you’re reading CPUID directly, you need to combine ext_model << 4 | base_model yourself.

For production code, cross-reference against the Intel SDM (Software Developer’s Manual, Volume 3B, Chapter 19) or use a maintained lookup table from a project like the Intel perfmon repository (github.com/intel/perfmon). But for a monitoring tool, the key is having some mapping, not a complete one.

Why this matters for PMC events

Here’s a concrete example. On Intel Skylake, L1 data cache load miss is:

Type: PERF_TYPE_RAW (4)
Event: 0xD1
Umask: 0x08

On Ice Lake, the same counter exists but some events moved. The safe approach is to enumerate available events on the target machine rather than hardcoding.

Enumerating available events with perf list

Before opening any raw event, you can ask the kernel what events are available:

perf list

This prints a categorized list of events. For raw events, look for entries under raw in the output. To narrow it down:

perf list | grep -i cache | head -20

To get only raw events with their hex encodings:

perf list | grep -i "raw" | head -20

Each raw event has a hex encoding in the output — that’s the config value to use with PERF_TYPE_RAW.

From a Rust program, you can call perf list via std::process::Command and parse the output. Or use the sysfs path directly:

#![allow(unused)]
fn main() {
use std::fs;

fn list_raw_events() -> io::Result<Vec<String>> {
    let path = "/sys/bus/event_source/devices/cpu/events";
    let mut events = Vec::new();

    if let Ok(entries) = fs::read_dir(path) {
        for entry in entries.flatten() {
            if let Ok(content) = fs::read_to_string(entry.path()) {
                // Each file is a perf event definition in the format:
                // event=0xD1\numask=0x08\n
                events.push(entry.file_name().into_string().unwrap_or_default());
            }
        }
    }

    Ok(events)
}
}

Putting it together

A helper that returns everything we need:

#![allow(unused)]
fn main() {
pub struct CpuMicroarch {
    pub cpuinfo: CpuInfo,
    pub microarch: Microarch,
    pub vendor: Vendor,
}

pub fn detect() -> io::Result<CpuMicroarch> {
    let raw = fs::read_to_string("/proc/cpuinfo")?;
    let cpuinfo = parse_cpuinfo(&raw).ok_or_else(|| {
        io::Error::new(io::ErrorKind::InvalidData, "could not parse cpuinfo")
    })?;
    let microarch = detect_microarch(&cpuinfo);
    let vendor = cpuinfo.vendor.clone();

    Ok(CpuMicroarch { cpuinfo, microarch, vendor })
}
}

Next: Part 5 — Cache and TLB Metrics from PMC — Open cache miss and TLB walk counters, compute miss rates, and select the right events for your CPU.

Part 5 — Cache and TLB Metrics from PMC

L1 cache misses cost a few cycles. L3 misses cost a few hundred.

That’s not hyperbole — the difference between a cache hit and a cache miss at each level of the hierarchy is an order of magnitude larger. A L1 miss that hits in L2 might cost 10 cycles. A L3 miss that goes to main memory costs 100-300 cycles depending on the system. Once you see those numbers in your metrics, memory-bound workloads become obvious.

The cache hierarchy in plain language

Modern CPUs have several levels of cache. Each core has its own private L1 — a small, fast cache split into L1 data and L1 instructions. L2 is also private to each core but larger and slower. L3 (or LLC — Last Level Cache) is typically shared across cores on the same chip and slower still.

When the core needs a piece of data, it checks L1 first. If it finds the cache line, that’s a hit and the data is available in 1-2 cycles. If L1 misses, it checks L2. If L2 misses, it checks L3. If L3 misses, it goes to main memory — a round-trip that might be 100-300 nanoseconds on a fast system.

Each level is a separate performance counter in the CPU. Counting L3 misses tells you how often the CPU is going to main memory.

PMC events for each cache level

On Intel x86, raw events are identified by a type and a pair of hex numbers: the event selector and the unit mask (umask). The format for raw events in perf:

event=0x<EventHex>,umask=0x<UmaskHex>

When you use perf_event_open with PERF_TYPE_RAW (4), you set config to (umask << 8) | event.

Here are the Intel Skylake cache events. These come from the MEM_LOAD_RETIRED event family (event 0xD1) and the TLB event families (0x08 for dTLB, 0x85 for iTLB). The event numbers are from the Intel SDM (Software Developer’s Manual, Volume 3B, Chapter 19). Verify these on your target system with perf list — some events are SKU-dependent.

Metric	Event	Umask	Intel SDM Name
L1 dcache load miss	0xD1	0x08	MEM_LOAD_RETIRED.L1_MISS
L2 cache miss	0xD1	0x10	MEM_LOAD_RETIRED.L2_MISS
L3 cache miss	0xD1	0x20	MEM_LOAD_RETIRED.L3_MISS
L1 dcache load hit	0xD1	0x01	MEM_LOAD_RETIRED.L1_HIT

The L3 miss counter requires SKU verification — some Intel parts don’t expose it. Check perf list on your target system before relying on it.

A note on event accuracy. MEM_LOAD_RETIRED counts retired load instructions — loads that completed. This means it doesn’t count speculative loads that were issued but discarded (e.g., on a mispredicted branch). For most monitoring use cases, retired loads are what you want: they reflect the work the program actually did, not work it speculated about and threw away. If you need to count all load accesses including speculative ones, use the MEM_INST_RETIRED event family instead — but the distinction usually only matters for profiling, not metrics.

The TLB structure

The Translation Lookaside Buffer (TLB) is a hardware cache for virtual-to-physical address translations. Every memory access needs an address translation: virtual address → physical address. The TLB caches these translations so the CPU doesn’t have to walk the page table every time.

There are two TLBs:

dTLB (data TLB): caches translations for memory reads and writes
iTLB (instruction TLB): caches translations for instruction fetches

A TLB miss means the translation wasn’t cached. The CPU then has to walk the page table, which is a multi-level lookup and takes a few dozen cycles. On a TLB miss, the CPU stalls until the translation is available.

PMC events for TLB misses

Metric	Event	Umask	Intel SDM Name
dTLB load miss (causes walk)	0x08	0x01	DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK
dTLB load walk completed	0x08	0x02	DTLB_LOAD_MISSES.WALK_COMPLETED
iTLB miss (causes walk)	0x85	0x01	ITLB_MISSES.MISS_CAUSES_A_WALK
iTLB walk completed	0x85	0x02	ITLB_MISSES.WALK_COMPLETED

Notice that cache and TLB events are in different event families. Cache events are 0xD1 (MEM_LOAD_RETIRED). dTLB events are 0x08 (DTLB_LOAD_MISSES). iTLB events are 0x85 (ITLB_MISSES). The “causes a walk” events are the useful ones — a TLB miss that doesn’t trigger a page table walk (e.g., a hit in the second-level TLB) isn’t as costly, so we count the ones that actually stall the core waiting for a translation.

A function to open cache counters

#![allow(unused)]
fn main() {
// PERF_TYPE_RAW = 4: raw PMC events (event numbers vary by CPU)
const PERF_TYPE_RAW: u32 = 4;

fn open_raw_pmc(
    event: u16,   // event number from Intel SDM
    umask: u8,    // unit mask from Intel SDM
    pid: libc::pid_t,
    cpu: libc::c_int,
) -> std::io::Result<libc::c_int> {
    // Hand-rolled perf_event_attr — libc::perf_event_attr isn't available
    // on all platforms. The field ordering must match the kernel struct exactly.
    #[repr(C)]
    struct PerfEventAttr {
        type_: u32,
        size: u32,
        config: u64,
        sample_period: u64,
        sample_type: u64,
        read_format: u64,
        flags: u64, // bit 0: disabled, bit 2: pinned
    }

    let attr = PerfEventAttr {
        type_: PERF_TYPE_RAW,
        size: std::mem::size_of::<PerfEventAttr>() as u32,
        config: ((umask as u64) << 8) | (event as u64),
        sample_period: 0, // counting mode: no sampling, read counter directly
        sample_type: 0,
        read_format: 0,
        flags: 0b101, // disabled=1 (bit 0), pinned=1 (bit 2)
    };

    let fd = unsafe {
        libc::syscall(
            libc::SYS_perf_event_open,
            &attr as *const _,
            pid,
            cpu,
            -1,
            0,
        )
    };

    if fd < 0 {
        return Err(std::io::Error::last_os_error());
    }
    Ok(fd as libc::c_int)
}

fn enable_counter(fd: libc::c_int) {
    unsafe {
        // arg=0 enables this counter (not a group)
        libc::ioctl(fd, libc::PERF_EVENT_IOC_ENABLE, 0);
    }
}
}

Now opening specific counters:

#![allow(unused)]
fn main() {
// Open a counter for L1 dcache load miss (event 0xD1, umask 0x08)
// pid=0: measure this process; cpu=-1: follow across all CPUs
let l1d_miss = open_raw_pmc(0xD1, 0x08, 0, -1)?;

// Open a counter for dTLB load miss (event 0x08, umask 0x01)
let dtlb_miss = open_raw_pmc(0x08, 0x01, 0, -1)?;

// Open a counter for L3 cache miss (event 0xD1, umask 0x20)
let l3_miss = open_raw_pmc(0xD1, 0x20, 0, -1)?;
}

If a counter returns EINVAL, that event isn’t available on this CPU. Catch it and fall back.

System-wide monitoring: The examples above use pid=0 (measure the calling process) for simplicity. A production monitoring tool should use pid=-1 with per-CPU file descriptors to measure all processes on the system. See Part 3’s “Per-process vs. system-wide monitoring” section for the full explanation.

Computing miss rates

Raw counts don’t mean much in isolation. A workload doing a billion memory accesses might have 50 million L3 misses. Is that a lot? It depends on how many accesses there were. The useful metric is miss rate per thousand instructions or miss rate per million memory operations.

#![allow(unused)]
fn main() {
fn compute_miss_rate(misses: u64, instructions: u64) -> f64 {
    if instructions == 0 {
        return 0.0;
    }
    (misses as f64 / instructions as f64) * 1000.0
}
}

A few reference points for context:

L1 dcache miss rate: 2-5 misses per 1000 instructions is typical for well-optimized code
L3 miss rate: 0.5-2 per 1000 instructions means memory locality is decent; above 5 suggests poor spatial locality
dTLB miss rate: 0.1-1 per 1000 is typical; above 5 suggests TLB-unfriendly access patterns (e.g., scanning large arrays with a large page table stride)

Cross-microarchitecture event selection

Here’s a wrapper that selects the right event numbers based on the detected microarchitecture:

#![allow(unused)]
fn main() {
use crate::cpu::{detect, CpuMicroarch, Microarch};

#[derive(Clone, Copy)]
pub struct CacheEvent {
    pub name: &'static str,
    pub event: u16,   // event number from Intel SDM
    pub umask: u8,    // unit mask from Intel SDM
}

fn cache_events_for(microarch: &Microarch) -> Vec<CacheEvent> {
    match microarch {
        Microarch::Skylake | Microarch::SkylakeSp | Microarch::KabyLake => vec![
            CacheEvent { name: "L1 dcache miss",    event: 0xD1, umask: 0x08 },
            CacheEvent { name: "L2 cache miss",     event: 0xD1, umask: 0x10 },
            CacheEvent { name: "L3 cache miss",    event: 0xD1, umask: 0x20 },
            CacheEvent { name: "dTLB load miss",   event: 0x08, umask: 0x01 },
            CacheEvent { name: "iTLB miss",        event: 0x85, umask: 0x01 },
        ],
        Microarch::IceLake | Microarch::IceLakeSp | Microarch::RocketLake => vec![
            // Ice Lake changed some event encodings.
            // L3 miss (0xD1, umask 0x20) is not reliably available on all Ice Lake SKUs.
            // Check `perf list` on your system; if MEM_LOAD_RETIRED.L3_MISS is listed,
            // add it with event 0xD1, umask 0x20.
            CacheEvent { name: "L1 dcache miss",   event: 0xD1, umask: 0x08 },
            CacheEvent { name: "L2 cache miss",    event: 0xD1, umask: 0x10 },
            CacheEvent { name: "dTLB load miss",   event: 0x08, umask: 0x01 },
            CacheEvent { name: "iTLB miss",        event: 0x85, umask: 0x01 },
        ],
        _ => {
            // Fall back to the universal events
            vec![
                CacheEvent { name: "L1 dcache miss",  event: 0xD1, umask: 0x08 },
                CacheEvent { name: "dTLB load miss",  event: 0x08, umask: 0x01 },
                CacheEvent { name: "iTLB miss",       event: 0x85, umask: 0x01 },
            ]
        }
    }
}
}

Counting mode vs. sampling mode

We’ve been using counting mode: open a counter, enable it, and read the cumulative count every second. This gives you a metric — a number that describes what’s happening.

Sampling mode is different: you set the counter to generate a sample (a trace event) every N events. The kernel writes each sample to a ring buffer that userspace reads. This gives you a profile — a stream of individual events that lets you see where the misses are happening.

For a monitoring dashboard, counting mode is what you want. You’re interested in “how many L3 misses per second” — not “which function is causing L3 misses.”

The performance monitoring landscape: counting mode gives metrics, sampling mode gives profiles. We’re building a metrics collector, so counting is the right tool.

Next: Part 6 — Scheduler Tracing with eBPF — Instrument the scheduler with tracepoints and measure runqueue wait times.

Part 6 — Scheduler Tracing with eBPF

The scheduler decides which task runs next. Every scheduling decision is a data point.

You can’t see this from PMCs. The CPU doesn’t know whether it’s executing a task that’s been waiting for 50 milliseconds or one that recently got scheduled. The kernel knows. It records every scheduling decision in tracepoints, and we can read those tracepoints with eBPF.

The scheduler tracepoints

The kernel exposes several scheduler tracepoints. The ones we care about:

Tracepoint	When it fires
`sched:sched_switch`	The scheduler switches from one task to another
`sched:sched_waking`	A task is about to be woken (pre-wakeup)
`sched:sched_wakeup`	A sleeping task has been woken
`sched:sched_stat_wait`	Time a task spent waiting on a runqueue
`sched:sched_migrate_task`	A task was migrated to another CPU

sched_switch is the most informative. It fires whenever the scheduler replaces the running task with a different one.

The map declaration pattern

This is the most important pattern in eBPF programming with Aya: maps are static globals.

You declare a map as a static variable with a constructor call. The eBPF verifier sees it at load time and allocates it. You reference it by name from userspace.

#![allow(unused)]
fn main() {
use aya_ebpf::maps::RingBuf;
use aya_ebpf::macros::map;

// 8 KiB ring buffer. Must be a power-of-two multiple of page_size (4096).
// The userspace reader picks this up by the name "events".
#[map]
static EVENTS: RingBuf = RingBuf::with_byte_size(8 * 4096, 0);

#[map]
static COUNTERS: aya_ebpf::maps::HashMap<u32, u64> =
    aya_ebpf::maps::HashMap::with_max_entries(256, 0);
}

Note that aya-ebpf maps don’t have builder methods — they’re constructed with with_max_entries() and with_byte_size() (for ring buffers). This is different from Rust’s standard library conventions, but it matches how the eBPF verifier needs to know the map size at compile time.

Reading tracepoint arguments

A tracepoint has a fixed payload — a chunk of memory that contains the arguments. The kernel defines the format. You read from the tracepoint context at specific byte offsets using read_at().

Here’s the layout for sched:sched_switch on Linux 5.x:

Offset   Type     Field
------   ----   -----
0        char[16]  prev_comm        (TASK_COMM_LEN)
16       u32       prev_pid
20       u32       prev_prio
24       u64       prev_state       (the TASK_* state mask)
32       char[16]  next_comm        (TASK_COMM_LEN)
48       u32       next_pid
52       u32       next_prio

The prev_comm and next_comm fields are 16-byte character arrays containing the process name (TASK_COMM_LEN is 16 in the kernel). They take up space in the payload even though we don’t read them — every offset after them is shifted.

Always verify on your system:

cat /sys/kernel/tracing/events/sched/sched_switch/format

This prints the exact field layout including offsets. The format file offsets include the 8-byte common header, so subtract 8 to get the read_at offset. See the detailed cross-check procedure in Part 2. Kernel versions can and do change tracepoint layouts.

The struct in our eBPF program:

#![allow(unused)]
fn main() {
// ebpf-programs/src/scheduler.rs

use aya_ebpf::programs::TracePointContext;
use aya_ebpf::macros::tracepoint;
use aya_ebpf::maps::RingBuf;
use aya_ebpf::helpers::{bpf_ktime_get_ns, bpf_get_smp_processor_id};

#[derive(Clone, Copy)]
#[repr(C)]
pub struct SchedSwitchEvent {
    pub cpu_id: u32,
    pub prev_pid: u32,
    pub prev_state: u64,
    pub next_pid: u32,
    pub timestamp: u64,
}

// Declare the ring buffer as a static
#[map]
static EVENTS: RingBuf = RingBuf::with_byte_size(8 * 4096, 0);

#[tracepoint]
pub fn sched_switch(ctx: TracePointContext) -> u32 {
    let prev_pid = unsafe { ctx.read_at::<u32>(16).unwrap_or(0) };
    let prev_state = unsafe { ctx.read_at::<u64>(24).unwrap_or(0) };
    let next_pid = unsafe { ctx.read_at::<u32>(48).unwrap_or(0) };
    let cpu_id = unsafe { bpf_get_smp_processor_id() };
    let timestamp = unsafe { bpf_ktime_get_ns() };

    let event = SchedSwitchEvent {
        cpu_id,
        prev_pid,
        prev_state,
        next_pid,
        timestamp,
    };

    // output() sends data directly to the ring buffer
    EVENTS.output(&event, 0);

    0
}
}

A few things to notice here:

unsafe around read_at: read_at wraps bpf_probe_read, which reads from kernel memory. The eBPF verifier can’t guarantee the memory is valid, so you need unsafe. In practice, you’re reading from a tracepoint payload that the kernel has placed there, so it’s safe — but the compiler doesn’t know that.

bpf_get_smp_processor_id() and bpf_ktime_get_ns(): These are raw BPF helpers. Aya wraps many helpers, but these two are so fundamental that they’re exposed directly as unsafe extern calls through the bindings. They’re available in every eBPF program.

EVENTS.output(): The ring buffer is a static. We call .output() directly on it — no ctx involved. The ring buffer is declared at the top of the file and lives for the lifetime of the program.

Building a runqueue wait histogram

The wait time is the time between when a task was woken and when it actually starts running on a CPU. To measure it, we need to correlate events from two tracepoints: sched:sched_waking (when a task is about to wake) and sched:sched_switch (when a task starts running). When sched_switch fires and the next_pid matches a task we saw in sched_waking, we know how long that task waited.

The approach: a hash map keyed by PID. When a task is woken (via sched_waking), record the timestamp. When a task starts running (via sched_switch where next_pid matches), look up the timestamp, compute the wait, and delete the entry.

Why sched_waking instead of sched_wakeup? sched_waking fires when the wake signal is about to be sent — slightly earlier and more reliable for measuring the full wait. sched_wakeup fires after the target task has been added to the runqueue. For wait time measurement, the difference is negligible, but sched_waking is the more common choice in production tools.

The sched:sched_waking payload layout:

Offset   Type     Field
------   ----   -----
0        char[16]  comm            (TASK_COMM_LEN)
16       u32       pid
20       u32       prio
24       u32       target_cpu

Verify on your system: cat /sys/kernel/tracing/events/sched/sched_waking/format (format file offsets are 8 bytes larger than read_at offsets — see Part 2 for details)

#![allow(unused)]
fn main() {
// ebpf-programs/src/scheduler.rs

use aya_ebpf::programs::TracePointContext;
use aya_ebpf::macros::{map, tracepoint};
use aya_ebpf::maps::{HashMap, RingBuf};
use aya_ebpf::helpers::{bpf_ktime_get_ns, bpf_get_smp_processor_id};

// Map: PID → waking timestamp (nanoseconds)
#[map]
static WAKE_TS: HashMap<i32, u64> = HashMap::with_max_entries(1024, 0);

// Ring buffer for wait events
#[map]
static EVENTS: RingBuf = RingBuf::with_byte_size(8 * 4096, 0);

#[derive(Clone, Copy)]
#[repr(C)]
pub struct WaitEvent {
    pub pid: u32,
    pub wait_ns: u64,
    pub cpu_id: u32,
}

// sched:sched_waking — record when a task is about to be woken
// Payload: comm (char[16]) at 0, pid (u32) at 16, prio (u32) at 20, target_cpu (u32) at 24
#[tracepoint]
pub fn sched_waking(ctx: TracePointContext) -> u32 {
    let pid = unsafe { ctx.read_at::<i32>(16).unwrap_or(0) };
    let ts = unsafe { bpf_ktime_get_ns() };

    // Only record for non-zero PIDs (kernel threads have pid 0)
    if pid > 0 {
        WAKE_TS.insert(&pid, &ts, 0);
    }

    0
}

// sched:sched_switch — check if the incoming task was waiting
// Payload: prev_comm (char[16]) at 0, prev_pid (u32) at 16, prev_prio (u32) at 20,
//          prev_state (u64) at 24, next_comm (char[16]) at 32, next_pid (u32) at 48
#[tracepoint]
pub fn sched_switch_wait(ctx: TracePointContext) -> u32 {
    let next_pid = unsafe { ctx.read_at::<u32>(48).unwrap_or(0) };

    if next_pid > 0 {
        let ts = unsafe { bpf_ktime_get_ns() };
        // WAKE_TS key type is i32 (kernel PIDs are pid_t). Cast for lookup.
        // Safe because we only reach this branch when next_pid > 0,
        // and all real PIDs fit in both u32 and i32.
        let pid_key = next_pid as i32;
        // SAFETY: WAKE_TS.get() is unsafe because the kernel doesn't
        // guarantee atomicity without BPF_F_NO_PREALLOC. For our purposes —
        // measuring scheduler wait time — occasional corruption is acceptable
        // since it means one lost measurement at worst.
        unsafe {
            if let Some(&wake_ts) = WAKE_TS.get(&pid_key) {
                let wait_ns = ts.saturating_sub(wake_ts);
                let cpu_id = bpf_get_smp_processor_id();
                let event = WaitEvent { pid: next_pid, wait_ns, cpu_id };
                EVENTS.output(&event, 0);
                let _ = WAKE_TS.remove(&pid_key);
            }
        }
    }

    0
}
}

Two tracepoints, one hash map. sched_waking writes the timestamp when a task is about to wake. sched_switch reads it back when the task starts running.

Merging the handlers: This chapter shows three separate sched_switch handlers — one for basic tracing, one for wait histograms, one for involuntary switches. In a real program, you’d combine them into a single handler. They’re split here so each concept is clear on its own. When you merge them, the combined handler reads prev_pid, prev_state, and next_pid once, then does all three operations (ring buffer output, wait lookup, involuntary counting) in the same function. The next_pid in sched_switch matches the pid from sched_waking — that’s how we correlate the two events. The wait time is the difference between the two timestamps.

This is harder than a single-tracepoint approach — you need to correlate events from two sources. But it’s the only correct way to measure runqueue wait time with the tracepoints the kernel actually provides. The scheduler’s internal enqueue/dequeue operations aren’t exposed as tracepoints.

insert(), get(), remove(): HashMap in Aya eBPF has insert() (returns Result<(), c_long>), get() (unsafe, returns Option<&V>), and remove() (returns Result<(), c_long>). There’s no get_mut() — eBPF maps are accessed by reference.

get() returns Option<&V>: When the key isn’t found, get() returns None. Two patterns for handling this:

if let Some(&val) = map.get(&key) — when the “not found” case means “skip this event.” Used in sched_switch_wait above: if the PID isn’t in WAKE_TS, there’s nothing to compute.
map.get(&key).copied().unwrap_or(0) — when the “not found” case means “start from zero.” Used for counter increments: if the CPU isn’t in the map, the count is zero.

Both patterns are idiomatic. Pick based on the semantics of the “not found” case.

Counting context switches

Every sched_switch is a context switch. Counting them is straightforward — we’ll extend the sched_switch handler from the previous section with a per-CPU hash map:

#![allow(unused)]
fn main() {
// ebpf-programs/src/scheduler.rs

#[map]
static CTXSW_COUNTERS: HashMap<u32, u64> = HashMap::with_max_entries(256, 0);

#[tracepoint]
pub fn sched_switch(ctx: TracePointContext) -> u32 {
    let cpu_id = unsafe { bpf_get_smp_processor_id() };

    unsafe {
        let count = CTXSW_COUNTERS.get(&cpu_id).copied().unwrap_or(0u64);
        let new_count = count + 1;
        let _ = CTXSW_COUNTERS.insert(&cpu_id, &new_count, 0);
    }

    0
}
}

Note that insert() takes ownership of the reference’s pointed-to value, so we dereference count into new_count and insert that. This is the standard pattern for counter increments in eBPF.

Involuntary context switches

An involuntary context switch is one where the running task didn’t voluntarily give up the CPU — it got preempted or its time slice expired. We can detect this from the same sched_switch tracepoint by checking whether prev_pid was actually running when it got switched out.

In the kernel, TASK_RUNNING is state 0. If prev_state in sched_switch is 0, the task was running and got switched out involuntarily:

#![allow(unused)]
fn main() {
// TASK_RUNNING = 0 in the kernel
const TASK_RUNNING: u64 = 0;

#[map]
static INVOL_CTXSW: HashMap<u32, u64> = HashMap::with_max_entries(256, 0);

#[tracepoint]
pub fn sched_switch(ctx: TracePointContext) -> u32 {
    let prev_state = unsafe { ctx.read_at::<u64>(24).unwrap_or(0) };
    let prev_pid = unsafe { ctx.read_at::<u32>(16).unwrap_or(0) };

    // prev_state == 0 means the task was in TASK_RUNNING
    // prev_pid > 0 means it wasn't idle — involuntary switch
    let is_involuntary = prev_state == TASK_RUNNING && prev_pid > 0;

    if is_involuntary {
        let cpu = unsafe { bpf_get_smp_processor_id() };
        unsafe {
            let count = INVOL_CTXSW.get(&cpu).copied().unwrap_or(0u64);
            let new_count = count + 1;
            let _ = INVOL_CTXSW.insert(&cpu, &new_count, 0);
        }
    }

    0
}
}

Steal time

On virtualized systems, the hypervisor sometimes doesn’t give a vCPU any time to run even though it was runnable. The kernel records this as steal time. Unlike other metrics in this tutorial, steal time isn’t available through a scheduler tracepoint — it’s reported in /proc/stat:

cpu  2255 34 2290 22625563 6290 0 236 0 0 0
cpu0 1132 17 1145 11312781 3145 0 118 0 0 0
cpu1 1123 17 1145 11312782 3145 0 118 0 0 0

The fields are: user, nice, system, idle, iowait, irq, softirq, steal, guest, guest_nice. The steal field (8th column, 0-indexed column 7) is the number of jiffies the CPU wanted to run but the hypervisor scheduled something else.

#![allow(unused)]
fn main() {
use std::fs;

struct CpuStat {
    pub user: u64,
    pub nice: u64,
    pub system: u64,
    pub idle: u64,
    pub iowait: u64,
    pub irq: u64,
    pub softirq: u64,
    pub steal: u64,
}

fn read_proc_stat() -> std::io::Result<Vec<CpuStat>> {
    let content = fs::read_to_string("/proc/stat")?;
    let mut cpus = Vec::new();

    for line in content.lines() {
        let parts: Vec<&str> = line.split_whitespace().collect();
        if parts.is_empty() || !parts[0].starts_with("cpu") {
            continue;
        }
        // Skip the aggregate "cpu " line — we want per-CPU (cpu0, cpu1, ...)
        if parts[0] == "cpu" {
            continue;
        }

        cpus.push(CpuStat {
            user:   parts.get(1).and_then(|v| v.parse().ok()).unwrap_or(0),
            nice:   parts.get(2).and_then(|v| v.parse().ok()).unwrap_or(0),
            system: parts.get(3).and_then(|v| v.parse().ok()).unwrap_or(0),
            idle:   parts.get(4).and_then(|v| v.parse().ok()).unwrap_or(0),
            iowait: parts.get(5).and_then(|v| v.parse().ok()).unwrap_or(0),
            irq:    parts.get(6).and_then(|v| v.parse().ok()).unwrap_or(0),
            softirq:parts.get(7).and_then(|v| v.parse().ok()).unwrap_or(0),
            steal:  parts.get(8).and_then(|v| v.parse().ok()).unwrap_or(0),
        });
    }

    Ok(cpus)
}
}

Compute the steal ratio (fraction of total CPU time spent stolen):

#![allow(unused)]
fn main() {
fn steal_ratio(stat: &CpuStat) -> f64 {
    let total = stat.user + stat.nice + stat.system + stat.idle
        + stat.iowait + stat.irq + stat.softirq + stat.steal;
    if total == 0 {
        return 0.0;
    }
    stat.steal as f64 / total as f64
}
}

Steal time above 5-10% means the host is oversubscribed — there are more vCPUs competing for physical CPUs than physical CPUs available. Your workload is spending real time waiting for the hypervisor, not doing useful work.

This is a procfs metric, not an eBPF metric — no tracepoint or kprobe required. The kernel already tracks it. We read it alongside the scheduler tracepoint data in the same polling loop.

Checking tracepoint availability. Scheduler tracepoints are available on virtually every Linux kernel, but their names and arguments can change between versions. Before your monitoring tool starts, verify that the tracepoints it needs actually exist:
# List all scheduler tracepoints on this kernel
ls /sys/kernel/tracing/events/sched/

# Check a specific tracepoint
cat /sys/kernel/tracing/events/sched/sched_switch/id
If cat .../id prints a number, the tracepoint exists and can be attached. If you get “No such file or directory,” the tracepoint isn’t available on this kernel — your program should skip attaching it rather than crashing. In the userspace code below, you’d guard the program.attach() call with a check like this:
#![allow(unused)]
fn main() {
// Check that the tracepoint exists before attaching
let tp_id = std::fs::read_to_string(
    "/sys/kernel/tracing/events/sched/sched_switch/id"
);
if tp_id.is_ok() {
    program.attach("sched", "sched_switch")?;
} else {
    eprintln!("sched:sched_switch not available on this kernel, skipping");
}
}

Merging the handlers

The three sched_switch handlers above each do one thing. In a real program, you’d combine them into a single handler that reads the tracepoint payload once and does all three operations. Here’s what that looks like:

#![allow(unused)]
fn main() {
// ebpf-programs/src/scheduler.rs

use aya_ebpf::programs::TracePointContext;
use aya_ebpf::macros::{map, tracepoint};
use aya_ebpf::maps::{HashMap, RingBuf};
use aya_ebpf::helpers::{bpf_ktime_get_ns, bpf_get_smp_processor_id};

const TASK_RUNNING: u64 = 0;

// Shared maps (declared once, used by both sched_switch and sched_waking)
#[map]
static EVENTS: RingBuf = RingBuf::with_byte_size(8 * 4096, 0);
#[map]
static WAKE_TS: HashMap<i32, u64> = HashMap::with_max_entries(1024, 0);
#[map]
static CTXSW_COUNTERS: HashMap<u32, u64> = HashMap::with_max_entries(256, 0);
#[map]
static INVOL_CTXSW: HashMap<u32, u64> = HashMap::with_max_entries(256, 0);

#[derive(Clone, Copy)]
#[repr(C)]
pub struct SchedSwitchEvent {
    pub cpu_id: u32,
    pub prev_pid: u32,
    pub prev_state: u64,
    pub next_pid: u32,
    pub wait_ns: u64,    // 0 if no waking timestamp was found
    pub timestamp: u64,
}

// Combined sched_switch handler: tracing + wait lookup + involuntary counting
//
// Payload layout (Linux 5.x — verify with
//   cat /sys/kernel/tracing/events/sched/sched_switch/format):
//   offset 0:  prev_comm  char[16]
//   offset 16: prev_pid   u32
//   offset 20: prev_prio  u32
//   offset 24: prev_state u64
//   offset 32: next_comm  char[16]
//   offset 48: next_pid   u32
//   offset 52: next_prio  u32
#[tracepoint]
pub fn sched_switch_combined(ctx: TracePointContext) -> u32 {
    // Read the payload once — shared across all three operations
    let prev_pid = unsafe { ctx.read_at::<u32>(16).unwrap_or(0) };
    let prev_state = unsafe { ctx.read_at::<u64>(24).unwrap_or(0) };
    let next_pid = unsafe { ctx.read_at::<u32>(48).unwrap_or(0) };
    let cpu_id = unsafe { bpf_get_smp_processor_id() };
    let timestamp = unsafe { bpf_ktime_get_ns() };

    // --- Operation 1: Ring buffer event ---
    // Look up the waking timestamp for next_pid to compute wait time
    let mut wait_ns = 0u64;
    if next_pid > 0 {
        let pid_key = next_pid as i32;
        unsafe {
            if let Some(&wake_ts) = WAKE_TS.get(&pid_key) {
                wait_ns = timestamp.saturating_sub(wake_ts);
                let _ = WAKE_TS.remove(&pid_key);
            }
        }
    }

    let event = SchedSwitchEvent {
        cpu_id,
        prev_pid,
        prev_state,
        next_pid,
        wait_ns,
        timestamp,
    };
    EVENTS.output(&event, 0);

    // --- Operation 2: Context switch counter ---
    unsafe {
        let count = CTXSW_COUNTERS.get(&cpu_id).copied().unwrap_or(0u64);
        let _ = CTXSW_COUNTERS.insert(&cpu_id, &(count + 1), 0);
    }

    // --- Operation 3: Involuntary context switch counter ---
    if prev_state == TASK_RUNNING && prev_pid > 0 {
        unsafe {
            let count = INVOL_CTXSW.get(&cpu_id).copied().unwrap_or(0u64);
            let _ = INVOL_CTXSW.insert(&cpu_id, &(count + 1), 0);
        }
    }

    0
}
}

Key differences from the separate handlers:

One read_at pass. The combined handler reads prev_pid, prev_state, and next_pid once. The separate handlers each read independently — that’s three times the work per context switch.
Wait time is embedded in the event. The SchedSwitchEvent struct now includes wait_ns. If the incoming task (next_pid) doesn’t have a waking timestamp, wait_ns is 0 — the userspace reader checks for this.
One sched_waking handler still needed. The combined sched_switch handler only replaces the three separate sched_switch variants. The sched_waking handler (which records the wake timestamp) is unchanged.

The lib.rs registration for the combined approach needs only two entry points instead of three:

#![allow(unused)]
fn main() {
// ebpf-programs/src/lib.rs

#![no_std]
use aya_ebpf::macros::tracepoint;
use aya_ebpf::programs::TracePointContext;
mod scheduler;

#[tracepoint]
pub fn trace_sched_switch(ctx: TracePointContext) -> u32 {
    scheduler::sched_switch_combined(ctx)
}

#[tracepoint]
pub fn trace_sched_waking(ctx: TracePointContext) -> u32 {
    scheduler::sched_waking(ctx)
}
}

Reading in userspace

The userspace side loads the eBPF programs, creates maps, attaches tracepoints, and reads from the ring buffer.

#![allow(unused)]
fn main() {
// monitor/src/main.rs

use aya::maps::RingBuf;
use aya::programs::TracePoint;
use aya::Ebpf;
use std::convert::TryFrom;

#[derive(Clone, Copy, Debug)]
#[repr(C)]
pub struct SchedSwitchEvent {
    pub cpu_id: u32,
    pub prev_pid: u32,
    pub prev_state: u64,
    pub next_pid: u32,
    pub timestamp: u64,
}

async fn poll_scheduler(ebpf: &mut Ebpf) -> anyhow::Result<()> {
    // RingBuf takes ownership of the map reference
    let mut ring_buf = RingBuf::try_from(ebpf.map_mut("events")?)?;

    let mut ctxsw_total = 0u64;
    let mut involuntary = 0u64;

    while let Some(item) = ring_buf.next() {
        // item derefs to &[u8] — cast to our event type
        let event = unsafe {
            std::ptr::read_unaligned(item.as_ptr() as *const SchedSwitchEvent)
        };

        ctxsw_total += 1;
        // Involuntary: was running (prev_state == 0) and wasn't idle (prev_pid > 0)
        if event.prev_state == 0 && event.prev_pid > 0 {
            involuntary += 1;
        }
    }

    if ctxsw_total > 0 {
        let inv_rate = involuntary as f64 / ctxsw_total as f64;
        println!("ctxsw={ctxsw_total}  involuntary_rate={inv_rate:.3}");
    }

    Ok(())
}
}

Key points about the userspace code:

RingBuf::try_from(map_data): You create the ring buffer from a map reference. ebpf.map_mut("events") looks up the map by name that the eBPF program declared. RingBuf::try_from() takes ownership of that map reference.

ring_buf.next() returns Option<RingBufItem<'_>>: RingBufItem derefs to &[u8] — a byte slice containing the event data. You cast it back to your struct type. read_unaligned is important because the data may not be aligned to the struct’s alignment requirements.

TracePoint (capital P): The userspace program type is TracePoint, not Tracepoint. This is the Aya convention — program types are PascalCase.

Reading histogram maps

For counter maps (like CTXSW_COUNTERS and INVOL_CTXSW), you read them directly from userspace without going through the ring buffer:

#![allow(unused)]
fn main() {
use aya::maps::HashMap;

async fn read_counters(ebpf: &mut Ebpf) -> anyhow::Result<()> {
    let counters: HashMap<_, u32, u64> =
        HashMap::try_from(ebpf.map_mut("ctxsw_counters")?)?;

    for (cpu, &count) in counters.iter() {
        println!("cpu={cpu} ctxsw={count}");
    }

    Ok(())
}
}

The counter values are cumulative since the program was loaded. To get per-second rates, save the previous reading and compute the delta.

Registering programs in lib.rs

The eBPF programs are defined in separate .rs files and registered in lib.rs:

#![allow(unused)]
fn main() {
// ebpf-programs/src/lib.rs

#![no_std]

use aya_ebpf::macros::tracepoint;
use aya_ebpf::programs::TracePointContext;

mod scheduler;

#[tracepoint]
pub fn trace_sched_switch(ctx: TracePointContext) -> u32 {
    scheduler::sched_switch(ctx)
}

#[tracepoint]
pub fn trace_sched_waking(ctx: TracePointContext) -> u32 {
    scheduler::sched_waking(ctx)
}

#[tracepoint]
pub fn trace_sched_switch_wait(ctx: TracePointContext) -> u32 {
    scheduler::sched_switch_wait(ctx)
}
}

Each entry point uses the #[tracepoint] form — no args. The macro marks the function as an eBPF tracepoint program; the category and name are provided from userspace via program.attach(). The function body delegates to the module implementation. In a real project, you might define the programs directly in main.rs or split them into modules — either way, the macro goes on the function that the eBPF verifier sees.

Next: Part 7 — NUMA and Memory Metrics — Track page migration rates and remote memory access ratios on multi-socket systems.

Part 7 — NUMA and Memory Metrics

On a NUMA system, memory access has a cost that depends on which socket you’re on.

NUMA stands for Non-Uniform Memory Access. On a multi-socket server, each CPU socket has its own local memory. When a task running on socket A accesses memory attached to socket B, the data has to travel across the inter-socket interconnect. That takes longer than accessing local memory.

If your workload is hitting 80% remote memory access, you’re paying the interconnect tax on most of your memory traffic. That’s a NUMA problem.

The basics in plain language

A node in Linux’s NUMA vocabulary is a group of CPUs and memory that are physically close. On a dual-socket system, you typically have node 0 and node 1. Each node has its own local memory. Memory attached to node 0 is “local” to socket 0’s CPUs, and “remote” to socket 1’s CPUs.

Linux has a NUMA balancer that moves pages between nodes at runtime to try to keep tasks running close to their data. When the balancer kicks in, it migrates pages. Too much migration is a sign that tasks are bouncing between sockets.

/proc/vmstat — the key file

/proc/vmstat is a flat list of virtual memory statistics. Most of them are for kernel internals, but a few are relevant for NUMA:

numa_hit       12345678   // pages allocated to this node (success)
numa_miss      234567     // pages allocated to this node but from remote (fail)
numa_foreign   12345      // pages allocated to this node from another node's memory
pgmigrate_success  98765  // pages successfully migrated between nodes
pgmigrate_fail     123   // migration attempts that failed

The key fields:

numa_hit: pages that were allocated on this node and used on this node (the good case)
numa_miss: pages allocated on this node but the CPU accessing them was on a remote node (the bad case)
pgmigrate_success: how many pages were successfully moved between nodes

The NUMA remote ratio is:

remote_ratio = numa_miss / (numa_hit + numa_miss)

A remote ratio above 0.2 (20%) means more than a fifth of memory accesses are crossing the interconnect. This is worth optimizing.

/sys/devices/system/node/nodeN/meminfo — per-node memory

Per-node memory breakdown lives in sysfs, not procfs. Each NUMA node has a meminfo file:

cat /sys/devices/system/node/node0/meminfo

The output looks like:

Node 0 Anon:     12345678 kB
Node 0 File:      2345678 kB
Node 0 HugePages:   4096 kB
Node 0 Shmem:      12345 kB

Each line breaks down memory by type:

Anon: anonymous memory (heap, stack, not backed by a file)
File: page cache (file-backed memory)
HugePages: hugepages
Shmem: shared memory (including tmpfs)

The numbers are in KiB. To get bytes: multiply by 1024.

Reading vmstat

#![allow(unused)]
fn main() {
use std::fs;

#[derive(Default)]
pub struct NumaStats {
    pub numa_hit: u64,
    pub numa_miss: u64,
    pub numa_foreign: u64,
    pub pgmigrate_success: u64,
    pub pgmigrate_fail: u64,
}

fn read_vmstat() -> std::io::Result<NumaStats> {
    let content = fs::read_to_string("/proc/vmstat")?;

    let mut numa_hit = 0u64;
    let mut numa_miss = 0u64;
    let mut numa_foreign = 0u64;
    let mut pgmigrate_success = 0u64;
    let mut pgmigrate_fail = 0u64;

    for line in content.lines() {
        let mut parts = line.split_whitespace();
        let name = parts.next().unwrap_or("");
        let value: u64 = parts.next().unwrap_or("0").parse().unwrap_or(0);

        match name {
            "numa_hit" => numa_hit = value,
            "numa_miss" => numa_miss = value,
            "numa_foreign" => numa_foreign = value,
            "pgmigrate_success" => pgmigrate_success = value,
            "pgmigrate_fail" => pgmigrate_fail = value,
            _ => {}
        }
    }

    Ok(NumaStats {
        numa_hit,
        numa_miss,
        numa_foreign,
        pgmigrate_success,
        pgmigrate_fail,
    })
}
}

Rate computation

/proc/vmstat returns cumulative counters since boot. To get per-second rates, poll the file and compute the delta:

#![allow(unused)]
fn main() {
use std::time::{Duration, Instant};

struct NumaRate {
    pub numa_remote_rate: f64,      // remote accesses per second
    pub migration_rate: f64,        // pages migrated per second
    pub migration_fail_rate: f64,   // failed migrations per second
}

fn compute_rates(prev: &NumaStats, curr: &NumaStats, elapsed_secs: f64) -> NumaRate {
    if elapsed_secs <= 0.0 {
        return NumaRate {
            numa_remote_rate: 0.0,
            migration_rate: 0.0,
            migration_fail_rate: 0.0,
        };
    }
    let total_accesses = (curr.numa_hit - prev.numa_hit)
        .saturating_add(curr.numa_miss - prev.numa_miss);
    let remote_accesses = curr.numa_miss - prev.numa_miss;
    let migrations = curr.pgmigrate_success - prev.pgmigrate_success;
    let failed = curr.pgmigrate_fail - prev.pgmigrate_fail;

    NumaRate {
        numa_remote_rate: remote_accesses as f64 / elapsed_secs,
        migration_rate: migrations as f64 / elapsed_secs,
        migration_fail_rate: failed as f64 / elapsed_secs,
    }
}
}

Per-node memory from sysfs

#![allow(unused)]
fn main() {
#[derive(Debug, Clone)]
pub struct NodeMem {
    pub node: u32,
    pub anon_pages: u64,
    pub file_pages: u64,
    pub huge_pages: u64,
    pub shmem: u64,
}

fn parse_node_meminfo(content: &str) -> Vec<NodeMem> {
    // The kernel meminfo format is:
    //   Node 0 Anon:     12345678 kB
    //   Node 0 File:      2345678 kB
    //   ...
    // We accumulate fields per node into a HashMap.
    use std::collections::HashMap;
    let mut node_data: HashMap<u32, NodeMem> = HashMap::new();

    for line in content.lines() {
        // Lines look like: "Node 0 Anon:     12345678 kB"
        let parts: Vec<&str> = line.split_whitespace().collect();
        if parts.len() < 4 {
            continue;
        }

        // parts[0] = "Node", parts[1] = node_id, parts[2] = "Anon:", parts[3] = value
        if parts[0] != "Node" {
            continue;
        }

        let node_id: u32 = parts[1].parse().unwrap_or(0);
        let field_name = parts[2].trim_end_matches(':');
        // Values are in KiB; convert to pages (4 KiB each) for consistency
        let value_kib: u64 = parts[3].parse().unwrap_or(0);
        let value_pages = value_kib / 4; // KiB → pages (4 KiB each)

        let entry = node_data.entry(node_id).or_insert_with(|| NodeMem {
            node: node_id,
            anon_pages: 0,
            file_pages: 0,
            huge_pages: 0,
            shmem: 0,
        });

        match field_name {
            "Anon" => entry.anon_pages = value_pages,
            "File" => entry.file_pages = value_pages,
            "HugePages" => entry.huge_pages = value_pages,
            "Shmem" => entry.shmem = value_pages,
            _ => {}
        }
    }

    let mut nodes: Vec<NodeMem> = node_data.into_values().collect();
    nodes.sort_by_key(|n| n.node);
    nodes
}

/// Read per-node memory info from sysfs (all nodes)
fn read_all_node_meminfo() -> std::io::Result<Vec<NodeMem>> {
    let node_dir = std::path::Path::new("/sys/devices/system/node");
    let mut all_nodes = Vec::new();

    for entry in std::fs::read_dir(node_dir)?.flatten() {
        let name = entry.file_name();
        let name_str = name.to_string_lossy();
        if !name_str.starts_with("node") {
            continue;
        }

        let meminfo_path = entry.path().join("meminfo");
        if !meminfo_path.exists() {
            continue;
        }

        let content = std::fs::read_to_string(&meminfo_path)?;
        let mut parsed = parse_node_meminfo(&content);
        all_nodes.append(&mut parsed);
    }

    all_nodes.sort_by_key(|n| n.node);
    Ok(all_nodes)
}
}

The parse_node_meminfo function converts KiB to pages internally (dividing by 4, since 1 page = 4 KiB). To get bytes from pages:

#![allow(unused)]
fn main() {
fn page_bytes(pages: u64) -> u64 {
    // Hard-coding 4 KiB pages is a simplification. ARM64 can use 16 KiB or 64 KiB
    // pages. For a portable version, use libc::sysconf(libc::_SC_PAGESIZE) to get
    // the actual page size at runtime.
    pages * 4096
}
}

/sys/devices/system/node/ — cross-node stats

Linux exposes per-node information through sysfs. The directory structure:

/sys/devices/system/node/
├── node0/
│   ├── cpulist          # CPUs on this node
│   ├── distance          # NUMA distances to other nodes
│   ├── meminfo           # memory stats for this node
│   └── numastat          # NUMA hit/miss for this node
├── node1/
│   └── ...

numastat is particularly useful:

numa_hit  12345678
numa_miss  234567
numa_foreign 12345
interleave_hit  1234
other_node  5678

This is per-node data, which is more useful than the system-wide vmstat when you’re debugging a specific NUMA imbalance.

#![allow(unused)]
fn main() {
use std::fs;

fn read_node_numastat(node: u32) -> std::io::Result<NumaStats> {
    let path = format!("/sys/devices/system/node/node{}/numastat", node);
    let content = fs::read_to_string(&path)?;

    let mut stats = NumaStats::default();

    for line in content.lines() {
        let mut parts = line.split_whitespace();
        let name = parts.next().unwrap_or("");
        let value: u64 = parts.next().unwrap_or("0").parse().unwrap_or(0);

        match name {
            "numa_hit" => stats.numa_hit = value,
            "numa_miss" => stats.numa_miss = value,
            "numa_foreign" => stats.numa_foreign = value,
            _ => {}
        }
    }

    Ok(stats)
}
}

Hugepage utilization

Hugepages (typically 2 MiB or 1 GiB) reduce TLB pressure because one TLB entry covers a much larger region. A regular 4 KiB page needs one TLB entry per 4 KiB of contiguous memory. A 2 MiB hugepage needs one TLB entry per 2 MiB — 512x more coverage per entry. For workloads that scan large arrays or walk page tables, this can halve TLB miss rates.

The data comes from /proc/meminfo:

HugePages_Total:    1024
HugePages_Free:     512
HugePages_Rsvd:     128
HugePages_Surp:     0
AnonHugePages:      2048

The useful metrics:

Hugepage pool utilization: (HugePages_Total - HugePages_Free) / HugePages_Total. If this is near 100%, the pool is exhausted and new hugepage allocations will fail. If it’s near 0%, the pool is overprovisioned and wasting memory.
Transparent hugepage usage: AnonHugePages (in KiB) tells you how much transparent hugepage memory is in use. Compare it to the total anonymous memory (AnonPages in /proc/meminfo) to get a ratio: AnonHugePages / AnonPages. If this ratio is low for a memory-intensive workload, the kernel isn’t coalescing regular pages into hugepages effectively.

Reading it in Rust:

#![allow(unused)]
fn main() {
use std::fs;

#[derive(Default)]
pub struct HugepageStats {
    pub total: u64,       // HugePages_Total
    pub free: u64,        // HugePages_Free
    pub reserved: u64,    // HugePages_Rsvd
    pub anon_hugepages: u64,  // AnonHugePages (in KiB)
    pub anon_pages: u64,      // AnonPages (in KiB)
}

fn read_hugepage_stats() -> std::io::Result<HugepageStats> {
    let content = fs::read_to_string("/proc/meminfo")?;
    let mut stats = HugepageStats::default();

    for line in content.lines() {
        let mut parts = line.split_whitespace();
        let name = parts.next().unwrap_or("");
        let value: u64 = parts.next().unwrap_or("0").parse().unwrap_or(0);

        match name {
            "HugePages_Total:" => stats.total = value,
            "HugePages_Free:" => stats.free = value,
            "HugePages_Rsvd:" => stats.reserved = value,
            "AnonHugePages:" => stats.anon_hugepages = value,
            "AnonPages:" => stats.anon_pages = value,
            _ => {}
        }
    }

    Ok(stats)
}

fn hugepage_pool_utilization(stats: &HugepageStats) -> f64 {
    if stats.total == 0 {
        return 0.0;
    }
    (stats.total - stats.free) as f64 / stats.total as f64
}

fn transparent_hugepage_ratio(stats: &HugepageStats) -> f64 {
    if stats.anon_pages == 0 {
        return 0.0;
    }
    stats.anon_hugepages as f64 / stats.anon_pages as f64
}
}

A pool utilization above 90% means you should increase HugePages_Total in the kernel boot parameters. A transparent hugepage ratio below 10% for a memory-intensive workload means the kernel’s khugepaged daemon isn’t coalescing pages fast enough — check /sys/kernel/mm/transparent_hugepage/ for the current policy (always, madvise, or never).

Next: Part 8 — Uncore IMC Bandwidth — Measure memory bandwidth through the Integrated Memory Controller.

Part 8 — Uncore IMC Bandwidth

The CPU cores get all the attention. The memory controller sits quietly in the corner of the same chip, and nobody thinks about it until memory bandwidth is saturated and everything stalls.

Modern server CPUs put the Integrated Memory Controller (IMC) on the same package as the cores, but outside the cores themselves. Intel calls this the uncore — the silicon on the chip that isn’t a CPU core. It includes the memory controller, the last-level cache, and the mesh interconnect that ties everything together. The uncore has its own performance counters, and they measure memory bandwidth: how many bytes per second are flowing through the memory controller to DRAM.

If memory bandwidth is saturated, adding more CPU cores won’t help. The cores will stall waiting for memory. Monitoring IMC bandwidth tells you whether you’re approaching that ceiling.

What “uncore” means

On Intel, the chip is divided into two parts:

Core: the CPU cores — instruction execution, cache, registers
Uncore: everything else — the memory controller, the last-level cache (L3), the mesh interconnect that links the cores to the uncore

The uncore is shared across all cores on the socket. Its counters are accessible via perf_event_open but they live in a separate PMU (Performance Monitoring Unit) from the core PMU. You open them differently.

The IMC counters on Intel

Intel’s uncore IMC events are in the uncore_imc PMU. The type value is architecture-specific:

#![allow(unused)]
fn main() {
// For Intel Skylake and later server CPUs
// The type value is architecture-specific and varies between kernels.
// Don't hardcode it — read it from sysfs (see find_uncore_imc_type below).
const UNCORE_IMC_TYPE_EXAMPLE: u32 = 15; // example only; always read from sysfs
}

The reliable way to find the type is through sysfs:

ls /sys/bus/event_source/devices/

Look for something like uncore_imc_0 or uncore_cha_0 (Cache Housing Agent — also useful). The IMC is uncore_imc_<socket_id>.

cat /sys/bus/event_source/devices/uncore_imc_0/type

This prints the integer type value you need for perf_event_open.

The key IMC events (from the Intel perfmon repo, iMC unit):

CAS count (all): event 0x04, umask 0x0F — all CAS (Column Address Strobe) operations, the memory commands that transfer data. This is the most useful bandwidth proxy.
CAS read: event 0x04, umask 0x03 — DRAM read CAS commands (includes underfill reads)
CAS write: event 0x04, umask 0x0C — DRAM write CAS commands
DRAM activate count: event 0x01, umask 0x02 — DRAM ACT (activate) commands for writes; umask 0x01 for reads. Cycles the DRAM rank was active.

When you open a raw uncore event with perf_event_open, the config field encodes both the event and umask: config = (umask << 8) | event. So CAS count all (event 0x04, umask 0x0F) becomes config = 0x0F04.

The CAS (Column Address Strobe) counter is the standard way to compute memory bandwidth. Each CAS operation transfers 64 bytes (one cache line). So:

bandwidth_bytes_per_sec = CAS_count * 64 / elapsed_seconds
bandwidth_gb_per_sec = CAS_count * 64 / 1_000_000_000 / elapsed_seconds

Opening uncore IMC counters

Uncore counters have restrictions. They can only be opened:

With cpu=-1 for system-wide (all sockets) — or with a per-socket fd
By root or a process with CAP_SYS_ADMIN
With the specific CPU(s) that own the uncore (socket-local CPUs only)

On a multi-socket system, you open one fd per socket:

#![allow(unused)]
fn main() {
use std::fs;

fn find_uncore_imc_type() -> std::io::Result<u32> {
    // Find the uncore_imc PMU type value from sysfs.
    // On multi-socket systems, each socket has its own uncore_imc_N device
    // (uncore_imc_0, uncore_imc_1, etc.), but they all share the same PMU type.
    // We read the first one we find.
    let entries = std::fs::read_dir("/sys/bus/event_source/devices/")?;
    for entry in entries.flatten() {
        let name = entry.file_name().into_string().unwrap_or_default();
        if name.starts_with("uncore_imc") {
            let type_path = entry.path().join("type");
            let type_str = fs::read_to_string(&type_path)?.trim().to_owned();
            return type_str.parse::<u32>()
                .map_err(|e| std::io::Error::new(std::io::ErrorKind::InvalidData, e));
        }
    }
    Err(std::io::Error::new(
        std::io::ErrorKind::NotFound,
        "uncore_imc PMU not found",
    ))
}

fn open_imc_counter(
    pmu_type: u32,
    config: u64,
    cpu: libc::c_int,
) -> std::io::Result<libc::c_int> {
    // Hand-rolled perf_event_attr — libc::perf_event_attr isn't available
    // on all platforms. The field ordering must match the kernel struct exactly.
    #[repr(C)]
    struct PerfEventAttr {
        type_: u32,
        size: u32,
        config: u64,
        sample_period: u64,
        sample_type: u64,
        read_format: u64,
        flags: u64, // bit 0: disabled, bit 2: pinned
    }

    let attr = PerfEventAttr {
        type_: pmu_type,
        size: std::mem::size_of::<PerfEventAttr>() as u32,
        config,
        sample_period: 0,
        sample_type: 0,
        read_format: 0,
        flags: 0b101, // disabled=1 (bit 0), pinned=1 (bit 2)
    };

    let fd = unsafe {
        // pid=-1: uncore PMUs are system-wide, they don't monitor a specific
        // process. The kernel requires pid=-1 for uncore events.
        libc::syscall(libc::SYS_perf_event_open, &attr as *const _, -1, cpu, -1, 0)
    };

    if fd < 0 {
        Err(std::io::Error::last_os_error())
    } else {
        Ok(fd as libc::c_int)
    }
}
}

Computing memory bandwidth

#![allow(unused)]
fn main() {
fn compute_bandwidth_gbps(cas_count_delta: u64, elapsed_secs: f64) -> f64 {
    if elapsed_secs <= 0.0 {
        return 0.0;
    }
    // 64 bytes per CAS operation (one cache line)
    (cas_count_delta as f64 * 64.0) / elapsed_secs / 1e9
}
}

Each CAS operation transfers one 64-byte cache line, so the math is straightforward.

The enable_counter and read_counter helper functions are the same ones from Part 3 — they call ioctl(fd, PERF_EVENT_IOC_ENABLE, 0) and read(fd, &mut value) respectively.

The full polling loop

#![allow(unused)]
fn main() {
use std::time::Instant;

struct ImcBandwidth {
    pub bandwidth_gbps: f64,
    pub cas_count: u64,
    pub elapsed_secs: f64,
}

fn open_socket_imc_counters(
    pmu_type: u32,
    socket_cpus: &[i32],  // CPUs local to each socket
) -> std::io::Result<Vec<(i32, i32)>> {
    // Open one CAS counter fd per socket. Keep these fds open for the
    // lifetime of the monitor — don't open/close on every poll.
    //
    // socket_cpus: one CPU id per socket (any CPU local to that socket's
    // uncore works). On a single-socket system, this is [0]. On dual-socket,
    // find them with:
    //   cat /sys/devices/system/node/node0/cpulist  → socket 0 CPUs
    //   cat /sys/devices/system/node/node1/cpulist  → socket 1 CPUs
    let mut fds = Vec::new();

    for (socket, &cpu) in socket_cpus.iter().enumerate() {
        let fd = open_imc_counter(pmu_type, 0x0F04, cpu)?; // CAS count all (event 0x04, umask 0x0F)
        enable_counter(fd)?;
        fds.push((socket as i32, fd));
    }

    Ok(fds)
}

fn read_imc_counts(socket_fds: &[(i32, i32)]) -> std::io::Result<Vec<(i32, u64)>> {
    let mut results = Vec::new();

    for &(socket, fd) in socket_fds {
        let cas = read_counter(fd)?;
        results.push((socket, cas));
    }

    Ok(results)
}

fn poll_imc(
    socket_fds: &[(i32, i32)],
    prev_counts: &[(i32, u64)],
    interval_secs: f64,
) -> std::io::Result<Vec<ImcBandwidth>> {
    let current_counts = read_imc_counts(socket_fds)?;

    let mut result = Vec::new();
    for ((socket, prev_cas), (_, curr_cas)) in prev_counts.iter().zip(&current_counts) {
        let delta = curr_cas.saturating_sub(*prev_cas);
        result.push(ImcBandwidth {
            bandwidth_gbps: compute_bandwidth_gbps(delta, interval_secs),
            cas_count: delta,
            elapsed_secs: interval_secs,
        });
    }

    Ok(result)
}
}

Virtualization caveat

You can’t read uncore counters inside a virtual machine (in most cases). The guest doesn’t have direct access to the IMC hardware. When you try to open an uncore event from inside a guest, you get EPERM or EINVAL.

If you’re building a monitoring agent that runs inside VMs, skip uncore IMC reading — it’s not available. For bare-metal hosts, it’s one of the most useful performance signals.

AMD alternative: Data Fabric counters

AMD’s equivalent of the Intel IMC counters are the Data Fabric (DF) performance counters. The Data Fabric is the interconnect that links AMD CPU cores, memory controllers, and I/O — analogous to Intel’s mesh + uncore. DF counters track memory bandwidth through the same perf_event_open interface, but the PMU names and event encodings are different.

Look for uncore_data_fabric or amd_df PMUs in sysfs:

ls /sys/bus/event_source/devices/
# Look for uncore_data_fabric or amd_df on AMD systems

AMD also has Instruction-Based Sampling (IBS) — a different approach that periodically samples the instruction stream and reports details about memory operations. IBS is more useful for profiling (“where are the loads that miss in L3?”) than for bandwidth monitoring. If you need per-instruction latency data, IBS is the tool. If you need aggregate bandwidth, use the DF counters.

IBS events are in the ibs_op and ibs_fetch PMUs on AMD. These are not available on Intel.

Next: Part 9 — Thermal Monitoring — Read thermal zones from sysfs and compute headroom before throttling.

Part 9 — Thermal Monitoring

A CPU running hot doesn’t jump straight to throttling — there are graduated stages, and each one leaves a signal you can read.

Thermal throttling is the last resort. Before the CPU hits its critical temperature and starts skipping cycles, the kernel has already entered passive cooling — reducing the clock speed to lower heat output. If you’re watching CPU utilization stay flat while your benchmark scores drop, passive cooling is probably the cause. The good news: you can see it happening in real time.

Linux exposes thermal zone readings through sysfs. Each thermal zone corresponds to a physical temperature sensor somewhere in the system.

What thermal zones are

The kernel abstracts temperature sensors as thermal zones. A thermal zone has a type (what it measures), a temp (current temperature in milli-degrees Celsius), and a set of trip_point_* thresholds.

Common zone types on x86:

x86_pkg_temp — the CPU package (whole-chip temperature)
acpitz — ACPI thermal zone (usually near the CPU)
coretemp — per-core temperature (Intel)
nvme — NVMe drive temperature
tztsx — various other sensors

On ARM servers:

soc-thermal — the SoC temperature
cpu-thermal — the CPU cluster temperature

The zone naming and quantity vary by hardware. The kernel creates whatever zones the hardware exposes.

Reading thermal zones from sysfs

The sysfs layout:

/sys/class/thermal/
├── thermal_zone0/
│   ├── type            # "x86_pkg_temp", "acpitz", etc.
│   ├── temp            # current temperature in millidegrees Celsius (e.g., 72000 = 72.0°C)
│   ├── trip_point_0_temp  # threshold in m°C
│   ├── trip_point_0_type # "active", "passive", "hot", "critical"
│   └── ...
├── thermal_zone1/
│   └── ...

Each zone has multiple trip points — temperature thresholds that trigger different cooling responses. The types:

passive: fan speeds up (no performance impact)
active: stronger cooling, minor performance impact
hot: thermal throttle imminent
critical: emergency shut down if reached

Parsing a thermal zone

#![allow(unused)]
fn main() {
use std::fs;

#[derive(Debug, Clone)]
pub struct ThermalZone {
    pub name: String,          // zone name in sysfs (e.g., "thermal_zone0")
    pub zone_type: String,    // what this zone measures (e.g., "x86_pkg_temp")
    pub temp_millicelsius: i64,
    pub trip_points: Vec<TripPoint>,
}

#[derive(Debug, Clone)]
pub struct TripPoint {
    pub temp_millicelsius: i64,
    pub trip_type: String,    // "passive", "active", "hot", "critical"
}

fn read_thermal_zone(zone_path: &std::path::Path) -> std::io::Result<ThermalZone> {
    let name = zone_path
        .file_name()
        .and_then(|n| n.to_str())
        .unwrap_or("unknown")
        .to_string();

    let zone_type = fs::read_to_string(zone_path.join("type"))?.trim().to_string();
    let temp_str = fs::read_to_string(zone_path.join("temp"))?.trim().to_string();
    let temp_millicelsius: i64 = temp_str.parse().unwrap_or(0);

    let mut trip_points = Vec::new();

    // Trip points are numbered starting at 0
    let mut idx = 0;
    loop {
        let trip_type_path = zone_path.join(format!("trip_point_{}_type", idx));
        let trip_temp_path = zone_path.join(format!("trip_point_{}_temp", idx));

        if !trip_type_path.exists() {
            break;
        }

        let trip_type = fs::read_to_string(&trip_type_path)?.trim().to_string();
        let trip_temp_str = fs::read_to_string(&trip_temp_path)?.trim().to_string();
        let trip_temp: i64 = trip_temp_str.parse().unwrap_or(0);

        trip_points.push(TripPoint {
            temp_millicelsius: trip_temp,
            trip_type,
        });

        idx += 1;
    }

    Ok(ThermalZone {
        name,
        zone_type,
        temp_millicelsius,
        trip_points,
    })
}

fn read_all_thermal_zones() -> std::io::Result<Vec<ThermalZone>> {
    let thermal_path = std::path::Path::new("/sys/class/thermal");
    let entries = fs::read_dir(thermal_path)?;
    let mut zones = Vec::new();

    for entry in entries.flatten() {
        let path = entry.path();
        if path.file_name().and_then(|n| n.to_str())
            .map(|n| n.starts_with("thermal_zone"))
            .unwrap_or(false)
        {
            if let Ok(zone) = read_thermal_zone(&path) {
                zones.push(zone);
            }
        }
    }

    Ok(zones)
}
}

Computing thermal headroom

Thermal headroom is the gap between the current temperature and the critical threshold:

#![allow(unused)]
fn main() {
fn thermal_headroom(zone: &ThermalZone) -> Option<i64> {
    let critical = zone.trip_points.iter()
        .filter(|tp| tp.trip_type == "critical")
        .map(|tp| tp.temp_millicelsius)
        .min()?;

    // headroom = critical - current (both in m°C)
    Some(critical - zone.temp_millicelsius)
}
}

Headroom is your safety margin. If a workload pushes the CPU toward critical temperature, the headroom shrinks. When headroom hits zero, throttling kicks in.

Thermal headroom in degrees Celsius:

#![allow(unused)]
fn main() {
fn headroom_celsius(zone: &ThermalZone) -> Option<f64> {
    thermal_headroom(zone).map(|hm| hm as f64 / 1000.0)
}
}

Identifying the package sensor vs. per-core sensors

The most useful zone for CPU performance is the package-level zone. It’s typically x86_pkg_temp (Intel) or the ACPI zone near the CPU. Per-core zones (coretemp) are more granular but package-level is what you watch for overall thermal throttle risk.

#![allow(unused)]
fn main() {
fn find_package_zone(zones: &[ThermalZone]) -> Option<&ThermalZone> {
    // x86_pkg_temp is the canonical package-level sensor on Intel
    zones.iter()
        .find(|z| z.zone_type == "x86_pkg_temp")
        .or_else(|| zones.iter().find(|z| z.zone_type.contains("pkg")))
        .or_else(|| zones.iter().find(|z| z.zone_type == "acpitz"))
}
}

Polling interval

Thermal changes are slow. A CPU at 60°C doesn’t jump to 90°C in a second — the thermal mass is too large. A 1-second polling interval is more than enough. Even 5 seconds is fine for thermal monitoring.

The important thing is to watch for the trend, not individual readings. If the package temperature is creeping up over a 30-second window, something is building heat.

#![allow(unused)]
fn main() {
use std::time::{Duration, Instant};

async fn poll_thermal(interval: Duration) -> anyhow::Result<()> {
    loop {
        let zones = read_all_thermal_zones()?;
        let package = find_package_zone(&zones);

        if let Some(pkg) = package {
            let current = pkg.temp_millicelsius as f64 / 1000.0;
            let headroom = thermal_headroom(pkg).map(|h| h as f64 / 1000.0);

            println!(
                "package_temp={:.1}°C  headroom={:.1}°C  type={}",
                current,
                headroom.unwrap_or(-999.0),
                pkg.zone_type,
            );
        }

        tokio::time::sleep(interval).await;
    }
}
}

Cross-architecture differences

On AMD EPYC, the thermal zones may be named differently. Use ls /sys/class/thermal/ on the target system to see what’s available. The ACPI thermal zones are the most portable fallback — the ACPI spec requires them on all compliant systems.

On ARM servers, the sensor landscape is more fragmented. soc-thermal and cpu-thermal are common names. Some ARM platforms expose only one thermal zone for the whole SoC.

Thresholds at a glance

For quick reference, the thermal throttle scale (Intel desktop/server):

Temperature	What it means
< 70°C	Normal operation, no throttling
70-85°C	Active cooling engaged, performance nominal
85-95°C	Passive cooling — clock speed reduced
95-100°C	Hot — aggressive throttling
> 100°C	Critical — emergency throttle

These thresholds are approximate and vary by SKU. The trip points from sysfs are the authoritative source for your specific hardware.

Next: Part 10 — Block I/O Tracing — Trace block I/O requests and compute IOPS, throughput, and access pattern entropy.

Part 10 — Block I/O Tracing

I/O patterns tell you whether your storage is being used well.

High IOPS with low throughput means many small operations — small random reads, metadata-heavy workloads, or many tiny writes. Low IOPS with high throughput means a few large sequential operations — streaming reads, copy operations. Both are normal, but both can become bottlenecks.

The block layer tracepoints let us observe every I/O request as it enters and completes.

The block tracepoints

Linux has several block I/O tracepoints. The two we care about:

Tracepoint	When it fires
`block:block_bio_queue`	A request is submitted to the block layer
`block:block_bio_complete`	A request completes

The arguments for block:block_bio_queue:

Offset   Type      Field
------   ----      -----
0        u64       dev (dev_t: major<<20 | minor, 4 bytes padding)
8        u64       sector          // starting sector number
16       u32       nr_sector       // number of sectors
20       (4 bytes padding)
24       char[10]  rwbs            // R/W/S flag string ("R", "W", etc.)
34       char[16]  comm            // process name (TASK_COMM_LEN)

The rwbs field is a 10-byte character array that encodes the I/O direction: "R" for read, "W" for write, "S" for sync, "F" for flush, "D" for discard, "N" for none. The comm field is the process name that submitted the I/O.

Important: block_bio_queue is a block_bio-class tracepoint — it operates on struct bio, not struct request. The block_rq-class tracepoints (block_rq_insert, block_rq_issue) have a bytes field and op_flags field that block_bio lacks. If you need the transfer size in bytes, compute it from nr_sector * 512 (each sector is 512 bytes). This is what the eBPF code below does.

Verifying offsets. The offsets above use the read_at convention — they start from the first byte after the 8-byte common tracepoint header. The format file includes this header, so its offsets are 8 bytes larger. See the verification note in Part 2 for the full procedure.

The device number dev_t encodes major and minor into a single 32-bit integer. To get a readable device name, map it through /sys/dev/block/:

#![allow(unused)]
fn main() {
use std::fs;

// Convert dev_t (major<<20 | minor) to a readable device name like "sda" or "nvme0n1"
fn dev_t_to_name(dev: u32) -> Option<String> {
    let major = (dev >> 20) as u32;
    let minor = (dev & 0xFFFFF) as u32;
    let path = format!("/sys/dev/block/{}:{}", major, minor);
    // read_link returns a relative path like "../../devices/.../block/sda"
    // The last component after "/block/" is the device name.
    let target = fs::read_link(&path).ok()?;
    let target_str = target.into_os_string().into_string().ok()?;
    // Extract the device name: everything after the last "/block/" component
    if let Some(idx) = target_str.rfind("/block/") {
        Some(target_str[idx + 7..].to_owned())
    } else {
        // Fallback: use the last path component
        target_str.split('/').next_back().map(|s| s.to_owned())
    }
}
}

The /sys/dev/block/MAJOR:MINOR symlink points into the device tree — something like ../../devices/pci0000:00/.../block/sda. We extract the device name (sda, nvme0n1) from the block/ component rather than returning the raw symlink target, which would be a long relative path.

For our metrics, we’ll track per-device IOPS and throughput.

The eBPF program

#![allow(unused)]
fn main() {
// ebpf-programs/src/blockio.rs

use aya_ebpf::programs::TracePointContext;
use aya_ebpf::macros::tracepoint;
use aya_ebpf::maps::{HashMap, PerCpuArray};
use aya_ebpf::helpers::bpf_ktime_get_ns;

#[derive(Clone, Copy)]
#[repr(C)]
pub struct BioQueueEvent {
    pub dev: u32,
    pub sector: u64,
    pub nr_sector: u32,
    pub timestamp: u64,
}

// Per-device counter: dev → (ops_count, total_bytes)
// HashMap is shared across CPUs, so concurrent updates from different CPUs
// can lose counts (read-modify-write race). For monitoring where approximate
// counts are acceptable, this is fine. For exact counts, use PerCpuArray
// and sum in userspace (like the histogram in Part 12).
#[map]
static IO_COUNTERS: HashMap<u32, (u64, u64)> = HashMap::with_max_entries(64, 0);

// Per-CPU sampling counter — avoids static mut by using a map
#[map]
static SAMPLE_COUNTER: PerCpuArray<u64> = PerCpuArray::with_max_entries(1, 0);

// Ring buffer for sector samples (sent every 100th event)
#[map]
static SECTOR_SAMPLES: aya_ebpf::maps::RingBuf =
    aya_ebpf::maps::RingBuf::with_byte_size(8 * 4096, 0);

#[derive(Clone, Copy)]
#[repr(C)]
pub struct SectorSample {
    pub dev: u32,
    pub sector: u64,
}

#[tracepoint]
pub fn block_bio_queue(ctx: TracePointContext) -> u32 {
    let dev = unsafe { ctx.read_at::<u32>(0).unwrap_or(0) };
    let sector = unsafe { ctx.read_at::<u64>(8).unwrap_or(0) };
    let nr_sector = unsafe { ctx.read_at::<u32>(16).unwrap_or(0) };

    // Update per-device counters (per-CPU copy via HashMap keyed by dev)
    unsafe {
        let (ops, bytes) = IO_COUNTERS.get(&dev).copied().unwrap_or((0u64, 0u64));
        let new_ops = ops + 1;
        let new_bytes = bytes + (nr_sector as u64 * 512);
        let _ = IO_COUNTERS.insert(&dev, &(new_ops, new_bytes), 0);
    }

    // Sampling: every 100th operation per CPU — avoid static mut
    unsafe {
        if let Some(ptr) = SAMPLE_COUNTER.get_ptr_mut(0) {
            *ptr += 1;
            if *ptr % 100 == 0 {
                let sample = SectorSample { dev, sector };
                SECTOR_SAMPLES.output(&sample, 0);
            }
        }
    }

    0
}
}

A few things to notice:

Tuple values in HashMap: HashMap<u32, (u64, u64)> stores a tuple as the value. This works in Aya eBPF — the value size is the size of the tuple. It’s a convenient way to store multiple counters per key.

PerCpuArray<u64> for the sampling counter: The original code used static mut COUNTER: u64 which is forbidden in safe Rust and problematic in eBPF. Using a PerCpuArray with one entry gives each CPU its own counter — no atomics needed.

unsafe on get(): HashMap::get() is unsafe because the kernel doesn’t guarantee atomicity without BPF_F_NO_PREALLOC. For metrics aggregation, occasional lost updates are acceptable.

Computing IOPS and throughput

In userspace, read the counters and compute rates:

#![allow(unused)]
fn main() {
use aya::maps::HashMap as AyaHashMap;

pub struct IoStats {
    pub device: String,
    pub iops: f64,
    pub throughput_mbps: f64,
    pub ops_count: u64,
    pub bytes_count: u64,
}

fn read_io_stats(
    counters: &AyaHashMap<u32, (u64, u64)>,
    prev: &std::collections::HashMap<u32, (u64, u64)>,
    elapsed_secs: f64,
) -> Vec<IoStats> {
    let safe_elapsed = if elapsed_secs > 0.0 { elapsed_secs } else { 1.0 };
    counters.iter().filter_map(|item| {
        let (dev, (ops, bytes)) = item.ok()?;
        let prev_data = prev.get(&dev).copied().unwrap_or((0, 0));
        let ops_delta = ops.saturating_sub(prev_data.0);
        let bytes_delta = bytes.saturating_sub(prev_data.1);

        Some(IoStats {
            device: dev_t_to_name(dev).unwrap_or_else(|| format!("{:08x}", dev)),
            iops: ops_delta as f64 / safe_elapsed,
            throughput_mbps: (bytes_delta as f64 / safe_elapsed) / 1e6,
            ops_count: ops_delta,
            bytes_count: bytes_delta,
        })
    }).collect()
}
}

IOPS entropy — measuring randomness

High IOPS can mean two different things:

Sequential I/O: reading a large file in order, one big read per access — predictable, batchable
Random I/O: reading many small blocks at scattered addresses — unpredictable, hard to batch

Both have high IOPS. The difference is in the sector number distribution. Shannon entropy of the sector numbers measures how predictable or random the access pattern is.

Here’s how to compute it:

#![allow(unused)]
fn main() {
fn compute_entropy(sectors: &[u64], num_buckets: usize) -> f64 {
    if sectors.is_empty() {
        return 0.0;
    }

    // Bucket sectors into ranges to build a histogram
    let min = *sectors.iter().min().unwrap_or(&0);
    let max = *sectors.iter().max().unwrap_or(&0);

    if min == max {
        return 0.0; // all accesses in one bucket = completely predictable
    }

    let range = max - min + 1;
    let bucket_size = (range / num_buckets as u64).max(1);

    let mut counts = vec![0usize; num_buckets];
    for &sector in sectors {
        let bucket = ((sector - min) / bucket_size) as usize;
        let bucket = bucket.min(num_buckets - 1);
        counts[bucket] += 1;
    }

    let total = counts.iter().sum::<usize>() as f64;
    if total == 0.0 {
        return 0.0;
    }

    // H = -sum(p * log2(p)) for each bucket
    let mut entropy = 0.0;
    for &count in &counts {
        if count == 0 {
            continue;
        }
        let p = count as f64 / total;
        entropy -= p * p.log2();
    }

    entropy
}
}

Entropy is measured in bits. A value near 0 means the accesses are concentrated in one bucket — very sequential. A value near log2(num_buckets) means accesses are spread evenly across all buckets — very random.

For num_buckets = 16:

0-2 bits: sequential (always reading from one area)
2-3 bits: some locality (reading from a few areas)
3-4 bits: random (reading from across the address space)

Ring buffer reader for sector sampling

For entropy calculation, you need the actual sector numbers, not just counts. The eBPF program sends every 100th sector number to the ring buffer:

#![allow(unused)]
fn main() {
use aya::maps::RingBuf;
use aya::Ebpf;

#[derive(Clone, Copy, Debug)]
#[repr(C)]
pub struct SectorSample {
    pub dev: u32,
    pub sector: u64,
}

async fn poll_sector_samples(
    ebpf: &mut Ebpf,
    sector_window: &mut VecDeque<u64>,
) -> anyhow::Result<()> {
    let mut ring_buf = RingBuf::try_from(ebpf.map_mut("sector_samples")?)?;

    while let Some(item) = ring_buf.next() {
        let sample = unsafe {
            std::ptr::read_unaligned(item.as_ptr() as *const SectorSample)
        };
        sector_window.push_back(sample.sector);
    }

    // Keep the last 10000 samples
    while sector_window.len() > 10000 {
        sector_window.pop_front();
    }

    Ok(())
}
}

The ring buffer stores raw bytes. item is &[u8] — you cast it back to the struct type with read_unaligned. This is the standard pattern for receiving structured events from the ring buffer.

Userspace aggregation

#![allow(unused)]
fn main() {
use std::collections::{HashMap, VecDeque};
use std::time::Duration;

async fn poll_block_io(
    ebpf: &mut Ebpf,
    prev: &mut std::collections::HashMap<u32, (u64, u64)>,
    sector_window: &mut VecDeque<u64>,
    elapsed_secs: f64,
) -> anyhow::Result<()> {
    // Drain ring buffer samples for entropy calculation
    poll_sector_samples(ebpf, sector_window).await?;

    // Read the eBPF counters map
    let counters: aya::maps::HashMap<_, u32, (u64, u64)> =
        aya::maps::HashMap::try_from(ebpf.map_mut("io_counters")?)?;

    let safe_elapsed = if elapsed_secs > 0.0 { elapsed_secs } else { 1.0 };
    // Compute IOPS and throughput by iterating the eBPF map and comparing to prev
    let mut stats = Vec::new();
    for item in counters.iter() {
        let (dev, (ops, bytes)) = item?;
        let (prev_ops, prev_bytes) = prev.get(&dev).copied().unwrap_or((0, 0));
        let ops_delta = ops.saturating_sub(prev_ops);
        let bytes_delta = bytes.saturating_sub(prev_bytes);

        stats.push(IoStats {
            device: dev_t_to_name(dev).unwrap_or_else(|| format!("{:08x}", dev)),
            iops: ops_delta as f64 / safe_elapsed,
            throughput_mbps: (bytes_delta as f64 / safe_elapsed) / 1e6,
            ops_count: ops_delta,
            bytes_count: bytes_delta,
        });

        prev.insert(dev, (ops, bytes));
    }

    let entropy = compute_entropy(&sector_window.iter().copied().collect::<Vec<_>>(), 16);

    for stat in stats {
        println!(
            "{}: iops={:.0}  mbps={:.1}  entropy={:.2}",
            stat.device, stat.iops, stat.throughput_mbps, entropy
        );
    }

    Ok(())
}
}

Next: Part 11 — vhost and Virtio Ring Instrumentation — Instrument the virtio ring with kprobes to measure I/O latency at the virtualization boundary.

Part 11 — vhost and Virtio Ring Instrumentation

If you run virtual machines on Linux, the guest and the host need to move data back and forth for disk I/O and networking. That data path has a bottleneck: the shared memory ring where work items are posted. When that ring fills up, the guest waits. When the host can’t process descriptors fast enough, the guest waits longer.

The ring is called the virtqueue — a circular buffer in shared memory. The guest writes descriptors (“here’s a network packet to send” or “read 4KB from this disk offset”) into the ring. The host-side vhost kernel module picks them up, does the work, and signals completion. If you can measure how full the ring is and how long descriptors sit before being processed, you can see I/O latency building up before it shows up in the guest’s metrics.

Understanding the ring means understanding the latency and throughput of your VM’s I/O path.

Virtio ring basics

The virtio ring has three parts:

Descriptor table: an array of data buffer descriptors — each entry points to a buffer and its length
Available ring: an array of descriptor indices the guest has made available to the host
Used ring: an array of descriptor indices the host has processed and returned to the guest

The host consumer reads from the available ring and writes completions to the used ring. The guest producer writes to the available ring and reads from the used ring.

The key performance questions:

How many descriptors are being added per second?
Is the ring getting full (stall condition)?
What’s the latency from add to completion?

KProbes vs tracepoints

Tracepoints are stable hooks placed by kernel developers. KProbes are dynamic — you can attach to any kernel function. They’re more powerful but less portable.

vhost and virtio don’t have tracepoints for the fast path operations. To instrument the ring, we need kprobes.

The tradeoff: kprobe function names change between kernel versions. A probe that works on kernel 5.15 might not exist on 6.1. This is the cost of deep kernel instrumentation.

Finding the right symbols

The kernel exposes all exported symbols in /proc/kallsyms. Look for vhost and virtqueue functions:

# Find vhost functions
cat /proc/kallsyms | grep -i vhost | grep -v "\[.*\]" | head -20

# Find virtqueue functions
cat /proc/kallsyms | grep -i virtqueue | grep -v "\[.*\]" | head -20

Common probe targets:

Function	What it does
`handle_tx_kick`	TX kick from the guest (vhost-net) — actually exported
`handle_rx_kick`	RX kick from the guest (vhost-net) — actually exported
`vhost_scsi_handle_vq`	SCSI I/O submission (vhost-scsi) — actually exported
`vhost_worker`	The main vhost worker loop — actually exported

⚠️ Why not __virtqueue_add? __virtqueue_add and virtqueue_add are declared static inline in the kernel source (drivers/virtio/virtio_ring.c). The compiler expands them at the call site, so they don’t appear in /proc/kallsyms and can’t be probed with kprobes. Any guide that shows __virtqueue_add as a kprobe target won’t work on a standard kernel build. The functions in the table above are exported and appear in kallsyms.

The exact names vary by kernel version and configuration. Build discovery into your program.

A note on struct access

KProbes give you access to function arguments, not named struct fields. You access arguments through ctx.arg::<T>(n) — the nth argument (0-indexed). To read struct fields from pointers, use bpf_probe_read.

This is fundamentally different from tracepoints, where the struct layout is documented. With kprobes, you need to know the kernel struct layout. For production use, this means either:

Using BTF (Compile Once, Run Everywhere) to access struct fields by name
Hardcoding struct field offsets (fragile but works)

For this tutorial, we’ll use the argument-based approach, which works without knowing struct layouts.

Counting descriptor additions

The most fundamental metric: how many descriptors are being processed per second per queue. We can’t easily get the queue index from a kprobe without CO-RE and BTF, so we’ll count per-CPU as a proxy:

#![allow(unused)]
fn main() {
// ebpf-programs/src/vhost.rs

use aya_ebpf::programs::ProbeContext;
use aya_ebpf::macros::{kprobe, map};
use aya_ebpf::maps::{HashMap, RingBuf};
use aya_ebpf::helpers::{bpf_ktime_get_ns, bpf_get_smp_processor_id};

#[map]
static VQ_COUNTER: HashMap<u32, u64> = HashMap::with_max_entries(256, 0);

// The #[kprobe] macro marks this function as a kprobe entry point.
// The kernel symbol to probe is specified at attach time from userspace.
#[kprobe]
pub fn vhost_kick(ctx: ProbeContext) -> u32 {
    // We're probing handle_tx_kick or handle_rx_kick — actually exported functions
    // that process virtqueue entries. We can't easily get the queue index
    // without CO-RE. For now, count per CPU as a proxy.
    let cpu = unsafe { bpf_get_smp_processor_id() };

    unsafe {
        let count = VQ_COUNTER.get(&cpu).copied().unwrap_or(0u64);
        let new_count = count + 1;
        let _ = VQ_COUNTER.insert(&cpu, &new_count, 0);
    }

    0
}
}

ctx.arg::<T>(n) returns Option<T>. For pointers, use Option::<*const T>::None when the argument isn’t a valid pointer.

Getting queue index with CO-RE

If your kernel has BTF enabled (most modern kernels do), you can access struct fields by name. This lets you get the queue index from the vhost_virtqueue struct and track per-queue counters instead of per-CPU.

CO-RE (Compile Once, Run Everywhere) is Aya’s mechanism for portable BTF-based access. It works by reading struct field offsets at load time using BTF data from the target kernel. With CO-RE, you’d read the handle_kick argument to get the vhost_virtqueue pointer, then use bpf_core_read to read the index field by name.

A complete CO-RE example is beyond the scope of this tutorial — it requires kernel BTF data to be available (/sys/kernel/btf/vmlinux must exist), Aya’s aya-obj BTF parsing, and careful struct definitions that match the kernel’s layout. The Aya project has CO-RE examples in their repository that show the full pattern.

If BTF isn’t available on your target kernel, fall back to the per-CPU counting approach shown earlier. Per-CPU counters give you some visibility without the BTF dependency.

Stall detection

A ring stall happens when the available ring is full — the host is trying to add descriptors but the guest hasn’t consumed the previous ones. A proxy for stalls: measure the time between consecutive descriptor adds on the same CPU. If the gap spikes, something was waiting.

#![allow(unused)]
fn main() {
// ebpf-programs/src/vhost.rs

use aya_ebpf::programs::ProbeContext;
use aya_ebpf::macros::{kprobe, map};
use aya_ebpf::maps::{HashMap, RingBuf};
use aya_ebpf::helpers::{bpf_ktime_get_ns, bpf_get_smp_processor_id};

#[map]
static LAST_ADD_TS: HashMap<u32, u64> = HashMap::with_max_entries(256, 0);
#[map]
static STALL_EVENTS: RingBuf = RingBuf::with_byte_size(8 * 4096, 0);

#[derive(Clone, Copy)]
#[repr(C)]
pub struct StallEvent {
    pub cpu_id: u32,
    pub gap_ns: u64,
}

const STALL_THRESHOLD_NS: u64 = 1_000_000; // 1ms

#[kprobe]
pub fn vhost_kick_stall(ctx: ProbeContext) -> u32 {
    let cpu = unsafe { bpf_get_smp_processor_id() };
    let now = unsafe { bpf_ktime_get_ns() };

    unsafe {
        if let Some(&prev_ts) = LAST_ADD_TS.get(&cpu) {
            let gap = now.saturating_sub(prev_ts);
            if gap > STALL_THRESHOLD_NS {
                let event = StallEvent {
                    cpu_id: cpu,
                    gap_ns: gap,
                };
                STALL_EVENTS.output(&event, 0);
            }
        }
        let _ = LAST_ADD_TS.insert(&cpu, &now, 0);
    }

    0
}
}

The full vhost eBPF program

Here’s the complete program for vhost/virtio instrumentation:

#![allow(unused)]
fn main() {
// ebpf-programs/src/vhost.rs

use aya_ebpf::programs::ProbeContext;
use aya_ebpf::macros::kprobe;
use aya_ebpf::maps::{HashMap, RingBuf};
use aya_ebpf::helpers::{bpf_ktime_get_ns, bpf_get_smp_processor_id};

#[map]
static VQ_COUNTER: HashMap<u32, u64> = HashMap::with_max_entries(256, 0);
#[map]
static LAST_ADD_TS: HashMap<u32, u64> = HashMap::with_max_entries(256, 0);
#[map]
static STALL_EVENTS: RingBuf = RingBuf::with_byte_size(8 * 4096, 0);

#[derive(Clone, Copy)]
#[repr(C)]
pub struct StallEvent {
    pub cpu_id: u32,
    pub gap_ns: u64,
}

const STALL_THRESHOLD_NS: u64 = 1_000_000;

#[kprobe]
pub fn vhost_kick(ctx: ProbeContext) -> u32 {
    let cpu = unsafe { bpf_get_smp_processor_id() };
    let now = unsafe { bpf_ktime_get_ns() };

    // Per-CPU descriptor counter
    unsafe {
        let count = VQ_COUNTER.get(&cpu).copied().unwrap_or(0u64);
        let new_count = count + 1;
        let _ = VQ_COUNTER.insert(&cpu, &new_count, 0);
    }

    // Stall detection: if the gap since the last add exceeds the
    // threshold, emit a stall event. A gap means something was
    // waiting — either the ring was full or the host was slow.
    // Note: a long gap can also mean the ring was idle (no I/O
    // submitted). This proxy can't distinguish stalls from idle
    // periods. For production use, track the available ring's
    // size directly via CO-RE or correlate with the vhost worker's
    // CPU usage.
    unsafe {
        if let Some(&prev_ts) = LAST_ADD_TS.get(&cpu) {
            let gap = now.saturating_sub(prev_ts);
            if gap > STALL_THRESHOLD_NS {
                let event = StallEvent {
                    cpu_id: cpu,
                    gap_ns: gap,
                };
                STALL_EVENTS.output(&event, 0);
            }
        }
        let _ = LAST_ADD_TS.insert(&cpu, &now, 0);
    }

    0
}
}

Attaching probes dynamically from userspace

The challenge with kprobes: function names change between kernel versions. The solution is to look up the symbol at runtime and attach to whatever exists:

#![allow(unused)]
fn main() {
// monitor/src/vhost.rs

use aya::programs::KProbe;
use std::process::Command;

pub fn attach_vhost_probes(ebpf: &mut aya::Ebpf) -> anyhow::Result<()> {
    // The eBPF program is named "vhost_kick" (matching the Rust function).
    // At attach time, we specify the kernel symbol to probe.
    // Look up available symbols and attach to whichever exists.
    //
    // Note: __virtqueue_add and virtqueue_add are static inline — they won't
    // appear in kallsyms. We probe the exported vhost handler functions instead.
    let candidates = [
        "handle_tx_kick",   // vhost-net TX (most common)
        "handle_rx_kick",   // vhost-net RX
        "vhost_scsi_handle_vq",  // vhost-scsi
    ];

    let kallsyms = std::fs::read_to_string("/proc/kallsyms")?;

    // Find the eBPF program by its Rust name, not the kernel symbol
    let program = ebpf.program_mut("vhost_kick")
        .ok_or_else(|| anyhow::anyhow!("vhost_kick program not found in eBPF object"))?;
    let kprobe: &mut KProbe = program.try_into()?;
    kprobe.load()?;

    // Try each candidate kernel symbol
    let mut attached = false;
    for sym in candidates {
        if kallsyms.lines().any(|l| l.contains(sym)) {
            kprobe.attach(sym, 0)?; // 0 = entry probe
            println!("attached kprobe: {sym}");
            attached = true;
            break;
        }
    }

    if !attached {
        println!("warning: no vhost/virtio symbols found in /proc/kallsyms — skipping vhost probes");
    }

    Ok(())
}
}

attach(name, flags): The flags parameter controls whether the probe is at the function entry (0) or at the function return (>0, typically 1 for RetProbe).

Reading in userspace

#![allow(unused)]
fn main() {
// monitor/src/main.rs

use aya::maps::{HashMap, RingBuf};
use aya::programs::KProbe;

#[derive(Clone, Copy, Debug)]
#[repr(C)]
pub struct StallEvent {
    pub cpu_id: u32,
    pub gap_ns: u64,
}

async fn poll_vhost(ebpf: &mut aya::Ebpf) -> anyhow::Result<()> {
    let mut stall_buf = RingBuf::try_from(ebpf.map_mut("stall_events")?)?;

    while let Some(item) = stall_buf.next() {
        let event = unsafe {
            std::ptr::read_unaligned(item.as_ptr() as *const StallEvent)
        };
        println!(
            "vhost stall: cpu={} gap={}µs",
            event.cpu_id,
            event.gap_ns / 1000
        );
    }

    Ok(())
}
}

Version compatibility

This is the hardest part of kprobe-based instrumentation. Function names change, struct layouts change, and some functions are added or removed between kernel versions.

Practical strategies:

Probe what exists: Look up symbols in /proc/kallsyms at runtime and only attach to what’s there
Graceful degradation: If the vhost handler functions aren’t available, fall back to counting at a higher-level tracepoint (like sched:sched_switch or irq:softirq_entry)
Per-version testing: Test your probes on multiple kernel versions

For a monitoring tool that needs broad compatibility, tracepoints are always preferable when available. KProbes are for the deep instrumentation that tracepoints can’t reach.

Next: Part 12 — Queue Depth Histograms — Build per-CPU histograms in eBPF and compute p50 and p99 percentiles.

Part 12 — Queue Depth Histograms

Raw counts are useful. But p50 and p99 tell you what the experience actually feels like.

Here’s the problem with raw counts: they don’t distinguish between a workload that’s consistently mildly queued and one that has extreme spikes. You might measure an average queue depth of 4 — but if every measurement is either 0 or 100, the average is misleading. Percentiles reveal the shape of the distribution.

The eBPF program maintains a histogram in a per-CPU array. Each CPU increments its own copy of the bucket counter. Userspace reads all the per-CPU arrays, sums them, and computes p50 and p99 from the cumulative distribution.

The key insight: histogram IS the data structure

Statistical sampling is the usual approach: sample every Nth event, store the samples, compute percentiles from the sample set. This introduces sampling error, and extreme events may not be captured at all.

The alternative: increment a counter for every event, but put it in the right bucket. The histogram IS the data. You never store individual samples.

Bucket 0: [0]        → count of times the value was exactly 0
Bucket 1: [1]        → count of times the value was exactly 1
Bucket 2: [2-4]      → count of times the value was 2, 3, or 4
Bucket 3: [5-8]      → count of times the value was 5 through 8
Bucket 4: [9-16]     → count of times the value was 9 through 16
Bucket 5: [17-32]    → count of times the value was 17 through 32
Bucket 6: [33-64]    → count of times the value was 33 through 64
Bucket 7: [65+]      → count of times the value was 65 or more

This is a logarithmic bucket scheme — wide buckets at high values. The resolution at the low end is higher because that’s where most queues spend most of their time.

Per-CPU arrays: no lost updates

The critical challenge with histogram counters: multiple CPUs can fire the same eBPF program simultaneously. If you use a regular Array<u64>, two CPUs incrementing the same bucket at the same time would both read the same value, increment it, and write it back — lost updates.

PerCpuArray solves this. Each CPU has its own independent copy of the full array. CPU 0’s copy of buckets[3] is completely separate from CPU 1’s copy. You increment your own CPU’s copy without any contention. When userspace reads the histogram, it sums all the per-CPU values for each bucket.

#![allow(unused)]
fn main() {
// ebpf-programs/src/histogram.rs

use aya_ebpf::maps::PerCpuArray;
use aya_ebpf::macros::map;

#[map]
static QUEUE_HIST: PerCpuArray<u64> = PerCpuArray::with_max_entries(8, 0);

// Bucket boundaries (queue depth)
// Bucket 0: depth=0, Bucket 1: depth=1, etc.
const BUCKET_BOUNDARIES: &[u32] = &[
    1,           // [0]       bucket 0
    2,           // [1]       bucket 1
    5,           // [2-4]     bucket 2
    9,           // [5-8]     bucket 3
    17,          // [9-16]    bucket 4
    33,          // [17-32]   bucket 5
    65,          // [33-64]   bucket 6
    u32::MAX,    // [65+]     bucket 7
];

fn bucket_for_depth(depth: u32) -> usize {
    for (i, &boundary) in BUCKET_BOUNDARIES.iter().enumerate() {
        if depth < boundary {
            return i;
        }
    }
    BUCKET_BOUNDARIES.len() - 1
}

#[inline(always)]
fn increment_bucket(bucket: usize) {
    // get_ptr_mut returns a pointer to the current CPU's copy
    if let Some(ptr) = QUEUE_HIST.get_ptr_mut(bucket as u32) {
        // SAFETY: each CPU has its own independent array copy.
        // No other CPU can write to this location.
        unsafe {
            *ptr += 1;
        }
    }
}
}

Each CPU has its own copy of QUEUE_HIST. When CPU 0 calls increment_bucket(3), it writes to CPU 0’s copy. CPU 1’s call writes to CPU 1’s copy. No contention, no lost updates.

Integrating into the scheduler tracepoint

Attach the histogram increment to a tracepoint or kprobe that provides the metric you care about:

#![allow(unused)]
fn main() {
// ebpf-programs/src/scheduler.rs
// (continued from Part 6)

use aya_ebpf::maps::PerCpuArray;
use aya_ebpf::programs::TracePointContext;
use aya_ebpf::macros::{map, tracepoint};

#[map]
static QUEUE_HIST: PerCpuArray<u64> = PerCpuArray::with_max_entries(8, 0);

const BUCKET_BOUNDARIES: &[u32] = &[1, 2, 5, 9, 17, 33, 65, u32::MAX];

fn bucket_for_depth(depth: u32) -> usize {
    for (i, &boundary) in BUCKET_BOUNDARIES.iter().enumerate() {
        if depth < boundary {
            return i;
        }
    }
    BUCKET_BOUNDARIES.len() - 1
}

fn increment_bucket(bucket: usize) {
    if let Some(ptr) = QUEUE_HIST.get_ptr_mut(bucket as u32) {
        unsafe { *ptr += 1; }
    }
}

// Runqueue length is tracked in a hash map keyed by CPU.
// We update it on sched_waking (task about to wake) and sched_switch (task starts
// running), then emit the queue depth on each context switch.

#[map]
static RUNQUEUE_DEPTH: aya_ebpf::maps::HashMap<u32, u32> =
    aya_ebpf::maps::HashMap::with_max_entries(256, 0);

// sched:sched_waking — task is about to be woken (increment runqueue depth)
// Payload: comm (char[16]) at 0, pid (u32) at 16, prio (u32) at 20, target_cpu (u32) at 24
#[tracepoint]
pub fn sched_waking_depth(ctx: TracePointContext) -> u32 {
    let pid = unsafe { ctx.read_at::<i32>(16).unwrap_or(0) };
    if pid > 0 {
        // target_cpu tells us which CPU the task will run on
        let target_cpu = unsafe { ctx.read_at::<u32>(24).unwrap_or(0) };
        unsafe {
            let depth = RUNQUEUE_DEPTH.get(&target_cpu).copied().unwrap_or(0u32);
            let new_depth = depth + 1;
            let _ = RUNQUEUE_DEPTH.insert(&target_cpu, &new_depth, 0);
        }
    }
    0
}

// sched:sched_switch — task starts running (decrement runqueue depth, record histogram)
// Payload: prev_comm (char[16]) at 0, prev_pid (u32) at 16, prev_prio (u32) at 20,
//          prev_state (u64) at 24, next_comm (char[16]) at 32, next_pid (u32) at 48
#[tracepoint]
pub fn sched_switch_depth(ctx: TracePointContext) -> u32 {
    let next_pid = unsafe { ctx.read_at::<i32>(48).unwrap_or(0) };
    if next_pid > 0 {
        let cpu = unsafe { aya_ebpf::helpers::bpf_get_smp_processor_id() };
        unsafe {
            let depth = RUNQUEUE_DEPTH.get(&cpu).copied().unwrap_or(0u32);
            if depth > 0 {
                increment_bucket(bucket_for_depth(depth));
                let new_depth = depth - 1;
                let _ = RUNQUEUE_DEPTH.insert(&cpu, &new_depth, 0);
            }
        }
    }
    0
}
}

When a task is dequeued from the runqueue, we read the current runqueue depth, find the right bucket, and increment it. The PerCpuArray ensures no updates are lost even under heavy scheduler activity.

This is an approximation, not a precise count. The scheduler’s actual runqueue depth is maintained by the kernel’s internal per-CPU rq struct — we can’t read that from eBPF without a kprobe on an internal function. Our approach uses sched_waking (increment) and sched_switch (decrement) as proxies for enqueue and dequeue. This can diverge from the kernel’s count in two cases: a task migrated to a different CPU between waking and running (our depth for the original CPU overcounts), or a task woken on a CPU that’s already running that same task (the sched_switch fires normally, but the depth was never incremented because the task was already on the runqueue). For histogram purposes — seeing the shape of the distribution — these small divergences don’t matter. For an exact count, you’d need to read rq->nr_running directly (via kprobe on a function that holds the runqueue lock).

Userspace reader

Userspace reads the per-CPU arrays and sums them:

#![allow(unused)]
fn main() {
use aya::maps::{PerCpuArray, PerCpuValues};

pub struct Bucket {
    pub range_start: u32,
    pub range_end: u32,
    pub count: u64,
}

pub struct Histogram {
    pub buckets: Vec<Bucket>,
    pub total: u64,
    pub p50: f64,
    pub p99: f64,
}

const BOUNDARIES: [u32; 8] = [1, 2, 5, 9, 17, 33, 65, u32::MAX];

fn read_histogram(ebpf: &mut aya::Ebpf) -> anyhow::Result<Histogram> {
    let hist: PerCpuArray<u64> = PerCpuArray::try_from(ebpf.map_mut("queue_hist")?)?;

    let mut bucket_counts = vec![0u64; 8];

    // Read all 8 buckets, summing across all CPUs
    for idx in 0..8u32 {
        let per_cpu_values: PerCpuValues<u64> = hist.get(&idx, 0)?;
        bucket_counts[idx as usize] = per_cpu_values.iter().sum();
    }

    let mut buckets = Vec::new();
    for i in 0..8 {
        buckets.push(Bucket {
            range_start: if i == 0 { 0 } else { BOUNDARIES[i - 1] },
            range_end: BOUNDARIES[i],
            count: bucket_counts[i],
        });
    }

    let total: u64 = bucket_counts.iter().sum();
    let (p50, p99) = compute_percentiles(&bucket_counts, &BOUNDARIES, total);

    Ok(Histogram {
        buckets,
        total,
        p50,
        p99,
    })
}
}

A note on reading PerCpuArray from userspace. The PerCpuArray::get(&index, flags) method returns PerCpuValues<V> — one value per CPU, all in one call. PerCpuValues derefs to Box<[V]>, so you can call .iter() on it directly and sum across CPUs, as the read_histogram function above does.

Computing p50 and p99

The percentile is the value below which p * total events fall. From a histogram, we walk the buckets cumulatively:

#![allow(unused)]
fn main() {
fn compute_percentiles(
    counts: &[u64],
    boundaries: &[u32; 8],
    total: u64,
) -> (f64, f64) {
    if total == 0 {
        return (0.0, 0.0);
    }

    let p50_target = (total as f64 * 0.50) as u64;
    let p99_target = (total as f64 * 0.99) as u64;

    let mut cumsum = 0u64;
    let mut prev_cumsum = 0u64;
    let mut p50 = f64::NAN;
    let mut p99 = f64::NAN;

    for (i, &count) in counts.iter().enumerate() {
        prev_cumsum = cumsum;
        cumsum += count;

        if p50.is_nan() && cumsum >= p50_target {
            p50 = interpolate(&boundaries, i, prev_cumsum, cumsum, p50_target);
        }

        if p99.is_nan() && cumsum >= p99_target {
            p99 = interpolate(&boundaries, i, prev_cumsum, cumsum, p99_target);
        }

        if !p50.is_nan() && !p99.is_nan() {
            break;
        }
    }

    (p50, p99)
}

fn interpolate(
    boundaries: &[u32; 8],
    bucket_idx: usize,
    prev_cumsum: u64,  // cumulative count before this bucket
    cumsum: u64,       // cumulative count including this bucket
    target: u64,       // the percentile target value
) -> f64 {
    // Linear interpolation: where within this bucket does the target fall?
    let count_in_bucket = cumsum - prev_cumsum;
    let offset_in_bucket = target.saturating_sub(prev_cumsum);
    let fraction = if count_in_bucket > 0 {
        offset_in_bucket as f64 / count_in_bucket as f64
    } else {
        0.0
    };

    let bucket_start = if bucket_idx == 0 { 0 } else { boundaries[bucket_idx - 1] } as f64;
    let bucket_end = if boundaries[bucket_idx] == u32::MAX {
        bucket_start + 1.0 // open-ended bucket: approximate as one past start
    } else {
        boundaries[bucket_idx] as f64
    };

    bucket_start + fraction * (bucket_end - bucket_start)
}
}

The p50 of 2.5 and p99 of 18.3 from the example above tell you: most observations are low, but occasionally there are spikes. Without percentiles, you’d only know the average, which would hide that information.

Formatted output

#![allow(unused)]
fn main() {
fn format_histogram(hist: &Histogram) -> String {
    let mut lines = Vec::new();
    lines.push(format!(
        "total={}  p50={:.1}  p99={:.1}",
        hist.total, hist.p50, hist.p99
    ));

    for bucket in &hist.buckets {
        let pct = if hist.total > 0 {
            (bucket.count as f64 / hist.total as f64 * 100.0).round() as i32
        } else {
            0
        };
        let range = if bucket.range_end == u32::MAX {
            format!("{}+", bucket.range_start)
        } else {
            format!("{}-{}", bucket.range_start, bucket.range_end)
        };
        lines.push(format!("  {}: {:>8}  {:>4}%", range, bucket.count, pct));
    }

    lines.join("\n")
}
}

Example output for runqueue depth:

total=4821  p50=2.5  p99=18.3
  0:     1200   25%
  1:      980   20%
  2-4:   1400   29%
  5-8:    800   17%
  9-16:   400    8%
  17-32:   30    1%
  33-64:   10    0%
  65+:      1    0%

A note on atomic operations

For a regular Array<u64> (not per-CPU), you’d need atomic operations to avoid lost updates. The BPF helper for this is __atomic_fetch_add — available in Linux 5.8+:

#![allow(unused)]
fn main() {
// For non-per-CPU arrays, use BPF atomic operations
unsafe {
    let ptr = QUEUE_HIST.get_ptr_mut(bucket as u32)?;
    // __atomic_fetch_add(ptr, 1, __ATOMIC_RELAXED);
    // Note: Aya doesn't wrap this helper directly.
    // Use PerCpuArray instead — it's simpler and faster.
}
}

PerCPU arrays are the preferred approach in Aya. They’re simpler, faster (no atomic operations needed), and the pattern works everywhere.

Summary

Here’s the full histogram pattern:

eBPF: declare a PerCpuArray<u64> with N entries (one per bucket)
eBPF: on each event, increment the correct bucket using get_ptr_mut() + dereference
Userspace: read all N entries from the per-CPU array
Userspace: sum per-CPU values for each bucket
Userspace: compute cumulative distribution, interpolate p50 and p99

No statistical sampling. Every event is counted. The histogram IS the complete dataset.

This completes the detailed instrumentation chapters. Parts 1–2 covered the architecture and setup. Parts 3–9 covered each data source. Parts 10–12 covered block I/O, vhost rings, and histograms.

The three sources — hardware PMCs, kernel tracepoints, and procfs/sysfs — each have a different rhythm. PMCs are per-cycle counters that you poll at whatever interval suits your dashboard. eBPF programs push events into ring buffers the moment they happen. Procfs and sysfs are snapshots you read on a timer. The remaining work — wiring everything into a single polling loop, adding configuration, and shipping structured output — is engineering integration. The data sources are in place. The instrumentation works. What’s left is assembly.

Keyboard shortcuts

eBPF Performance Monitoring with Aya