Part 5 — Cache and TLB Metrics from PMC

L1 cache misses cost a few cycles. L3 misses cost a few hundred.

That’s not hyperbole — the difference between a cache hit and a cache miss at each level of the hierarchy is an order of magnitude larger. A L1 miss that hits in L2 might cost 10 cycles. A L3 miss that goes to main memory costs 100-300 cycles depending on the system. Once you see those numbers in your metrics, memory-bound workloads become obvious.

The cache hierarchy in plain language

Modern CPUs have several levels of cache. Each core has its own private L1 — a small, fast cache split into L1 data and L1 instructions. L2 is also private to each core but larger and slower. L3 (or LLC — Last Level Cache) is typically shared across cores on the same chip and slower still.

When the core needs a piece of data, it checks L1 first. If it finds the cache line, that’s a hit and the data is available in 1-2 cycles. If L1 misses, it checks L2. If L2 misses, it checks L3. If L3 misses, it goes to main memory — a round-trip that might be 100-300 nanoseconds on a fast system.

Each level is a separate performance counter in the CPU. Counting L3 misses tells you how often the CPU is going to main memory.

PMC events for each cache level

On Intel x86, raw events are identified by a type and a pair of hex numbers: the event selector and the unit mask (umask). The format for raw events in perf:

event=0x<EventHex>,umask=0x<UmaskHex>

When you use perf_event_open with PERF_TYPE_RAW (4), you set config to (umask << 8) | event.

Here are the Intel Skylake cache events. These come from the MEM_LOAD_RETIRED event family (event 0xD1) and the TLB event families (0x08 for dTLB, 0x85 for iTLB). The event numbers are from the Intel SDM (Software Developer’s Manual, Volume 3B, Chapter 19). Verify these on your target system with perf list — some events are SKU-dependent.

Metric	Event	Umask	Intel SDM Name
L1 dcache load miss	0xD1	0x08	MEM_LOAD_RETIRED.L1_MISS
L2 cache miss	0xD1	0x10	MEM_LOAD_RETIRED.L2_MISS
L3 cache miss	0xD1	0x20	MEM_LOAD_RETIRED.L3_MISS
L1 dcache load hit	0xD1	0x01	MEM_LOAD_RETIRED.L1_HIT

The L3 miss counter requires SKU verification — some Intel parts don’t expose it. Check perf list on your target system before relying on it.

A note on event accuracy. MEM_LOAD_RETIRED counts retired load instructions — loads that completed. This means it doesn’t count speculative loads that were issued but discarded (e.g., on a mispredicted branch). For most monitoring use cases, retired loads are what you want: they reflect the work the program actually did, not work it speculated about and threw away. If you need to count all load accesses including speculative ones, use the MEM_INST_RETIRED event family instead — but the distinction usually only matters for profiling, not metrics.

The TLB structure

The Translation Lookaside Buffer (TLB) is a hardware cache for virtual-to-physical address translations. Every memory access needs an address translation: virtual address → physical address. The TLB caches these translations so the CPU doesn’t have to walk the page table every time.

There are two TLBs:

dTLB (data TLB): caches translations for memory reads and writes
iTLB (instruction TLB): caches translations for instruction fetches

A TLB miss means the translation wasn’t cached. The CPU then has to walk the page table, which is a multi-level lookup and takes a few dozen cycles. On a TLB miss, the CPU stalls until the translation is available.

PMC events for TLB misses

Metric	Event	Umask	Intel SDM Name
dTLB load miss (causes walk)	0x08	0x01	DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK
dTLB load walk completed	0x08	0x02	DTLB_LOAD_MISSES.WALK_COMPLETED
iTLB miss (causes walk)	0x85	0x01	ITLB_MISSES.MISS_CAUSES_A_WALK
iTLB walk completed	0x85	0x02	ITLB_MISSES.WALK_COMPLETED

Notice that cache and TLB events are in different event families. Cache events are 0xD1 (MEM_LOAD_RETIRED). dTLB events are 0x08 (DTLB_LOAD_MISSES). iTLB events are 0x85 (ITLB_MISSES). The “causes a walk” events are the useful ones — a TLB miss that doesn’t trigger a page table walk (e.g., a hit in the second-level TLB) isn’t as costly, so we count the ones that actually stall the core waiting for a translation.

A function to open cache counters

#![allow(unused)]
fn main() {
// PERF_TYPE_RAW = 4: raw PMC events (event numbers vary by CPU)
const PERF_TYPE_RAW: u32 = 4;

fn open_raw_pmc(
    event: u16,   // event number from Intel SDM
    umask: u8,    // unit mask from Intel SDM
    pid: libc::pid_t,
    cpu: libc::c_int,
) -> std::io::Result<libc::c_int> {
    // Hand-rolled perf_event_attr — libc::perf_event_attr isn't available
    // on all platforms. The field ordering must match the kernel struct exactly.
    #[repr(C)]
    struct PerfEventAttr {
        type_: u32,
        size: u32,
        config: u64,
        sample_period: u64,
        sample_type: u64,
        read_format: u64,
        flags: u64, // bit 0: disabled, bit 2: pinned
    }

    let attr = PerfEventAttr {
        type_: PERF_TYPE_RAW,
        size: std::mem::size_of::<PerfEventAttr>() as u32,
        config: ((umask as u64) << 8) | (event as u64),
        sample_period: 0, // counting mode: no sampling, read counter directly
        sample_type: 0,
        read_format: 0,
        flags: 0b101, // disabled=1 (bit 0), pinned=1 (bit 2)
    };

    let fd = unsafe {
        libc::syscall(
            libc::SYS_perf_event_open,
            &attr as *const _,
            pid,
            cpu,
            -1,
            0,
        )
    };

    if fd < 0 {
        return Err(std::io::Error::last_os_error());
    }
    Ok(fd as libc::c_int)
}

fn enable_counter(fd: libc::c_int) {
    unsafe {
        // arg=0 enables this counter (not a group)
        libc::ioctl(fd, libc::PERF_EVENT_IOC_ENABLE, 0);
    }
}
}

Now opening specific counters:

#![allow(unused)]
fn main() {
// Open a counter for L1 dcache load miss (event 0xD1, umask 0x08)
// pid=0: measure this process; cpu=-1: follow across all CPUs
let l1d_miss = open_raw_pmc(0xD1, 0x08, 0, -1)?;

// Open a counter for dTLB load miss (event 0x08, umask 0x01)
let dtlb_miss = open_raw_pmc(0x08, 0x01, 0, -1)?;

// Open a counter for L3 cache miss (event 0xD1, umask 0x20)
let l3_miss = open_raw_pmc(0xD1, 0x20, 0, -1)?;
}

If a counter returns EINVAL, that event isn’t available on this CPU. Catch it and fall back.

System-wide monitoring: The examples above use pid=0 (measure the calling process) for simplicity. A production monitoring tool should use pid=-1 with per-CPU file descriptors to measure all processes on the system. See Part 3’s “Per-process vs. system-wide monitoring” section for the full explanation.

Computing miss rates

Raw counts don’t mean much in isolation. A workload doing a billion memory accesses might have 50 million L3 misses. Is that a lot? It depends on how many accesses there were. The useful metric is miss rate per thousand instructions or miss rate per million memory operations.

#![allow(unused)]
fn main() {
fn compute_miss_rate(misses: u64, instructions: u64) -> f64 {
    if instructions == 0 {
        return 0.0;
    }
    (misses as f64 / instructions as f64) * 1000.0
}
}

A few reference points for context:

L1 dcache miss rate: 2-5 misses per 1000 instructions is typical for well-optimized code
L3 miss rate: 0.5-2 per 1000 instructions means memory locality is decent; above 5 suggests poor spatial locality
dTLB miss rate: 0.1-1 per 1000 is typical; above 5 suggests TLB-unfriendly access patterns (e.g., scanning large arrays with a large page table stride)

Cross-microarchitecture event selection

Here’s a wrapper that selects the right event numbers based on the detected microarchitecture:

#![allow(unused)]
fn main() {
use crate::cpu::{detect, CpuMicroarch, Microarch};

#[derive(Clone, Copy)]
pub struct CacheEvent {
    pub name: &'static str,
    pub event: u16,   // event number from Intel SDM
    pub umask: u8,    // unit mask from Intel SDM
}

fn cache_events_for(microarch: &Microarch) -> Vec<CacheEvent> {
    match microarch {
        Microarch::Skylake | Microarch::SkylakeSp | Microarch::KabyLake => vec![
            CacheEvent { name: "L1 dcache miss",    event: 0xD1, umask: 0x08 },
            CacheEvent { name: "L2 cache miss",     event: 0xD1, umask: 0x10 },
            CacheEvent { name: "L3 cache miss",    event: 0xD1, umask: 0x20 },
            CacheEvent { name: "dTLB load miss",   event: 0x08, umask: 0x01 },
            CacheEvent { name: "iTLB miss",        event: 0x85, umask: 0x01 },
        ],
        Microarch::IceLake | Microarch::IceLakeSp | Microarch::RocketLake => vec![
            // Ice Lake changed some event encodings.
            // L3 miss (0xD1, umask 0x20) is not reliably available on all Ice Lake SKUs.
            // Check `perf list` on your system; if MEM_LOAD_RETIRED.L3_MISS is listed,
            // add it with event 0xD1, umask 0x20.
            CacheEvent { name: "L1 dcache miss",   event: 0xD1, umask: 0x08 },
            CacheEvent { name: "L2 cache miss",    event: 0xD1, umask: 0x10 },
            CacheEvent { name: "dTLB load miss",   event: 0x08, umask: 0x01 },
            CacheEvent { name: "iTLB miss",        event: 0x85, umask: 0x01 },
        ],
        _ => {
            // Fall back to the universal events
            vec![
                CacheEvent { name: "L1 dcache miss",  event: 0xD1, umask: 0x08 },
                CacheEvent { name: "dTLB load miss",  event: 0x08, umask: 0x01 },
                CacheEvent { name: "iTLB miss",       event: 0x85, umask: 0x01 },
            ]
        }
    }
}
}

Counting mode vs. sampling mode

We’ve been using counting mode: open a counter, enable it, and read the cumulative count every second. This gives you a metric — a number that describes what’s happening.

Sampling mode is different: you set the counter to generate a sample (a trace event) every N events. The kernel writes each sample to a ring buffer that userspace reads. This gives you a profile — a stream of individual events that lets you see where the misses are happening.

For a monitoring dashboard, counting mode is what you want. You’re interested in “how many L3 misses per second” — not “which function is causing L3 misses.”

The performance monitoring landscape: counting mode gives metrics, sampling mode gives profiles. We’re building a metrics collector, so counting is the right tool.

Next: Part 6 — Scheduler Tracing with eBPF — Instrument the scheduler with tracepoints and measure runqueue wait times.

Keyboard shortcuts

eBPF Performance Monitoring with Aya