Part 3 — Hardware PMCs with perf_event_open

The CPU has hardware counters on-die. The only way to read them on Linux is perf_event_open.

What PMCs Are

Performance Monitoring Counters — PMCs — are tiny registers inside the CPU chip. They count specific microarchitectural events: a cache line loaded from L1, a branch instruction resolved, a TLB walk performed. Every modern x86 and ARM processor has them.

The counter is an accumulation of incremental events. Each hardware event increments it. You open a file descriptor for a specific event, and then you read the counter value by reading from that file descriptor. That’s the whole interface.

The events are defined by a type and a config. On x86, type is usually PERF_TYPE_HARDWARE (0) or PERF_TYPE_RAW (4). The hardware events — instructions retired, CPU cycles, cache references — live in the hardware type. The raw events — cache misses, branch mispredicts, TLB walks — live in the raw type, and their event numbers vary by CPU microarchitecture. That’s why Part 4 matters: before you can open a cache miss counter, you need to know what CPU you’re on.

The syscall

perf_event_open is a Linux-specific syscall. Here’s its signature from the kernel headers:

int perf_event_open(
    struct perf_event_attr *attr,  // what to count
    pid_t pid,                     // attach to this process (0 = self)
    int cpu,                       // which CPU (-1 = all)
    int group_fd,                  // group leader fd (-1 = new group)
    unsigned long flags            // PERF_FLAG_FD_CLOEXEC etc.
);

Returns a file descriptor on success, or -1 on error.

The perf_event_attr struct is the interesting part. Here’s the relevant subset from the kernel headers (include/uapi/linux/perf_event.h):

struct perf_event_attr {
    __u32 type;              // PERF_TYPE_HARDWARE, PERF_TYPE_RAW, etc.
    __u32 size;              // sizeof(struct perf_event_attr)
    __u64 config;            // which specific event
    union {
        __u64 sample_period; // sample every N events (for sampling mode)
        __u64 sample_freq;   // target sample frequency (for sampling mode)
    };
    __u64 sample_type;       // what gets written to the sample buffer
    __u64 read_format;       // format for reading counter values
    __u64 flags;             // disabled, pinned, inherit, etc.
};

For counting mode (what we’ll use), you set the union to 0 — both sample_period and sample_freq are zero, meaning no sampling. The kernel won’t generate PMIs (Performance Monitoring Interrupts), and you read the counter value directly from the file descriptor. For profiling mode, you set sample_freq to a target sample rate (e.g., 1000 Hz) and the kernel samples periodically.

The Rust hand-rolled struct below uses a single sample_period field instead of the union — since we’re in counting mode, the union value is always 0 and a single field is sufficient. The struct includes size (required by the kernel) and omits read_format and the full flags field in the explanation above — the working code at the end of this part includes everything you need.

Rust bindings

There’s a perf-event crate that wraps perf_event_open with a safe interface, but it doesn’t support all the raw PMC event types we need, and libc::perf_event_attr isn’t available on all platforms. So we hand-roll the struct and call the syscall directly via libc.

On Linux, perf_event_open is syscall number 298 on x86_64. The number varies by architecture: 241 on ARM64 and RISC-V, 319 on PowerPC. The code uses libc::SYS_perf_event_open, which resolves to the right number per architecture. You can’t call it through libc::perf_event_open because not all libc implementations expose it.

The full open_pmc function is in the minimal working example at the end of this part. The next few sections show the API surface — what the events are, how to read counters, how to compute useful metrics — and then we bring it all together.

Opening a counter for instructions and cycles

The hardware events live in PERF_TYPE_HARDWARE. The two universal hardware events — every CPU supports them — are PERF_COUNT_HW_INSTRUCTIONS and PERF_COUNT_HW_CPU_CYCLES:

#![allow(unused)]
fn main() {
let instr_fd = open_pmc(
    libc::PERF_TYPE_HARDWARE as u32,
    libc::PERF_COUNT_HW_INSTRUCTIONS as u64,
    0,  // attach to current process
    -1, // all CPUs
)?;

let cycles_fd = open_pmc(
    libc::PERF_TYPE_HARDWARE as u32,
    libc::PERF_COUNT_HW_CPU_CYCLES as u64,
    0,
    -1,
)?;
}

Reading and resetting

Reading from the file descriptor returns the current counter value. The counter starts disabled — you enable it with an ioctl call:

#![allow(unused)]
fn main() {
fn read_counter(fd: i32) -> std::io::Result<u64> {
    let mut val: u64 = 0;
    let n = unsafe {
        libc::read(fd, &mut val as *mut _ as *mut libc::c_void, 8)
    };
    if n < 0 {
        Err(std::io::Error::last_os_error())
    } else {
        Ok(val)
    }
}

fn enable_counter(fd: i32) -> std::io::Result<()> {
    unsafe {
        // arg=0 enables this counter. PERF_IOC_FLAG_GROUP would enable
        // all counters in the same group — but we aren't using groups here.
        libc::ioctl(fd, libc::PERF_EVENT_IOC_ENABLE, 0);
    }
    Ok(())
}
}

To reset the counter to zero, use ioctl with PERF_EVENT_IOC_RESET:

#![allow(unused)]
fn main() {
unsafe {
    libc::ioctl(fd, libc::PERF_EVENT_IOC_RESET, 0);
}
}

Computing IPC and stall ratio

Instructions per cycle (IPC) tells you how much useful work the CPU is doing per clock tick. A healthy compute-bound workload might hit 3-4 IPC on a modern out-of-order core. A memory-bound workload stalls frequently and might hit 0.5 IPC.

#![allow(unused)]
fn main() {
fn compute_ipc(instructions: u64, cycles: u64) -> f64 {
    if cycles == 0 {
        return 0.0;
    }
    instructions as f64 / cycles as f64
}
}

Stall ratio is the fraction of cycles where the core wasn’t retiring instructions. This happens when the core is waiting on memory, a branch mispredict, or any other pipeline stall.

The key insight: when the pipeline stalls, the instruction counter slows down even though cycles keep ticking. Here’s the catch — modern out-of-order cores can retire multiple instructions per cycle (4-6 on recent x86). So cycles - instructions doesn’t directly give you stalled cycles. What it gives you is a lower bound on stalls — cycles where the core couldn’t even manage 1 retirement. The max(0) clamp hides the IPC > 1 case entirely.

#![allow(unused)]
fn main() {
fn compute_stall_ratio(instructions: u64, cycles: u64) -> f64 {
    if instructions == 0 || cycles == 0 {
        return 0.0;
    }
    // When IPC > 1, the core is more than 1 instruction per cycle — no stalls.
    // When IPC < 1, the gap is a lower bound on stalled cycles.
    // This underestimates stalls when the core retires multiple instructions
    // in some cycles and zero in others (the average IPC hides the stall bursts).
    let stalled = (cycles as i64 - instructions as i64).max(0) as f64;
    stalled / cycles as f64
}
}

This is a simplification in two ways. First, it can’t detect stall bursts hidden by a high average IPC. Second, some cycles are legitimately empty (no instructions ready, no work to do). For a more accurate stall ratio, you’d read the actual stall-cycle PMCs that modern CPUs expose (e.g., INT_MISC.RECOVERY_CYCLES on Intel). But as a quick health check from two counters alone, it’s a useful signal.

Per-process vs. system-wide monitoring

The pid parameter controls what gets counted. The cpu parameter controls where. Getting these wrong gives you the right numbers for the wrong thing.

Per-process (pid=0, cpu=-1): Open the counter with pid=0 and you measure the calling process as it runs on any CPU. The kernel follows the process across CPU migrations and accumulates the count.

#![allow(unused)]
fn main() {
let cycles_fd = open_pmc(PERF_TYPE_HARDWARE, libc::PERF_COUNT_HW_CPU_CYCLES as u64, 0, -1)?;
}

This is what the example program uses. It works for measuring the IPC of a workload you launch from your monitoring tool. It does not measure the entire system — if your process is mostly sleeping (waiting for the next read interval), the counter values will be near zero.

System-wide (pid=-1, per-CPU): Open a counter for each CPU with pid=-1 to measure all processes on that CPU. The kernel requires a specific cpu number when pid=-1 (passing cpu=-1 with pid=-1 returns EINVAL):

#![allow(unused)]
fn main() {
fn open_all_cpus(type_: u32, config: u64) -> std::io::Result<Vec<(i32, i32)>> {
    let mut fds = Vec::new();
    for cpu in 0..num_cpus() {
        // pid=-1: system-wide. Measures all processes on this CPU.
        let fd = open_pmc(type_, config, -1, cpu as i32)?;
        fds.push((cpu, fd));
    }
    Ok(fds)
}
}

Read each fd and aggregate in userspace for a total. Or keep them separate for per-CPU granularity — a CPU with unusually high cache misses might be running a memory-bound workload pinned to that core.

For a system monitoring tool, pid=-1 is almost always what you want. The example program uses pid=0 for simplicity — a single fd, no aggregation loop — but a real deployment should switch to pid=-1 with per-CPU fds. Part 8 uses pid=-1 for uncore IMC counters, which are inherently system-wide.

Error handling

Two errors are common:

EPERM: The syscall returns EPERM when the calling process lacks the right capabilities. PMC access requires CAP_SYS_ADMIN (root) or CAP_SYS_PERFMON (Linux 5.8+, a narrower capability). If you hit this in a container, the host may need to grant the capability.

EINVAL: The event you asked for isn’t available on this CPU. Raw PMC events in particular vary by microarchitecture — the same event number might mean “cache miss” on Skylake and “not defined” on Ice Lake. This is why Part 4 exists: detect the CPU first, then select events.

A minimal working example

Here’s a complete program that opens an instructions counter and a cycles counter, enables them, reads them once per second, and prints per-second IPC. This brings together the open_pmc, read_counter, and enable_counter functions from the sections above:

use std::io;
use std::thread;
use std::time::Duration;

fn open_pmc(type_: u32, config: u64, pid: i32, cpu: i32) -> io::Result<i32> {
    // Hand-rolled perf_event_attr — libc::perf_event_attr is not available
    // on all platforms. The field ordering must match the kernel struct exactly.
    #[repr(C)]
    struct PerfEventAttr {
        type_: u32,
        size: u32,
        config: u64,
        sample_period: u64, // 0 for counting mode
        sample_type: u64,
        read_format: u64,
        flags: u64, // bit 0: disabled, bit 2: pinned
    }

    let attr = PerfEventAttr {
        type_,
        size: std::mem::size_of::<PerfEventAttr>() as u32,
        config,
        sample_period: 0, // counting mode: no sampling, read counter directly
        sample_type: 0,
        read_format: 0,
        flags: 0b101, // disabled=1 (bit 0), pinned=1 (bit 2)
    //
    // The flags field is a C bitfield packed into a u64. On x86-64,
    // GCC/Clang allocate bits from LSB to MSB, so:
    //   bit 0 = disabled (start counter in disabled state)
    //   bit 1 = inherit  (children inherit the counter — not set here)
    //   bit 2 = pinned   (counter must stay on the PMU — prevents multiplexing)
    // We set disabled so we can enable the counter explicitly via ioctl.
    // We set pinned because counting mode needs the counter scheduled
    // at all times — without pinned, the kernel may multiplex the counter
    // on busy systems, producing scaled values instead of exact counts.
    };

    let fd = unsafe {
        libc::syscall(
            libc::SYS_perf_event_open,
            &attr as *const _,
            pid,
            cpu,
            -1, // no group leader
            0,  // no flags
        )
    };

    if fd < 0 {
        return Err(io::Error::last_os_error());
    }
    Ok(fd as i32)
}

fn read_counter(fd: i32) -> io::Result<u64> {
    let mut val: u64 = 0;
    let n = unsafe {
        libc::read(fd, &mut val as *mut _ as *mut libc::c_void, 8)
    };
    if n < 0 {
        Err(io::Error::last_os_error())
    } else {
        Ok(val)
    }
}

fn enable_counter(fd: i32) -> io::Result<()> {
    unsafe {
        // arg=0: enable this counter (not a group)
        libc::ioctl(fd, libc::PERF_EVENT_IOC_ENABLE, 0);
    }
    Ok(())
}

fn main() -> io::Result<()> {
    let instr_fd = open_pmc(
        libc::PERF_TYPE_HARDWARE as u32,
        libc::PERF_COUNT_HW_INSTRUCTIONS as u64,
        0,  // pid=0: measure this process (not system-wide; see "Per-process vs. system-wide" below)
        -1, // follow this process across all CPUs
    )?;
    let cycles_fd = open_pmc(
        libc::PERF_TYPE_HARDWARE as u32,
        libc::PERF_COUNT_HW_CPU_CYCLES as u64,
        0,  // pid=0: measure this process
        -1, // follow this process across all CPUs
    )?;

    enable_counter(instr_fd)?;
    enable_counter(cycles_fd)?;

    let mut prev_instr = 0u64;
    let mut prev_cycles = 0u64;

    loop {
        thread::sleep(Duration::from_secs(1));

        let instr = read_counter(instr_fd)?;
        let cycles = read_counter(cycles_fd)?;

        let instr_delta = instr - prev_instr;
        let cycles_delta = cycles - prev_cycles;

        let ipc = if cycles_delta > 0 {
            instr_delta as f64 / cycles_delta as f64
        } else {
            0.0
        };

        println!("instructions={instr_delta} cycles={cycles_delta} ipc={ipc:.3}");

        prev_instr = instr;
        prev_cycles = cycles;
    }
}

Run it with sudo cargo run — it needs root. The counter values are cumulative, so the program computes deltas between reads to get per-second IPC.

Next: Part 4 — CPU Microarchitecture Detection — Before you open any raw PMC, figure out what CPU you’re running on and pick the right event numbers.

Keyboard shortcuts

eBPF Performance Monitoring with Aya