Part 8 — Uncore IMC Bandwidth

The CPU cores get all the attention. The memory controller sits quietly in the corner of the same chip, and nobody thinks about it until memory bandwidth is saturated and everything stalls.

Modern server CPUs put the Integrated Memory Controller (IMC) on the same package as the cores, but outside the cores themselves. Intel calls this the uncore — the silicon on the chip that isn’t a CPU core. It includes the memory controller, the last-level cache, and the mesh interconnect that ties everything together. The uncore has its own performance counters, and they measure memory bandwidth: how many bytes per second are flowing through the memory controller to DRAM.

If memory bandwidth is saturated, adding more CPU cores won’t help. The cores will stall waiting for memory. Monitoring IMC bandwidth tells you whether you’re approaching that ceiling.

What “uncore” means

On Intel, the chip is divided into two parts:

Core: the CPU cores — instruction execution, cache, registers
Uncore: everything else — the memory controller, the last-level cache (L3), the mesh interconnect that links the cores to the uncore

The uncore is shared across all cores on the socket. Its counters are accessible via perf_event_open but they live in a separate PMU (Performance Monitoring Unit) from the core PMU. You open them differently.

The IMC counters on Intel

Intel’s uncore IMC events are in the uncore_imc PMU. The type value is architecture-specific:

#![allow(unused)]
fn main() {
// For Intel Skylake and later server CPUs
// The type value is architecture-specific and varies between kernels.
// Don't hardcode it — read it from sysfs (see find_uncore_imc_type below).
const UNCORE_IMC_TYPE_EXAMPLE: u32 = 15; // example only; always read from sysfs
}

The reliable way to find the type is through sysfs:

ls /sys/bus/event_source/devices/

Look for something like uncore_imc_0 or uncore_cha_0 (Cache Housing Agent — also useful). The IMC is uncore_imc_<socket_id>.

cat /sys/bus/event_source/devices/uncore_imc_0/type

This prints the integer type value you need for perf_event_open.

The key IMC events (from the Intel perfmon repo, iMC unit):

CAS count (all): event 0x04, umask 0x0F — all CAS (Column Address Strobe) operations, the memory commands that transfer data. This is the most useful bandwidth proxy.
CAS read: event 0x04, umask 0x03 — DRAM read CAS commands (includes underfill reads)
CAS write: event 0x04, umask 0x0C — DRAM write CAS commands
DRAM activate count: event 0x01, umask 0x02 — DRAM ACT (activate) commands for writes; umask 0x01 for reads. Cycles the DRAM rank was active.

When you open a raw uncore event with perf_event_open, the config field encodes both the event and umask: config = (umask << 8) | event. So CAS count all (event 0x04, umask 0x0F) becomes config = 0x0F04.

The CAS (Column Address Strobe) counter is the standard way to compute memory bandwidth. Each CAS operation transfers 64 bytes (one cache line). So:

bandwidth_bytes_per_sec = CAS_count * 64 / elapsed_seconds
bandwidth_gb_per_sec = CAS_count * 64 / 1_000_000_000 / elapsed_seconds

Opening uncore IMC counters

Uncore counters have restrictions. They can only be opened:

With cpu=-1 for system-wide (all sockets) — or with a per-socket fd
By root or a process with CAP_SYS_ADMIN
With the specific CPU(s) that own the uncore (socket-local CPUs only)

On a multi-socket system, you open one fd per socket:

#![allow(unused)]
fn main() {
use std::fs;

fn find_uncore_imc_type() -> std::io::Result<u32> {
    // Find the uncore_imc PMU type value from sysfs.
    // On multi-socket systems, each socket has its own uncore_imc_N device
    // (uncore_imc_0, uncore_imc_1, etc.), but they all share the same PMU type.
    // We read the first one we find.
    let entries = std::fs::read_dir("/sys/bus/event_source/devices/")?;
    for entry in entries.flatten() {
        let name = entry.file_name().into_string().unwrap_or_default();
        if name.starts_with("uncore_imc") {
            let type_path = entry.path().join("type");
            let type_str = fs::read_to_string(&type_path)?.trim().to_owned();
            return type_str.parse::<u32>()
                .map_err(|e| std::io::Error::new(std::io::ErrorKind::InvalidData, e));
        }
    }
    Err(std::io::Error::new(
        std::io::ErrorKind::NotFound,
        "uncore_imc PMU not found",
    ))
}

fn open_imc_counter(
    pmu_type: u32,
    config: u64,
    cpu: libc::c_int,
) -> std::io::Result<libc::c_int> {
    // Hand-rolled perf_event_attr — libc::perf_event_attr isn't available
    // on all platforms. The field ordering must match the kernel struct exactly.
    #[repr(C)]
    struct PerfEventAttr {
        type_: u32,
        size: u32,
        config: u64,
        sample_period: u64,
        sample_type: u64,
        read_format: u64,
        flags: u64, // bit 0: disabled, bit 2: pinned
    }

    let attr = PerfEventAttr {
        type_: pmu_type,
        size: std::mem::size_of::<PerfEventAttr>() as u32,
        config,
        sample_period: 0,
        sample_type: 0,
        read_format: 0,
        flags: 0b101, // disabled=1 (bit 0), pinned=1 (bit 2)
    };

    let fd = unsafe {
        // pid=-1: uncore PMUs are system-wide, they don't monitor a specific
        // process. The kernel requires pid=-1 for uncore events.
        libc::syscall(libc::SYS_perf_event_open, &attr as *const _, -1, cpu, -1, 0)
    };

    if fd < 0 {
        Err(std::io::Error::last_os_error())
    } else {
        Ok(fd as libc::c_int)
    }
}
}

Computing memory bandwidth

#![allow(unused)]
fn main() {
fn compute_bandwidth_gbps(cas_count_delta: u64, elapsed_secs: f64) -> f64 {
    if elapsed_secs <= 0.0 {
        return 0.0;
    }
    // 64 bytes per CAS operation (one cache line)
    (cas_count_delta as f64 * 64.0) / elapsed_secs / 1e9
}
}

Each CAS operation transfers one 64-byte cache line, so the math is straightforward.

The enable_counter and read_counter helper functions are the same ones from Part 3 — they call ioctl(fd, PERF_EVENT_IOC_ENABLE, 0) and read(fd, &mut value) respectively.

The full polling loop

#![allow(unused)]
fn main() {
use std::time::Instant;

struct ImcBandwidth {
    pub bandwidth_gbps: f64,
    pub cas_count: u64,
    pub elapsed_secs: f64,
}

fn open_socket_imc_counters(
    pmu_type: u32,
    socket_cpus: &[i32],  // CPUs local to each socket
) -> std::io::Result<Vec<(i32, i32)>> {
    // Open one CAS counter fd per socket. Keep these fds open for the
    // lifetime of the monitor — don't open/close on every poll.
    //
    // socket_cpus: one CPU id per socket (any CPU local to that socket's
    // uncore works). On a single-socket system, this is [0]. On dual-socket,
    // find them with:
    //   cat /sys/devices/system/node/node0/cpulist  → socket 0 CPUs
    //   cat /sys/devices/system/node/node1/cpulist  → socket 1 CPUs
    let mut fds = Vec::new();

    for (socket, &cpu) in socket_cpus.iter().enumerate() {
        let fd = open_imc_counter(pmu_type, 0x0F04, cpu)?; // CAS count all (event 0x04, umask 0x0F)
        enable_counter(fd)?;
        fds.push((socket as i32, fd));
    }

    Ok(fds)
}

fn read_imc_counts(socket_fds: &[(i32, i32)]) -> std::io::Result<Vec<(i32, u64)>> {
    let mut results = Vec::new();

    for &(socket, fd) in socket_fds {
        let cas = read_counter(fd)?;
        results.push((socket, cas));
    }

    Ok(results)
}

fn poll_imc(
    socket_fds: &[(i32, i32)],
    prev_counts: &[(i32, u64)],
    interval_secs: f64,
) -> std::io::Result<Vec<ImcBandwidth>> {
    let current_counts = read_imc_counts(socket_fds)?;

    let mut result = Vec::new();
    for ((socket, prev_cas), (_, curr_cas)) in prev_counts.iter().zip(&current_counts) {
        let delta = curr_cas.saturating_sub(*prev_cas);
        result.push(ImcBandwidth {
            bandwidth_gbps: compute_bandwidth_gbps(delta, interval_secs),
            cas_count: delta,
            elapsed_secs: interval_secs,
        });
    }

    Ok(result)
}
}

Virtualization caveat

You can’t read uncore counters inside a virtual machine (in most cases). The guest doesn’t have direct access to the IMC hardware. When you try to open an uncore event from inside a guest, you get EPERM or EINVAL.

If you’re building a monitoring agent that runs inside VMs, skip uncore IMC reading — it’s not available. For bare-metal hosts, it’s one of the most useful performance signals.

AMD alternative: Data Fabric counters

AMD’s equivalent of the Intel IMC counters are the Data Fabric (DF) performance counters. The Data Fabric is the interconnect that links AMD CPU cores, memory controllers, and I/O — analogous to Intel’s mesh + uncore. DF counters track memory bandwidth through the same perf_event_open interface, but the PMU names and event encodings are different.

Look for uncore_data_fabric or amd_df PMUs in sysfs:

ls /sys/bus/event_source/devices/
# Look for uncore_data_fabric or amd_df on AMD systems

AMD also has Instruction-Based Sampling (IBS) — a different approach that periodically samples the instruction stream and reports details about memory operations. IBS is more useful for profiling (“where are the loads that miss in L3?”) than for bandwidth monitoring. If you need per-instruction latency data, IBS is the tool. If you need aggregate bandwidth, use the DF counters.

IBS events are in the ibs_op and ibs_fetch PMUs on AMD. These are not available on Intel.

Next: Part 9 — Thermal Monitoring — Read thermal zones from sysfs and compute headroom before throttling.

Keyboard shortcuts

eBPF Performance Monitoring with Aya