Part 7 — NUMA and Memory Metrics

On a NUMA system, memory access has a cost that depends on which socket you’re on.

NUMA stands for Non-Uniform Memory Access. On a multi-socket server, each CPU socket has its own local memory. When a task running on socket A accesses memory attached to socket B, the data has to travel across the inter-socket interconnect. That takes longer than accessing local memory.

If your workload is hitting 80% remote memory access, you’re paying the interconnect tax on most of your memory traffic. That’s a NUMA problem.

The basics in plain language

A node in Linux’s NUMA vocabulary is a group of CPUs and memory that are physically close. On a dual-socket system, you typically have node 0 and node 1. Each node has its own local memory. Memory attached to node 0 is “local” to socket 0’s CPUs, and “remote” to socket 1’s CPUs.

Linux has a NUMA balancer that moves pages between nodes at runtime to try to keep tasks running close to their data. When the balancer kicks in, it migrates pages. Too much migration is a sign that tasks are bouncing between sockets.

/proc/vmstat — the key file

/proc/vmstat is a flat list of virtual memory statistics. Most of them are for kernel internals, but a few are relevant for NUMA:

numa_hit       12345678   // pages allocated to this node (success)
numa_miss      234567     // pages allocated to this node but from remote (fail)
numa_foreign   12345      // pages allocated to this node from another node's memory
pgmigrate_success  98765  // pages successfully migrated between nodes
pgmigrate_fail     123   // migration attempts that failed

The key fields:

numa_hit: pages that were allocated on this node and used on this node (the good case)
numa_miss: pages allocated on this node but the CPU accessing them was on a remote node (the bad case)
pgmigrate_success: how many pages were successfully moved between nodes

The NUMA remote ratio is:

remote_ratio = numa_miss / (numa_hit + numa_miss)

A remote ratio above 0.2 (20%) means more than a fifth of memory accesses are crossing the interconnect. This is worth optimizing.

/sys/devices/system/node/nodeN/meminfo — per-node memory

Per-node memory breakdown lives in sysfs, not procfs. Each NUMA node has a meminfo file:

cat /sys/devices/system/node/node0/meminfo

The output looks like:

Node 0 Anon:     12345678 kB
Node 0 File:      2345678 kB
Node 0 HugePages:   4096 kB
Node 0 Shmem:      12345 kB

Each line breaks down memory by type:

Anon: anonymous memory (heap, stack, not backed by a file)
File: page cache (file-backed memory)
HugePages: hugepages
Shmem: shared memory (including tmpfs)

The numbers are in KiB. To get bytes: multiply by 1024.

Reading vmstat

#![allow(unused)]
fn main() {
use std::fs;

#[derive(Default)]
pub struct NumaStats {
    pub numa_hit: u64,
    pub numa_miss: u64,
    pub numa_foreign: u64,
    pub pgmigrate_success: u64,
    pub pgmigrate_fail: u64,
}

fn read_vmstat() -> std::io::Result<NumaStats> {
    let content = fs::read_to_string("/proc/vmstat")?;

    let mut numa_hit = 0u64;
    let mut numa_miss = 0u64;
    let mut numa_foreign = 0u64;
    let mut pgmigrate_success = 0u64;
    let mut pgmigrate_fail = 0u64;

    for line in content.lines() {
        let mut parts = line.split_whitespace();
        let name = parts.next().unwrap_or("");
        let value: u64 = parts.next().unwrap_or("0").parse().unwrap_or(0);

        match name {
            "numa_hit" => numa_hit = value,
            "numa_miss" => numa_miss = value,
            "numa_foreign" => numa_foreign = value,
            "pgmigrate_success" => pgmigrate_success = value,
            "pgmigrate_fail" => pgmigrate_fail = value,
            _ => {}
        }
    }

    Ok(NumaStats {
        numa_hit,
        numa_miss,
        numa_foreign,
        pgmigrate_success,
        pgmigrate_fail,
    })
}
}

Rate computation

/proc/vmstat returns cumulative counters since boot. To get per-second rates, poll the file and compute the delta:

#![allow(unused)]
fn main() {
use std::time::{Duration, Instant};

struct NumaRate {
    pub numa_remote_rate: f64,      // remote accesses per second
    pub migration_rate: f64,        // pages migrated per second
    pub migration_fail_rate: f64,   // failed migrations per second
}

fn compute_rates(prev: &NumaStats, curr: &NumaStats, elapsed_secs: f64) -> NumaRate {
    if elapsed_secs <= 0.0 {
        return NumaRate {
            numa_remote_rate: 0.0,
            migration_rate: 0.0,
            migration_fail_rate: 0.0,
        };
    }
    let total_accesses = (curr.numa_hit - prev.numa_hit)
        .saturating_add(curr.numa_miss - prev.numa_miss);
    let remote_accesses = curr.numa_miss - prev.numa_miss;
    let migrations = curr.pgmigrate_success - prev.pgmigrate_success;
    let failed = curr.pgmigrate_fail - prev.pgmigrate_fail;

    NumaRate {
        numa_remote_rate: remote_accesses as f64 / elapsed_secs,
        migration_rate: migrations as f64 / elapsed_secs,
        migration_fail_rate: failed as f64 / elapsed_secs,
    }
}
}

Per-node memory from sysfs

#![allow(unused)]
fn main() {
#[derive(Debug, Clone)]
pub struct NodeMem {
    pub node: u32,
    pub anon_pages: u64,
    pub file_pages: u64,
    pub huge_pages: u64,
    pub shmem: u64,
}

fn parse_node_meminfo(content: &str) -> Vec<NodeMem> {
    // The kernel meminfo format is:
    //   Node 0 Anon:     12345678 kB
    //   Node 0 File:      2345678 kB
    //   ...
    // We accumulate fields per node into a HashMap.
    use std::collections::HashMap;
    let mut node_data: HashMap<u32, NodeMem> = HashMap::new();

    for line in content.lines() {
        // Lines look like: "Node 0 Anon:     12345678 kB"
        let parts: Vec<&str> = line.split_whitespace().collect();
        if parts.len() < 4 {
            continue;
        }

        // parts[0] = "Node", parts[1] = node_id, parts[2] = "Anon:", parts[3] = value
        if parts[0] != "Node" {
            continue;
        }

        let node_id: u32 = parts[1].parse().unwrap_or(0);
        let field_name = parts[2].trim_end_matches(':');
        // Values are in KiB; convert to pages (4 KiB each) for consistency
        let value_kib: u64 = parts[3].parse().unwrap_or(0);
        let value_pages = value_kib / 4; // KiB → pages (4 KiB each)

        let entry = node_data.entry(node_id).or_insert_with(|| NodeMem {
            node: node_id,
            anon_pages: 0,
            file_pages: 0,
            huge_pages: 0,
            shmem: 0,
        });

        match field_name {
            "Anon" => entry.anon_pages = value_pages,
            "File" => entry.file_pages = value_pages,
            "HugePages" => entry.huge_pages = value_pages,
            "Shmem" => entry.shmem = value_pages,
            _ => {}
        }
    }

    let mut nodes: Vec<NodeMem> = node_data.into_values().collect();
    nodes.sort_by_key(|n| n.node);
    nodes
}

/// Read per-node memory info from sysfs (all nodes)
fn read_all_node_meminfo() -> std::io::Result<Vec<NodeMem>> {
    let node_dir = std::path::Path::new("/sys/devices/system/node");
    let mut all_nodes = Vec::new();

    for entry in std::fs::read_dir(node_dir)?.flatten() {
        let name = entry.file_name();
        let name_str = name.to_string_lossy();
        if !name_str.starts_with("node") {
            continue;
        }

        let meminfo_path = entry.path().join("meminfo");
        if !meminfo_path.exists() {
            continue;
        }

        let content = std::fs::read_to_string(&meminfo_path)?;
        let mut parsed = parse_node_meminfo(&content);
        all_nodes.append(&mut parsed);
    }

    all_nodes.sort_by_key(|n| n.node);
    Ok(all_nodes)
}
}

The parse_node_meminfo function converts KiB to pages internally (dividing by 4, since 1 page = 4 KiB). To get bytes from pages:

#![allow(unused)]
fn main() {
fn page_bytes(pages: u64) -> u64 {
    // Hard-coding 4 KiB pages is a simplification. ARM64 can use 16 KiB or 64 KiB
    // pages. For a portable version, use libc::sysconf(libc::_SC_PAGESIZE) to get
    // the actual page size at runtime.
    pages * 4096
}
}

/sys/devices/system/node/ — cross-node stats

Linux exposes per-node information through sysfs. The directory structure:

/sys/devices/system/node/
├── node0/
│   ├── cpulist          # CPUs on this node
│   ├── distance          # NUMA distances to other nodes
│   ├── meminfo           # memory stats for this node
│   └── numastat          # NUMA hit/miss for this node
├── node1/
│   └── ...

numastat is particularly useful:

numa_hit  12345678
numa_miss  234567
numa_foreign 12345
interleave_hit  1234
other_node  5678

This is per-node data, which is more useful than the system-wide vmstat when you’re debugging a specific NUMA imbalance.

#![allow(unused)]
fn main() {
use std::fs;

fn read_node_numastat(node: u32) -> std::io::Result<NumaStats> {
    let path = format!("/sys/devices/system/node/node{}/numastat", node);
    let content = fs::read_to_string(&path)?;

    let mut stats = NumaStats::default();

    for line in content.lines() {
        let mut parts = line.split_whitespace();
        let name = parts.next().unwrap_or("");
        let value: u64 = parts.next().unwrap_or("0").parse().unwrap_or(0);

        match name {
            "numa_hit" => stats.numa_hit = value,
            "numa_miss" => stats.numa_miss = value,
            "numa_foreign" => stats.numa_foreign = value,
            _ => {}
        }
    }

    Ok(stats)
}
}

Hugepage utilization

Hugepages (typically 2 MiB or 1 GiB) reduce TLB pressure because one TLB entry covers a much larger region. A regular 4 KiB page needs one TLB entry per 4 KiB of contiguous memory. A 2 MiB hugepage needs one TLB entry per 2 MiB — 512x more coverage per entry. For workloads that scan large arrays or walk page tables, this can halve TLB miss rates.

The data comes from /proc/meminfo:

HugePages_Total:    1024
HugePages_Free:     512
HugePages_Rsvd:     128
HugePages_Surp:     0
AnonHugePages:      2048

The useful metrics:

Hugepage pool utilization: (HugePages_Total - HugePages_Free) / HugePages_Total. If this is near 100%, the pool is exhausted and new hugepage allocations will fail. If it’s near 0%, the pool is overprovisioned and wasting memory.
Transparent hugepage usage: AnonHugePages (in KiB) tells you how much transparent hugepage memory is in use. Compare it to the total anonymous memory (AnonPages in /proc/meminfo) to get a ratio: AnonHugePages / AnonPages. If this ratio is low for a memory-intensive workload, the kernel isn’t coalescing regular pages into hugepages effectively.

Reading it in Rust:

#![allow(unused)]
fn main() {
use std::fs;

#[derive(Default)]
pub struct HugepageStats {
    pub total: u64,       // HugePages_Total
    pub free: u64,        // HugePages_Free
    pub reserved: u64,    // HugePages_Rsvd
    pub anon_hugepages: u64,  // AnonHugePages (in KiB)
    pub anon_pages: u64,      // AnonPages (in KiB)
}

fn read_hugepage_stats() -> std::io::Result<HugepageStats> {
    let content = fs::read_to_string("/proc/meminfo")?;
    let mut stats = HugepageStats::default();

    for line in content.lines() {
        let mut parts = line.split_whitespace();
        let name = parts.next().unwrap_or("");
        let value: u64 = parts.next().unwrap_or("0").parse().unwrap_or(0);

        match name {
            "HugePages_Total:" => stats.total = value,
            "HugePages_Free:" => stats.free = value,
            "HugePages_Rsvd:" => stats.reserved = value,
            "AnonHugePages:" => stats.anon_hugepages = value,
            "AnonPages:" => stats.anon_pages = value,
            _ => {}
        }
    }

    Ok(stats)
}

fn hugepage_pool_utilization(stats: &HugepageStats) -> f64 {
    if stats.total == 0 {
        return 0.0;
    }
    (stats.total - stats.free) as f64 / stats.total as f64
}

fn transparent_hugepage_ratio(stats: &HugepageStats) -> f64 {
    if stats.anon_pages == 0 {
        return 0.0;
    }
    stats.anon_hugepages as f64 / stats.anon_pages as f64
}
}

A pool utilization above 90% means you should increase HugePages_Total in the kernel boot parameters. A transparent hugepage ratio below 10% for a memory-intensive workload means the kernel’s khugepaged daemon isn’t coalescing pages fast enough — check /sys/kernel/mm/transparent_hugepage/ for the current policy (always, madvise, or never).

Next: Part 8 — Uncore IMC Bandwidth — Measure memory bandwidth through the Integrated Memory Controller.

Keyboard shortcuts

eBPF Performance Monitoring with Aya