Part 10 — Block I/O Tracing

I/O patterns tell you whether your storage is being used well.

High IOPS with low throughput means many small operations — small random reads, metadata-heavy workloads, or many tiny writes. Low IOPS with high throughput means a few large sequential operations — streaming reads, copy operations. Both are normal, but both can become bottlenecks.

The block layer tracepoints let us observe every I/O request as it enters and completes.

The block tracepoints

Linux has several block I/O tracepoints. The two we care about:

Tracepoint	When it fires
`block:block_bio_queue`	A request is submitted to the block layer
`block:block_bio_complete`	A request completes

The arguments for block:block_bio_queue:

Offset   Type      Field
------   ----      -----
0        u64       dev (dev_t: major<<20 | minor, 4 bytes padding)
8        u64       sector          // starting sector number
16       u32       nr_sector       // number of sectors
20       (4 bytes padding)
24       char[10]  rwbs            // R/W/S flag string ("R", "W", etc.)
34       char[16]  comm            // process name (TASK_COMM_LEN)

The rwbs field is a 10-byte character array that encodes the I/O direction: "R" for read, "W" for write, "S" for sync, "F" for flush, "D" for discard, "N" for none. The comm field is the process name that submitted the I/O.

Important: block_bio_queue is a block_bio-class tracepoint — it operates on struct bio, not struct request. The block_rq-class tracepoints (block_rq_insert, block_rq_issue) have a bytes field and op_flags field that block_bio lacks. If you need the transfer size in bytes, compute it from nr_sector * 512 (each sector is 512 bytes). This is what the eBPF code below does.

Verifying offsets. The offsets above use the read_at convention — they start from the first byte after the 8-byte common tracepoint header. The format file includes this header, so its offsets are 8 bytes larger. See the verification note in Part 2 for the full procedure.

The device number dev_t encodes major and minor into a single 32-bit integer. To get a readable device name, map it through /sys/dev/block/:

#![allow(unused)]
fn main() {
use std::fs;

// Convert dev_t (major<<20 | minor) to a readable device name like "sda" or "nvme0n1"
fn dev_t_to_name(dev: u32) -> Option<String> {
    let major = (dev >> 20) as u32;
    let minor = (dev & 0xFFFFF) as u32;
    let path = format!("/sys/dev/block/{}:{}", major, minor);
    // read_link returns a relative path like "../../devices/.../block/sda"
    // The last component after "/block/" is the device name.
    let target = fs::read_link(&path).ok()?;
    let target_str = target.into_os_string().into_string().ok()?;
    // Extract the device name: everything after the last "/block/" component
    if let Some(idx) = target_str.rfind("/block/") {
        Some(target_str[idx + 7..].to_owned())
    } else {
        // Fallback: use the last path component
        target_str.split('/').next_back().map(|s| s.to_owned())
    }
}
}

The /sys/dev/block/MAJOR:MINOR symlink points into the device tree — something like ../../devices/pci0000:00/.../block/sda. We extract the device name (sda, nvme0n1) from the block/ component rather than returning the raw symlink target, which would be a long relative path.

For our metrics, we’ll track per-device IOPS and throughput.

The eBPF program

#![allow(unused)]
fn main() {
// ebpf-programs/src/blockio.rs

use aya_ebpf::programs::TracePointContext;
use aya_ebpf::macros::tracepoint;
use aya_ebpf::maps::{HashMap, PerCpuArray};
use aya_ebpf::helpers::bpf_ktime_get_ns;

#[derive(Clone, Copy)]
#[repr(C)]
pub struct BioQueueEvent {
    pub dev: u32,
    pub sector: u64,
    pub nr_sector: u32,
    pub timestamp: u64,
}

// Per-device counter: dev → (ops_count, total_bytes)
// HashMap is shared across CPUs, so concurrent updates from different CPUs
// can lose counts (read-modify-write race). For monitoring where approximate
// counts are acceptable, this is fine. For exact counts, use PerCpuArray
// and sum in userspace (like the histogram in Part 12).
#[map]
static IO_COUNTERS: HashMap<u32, (u64, u64)> = HashMap::with_max_entries(64, 0);

// Per-CPU sampling counter — avoids static mut by using a map
#[map]
static SAMPLE_COUNTER: PerCpuArray<u64> = PerCpuArray::with_max_entries(1, 0);

// Ring buffer for sector samples (sent every 100th event)
#[map]
static SECTOR_SAMPLES: aya_ebpf::maps::RingBuf =
    aya_ebpf::maps::RingBuf::with_byte_size(8 * 4096, 0);

#[derive(Clone, Copy)]
#[repr(C)]
pub struct SectorSample {
    pub dev: u32,
    pub sector: u64,
}

#[tracepoint]
pub fn block_bio_queue(ctx: TracePointContext) -> u32 {
    let dev = unsafe { ctx.read_at::<u32>(0).unwrap_or(0) };
    let sector = unsafe { ctx.read_at::<u64>(8).unwrap_or(0) };
    let nr_sector = unsafe { ctx.read_at::<u32>(16).unwrap_or(0) };

    // Update per-device counters (per-CPU copy via HashMap keyed by dev)
    unsafe {
        let (ops, bytes) = IO_COUNTERS.get(&dev).copied().unwrap_or((0u64, 0u64));
        let new_ops = ops + 1;
        let new_bytes = bytes + (nr_sector as u64 * 512);
        let _ = IO_COUNTERS.insert(&dev, &(new_ops, new_bytes), 0);
    }

    // Sampling: every 100th operation per CPU — avoid static mut
    unsafe {
        if let Some(ptr) = SAMPLE_COUNTER.get_ptr_mut(0) {
            *ptr += 1;
            if *ptr % 100 == 0 {
                let sample = SectorSample { dev, sector };
                SECTOR_SAMPLES.output(&sample, 0);
            }
        }
    }

    0
}
}

A few things to notice:

Tuple values in HashMap: HashMap<u32, (u64, u64)> stores a tuple as the value. This works in Aya eBPF — the value size is the size of the tuple. It’s a convenient way to store multiple counters per key.

PerCpuArray<u64> for the sampling counter: The original code used static mut COUNTER: u64 which is forbidden in safe Rust and problematic in eBPF. Using a PerCpuArray with one entry gives each CPU its own counter — no atomics needed.

unsafe on get(): HashMap::get() is unsafe because the kernel doesn’t guarantee atomicity without BPF_F_NO_PREALLOC. For metrics aggregation, occasional lost updates are acceptable.

Computing IOPS and throughput

In userspace, read the counters and compute rates:

#![allow(unused)]
fn main() {
use aya::maps::HashMap as AyaHashMap;

pub struct IoStats {
    pub device: String,
    pub iops: f64,
    pub throughput_mbps: f64,
    pub ops_count: u64,
    pub bytes_count: u64,
}

fn read_io_stats(
    counters: &AyaHashMap<u32, (u64, u64)>,
    prev: &std::collections::HashMap<u32, (u64, u64)>,
    elapsed_secs: f64,
) -> Vec<IoStats> {
    let safe_elapsed = if elapsed_secs > 0.0 { elapsed_secs } else { 1.0 };
    counters.iter().filter_map(|item| {
        let (dev, (ops, bytes)) = item.ok()?;
        let prev_data = prev.get(&dev).copied().unwrap_or((0, 0));
        let ops_delta = ops.saturating_sub(prev_data.0);
        let bytes_delta = bytes.saturating_sub(prev_data.1);

        Some(IoStats {
            device: dev_t_to_name(dev).unwrap_or_else(|| format!("{:08x}", dev)),
            iops: ops_delta as f64 / safe_elapsed,
            throughput_mbps: (bytes_delta as f64 / safe_elapsed) / 1e6,
            ops_count: ops_delta,
            bytes_count: bytes_delta,
        })
    }).collect()
}
}

IOPS entropy — measuring randomness

High IOPS can mean two different things:

Sequential I/O: reading a large file in order, one big read per access — predictable, batchable
Random I/O: reading many small blocks at scattered addresses — unpredictable, hard to batch

Both have high IOPS. The difference is in the sector number distribution. Shannon entropy of the sector numbers measures how predictable or random the access pattern is.

Here’s how to compute it:

#![allow(unused)]
fn main() {
fn compute_entropy(sectors: &[u64], num_buckets: usize) -> f64 {
    if sectors.is_empty() {
        return 0.0;
    }

    // Bucket sectors into ranges to build a histogram
    let min = *sectors.iter().min().unwrap_or(&0);
    let max = *sectors.iter().max().unwrap_or(&0);

    if min == max {
        return 0.0; // all accesses in one bucket = completely predictable
    }

    let range = max - min + 1;
    let bucket_size = (range / num_buckets as u64).max(1);

    let mut counts = vec![0usize; num_buckets];
    for &sector in sectors {
        let bucket = ((sector - min) / bucket_size) as usize;
        let bucket = bucket.min(num_buckets - 1);
        counts[bucket] += 1;
    }

    let total = counts.iter().sum::<usize>() as f64;
    if total == 0.0 {
        return 0.0;
    }

    // H = -sum(p * log2(p)) for each bucket
    let mut entropy = 0.0;
    for &count in &counts {
        if count == 0 {
            continue;
        }
        let p = count as f64 / total;
        entropy -= p * p.log2();
    }

    entropy
}
}

Entropy is measured in bits. A value near 0 means the accesses are concentrated in one bucket — very sequential. A value near log2(num_buckets) means accesses are spread evenly across all buckets — very random.

For num_buckets = 16:

0-2 bits: sequential (always reading from one area)
2-3 bits: some locality (reading from a few areas)
3-4 bits: random (reading from across the address space)

Ring buffer reader for sector sampling

For entropy calculation, you need the actual sector numbers, not just counts. The eBPF program sends every 100th sector number to the ring buffer:

#![allow(unused)]
fn main() {
use aya::maps::RingBuf;
use aya::Ebpf;

#[derive(Clone, Copy, Debug)]
#[repr(C)]
pub struct SectorSample {
    pub dev: u32,
    pub sector: u64,
}

async fn poll_sector_samples(
    ebpf: &mut Ebpf,
    sector_window: &mut VecDeque<u64>,
) -> anyhow::Result<()> {
    let mut ring_buf = RingBuf::try_from(ebpf.map_mut("sector_samples")?)?;

    while let Some(item) = ring_buf.next() {
        let sample = unsafe {
            std::ptr::read_unaligned(item.as_ptr() as *const SectorSample)
        };
        sector_window.push_back(sample.sector);
    }

    // Keep the last 10000 samples
    while sector_window.len() > 10000 {
        sector_window.pop_front();
    }

    Ok(())
}
}

The ring buffer stores raw bytes. item is &[u8] — you cast it back to the struct type with read_unaligned. This is the standard pattern for receiving structured events from the ring buffer.

Userspace aggregation

#![allow(unused)]
fn main() {
use std::collections::{HashMap, VecDeque};
use std::time::Duration;

async fn poll_block_io(
    ebpf: &mut Ebpf,
    prev: &mut std::collections::HashMap<u32, (u64, u64)>,
    sector_window: &mut VecDeque<u64>,
    elapsed_secs: f64,
) -> anyhow::Result<()> {
    // Drain ring buffer samples for entropy calculation
    poll_sector_samples(ebpf, sector_window).await?;

    // Read the eBPF counters map
    let counters: aya::maps::HashMap<_, u32, (u64, u64)> =
        aya::maps::HashMap::try_from(ebpf.map_mut("io_counters")?)?;

    let safe_elapsed = if elapsed_secs > 0.0 { elapsed_secs } else { 1.0 };
    // Compute IOPS and throughput by iterating the eBPF map and comparing to prev
    let mut stats = Vec::new();
    for item in counters.iter() {
        let (dev, (ops, bytes)) = item?;
        let (prev_ops, prev_bytes) = prev.get(&dev).copied().unwrap_or((0, 0));
        let ops_delta = ops.saturating_sub(prev_ops);
        let bytes_delta = bytes.saturating_sub(prev_bytes);

        stats.push(IoStats {
            device: dev_t_to_name(dev).unwrap_or_else(|| format!("{:08x}", dev)),
            iops: ops_delta as f64 / safe_elapsed,
            throughput_mbps: (bytes_delta as f64 / safe_elapsed) / 1e6,
            ops_count: ops_delta,
            bytes_count: bytes_delta,
        });

        prev.insert(dev, (ops, bytes));
    }

    let entropy = compute_entropy(&sector_window.iter().copied().collect::<Vec<_>>(), 16);

    for stat in stats {
        println!(
            "{}: iops={:.0}  mbps={:.1}  entropy={:.2}",
            stat.device, stat.iops, stat.throughput_mbps, entropy
        );
    }

    Ok(())
}
}

Next: Part 11 — vhost and Virtio Ring Instrumentation — Instrument the virtio ring with kprobes to measure I/O latency at the virtualization boundary.

Keyboard shortcuts