Part 12 — Queue Depth Histograms

Raw counts are useful. But p50 and p99 tell you what the experience actually feels like.

Here’s the problem with raw counts: they don’t distinguish between a workload that’s consistently mildly queued and one that has extreme spikes. You might measure an average queue depth of 4 — but if every measurement is either 0 or 100, the average is misleading. Percentiles reveal the shape of the distribution.

The eBPF program maintains a histogram in a per-CPU array. Each CPU increments its own copy of the bucket counter. Userspace reads all the per-CPU arrays, sums them, and computes p50 and p99 from the cumulative distribution.

The key insight: histogram IS the data structure

Statistical sampling is the usual approach: sample every Nth event, store the samples, compute percentiles from the sample set. This introduces sampling error, and extreme events may not be captured at all.

The alternative: increment a counter for every event, but put it in the right bucket. The histogram IS the data. You never store individual samples.

Bucket 0: [0]        → count of times the value was exactly 0
Bucket 1: [1]        → count of times the value was exactly 1
Bucket 2: [2-4]      → count of times the value was 2, 3, or 4
Bucket 3: [5-8]      → count of times the value was 5 through 8
Bucket 4: [9-16]     → count of times the value was 9 through 16
Bucket 5: [17-32]    → count of times the value was 17 through 32
Bucket 6: [33-64]    → count of times the value was 33 through 64
Bucket 7: [65+]      → count of times the value was 65 or more

This is a logarithmic bucket scheme — wide buckets at high values. The resolution at the low end is higher because that’s where most queues spend most of their time.

Per-CPU arrays: no lost updates

The critical challenge with histogram counters: multiple CPUs can fire the same eBPF program simultaneously. If you use a regular Array<u64>, two CPUs incrementing the same bucket at the same time would both read the same value, increment it, and write it back — lost updates.

PerCpuArray solves this. Each CPU has its own independent copy of the full array. CPU 0’s copy of buckets[3] is completely separate from CPU 1’s copy. You increment your own CPU’s copy without any contention. When userspace reads the histogram, it sums all the per-CPU values for each bucket.

#![allow(unused)]
fn main() {
// ebpf-programs/src/histogram.rs

use aya_ebpf::maps::PerCpuArray;
use aya_ebpf::macros::map;

#[map]
static QUEUE_HIST: PerCpuArray<u64> = PerCpuArray::with_max_entries(8, 0);

// Bucket boundaries (queue depth)
// Bucket 0: depth=0, Bucket 1: depth=1, etc.
const BUCKET_BOUNDARIES: &[u32] = &[
    1,           // [0]       bucket 0
    2,           // [1]       bucket 1
    5,           // [2-4]     bucket 2
    9,           // [5-8]     bucket 3
    17,          // [9-16]    bucket 4
    33,          // [17-32]   bucket 5
    65,          // [33-64]   bucket 6
    u32::MAX,    // [65+]     bucket 7
];

fn bucket_for_depth(depth: u32) -> usize {
    for (i, &boundary) in BUCKET_BOUNDARIES.iter().enumerate() {
        if depth < boundary {
            return i;
        }
    }
    BUCKET_BOUNDARIES.len() - 1
}

#[inline(always)]
fn increment_bucket(bucket: usize) {
    // get_ptr_mut returns a pointer to the current CPU's copy
    if let Some(ptr) = QUEUE_HIST.get_ptr_mut(bucket as u32) {
        // SAFETY: each CPU has its own independent array copy.
        // No other CPU can write to this location.
        unsafe {
            *ptr += 1;
        }
    }
}
}

Each CPU has its own copy of QUEUE_HIST. When CPU 0 calls increment_bucket(3), it writes to CPU 0’s copy. CPU 1’s call writes to CPU 1’s copy. No contention, no lost updates.

Integrating into the scheduler tracepoint

Attach the histogram increment to a tracepoint or kprobe that provides the metric you care about:

#![allow(unused)]
fn main() {
// ebpf-programs/src/scheduler.rs
// (continued from Part 6)

use aya_ebpf::maps::PerCpuArray;
use aya_ebpf::programs::TracePointContext;
use aya_ebpf::macros::{map, tracepoint};

#[map]
static QUEUE_HIST: PerCpuArray<u64> = PerCpuArray::with_max_entries(8, 0);

const BUCKET_BOUNDARIES: &[u32] = &[1, 2, 5, 9, 17, 33, 65, u32::MAX];

fn bucket_for_depth(depth: u32) -> usize {
    for (i, &boundary) in BUCKET_BOUNDARIES.iter().enumerate() {
        if depth < boundary {
            return i;
        }
    }
    BUCKET_BOUNDARIES.len() - 1
}

fn increment_bucket(bucket: usize) {
    if let Some(ptr) = QUEUE_HIST.get_ptr_mut(bucket as u32) {
        unsafe { *ptr += 1; }
    }
}

// Runqueue length is tracked in a hash map keyed by CPU.
// We update it on sched_waking (task about to wake) and sched_switch (task starts
// running), then emit the queue depth on each context switch.

#[map]
static RUNQUEUE_DEPTH: aya_ebpf::maps::HashMap<u32, u32> =
    aya_ebpf::maps::HashMap::with_max_entries(256, 0);

// sched:sched_waking — task is about to be woken (increment runqueue depth)
// Payload: comm (char[16]) at 0, pid (u32) at 16, prio (u32) at 20, target_cpu (u32) at 24
#[tracepoint]
pub fn sched_waking_depth(ctx: TracePointContext) -> u32 {
    let pid = unsafe { ctx.read_at::<i32>(16).unwrap_or(0) };
    if pid > 0 {
        // target_cpu tells us which CPU the task will run on
        let target_cpu = unsafe { ctx.read_at::<u32>(24).unwrap_or(0) };
        unsafe {
            let depth = RUNQUEUE_DEPTH.get(&target_cpu).copied().unwrap_or(0u32);
            let new_depth = depth + 1;
            let _ = RUNQUEUE_DEPTH.insert(&target_cpu, &new_depth, 0);
        }
    }
    0
}

// sched:sched_switch — task starts running (decrement runqueue depth, record histogram)
// Payload: prev_comm (char[16]) at 0, prev_pid (u32) at 16, prev_prio (u32) at 20,
//          prev_state (u64) at 24, next_comm (char[16]) at 32, next_pid (u32) at 48
#[tracepoint]
pub fn sched_switch_depth(ctx: TracePointContext) -> u32 {
    let next_pid = unsafe { ctx.read_at::<i32>(48).unwrap_or(0) };
    if next_pid > 0 {
        let cpu = unsafe { aya_ebpf::helpers::bpf_get_smp_processor_id() };
        unsafe {
            let depth = RUNQUEUE_DEPTH.get(&cpu).copied().unwrap_or(0u32);
            if depth > 0 {
                increment_bucket(bucket_for_depth(depth));
                let new_depth = depth - 1;
                let _ = RUNQUEUE_DEPTH.insert(&cpu, &new_depth, 0);
            }
        }
    }
    0
}
}

When a task is dequeued from the runqueue, we read the current runqueue depth, find the right bucket, and increment it. The PerCpuArray ensures no updates are lost even under heavy scheduler activity.

This is an approximation, not a precise count. The scheduler’s actual runqueue depth is maintained by the kernel’s internal per-CPU rq struct — we can’t read that from eBPF without a kprobe on an internal function. Our approach uses sched_waking (increment) and sched_switch (decrement) as proxies for enqueue and dequeue. This can diverge from the kernel’s count in two cases: a task migrated to a different CPU between waking and running (our depth for the original CPU overcounts), or a task woken on a CPU that’s already running that same task (the sched_switch fires normally, but the depth was never incremented because the task was already on the runqueue). For histogram purposes — seeing the shape of the distribution — these small divergences don’t matter. For an exact count, you’d need to read rq->nr_running directly (via kprobe on a function that holds the runqueue lock).

Userspace reader

Userspace reads the per-CPU arrays and sums them:

#![allow(unused)]
fn main() {
use aya::maps::{PerCpuArray, PerCpuValues};

pub struct Bucket {
    pub range_start: u32,
    pub range_end: u32,
    pub count: u64,
}

pub struct Histogram {
    pub buckets: Vec<Bucket>,
    pub total: u64,
    pub p50: f64,
    pub p99: f64,
}

const BOUNDARIES: [u32; 8] = [1, 2, 5, 9, 17, 33, 65, u32::MAX];

fn read_histogram(ebpf: &mut aya::Ebpf) -> anyhow::Result<Histogram> {
    let hist: PerCpuArray<u64> = PerCpuArray::try_from(ebpf.map_mut("queue_hist")?)?;

    let mut bucket_counts = vec![0u64; 8];

    // Read all 8 buckets, summing across all CPUs
    for idx in 0..8u32 {
        let per_cpu_values: PerCpuValues<u64> = hist.get(&idx, 0)?;
        bucket_counts[idx as usize] = per_cpu_values.iter().sum();
    }

    let mut buckets = Vec::new();
    for i in 0..8 {
        buckets.push(Bucket {
            range_start: if i == 0 { 0 } else { BOUNDARIES[i - 1] },
            range_end: BOUNDARIES[i],
            count: bucket_counts[i],
        });
    }

    let total: u64 = bucket_counts.iter().sum();
    let (p50, p99) = compute_percentiles(&bucket_counts, &BOUNDARIES, total);

    Ok(Histogram {
        buckets,
        total,
        p50,
        p99,
    })
}
}

A note on reading PerCpuArray from userspace. The PerCpuArray::get(&index, flags) method returns PerCpuValues<V> — one value per CPU, all in one call. PerCpuValues derefs to Box<[V]>, so you can call .iter() on it directly and sum across CPUs, as the read_histogram function above does.

Computing p50 and p99

The percentile is the value below which p * total events fall. From a histogram, we walk the buckets cumulatively:

#![allow(unused)]
fn main() {
fn compute_percentiles(
    counts: &[u64],
    boundaries: &[u32; 8],
    total: u64,
) -> (f64, f64) {
    if total == 0 {
        return (0.0, 0.0);
    }

    let p50_target = (total as f64 * 0.50) as u64;
    let p99_target = (total as f64 * 0.99) as u64;

    let mut cumsum = 0u64;
    let mut prev_cumsum = 0u64;
    let mut p50 = f64::NAN;
    let mut p99 = f64::NAN;

    for (i, &count) in counts.iter().enumerate() {
        prev_cumsum = cumsum;
        cumsum += count;

        if p50.is_nan() && cumsum >= p50_target {
            p50 = interpolate(&boundaries, i, prev_cumsum, cumsum, p50_target);
        }

        if p99.is_nan() && cumsum >= p99_target {
            p99 = interpolate(&boundaries, i, prev_cumsum, cumsum, p99_target);
        }

        if !p50.is_nan() && !p99.is_nan() {
            break;
        }
    }

    (p50, p99)
}

fn interpolate(
    boundaries: &[u32; 8],
    bucket_idx: usize,
    prev_cumsum: u64,  // cumulative count before this bucket
    cumsum: u64,       // cumulative count including this bucket
    target: u64,       // the percentile target value
) -> f64 {
    // Linear interpolation: where within this bucket does the target fall?
    let count_in_bucket = cumsum - prev_cumsum;
    let offset_in_bucket = target.saturating_sub(prev_cumsum);
    let fraction = if count_in_bucket > 0 {
        offset_in_bucket as f64 / count_in_bucket as f64
    } else {
        0.0
    };

    let bucket_start = if bucket_idx == 0 { 0 } else { boundaries[bucket_idx - 1] } as f64;
    let bucket_end = if boundaries[bucket_idx] == u32::MAX {
        bucket_start + 1.0 // open-ended bucket: approximate as one past start
    } else {
        boundaries[bucket_idx] as f64
    };

    bucket_start + fraction * (bucket_end - bucket_start)
}
}

The p50 of 2.5 and p99 of 18.3 from the example above tell you: most observations are low, but occasionally there are spikes. Without percentiles, you’d only know the average, which would hide that information.

Formatted output

#![allow(unused)]
fn main() {
fn format_histogram(hist: &Histogram) -> String {
    let mut lines = Vec::new();
    lines.push(format!(
        "total={}  p50={:.1}  p99={:.1}",
        hist.total, hist.p50, hist.p99
    ));

    for bucket in &hist.buckets {
        let pct = if hist.total > 0 {
            (bucket.count as f64 / hist.total as f64 * 100.0).round() as i32
        } else {
            0
        };
        let range = if bucket.range_end == u32::MAX {
            format!("{}+", bucket.range_start)
        } else {
            format!("{}-{}", bucket.range_start, bucket.range_end)
        };
        lines.push(format!("  {}: {:>8}  {:>4}%", range, bucket.count, pct));
    }

    lines.join("\n")
}
}

Example output for runqueue depth:

total=4821  p50=2.5  p99=18.3
  0:     1200   25%
  1:      980   20%
  2-4:   1400   29%
  5-8:    800   17%
  9-16:   400    8%
  17-32:   30    1%
  33-64:   10    0%
  65+:      1    0%

A note on atomic operations

For a regular Array<u64> (not per-CPU), you’d need atomic operations to avoid lost updates. The BPF helper for this is __atomic_fetch_add — available in Linux 5.8+:

#![allow(unused)]
fn main() {
// For non-per-CPU arrays, use BPF atomic operations
unsafe {
    let ptr = QUEUE_HIST.get_ptr_mut(bucket as u32)?;
    // __atomic_fetch_add(ptr, 1, __ATOMIC_RELAXED);
    // Note: Aya doesn't wrap this helper directly.
    // Use PerCpuArray instead — it's simpler and faster.
}
}

PerCPU arrays are the preferred approach in Aya. They’re simpler, faster (no atomic operations needed), and the pattern works everywhere.

Summary

Here’s the full histogram pattern:

eBPF: declare a PerCpuArray<u64> with N entries (one per bucket)
eBPF: on each event, increment the correct bucket using get_ptr_mut() + dereference
Userspace: read all N entries from the per-CPU array
Userspace: sum per-CPU values for each bucket
Userspace: compute cumulative distribution, interpolate p50 and p99

No statistical sampling. Every event is counted. The histogram IS the complete dataset.

This completes the detailed instrumentation chapters. Parts 1–2 covered the architecture and setup. Parts 3–9 covered each data source. Parts 10–12 covered block I/O, vhost rings, and histograms.

The three sources — hardware PMCs, kernel tracepoints, and procfs/sysfs — each have a different rhythm. PMCs are per-cycle counters that you poll at whatever interval suits your dashboard. eBPF programs push events into ring buffers the moment they happen. Procfs and sysfs are snapshots you read on a timer. The remaining work — wiring everything into a single polling loop, adding configuration, and shipping structured output — is engineering integration. The data sources are in place. The instrumentation works. What’s left is assembly.

Keyboard shortcuts

eBPF Performance Monitoring with Aya