Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Part 11 — vhost and Virtio Ring Instrumentation

If you run virtual machines on Linux, the guest and the host need to move data back and forth for disk I/O and networking. That data path has a bottleneck: the shared memory ring where work items are posted. When that ring fills up, the guest waits. When the host can’t process descriptors fast enough, the guest waits longer.

The ring is called the virtqueue — a circular buffer in shared memory. The guest writes descriptors (“here’s a network packet to send” or “read 4KB from this disk offset”) into the ring. The host-side vhost kernel module picks them up, does the work, and signals completion. If you can measure how full the ring is and how long descriptors sit before being processed, you can see I/O latency building up before it shows up in the guest’s metrics.

Understanding the ring means understanding the latency and throughput of your VM’s I/O path.

Virtio ring basics

The virtio ring has three parts:

  • Descriptor table: an array of data buffer descriptors — each entry points to a buffer and its length
  • Available ring: an array of descriptor indices the guest has made available to the host
  • Used ring: an array of descriptor indices the host has processed and returned to the guest

The host consumer reads from the available ring and writes completions to the used ring. The guest producer writes to the available ring and reads from the used ring.

The key performance questions:

  • How many descriptors are being added per second?
  • Is the ring getting full (stall condition)?
  • What’s the latency from add to completion?

KProbes vs tracepoints

Tracepoints are stable hooks placed by kernel developers. KProbes are dynamic — you can attach to any kernel function. They’re more powerful but less portable.

vhost and virtio don’t have tracepoints for the fast path operations. To instrument the ring, we need kprobes.

The tradeoff: kprobe function names change between kernel versions. A probe that works on kernel 5.15 might not exist on 6.1. This is the cost of deep kernel instrumentation.

Finding the right symbols

The kernel exposes all exported symbols in /proc/kallsyms. Look for vhost and virtqueue functions:

# Find vhost functions
cat /proc/kallsyms | grep -i vhost | grep -v "\[.*\]" | head -20

# Find virtqueue functions
cat /proc/kallsyms | grep -i virtqueue | grep -v "\[.*\]" | head -20

Common probe targets:

FunctionWhat it does
handle_tx_kickTX kick from the guest (vhost-net) — actually exported
handle_rx_kickRX kick from the guest (vhost-net) — actually exported
vhost_scsi_handle_vqSCSI I/O submission (vhost-scsi) — actually exported
vhost_workerThe main vhost worker loop — actually exported

⚠️ Why not __virtqueue_add? __virtqueue_add and virtqueue_add are declared static inline in the kernel source (drivers/virtio/virtio_ring.c). The compiler expands them at the call site, so they don’t appear in /proc/kallsyms and can’t be probed with kprobes. Any guide that shows __virtqueue_add as a kprobe target won’t work on a standard kernel build. The functions in the table above are exported and appear in kallsyms.

The exact names vary by kernel version and configuration. Build discovery into your program.

A note on struct access

KProbes give you access to function arguments, not named struct fields. You access arguments through ctx.arg::<T>(n) — the nth argument (0-indexed). To read struct fields from pointers, use bpf_probe_read.

This is fundamentally different from tracepoints, where the struct layout is documented. With kprobes, you need to know the kernel struct layout. For production use, this means either:

  1. Using BTF (Compile Once, Run Everywhere) to access struct fields by name
  2. Hardcoding struct field offsets (fragile but works)

For this tutorial, we’ll use the argument-based approach, which works without knowing struct layouts.

Counting descriptor additions

The most fundamental metric: how many descriptors are being processed per second per queue. We can’t easily get the queue index from a kprobe without CO-RE and BTF, so we’ll count per-CPU as a proxy:

#![allow(unused)]
fn main() {
// ebpf-programs/src/vhost.rs

use aya_ebpf::programs::ProbeContext;
use aya_ebpf::macros::{kprobe, map};
use aya_ebpf::maps::{HashMap, RingBuf};
use aya_ebpf::helpers::{bpf_ktime_get_ns, bpf_get_smp_processor_id};

#[map]
static VQ_COUNTER: HashMap<u32, u64> = HashMap::with_max_entries(256, 0);

// The #[kprobe] macro marks this function as a kprobe entry point.
// The kernel symbol to probe is specified at attach time from userspace.
#[kprobe]
pub fn vhost_kick(ctx: ProbeContext) -> u32 {
    // We're probing handle_tx_kick or handle_rx_kick — actually exported functions
    // that process virtqueue entries. We can't easily get the queue index
    // without CO-RE. For now, count per CPU as a proxy.
    let cpu = unsafe { bpf_get_smp_processor_id() };

    unsafe {
        let count = VQ_COUNTER.get(&cpu).copied().unwrap_or(0u64);
        let new_count = count + 1;
        let _ = VQ_COUNTER.insert(&cpu, &new_count, 0);
    }

    0
}
}

ctx.arg::<T>(n) returns Option<T>. For pointers, use Option::<*const T>::None when the argument isn’t a valid pointer.

Getting queue index with CO-RE

If your kernel has BTF enabled (most modern kernels do), you can access struct fields by name. This lets you get the queue index from the vhost_virtqueue struct and track per-queue counters instead of per-CPU.

CO-RE (Compile Once, Run Everywhere) is Aya’s mechanism for portable BTF-based access. It works by reading struct field offsets at load time using BTF data from the target kernel. With CO-RE, you’d read the handle_kick argument to get the vhost_virtqueue pointer, then use bpf_core_read to read the index field by name.

A complete CO-RE example is beyond the scope of this tutorial — it requires kernel BTF data to be available (/sys/kernel/btf/vmlinux must exist), Aya’s aya-obj BTF parsing, and careful struct definitions that match the kernel’s layout. The Aya project has CO-RE examples in their repository that show the full pattern.

If BTF isn’t available on your target kernel, fall back to the per-CPU counting approach shown earlier. Per-CPU counters give you some visibility without the BTF dependency.

Stall detection

A ring stall happens when the available ring is full — the host is trying to add descriptors but the guest hasn’t consumed the previous ones. A proxy for stalls: measure the time between consecutive descriptor adds on the same CPU. If the gap spikes, something was waiting.

#![allow(unused)]
fn main() {
// ebpf-programs/src/vhost.rs

use aya_ebpf::programs::ProbeContext;
use aya_ebpf::macros::{kprobe, map};
use aya_ebpf::maps::{HashMap, RingBuf};
use aya_ebpf::helpers::{bpf_ktime_get_ns, bpf_get_smp_processor_id};

#[map]
static LAST_ADD_TS: HashMap<u32, u64> = HashMap::with_max_entries(256, 0);
#[map]
static STALL_EVENTS: RingBuf = RingBuf::with_byte_size(8 * 4096, 0);

#[derive(Clone, Copy)]
#[repr(C)]
pub struct StallEvent {
    pub cpu_id: u32,
    pub gap_ns: u64,
}

const STALL_THRESHOLD_NS: u64 = 1_000_000; // 1ms

#[kprobe]
pub fn vhost_kick_stall(ctx: ProbeContext) -> u32 {
    let cpu = unsafe { bpf_get_smp_processor_id() };
    let now = unsafe { bpf_ktime_get_ns() };

    unsafe {
        if let Some(&prev_ts) = LAST_ADD_TS.get(&cpu) {
            let gap = now.saturating_sub(prev_ts);
            if gap > STALL_THRESHOLD_NS {
                let event = StallEvent {
                    cpu_id: cpu,
                    gap_ns: gap,
                };
                STALL_EVENTS.output(&event, 0);
            }
        }
        let _ = LAST_ADD_TS.insert(&cpu, &now, 0);
    }

    0
}
}

The full vhost eBPF program

Here’s the complete program for vhost/virtio instrumentation:

#![allow(unused)]
fn main() {
// ebpf-programs/src/vhost.rs

use aya_ebpf::programs::ProbeContext;
use aya_ebpf::macros::kprobe;
use aya_ebpf::maps::{HashMap, RingBuf};
use aya_ebpf::helpers::{bpf_ktime_get_ns, bpf_get_smp_processor_id};

#[map]
static VQ_COUNTER: HashMap<u32, u64> = HashMap::with_max_entries(256, 0);
#[map]
static LAST_ADD_TS: HashMap<u32, u64> = HashMap::with_max_entries(256, 0);
#[map]
static STALL_EVENTS: RingBuf = RingBuf::with_byte_size(8 * 4096, 0);

#[derive(Clone, Copy)]
#[repr(C)]
pub struct StallEvent {
    pub cpu_id: u32,
    pub gap_ns: u64,
}

const STALL_THRESHOLD_NS: u64 = 1_000_000;

#[kprobe]
pub fn vhost_kick(ctx: ProbeContext) -> u32 {
    let cpu = unsafe { bpf_get_smp_processor_id() };
    let now = unsafe { bpf_ktime_get_ns() };

    // Per-CPU descriptor counter
    unsafe {
        let count = VQ_COUNTER.get(&cpu).copied().unwrap_or(0u64);
        let new_count = count + 1;
        let _ = VQ_COUNTER.insert(&cpu, &new_count, 0);
    }

    // Stall detection: if the gap since the last add exceeds the
    // threshold, emit a stall event. A gap means something was
    // waiting — either the ring was full or the host was slow.
    // Note: a long gap can also mean the ring was idle (no I/O
    // submitted). This proxy can't distinguish stalls from idle
    // periods. For production use, track the available ring's
    // size directly via CO-RE or correlate with the vhost worker's
    // CPU usage.
    unsafe {
        if let Some(&prev_ts) = LAST_ADD_TS.get(&cpu) {
            let gap = now.saturating_sub(prev_ts);
            if gap > STALL_THRESHOLD_NS {
                let event = StallEvent {
                    cpu_id: cpu,
                    gap_ns: gap,
                };
                STALL_EVENTS.output(&event, 0);
            }
        }
        let _ = LAST_ADD_TS.insert(&cpu, &now, 0);
    }

    0
}
}

Attaching probes dynamically from userspace

The challenge with kprobes: function names change between kernel versions. The solution is to look up the symbol at runtime and attach to whatever exists:

#![allow(unused)]
fn main() {
// monitor/src/vhost.rs

use aya::programs::KProbe;
use std::process::Command;

pub fn attach_vhost_probes(ebpf: &mut aya::Ebpf) -> anyhow::Result<()> {
    // The eBPF program is named "vhost_kick" (matching the Rust function).
    // At attach time, we specify the kernel symbol to probe.
    // Look up available symbols and attach to whichever exists.
    //
    // Note: __virtqueue_add and virtqueue_add are static inline — they won't
    // appear in kallsyms. We probe the exported vhost handler functions instead.
    let candidates = [
        "handle_tx_kick",   // vhost-net TX (most common)
        "handle_rx_kick",   // vhost-net RX
        "vhost_scsi_handle_vq",  // vhost-scsi
    ];

    let kallsyms = std::fs::read_to_string("/proc/kallsyms")?;

    // Find the eBPF program by its Rust name, not the kernel symbol
    let program = ebpf.program_mut("vhost_kick")
        .ok_or_else(|| anyhow::anyhow!("vhost_kick program not found in eBPF object"))?;
    let kprobe: &mut KProbe = program.try_into()?;
    kprobe.load()?;

    // Try each candidate kernel symbol
    let mut attached = false;
    for sym in candidates {
        if kallsyms.lines().any(|l| l.contains(sym)) {
            kprobe.attach(sym, 0)?; // 0 = entry probe
            println!("attached kprobe: {sym}");
            attached = true;
            break;
        }
    }

    if !attached {
        println!("warning: no vhost/virtio symbols found in /proc/kallsyms — skipping vhost probes");
    }

    Ok(())
}
}

attach(name, flags): The flags parameter controls whether the probe is at the function entry (0) or at the function return (>0, typically 1 for RetProbe).

Reading in userspace

#![allow(unused)]
fn main() {
// monitor/src/main.rs

use aya::maps::{HashMap, RingBuf};
use aya::programs::KProbe;

#[derive(Clone, Copy, Debug)]
#[repr(C)]
pub struct StallEvent {
    pub cpu_id: u32,
    pub gap_ns: u64,
}

async fn poll_vhost(ebpf: &mut aya::Ebpf) -> anyhow::Result<()> {
    let mut stall_buf = RingBuf::try_from(ebpf.map_mut("stall_events")?)?;

    while let Some(item) = stall_buf.next() {
        let event = unsafe {
            std::ptr::read_unaligned(item.as_ptr() as *const StallEvent)
        };
        println!(
            "vhost stall: cpu={} gap={}µs",
            event.cpu_id,
            event.gap_ns / 1000
        );
    }

    Ok(())
}
}

Version compatibility

This is the hardest part of kprobe-based instrumentation. Function names change, struct layouts change, and some functions are added or removed between kernel versions.

Practical strategies:

  1. Probe what exists: Look up symbols in /proc/kallsyms at runtime and only attach to what’s there
  2. Graceful degradation: If the vhost handler functions aren’t available, fall back to counting at a higher-level tracepoint (like sched:sched_switch or irq:softirq_entry)
  3. Per-version testing: Test your probes on multiple kernel versions

For a monitoring tool that needs broad compatibility, tracepoints are always preferable when available. KProbes are for the deep instrumentation that tracepoints can’t reach.

Next: Part 12 — Queue Depth Histograms — Build per-CPU histograms in eBPF and compute p50 and p99 percentiles.