Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Part 6 — Scheduler Tracing with eBPF

The scheduler decides which task runs next. Every scheduling decision is a data point.

You can’t see this from PMCs. The CPU doesn’t know whether it’s executing a task that’s been waiting for 50 milliseconds or one that recently got scheduled. The kernel knows. It records every scheduling decision in tracepoints, and we can read those tracepoints with eBPF.

The scheduler tracepoints

The kernel exposes several scheduler tracepoints. The ones we care about:

TracepointWhen it fires
sched:sched_switchThe scheduler switches from one task to another
sched:sched_wakingA task is about to be woken (pre-wakeup)
sched:sched_wakeupA sleeping task has been woken
sched:sched_stat_waitTime a task spent waiting on a runqueue
sched:sched_migrate_taskA task was migrated to another CPU

sched_switch is the most informative. It fires whenever the scheduler replaces the running task with a different one.

The map declaration pattern

This is the most important pattern in eBPF programming with Aya: maps are static globals.

You declare a map as a static variable with a constructor call. The eBPF verifier sees it at load time and allocates it. You reference it by name from userspace.

#![allow(unused)]
fn main() {
use aya_ebpf::maps::RingBuf;
use aya_ebpf::macros::map;

// 8 KiB ring buffer. Must be a power-of-two multiple of page_size (4096).
// The userspace reader picks this up by the name "events".
#[map]
static EVENTS: RingBuf = RingBuf::with_byte_size(8 * 4096, 0);

#[map]
static COUNTERS: aya_ebpf::maps::HashMap<u32, u64> =
    aya_ebpf::maps::HashMap::with_max_entries(256, 0);
}

Note that aya-ebpf maps don’t have builder methods — they’re constructed with with_max_entries() and with_byte_size() (for ring buffers). This is different from Rust’s standard library conventions, but it matches how the eBPF verifier needs to know the map size at compile time.

Reading tracepoint arguments

A tracepoint has a fixed payload — a chunk of memory that contains the arguments. The kernel defines the format. You read from the tracepoint context at specific byte offsets using read_at().

Here’s the layout for sched:sched_switch on Linux 5.x:

Offset   Type     Field
------   ----   -----
0        char[16]  prev_comm        (TASK_COMM_LEN)
16       u32       prev_pid
20       u32       prev_prio
24       u64       prev_state       (the TASK_* state mask)
32       char[16]  next_comm        (TASK_COMM_LEN)
48       u32       next_pid
52       u32       next_prio

The prev_comm and next_comm fields are 16-byte character arrays containing the process name (TASK_COMM_LEN is 16 in the kernel). They take up space in the payload even though we don’t read them — every offset after them is shifted.

Always verify on your system:

cat /sys/kernel/tracing/events/sched/sched_switch/format

This prints the exact field layout including offsets. The format file offsets include the 8-byte common header, so subtract 8 to get the read_at offset. See the detailed cross-check procedure in Part 2. Kernel versions can and do change tracepoint layouts.

The struct in our eBPF program:

#![allow(unused)]
fn main() {
// ebpf-programs/src/scheduler.rs

use aya_ebpf::programs::TracePointContext;
use aya_ebpf::macros::tracepoint;
use aya_ebpf::maps::RingBuf;
use aya_ebpf::helpers::{bpf_ktime_get_ns, bpf_get_smp_processor_id};

#[derive(Clone, Copy)]
#[repr(C)]
pub struct SchedSwitchEvent {
    pub cpu_id: u32,
    pub prev_pid: u32,
    pub prev_state: u64,
    pub next_pid: u32,
    pub timestamp: u64,
}

// Declare the ring buffer as a static
#[map]
static EVENTS: RingBuf = RingBuf::with_byte_size(8 * 4096, 0);

#[tracepoint]
pub fn sched_switch(ctx: TracePointContext) -> u32 {
    let prev_pid = unsafe { ctx.read_at::<u32>(16).unwrap_or(0) };
    let prev_state = unsafe { ctx.read_at::<u64>(24).unwrap_or(0) };
    let next_pid = unsafe { ctx.read_at::<u32>(48).unwrap_or(0) };
    let cpu_id = unsafe { bpf_get_smp_processor_id() };
    let timestamp = unsafe { bpf_ktime_get_ns() };

    let event = SchedSwitchEvent {
        cpu_id,
        prev_pid,
        prev_state,
        next_pid,
        timestamp,
    };

    // output() sends data directly to the ring buffer
    EVENTS.output(&event, 0);

    0
}
}

A few things to notice here:

unsafe around read_at: read_at wraps bpf_probe_read, which reads from kernel memory. The eBPF verifier can’t guarantee the memory is valid, so you need unsafe. In practice, you’re reading from a tracepoint payload that the kernel has placed there, so it’s safe — but the compiler doesn’t know that.

bpf_get_smp_processor_id() and bpf_ktime_get_ns(): These are raw BPF helpers. Aya wraps many helpers, but these two are so fundamental that they’re exposed directly as unsafe extern calls through the bindings. They’re available in every eBPF program.

EVENTS.output(): The ring buffer is a static. We call .output() directly on it — no ctx involved. The ring buffer is declared at the top of the file and lives for the lifetime of the program.

Building a runqueue wait histogram

The wait time is the time between when a task was woken and when it actually starts running on a CPU. To measure it, we need to correlate events from two tracepoints: sched:sched_waking (when a task is about to wake) and sched:sched_switch (when a task starts running). When sched_switch fires and the next_pid matches a task we saw in sched_waking, we know how long that task waited.

The approach: a hash map keyed by PID. When a task is woken (via sched_waking), record the timestamp. When a task starts running (via sched_switch where next_pid matches), look up the timestamp, compute the wait, and delete the entry.

Why sched_waking instead of sched_wakeup? sched_waking fires when the wake signal is about to be sent — slightly earlier and more reliable for measuring the full wait. sched_wakeup fires after the target task has been added to the runqueue. For wait time measurement, the difference is negligible, but sched_waking is the more common choice in production tools.

The sched:sched_waking payload layout:

Offset   Type     Field
------   ----   -----
0        char[16]  comm            (TASK_COMM_LEN)
16       u32       pid
20       u32       prio
24       u32       target_cpu

Verify on your system: cat /sys/kernel/tracing/events/sched/sched_waking/format (format file offsets are 8 bytes larger than read_at offsets — see Part 2 for details)

#![allow(unused)]
fn main() {
// ebpf-programs/src/scheduler.rs

use aya_ebpf::programs::TracePointContext;
use aya_ebpf::macros::{map, tracepoint};
use aya_ebpf::maps::{HashMap, RingBuf};
use aya_ebpf::helpers::{bpf_ktime_get_ns, bpf_get_smp_processor_id};

// Map: PID → waking timestamp (nanoseconds)
#[map]
static WAKE_TS: HashMap<i32, u64> = HashMap::with_max_entries(1024, 0);

// Ring buffer for wait events
#[map]
static EVENTS: RingBuf = RingBuf::with_byte_size(8 * 4096, 0);

#[derive(Clone, Copy)]
#[repr(C)]
pub struct WaitEvent {
    pub pid: u32,
    pub wait_ns: u64,
    pub cpu_id: u32,
}

// sched:sched_waking — record when a task is about to be woken
// Payload: comm (char[16]) at 0, pid (u32) at 16, prio (u32) at 20, target_cpu (u32) at 24
#[tracepoint]
pub fn sched_waking(ctx: TracePointContext) -> u32 {
    let pid = unsafe { ctx.read_at::<i32>(16).unwrap_or(0) };
    let ts = unsafe { bpf_ktime_get_ns() };

    // Only record for non-zero PIDs (kernel threads have pid 0)
    if pid > 0 {
        WAKE_TS.insert(&pid, &ts, 0);
    }

    0
}

// sched:sched_switch — check if the incoming task was waiting
// Payload: prev_comm (char[16]) at 0, prev_pid (u32) at 16, prev_prio (u32) at 20,
//          prev_state (u64) at 24, next_comm (char[16]) at 32, next_pid (u32) at 48
#[tracepoint]
pub fn sched_switch_wait(ctx: TracePointContext) -> u32 {
    let next_pid = unsafe { ctx.read_at::<u32>(48).unwrap_or(0) };

    if next_pid > 0 {
        let ts = unsafe { bpf_ktime_get_ns() };
        // WAKE_TS key type is i32 (kernel PIDs are pid_t). Cast for lookup.
        // Safe because we only reach this branch when next_pid > 0,
        // and all real PIDs fit in both u32 and i32.
        let pid_key = next_pid as i32;
        // SAFETY: WAKE_TS.get() is unsafe because the kernel doesn't
        // guarantee atomicity without BPF_F_NO_PREALLOC. For our purposes —
        // measuring scheduler wait time — occasional corruption is acceptable
        // since it means one lost measurement at worst.
        unsafe {
            if let Some(&wake_ts) = WAKE_TS.get(&pid_key) {
                let wait_ns = ts.saturating_sub(wake_ts);
                let cpu_id = bpf_get_smp_processor_id();
                let event = WaitEvent { pid: next_pid, wait_ns, cpu_id };
                EVENTS.output(&event, 0);
                let _ = WAKE_TS.remove(&pid_key);
            }
        }
    }

    0
}
}

Two tracepoints, one hash map. sched_waking writes the timestamp when a task is about to wake. sched_switch reads it back when the task starts running.

Merging the handlers: This chapter shows three separate sched_switch handlers — one for basic tracing, one for wait histograms, one for involuntary switches. In a real program, you’d combine them into a single handler. They’re split here so each concept is clear on its own. When you merge them, the combined handler reads prev_pid, prev_state, and next_pid once, then does all three operations (ring buffer output, wait lookup, involuntary counting) in the same function. The next_pid in sched_switch matches the pid from sched_waking — that’s how we correlate the two events. The wait time is the difference between the two timestamps.

This is harder than a single-tracepoint approach — you need to correlate events from two sources. But it’s the only correct way to measure runqueue wait time with the tracepoints the kernel actually provides. The scheduler’s internal enqueue/dequeue operations aren’t exposed as tracepoints.

insert(), get(), remove(): HashMap in Aya eBPF has insert() (returns Result<(), c_long>), get() (unsafe, returns Option<&V>), and remove() (returns Result<(), c_long>). There’s no get_mut() — eBPF maps are accessed by reference.

get() returns Option<&V>: When the key isn’t found, get() returns None. Two patterns for handling this:

  1. if let Some(&val) = map.get(&key) — when the “not found” case means “skip this event.” Used in sched_switch_wait above: if the PID isn’t in WAKE_TS, there’s nothing to compute.
  2. map.get(&key).copied().unwrap_or(0) — when the “not found” case means “start from zero.” Used for counter increments: if the CPU isn’t in the map, the count is zero.

Both patterns are idiomatic. Pick based on the semantics of the “not found” case.

Counting context switches

Every sched_switch is a context switch. Counting them is straightforward — we’ll extend the sched_switch handler from the previous section with a per-CPU hash map:

#![allow(unused)]
fn main() {
// ebpf-programs/src/scheduler.rs

#[map]
static CTXSW_COUNTERS: HashMap<u32, u64> = HashMap::with_max_entries(256, 0);

#[tracepoint]
pub fn sched_switch(ctx: TracePointContext) -> u32 {
    let cpu_id = unsafe { bpf_get_smp_processor_id() };

    unsafe {
        let count = CTXSW_COUNTERS.get(&cpu_id).copied().unwrap_or(0u64);
        let new_count = count + 1;
        let _ = CTXSW_COUNTERS.insert(&cpu_id, &new_count, 0);
    }

    0
}
}

Note that insert() takes ownership of the reference’s pointed-to value, so we dereference count into new_count and insert that. This is the standard pattern for counter increments in eBPF.

Involuntary context switches

An involuntary context switch is one where the running task didn’t voluntarily give up the CPU — it got preempted or its time slice expired. We can detect this from the same sched_switch tracepoint by checking whether prev_pid was actually running when it got switched out.

In the kernel, TASK_RUNNING is state 0. If prev_state in sched_switch is 0, the task was running and got switched out involuntarily:

#![allow(unused)]
fn main() {
// TASK_RUNNING = 0 in the kernel
const TASK_RUNNING: u64 = 0;

#[map]
static INVOL_CTXSW: HashMap<u32, u64> = HashMap::with_max_entries(256, 0);

#[tracepoint]
pub fn sched_switch(ctx: TracePointContext) -> u32 {
    let prev_state = unsafe { ctx.read_at::<u64>(24).unwrap_or(0) };
    let prev_pid = unsafe { ctx.read_at::<u32>(16).unwrap_or(0) };

    // prev_state == 0 means the task was in TASK_RUNNING
    // prev_pid > 0 means it wasn't idle — involuntary switch
    let is_involuntary = prev_state == TASK_RUNNING && prev_pid > 0;

    if is_involuntary {
        let cpu = unsafe { bpf_get_smp_processor_id() };
        unsafe {
            let count = INVOL_CTXSW.get(&cpu).copied().unwrap_or(0u64);
            let new_count = count + 1;
            let _ = INVOL_CTXSW.insert(&cpu, &new_count, 0);
        }
    }

    0
}
}

Steal time

On virtualized systems, the hypervisor sometimes doesn’t give a vCPU any time to run even though it was runnable. The kernel records this as steal time. Unlike other metrics in this tutorial, steal time isn’t available through a scheduler tracepoint — it’s reported in /proc/stat:

cpu  2255 34 2290 22625563 6290 0 236 0 0 0
cpu0 1132 17 1145 11312781 3145 0 118 0 0 0
cpu1 1123 17 1145 11312782 3145 0 118 0 0 0

The fields are: user, nice, system, idle, iowait, irq, softirq, steal, guest, guest_nice. The steal field (8th column, 0-indexed column 7) is the number of jiffies the CPU wanted to run but the hypervisor scheduled something else.

#![allow(unused)]
fn main() {
use std::fs;

struct CpuStat {
    pub user: u64,
    pub nice: u64,
    pub system: u64,
    pub idle: u64,
    pub iowait: u64,
    pub irq: u64,
    pub softirq: u64,
    pub steal: u64,
}

fn read_proc_stat() -> std::io::Result<Vec<CpuStat>> {
    let content = fs::read_to_string("/proc/stat")?;
    let mut cpus = Vec::new();

    for line in content.lines() {
        let parts: Vec<&str> = line.split_whitespace().collect();
        if parts.is_empty() || !parts[0].starts_with("cpu") {
            continue;
        }
        // Skip the aggregate "cpu " line — we want per-CPU (cpu0, cpu1, ...)
        if parts[0] == "cpu" {
            continue;
        }

        cpus.push(CpuStat {
            user:   parts.get(1).and_then(|v| v.parse().ok()).unwrap_or(0),
            nice:   parts.get(2).and_then(|v| v.parse().ok()).unwrap_or(0),
            system: parts.get(3).and_then(|v| v.parse().ok()).unwrap_or(0),
            idle:   parts.get(4).and_then(|v| v.parse().ok()).unwrap_or(0),
            iowait: parts.get(5).and_then(|v| v.parse().ok()).unwrap_or(0),
            irq:    parts.get(6).and_then(|v| v.parse().ok()).unwrap_or(0),
            softirq:parts.get(7).and_then(|v| v.parse().ok()).unwrap_or(0),
            steal:  parts.get(8).and_then(|v| v.parse().ok()).unwrap_or(0),
        });
    }

    Ok(cpus)
}
}

Compute the steal ratio (fraction of total CPU time spent stolen):

#![allow(unused)]
fn main() {
fn steal_ratio(stat: &CpuStat) -> f64 {
    let total = stat.user + stat.nice + stat.system + stat.idle
        + stat.iowait + stat.irq + stat.softirq + stat.steal;
    if total == 0 {
        return 0.0;
    }
    stat.steal as f64 / total as f64
}
}

Steal time above 5-10% means the host is oversubscribed — there are more vCPUs competing for physical CPUs than physical CPUs available. Your workload is spending real time waiting for the hypervisor, not doing useful work.

This is a procfs metric, not an eBPF metric — no tracepoint or kprobe required. The kernel already tracks it. We read it alongside the scheduler tracepoint data in the same polling loop.

Checking tracepoint availability. Scheduler tracepoints are available on virtually every Linux kernel, but their names and arguments can change between versions. Before your monitoring tool starts, verify that the tracepoints it needs actually exist:

# List all scheduler tracepoints on this kernel
ls /sys/kernel/tracing/events/sched/

# Check a specific tracepoint
cat /sys/kernel/tracing/events/sched/sched_switch/id

If cat .../id prints a number, the tracepoint exists and can be attached. If you get “No such file or directory,” the tracepoint isn’t available on this kernel — your program should skip attaching it rather than crashing. In the userspace code below, you’d guard the program.attach() call with a check like this:

#![allow(unused)]
fn main() {
// Check that the tracepoint exists before attaching
let tp_id = std::fs::read_to_string(
    "/sys/kernel/tracing/events/sched/sched_switch/id"
);
if tp_id.is_ok() {
    program.attach("sched", "sched_switch")?;
} else {
    eprintln!("sched:sched_switch not available on this kernel, skipping");
}
}

Merging the handlers

The three sched_switch handlers above each do one thing. In a real program, you’d combine them into a single handler that reads the tracepoint payload once and does all three operations. Here’s what that looks like:

#![allow(unused)]
fn main() {
// ebpf-programs/src/scheduler.rs

use aya_ebpf::programs::TracePointContext;
use aya_ebpf::macros::{map, tracepoint};
use aya_ebpf::maps::{HashMap, RingBuf};
use aya_ebpf::helpers::{bpf_ktime_get_ns, bpf_get_smp_processor_id};

const TASK_RUNNING: u64 = 0;

// Shared maps (declared once, used by both sched_switch and sched_waking)
#[map]
static EVENTS: RingBuf = RingBuf::with_byte_size(8 * 4096, 0);
#[map]
static WAKE_TS: HashMap<i32, u64> = HashMap::with_max_entries(1024, 0);
#[map]
static CTXSW_COUNTERS: HashMap<u32, u64> = HashMap::with_max_entries(256, 0);
#[map]
static INVOL_CTXSW: HashMap<u32, u64> = HashMap::with_max_entries(256, 0);

#[derive(Clone, Copy)]
#[repr(C)]
pub struct SchedSwitchEvent {
    pub cpu_id: u32,
    pub prev_pid: u32,
    pub prev_state: u64,
    pub next_pid: u32,
    pub wait_ns: u64,    // 0 if no waking timestamp was found
    pub timestamp: u64,
}

// Combined sched_switch handler: tracing + wait lookup + involuntary counting
//
// Payload layout (Linux 5.x — verify with
//   cat /sys/kernel/tracing/events/sched/sched_switch/format):
//   offset 0:  prev_comm  char[16]
//   offset 16: prev_pid   u32
//   offset 20: prev_prio  u32
//   offset 24: prev_state u64
//   offset 32: next_comm  char[16]
//   offset 48: next_pid   u32
//   offset 52: next_prio  u32
#[tracepoint]
pub fn sched_switch_combined(ctx: TracePointContext) -> u32 {
    // Read the payload once — shared across all three operations
    let prev_pid = unsafe { ctx.read_at::<u32>(16).unwrap_or(0) };
    let prev_state = unsafe { ctx.read_at::<u64>(24).unwrap_or(0) };
    let next_pid = unsafe { ctx.read_at::<u32>(48).unwrap_or(0) };
    let cpu_id = unsafe { bpf_get_smp_processor_id() };
    let timestamp = unsafe { bpf_ktime_get_ns() };

    // --- Operation 1: Ring buffer event ---
    // Look up the waking timestamp for next_pid to compute wait time
    let mut wait_ns = 0u64;
    if next_pid > 0 {
        let pid_key = next_pid as i32;
        unsafe {
            if let Some(&wake_ts) = WAKE_TS.get(&pid_key) {
                wait_ns = timestamp.saturating_sub(wake_ts);
                let _ = WAKE_TS.remove(&pid_key);
            }
        }
    }

    let event = SchedSwitchEvent {
        cpu_id,
        prev_pid,
        prev_state,
        next_pid,
        wait_ns,
        timestamp,
    };
    EVENTS.output(&event, 0);

    // --- Operation 2: Context switch counter ---
    unsafe {
        let count = CTXSW_COUNTERS.get(&cpu_id).copied().unwrap_or(0u64);
        let _ = CTXSW_COUNTERS.insert(&cpu_id, &(count + 1), 0);
    }

    // --- Operation 3: Involuntary context switch counter ---
    if prev_state == TASK_RUNNING && prev_pid > 0 {
        unsafe {
            let count = INVOL_CTXSW.get(&cpu_id).copied().unwrap_or(0u64);
            let _ = INVOL_CTXSW.insert(&cpu_id, &(count + 1), 0);
        }
    }

    0
}
}

Key differences from the separate handlers:

  • One read_at pass. The combined handler reads prev_pid, prev_state, and next_pid once. The separate handlers each read independently — that’s three times the work per context switch.
  • Wait time is embedded in the event. The SchedSwitchEvent struct now includes wait_ns. If the incoming task (next_pid) doesn’t have a waking timestamp, wait_ns is 0 — the userspace reader checks for this.
  • One sched_waking handler still needed. The combined sched_switch handler only replaces the three separate sched_switch variants. The sched_waking handler (which records the wake timestamp) is unchanged.

The lib.rs registration for the combined approach needs only two entry points instead of three:

#![allow(unused)]
fn main() {
// ebpf-programs/src/lib.rs

#![no_std]
use aya_ebpf::macros::tracepoint;
use aya_ebpf::programs::TracePointContext;
mod scheduler;

#[tracepoint]
pub fn trace_sched_switch(ctx: TracePointContext) -> u32 {
    scheduler::sched_switch_combined(ctx)
}

#[tracepoint]
pub fn trace_sched_waking(ctx: TracePointContext) -> u32 {
    scheduler::sched_waking(ctx)
}
}

Reading in userspace

The userspace side loads the eBPF programs, creates maps, attaches tracepoints, and reads from the ring buffer.

#![allow(unused)]
fn main() {
// monitor/src/main.rs

use aya::maps::RingBuf;
use aya::programs::TracePoint;
use aya::Ebpf;
use std::convert::TryFrom;

#[derive(Clone, Copy, Debug)]
#[repr(C)]
pub struct SchedSwitchEvent {
    pub cpu_id: u32,
    pub prev_pid: u32,
    pub prev_state: u64,
    pub next_pid: u32,
    pub timestamp: u64,
}

async fn poll_scheduler(ebpf: &mut Ebpf) -> anyhow::Result<()> {
    // RingBuf takes ownership of the map reference
    let mut ring_buf = RingBuf::try_from(ebpf.map_mut("events")?)?;

    let mut ctxsw_total = 0u64;
    let mut involuntary = 0u64;

    while let Some(item) = ring_buf.next() {
        // item derefs to &[u8] — cast to our event type
        let event = unsafe {
            std::ptr::read_unaligned(item.as_ptr() as *const SchedSwitchEvent)
        };

        ctxsw_total += 1;
        // Involuntary: was running (prev_state == 0) and wasn't idle (prev_pid > 0)
        if event.prev_state == 0 && event.prev_pid > 0 {
            involuntary += 1;
        }
    }

    if ctxsw_total > 0 {
        let inv_rate = involuntary as f64 / ctxsw_total as f64;
        println!("ctxsw={ctxsw_total}  involuntary_rate={inv_rate:.3}");
    }

    Ok(())
}
}

Key points about the userspace code:

RingBuf::try_from(map_data): You create the ring buffer from a map reference. ebpf.map_mut("events") looks up the map by name that the eBPF program declared. RingBuf::try_from() takes ownership of that map reference.

ring_buf.next() returns Option<RingBufItem<'_>>: RingBufItem derefs to &[u8] — a byte slice containing the event data. You cast it back to your struct type. read_unaligned is important because the data may not be aligned to the struct’s alignment requirements.

TracePoint (capital P): The userspace program type is TracePoint, not Tracepoint. This is the Aya convention — program types are PascalCase.

Reading histogram maps

For counter maps (like CTXSW_COUNTERS and INVOL_CTXSW), you read them directly from userspace without going through the ring buffer:

#![allow(unused)]
fn main() {
use aya::maps::HashMap;

async fn read_counters(ebpf: &mut Ebpf) -> anyhow::Result<()> {
    let counters: HashMap<_, u32, u64> =
        HashMap::try_from(ebpf.map_mut("ctxsw_counters")?)?;

    for (cpu, &count) in counters.iter() {
        println!("cpu={cpu} ctxsw={count}");
    }

    Ok(())
}
}

The counter values are cumulative since the program was loaded. To get per-second rates, save the previous reading and compute the delta.

Registering programs in lib.rs

The eBPF programs are defined in separate .rs files and registered in lib.rs:

#![allow(unused)]
fn main() {
// ebpf-programs/src/lib.rs

#![no_std]

use aya_ebpf::macros::tracepoint;
use aya_ebpf::programs::TracePointContext;

mod scheduler;

#[tracepoint]
pub fn trace_sched_switch(ctx: TracePointContext) -> u32 {
    scheduler::sched_switch(ctx)
}

#[tracepoint]
pub fn trace_sched_waking(ctx: TracePointContext) -> u32 {
    scheduler::sched_waking(ctx)
}

#[tracepoint]
pub fn trace_sched_switch_wait(ctx: TracePointContext) -> u32 {
    scheduler::sched_switch_wait(ctx)
}
}

Each entry point uses the #[tracepoint] form — no args. The macro marks the function as an eBPF tracepoint program; the category and name are provided from userspace via program.attach(). The function body delegates to the module implementation. In a real project, you might define the programs directly in main.rs or split them into modules — either way, the macro goes on the function that the eBPF verifier sees.

Next: Part 7 — NUMA and Memory Metrics — Track page migration rates and remote memory access ratios on multi-socket systems.