Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

You’ve got a service running. It works. But now you need more — TLS termination, load balancing, rate limiting, caching. You could add this to your application directly, but now your business logic is tangled with infrastructure. Every new feature requires touching both.

Or you could use a proxy. A layer in front of your service that handles the infrastructure: routing, encryption, caching, rate limiting. Your application stays focused. The proxy handles the rest.

Pingora is a Rust framework for building exactly these proxies. Not a config file you fight against — a library you build on. Cloudflare uses it to handle over 40 million requests per second. This tutorial teaches you how by building a real load-balancing proxy, piece by piece.

Eight parts, each building on the last:

  1. Your First Proxy — the basic request flow, the ProxyHttp trait, minimal working code
  2. Load Balancing — selecting backends, health checks, failover
  3. Filters and Middleware — intercepting and modifying traffic
  4. TLS — encrypting the client-facing connection
  5. Running in Production — logging, metrics, graceful restarts
  6. HTTP Caching — reducing backend load
  7. Rate Limiting — protecting your services
  8. The Complete Proxy — wiring everything together

Each piece is useful on its own. Together, they form a proxy that can handle real traffic. Each part is self-contained — read them in order or jump to the one you need. The complete code is in Part 8 if you want to see the full picture first.

Why Pingora?

If you’ve used Nginx or HAProxy, you know the drill: config files, directives, a mini-language for each feature. It works, but customization has limits. When you need logic that doesn’t fit the config DSL, you’re writing Lua plugins or patching C modules.

Pingora is different. Your proxy is a Rust program. You implement a trait. The framework handles everything else — HTTP parsing, connection pooling, keep-alive, TLS, async I/O. You decide where requests go and what happens along the way.

The tradeoff: more code than a config file. The upside: you can do anything.

Part 1: Your First Proxy

You need a proxy. Not a config file — an actual program you can modify, extend, and ship. Let’s build one.

About 40 lines of code. That’s all it takes to build a working reverse proxy that forwards HTTP requests to an upstream server. But those 40 lines contain the entire architecture of every Pingora service you’ll ever write.

The Request’s Journey

Before we write any code, let’s trace what happens when a client sends a request through a Pingora proxy:

1. Client sends: GET / HTTP/1.1 → :6188
2. Pingora accepts the connection
3. Pingora reads the request header
4. upstream_peer() is called → "send it to 1.1.1.1:443"
5. Pingora connects to 1.1.1.1:443 (or reuses an existing connection)
6. upstream_request_filter() is called → "add Host: one.one.one.one"
7. Pingora sends the request to the upstream
8. Pingora reads the upstream response
9. Pingora forwards the response to the client
10. Connection is recycled for reuse

Steps 2, 5, 7, 8, 9, and 10 are handled by the framework. You write steps 4 and 6. That’s the deal: Pingora handles the plumbing, you handle the logic.

The Smallest Possible Proxy

Every Pingora proxy has three pieces:

  1. A struct that implements ProxyHttp — this is where your logic lives
  2. A service that wraps your struct and listens on a port
  3. A server that hosts the service and manages the process

Here’s the complete code:

use async_trait::async_trait;
use pingora::prelude::*;
use pingora::proxy::{ProxyHttp, Session};
use pingora::upstreams::peer::HttpPeer;

pub struct MyProxy;

#[async_trait]
impl ProxyHttp for MyProxy {
    type CTX = ();
    fn new_ctx(&self) -> Self::CTX {}

    async fn upstream_peer(
        &self,
        _session: &mut Session,
        _ctx: &mut Self::CTX,
    ) -> Result<Box<HttpPeer>> {
        let peer = Box::new(HttpPeer::new(
            ("1.1.1.1", 443),
            true,
            "one.one.one.one".to_string(),
        ));
        Ok(peer)
    }

    async fn upstream_request_filter(
        &self,
        _session: &mut Session,
        upstream_request: &mut pingora::http::RequestHeader,
        _ctx: &mut Self::CTX,
    ) -> Result<()> {
        upstream_request.insert_header("Host", "one.one.one.one")?;
        Ok(())
    }
}

fn main() {
    let mut server = Server::new(None).unwrap();
    server.bootstrap();

    let mut service = http_proxy_service(&server.configuration, MyProxy);
    service.add_tcp("0.0.0.0:6188");
    server.add_service(service);

    server.run_forever();
}

That’s it. 40 lines including imports. Let’s break down what each piece does.

upstream_peer() — Where Does the Request Go?

This is the only required method on ProxyHttp. It answers one question: where should this request be sent?

The return type is HttpPeer, which describes the upstream connection:

#![allow(unused)]
fn main() {
HttpPeer::new(
    ("1.1.1.1", 443),     // The upstream address and port
    true,                  // Use TLS?
    "one.one.one.one".to_string(), // SNI hostname for TLS
)
}

The three parameters:

  1. Address — a hostname or IP with a port. This is where Pingora will connect.
  2. TLS — whether to encrypt the upstream connection. Since we’re connecting to an HTTPS server, this is true.
  3. SNI — Server Name Indication. During the TLS handshake, the client tells the server which hostname it’s looking for. Without the right SNI, the server won’t know which certificate to present.

Right now, every request goes to the same upstream. In Part 2, we’ll use this method to implement load balancing — selecting different backends for different requests.

upstream_request_filter() — Modifying the Request

This method is optional. We need it here because 1.1.1.1 requires a Host header to serve the right website. Without it, the server returns a 403.

#![allow(unused)]
fn main() {
async fn upstream_request_filter(
    &self,
    _session: &mut Session,
    upstream_request: &mut pingora::http::RequestHeader,
    _ctx: &mut Self::CTX,
) -> Result<()> {
    upstream_request.insert_header("Host", "one.one.one.one")?;
    Ok(())
}
}

The upstream_request parameter is mutable — you can add, remove, or modify any header. This is where you’d add authentication headers, strip internal headers, or rewrite the request path.

The method runs after the connection to the upstream is established but before the request is sent. This means you can make decisions based on the upstream connection (e.g., use different headers for different backends).

CTX — Per-Request State

The type CTX = () line looks odd. What’s it for?

Each request gets its own CTX instance. This is how you share state between the different phases of a single request. For example, you might parse a JWT in request_filter(), store the user ID in CTX, and use it in upstream_request_filter() to add an X-User-Id header.

Right now we don’t need per-request state, so we use () (the unit type — Rust’s way of saying “no data”). We’ll add real state in Part 3.

The Server

The main() function sets up three things:

fn main() {
    // 1. Create the server
    let mut server = Server::new(None).unwrap();
    server.bootstrap();

    // 2. Create a proxy service
    let mut service = http_proxy_service(&server.configuration, MyProxy);
    service.add_tcp("0.0.0.0:6188");

    // 3. Register and run
    server.add_service(service);
    server.run_forever();
}

Server::new(None) — creates a Pingora server with default configuration. The None means “no custom configuration file.” In Part 5, we’ll pass a real config.

server.bootstrap() — initializes the server. This parses CLI arguments, sets up signal handlers, and prepares the runtime. It must be called before adding services.

http_proxy_service() — creates a service that uses our MyProxy handler. The first argument is the server configuration (needed for things like TLS settings). The second is our proxy struct.

service.add_tcp() — tells the service to listen on port 6188. You can add multiple listeners (TCP, Unix sockets, TLS).

server.run_forever() — starts the event loop. This spawns worker threads and blocks the main thread. The server will run until it receives a shutdown signal.

Running It

cargo run

Then in another terminal:

curl http://127.0.0.1:6188 -sv

You should see the 1.1.1.1 website, served through your proxy. The -sv flags show the request/response headers.

What Just Happened

Let’s look at what Pingora did for you, because it’s a lot:

  1. HTTP parsing — Pingora parsed the incoming HTTP/1.1 request from the client. No hand-written parser needed.
  2. Connection pooling — When you send a second request, Pingora reuses the existing connection to 1.1.1.1 instead of opening a new one. This is automatic.
  3. TLS — Pingora handled the entire TLS handshake with the upstream server. All you specified was true for “use TLS.”
  4. Keep-alive — Both the downstream (client) and upstream connections are kept alive. No Connection: close needed.
  5. Error handling — If the upstream is unreachable, Pingora returns a 502 Bad Gateway to the client. You didn’t have to write that code.
  6. Concurrency — Multiple requests are handled concurrently using async I/O. No thread per connection.

That’s the value of a framework. You wrote two methods that answer “where” and “what headers.” Pingora handled the rest.

What’s Next

This proxy works, but it only talks to one backend. If that backend goes down, every request fails. In Part 2: Load Balancing, we’ll add multiple backends, select between them, and detect when one is unhealthy.

Part 2: Load Balancing — Picking the Right Backend When There Are Many

One backend isn’t enough. If it goes down, everything stops. If it’s slow, everyone waits. You need multiple backends, and you need to pick the right one for each request. That’s load balancing.

Your proxy will distribute requests across multiple backends using round-robin selection and automatically skip backends that are unhealthy.

The Problem with One Backend

In Part 1, every request went to 1.1.1.1. That works fine — until it doesn’t. What if 1.1.1.1 is down for maintenance? What if traffic spikes and one server can’t handle it?

We need two things:

  1. Multiple backends — so traffic can be spread across them
  2. Health checks — so we stop sending traffic to backends that are down

Round-Robin Selection

The simplest load balancing algorithm is round-robin: rotate through the backends in order. First request goes to backend A, second to B, third to A again, and so on.

Request 1 → 1.1.1.1:443
Request 2 → 1.0.0.1:443
Request 3 → 1.1.1.1:443
Request 4 → 1.0.0.1:443
...

Pingora provides this via the LoadBalancer type with a RoundRobin selection algorithm:

#![allow(unused)]
fn main() {
use pingora::lb::{LoadBalancer, selection::RoundRobin};

let upstreams = LoadBalancer::try_from_iter([
    "1.1.1.1:443",
    "1.0.0.1:443",
]).unwrap();
}

try_from_iter creates a load balancer from a list of backend addresses. It resolves DNS names and validates the addresses. The RoundRobin type parameter tells it which selection algorithm to use.

The Proxy with Load Balancing

Our proxy struct wraps the LoadBalancer in an Arc:

#![allow(unused)]
fn main() {
pub struct LB(Arc<LoadBalancer<RoundRobin>>);
}

Why Arc? Because the load balancer is shared between two things:

  1. The proxy handler (which calls select() to pick a backend)
  2. The health check service (which marks backends healthy/unhealthy)

Both need to see the same state. Arc gives us shared ownership without copying.

The upstream_peer method is almost the same as Part 1, except now we select from the pool:

#![allow(unused)]
fn main() {
async fn upstream_peer(
    &self,
    _session: &mut Session,
    _ctx: &mut Self::CTX,
) -> Result<Box<HttpPeer>> {
    let upstream = self.0
        .select(b"", 256)
        .ok_or_else(|| Error::new_str("no healthy upstream available"))?;

    let peer = Box::new(HttpPeer::new(
        upstream,
        true,
        "one.one.one.one".to_string(),
    ));
    Ok(peer)
}
}

Two differences from Part 1:

  1. self.0.select(b"", 256) instead of a hardcoded address. The select method picks a backend using the selection algorithm. The first argument (b"") is a hash key — it’s used for consistent hashing but ignored by round-robin. The second argument is the maximum number of iterations (in case many backends are unhealthy).

  2. .ok_or_else(|| ...) instead of .unwrap(). select() returns Option<Backend> — it returns None if all backends are unhealthy. We convert that to an error that Pingora will turn into a 502 for the client.

Health Checks: Stopping Traffic to Dead Backends

Here’s the problem: if one of our three backends is down, round-robin will still try to send traffic to it. Every third request fails.

We need health checks. Pingora provides TcpHealthCheck, which periodically tries to connect to each backend:

#![allow(unused)]
fn main() {
let mut upstreams = LoadBalancer::try_from_iter([
    "1.1.1.1:443",
    "1.0.0.1:443",
    "127.0.0.1:343",  // broken — nothing listens here
]).unwrap();

let hc = TcpHealthCheck::new();
upstreams.set_health_check(hc);
upstreams.health_check_frequency = Some(std::time::Duration::from_secs(1));
}

Three things happen here:

  1. Create the health check. TcpHealthCheck::new() creates a check that attempts a TCP connection. If the connection succeeds, the backend is healthy. If it’s refused, the backend is unhealthy.

  2. Attach it to the load balancer. set_health_check(hc) tells the load balancer to use this check. After this, select() will skip unhealthy backends.

  3. Set the frequency. We check every second. The default is 5 seconds — faster checks mean faster failover but more overhead.

Running Health Checks in the Background

Health checks need to run continuously, independently of request handling. Pingora handles this with the background service pattern:

#![allow(unused)]
fn main() {
let background = pingora::services::background::background_service(
    "health_check",
    upstreams,
);
let upstreams = background.task();  // shadows the earlier `upstreams` variable
}

Note: the second let upstreams shadows the first — we’re replacing the owned LoadBalancer with the shared Arc<LoadBalancer> returned by task(). Same name, but now it’s a reference the proxy can use.

This is subtle, so let’s break it down:

  1. background_service("health_check", upstreams) — takes ownership of the LoadBalancer and creates a background task. The string is a label for logging — it appears in health check log messages so you can identify which service generated them.

  2. background.task() — returns an Arc<LoadBalancer> that points to the same instance the background service is managing. This is how we share state: the health checker updates the load balancer’s internal state, and our proxy reads it, both through the same Arc.

  3. Register both services:

#![allow(unused)]
fn main() {
server.add_service(background);  // health check runner
server.add_service(proxy_service);  // our proxy
}

The background service runs in the server’s thread pool. It doesn’t block request handling.

The Complete Code

use async_trait::async_trait;
use pingora::prelude::*;
use pingora::proxy::{ProxyHttp, Session};
use pingora::upstreams::peer::HttpPeer;
use pingora::lb::{LoadBalancer, selection::RoundRobin, health_check::TcpHealthCheck};
use std::sync::Arc;

pub struct LB(Arc<LoadBalancer<RoundRobin>>);

#[async_trait]
impl ProxyHttp for LB {
    type CTX = ();
    fn new_ctx(&self) -> Self::CTX {}

    async fn upstream_peer(
        &self,
        _session: &mut Session,
        _ctx: &mut Self::CTX,
    ) -> Result<Box<HttpPeer>> {
        let upstream = self.0
            .select(b"", 256)
            .ok_or_else(|| Error::new_str("no healthy upstream available"))?;

        let peer = Box::new(HttpPeer::new(
            upstream,
            true,
            "one.one.one.one".to_string(),
        ));
        Ok(peer)
    }

    async fn upstream_request_filter(
        &self,
        _session: &mut Session,
        upstream_request: &mut pingora::http::RequestHeader,
        _ctx: &mut Self::CTX,
    ) -> Result<()> {
        upstream_request.insert_header("Host", "one.one.one.one")?;
        Ok(())
    }
}

fn main() {
    let mut server = Server::new(None).unwrap();
    server.bootstrap();

    let mut upstreams = LoadBalancer::try_from_iter([
        "1.1.1.1:443",
        "1.0.0.1:443",
        "127.0.0.1:343",
    ]).unwrap();

    let hc = TcpHealthCheck::new();
    upstreams.set_health_check(hc);
    upstreams.health_check_frequency = Some(std::time::Duration::from_secs(1));

    let background = pingora::services::background::background_service(
        "health_check",
        upstreams,
    );
    let upstreams = background.task();

    let mut service = http_proxy_service(&server.configuration, LB(upstreams));
    service.add_tcp("0.0.0.0:6188");

    server.add_service(background);
    server.add_service(service);

    server.run_forever();
}

Running It

cargo run

Then test with multiple requests:

for i in $(seq 1 10); do curl -s -o /dev/null -w "%{http_code}\n" http://127.0.0.1:6188; done

Before health checks kick in, you might see some 502s (the broken backend). After a second or two, all requests should return 200 — the health checker has detected that 127.0.0.1:343 is down and removed it from rotation.

The Background Service Pattern

This pattern — a background task that manages shared state — is worth understanding because it shows up everywhere in Pingora:

┌─────────────────────────────────────┐
│            Arc<LoadBalancer>         │
│                                     │
│  ┌──────────────┐ ┌──────────────┐  │
│  │ Health Check  │ │  Proxy (LB)  │  │
│  │  Background   │ │  reads via   │  │
│  │  Service      │ │  select()    │  │
│  │  writes via   │ │              │  │
│  │  health_check │ │              │  │
│  └──────────────┘ └──────────────┘  │
│        task() returns the Arc        │
└─────────────────────────────────────┘

The background_service() function takes ownership of the state, runs it in the background, and gives you back an Arc via .task(). Your proxy uses that Arc. This is Pingora’s way of sharing state between services without mutex hell.

Other Selection Algorithms

Round-robin is the simplest algorithm. Pingora also provides:

AlgorithmWhen to UseHow It Selects
RoundRobinEqual backends, no session affinityRotate through in order
ConsistentHashingSession affinity needed (sticky sessions)Hash the key to a backend
RandomSimple distribution, no order guaranteesPick randomly

For consistent hashing, the key parameter in select(key, max) matters — it determines which backend a given key maps to. Same key always maps to the same backend (until backends are added or removed).

What’s Next

This proxy can balance across multiple backends and skip unhealthy ones. But every request gets the same treatment — no matter who’s asking or what they’re asking for. In Part 3: Filters and Middleware, we’ll intercept requests, modify headers, and add custom logic that runs before and after each request.

Part 3: Filters and Middleware — Intercepting Traffic Without Tangling Your Logic

Parts 1 and 2 gave us a working proxy: requests come in, we pick a backend, the response goes out. But every request gets the same treatment. No validation. No modification. No way to say “this request isn’t allowed” or “add this header before forwarding.”

Real proxies need to intercept traffic. Block requests that fail authentication. Rewrite headers. Rate-limit abusive clients. Hide internal details from responses. These aren’t afterthoughts — they’re half the reason you’d write a proxy instead of using a config file.

This part introduces the full request lifecycle and the filter methods that let you control it.

The Request Lifecycle

Here’s what happens inside Pingora for every request:

  Client sends request
         │
         ▼
  early_request_filter()   ← First chance to inspect/modify
         │
         ▼
  request_filter()          ← Validate, rate-limit, or reject
         │
     ┌───┴───┐
     │  Ok?  │
     └───┬───┘
     no  │   yes
     │   │    │
     ▼   │    ▼
  respond    upstream_peer()    ← Pick the backend
  error      │
             ▼
         Connect to upstream
             │
             ▼
         upstream_request_filter()  ← Modify request before sending
             │
             ▼
         Send request to upstream
             │
             ▼
         Receive response from upstream
             │
             ▼
         upstream_response_filter()  ← Modify response (before caching)
             │
             ▼
         [Cache layer]             ← Cached here if caching is enabled
             │
             ▼
         response_filter()          ← Modify response (after caching, before sending)
             │
             ▼
         Send response to client
             │
             ▼
         logging()            ← Always runs, even on errors

You’ve already seen two of these: upstream_peer() (Part 1) and upstream_request_filter() (Parts 1–2). Now we’ll use the rest.

The key insight: each phase has a specific job. Don’t put authentication logic in upstream_request_filter — put it in request_filter, where it runs before any upstream connection is made. Don’t modify response headers in upstream_response_filter if you want the changes visible to the client — use response_filter instead. The phase tells you when; the method name tells you what.

Wait — why are there two response filter phases? upstream_response_filter runs before Pingora’s cache layer, so changes you make there affect what gets cached. response_filter runs after caching, so changes only affect what the client sees. For now, we’ll use response_filter — it’s what you want most of the time. If you’re working with Pingora’s cache, upstream_response_filter is where you’d normalize headers for cacheability.

request_filter() — The Gatekeeper

This is where you decide whether a request should be allowed at all. The method returns Result<bool>:

  • Ok(false) — the request is fine, keep going
  • Ok(true) — I already sent a response to the client, stop here
  • Err(...) — something went wrong, turn this into an error response

Let’s add API key authentication. If a request doesn’t have the right X-API-Key header, it gets a 401:

#![allow(unused)]
fn main() {
async fn request_filter(
    &self,
    session: &mut Session,
    _ctx: &mut Self::CTX,
) -> Result<bool> {
    let has_valid_key = session
        .req_header()
        .headers
        .get("X-API-Key")
        .map(|v| v.as_bytes())
        == Some(b"secret123");

    if !has_valid_key {
        let _ = session.respond_error(401).await;
        return Ok(true); // we handled it — stop the pipeline
    }

    Ok(false) // request is OK, continue
}
}

The respond_error(401) call sends a 401 response to the client. The Ok(true) tells Pingora “I already sent a response, don’t try to proxy this request upstream.” Without Ok(true), Pingora would try to proxy the request even though you already sent a 401 — that’s a protocol violation.

This pattern — check a condition, send an error, return Ok(true) — is how you build gatekeepers. Rate limiting, IP blocking, authentication, authorization — they all use request_filter this way.

CTX — Per-Request State

In Parts 1 and 2, we used type CTX = () — no per-request state. Now we need it.

Here’s the problem: request_filter parses the API key, but later phases might need to know who authenticated. Or: request_filter extracts the request path, and upstream_peer uses it for routing. These phases can’t talk to each other directly — they’re separate method calls on the trait.

Before we solve it: what wouldn’t work?

#![allow(unused)]
fn main() {
// ❌ Global state — not thread-safe in an async context
static API_KEYS: LazyLock<HashMap<String, bool>> = LazyLock::new(|| load_keys());

// ❌ Passing Arc<Mutex<RequestState>> through every phase
// This blocks the event loop — every request waits for every other request
async fn request_filter(
    &self,
    session: &mut Session,
    state: Arc<Mutex<RequestState>>,
) -> Result<bool>
}

Both of these are real approaches you’d reach for in other Rust code. Both break in a proxy: global state isn’t safe across async tasks, and Arc<Mutex> serializes access across the entire event loop. Every request would wait for every other request to read or write the shared state.

CTX is the answer. It’s a struct you define. Each request gets its own instance. Every phase can read and write it.

#![allow(unused)]
fn main() {
pub struct GatewayCtx {
    api_key: Option<String>,
    request_start: std::time::Instant,
}

#[async_trait]
impl ProxyHttp for Gateway {
    type CTX = GatewayCtx;
    fn new_ctx(&self) -> Self::CTX {
        GatewayCtx {
            api_key: None,
            request_start: std::time::Instant::now(),
        }
    }
    // ...
}
}

Now request_filter can store the API key, and logging can report how long the request took — both through the same ctx object.

Let’s rewrite the authentication to save the key:

#![allow(unused)]
fn main() {
async fn request_filter(
    &self,
    session: &mut Session,
    ctx: &mut Self::CTX,
) -> Result<bool> {
    if let Some(key) = session.req_header().headers.get("X-API-Key") {
        ctx.api_key = Some(key.to_str().unwrap().to_string());
        Ok(false)
    } else {
        let _ = session.respond_error(401).await;
        Ok(true)
    }
}
}

The ctx is available in every phase. No global state, no thread-local hacks, no mutex for per-request data. It’s a struct that lives as long as the request — created when the request arrives, dropped when it completes.

response_filter() — Cleaning Up Responses

Upstream servers leak information. Server: nginx/1.18.0. X-Powered-By: PHP/7.4.3. X-Request-Id: abc123-internal. These headers tell attackers what you’re running and how your infrastructure is wired.

response_filter lets you strip or replace headers before the client sees them:

#![allow(unused)]
fn main() {
async fn response_filter(
    &self,
    _session: &mut Session,
    upstream_response: &mut ResponseHeader,
    _ctx: &mut Self::CTX,
) -> Result<()> {
    // Replace the Server header — don't advertise what the upstream runs
    upstream_response.insert_header("Server", "Gateway")?;

    // Remove headers that leak internal details
    upstream_response.remove_header("X-Powered-By");
    upstream_response.remove_header("X-Request-Id");

    // Add our own tracking header
    upstream_response.insert_header("X-Served-By", "pingora-tutorial")?;

    Ok(())
}
}

The upstream_response parameter is mutable — you can add, remove, or rewrite any header. This runs for every response, including error responses from the upstream.

logging() — The Phase That Always Runs

Every request ends up in logging(). Successful requests. Failed requests. Requests where request_filter rejected the client. All of them.

This makes it the right place for:

  • Access logs
  • Metrics
  • Request timing
  • Cleanup
#![allow(unused)]
fn main() {
async fn logging(
    &self,
    session: &mut Session,
    _e: Option<&pingora::Error>,
    ctx: &mut Self::CTX,
) {
    let status = session
        .response_written()
        .map_or(0, |resp| resp.status.as_u16());

    let elapsed = ctx.request_start.elapsed();

    println!(
        "status={} key={} elapsed={:?}",
        status,
        ctx.api_key.as_deref().unwrap_or("none"),
        elapsed,
    );
}
}

The e parameter is Some(error) if the request failed, None if it succeeded. You can use this to log errors differently from successes — for example, sending errors to Sentry but only logging successes to your access log.

The Complete Proxy

Here’s our gateway with authentication, response cleanup, and logging:

use async_trait::async_trait;
use pingora::prelude::*;
use pingora::proxy::{ProxyHttp, Session};
use pingora::upstreams::peer::HttpPeer;
use pingora::lb::{LoadBalancer, selection::RoundRobin, health_check::TcpHealthCheck};
use std::sync::Arc;

pub struct Gateway(Arc<LoadBalancer<RoundRobin>>);

pub struct GatewayCtx {
    api_key: Option<String>,
    request_start: std::time::Instant,
}

#[async_trait]
impl ProxyHttp for Gateway {
    type CTX = GatewayCtx;
    fn new_ctx(&self) -> Self::CTX {
        GatewayCtx {
            api_key: None,
            request_start: std::time::Instant::now(),
        }
    }

    async fn request_filter(
        &self,
        session: &mut Session,
        ctx: &mut Self::CTX,
    ) -> Result<bool> {
        if let Some(key) = session.req_header().headers.get("X-API-Key") {
            ctx.api_key = Some(key.to_str().unwrap().to_string());
            Ok(false)
        } else {
            let _ = session.respond_error(401).await;
            Ok(true)
        }
    }

    async fn upstream_peer(
        &self,
        _session: &mut Session,
        _ctx: &mut Self::CTX,
    ) -> Result<Box<HttpPeer>> {
        let upstream = self.0
            .select(b"", 256)
            .ok_or_else(|| Error::new_str("no healthy upstream available"))?;

        let peer = Box::new(HttpPeer::new(
            upstream,
            true,
            "one.one.one.one".to_string(),
        ));
        Ok(peer)
    }

    async fn upstream_request_filter(
        &self,
        _session: &mut Session,
        upstream_request: &mut pingora::http::RequestHeader,
        _ctx: &mut Self::CTX,
    ) -> Result<()> {
        upstream_request.insert_header("Host", "one.one.one.one")?;
        Ok(())
    }

    async fn response_filter(
        &self,
        _session: &mut Session,
        upstream_response: &mut pingora::http::ResponseHeader,
        _ctx: &mut Self::CTX,
    ) -> Result<()> {
        upstream_response.insert_header("Server", "Gateway")?;
        upstream_response.remove_header("X-Powered-By");
        upstream_response.insert_header("X-Served-By", "pingora-tutorial")?;
        Ok(())
    }

    async fn logging(
        &self,
        session: &mut Session,
        _e: Option<&pingora::Error>,
        ctx: &mut Self::CTX,
    ) {
        let status = session
            .response_written()
            .map_or(0, |resp| resp.status.as_u16());
        let elapsed = ctx.request_start.elapsed();
        println!(
            "status={} key={} elapsed={:?}",
            status,
            ctx.api_key.as_deref().unwrap_or("none"),
            elapsed,
        );
    }
}

fn main() {
    let mut server = Server::new(None).unwrap();
    server.bootstrap();

    let mut upstreams = LoadBalancer::try_from_iter([
        "1.1.1.1:443",
        "1.0.0.1:443",
    ]).unwrap();

    let hc = TcpHealthCheck::new();
    upstreams.set_health_check(hc);
    upstreams.health_check_frequency = Some(std::time::Duration::from_secs(1));

    let background = pingora::services::background::background_service(
        "health_check",
        upstreams,
    );
    let upstreams = background.task();

    let mut service = http_proxy_service(&server.configuration, Gateway(upstreams));
    service.add_tcp("0.0.0.0:6188");

    server.add_service(background);
    server.add_service(service);

    server.run_forever();
}

Running It

cargo run

Without an API key — expect a 401:

curl http://127.0.0.1:6188 -sv
# < HTTP/1.1 401 Unauthorized

With an API key — the request goes through:

curl http://127.0.0.1:6188 -H "X-API-Key: secret123" -sv
# < HTTP/1.1 200 OK
# < Server: Gateway
# < X-Served-By: pingora-tutorial

Check the terminal where the proxy is running — you’ll see the logging output:

status=200 key=secret123 elapsed=123.456ms

Phase Reference — for lookup while reading the chapter above

PhaseWhen It RunsWhat It’s For
early_request_filterBefore everything elseModule configuration (advanced)
request_filterAfter request header is readValidation, auth, rate limiting, rejection
upstream_peerAfter request passes filtersRoute selection, load balancing
upstream_request_filterAfter upstream connection, before sendingModify request headers for upstream
upstream_response_filterAfter upstream respondsModify response before caching
response_filterBefore sending to clientModify response visible to client
response_body_filterFor each response body chunkTransform response body
loggingAfter request completesAccess logs, metrics, timing

Rule of thumb: “should this request be allowed?” → request_filter. “what should the upstream see?” → upstream_request_filter. “what should the client see?” → response_filter.

What’s Next

This proxy can authenticate requests, clean up responses, and log everything. But all traffic is still unencrypted between the client and the proxy. In Part 4: TLS and Security, we’ll add TLS termination so clients can connect over HTTPS.

Part 4: TLS — Encrypting Both Sides of the Proxy

Your proxy works. It forwards requests. But right now, clients connect over plain HTTP — anyone on the network can read the traffic. In production, you need encryption. Clients should connect to your proxy over HTTPS, and your proxy should validate the certificates it sees when connecting to upstreams.

TLS is one of those things that sounds simple — “encrypt the connection” — but has a hundred details: which certificates to present, whether to verify the upstream’s identity, how to handle certificate rotation, what to do when the client presents its own certificate. Pingora handles the mechanics — the handshake, the encryption, the protocol negotiation. Your job is to decide what to trust.

This part covers two directions of TLS:

  • Downstream TLS — clients connect to your proxy over HTTPS. You present a certificate.
  • Upstream TLS — your proxy connects to backends over HTTPS. You verify their certificate.

They’re separate concerns with separate APIs. Let’s walk through both.

Downstream TLS: Accepting HTTPS

In Parts 1–3, we used service.add_tcp() to listen for plain HTTP connections. For HTTPS, you swap in add_tls():

#![allow(unused)]
fn main() {
// Before: plain HTTP
service.add_tcp("0.0.0.0:6188");

// After: HTTPS with a certificate
service.add_tls("0.0.0.0:6443", "cert.pem", "key.pem")?;
}

That’s it. Three arguments:

  1. The address — which port to listen on
  2. Certificate path — the PEM file containing your server certificate (and any intermediate certs)
  3. Key path — the PEM file containing your private key

Pingora reads the files, configures the TLS acceptor, and from that point on every connection on that port goes through a TLS handshake before any HTTP bytes are exchanged.

Where Do You Get a Certificate?

For development, generate a self-signed one:

openssl req -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem \
    -days 365 -nodes -subj '/CN=localhost'

This creates cert.pem and key.pem in the current directory. The -nodes flag skips passphrase encryption (fine for dev, bad for production). The CN=localhost means the certificate is valid for the hostname localhost.

For production, you’d use Let’s Encrypt, your organization’s PKI, or a certificate manager like cert-manager.

What About HTTP/2 and Custom TLS Settings?

The add_tls() API uses sensible defaults. If you need more control — HTTP/2 support, mutual TLS, or custom cipher suites — Pingora provides add_tls_with_settings() which takes a TlsSettings object. TlsSettings lives in pingora-core (not re-exported through the pingora convenience crate), so you’d add pingora-core as a direct dependency:

#![allow(unused)]
fn main() {
use pingora_core::listeners::TlsSettings;

let mut tls = TlsSettings::intermediate("cert.pem", "key.pem")?;
tls.enable_h2(); // advertise HTTP/2 during ALPN negotiation
service.add_tls_with_settings("0.0.0.0:6443", None, tls);
}

TlsSettings::intermediate() follows Mozilla’s intermediate TLS recommendations — a reasonable default for most deployments. The enable_h2() method sets ALPN to prefer HTTP/2 while still allowing HTTP/1.1.

For the tutorial, we’ll stick with the simple add_tls() API. The important thing is understanding the concepts — the API details vary by TLS backend (BoringSSL, rustls, s2n) and Pingora version.

Upstream TLS: Connecting to Backends

We’ve been using upstream TLS since Part 1 — HttpPeer::new(("1.1.1.1", 443), true, "one.one.one.one"). The true means “connect over TLS.” The SNI string tells the upstream server which certificate to present.

But there’s more to upstream TLS than turning it on. The PeerOptions on an HttpPeer give you fine-grained control over certificate verification:

OptionDefaultWhat It Controls
verify_certtrueCheck that the upstream’s certificate is signed by a trusted CA
verify_hostnametrueCheck that the certificate’s CN matches the SNI
use_system_certstrueLoad the OS trust store for verification
caNoneCustom CA certificates (instead of system trust store)
alternative_cnNoneAccept a different hostname in the cert

The defaults are secure. But there are legitimate reasons to change them:

verify_cert: false — useful in development when the upstream has a self-signed cert. Never do this in production.

Custom CA — if your organization runs its own certificate authority (common in internal networks), you’d load the CA cert and pass it via ca:

#![allow(unused)]
fn main() {
let peer = Box::new(HttpPeer::new(("internal.api", 443), true, "internal.api".to_string()));
peer.options.ca = Some(my_ca_certs);
}

alternative_cn — if the upstream’s certificate has a different hostname than what you’re connecting to (e.g., connecting by IP but the cert has a domain name), you can specify what CN to accept:

#![allow(unused)]
fn main() {
peer.options.alternative_cn = Some("api.example.com".to_string());
}

Mutual TLS: When the Client Presents a Certificate

So far, TLS is one-way: the server proves its identity to the client. But some setups require the client to prove its identity too — this is mutual TLS (mTLS). It’s common in zero-trust networks and service mesh architectures.

In mTLS, the proxy (as a TLS server) asks the connecting client for a certificate. If the client can’t present a valid one, the connection is rejected.

With Pingora’s TlsSettings, you configure mTLS through a client certificate verifier:

#![allow(unused)]
fn main() {
// This is a simplified example — the actual verifier implementation
// depends on your PKI setup and TLS backend (rustls, BoringSSL, s2n).
let mut tls = TlsSettings::intermediate("cert.pem", "key.pem")?;
tls.set_client_cert_verifier(my_verifier);
}

The verifier is responsible for checking that the client’s certificate is valid — signed by a trusted CA, not expired, not revoked. This is where your specific trust model lives.

mTLS on the Upstream Side

If your proxy needs to present a client certificate when connecting to an upstream (i.e., the upstream requires mTLS), you set client_cert_key on the HttpPeer:

#![allow(unused)]
fn main() {
use pingora::utils::CertKey;
use std::sync::Arc;

let client_cert = Arc::new(CertKey::from_pem_file("client-cert.pem", "client-key.pem")?);
let peer = Box::new(HttpPeer::new(("api.internal", 443), true, "api.internal".to_string()));
peer.options.client_cert_key = Some(client_cert);
}

This is common when your proxy is one service in a service mesh, and the upstream requires all callers to authenticate.

The Full Picture

Here’s what TLS looks like in both directions:

Client                     Your Proxy                    Upstream
  │                           │                            │
  │──── TLS handshake ──────►│                            │
  │  (client validates       │                            │
  │   your proxy's cert)     │                            │
  │                           │──── TLS handshake ──────►│
  │                           │  (proxy validates         │
  │                           │   upstream's cert)        │
  │                           │                            │
  │  [optional: mTLS]        │  [optional: mTLS]         │
  │  client presents cert    │  proxy presents cert       │
  │  to proxy                │  to upstream               │
  │                           │                            │
  │◄──── encrypted data ────►│◄──── encrypted data ─────►│

The two TLS sessions are independent. The client might use TLS 1.3 with a modern cipher suite while the upstream connection uses TLS 1.2 with a different suite. Pingora terminates the client’s TLS session, reads the HTTP request, then opens a separate TLS session to the upstream.

This is called TLS termination — the proxy is the endpoint for both TLS sessions, with unencrypted HTTP in the middle. This is what lets you inspect and modify requests in filters (Part 3). If you needed end-to-end encryption without the proxy seeing the plaintext, you’d use TLS passthrough — but that means no filters, no routing, no modification. It’s a tradeoff.

The Code

Let’s put it together. We’ll take the load balancer from Part 2, add downstream TLS (accept HTTPS on port 6443), and keep the existing HTTP listener on port 6188 for comparison.

use async_trait::async_trait;
use pingora::prelude::*;
use pingora::proxy::{ProxyHttp, Session};
use pingora::upstreams::peer::HttpPeer;
use pingora::lb::{LoadBalancer, selection::RoundRobin};
use std::sync::Arc;

pub struct LB(Arc<LoadBalancer<RoundRobin>>);

#[async_trait]
impl ProxyHttp for LB {
    type CTX = ();
    fn new_ctx(&self) -> Self::CTX {}

    async fn upstream_peer(
        &self,
        _session: &mut Session,
        _ctx: &mut Self::CTX,
    ) -> Result<Box<HttpPeer>> {
        let upstream = self.0
            .select(b"", 256)
            .ok_or_else(|| Error::new_str("no healthy upstream available"))?;
        let peer = Box::new(HttpPeer::new(
            upstream,
            true, // upstream TLS
            "one.one.one.one".to_string(),
        ));
        Ok(peer)
    }

    async fn upstream_request_filter(
        &self,
        _session: &mut Session,
        upstream_request: &mut pingora::http::RequestHeader,
        _ctx: &mut Self::CTX,
    ) -> Result<()> {
        upstream_request.insert_header("Host", "one.one.one.one")?;
        Ok(())
    }
}

fn main() {
    let mut server = Server::new(None).unwrap();
    server.bootstrap();

    let upstreams = LoadBalancer::try_from_iter(["1.1.1.1:443", "1.0.0.1:443"]).unwrap();
    let lb = LB(Arc::new(upstreams));

    let mut service = http_proxy_service(&server.configuration, lb);

    // Plain HTTP (for comparison/testing)
    service.add_tcp("0.0.0.0:6188");

    // HTTPS with a certificate
    service.add_tls("0.0.0.0:6443", "cert.pem", "key.pem").unwrap();

    server.add_service(service);
    server.run_forever();
}

The key changes from Part 2:

  1. add_tls() replaces one of the add_tcp() calls — the proxy now accepts HTTPS on port 6443
  2. The ProxyHttp implementation is unchanged — TLS is a listener concern, not a proxy logic concern
  3. The upstream connection (HttpPeer::new(..., true, ...)) still uses TLS as before

Running It

First, generate the self-signed certificate:

openssl req -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem \
    -days 365 -nodes -subj '/CN=localhost'

Then run the proxy:

cargo run

Test the HTTPS endpoint (note the --insecure flag — curl won’t trust our self-signed cert):

curl https://localhost:6443 --insecure -sv

You should see the connection use TLS, and the response from 1.1.1.1.

For comparison, the HTTP endpoint still works:

curl http://localhost:6188 -sv

What Just Happened

Let’s be clear about what Pingora handled:

  1. TLS handshake — Pingora performed the full TLS handshake with the client, presenting your certificate and negotiating cipher suites. You didn’t write any of that code.

  2. Certificate loadingadd_tls() read the PEM files and configured the TLS stack. If the files were missing or malformed, it would fail at startup — not at request time.

  3. Separate listeners — The same proxy serves both HTTP and HTTPS. This is useful during migration (HTTP → HTTPS) and for health checks that don’t need encryption.

What you’re responsible for:

  1. Certificate management — Getting valid certificates, rotating them before they expire, protecting the private key. Pingora reads the files at startup; it doesn’t manage the certificates for you.

  2. Trust decisions — Whether to verify upstream certs, which CAs to trust, whether to accept self-signed certs. The defaults are secure, but you need to understand what you’re changing when you override them.

  3. Key protection — The private key file must be readable by the process but protected from everyone else. In production, use filesystem permissions, secrets managers, or HSMs.

Common Mistakes

Forgetting the SNI. The third argument to HttpPeer::new() is the SNI hostname. If you pass an IP address instead of a hostname, the upstream server won’t know which certificate to present, and the TLS handshake fails. This is why our example uses "one.one.one.one" even though we connect to 1.1.1.1.

Using self-signed certs in production. verify_cert: false makes the handshake succeed but defeats the purpose of TLS. Anyone can intercept the connection and present a fake certificate. Use proper certificates from a trusted CA.

Certificate expiry. Pingora reads certificates at startup. If your certificate expires while the proxy is running, TLS connections will fail. Use certificate rotation (reload the cert without restarting) or automation (Let’s Encrypt with short-lived certs).

What’s Next

Your proxy now accepts encrypted connections and verifies upstream certificates. But it’s still a single process — if it crashes or you need to update the code, every connection drops. In Part 5: Production Operations, we’ll cover graceful restarts, configuration files, and zero-downtime deployment.

Part 5: Running in Production — Logging, Metrics, and Not Breaking at 3 AM

You’ve built a proxy that load-balances, filters requests, and handles TLS. It works great on your laptop. But your laptop isn’t production. In production, things are different: the process needs to run in the background, it needs to survive machine restarts, and — the hard one — it needs to update without dropping connections.

Let’s talk about the operations side of running a Pingora proxy.

The bridge from Part 4 to here is one line in main():

#![allow(unused)]
fn main() {
// Before (Parts 1-4)
let mut server = Server::new(None).unwrap();

// After (this part)
let opt = Some(Opt::parse_args());          // add: CLI + config file support
let mut server = Server::new(opt).unwrap(); // change: pass opt instead of None
}

One line added, one line changed. That single change unlocks config files, daemon mode, CLI flags, and zero-downtime upgrades. Everything else in this chapter uses what that one line enables.

Configuration Files

So far, we’ve hardcoded everything: listen addresses, upstream backends, TLS certificate paths. That works for a tutorial. It doesn’t work when you need different settings per environment (dev, staging, production) or when you want to change settings without recompiling.

Pingora uses YAML configuration files. Create a file called conf.yaml:

---
version: 1
threads: 4
pid_file: /tmp/load_balancer.pid
upgrade_sock: /tmp/load_balancer.sock
error_log: /tmp/load_balancer_err.log

Then pass it to your server:

cargo run -- -c conf.yaml

The version: 1 is required — it tells Pingora which config format to expect. The other settings:

SettingWhat It Does
threadsNumber of worker threads per service. Default is 1. Production typically uses 2× CPU cores.
pid_fileWhere to write the process ID. Essential for scripting and monitoring.
upgrade_sockUnix socket for graceful upgrades (we’ll get to this).
error_logWhere to write errors. If not set, goes to stderr.
daemonRun in the background. Default: false.
user / groupDrop privileges to this user/group after startup. Run as root to bind port 443, then drop to unprivileged user.

Any setting you don’t include uses its default. And here’s a nice detail: unknown settings are ignored, not rejected. This means you can add your own custom settings to the same file and read them in your code. Pingora won’t complain.

Reading Custom Settings

Want to put your upstream backends in the config file instead of hardcoding them? The Server object gives you access to the raw configuration:

#![allow(unused)]
fn main() {
use pingora::server::Server;

let mut server = Server::new(Some(Opt::parse_args())).unwrap();
server.bootstrap();

// Access the raw config via server.configuration
// Custom settings are preserved and accessible
}

The exact API for reading custom fields depends on your Pingora version. The key insight is: the config file is your config file too. Pingora uses what it understands and passes the rest through.

Command-Line Arguments

Even without a config file, Pingora’s Server gives you command-line argument parsing for free. Change your main():

#![allow(unused)]
fn main() {
// Before: no CLI args
let mut server = Server::new(None).unwrap();

// After: Pingora handles CLI parsing
let mut server = Server::new(Some(Opt::parse_args())).unwrap();
}

Now your binary supports these flags:

FlagEffect
-d / --daemonRun in the background
-c / --confPath to config file
-u / --upgradeGraceful upgrade mode (more on this below)
-t / --testTest the config and exit

This is free functionality. You don’t write the arg parser, you don’t handle the flags. Pingora does it.

Running as a Daemon

With --daemon (or daemon: true in the config), the process forks into the background. A few things to know:

  1. The pid_file becomes essential. You need to know the PID to send signals. Check it with cat /tmp/load_balancer.pid.

  2. Privilege dropping happens automatically. If you set user and group in the config, Pingora starts as root (to bind privileged ports like 443), loads certificates and keys, then drops to the unprivileged user before accepting connections. This is the correct pattern: do privileged things early, then run unprivileged.

  3. Forking means threads don’t survive. The daemon fork happens inside run_forever(). If you spawn threads before that call, they’ll be lost in the fork. Do your setup, but don’t start background threads until after bootstrap().

Signals: How to Stop and Restart

Pingora listens for three signals, each with different behavior:

SIGINT (Ctrl+C): Fast Shutdown

The process exits immediately. All in-flight requests are dropped. This is the “something is very wrong, kill it now” option.

kill -INT $(cat /tmp/load_balancer.pid)

SIGTERM: Graceful Shutdown

The process stops accepting new connections, waits for in-flight requests to finish, then exits. This is the “I want to stop, but I don’t want to break anything” option.

kill -TERM $(cat /tmp/load_balancer.pid)

How long does it wait? By default, a few seconds. You can configure the grace period in your code or via the config file.

SIGQUIT: Graceful Upgrade

This is the interesting one. SIGQUIT triggers a graceful shutdown and transfers the listening sockets to a new instance. We’ll cover this in detail next.

Graceful Upgrades: Zero-Downtime Deployment

Here’s the problem: you found a bug in your proxy code. You fixed it, recompiled, and now you want to deploy the new binary. The naive approach:

  1. Stop the old binary → connections drop → errors
  2. Start the new binary → it binds the port → traffic resumes

During step 1, any request in flight gets an error. Clients see 502s or connection refused. For a proxy handling millions of requests, even a few seconds of errors is unacceptable.

Pingora solves this with graceful upgrades. The mechanism works like this:

New Instance (PID 5678)           Old Instance (PID 1234)
        │                               │
        │  Start with -u flag            │
        │  → create upgrade socket        │
        │  → wait for FDs                 │
        │                               │
        │                       SIGQUIT received
        │                       → connect to upgrade socket
        │◄──────────────────────────────│
        │                               │
        │  (receives listening FDs,       │  (finishes in-flight
        │   accepts new connections       │   requests, then exits)
        │   immediately)                  │
        │                               │
        │  handles all traffic            ✗ exits

How to Do It

Step by step:

1. Configure the upgrade socket. Both instances need to agree on where to transfer the sockets. This goes in conf.yaml:

upgrade_sock: /tmp/load_balancer.sock

2. Start the new instance in upgrade mode:

cargo run -- -c conf.yaml -d -u

The -u flag tells the new instance: “don’t try to bind the ports yourself. Instead, wait to acquire the listening sockets from the old instance.” The new process creates the upgrade socket and listens on it, waiting for the old process to connect.

3. Send SIGQUIT to the old instance:

kill -QUIT $(cat /tmp/load_balancer.pid)

4. What happens next:

  • The old instance receives SIGQUIT and connects to the upgrade socket on the new instance
  • It transfers its listening sockets and enters graceful shutdown
  • The new instance receives the sockets and starts accepting connections immediately
  • The old instance enters graceful shutdown: it finishes in-flight requests, then exits
  • The new instance handles all new traffic

From a client’s perspective, the proxy never stopped. The listening socket was never closed. There was no gap where connections would be refused.

The Guarantee

Pingora’s graceful upgrade guarantees two things:

  1. No connection refused. Every request is handled by either the old instance or the new one. The listening socket transfers atomically.

  2. No terminated requests. Any request that can finish within the grace period is allowed to complete. The old instance doesn’t kill in-flight work.

These are strong guarantees. They’re why Cloudflare can deploy new versions of their proxy infrastructure without affecting the 40M+ requests per second flowing through it.

One-Liner Upgrade

In practice, the new instance needs to be running before the old one sends its sockets. The order matters:

# Start the new instance first — it listens on the upgrade socket
RUST_LOG=INFO cargo run -- -c conf.yaml -d -u && \
kill -QUIT $(cat /tmp/load_balancer.pid)

With -d (daemon mode), the process forks into the background and the command returns. Then we send SIGQUIT to the old process, which connects to the upgrade socket and transfers its listening FDs. This command only works in daemon mode — without -d, cargo run blocks the terminal and the kill -QUIT never runs. Without daemon mode, you’d need two terminal sessions: one running the new instance, one sending the signal.

Why this order? The new process creates the upgrade socket and listens on it. When the old process receives SIGQUIT, it connects to that socket and sends its file descriptors. If you signal the old process first, it tries to connect to the upgrade socket before the new process has created it — the old process will retry for a few seconds (Pingora has built-in retry logic), but starting the new process first is more reliable.

The Code

The code changes for production are minimal — mostly it’s about using the APIs we’ve been ignoring. Here’s our load balancer with config file support, CLI args, and graceful upgrade readiness:

use async_trait::async_trait;
use pingora::prelude::*;
use pingora::proxy::{ProxyHttp, Session};
use pingora::upstreams::peer::HttpPeer;
use pingora::lb::{LoadBalancer, selection::RoundRobin, health_check::TcpHealthCheck};
use pingora::server::configuration::Opt;
use std::sync::Arc;

pub struct LB(Arc<LoadBalancer<RoundRobin>>);

#[async_trait]
impl ProxyHttp for LB {
    type CTX = ();
    fn new_ctx(&self) -> Self::CTX {}

    async fn upstream_peer(
        &self,
        _session: &mut Session,
        _ctx: &mut Self::CTX,
    ) -> Result<Box<HttpPeer>> {
        let upstream = self.0
            .select(b"", 256)
            .ok_or_else(|| Error::new_str("no healthy upstream available"))?;
        let peer = Box::new(HttpPeer::new(
            upstream,
            true,
            "one.one.one.one".to_string(),
        ));
        Ok(peer)
    }

    async fn upstream_request_filter(
        &self,
        _session: &mut Session,
        upstream_request: &mut pingora::http::RequestHeader,
        _ctx: &mut Self::CTX,
    ) -> Result<()> {
        upstream_request.insert_header("Host", "one.one.one.one")?;
        Ok(())
    }
}

fn main() {
    // Parse CLI args — gives us -c, -d, -u, -t for free
    let opt = Some(Opt::parse_args());
    let mut server = Server::new(opt).unwrap();
    server.bootstrap();

    let mut upstreams = LoadBalancer::try_from_iter(["1.1.1.1:443", "1.0.0.1:443"]).unwrap();

    // Health checks — detect and skip broken backends
    let hc = TcpHealthCheck::new();
    upstreams.set_health_check(hc);
    upstreams.health_check_frequency = Some(std::time::Duration::from_secs(10));

    let background = background_service("health check", upstreams);
    let upstreams = background.task();

    let lb = LB(upstreams);
    let mut service = http_proxy_service(&server.configuration, lb);
    service.add_tcp("0.0.0.0:6188");

    server.add_service(background);
    server.add_service(service);

    // run_forever() handles:
    // - daemonization (if -d or daemon: true)
    // - signal handling (SIGINT, SIGTERM, SIGQUIT)
    // - graceful upgrade socket transfer (if -u)
    server.run_forever();
}

The key change from earlier parts: Server::new(Some(Opt::parse_args())). That one change gives you config file support, daemonization, CLI args, and graceful upgrade capability. Everything else — the proxy logic, the load balancing, the health checks — is the same.

Running in Production

Here’s a typical production workflow:

Start the proxy as a daemon:

RUST_LOG=INFO cargo run --release -- -c conf.yaml -d

Check it’s running:

cat /tmp/load_balancer.pid
curl http://localhost:6188 -svo /dev/null

Deploy a new version (zero downtime):

# Rebuild with your changes
cargo build --release

# Start the new process FIRST — it creates the upgrade socket
# Then signal the old process to transfer its sockets
RUST_LOG=INFO ./target/release/part-05-production -c conf.yaml -d -u && \
kill -QUIT $(cat /tmp/load_balancer.pid)

Stop it gracefully (no new connections, finish in-flight):

kill -TERM $(cat /tmp/load_balancer.pid)

Emergency stop (drop everything):

kill -INT $(cat /tmp/load_balancer.pid)

Systemd Integration

For production, you’ll likely run under systemd. Here’s a minimal service file:

[Unit]
Description=Pingora Load Balancer
After=network.target

[Service]
Type=forking
PIDFile=/tmp/load_balancer.pid
ExecStart=/usr/local/bin/load_balancer -c /etc/load_balancer/conf.yaml -d
ExecReload=/bin/sh -c '/usr/local/bin/load_balancer -c /etc/load_balancer/conf.yaml -d -u && kill -QUIT $(cat /tmp/load_balancer.pid)'
KillSignal=SIGTERM
TimeoutStopSec=30

[Install]
WantedBy=multi-user.target

The ExecReload line does exactly what the one-liner above does: starts the new process in upgrade mode (which creates the upgrade socket and waits), then sends SIGQUIT to the old process. The old process connects to the upgrade socket, transfers its listening file descriptors, and enters graceful shutdown.

This works because ExecReload runs in a shell — we can chain the new process startup with the signal in one command. With -d (daemon mode), the new process forks to the background immediately, so kill -QUIT runs right after. The old process receives SIGQUIT, transfers its sockets, and exits. The new process starts accepting connections with no gap.

The one-liner from the section above and this systemd reload do the same thing. The unit file automates the same operation: systemctl reload load_balancer.

⚠️ systemctl restart is NOT a graceful upgrade. systemctl restart sends SIGTERM (stop) then starts a new process — there’s a gap between the old process stopping and the new process binding the port. During that gap, connections are refused. Use systemctl reload for zero-downtime deployment. Use systemctl restart only when you want a full stop-and-start (e.g., after a config change that can’t be picked up by reload).

The Type=forking tells systemd that the process will daemonize. The PIDFile lets systemd track the daemon’s PID.

What We’re Simplifying

A few things this part doesn’t cover in depth:

Observability. Pingora has built-in Prometheus metrics. Add a Prometheus service alongside your proxy and you get request counts, error rates, latency histograms for free. We showed this briefly in Part 3’s logging phase. For production, you want dashboards and alerts.

Hot config reload. Pingora reads the config file at startup. Changing the config requires a restart (graceful or otherwise). For dynamic configuration — like adding backends without restarting — you’d maintain an in-memory data structure and update it through your own mechanism (a config service, a file watcher, etc.).

Multiple services. A single Pingora Server can host multiple Service instances — different proxies on different ports, a metrics endpoint, an admin API. Each service has its own listeners and proxy logic.

What You’ve Built

Across all five parts, you’ve built a production-ready reverse proxy:

PartWhat You Added
1A working reverse proxy
2Load balancing and health checks
3Request filtering, response modification, per-request state
4TLS termination and certificate verification
5Config files, daemonization, zero-downtime upgrades

That’s a real proxy. Not a toy — the same framework powers 40M+ requests per second at Cloudflare. The APIs you’ve learned are the ones they use.

Where to Go Next

The Pingora ecosystem has more to explore:

  • Cachingpingora-cache provides HTTP caching with cache-control, varying, and purge support
  • Rate limitingpingora-load-balancing includes rate limiter utilities
  • Custom protocols — Pingora does more than HTTP. You can build TCP proxies, tunneling services, or custom protocols on the same framework
  • Connection pooling — Pingora reuses upstream connections automatically. The pooling behavior is configurable per-peer.

The Pingora GitHub repository has examples for all of these. The user guide covers the internals in more depth than we did here.

The hardest part of building a proxy isn’t the code — it’s the operational concerns. Handling slow clients, backpressure, connection limits, retry storms, and the long tail of edge cases that only show up at scale. Pingora handles most of these for you. Your job is to configure it correctly and write the proxy logic that makes sense for your use case.

Part 6: HTTP Caching — Making Your Proxy Remember So Your Backends Don’t Have To

Your proxy sits between clients and backends. Every request goes upstream, every response comes back. That’s fine when your backends are fast and your traffic is low. But at scale, most requests are for the same content — the same images, the same API responses, the same static assets. Why fetch them from the upstream every time?

Caching is the answer, and it’s also the problem. HTTP caching is deceptively complex. It’s not “store the response, serve it again.” You need to decide what to cache, how long to cache it, who gets which version, and when to throw it away. The HTTP spec has an entire section on caching semantics (RFC 9211, building on decades of earlier RFCs). Pingora’s caching layer implements these semantics — your job is to configure and customize them.

Let’s start with the concepts, then wire up a real cache.

Why Cache at the Proxy?

You could cache at the application layer — your backend could use Redis, or serve cached responses from memory. So why cache at the proxy?

Speed. The proxy is closer to the client than the backend. A cache hit at the proxy means zero network hops to the backend. For a globally distributed setup, that’s the difference between 10ms and 200ms.

Load reduction. Every cached response is a request your backend doesn’t see. During traffic spikes, the cache absorbs the load. Your backend stays healthy because it only handles cache misses.

Independence. The proxy cache works regardless of your backend technology. Whether you’re proxying to a Go service, a Python app, or a legacy Java monolith, the caching logic is the same. You don’t need every backend to implement its own caching.

Correctness. The proxy understands HTTP caching semantics — Cache-Control, Vary, ETag, conditional requests. It can serve stale content while revalidating in the background (stale-while-revalidate). These are the same semantics browsers use, enforced consistently at the proxy layer.

How HTTP Caching Works (The Short Version)

Before we get to Pingora’s API, let’s make sure we’re on the same page about HTTP caching.

Cache-Control

The Cache-Control header is the primary mechanism. The backend sends it with responses:

Cache-Control: max-age=300, public

This means: “cache this for 300 seconds, and any cache (browser, CDN, proxy) can store it.”

Common directives:

DirectiveMeaning
max-age=NCache is fresh for N seconds
s-maxage=NSame, but only for shared caches (like your proxy)
publicAny cache can store this
privateOnly the browser can cache this (not the proxy)
no-cacheCache can store, but must revalidate before using
no-storeDon’t cache at all
stale-while-revalidate=NServe stale content for up to N seconds while fetching fresh content in the background

Vary

The Vary header tells the cache which request headers affect the response:

Vary: Accept-Encoding

This means: “I serve different content based on Accept-Encoding. Store separate cache entries for gzip vs. brotli vs. uncompressed.”

If Vary is missing, the cache assumes the response is the same regardless of request headers. Getting Vary wrong is a common source of cache bugs — serving compressed content to clients that can’t handle it, or serving the wrong language variant.

Conditional Requests

When a cached response is stale (past its max-age), the proxy doesn’t throw it away. It can send a conditional request to check if the content has changed:

  • If-None-Match: "abc123" — “I have version abc123. Only send the full response if it’s changed.” If the backend responds with 304 Not Modified, the proxy refreshes the cache metadata and serves the cached body.

  • If-Modified-Since: Wed, 21 Oct 2025 07:28:00 GMT — Same idea, but based on the last modification time.

Conditional requests save bandwidth — the backend doesn’t need to send the full response body if nothing changed.

The Three Hooks

Pingora’s cache integrates with the same ProxyHttp trait you’ve been using since Part 3. There are three methods you implement to make caching work:

Request arrives
     │
     ▼
┌──────────────────────┐
│  request_cache_filter │  ← Should this request use the cache?
│  (enable caching)     │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  cache_key_callback   │  ← What's the cache key for this request?
│  (generate the key)   │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Pingora looks up    │  ← Internal: check storage for this key
│  the key in storage  │
└──────────┬───────────┘
           │
     ┌─────┴─────┐
     │            │
 Cache Hit    Cache Miss
     │            │
     ▼            ▼
┌──────────┐ ┌────────────────────────┐
│ Serve it │ │  response_cache_filter │ ← Is the upstream response cacheable?
│ from     │ │  (decide what to store)│
│ cache    │ └──────────┬─────────────┘
└──────────┘            │
                        ▼
                  ┌──────────────┐
                  │  Store it    │
                  │  in cache    │
                  └──────────────┘

Each hook corresponds to a method on ProxyHttp. Let’s walk through them.

Step 1: Enable Caching with request_cache_filter

By default, caching is off. You have to opt in. request_cache_filter is where you decide: “should this request even use the cache?”

#![allow(unused)]
fn main() {
use pingora::cache::{MemCache, NoCacheReason};
use pingora::proxy::{ProxyHttp, Session};

fn request_cache_filter(
    &self,
    session: &mut Session,
    _ctx: &mut Self::CTX,
) -> Result<()> {
    // Only cache GET and HEAD requests.
    // POST, PUT, DELETE are mutations — caching them is almost always wrong.
    let method = &session.req_header().method;
    if method != "GET" && method != "HEAD" {
        return Ok(());
    }

    // Enable the cache with our storage backend.
    session.cache.enable(
        self.storage,  // the Storage implementation
        None,          // eviction manager (entries live until process restart)
        None,          // predictor (every eligible request goes through cache logic)
        None,          // cache lock (concurrent misses fetch independently)
        None,          // option overrides
    );

    Ok(())
}
}

The key call is session.cache.enable(). It takes a storage backend and four optional parameters:

  • storage — Where cached data lives. We’re using MemCache, an in-memory hashtable. The pingora-cache crate docs mark it “for testing only,” but it’s the right tool for learning the API.
  • eviction — How to decide what to throw away when the cache is full. Without one, nothing gets evicted — entries live until the process restarts.
  • predictor — A hint about whether a request is likely cacheable. Without one, every eligible request goes through the full cache lookup.
  • cache_lock — What to do when multiple requests miss the cache for the same key. Without one, they all fetch independently. With one, the first request fetches and the rest wait for its result.

For a learning setup, None for everything except storage is fine. Production deployments would configure eviction and cache locking.

One thing to notice: storage is &'static (dyn Storage + Sync). The cache outlives any single request, so Pingora needs a reference that’s valid for the program’s lifetime. In our main(), we get this with Box::leak:

#![allow(unused)]
fn main() {
let storage: &'static MemCache = Box::leak(Box::new(MemCache::new()));
}

In production, you’d use LazyLock instead:

use std::sync::LazyLock;

static STORAGE: LazyLock<&'static MemCache> = LazyLock::new(|| {
    Box::leak(Box::new(MemCache::new()))
});

LazyLock defers the allocation until first use, then locks it in place — same static lifetime, but the initialization is explicit and lazy rather than eagerly leaked. The Box::leak approach works for a tutorial — the memory is allocated once and lives until the process exits.

Why &'static? The cache storage must outlive every individual request. Pingora can’t hold a regular reference — it would need a lifetime parameter on every struct that touches the cache, which would propagate through the entire API. Box::leak gives us a static reference by intentionally “leaking” the allocation. The memory lives until the process exits, which is exactly how long we need it. This isn’t a memory leak in the traditional sense — it’s a deliberate, one-time allocation that we never intend to free.

Step 2: Generate the Cache Key with cache_key_callback

The cache key determines whether two requests map to the same cached response. This is the most security-sensitive part of caching. Get the key wrong and you serve one user’s data to another — that’s cache poisoning.

One critical detail before we build the key: the upstream controls whether your cache is valid. If one.one.one.one sends Cache-Control: no-store, Pingora won’t cache it — even if your key construction is perfect. The contract is: the proxy decides how to store and retrieve; the origin decides whether to allow it. one.one.one.one works because it explicitly sends Cache-Control: public, max-age=60.

A safe default: use the request URI as the key. This means /path/a and /path/b are cached separately.

#![allow(unused)]
fn main() {
use pingora::cache::CacheKey;

static CACHE_NAMESPACE: &[u8] = b"pingora-tutorial";

fn cache_key_callback(
    &self,
    session: &Session,
    _ctx: &mut Self::CTX,
) -> Result<CacheKey> {
    let uri = session.req_header().uri.path().as_bytes();
    Ok(CacheKey::new(
        CACHE_NAMESPACE,     // namespace: distinguishes this proxy's cache
        uri,                 // primary: the request path
        "tutorial-cache",    // user_tag: for logging and debugging
    ))
}
}

The CacheKey has three parts:

  • namespace — Separates cache entries from different proxies sharing the same storage. Like a table prefix in a shared database.
  • primary — The main identifier. Usually the request URI. Both namespace and primary are hashed together to form the lookup key.
  • user_tag — A human-readable label for debugging. Not part of the hash, but appears in logs and tracing.

Why is the key security-sensitive? Consider what happens if you forget the namespace: two proxies sharing the same MemCache would overwrite each other’s entries. Or if you include only the host but not the path: every path on the same host would share one cache entry. The key has to capture everything that makes one response different from another.

Our example uses .uri.path(), which excludes query strings. A request to /api/users?page=1 and /api/users?page=2 would share the same cache entry — returning the first page’s response for the second page. For an API with query parameters, use .uri.path_and_query() instead to include the ?page=2 portion in the key. We use .path() here because one.one.one.one doesn’t vary responses by query string, but a real API almost certainly would.

The default implementation of cache_key_callback panics — you must override it when caching is enabled. This is by design. There’s no safe default because the correct key depends on your application.

Step 3: Decide What’s Cacheable with response_cache_filter

After the upstream responds, Pingora asks: “should we store this?” The default answer is no — you have to opt in.

#![allow(unused)]
fn main() {
use pingora::cache::{CacheMeta, RespCacheable::*, NoCacheReason};
use std::time::{Duration, SystemTime};

fn response_cache_filter(
    &self,
    _session: &Session,
    resp: &pingora::http::ResponseHeader,
    _ctx: &mut Self::CTX,
) -> Result<pingora::cache::RespCacheable> {
    // Cache 200 responses for 60 seconds.
    // Skip everything else — 404s, 500s, redirects.
    if resp.status != 200 {
        return Ok(Uncacheable(NoCacheReason::Custom("non-200")));
    }

    // Respect Cache-Control from the upstream.
    if let Some(cc) = resp.headers.get("cache-control") {
        let cc_str = cc.to_str().unwrap_or("");
        if cc_str.contains("no-store") || cc_str.contains("private") {
            return Ok(Uncacheable(NoCacheReason::OriginNotCache));
        }
    }

    // Build the cache metadata.
    let now = SystemTime::now();
    let meta = CacheMeta::new(
        now + Duration::from_secs(60),  // fresh_until: when the entry becomes stale
        now,                             // created: when it was cached
        0,                               // stale-while-revalidate seconds
        0,                               // stale-if-error seconds
        resp.clone(),                    // the response header to cache
    );

    Ok(Cacheable(meta))
}
}

The return type is RespCacheable — either Cacheable(CacheMeta) or Uncacheable(NoCacheReason). The CacheMeta tells Pingora how long to keep the entry and when to revalidate.

Our implementation is conservative:

  • Only 200s get cached. A 404 might be worth caching briefly (to avoid hammering the backend for a known-missing resource), but it depends on the use case.
  • We check Cache-Control from the upstream. If the origin says no-store or private, we respect it. This is important — the proxy shouldn’t cache things the origin didn’t intend to be cached.
  • The TTL is 60 seconds. Short enough that stale data doesn’t linger, long enough to absorb traffic spikes.

This is a simplification. A production cache would also check Vary, handle stale-while-revalidate, and respect s-maxage. Pingora’s cache internals handle most of this automatically when you enable the full pipeline. Our example manually creates CacheMeta because we’re demonstrating the API at a level where you can see every piece.

The Full Picture

Here’s what the caching proxy looks like when you put it all together:

use async_trait::async_trait;
use pingora::cache::{CacheKey, CacheMeta, MemCache, NoCacheReason, RespCacheable::*};
use pingora::lb::{LoadBalancer, selection::RoundRobin};
use pingora::prelude::*;
use pingora::proxy::{ProxyHttp, Session};
use pingora::upstreams::peer::HttpPeer;
use std::sync::Arc;
use std::time::{Duration, SystemTime};

static CACHE_NAMESPACE: &[u8] = b"pingora-tutorial";

pub struct LB {
    upstreams: Arc<LoadBalancer<RoundRobin>>,
    storage: &'static MemCache,
}

#[async_trait]
impl ProxyHttp for LB {
    type CTX = ();
    fn new_ctx(&self) -> Self::CTX {}

    async fn upstream_peer(
        &self,
        _session: &mut Session,
        _ctx: &mut Self::CTX,
    ) -> Result<Box<HttpPeer>> {
        let upstream = self.upstreams
            .select(b"", 256)
            .ok_or_else(|| Error::new_str("no healthy upstream available"))?;
        Ok(Box::new(HttpPeer::new(
            upstream, true, "one.one.one.one".to_string(),
        )))
    }

    fn request_cache_filter(
        &self,
        session: &mut Session,
        _ctx: &mut Self::CTX,
    ) -> Result<()> {
        let method = &session.req_header().method;
        if method != "GET" && method != "HEAD" {
            return Ok(());
        }
        session.cache.enable(self.storage, None, None, None, None);
        Ok(())
    }

    fn cache_key_callback(
        &self,
        session: &Session,
        _ctx: &mut Self::CTX,
    ) -> Result<CacheKey> {
        let uri = session.req_header().uri.path().as_bytes();
        Ok(CacheKey::new(CACHE_NAMESPACE, uri, "tutorial-cache"))
    }

    fn response_cache_filter(
        &self,
        _session: &Session,
        resp: &pingora::http::ResponseHeader,
        _ctx: &mut Self::CTX,
    ) -> Result<pingora::cache::RespCacheable> {
        if resp.status != 200 {
            return Ok(Uncacheable(NoCacheReason::Custom("non-200")));
        }
        if let Some(cc) = resp.headers.get("cache-control") {
            let cc_str = cc.to_str().unwrap_or("");
            if cc_str.contains("no-store") || cc_str.contains("private") {
                return Ok(Uncacheable(NoCacheReason::OriginNotCache));
            }
        }
        let now = SystemTime::now();
        let meta = CacheMeta::new(
            now + Duration::from_secs(60), now, 0, 0, resp.clone(),
        );
        Ok(Cacheable(meta))
    }
}

fn main() {
    let opt = Some(Opt::parse_args());
    let mut server = Server::new(opt).unwrap();
    server.bootstrap();

    let storage: &'static MemCache = Box::leak(Box::new(MemCache::new()));
    let upstreams = LoadBalancer::try_from_iter(["1.1.1.1:443", "1.0.0.1:443"]).unwrap();
    let lb = LB { upstreams: Arc::new(upstreams), storage };

    let mut service = http_proxy_service(&server.configuration, lb);
    service.add_tcp("0.0.0.0:6188");

    server.add_service(service);
    server.run_forever();
}

This is a working caching proxy. The first GET to any path goes upstream. The second serves from cache. After 60 seconds, the entry goes stale and the next request fetches fresh content.

Cache Strategies

The technical setup is one thing. The harder question is: what should you cache?

Cache Static Assets Aggressively

Images, CSS, JavaScript — these change infrequently. Cache them with a long max-age. Let the upstream set the headers, or override them in response_cache_filter:

Cache-Control: public, max-age=86400

Cache API Responses Carefully

API responses might be user-specific (private) or change frequently. Cache with short TTLs and revalidation:

Cache-Control: public, max-age=10, stale-while-revalidate=60

This serves cached content for up to 10 seconds, then serves stale content for up to 60 more seconds while fetching a fresh copy in the background. The client always gets a fast response; the data is at most 70 seconds stale.

Don’t Cache Authenticated Content

If the response depends on who’s asking (e.g., a user profile), either:

  • Set Cache-Control: private so the proxy doesn’t cache it
  • Vary on the Authorization header (but this means every unique token gets a separate cache entry — usually wrong)
  • Skip caching in request_cache_filter for authenticated requests

Cache 404s Briefly

A 404 for a non-existent resource prevents hammering the backend. But cache it briefly — the resource might be created:

Cache-Control: public, max-age=30

Cacheability Checklist

Before you wire in caching, run through this:

  • ✅ The origin sends Cache-Control (not no-store or private)
  • ✅ The request method is cacheable (GET or HEAD)
  • ✅ The cache key uniquely identifies the response (if the upstream varies on headers, include them in the key)

Cache Invalidation

The hardest part of caching isn’t storing things — it’s knowing when to throw them away.

Time-based expiration. The simplest approach. Set a max-age, and the cache evicts the entry when it expires. Works well for content that changes predictably.

Purge. Explicitly remove a cached entry. Useful when you know content has changed (e.g., after a deploy). Pingora supports purge via the Storage::purge() method.

Revalidation. When a cached entry is stale, the proxy checks with the upstream before serving it. Conditional requests (ETag, If-Modified-Since) make this efficient — the upstream only sends the full response if the content actually changed.

Stale-while-revalidate. Serve stale content immediately while fetching fresh content in the background. The client gets a fast response (from cache), and the cache is updated for the next request. This is the best default for content that should be fresh but where a few seconds of staleness is acceptable.

What We’re Simplifying

Our cache works, but it’s a teaching implementation. A production cache would add several things we’re skipping.

Eviction. Without an eviction manager, cached entries live until the process restarts. Under load, MemCache grows without bound. A production cache would evict stale entries and enforce a size limit.

Cache locking. Without a cache lock, multiple concurrent requests for the same uncached resource all fetch from upstream independently. A cache lock makes the first request the “writer” and the rest wait for its result — avoiding the “thundering herd” problem.

Full Vary support. Our cache_key_callback uses the URI path. A production cache would also vary on headers like Accept-Encoding when the upstream sends a Vary header. Pingora handles this via the cache_vary_filter method.

Predictive caching. A cache predictor learns which request patterns are likely cacheable and short-circuits uncacheable requests before they hit the storage layer. This reduces latency for requests that would miss anyway.

Conditional requests. We don’t implement cache_hit_filter, which lets you customize how cache hits are served — for example, checking conditional request headers like If-None-Match before serving from cache and returning 304 Not Modified. The default behavior serves cached responses directly, which is correct for our use case.

These features are all available in pingora-cache. Our example opts out of them to keep the API visible. In a real deployment, you’d configure them through the parameters we passed as None in session.cache.enable().

Where to Learn More

  • pingora-cache source code — The crate is the authoritative reference. Start with lib.rs for the HttpCache state machine.
  • RFC 9211 — “Cache-Status HTTP Header.” The modern spec for HTTP caching.
  • RFC 7234 — The original HTTP caching spec. Dense but definitive.
  • Cloudflare’s caching documentation — Practical advice from the team that built Pingora.

What You’ve Built

You now have a working caching proxy:

PartWhat You Added
1A working reverse proxy
2Load balancing and health checks
3Request filtering, response modification, per-request state
4TLS termination and certificate verification
5Config files, daemonization, zero-downtime upgrades
6HTTP caching — with real cache hits and misses

The framework is Pingora. The logic is yours. Go build something.

Part 7: Rate Limiting — Turning Away Requests Before They Reach Your Backends

Your proxy handles traffic from many clients. Most of them are well-behaved. But some aren’t. A misconfigured client hammering your API. A scraper hitting every URL every second. A DDoS attack flooding your service with requests. Without rate limiting, all of that traffic goes straight to your backends.

Rate limiting is the valve. It decides: this client has sent enough requests in the last second. The next one gets a 429 (Too Many Requests) instead of being proxied upstream.

The counting is easy. The hard part is doing it correctly across concurrent requests without becoming a bottleneck yourself.

Why Rate Limit at the Proxy?

You could rate-limit at the application layer. Your backend could check a Redis counter before processing each request. But rate limiting at the proxy has three advantages:

Speed. The proxy rejects overloaded requests before they reach the backend. That’s zero CPU time on your application server, zero database queries, zero latency for the rejected request (a fast 429).

Fairness. The proxy sees all clients. Your backend might see requests through a load balancer that hides the original client IP. The proxy, being the first hop, has accurate client information.

Protection. Rate limiting at the proxy means your backends never see the excess traffic. Even if the backend is slow or crashing, the proxy absorbs the spike.

The Token Bucket Algorithm

Most rate limiters use a variant of the token bucket algorithm. Here’s how it works:

  1. The bucket starts with N tokens (the “burst” capacity)
  2. Each request consumes one token
  3. Tokens are replenished at a steady rate (the “rate”)
  4. If the bucket is empty, the request is rejected
Time:    0s    1s    2s    3s    ...
Tokens:  20 → 20 → 20 → 20 → ...  (no traffic, bucket stays full)

Time:    0s    0.5s  1s    1.5s  2s    ...
Request: ✓     ✓     ✓     ✓     ✓     ...  (1 req/s, rate=10, bucket refills faster than it drains)

Time:    0s         0.1s       0.2s       ...
Request: ✓ ✓ ✓ ... ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗   (burst of 20, then rejected until bucket refills)
Tokens:  20→0       still 0   still 0    ... (bucket empties at 0s, starts refilling at 10/s)

This gives you two knobs:

  • Rate — how many requests per second are allowed (sustained throughput)
  • Burst — how many requests can arrive at once (spike capacity)

A rate of 10 with a burst of 20 means: a client can send 10 requests per second sustainably, or up to 20 in a single burst, but then must wait for the bucket to refill.

The burst matters because real traffic isn’t perfectly smooth. A browser loading a page might send 15 requests in rapid succession (images, scripts, stylesheets). That’s a burst, not an attack. The burst parameter lets legitimate spikes through while still capping sustained abuse.

Where Rate Limiting Fits in the Proxy

Rate limiting happens in the request_filter phase — the earliest phase where you can reject a request. This is important: rejecting a request before connecting to the upstream means zero cost on your backend. The rejection happens before upstream_peer is ever called — the request never touches your upstream at all.

new request
     │
     ▼
┌──────────────────┐
│ request_filter    │  ← Rate limit check happens here
│                   │
│  Allowed?         │
│  ├─ Yes → continue to upstream_peer()
│  └─ No  → 429, stop here
└──────────────────┘

This is the same pattern we used for authentication in Part 3. Return Ok(true) from request_filter to tell Pingora “I already wrote a response, don’t proxy this request.” The only difference is what triggers the rejection — bad credentials vs. too many requests.

Per-Client Buckets

Rate limiting only makes sense per-client. Limiting all traffic to 100 requests per second protects your backend but doesn’t prevent one client from monopolizing that budget.

“Per client” usually means per IP address. In our implementation, we extract the client address from the session:

#![allow(unused)]
fn main() {
let client_addr = session.client_addr();
let key = match client_addr {
    Some(addr) => addr.to_string(),
    None => "unknown".to_string(),
};
}

In a real deployment behind another load balancer, you’d check X-Forwarded-For or X-Real-IP instead. The client address Pingora sees might be the load balancer’s address, not the original client. Your rate limiter should use the right identifier for your architecture.

The Registry

One bucket per client means you need a registry — a map from client identifiers to their buckets. When a request arrives, you look up (or create) the bucket for that client and try to consume a token.

#![allow(unused)]
fn main() {
struct RateLimiterRegistry {
    buckets: Mutex<HashMap<String, TokenBucket>>,
    rate: usize,
    burst: usize,
}

impl RateLimiterRegistry {
    fn is_allowed(&self, key: &str) -> bool {
        let mut buckets = self.buckets.lock().unwrap();
        let bucket = buckets
            .entry(key.to_string())
            .or_insert_with(|| TokenBucket::new(self.rate, self.burst));
        bucket.try_consume()
    }
}
}

The registry is shared across all requests via Arc<RateLimiterRegistry>. This works for a tutorial. For production with millions of concurrent connections, the Mutex becomes a bottleneck — every request contends on the same lock:

#![allow(unused)]
fn main() {
// Mutex: every request waits for every other request
let mut buckets = self.buckets.lock().unwrap();
// the lock is held while we look up and update the bucket
// — other requests are blocked

// AtomicU64: no lock, no contention
counter.fetch_add(1, Ordering::SeqCst);  // one instruction, no waiting
}

With a Mutex, every rate-limit check serializes all other requests. At high concurrency, the lock itself becomes the bottleneck. AtomicU64 operations are lock-free — the CPU can check and increment in a single instruction without blocking other cores. This is why pingora-limits uses atomics: at Cloudflare’s scale, even a fast mutex is too slow. Solutions include:

  • Sharded registries. Hash the client key to one of N registries. Each registry has its own lock. Contention drops by N×.
  • Lock-free data structures. DashMap or similar concurrent hashmaps avoid the lock entirely.
  • pingora-limits. Pingora’s own rate limiting crate uses a more efficient counting algorithm (Generic Cell Rate Algorithm, GCRA) that doesn’t need per-client state at all — it uses a probabilistic approach that’s accurate in aggregate.

The Code

Our proxy combines rate limiting with the load balancing from Part 2:

#![allow(unused)]
fn main() {
pub struct LB {
    upstreams: Arc<LoadBalancer<RoundRobin>>,
    limiter: Arc<RateLimiterRegistry>,
}
}

The rate limiter is checked in request_filter. If the client has exceeded their limit, we return 429 and short-circuit the request. Otherwise, the request flows through to upstream_peer as usual.

The RateLimitCtx tracks whether the request was rate-limited, so the logging phase can record it:

#![allow(unused)]
fn main() {
struct RateLimitCtx {
    rate_limited: bool,
    client_key: String,
}
}

This is the CTX pattern from Part 3 — per-request state shared across phases.

Testing It

Start the proxy:

cargo run --bin part-07-rate-limiting

Send 25 requests in quick succession:

for i in $(seq 1 25); do curl -s -o /dev/null -w '%{http_code} ' http://127.0.0.1:6188; done

You’ll see something like:

200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 429 429 429 429 429

The first 20 requests succeed (the burst capacity). The next 5 get 429. Wait a second, and the bucket refills — new requests are allowed.

The Production Version: pingora-limits

Our token bucket implementation is for learning. For production, use pingora-limits:

[dependencies]
pingora-limits = "0.8"

The pingora-limits crate implements a rate estimator based on GCRA (Generic Cell Rate Algorithm). GCRA is the same algorithm used by telecom networks for traffic shaping. It has several advantages over token buckets:

No per-client state. GCRA doesn’t store a counter per client. Instead, it stores a single timestamp — the earliest time the next request is allowed. This is O(1) memory per client instead of O(rate).

No lock contention. The rate estimator uses atomic operations, not mutexes. Thousands of concurrent requests can check their limits without waiting for each other.

Smooth shaping. GCRA doesn’t have a separate “rate” and “burst” — it has a single emission interval that naturally handles both. The result is smoother traffic shaping: a client sending at exactly the allowed rate never gets rejected, while bursts are spread out over time.

The API:

use pingora_limits::rate::RateEstimator;

let estimator = RateEstimator::new();

// Check if a request is allowed
let key = client_ip.as_bytes();
let allowed = estimator
    .rate(key, 10) // 10 requests per second
    .await
    .is_ok();

Note that pingora-limits uses an async API (.await). This is because it’s designed for distributed rate limiting — in a multi-instance deployment, the estimator may need to coordinate with other instances. Our synchronous implementation is simpler but can’t do coordination.

The pingora-limits crate is what Cloudflare uses in production for their own rate limiting. It’s designed for the same scale Pingora operates at.

Rate Limiting vs. Throttling

A subtle distinction: rate limiting rejects excess requests (429), while throttling delays them. Our implementation does rate limiting — if you’re over the limit, you get rejected immediately.

Throttling (also called traffic shaping) queues excess requests and processes them later at the allowed rate. This is useful when you want to smooth out bursts without rejecting anything — the request takes longer, but it eventually gets through.

For a proxy, rate limiting is usually the right choice. Your job is to protect the backend, not to queue work on behalf of the client. If the client is sending too much, telling them to slow down (429 with a Retry-After header) is honest and efficient. The client can retry later; you don’t need to hold their request in memory.

What About Distributed Rate Limiting?

Our rate limiter runs in a single process. If you have multiple proxy instances (which you should, for availability), each instance has its own counter. A client sending 10 requests per second could send 10 to each instance and effectively get 10× the allowed rate.

Solutions:

Sticky sessions. Route the same client to the same proxy instance (via source IP hash). Simple, but loses the benefit of load distribution.

Shared state. Use Redis or a similar distributed store for the counters. Accurate, but adds latency to every request (a Redis round-trip per rate limit check).

Approximate counting. Accept that distributed rate limiting is slightly imprecise. Set the per-instance limit to total_limit / instance_count and accept that a perfectly coordinated burst might exceed the limit briefly. For most use cases, this is good enough.

The Pingora approach. Cloudflare runs Pingora at massive scale. Their rate limiting uses a combination of local counting (fast, no coordination) and periodic aggregation (accurate over time). The pingora-limits crate reflects this philosophy: it’s designed for per-instance rate estimation that works correctly in aggregate.

What We’re Simplifying

No eviction of stale buckets. Our HashMap grows forever — clients that made one request and never returned still have an entry. A production implementation would evict entries that haven’t been accessed recently (LRU, TTL-based, or periodic scanning).

No Retry-After header. When we return 429, we should include a Retry-After header telling the client when to try again. This is both polite and required by RFC 6585.

No differentiation by endpoint. Our rate limiter treats all requests the same. A real proxy might allow 100 requests/second for GET /api/data but only 5/second for POST /api/upload. The key would be client_ip + path instead of client_ip alone.

No sliding window. Our token bucket is a fixed-window approximation. Some rate limiters use sliding windows (expensive) or sliding log (very expensive) for more precise control. For most use cases, the token bucket’s approximation is sufficient.

Key Takeaways

Rate limiting belongs at the proxy. It’s faster, fairer, and more protective than rate limiting at the application layer.

The token bucket is the standard algorithm. Two knobs: rate (sustained) and burst (spike). Simple, effective, and good enough for most use cases.

Rate limiting is a request_filter. Rejecting before the upstream connection saves backend resources. The pattern is the same as authentication: respond_error() + Ok(true).

Per-client state needs care at scale. A Mutex<HashMap> works for thousands of clients. For millions, you need sharding, lock-free structures, or pingora-limits.

Use pingora-limits in production. It’s based on GCRA, uses atomic operations, and is battle-tested at Cloudflare’s scale.

What You’ve Built

Across all seven parts, you’ve built a complete, production-grade reverse proxy:

PartWhat You Added
1A working reverse proxy
2Load balancing and health checks
3Request filtering, response modification, per-request state
4TLS termination and certificate verification
5Config files, daemonization, zero-downtime upgrades
6HTTP caching concepts and strategies
7Rate limiting per client with token buckets

This is real infrastructure. The same framework, the same patterns, the same tradeoffs that Cloudflare navigates at 40M+ requests per second. The code you wrote here — with production-grade rate limiting and pingora-limits — could handle serious traffic.

But a real proxy doesn’t run in pieces. In Part 8: Putting It All Together, we’ll combine load balancing, TLS, caching, rate limiting, and filters into one coherent service.

Part 8: The Complete Proxy — Wiring Seven Parts Into One Service That Actually Works

Seven parts. Seven pieces. Each one works on its own. But a real proxy doesn’t run in pieces — it runs as one program, with load balancing and TLS and caching and rate limiting and filters, all wired together.

This part builds the complete proxy. No new concepts — the work of combining everything you’ve already learned into a single, coherent service.

What We’re Building

A production-grade reverse proxy with:

  • Load balancing across multiple backends with health checks (Part 2)
  • API key authentication as a request filter (Part 3)
  • Response header cleanup to hide internal details (Part 3)
  • Rate limiting per client IP (Part 7)
  • TLS on the downstream connection (Part 4)
  • Configuration from a YAML file (Part 5)
  • Caching for GET requests (Part 6) — omitted from this capstone; see note below

Each piece you’ve already built. The skill here is composition — making them work together without tangling the logic.

Why no caching? Part 6’s cache uses Pingora’s built-in session.cache API, which integrates differently from a simple struct-based approach. The cache hooks (request_cache_filter, cache_key_callback, response_cache_filter) are methods on the ProxyHttp trait, not custom structs. Adding Part 6’s caching on top of this capstone is a good exercise — start from Part 6’s working code and add the other features on top.

The Architecture

Here’s how the pieces fit into Pingora’s request lifecycle:

  Client (HTTPS) → :6188
         │
         ▼
  TLS termination              ← Part 4
         │
         ▼
  request_filter()
    ├─ Rate limit check        ← Part 7 (reject → 429)
    └─ API key authentication  ← Part 3 (reject → 401)
         │
         ▼
  upstream_peer()
    └─ Round-robin selection   ← Part 2
         │
         ▼
  upstream_request_filter()
    └─ Add Host header         ← Part 1
         │
         ▼
  Connect to upstream (TLS)    ← Part 4
         │
         ▼
  upstream_response_filter()
    └─ (cache goes here)       ← Part 6 (not wired in this capstone)
         │
         ▼
  response_filter()
    └─ Replace Server header,  ← Part 3
       strip X-Powered-By
         │
         ▼
  logging()
    └─ Record rate-limited
       and authenticated
       request stats           ← Part 7
         │
         ▼
  Client gets response

The key insight: each concern lives in exactly one phase. Rate limiting happens in request_filter because you want to reject before connecting upstream. Header cleanup happens in response_filter because you want to strip headers the client shouldn’t see. Caching would happen in upstream_response_filter because you want to cache before the cleanup strips the headers. (We omit caching here — see the note above.) The phase determines the order. The method determines the job.

Per-Request State

In Part 3, we introduced CTX — per-request state shared between phases. Now we need it for real:

#![allow(unused)]
fn main() {
pub struct ProxyCtx {
    // Rate limiting (Part 7)
    rate_limited: bool,
    client_key: String,

    // Authentication (Part 3)
    authenticated: bool,
}
}

Each field is set in one phase and read in a later phase. rate_limited is set in request_filter and read in logging. authenticated is set in request_filter — a real proxy would use it in upstream_request_filter to add user-specific headers, but our example doesn’t need that.

This is what CTX is for — passing data between phases without global mutable state.

The Proxy Struct

Our proxy holds three pieces of shared state:

#![allow(unused)]
fn main() {
pub struct FullProxy {
    upstreams: Arc<LoadBalancer<RoundRobin>>,  // Part 2
    limiter: Arc<RateLimiterRegistry>,          // Part 7
}
}

Both are wrapped in Arc because they’re shared between the request handler and background services (health checks for the load balancer).

The Request Filter: Two Gates, One Method

Both rate limiting and authentication are gatekeepers — they decide whether a request should proceed. They both belong in request_filter, and the order matters:

#![allow(unused)]
fn main() {
async fn request_filter(
    &self,
    session: &mut Session,
    ctx: &mut Self::CTX,
) -> Result<bool> {
    // Gate 1: Rate limiting (cheaper check — no DB lookup)
    let client_addr = session.client_addr();
    let key = match client_addr {
        Some(addr) => addr.to_string(),
        None => "unknown".to_string(),
    };
    ctx.client_key = key.clone();

    if !self.limiter.is_allowed(&key) {
        ctx.rate_limited = true;
        let _ = session.respond_error(429).await;
        return Ok(true);
    }

    // Gate 2: API key authentication (more expensive — header parsing)
    let has_valid_key = session
        .req_header()
        .headers
        .get("X-API-Key")
        .map(|v| v.as_bytes())
        == Some(b"secret123");

    if !has_valid_key {
        let _ = session.respond_error(401).await;
        return Ok(true);
    }

    ctx.authenticated = true;
    Ok(false)
}
}

Rate limiting comes first because it’s cheaper — a hashmap lookup versus header parsing. More importantly, if a client is over their rate limit, there’s no point checking their API key. Rejecting early saves work on every subsequent phase.

This ordering — cheaper rejection first — is a general principle. Put your fastest, most common rejections at the top.

The Complete Code

use async_trait::async_trait;
use pingora::lb::{LoadBalancer, selection::RoundRobin, health_check::TcpHealthCheck};
use pingora::prelude::*;
use pingora::proxy::{ProxyHttp, Session};
use pingora::upstreams::peer::HttpPeer;
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
use std::time::Instant;

// ─── Per-Request Context ────────────────────────────────────────────

pub struct ProxyCtx {
    rate_limited: bool,
    client_key: String,
    authenticated: bool,
}

// ─── Token Bucket Rate Limiter (Part 7) ─────────────────────────────

struct TokenBucket {
    max: usize,
    rate: usize,
    tokens: usize,
    last_refill: Instant,
}

impl TokenBucket {
    fn new(rate: usize, burst: usize) -> Self {
        TokenBucket {
            max: burst,
            rate,
            tokens: burst,
            last_refill: Instant::now(),
        }
    }

    fn try_consume(&mut self) -> bool {
        self.refill();
        if self.tokens > 0 {
            self.tokens -= 1;
            true
        } else {
            false
        }
    }

    fn refill(&mut self) {
        let now = Instant::now();
        let elapsed = now.duration_since(self.last_refill);
        let tokens_to_add = (elapsed.as_secs_f64() * self.rate as f64) as usize;
        if tokens_to_add > 0 {
            self.tokens = (self.tokens + tokens_to_add).min(self.max);
            self.last_refill = now;
        }
    }
}

struct RateLimiterRegistry {
    buckets: Mutex<HashMap<String, TokenBucket>>,
    rate: usize,
    burst: usize,
}

impl RateLimiterRegistry {
    fn new(rate: usize, burst: usize) -> Self {
        RateLimiterRegistry {
            buckets: Mutex::new(HashMap::new()),
            rate,
            burst,
        }
    }

    fn is_allowed(&self, key: &str) -> bool {
        let mut buckets = self.buckets.lock().unwrap();
        let bucket = buckets
            .entry(key.to_string())
            .or_insert_with(|| TokenBucket::new(self.rate, self.burst));
        bucket.try_consume()
    }
}

// ─── The Proxy ──────────────────────────────────────────────────────

pub struct FullProxy {
    upstreams: Arc<LoadBalancer<RoundRobin>>,
    limiter: Arc<RateLimiterRegistry>,
}

#[async_trait]
impl ProxyHttp for FullProxy {
    type CTX = ProxyCtx;
    fn new_ctx(&self) -> Self::CTX {
        ProxyCtx {
            rate_limited: false,
            client_key: String::new(),
            authenticated: false,
        }
    }

    // Gate 1: Rate limiting (cheap, do first)
    // Gate 2: API key authentication
    async fn request_filter(
        &self,
        session: &mut Session,
        ctx: &mut Self::CTX,
    ) -> Result<bool> {
        // ── Rate limiting ──
        let client_addr = session.client_addr();
        let key = match client_addr {
            Some(addr) => addr.to_string(),
            None => "unknown".to_string(),
        };
        ctx.client_key = key.clone();

        if !self.limiter.is_allowed(&key) {
            ctx.rate_limited = true;
            let _ = session.respond_error(429).await;
            return Ok(true);
        }

        // ── Authentication ──
        let has_valid_key = session
            .req_header()
            .headers
            .get("X-API-Key")
            .map(|v| v.as_bytes())
            == Some(b"secret123");

        if !has_valid_key {
            let _ = session.respond_error(401).await;
            return Ok(true);
        }

        ctx.authenticated = true;
        Ok(false)
    }

    // Round-robin across healthy backends
    async fn upstream_peer(
        &self,
        _session: &mut Session,
        _ctx: &mut Self::CTX,
    ) -> Result<Box<HttpPeer>> {
        let upstream = self.upstreams
            .select(b"", 256)
            .ok_or_else(|| Error::new_str("no healthy upstream available"))?;

        let peer = Box::new(HttpPeer::new(
            upstream,
            true,
            "one.one.one.one".to_string(),
        ));
        Ok(peer)
    }

    // Add required headers before sending upstream
    async fn upstream_request_filter(
        &self,
        _session: &mut Session,
        upstream_request: &mut pingora::http::RequestHeader,
        _ctx: &mut Self::CTX,
    ) -> Result<()> {
        upstream_request.insert_header("Host", "one.one.one.one")?;
        Ok(())
    }

    // Replace the Server header and strip internal headers
    // before sending to the client. Same pattern as Part 3.
    async fn response_filter(
        &self,
        _session: &mut Session,
        upstream_response: &mut pingora::http::ResponseHeader,
        _ctx: &mut Self::CTX,
    ) -> Result<()> {
        // Remove headers that expose internal infrastructure
        upstream_response.remove_header("X-Powered-By");
        // Replace the Server header to identify our proxy
        upstream_response.insert_header("Server", "Gateway")?;
        upstream_response.insert_header("X-Proxy", "pingora-tutorial")?;
        Ok(())
    }

    // Log rate-limited requests
    async fn logging(
        &self,
        session: &mut Session,
        _e: Option<&pingora::Error>,
        ctx: &mut Self::CTX,
    ) {
        if ctx.rate_limited {
            eprintln!(
                "[rate-limited] client={} path={}",
                ctx.client_key,
                session.req_header().uri.path()
            );
        }
    }
}

fn main() {
    let opt = Some(Opt::parse_args());
    let mut server = Server::new(opt).unwrap();
    server.bootstrap();

    // ── Load balancer with health checks (Part 2) ──
    let mut upstreams = LoadBalancer::try_from_iter([
        "1.1.1.1:443",
        "1.0.0.1:443",
    ]).unwrap();

    let hc = TcpHealthCheck::new();
    upstreams.set_health_check(hc);
    upstreams.health_check_frequency = Some(std::time::Duration::from_secs(1));

    let background = pingora::services::background::background_service(
        "health_check",
        upstreams,
    );
    let upstreams = background.task();

    // ── Rate limiter: 10 req/s per client, burst of 20 (Part 7) ──
    let limiter = Arc::new(RateLimiterRegistry::new(10, 20));

    let proxy = FullProxy {
        upstreams,
        limiter,
    };

    let mut service = http_proxy_service(&server.configuration, proxy);
    service.add_tcp("0.0.0.0:6188");

    server.add_service(background);
    server.add_service(service);

    println!("Full proxy starting on :6188");
    println!("  - Load balancing: round-robin across 2 backends");
    println!("  - Authentication: X-API-Key header required");
    println!("  - Rate limiting: 10 req/s per client, burst 20");
    println!("  - Response cleanup: internal headers stripped");
    println!();
    println!("Test: curl -H 'X-API-Key: secret123' http://127.0.0.1:6188 -sv");

    server.run_forever();
}

What Changed from the Individual Parts

The capstone code looks similar to each part’s standalone version, but there are a few differences worth noting:

request_filter now has two gates. In Part 3, authentication was the only gate. In Part 7, rate limiting was the only gate. Now they’re both in request_filter, with rate limiting first (cheaper check). This is the composition pattern: when multiple concerns share a phase, order them by cost — cheapest rejection first.

CTX carries data for multiple concerns. In Part 3, CTX held authenticated. In Part 7, it held rate_limited and client_key. Now it holds all three. The CTX struct is the communication bus between phases — each phase sets its fields, and logging reads them all.

logging reports on all concerns. Each part’s logging was specific to that part. The capstone’s logging reports on rate limits in one place. This is where observability comes together — a single log line can tell you whether a request was rate-limited or proxied successfully.

The main() function wires everything together. Each part’s main() was simple. The capstone’s main() creates the load balancer, attaches health checks, creates the rate limiter, and passes them to FullProxy. This is where the architecture becomes visible — main() is the composition root.

What We’re Simplifying

No TLS in this code. Adding TLS requires certificate files. The Part 4 code works as-is — you’d add service.add_tls_with_settings() in main() and configure TlsSettings on the HttpPeer. The pattern is identical to Part 4; we’re omitting it here so the code runs without setup.

Custom settings are hardcoded. The capstone uses Opt::parse_args() (Part 5), so Pingora’s built-in config file and CLI flags work — you can pass -c conf.yaml or -d and they’ll take effect. But our custom values (upstream backends, rate limiter thresholds) are hardcoded in main() rather than read from the config. To make them configurable, you’d read the raw config from server.configuration and parse your custom fields.

No caching. Part 6’s cache uses Pingora’s built-in session.cache API, which integrates through trait methods (request_cache_filter, cache_key_callback, response_cache_filter) rather than a custom struct. Adding it to this capstone requires a different integration pattern than the other features. Start from Part 6’s working code and add the other features on top — it’s a good exercise.

No graceful restart. Part 5’s zero-downtime upgrade requires the upgrade feature flag and a running process to send the signal to. The capstone uses the standard run_forever() loop.

What Comes Next

This proxy is production-capable for many use cases. If you need more, here’s where to look:

NeedApproach
Higher throughputMore worker threads (-t flag), connection tuning
Distributed rate limitingRedis-backed counters, or pingora-limits with aggregation
Request body filteringrequest_body_filter() — read the body before forwarding
Websocket proxyingPingora handles it; your upstream_peer selects the backend
HTTP/2 upstreamSet HttpPeer::options.http_version to Http2
Custom metricslogging() phase → Prometheus or StatsD
Plugin systemDynamic ProxyHttp dispatch based on route rules

The framework is Pingora. The logic is yours. You’ve seen every major piece. Go build something real.