Part 7: Rate Limiting — Turning Away Requests Before They Reach Your Backends

Your proxy handles traffic from many clients. Most of them are well-behaved. But some aren’t. A misconfigured client hammering your API. A scraper hitting every URL every second. A DDoS attack flooding your service with requests. Without rate limiting, all of that traffic goes straight to your backends.

Rate limiting is the valve. It decides: this client has sent enough requests in the last second. The next one gets a 429 (Too Many Requests) instead of being proxied upstream.

The counting is easy. The hard part is doing it correctly across concurrent requests without becoming a bottleneck yourself.

Why Rate Limit at the Proxy?

You could rate-limit at the application layer. Your backend could check a Redis counter before processing each request. But rate limiting at the proxy has three advantages:

Speed. The proxy rejects overloaded requests before they reach the backend. That’s zero CPU time on your application server, zero database queries, zero latency for the rejected request (a fast 429).

Fairness. The proxy sees all clients. Your backend might see requests through a load balancer that hides the original client IP. The proxy, being the first hop, has accurate client information.

Protection. Rate limiting at the proxy means your backends never see the excess traffic. Even if the backend is slow or crashing, the proxy absorbs the spike.

The Token Bucket Algorithm

Most rate limiters use a variant of the token bucket algorithm. Here’s how it works:

The bucket starts with N tokens (the “burst” capacity)
Each request consumes one token
Tokens are replenished at a steady rate (the “rate”)
If the bucket is empty, the request is rejected

Time:    0s    1s    2s    3s    ...
Tokens:  20 → 20 → 20 → 20 → ...  (no traffic, bucket stays full)

Time:    0s    0.5s  1s    1.5s  2s    ...
Request: ✓     ✓     ✓     ✓     ✓     ...  (1 req/s, rate=10, bucket refills faster than it drains)

Time:    0s         0.1s       0.2s       ...
Request: ✓ ✓ ✓ ... ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗   (burst of 20, then rejected until bucket refills)
Tokens:  20→0       still 0   still 0    ... (bucket empties at 0s, starts refilling at 10/s)

This gives you two knobs:

Rate — how many requests per second are allowed (sustained throughput)
Burst — how many requests can arrive at once (spike capacity)

A rate of 10 with a burst of 20 means: a client can send 10 requests per second sustainably, or up to 20 in a single burst, but then must wait for the bucket to refill.

The burst matters because real traffic isn’t perfectly smooth. A browser loading a page might send 15 requests in rapid succession (images, scripts, stylesheets). That’s a burst, not an attack. The burst parameter lets legitimate spikes through while still capping sustained abuse.

Where Rate Limiting Fits in the Proxy

Rate limiting happens in the request_filter phase — the earliest phase where you can reject a request. This is important: rejecting a request before connecting to the upstream means zero cost on your backend. The rejection happens before upstream_peer is ever called — the request never touches your upstream at all.

new request
     │
     ▼
┌──────────────────┐
│ request_filter    │  ← Rate limit check happens here
│                   │
│  Allowed?         │
│  ├─ Yes → continue to upstream_peer()
│  └─ No  → 429, stop here
└──────────────────┘

This is the same pattern we used for authentication in Part 3. Return Ok(true) from request_filter to tell Pingora “I already wrote a response, don’t proxy this request.” The only difference is what triggers the rejection — bad credentials vs. too many requests.

Per-Client Buckets

Rate limiting only makes sense per-client. Limiting all traffic to 100 requests per second protects your backend but doesn’t prevent one client from monopolizing that budget.

“Per client” usually means per IP address. In our implementation, we extract the client address from the session:

#![allow(unused)]
fn main() {
let client_addr = session.client_addr();
let key = match client_addr {
    Some(addr) => addr.to_string(),
    None => "unknown".to_string(),
};
}

In a real deployment behind another load balancer, you’d check X-Forwarded-For or X-Real-IP instead. The client address Pingora sees might be the load balancer’s address, not the original client. Your rate limiter should use the right identifier for your architecture.

The Registry

One bucket per client means you need a registry — a map from client identifiers to their buckets. When a request arrives, you look up (or create) the bucket for that client and try to consume a token.

#![allow(unused)]
fn main() {
struct RateLimiterRegistry {
    buckets: Mutex<HashMap<String, TokenBucket>>,
    rate: usize,
    burst: usize,
}

impl RateLimiterRegistry {
    fn is_allowed(&self, key: &str) -> bool {
        let mut buckets = self.buckets.lock().unwrap();
        let bucket = buckets
            .entry(key.to_string())
            .or_insert_with(|| TokenBucket::new(self.rate, self.burst));
        bucket.try_consume()
    }
}
}

The registry is shared across all requests via Arc<RateLimiterRegistry>. This works for a tutorial. For production with millions of concurrent connections, the Mutex becomes a bottleneck — every request contends on the same lock:

#![allow(unused)]
fn main() {
// Mutex: every request waits for every other request
let mut buckets = self.buckets.lock().unwrap();
// the lock is held while we look up and update the bucket
// — other requests are blocked

// AtomicU64: no lock, no contention
counter.fetch_add(1, Ordering::SeqCst);  // one instruction, no waiting
}

With a Mutex, every rate-limit check serializes all other requests. At high concurrency, the lock itself becomes the bottleneck. AtomicU64 operations are lock-free — the CPU can check and increment in a single instruction without blocking other cores. This is why pingora-limits uses atomics: at Cloudflare’s scale, even a fast mutex is too slow. Solutions include:

Sharded registries. Hash the client key to one of N registries. Each registry has its own lock. Contention drops by N×.
Lock-free data structures. DashMap or similar concurrent hashmaps avoid the lock entirely.
pingora-limits. Pingora’s own rate limiting crate uses a more efficient counting algorithm (Generic Cell Rate Algorithm, GCRA) that doesn’t need per-client state at all — it uses a probabilistic approach that’s accurate in aggregate.

The Code

Our proxy combines rate limiting with the load balancing from Part 2:

#![allow(unused)]
fn main() {
pub struct LB {
    upstreams: Arc<LoadBalancer<RoundRobin>>,
    limiter: Arc<RateLimiterRegistry>,
}
}

The rate limiter is checked in request_filter. If the client has exceeded their limit, we return 429 and short-circuit the request. Otherwise, the request flows through to upstream_peer as usual.

The RateLimitCtx tracks whether the request was rate-limited, so the logging phase can record it:

#![allow(unused)]
fn main() {
struct RateLimitCtx {
    rate_limited: bool,
    client_key: String,
}
}

This is the CTX pattern from Part 3 — per-request state shared across phases.

Testing It

Start the proxy:

cargo run --bin part-07-rate-limiting

Send 25 requests in quick succession:

for i in $(seq 1 25); do curl -s -o /dev/null -w '%{http_code} ' http://127.0.0.1:6188; done

You’ll see something like:

200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 429 429 429 429 429

The first 20 requests succeed (the burst capacity). The next 5 get 429. Wait a second, and the bucket refills — new requests are allowed.

The Production Version: pingora-limits

Our token bucket implementation is for learning. For production, use pingora-limits:

[dependencies]
pingora-limits = "0.8"

The pingora-limits crate implements a rate estimator based on GCRA (Generic Cell Rate Algorithm). GCRA is the same algorithm used by telecom networks for traffic shaping. It has several advantages over token buckets:

No per-client state. GCRA doesn’t store a counter per client. Instead, it stores a single timestamp — the earliest time the next request is allowed. This is O(1) memory per client instead of O(rate).

No lock contention. The rate estimator uses atomic operations, not mutexes. Thousands of concurrent requests can check their limits without waiting for each other.

Smooth shaping. GCRA doesn’t have a separate “rate” and “burst” — it has a single emission interval that naturally handles both. The result is smoother traffic shaping: a client sending at exactly the allowed rate never gets rejected, while bursts are spread out over time.

The API:

use pingora_limits::rate::RateEstimator;

let estimator = RateEstimator::new();

// Check if a request is allowed
let key = client_ip.as_bytes();
let allowed = estimator
    .rate(key, 10) // 10 requests per second
    .await
    .is_ok();

Note that pingora-limits uses an async API (.await). This is because it’s designed for distributed rate limiting — in a multi-instance deployment, the estimator may need to coordinate with other instances. Our synchronous implementation is simpler but can’t do coordination.

The pingora-limits crate is what Cloudflare uses in production for their own rate limiting. It’s designed for the same scale Pingora operates at.

Rate Limiting vs. Throttling

A subtle distinction: rate limiting rejects excess requests (429), while throttling delays them. Our implementation does rate limiting — if you’re over the limit, you get rejected immediately.

Throttling (also called traffic shaping) queues excess requests and processes them later at the allowed rate. This is useful when you want to smooth out bursts without rejecting anything — the request takes longer, but it eventually gets through.

For a proxy, rate limiting is usually the right choice. Your job is to protect the backend, not to queue work on behalf of the client. If the client is sending too much, telling them to slow down (429 with a Retry-After header) is honest and efficient. The client can retry later; you don’t need to hold their request in memory.

What About Distributed Rate Limiting?

Our rate limiter runs in a single process. If you have multiple proxy instances (which you should, for availability), each instance has its own counter. A client sending 10 requests per second could send 10 to each instance and effectively get 10× the allowed rate.

Solutions:

Sticky sessions. Route the same client to the same proxy instance (via source IP hash). Simple, but loses the benefit of load distribution.

Shared state. Use Redis or a similar distributed store for the counters. Accurate, but adds latency to every request (a Redis round-trip per rate limit check).

Approximate counting. Accept that distributed rate limiting is slightly imprecise. Set the per-instance limit to total_limit / instance_count and accept that a perfectly coordinated burst might exceed the limit briefly. For most use cases, this is good enough.

The Pingora approach. Cloudflare runs Pingora at massive scale. Their rate limiting uses a combination of local counting (fast, no coordination) and periodic aggregation (accurate over time). The pingora-limits crate reflects this philosophy: it’s designed for per-instance rate estimation that works correctly in aggregate.

What We’re Simplifying

No eviction of stale buckets. Our HashMap grows forever — clients that made one request and never returned still have an entry. A production implementation would evict entries that haven’t been accessed recently (LRU, TTL-based, or periodic scanning).

No Retry-After header. When we return 429, we should include a Retry-After header telling the client when to try again. This is both polite and required by RFC 6585.

No differentiation by endpoint. Our rate limiter treats all requests the same. A real proxy might allow 100 requests/second for GET /api/data but only 5/second for POST /api/upload. The key would be client_ip + path instead of client_ip alone.

No sliding window. Our token bucket is a fixed-window approximation. Some rate limiters use sliding windows (expensive) or sliding log (very expensive) for more precise control. For most use cases, the token bucket’s approximation is sufficient.

Key Takeaways

Rate limiting belongs at the proxy. It’s faster, fairer, and more protective than rate limiting at the application layer.

The token bucket is the standard algorithm. Two knobs: rate (sustained) and burst (spike). Simple, effective, and good enough for most use cases.

Rate limiting is a request_filter. Rejecting before the upstream connection saves backend resources. The pattern is the same as authentication: respond_error() + Ok(true).

Per-client state needs care at scale. A Mutex<HashMap> works for thousands of clients. For millions, you need sharding, lock-free structures, or pingora-limits.

Use pingora-limits in production. It’s based on GCRA, uses atomic operations, and is battle-tested at Cloudflare’s scale.

What You’ve Built

Across all seven parts, you’ve built a complete, production-grade reverse proxy:

Part	What You Added
1	A working reverse proxy
2	Load balancing and health checks
3	Request filtering, response modification, per-request state
4	TLS termination and certificate verification
5	Config files, daemonization, zero-downtime upgrades
6	HTTP caching concepts and strategies
7	Rate limiting per client with token buckets

This is real infrastructure. The same framework, the same patterns, the same tradeoffs that Cloudflare navigates at 40M+ requests per second. The code you wrote here — with production-grade rate limiting and pingora-limits — could handle serious traffic.

But a real proxy doesn’t run in pieces. In Part 8: Putting It All Together, we’ll combine load balancing, TLS, caching, rate limiting, and filters into one coherent service.

Keyboard shortcuts