We built our LLM proxy in Go. Not Rust. Not Python. Here's the engineering trade-off nobody talks about: the language that's fastest in benchmarks isn't always the language that ships the fastest product.
This post covers why we evaluated all three, what the actual performance differences are at proxy-relevant scale, and the one factor that made the decision obvious.
1. Go handles 5,000+ RPS with ~11 microseconds of overhead per request — more than enough for 99% of LLM proxy workloads.
2. Rust is faster (sub-1ms P99 at 10K QPS), but the development velocity trade-off isn't worth it unless you're building for hyperscale.
3. Python (LiteLLM) hits a wall at ~1,000 QPS due to the GIL — fine for prototyping, problematic for production traffic.
The Three Contenders
When we started building Preto's proxy layer, we had three options on the table. Each one had a strong case.
Python was the obvious first choice. The LLM ecosystem lives in Python. LiteLLM — the most popular open-source proxy — is Python. Every provider SDK is Python-first. We could ship a working proxy in a weekend.
Rust was the performance choice. TensorZero and Helicone both use Rust. Sub-millisecond P99 latency at 10,000 QPS. Memory safety guarantees. If we wanted to claim "the fastest proxy," Rust was the path.
Go was the pragmatic choice. Bifrost (the open-source proxy that benchmarks 50x faster than LiteLLM) is written in Go. Goroutines make concurrent streaming connections trivial. The standard library includes a production-grade HTTP server. And we could hire for it.
The Benchmark That Settled the Python Question
We ran Python off the list first. Not because it's slow in theory — because it's slow in practice at our target scale.
LiteLLM's own published benchmarks tell the story:
- At 500 RPS: Stable. ~40ms overhead. Acceptable.
- At 1,000 RPS: Memory climbs to 4GB+. Latency variance increases.
- At 2,000 RPS: Timeouts start. Memory hits 8GB+. Requests fail.
The culprit is Python's Global Interpreter Lock. An LLM proxy is fundamentally a concurrent I/O problem — you're holding thousands of open streaming connections simultaneously. Python's async primitives (asyncio) help, but the GIL still serializes CPU-bound work: JSON parsing, token counting, cost calculation, log serialization. Under load, these add up.
LiteLLM's team knows this. They've announced a Rust sidecar to handle the hot path. That's telling — even the most popular Python proxy is moving critical code out of Python.
If your LLM traffic is under 500 RPS and you need maximum provider coverage, LiteLLM is a solid choice. It supports 100+ providers with battle-tested adapters. The performance ceiling only matters if you're going to hit it.
Go vs. Rust: Where the Decision Gets Interesting
With Python out, the real comparison begins. Here's what we measured and researched:
| Dimension | Go | Rust |
|---|---|---|
| Proxy overhead | ~11μs at 5K RPS | <1ms P99 at 10K QPS |
| Max throughput (single instance) | 5,000+ RPS | 10,000+ QPS |
| Memory under load | ~200MB at 5K RPS | ~50MB at 10K QPS |
| Concurrency model | Goroutines (lightweight) | async/await (Tokio) |
| Streaming HTTP support | stdlib net/http | hyper/axum (good, more code) |
| Time to implement proxy MVP | ~2 weeks | ~5-6 weeks |
| Hiring pool | Large (DevOps, backend) | Small (systems specialists) |
| Compile times | ~5 seconds | ~2-5 minutes |
| Binary size | ~15MB | ~8MB |
| Ecosystem for LLM tooling | Growing | Growing |
The performance numbers are close enough to not matter for our use case. The development velocity numbers are not.
The Factor That Made It Obvious: Goroutines and Streaming
An LLM proxy's core job is holding thousands of concurrent HTTP connections open while streaming tokens back to clients. This is where Go's goroutine model shines.
In Go, every incoming request gets its own goroutine. Streaming the response is straightforward:
func proxyHandler(w http.ResponseWriter, r *http.Request) {
// Forward to upstream LLM provider
resp, err := http.DefaultClient.Do(upstreamReq)
if err != nil {
handleFallback(w, r) // try next provider
return
}
defer resp.Body.Close()
// Stream tokens back as they arrive
flusher, _ := w.(http.Flusher)
buf := make([]byte, 4096)
for {
n, err := resp.Body.Read(buf)
if n > 0 {
w.Write(buf[:n])
flusher.Flush() // send immediately
trackTokens(buf[:n]) // async cost tracking
}
if err != nil {
break
}
}
}
That's the core loop. In Rust, the equivalent code involves async/await, Pin<Box<dyn Stream>>, lifetime annotations, and careful ownership management. It's not harder conceptually — it's harder in practice, every time you refactor or add a new feature.
When your proxy needs to add a new middleware layer — say, budget enforcement before routing — the Go version is a new function in the chain. The Rust version often requires restructuring lifetimes and trait bounds across multiple files.
See how our Go proxy tracks your LLM spend
Preto captures cost per request, per feature, per team — with under 20ms of overhead. One URL change to set up.
See What Your LLM Spend Looks LikeFree forever for up to 10K requests. No credit card.
What We'd Choose Rust For
This isn't a "Go is better than Rust" argument. It's a "Go is better for our constraints" argument. We'd choose Rust if:
- We needed to handle 10,000+ QPS on a single instance. At that scale, Rust's zero-cost abstractions and lack of garbage collection pauses become meaningful.
- Memory was a hard constraint. Rust's 50MB footprint vs. Go's 200MB matters if you're running on edge nodes or embedded devices.
- The proxy was the entire product. If our company was an LLM proxy company (like Bifrost or TensorZero), spending 3x longer on the core engine is justified. Our proxy is infrastructure — the product is cost intelligence built on top.
TensorZero made the right call choosing Rust — their proxy IS the product, they need built-in A/B testing at wire speed, and they're targeting the highest-throughput tier. Helicone made the right call choosing Rust — they run on Cloudflare Workers at the edge, where memory and cold start time matter.
For a cost intelligence platform where the proxy is the data collection layer? Go is the right tool. If you're evaluating proxy-based cost tools rather than building your own, see our comparisons with Helicone (Rust, Cloudflare Workers) and LangSmith (SDK-based, no proxy).
The Real-World Request Lifecycle in Our Go Proxy
Here's how a request flows through our stack, with timing at each stage:
- TLS termination + HTTP parse — handled by Go's
net/httpserver. ~1ms. - API key lookup + team resolution — in-memory map with Redis sync every 10ms. ~0.5ms.
- Rate limit check — token-bucket algorithm in goroutine-safe map. ~0.1ms.
- Budget enforcement — check team's monthly spend against cap. ~0.2ms.
- Cache probe — SHA-256 hash of prompt + model + params, checked against local cache with Redis fallback. ~1-3ms.
- Route selection — match model to upstream endpoint, apply load balancing weights. ~0.1ms.
- Upstream call + streaming — goroutine holds connection, pipes
data:chunks back. 500ms-5,000ms (the LLM). - Async logging — cost calculation and log entry shipped to ClickHouse via buffered channel. ~0ms on the request path (fires in background goroutine).
Total proxy overhead: ~5-8ms. The LLM takes 500-5,000ms. Our proxy is under 1% of total request time.
Lessons From 6 Months in Production
Three things surprised us after shipping:
1. Garbage collection pauses are a non-issue. Go's GC has improved dramatically. At 3,000 RPS, our P99 GC pause is under 500 microseconds. We were prepared to tune GOGC — we never needed to.
2. The standard library HTTP server is production-ready. We started with Go's net/http and never moved to a framework. It handles keep-alive, connection pooling, graceful shutdown, and HTTP/2 out of the box. One less dependency.
3. Goroutine leaks are the real danger. Early on, we had a bug where failed upstream connections weren't properly closed, leaking goroutines. Go's runtime.NumGoroutine() metric caught it — but only after goroutine count climbed from 200 to 45,000 over a weekend. We now monitor goroutine count as a first-class metric. If you build a Go proxy, do this from day one.
Frequently Asked Questions
Why use Go for an LLM proxy instead of Rust?
How fast is a Go-based LLM proxy?
Why not use Python for an LLM proxy?
See what your LLM traffic looks like through a proxy.
Preto's Go-powered proxy captures cost, latency, and usage data per request — with under 20ms overhead. One URL change. Full visibility in 5 minutes.
See What Your LLM Spend Looks LikeFree forever for up to 10K requests. No credit card.