Building an LLM Proxy in Go: Why We Chose Go Over Rust and Python

Q: How fast is a Go-based LLM proxy?

Benchmarks from Bifrost (an open-source Go LLM proxy) show 11 microseconds of overhead at 5,000 requests per second, with 54x faster P99 latency than Python-based alternatives like LiteLLM. Go-based proxies typically handle 5,000+ RPS on a single instance with sub-millisecond overhead — more than enough for most production workloads.

Q: Why not use Python for an LLM proxy?

Python's Global Interpreter Lock (GIL) limits true parallelism for CPU-bound operations. LiteLLM, the most popular Python-based proxy, handles ~1,000 QPS before hitting performance walls — memory usage climbs to 8GB+ and request timeouts increase. Python is excellent for prototyping and has the largest provider ecosystem, but it hits a ceiling for high-throughput proxy workloads.

We built our LLM proxy in Go. Not Rust. Not Python. Here's the engineering trade-off nobody talks about: the language that's fastest in benchmarks isn't always the language that ships the fastest product.

This post covers why we evaluated all three, what the actual performance differences are at proxy-relevant scale, and the one factor that made the decision obvious.

TL;DR

1. Go handles 5,000+ RPS with ~11 microseconds of overhead per request — more than enough for 99% of LLM proxy workloads.
2. Rust is faster (sub-1ms P99 at 10K QPS), but the development velocity trade-off isn't worth it unless you're building for hyperscale.
3. Python (LiteLLM) hits a wall at ~1,000 QPS due to the GIL — fine for prototyping, problematic for production traffic.

The Three Contenders

When we started building Preto's proxy layer, we had three options on the table. Each one had a strong case.

Python was the obvious first choice. The LLM ecosystem lives in Python. LiteLLM — the most popular open-source proxy — is Python. Every provider SDK is Python-first. We could ship a working proxy in a weekend.

Rust was the performance choice. TensorZero and Helicone both use Rust. Sub-millisecond P99 latency at 10,000 QPS. Memory safety guarantees. If we wanted to claim "the fastest proxy," Rust was the path.

Go was the pragmatic choice. Bifrost (the open-source proxy that benchmarks 50x faster than LiteLLM) is written in Go. Goroutines make concurrent streaming connections trivial. The standard library includes a production-grade HTTP server. And we could hire for it.

The Benchmark That Settled the Python Question

We ran Python off the list first. Not because it's slow in theory — because it's slow in practice at our target scale.

LiteLLM's own published benchmarks tell the story:

At 500 RPS: Stable. ~40ms overhead. Acceptable.
At 1,000 RPS: Memory climbs to 4GB+. Latency variance increases.
At 2,000 RPS: Timeouts start. Memory hits 8GB+. Requests fail.

The culprit is Python's Global Interpreter Lock. An LLM proxy is fundamentally a concurrent I/O problem — you're holding thousands of open streaming connections simultaneously. Python's async primitives (asyncio) help, but the GIL still serializes CPU-bound work: JSON parsing, token counting, cost calculation, log serialization. Under load, these add up.

LiteLLM's team knows this. They've announced a Rust sidecar to handle the hot path. That's telling — even the most popular Python proxy is moving critical code out of Python.

Python isn't wrong — it's wrong for this

If your LLM traffic is under 500 RPS and you need maximum provider coverage, LiteLLM is a solid choice. It supports 100+ providers with battle-tested adapters. The performance ceiling only matters if you're going to hit it.

Go vs. Rust: Where the Decision Gets Interesting

With Python out, the real comparison begins. Here's what we measured and researched:

Dimension	Go	Rust
Proxy overhead	~11μs at 5K RPS	<1ms P99 at 10K QPS
Max throughput (single instance)	5,000+ RPS	10,000+ QPS
Memory under load	~200MB at 5K RPS	~50MB at 10K QPS
Concurrency model	Goroutines (lightweight)	async/await (Tokio)
Streaming HTTP support	stdlib net/http	hyper/axum (good, more code)
Time to implement proxy MVP	~2 weeks	~5-6 weeks
Hiring pool	Large (DevOps, backend)	Small (systems specialists)
Compile times	~5 seconds	~2-5 minutes
Binary size	~15MB	~8MB
Ecosystem for LLM tooling	Growing	Growing

The performance numbers are close enough to not matter for our use case. The development velocity numbers are not.

The Factor That Made It Obvious: Goroutines and Streaming

An LLM proxy's core job is holding thousands of concurrent HTTP connections open while streaming tokens back to clients. This is where Go's goroutine model shines.

In Go, every incoming request gets its own goroutine. Streaming the response is straightforward:

func proxyHandler(w http.ResponseWriter, r *http.Request) {
    // Forward to upstream LLM provider
    resp, err := http.DefaultClient.Do(upstreamReq)
    if err != nil {
        handleFallback(w, r) // try next provider
        return
    }
    defer resp.Body.Close()

    // Stream tokens back as they arrive
    flusher, _ := w.(http.Flusher)
    buf := make([]byte, 4096)
    for {
        n, err := resp.Body.Read(buf)
        if n > 0 {
            w.Write(buf[:n])
            flusher.Flush() // send immediately
            trackTokens(buf[:n]) // async cost tracking
        }
        if err != nil {
            break
        }
    }
}

That's the core loop. In Rust, the equivalent code involves async/await, Pin<Box<dyn Stream>>, lifetime annotations, and careful ownership management. It's not harder conceptually — it's harder in practice, every time you refactor or add a new feature.

When your proxy needs to add a new middleware layer — say, budget enforcement before routing — the Go version is a new function in the chain. The Rust version often requires restructuring lifetimes and trait bounds across multiple files.

See how our Go proxy tracks your LLM spend

Preto captures cost per request, per feature, per team — with under 20ms of overhead. One URL change to set up.

See What Your LLM Spend Looks Like

Free forever for up to 10K requests. No credit card.

What We'd Choose Rust For

This isn't a "Go is better than Rust" argument. It's a "Go is better for our constraints" argument. We'd choose Rust if:

We needed to handle 10,000+ QPS on a single instance. At that scale, Rust's zero-cost abstractions and lack of garbage collection pauses become meaningful.
Memory was a hard constraint. Rust's 50MB footprint vs. Go's 200MB matters if you're running on edge nodes or embedded devices.
The proxy was the entire product. If our company was an LLM proxy company (like Bifrost or TensorZero), spending 3x longer on the core engine is justified. Our proxy is infrastructure — the product is cost intelligence built on top.

TensorZero made the right call choosing Rust — their proxy IS the product, they need built-in A/B testing at wire speed, and they're targeting the highest-throughput tier. Helicone made the right call choosing Rust — they run on Cloudflare Workers at the edge, where memory and cold start time matter.

For a cost intelligence platform where the proxy is the data collection layer? Go is the right tool. If you're evaluating proxy-based cost tools rather than building your own, see our comparisons with Helicone (Rust, Cloudflare Workers) and LangSmith (SDK-based, no proxy).

The Real-World Request Lifecycle in Our Go Proxy

Here's how a request flows through our stack, with timing at each stage:

TLS termination + HTTP parse — handled by Go's net/http server. ~1ms.
API key lookup + team resolution — in-memory map with Redis sync every 10ms. ~0.5ms.
Rate limit check — token-bucket algorithm in goroutine-safe map. ~0.1ms.
Budget enforcement — check team's monthly spend against cap. ~0.2ms.
Cache probe — SHA-256 hash of prompt + model + params, checked against local cache with Redis fallback. ~1-3ms.
Route selection — match model to upstream endpoint, apply load balancing weights. ~0.1ms.
Upstream call + streaming — goroutine holds connection, pipes data: chunks back. 500ms-5,000ms (the LLM).
Async logging — cost calculation and log entry shipped to ClickHouse via buffered channel. ~0ms on the request path (fires in background goroutine).

Total proxy overhead: ~5-8ms. The LLM takes 500-5,000ms. Our proxy is under 1% of total request time.

Lessons From 6 Months in Production

Three things surprised us after shipping:

1. Garbage collection pauses are a non-issue. Go's GC has improved dramatically. At 3,000 RPS, our P99 GC pause is under 500 microseconds. We were prepared to tune GOGC — we never needed to.

2. The standard library HTTP server is production-ready. We started with Go's net/http and never moved to a framework. It handles keep-alive, connection pooling, graceful shutdown, and HTTP/2 out of the box. One less dependency.

3. Goroutine leaks are the real danger. Early on, we had a bug where failed upstream connections weren't properly closed, leaking goroutines. Go's runtime.NumGoroutine() metric caught it — but only after goroutine count climbed from 200 to 45,000 over a weekend. We now monitor goroutine count as a first-class metric. If you build a Go proxy, do this from day one.

Frequently Asked Questions

Why use Go for an LLM proxy instead of Rust?

Go offers the best balance of performance and development velocity. While Rust is faster in raw benchmarks, Go's goroutine model handles thousands of concurrent streaming connections with minimal code. For most teams under 5,000 RPS, Go's performance is equivalent — and development speed is 2-3x faster.

How fast is a Go-based LLM proxy?

Benchmarks from Bifrost show 11 microseconds of overhead at 5,000 RPS, with 54x faster P99 latency than Python-based alternatives. Our own production proxy runs at ~5-8ms total overhead including auth, caching, routing, and logging.

Why not use Python for an LLM proxy?

Python's GIL limits true parallelism for concurrent I/O workloads. LiteLLM handles ~1,000 QPS before hitting performance walls — memory climbs to 8GB+ and timeouts increase. Python is excellent for prototyping and has the largest provider ecosystem, but it struggles at production-scale proxy throughput.

See what your LLM traffic looks like through a proxy.

Preto's Go-powered proxy captures cost, latency, and usage data per request — with under 20ms overhead. One URL change. Full visibility in 5 minutes.

See What Your LLM Spend Looks Like

Free forever for up to 10K requests. No credit card.

Gaurav Dagade

Founder of Preto.ai. 11 years engineering leadership. Previously Engineering Manager at Bynry. Building the cost intelligence layer for AI infrastructure.

LinkedIn · Twitter