You know you're overspending on GPT-4o. You just don't know which requests are the problem — or how much fixing them would save.
Here's the clearest way to see it: "Classify this review as positive or negative" costs roughly $0.0002 on GPT-4.1 at 100 tokens. The same request costs $0.00001 on GPT-4.1-nano — 20x cheaper. At 10,000 classification requests per day, that's $1.90/day you're burning for no improvement in output quality. Over a year: nearly $700, for one endpoint. Most production apps have a dozen like it.
1. Most apps route 100% of traffic to the same model. Simple tasks — classification, extraction, boolean checks — cost 50–100x more than they should on frontier models.
2. A complexity estimator + routing layer at the proxy fixes this without touching application code. The router adds <1ms — it's pure CPU, no I/O.
3. Teams see 20–40% cost reduction within the first week of enabling routing. The highest wins come from apps with high-volume, low-complexity endpoints.
The 20x Price Gap You're Ignoring
The price difference between model tiers is not incremental — it's an order of magnitude. At current pricing:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Right for |
|---|---|---|---|
| gpt-4.1-nano | $0.10 | $0.40 | Classification, extraction, Q&A, booleans |
| gpt-4.1 | $2.00 | $8.00 | Summarization, structured output, medium reasoning |
| claude-opus-4-6 | $5.00 | $25.00 | Multi-step reasoning, long-context, complex code |
The pattern that causes waste: GPT-4o (or Opus) gets hardcoded at integration time because it's the safe default. The app ships. Traffic grows. Nobody goes back to ask which endpoints actually need it.
Routing fixes this at the proxy layer — between your application and the provider — without requiring changes to application code.
The Routing Decision Tree
Every request passes through a complexity estimator that produces a score from 0.0 to 1.0. The router maps that score to a model tier:
Building the Complexity Estimator
The estimator scores each request using four signals. Each signal contributes independently — the final score is capped at 1.0:
func (r *Router) estimateComplexity(req *ChatRequest) float64 {
score := 0.0
// Signal 1: total prompt tokens
// Longer prompts usually require more capable models to maintain coherence
tokens := estimateTokenCount(req.Messages)
switch {
case tokens > 2000:
score += 0.4
case tokens > 500:
score += 0.2
// under 500 tokens: no addition
}
// Signal 2: conversation depth
// Multi-turn conversations require tracking context across exchanges
if len(req.Messages) > 4 {
score += 0.2
}
// Signal 3: reasoning keywords in the last user message
// These tasks reliably benefit from frontier model reasoning
lastMsg := strings.ToLower(getLastUserMessage(req.Messages))
reasoningTerms := []string{
"analyze", "compare", "explain why", "write code",
"generate", "step by step", "reason through", "refactor",
"debug", "critique", "pros and cons",
}
for _, term := range reasoningTerms {
if strings.Contains(lastMsg, term) {
score += 0.3
break // one match is enough
}
}
// Signal 4: explicit complexity hint from the application
// Your app can set this header to bypass inference entirely
if req.Headers.Get("X-Complexity-Hint") == "high" {
score = 1.0
}
if score > 1.0 {
score = 1.0
}
return score
}
The X-Complexity-Hint header is the escape hatch. When your app knows a request is complex (a code review, a multi-document analysis), it can signal this directly — skipping inference entirely and routing straight to the right model.
The thresholds (0.3, 0.7) are starting points, not constants. Tune them against a sample of your own traffic: pull a day of requests, score them, manually label a subset, and adjust until the misclassification rate is acceptable for your use case.
The Router: Putting It Together
type RoutingDecision struct {
Model string
Provider string
}
func (r *Router) Route(req *ChatRequest) RoutingDecision {
// Check for explicit tag routing first (highest priority)
if decision, ok := r.routeByFeatureTag(req); ok {
return decision
}
// Fall back to complexity-based routing
complexity := r.estimateComplexity(req)
switch {
case complexity < 0.3:
return RoutingDecision{Model: "gpt-4.1-nano", Provider: "openai"}
case complexity < 0.7:
return RoutingDecision{Model: "gpt-4.1", Provider: "openai"}
default:
return RoutingDecision{Model: "claude-opus-4-6", Provider: "anthropic"}
}
}
func (r *Router) routeByFeatureTag(req *ChatRequest) (RoutingDecision, bool) {
tag := req.Headers.Get("X-Feature")
switch tag {
case "support-bot", "faq", "intent-classify":
// Known-simple features — always route cheap
return RoutingDecision{Model: "gpt-4.1-nano", Provider: "openai"}, true
case "code-review", "architecture-analysis":
// Known-complex features — always route to frontier
return RoutingDecision{Model: "claude-opus-4-6", Provider: "anthropic"}, true
default:
return RoutingDecision{}, false
}
}
Tag routing takes priority over complexity inference. If your app tags requests by feature (via the X-Feature header — see the gateway architecture post for the full pattern), the router never needs to guess. Inference fills the gap for untagged requests.
Want to find which of your requests need a cheaper model?
Preto analyzes your traffic and surfaces routing recommendations with projected savings per endpoint.
Find Which Requests Need a Cheaper ModelPreto analyzes your traffic and recommends routing rules. Free to start.
Failover: What Happens When Your Primary Model Is Down
Routing and failover are separate concerns. The router decides the intended model; a failover layer handles unavailability. Keep them decoupled — a router that tries to do failover becomes impossible to reason about.
// providerChain defines fallback order per tier
var fallbackChains = map[string][]RoutingDecision{
"simple": {
{Model: "gpt-4.1-nano", Provider: "openai"},
{Model: "claude-haiku-4-5", Provider: "anthropic"},
},
"standard": {
{Model: "gpt-4.1", Provider: "openai"},
{Model: "claude-sonnet-4-6", Provider: "anthropic"},
},
"complex": {
{Model: "claude-opus-4-6", Provider: "anthropic"},
{Model: "gpt-4.1", Provider: "openai"},
},
}
func (r *Router) RouteWithFailover(req *ChatRequest) RoutingDecision {
primary := r.Route(req)
tier := r.tier(primary.Model)
for _, candidate := range fallbackChains[tier] {
if r.circuit.IsAvailable(candidate.Provider) {
return candidate
}
}
// Last resort: return the primary and let it fail visibly
return primary
}
The circuit breaker (r.circuit.IsAvailable) tracks provider health separately. When OpenAI returns a run of 5xx errors, the circuit opens and the router automatically falls to the next provider in the chain — without changing the routing logic.
Where Routing Falls Short
Complexity-based routing is a heuristic. It works well for workloads with a clear mix of simple and complex tasks. It breaks down at the extremes:
Open-ended chat. A short message ("What do you think?") can require deep contextual reasoning. Token count is a poor proxy here. For chat-heavy apps, tag routing (by session type or user tier) is more reliable than inference.
Creative and long-form generation. Quality differences between models are most visible here. Routing a creative brief to GPT-4o-mini to save $0.002 risks a visibly worse output. Set explicit tags for these features and don't try to infer them.
Calibration drift. New models change the quality-cost frontier. GPT-4o-mini today handles tasks that required GPT-4 a year ago. Review your routing thresholds when models update — the optimal decision tree shifts.
At Preto, we surface the routing decisions and their outcomes alongside cost data — so you can see where the estimator is misrouting and tune thresholds against real production traffic, not guesses.
Frequently Asked Questions
What is LLM model routing?
How do you estimate request complexity automatically?
X-Complexity-Hint header to override inference.
How much can model routing reduce LLM costs?
Does model routing affect response quality?
How does model routing interact with failover?
Find which of your requests need a cheaper model.
Preto analyzes your LLM traffic and surfaces per-endpoint routing recommendations with projected cost savings. One URL change — no code refactor required.
Find Which Requests Need a Cheaper ModelFree forever up to 10K requests. No credit card required.