You know you're overspending on GPT-4o. You just don't know which requests are the problem — or how much fixing them would save.

Here's the clearest way to see it: "Classify this review as positive or negative" costs roughly $0.0002 on GPT-4.1 at 100 tokens. The same request costs $0.00001 on GPT-4.1-nano — 20x cheaper. At 10,000 classification requests per day, that's $1.90/day you're burning for no improvement in output quality. Over a year: nearly $700, for one endpoint. Most production apps have a dozen like it.

TL;DR

1. Most apps route 100% of traffic to the same model. Simple tasks — classification, extraction, boolean checks — cost 50–100x more than they should on frontier models.
2. A complexity estimator + routing layer at the proxy fixes this without touching application code. The router adds <1ms — it's pure CPU, no I/O.
3. Teams see 20–40% cost reduction within the first week of enabling routing. The highest wins come from apps with high-volume, low-complexity endpoints.

The 20x Price Gap You're Ignoring

The price difference between model tiers is not incremental — it's an order of magnitude. At current pricing:

Model Input (per 1M tokens) Output (per 1M tokens) Right for
gpt-4.1-nano $0.10 $0.40 Classification, extraction, Q&A, booleans
gpt-4.1 $2.00 $8.00 Summarization, structured output, medium reasoning
claude-opus-4-6 $5.00 $25.00 Multi-step reasoning, long-context, complex code

The pattern that causes waste: GPT-4o (or Opus) gets hardcoded at integration time because it's the safe default. The app ships. Traffic grows. Nobody goes back to ask which endpoints actually need it.

Routing fixes this at the proxy layer — between your application and the provider — without requiring changes to application code.

The Routing Decision Tree

Every request passes through a complexity estimator that produces a score from 0.0 to 1.0. The router maps that score to a model tier:

Incoming LLM Request
complexity_score = estimateComplexity(req)
score < 0.3
Simple
gpt-4.1-nano
$0.10 / 1M in
Classification · extraction · booleans · short Q&A
0.3 – 0.7
Standard
gpt-4.1
$2.00 / 1M in
Summarization · structured output · moderate reasoning
score > 0.7
Complex
claude-opus-4-6
$15.00 / 1M in
Multi-step reasoning · long context · complex code

Building the Complexity Estimator

The estimator scores each request using four signals. Each signal contributes independently — the final score is capped at 1.0:

func (r *Router) estimateComplexity(req *ChatRequest) float64 {
    score := 0.0

    // Signal 1: total prompt tokens
    // Longer prompts usually require more capable models to maintain coherence
    tokens := estimateTokenCount(req.Messages)
    switch {
    case tokens > 2000:
        score += 0.4
    case tokens > 500:
        score += 0.2
    // under 500 tokens: no addition
    }

    // Signal 2: conversation depth
    // Multi-turn conversations require tracking context across exchanges
    if len(req.Messages) > 4 {
        score += 0.2
    }

    // Signal 3: reasoning keywords in the last user message
    // These tasks reliably benefit from frontier model reasoning
    lastMsg := strings.ToLower(getLastUserMessage(req.Messages))
    reasoningTerms := []string{
        "analyze", "compare", "explain why", "write code",
        "generate", "step by step", "reason through", "refactor",
        "debug", "critique", "pros and cons",
    }
    for _, term := range reasoningTerms {
        if strings.Contains(lastMsg, term) {
            score += 0.3
            break // one match is enough
        }
    }

    // Signal 4: explicit complexity hint from the application
    // Your app can set this header to bypass inference entirely
    if req.Headers.Get("X-Complexity-Hint") == "high" {
        score = 1.0
    }

    if score > 1.0 {
        score = 1.0
    }
    return score
}

The X-Complexity-Hint header is the escape hatch. When your app knows a request is complex (a code review, a multi-document analysis), it can signal this directly — skipping inference entirely and routing straight to the right model.

The thresholds (0.3, 0.7) are starting points, not constants. Tune them against a sample of your own traffic: pull a day of requests, score them, manually label a subset, and adjust until the misclassification rate is acceptable for your use case.

The Router: Putting It Together

type RoutingDecision struct {
    Model    string
    Provider string
}

func (r *Router) Route(req *ChatRequest) RoutingDecision {
    // Check for explicit tag routing first (highest priority)
    if decision, ok := r.routeByFeatureTag(req); ok {
        return decision
    }

    // Fall back to complexity-based routing
    complexity := r.estimateComplexity(req)
    switch {
    case complexity < 0.3:
        return RoutingDecision{Model: "gpt-4.1-nano", Provider: "openai"}
    case complexity < 0.7:
        return RoutingDecision{Model: "gpt-4.1", Provider: "openai"}
    default:
        return RoutingDecision{Model: "claude-opus-4-6", Provider: "anthropic"}
    }
}

func (r *Router) routeByFeatureTag(req *ChatRequest) (RoutingDecision, bool) {
    tag := req.Headers.Get("X-Feature")
    switch tag {
    case "support-bot", "faq", "intent-classify":
        // Known-simple features — always route cheap
        return RoutingDecision{Model: "gpt-4.1-nano", Provider: "openai"}, true
    case "code-review", "architecture-analysis":
        // Known-complex features — always route to frontier
        return RoutingDecision{Model: "claude-opus-4-6", Provider: "anthropic"}, true
    default:
        return RoutingDecision{}, false
    }
}

Tag routing takes priority over complexity inference. If your app tags requests by feature (via the X-Feature header — see the gateway architecture post for the full pattern), the router never needs to guess. Inference fills the gap for untagged requests.

Want to find which of your requests need a cheaper model?

Preto analyzes your traffic and surfaces routing recommendations with projected savings per endpoint.

Find Which Requests Need a Cheaper Model

Preto analyzes your traffic and recommends routing rules. Free to start.

Failover: What Happens When Your Primary Model Is Down

Routing and failover are separate concerns. The router decides the intended model; a failover layer handles unavailability. Keep them decoupled — a router that tries to do failover becomes impossible to reason about.

// providerChain defines fallback order per tier
var fallbackChains = map[string][]RoutingDecision{
    "simple": {
        {Model: "gpt-4.1-nano",      Provider: "openai"},
        {Model: "claude-haiku-4-5", Provider: "anthropic"},
    },
    "standard": {
        {Model: "gpt-4.1",           Provider: "openai"},
        {Model: "claude-sonnet-4-6", Provider: "anthropic"},
    },
    "complex": {
        {Model: "claude-opus-4-6",  Provider: "anthropic"},
        {Model: "gpt-4.1",           Provider: "openai"},
    },
}

func (r *Router) RouteWithFailover(req *ChatRequest) RoutingDecision {
    primary := r.Route(req)
    tier := r.tier(primary.Model)

    for _, candidate := range fallbackChains[tier] {
        if r.circuit.IsAvailable(candidate.Provider) {
            return candidate
        }
    }
    // Last resort: return the primary and let it fail visibly
    return primary
}

The circuit breaker (r.circuit.IsAvailable) tracks provider health separately. When OpenAI returns a run of 5xx errors, the circuit opens and the router automatically falls to the next provider in the chain — without changing the routing logic.

Where Routing Falls Short

Complexity-based routing is a heuristic. It works well for workloads with a clear mix of simple and complex tasks. It breaks down at the extremes:

Open-ended chat. A short message ("What do you think?") can require deep contextual reasoning. Token count is a poor proxy here. For chat-heavy apps, tag routing (by session type or user tier) is more reliable than inference.

Creative and long-form generation. Quality differences between models are most visible here. Routing a creative brief to GPT-4o-mini to save $0.002 risks a visibly worse output. Set explicit tags for these features and don't try to infer them.

Calibration drift. New models change the quality-cost frontier. GPT-4o-mini today handles tasks that required GPT-4 a year ago. Review your routing thresholds when models update — the optimal decision tree shifts.

At Preto, we surface the routing decisions and their outcomes alongside cost data — so you can see where the estimator is misrouting and tune thresholds against real production traffic, not guesses.

Frequently Asked Questions

What is LLM model routing?
LLM model routing is the practice of automatically selecting which model handles each request based on the request's characteristics — typically complexity, token count, and task type. Instead of hardcoding every call to GPT-4o, a router sends simple requests (classification, extraction, booleans) to cheaper models and reserves frontier models for tasks that genuinely need them.
How do you estimate request complexity automatically?
A complexity estimator scores each request from 0.0 to 1.0 using signals like total prompt token count, conversation depth (number of turns), and the presence of reasoning keywords (analyze, compare, explain why, generate, step by step). Short prompts asking for a boolean or category score low; long multi-turn conversations asking for code or analysis score high. Applications can also pass an explicit X-Complexity-Hint header to override inference.
How much can model routing reduce LLM costs?
Teams typically see 20–40% cost reduction within the first week of enabling routing. The variance depends on your workload mix: apps with many classification, extraction, or FAQ tasks benefit most. Open-ended chat and code generation apps benefit less, since those tasks genuinely need capable models.
Does model routing affect response quality?
For tasks that score low on complexity — classification, short-answer extraction, sentiment analysis, boolean checks — routing to a smaller model produces equivalent output. Quality degrades only when tasks are misclassified as simple. Tune complexity thresholds against a sample of your own traffic before rolling routing out broadly.
How does model routing interact with failover?
Routing and failover are separate concerns and should be kept decoupled. The router decides the intended model; a failover layer handles provider unavailability. If the primary provider for a routing decision is down, the failover layer picks the next available provider in the chain — preserving the routing intent while ensuring availability.

Find which of your requests need a cheaper model.

Preto analyzes your LLM traffic and surfaces per-endpoint routing recommendations with projected cost savings. One URL change — no code refactor required.

Find Which Requests Need a Cheaper Model

Free forever up to 10K requests. No credit card required.

Gaurav Dagade
Gaurav Dagade

Founder of Preto.ai. 11 years engineering leadership. Previously Engineering Manager at Bynry. Building the cost intelligence layer for AI infrastructure.

LinkedIn · Twitter