You know your monthly LLM bill. You probably know your blended token cost. Here's the question your finance team is going to ask you next: what does one user-facing request actually cost?

Most AI teams cannot answer this within an order of magnitude. The token bill is visible — it shows up in OpenAI's dashboard. Everything else is invisible: the silent retries that triple the call count under load, the multi-step agent loops that fan one user action out to fifteen LLM calls, the RAG infrastructure that doubles the per-query cost before you ever generate a response. Your provider doesn't tell you any of this, because from their side it's all just billable tokens.

TL;DR

1. Token cost is typically 40–60% of total cost-per-request in production. Retries, multi-call workflows, RAG infrastructure, and failure regeneration make up the rest.
2. Agentic workflows blow up cost-per-request 10–50x compared to single-call inference. A documented April 2026 incident burned $4,200 in 63 hours when a "retry until it works" loop hit a rate limit.
3. The metric that actually matters is cost per unit of delivered value (per resolved ticket, per qualified lead, per generated document) — not raw cost-per-request and definitely not total monthly spend.

The Anatomy of a Cost-Per-Request Calculation

What people think a request costs:

# What most teams measure cost_per_request = (input_tokens * input_price) + (output_tokens * output_price)

What it actually costs in production:

# What teams should be measuring cost_per_request = ( direct_llm_token_cost # original call + retry_token_cost # 429s, network errors, validation failures + multi_call_token_cost # tool calls, judge/eval calls, RAG synthesis + failed_output_token_cost # regenerations from quality fails + embedding_cost # RAG retrieval embeddings + vector_db_query_cost # pgvector / Pinecone / Turbopuffer ops + proxy_compute_cost # gateway, logging, observability + amortized_engineering_cost # ongoing prompt/eval maintenance )

The first equation gives you a number. The second gives you the truth. The gap between them is where your gross margin disappears.

Where the Hidden 40% Lives

Five components consistently account for the cost most teams under-attribute:

Visible token cost
~55%
Retries (silent)
~15%
Multi-call fan-out
~14%
RAG infra (vector DB + embeddings)
~9%
Regenerations on quality fail
~7%
True cost-per-request
100%

The percentages are a working composite — exact splits vary by use case. RAG-heavy products skew higher on infrastructure and embeddings. Pure chat products skew higher on retries because TCP failures concentrate on streaming responses. Agentic products skew dramatically higher on multi-call fan-out, sometimes to the point where the original visible call is less than 5% of the total bill.

The single biggest under-attribution is retries. Network errors trigger a full inference run. Rate-limit (429) retries replay the entire prompt. Tool-call validation failures regenerate. Production observability data from Portkey shows silent retries multiplying tokens 2–5x in failure modes — and almost no team's dashboard surfaces this until they go looking.

The $4,200 Agent: Why the Cost-Per-Request Math Has Tails

April 2026 production incident. An autonomous agent was instructed to "keep trying until it works." It hit a persistent 429 rate-limit, retried in a tight loop for 63 hours, and burned $4,200 against an expected $50 budget — an 84x overrun on a single failure mode. The original cost-per-request estimate was three orders of magnitude lower than what actually shipped. Source: Sattyam Jain, Medium.

This is not an outlier. It's the recurring failure pattern of agentic systems shipped without per-request budget enforcement.

The math: a 50-turn Claude Sonnet 4.5 session benchmarks around $0.90 per session. Running 100 sessions per hour is $2,100 per day. A separate documented case showed a single runaway execution costing $4.80 versus a $0.31 normal — a 15x blow-up over a 2-week window from one missing step limit.

The reason this keeps happening: agentic workflows fan out one user-facing action into 5–50 internal LLM calls. Most of those calls are evaluation, judging, planning, and tool-handler invocations — invisible to the user, fully visible to your bill. Fiddler's analysis calls this the "trust tax" — every invisible eval call is a hedge against the visible call going wrong.

If your product ships agents and your cost-per-request estimate doesn't include a worst-case tail (typically 10–20x the median), you're not measuring cost-per-request. You're measuring cost-per-happy-path.

Industry Benchmarks: What Real Cost-Per-Request Looks Like

Use case Reported cost Reference
RAG lookup (simple) $0.001 – $0.005 AlphaCorp 2026 RAG benchmarks
Complex agent workflow $0.01 – $0.05 AlphaCorp 2026 RAG benchmarks
AI-resolved support ticket (Intercom Fin) $0.99 / resolution fin.ai pricing page
AI conversation (Salesforce Agentforce) $2.00 / conversation Salesforce Agentforce pricing
Gorgias AI ticket $0.60 – $1.27 Gorgias tiered pricing
AI invoice processing $2.36 / invoice Parseur 2026 benchmarks
Manual invoice processing (comparison) $22.75 / invoice Parseur 2026 benchmarks
GitHub Copilot premium request overage $0.04 / request github.com/features/copilot/plans

The most-quoted number on this list is Intercom Fin at $0.99 per resolved ticket. That number is not just a price — it's a unit-economics frame. It's why Fin scaled from $1M ARR to over $100M while Intercom's classic seat-based revenue model would have struggled to. Cost per resolution is the denominator that maps directly to customer value: a ticket resolved is a support cost saved. Everything else is internal accounting.

The contrast is the cautionary tale on the other end of the spectrum. Investor commentary on Cursor reportedly described the company as "spending 100% of its revenue on Anthropic" — a phrase that is not audited financials but captures a real category problem. When cost-per-request scales linearly with engagement and price doesn't, gross margin collapses at the moment your product is succeeding.

Want to see your true cost-per-request?

Preto attributes every LLM call to a request_id, tracks retries and multi-call fan-out automatically, and surfaces per-feature and per-user cost without you instrumenting anything.

See Your Cost Per Request — Free

Not just tokens. Preto tracks full-stack cost per request automatically.

How to Actually Track It (Without Building Observability From Scratch)

The implementation pattern that holds up in production:

1. Propagate a single request_id through every internal call. One user action gets one ID. Every LLM call, vector DB query, and embedding lookup downstream of that action carries the same ID in metadata. OpenAI's user and metadata fields are the recommended attribution primitives — user for the actor, metadata for trace_id, feature_id, tenant_id. OpenAI's production best practices documents the convention.

2. Adopt the OpenTelemetry GenAI semantic conventions. The spec defines gen_ai.usage.input_tokens and gen_ai.usage.output_tokens at the metric level, with billable token reporting (not just used) when both are present. Datadog already emits these natively for LLM spans. The advantage of using a standard: aggregation across multi-vendor stacks (OpenAI + Anthropic + self-hosted) without writing custom collectors.

3. Pick observability based on the workflow shape. Helicone is the lowest-friction path for proxy-based attribution — drop in an alternate base URL, get per-request cost out of the box. Langfuse is tracing-first and better suited to multi-step agent workflows where the parent-child call graph matters more than the per-call cost. The trade-off is real but neither is wrong.

4. Compute the four metrics that actually matter, weekly:

The Number Most Founders Are Missing

Total monthly LLM spend is a lagging indicator of margin pressure. It tells you what already happened. The four metrics above are leading indicators — they tell you what will happen to your gross margin in the next 30 days at your current trajectory.

The teams that build durable AI products track them. The teams that don't, eventually present a 70% gross margin in a board deck and discover during diligence that the real number was 38%. That's the gap covered in our breakdown of the unit economics nobody shows on AI SaaS pitch decks. The reason that gap exists is the cost-per-request math is wrong by default — and stays wrong until someone instruments for it.

The metadata tag fix is the smallest, fastest version of this. One feature flag tagged on every request can reveal four-figure waste in a single afternoon. The cost-per-request work is the bigger version of the same idea: instrument the dimension that matters, then act on what you find.

Frequently Asked Questions

What is the difference between cost per token and cost per request?
Cost per token is the unit price your provider bills you. Cost per request is the all-in cost to serve one user-facing interaction — the visible LLM call plus retries, multi-call agentic loops, RAG infrastructure (vector DB queries, embeddings), and any function-calling chain. Token cost is typically 40–60% of total request cost in production. Teams that budget on token cost alone consistently under-forecast by 1.5–3x.
How do you calculate cost per request including retries and tool calls?
Track three metrics per user-facing request: direct token cost across every LLM call triggered (original plus all internal eval/judge/retry/tool-handler calls), infrastructure cost (vector DB queries, embedding generation, proxy compute, log writes), and failure cost (full token cost of any unusable output that was regenerated). Sum and divide by user-facing request count. The cleanest implementation propagates a single request_id through every internal call.
What is a typical cost per request for a production AI application?
Reported ranges: simple RAG lookups $0.001–$0.005, complex agent workflows $0.01–$0.05, AI-resolved support tickets ~$0.99 (Intercom Fin), AI conversations ~$2.00 (Salesforce Agentforce), Gorgias AI tickets $0.60–$1.27, AI invoice processing ~$2.36 versus ~$22.75 manual. The number you actually care about is your own per-request cost relative to the revenue it generates, not an industry average.
Why do agentic workflows blow up cost per request?
An agent loop fans out 5–50 LLM calls per user-facing action — most are internal eval, judge, and tool-handler calls rather than visible response. A 50-turn Claude Sonnet 4.5 session benchmarks around $0.90; 100 per hour is $2,100/day. The amplification gets worse with retry logic: one April 2026 incident saw an agent loop on a 429 rate limit for 63 hours, burning $4,200 against an expected $50 budget. Without per-request budget caps at the proxy or orchestration layer, worst-case cost is unbounded.
What metric should I track instead of total monthly LLM spend?
Three metrics, weekly: cost per active user (per-DAU, not per-MAU), cost as percentage of revenue (target below 20%; above 30% needs intervention), and cost per unit of delivered value relevant to your product (cost per resolved ticket, per qualified lead, per code suggestion accepted, per generated document). Total monthly spend is a lagging indicator. The three above are leading indicators of margin pressure.

Stop estimating. Start measuring cost per request.

Preto attributes every LLM call — original, retry, tool, judge, eval — to a single request_id. You see the median, the p99, and the per-feature breakdown without instrumenting any of it yourself.

See Your Cost Per Request — Free

Works with OpenAI, Anthropic, Bedrock, Vertex. Free forever up to 10K requests.

Gaurav Dagade
Gaurav Dagade

Founder of Preto.ai. 11 years engineering leadership. Previously Engineering Manager at Bynry. Building the cost intelligence layer for AI infrastructure.

LinkedIn · Twitter