You know your monthly LLM bill. You probably know your blended token cost. Here's the question your finance team is going to ask you next: what does one user-facing request actually cost?
Most AI teams cannot answer this within an order of magnitude. The token bill is visible — it shows up in OpenAI's dashboard. Everything else is invisible: the silent retries that triple the call count under load, the multi-step agent loops that fan one user action out to fifteen LLM calls, the RAG infrastructure that doubles the per-query cost before you ever generate a response. Your provider doesn't tell you any of this, because from their side it's all just billable tokens.
1. Token cost is typically 40–60% of total cost-per-request in production. Retries, multi-call workflows, RAG infrastructure, and failure regeneration make up the rest.
2. Agentic workflows blow up cost-per-request 10–50x compared to single-call inference. A documented April 2026 incident burned $4,200 in 63 hours when a "retry until it works" loop hit a rate limit.
3. The metric that actually matters is cost per unit of delivered value (per resolved ticket, per qualified lead, per generated document) — not raw cost-per-request and definitely not total monthly spend.
The Anatomy of a Cost-Per-Request Calculation
What people think a request costs:
What it actually costs in production:
The first equation gives you a number. The second gives you the truth. The gap between them is where your gross margin disappears.
Where the Hidden 40% Lives
Five components consistently account for the cost most teams under-attribute:
The percentages are a working composite — exact splits vary by use case. RAG-heavy products skew higher on infrastructure and embeddings. Pure chat products skew higher on retries because TCP failures concentrate on streaming responses. Agentic products skew dramatically higher on multi-call fan-out, sometimes to the point where the original visible call is less than 5% of the total bill.
The single biggest under-attribution is retries. Network errors trigger a full inference run. Rate-limit (429) retries replay the entire prompt. Tool-call validation failures regenerate. Production observability data from Portkey shows silent retries multiplying tokens 2–5x in failure modes — and almost no team's dashboard surfaces this until they go looking.
The $4,200 Agent: Why the Cost-Per-Request Math Has Tails
April 2026 production incident. An autonomous agent was instructed to "keep trying until it works." It hit a persistent 429 rate-limit, retried in a tight loop for 63 hours, and burned $4,200 against an expected $50 budget — an 84x overrun on a single failure mode. The original cost-per-request estimate was three orders of magnitude lower than what actually shipped. Source: Sattyam Jain, Medium.
This is not an outlier. It's the recurring failure pattern of agentic systems shipped without per-request budget enforcement.
The math: a 50-turn Claude Sonnet 4.5 session benchmarks around $0.90 per session. Running 100 sessions per hour is $2,100 per day. A separate documented case showed a single runaway execution costing $4.80 versus a $0.31 normal — a 15x blow-up over a 2-week window from one missing step limit.
The reason this keeps happening: agentic workflows fan out one user-facing action into 5–50 internal LLM calls. Most of those calls are evaluation, judging, planning, and tool-handler invocations — invisible to the user, fully visible to your bill. Fiddler's analysis calls this the "trust tax" — every invisible eval call is a hedge against the visible call going wrong.
If your product ships agents and your cost-per-request estimate doesn't include a worst-case tail (typically 10–20x the median), you're not measuring cost-per-request. You're measuring cost-per-happy-path.
Industry Benchmarks: What Real Cost-Per-Request Looks Like
| Use case | Reported cost | Reference |
|---|---|---|
| RAG lookup (simple) | $0.001 – $0.005 | AlphaCorp 2026 RAG benchmarks |
| Complex agent workflow | $0.01 – $0.05 | AlphaCorp 2026 RAG benchmarks |
| AI-resolved support ticket (Intercom Fin) | $0.99 / resolution | fin.ai pricing page |
| AI conversation (Salesforce Agentforce) | $2.00 / conversation | Salesforce Agentforce pricing |
| Gorgias AI ticket | $0.60 – $1.27 | Gorgias tiered pricing |
| AI invoice processing | $2.36 / invoice | Parseur 2026 benchmarks |
| Manual invoice processing (comparison) | $22.75 / invoice | Parseur 2026 benchmarks |
| GitHub Copilot premium request overage | $0.04 / request | github.com/features/copilot/plans |
The most-quoted number on this list is Intercom Fin at $0.99 per resolved ticket. That number is not just a price — it's a unit-economics frame. It's why Fin scaled from $1M ARR to over $100M while Intercom's classic seat-based revenue model would have struggled to. Cost per resolution is the denominator that maps directly to customer value: a ticket resolved is a support cost saved. Everything else is internal accounting.
The contrast is the cautionary tale on the other end of the spectrum. Investor commentary on Cursor reportedly described the company as "spending 100% of its revenue on Anthropic" — a phrase that is not audited financials but captures a real category problem. When cost-per-request scales linearly with engagement and price doesn't, gross margin collapses at the moment your product is succeeding.
Want to see your true cost-per-request?
Preto attributes every LLM call to a request_id, tracks retries and multi-call fan-out automatically, and surfaces per-feature and per-user cost without you instrumenting anything.
See Your Cost Per Request — FreeNot just tokens. Preto tracks full-stack cost per request automatically.
How to Actually Track It (Without Building Observability From Scratch)
The implementation pattern that holds up in production:
1. Propagate a single request_id through every internal call. One user action gets one ID. Every LLM call, vector DB query, and embedding lookup downstream of that action carries the same ID in metadata. OpenAI's user and metadata fields are the recommended attribution primitives — user for the actor, metadata for trace_id, feature_id, tenant_id. OpenAI's production best practices documents the convention.
2. Adopt the OpenTelemetry GenAI semantic conventions. The spec defines gen_ai.usage.input_tokens and gen_ai.usage.output_tokens at the metric level, with billable token reporting (not just used) when both are present. Datadog already emits these natively for LLM spans. The advantage of using a standard: aggregation across multi-vendor stacks (OpenAI + Anthropic + self-hosted) without writing custom collectors.
3. Pick observability based on the workflow shape. Helicone is the lowest-friction path for proxy-based attribution — drop in an alternate base URL, get per-request cost out of the box. Langfuse is tracing-first and better suited to multi-step agent workflows where the parent-child call graph matters more than the per-call cost. The trade-off is real but neither is wrong.
4. Compute the four metrics that actually matter, weekly:
- Cost per active user (DAU, not MAU) — power users distort blended figures. A user opening your product once per month costs near zero. A daily user running 50 queries can cost $10–30/month.
- Cost as % of revenue — target below 20%; above 30% needs active intervention. Drivetrain's CFO guide for AI SaaS uses this framing explicitly.
- Cost per unit of delivered value — per resolved ticket, per qualified lead, per generated document, per accepted code completion. The denominator that ties to revenue.
- p99 cost-per-request — the agentic tail. If p50 is $0.04 and p99 is $4.00, you have a budget enforcement problem masked by a healthy median.
The Number Most Founders Are Missing
Total monthly LLM spend is a lagging indicator of margin pressure. It tells you what already happened. The four metrics above are leading indicators — they tell you what will happen to your gross margin in the next 30 days at your current trajectory.
The teams that build durable AI products track them. The teams that don't, eventually present a 70% gross margin in a board deck and discover during diligence that the real number was 38%. That's the gap covered in our breakdown of the unit economics nobody shows on AI SaaS pitch decks. The reason that gap exists is the cost-per-request math is wrong by default — and stays wrong until someone instruments for it.
The metadata tag fix is the smallest, fastest version of this. One feature flag tagged on every request can reveal four-figure waste in a single afternoon. The cost-per-request work is the bigger version of the same idea: instrument the dimension that matters, then act on what you find.
Frequently Asked Questions
What is the difference between cost per token and cost per request?
How do you calculate cost per request including retries and tool calls?
What is a typical cost per request for a production AI application?
Why do agentic workflows blow up cost per request?
What metric should I track instead of total monthly LLM spend?
Stop estimating. Start measuring cost per request.
Preto attributes every LLM call — original, retry, tool, judge, eval — to a single request_id. You see the median, the p99, and the per-feature breakdown without instrumenting any of it yourself.
See Your Cost Per Request — FreeWorks with OpenAI, Anthropic, Bedrock, Vertex. Free forever up to 10K requests.