Here's the number that should bother you: LLM API prices fell roughly 80% last year. The cost per million tokens dropped across every major provider. And yet — most SaaS companies saw their AI bills go up.
This is the trap. Cheaper tokens mean more AI. More AI means more features. More features mean more calls. More calls means more users hitting those features. The math compounds in the wrong direction even when the unit price is falling. AI spending more than doubled for the average production SaaS in 2025 — and 2026 looks worse, not better, because the usage curve hasn't flattened.
1. LLM prices are falling ~50% per year. AI bills are rising anyway, because usage grows faster than prices fall. The compound math runs against you.
2. The average production app has 30–40% recoverable waste — duplicate requests, simple tasks on expensive models, oversized prompts. This is money you're paying without benefit.
3. The survival plan has five steps: observe, measure, cache, route, cap. In that order. Skipping observation is why most optimization attempts fail.
Why Bills Triple When Prices Fall
The compound math is straightforward once you see it. Consider a typical SaaS that added AI features 18 months ago:
Month 1: One AI feature — a support bot. 5,000 requests/day. Bill: $400/month.
Month 6: The support bot works. You add AI to search, document summarization, and email drafting. Four features now. Volume: 25,000 requests/day. Bill: $1,800/month. But token prices dropped 30% since launch — so it "only" 4.5x'd instead of 5x.
Month 18: You doubled users. Every feature compounds. An agentic workflow you shipped fires 8 LLM calls per user action. Volume: 200,000 requests/day. Bill: $8,500/month. Token prices dropped another 40%. Still 4.7x growth on the bill from month 6.
This is the pattern. Prices fall 30–50% per year. Usage grows 3–5x per year in a healthy AI-native product. The net is a bill that doubles or triples annually regardless of what happens to per-token costs.
The companies that survive this aren't the ones who got lucky with cheap models. They're the ones who built cost discipline into how they ship AI.
The 4 Amplifiers Compounding Your Costs Right Now
1. Agentic workflows multiplying calls. A single user action in an agentic system can trigger 5–15 LLM calls: planner, subagents, validator, summarizer. If you built that workflow when tokens were cheap and never revisited the call count, you're paying for architecture decisions made under different economics.
2. Feature sprawl without visibility. You have 8 AI features. You know your total bill. You don't know which feature caused last month's spike. Without per-feature cost attribution, you can't prioritize optimization — so you optimize nothing. Every feature continues burning at whatever rate it was built with.
3. Model tier creep. The first engineer to integrate an LLM hardcodes the best available model — it's the safe default. That model stays in the code for 18 months. New models come out that cost 10–20x less and handle the same task. Nobody goes back. The expensive model keeps running.
4. No caching at the proxy layer. Support bots see the same 50 questions asked thousands of times per day. Scheduled jobs run the same prompt with the same data on every run. Without a caching layer, every duplicate hits the LLM fresh. The average production app sends 15–20% duplicate requests. You're paying for each one.
Want to see your own breakdown before reading further?
Preto shows you cost by feature, model, and endpoint — the data you need to find where your bill is actually going.
Get the LLM Cost Estimation SpreadsheetPlug in your models and request volume. See your projected monthly bill in 10 minutes.
What AI Actually Costs by Product Category
These are representative monthly ranges for a mid-size SaaS (approximately 50,000 MAU, $500K ARR) in production. The wide ranges reflect the difference between unoptimized and optimized implementations of the same use case.
The categories with the widest ranges — Voice AI, Content Generation — are also the ones where optimization has the most leverage. A voice AI app paying $15K/month unoptimized typically has 40–60% recoverable waste from redundant transcription calls, over-provisioned models on classification steps, and missing caching on repeated intents.
If you're in one of these categories and have never done a systematic cost audit, there is almost certainly 30–50% of your bill sitting in waste you could eliminate this quarter.
Where the Waste Actually Hides
The OpenAI dashboard shows you one number: total spend. It tells you nothing about which portion of that spend delivered value and which portion was waste. Here's how that spend typically breaks down in an unoptimized production app:
The bottom line: roughly 60% of the average production LLM bill is either waste or overhead that could be reduced. That number will surprise you if you've never measured it. It doesn't surprise the teams that have.
The 5-Part Survival Plan
The order here is not arbitrary. Every step depends on the one before it.
Observe: Get per-feature visibility
Tag every LLM request with the feature or endpoint that triggered it. Without this, everything else is guesswork. You can't optimize what you can't see. This is a one-line change: add an X-Feature header to every LLM call. A proxy layer captures it and attributes cost automatically.
Measure: Calculate cost per unit of value
Raw token counts don't tell you if you have a problem. Cost-per-ticket-resolved, cost-per-document-processed, cost-per-code-review does. Set a budget for what each AI feature is allowed to cost per user action. Anything above that threshold needs investigation.
Cache: Eliminate exact and semantic duplicates
Start with exact-match caching (SHA-256 prompt hashing). Zero false positives, under 1ms overhead, immediate results. Once you've measured your remaining duplicate rate, add semantic caching for near-matches. Typical result: 15–25% cost reduction with no change to application code.
Route: Match model to task complexity
Audit your top 10 endpoints by cost. For each one, ask: does this task actually need a frontier model? Classification, extraction, sentiment, boolean answers — these belong on cheap models ($0.10–0.40/1M tokens). Reasoning, code generation, complex summarization — keep those on capable models. Teams that implement routing typically see 20–40% cost reduction in the first week.
Cap: Enforce budgets before surprises hit
Set hard spending limits per feature and per team. Not alerts — limits. An alert fires after the damage. A limit stops it. Budget enforcement at the proxy layer means a runaway feature or a new agentic workflow doesn't turn into a finance conversation at the end of the month.
Are You Already Behind? A 5-Minute Diagnostic
If you answer yes to three or more of these, your AI costs are already compounding and you are operating without the data to stop them:
Risk signals
These signals don't mean you're doomed — they mean you're in the position most teams are in before they get serious about AI cost management. The work to fix them is specific, sequential, and faster than most teams expect. Most of it doesn't require changes to application code.
By Industry: What This Looks Like in Your Specific Context
The cost math plays out differently depending on your product category. The compounding drivers — agentic workflows, user growth, feature sprawl — manifest in distinct ways across industries:
- Fintech: Document analysis and compliance workflows are input-token-heavy. The risk is prompt size, not request count.
- Healthcare SaaS: HIPAA requirements constrain caching options. Every optimization must be evaluated against data residency rules.
- EdTech: Tutoring bots have high per-session token counts — the unit economics question is cost-per-session, not cost-per-request.
- Developer tools: Code review and code generation are output-heavy. The waste pattern is over-routing — simple linting tasks on frontier models.
- Customer support: The highest duplicate rates of any category. Support bots see the same questions constantly. Caching ROI is immediate.
- Voice AI and call automation: Cascaded architecture (STT → LLM → TTS) means 3 billable calls per user utterance. Cost compounds at every layer.
Industry-specific cost breakdowns, benchmarks, and optimization patterns for each of these categories are covered in the guides in this series.
Frequently Asked Questions
Why do AI costs triple even when LLM prices are falling?
What percentage of revenue should a SaaS company spend on LLM APIs?
What are the biggest sources of LLM API waste?
How much can companies actually save by optimizing LLM costs?
What is the first step to controlling LLM costs?
Start with the numbers — then fix them.
The LLM Cost Estimation Spreadsheet lets you plug in your models and request volume and see your projected monthly bill in 10 minutes. It's the fastest way to find out if you have a cost problem before the next invoice does.
Get the LLM Cost Estimation SpreadsheetFree. No signup required. Or connect Preto to see your actual costs in real time.