Your AI Costs Will 3x This Year. Here's How to Survive It.

Q: What are the biggest sources of LLM API waste?

Across production LLM applications, the largest waste sources are: duplicate requests with no caching (typically 15–20% of all requests), requests routed to expensive models for simple tasks (classification, extraction, Q&A that could run on cheaper models), and oversized system prompts repeated on every request. These three alone account for 30–40% of most companies' LLM spend.

Here's the number that should bother you: LLM API prices fell roughly 80% last year. The cost per million tokens dropped across every major provider. And yet — most SaaS companies saw their AI bills go up.

This is the trap. Cheaper tokens mean more AI. More AI means more features. More features mean more calls. More calls means more users hitting those features. The math compounds in the wrong direction even when the unit price is falling. AI spending more than doubled for the average production SaaS in 2025 — and 2026 looks worse, not better, because the usage curve hasn't flattened.

TL;DR

1. LLM prices are falling ~50% per year. AI bills are rising anyway, because usage grows faster than prices fall. The compound math runs against you.
2. The average production app has 30–40% recoverable waste — duplicate requests, simple tasks on expensive models, oversized prompts. This is money you're paying without benefit.
3. The survival plan has five steps: observe, measure, cache, route, cap. In that order. Skipping observation is why most optimization attempts fail.

Why Bills Triple When Prices Fall

The compound math is straightforward once you see it. Consider a typical SaaS that added AI features 18 months ago:

Month 1: One AI feature — a support bot. 5,000 requests/day. Bill: $400/month.

Month 6: The support bot works. You add AI to search, document summarization, and email drafting. Four features now. Volume: 25,000 requests/day. Bill: $1,800/month. But token prices dropped 30% since launch — so it "only" 4.5x'd instead of 5x.

Month 18: You doubled users. Every feature compounds. An agentic workflow you shipped fires 8 LLM calls per user action. Volume: 200,000 requests/day. Bill: $8,500/month. Token prices dropped another 40%. Still 4.7x growth on the bill from month 6.

This is the pattern. Prices fall 30–50% per year. Usage grows 3–5x per year in a healthy AI-native product. The net is a bill that doubles or triples annually regardless of what happens to per-token costs.

The companies that survive this aren't the ones who got lucky with cheap models. They're the ones who built cost discipline into how they ship AI.

The 4 Amplifiers Compounding Your Costs Right Now

1. Agentic workflows multiplying calls. A single user action in an agentic system can trigger 5–15 LLM calls: planner, subagents, validator, summarizer. If you built that workflow when tokens were cheap and never revisited the call count, you're paying for architecture decisions made under different economics.

2. Feature sprawl without visibility. You have 8 AI features. You know your total bill. You don't know which feature caused last month's spike. Without per-feature cost attribution, you can't prioritize optimization — so you optimize nothing. Every feature continues burning at whatever rate it was built with.

3. Model tier creep. The first engineer to integrate an LLM hardcodes the best available model — it's the safe default. That model stays in the code for 18 months. New models come out that cost 10–20x less and handle the same task. Nobody goes back. The expensive model keeps running.

4. No caching at the proxy layer. Support bots see the same 50 questions asked thousands of times per day. Scheduled jobs run the same prompt with the same data on every run. Without a caching layer, every duplicate hits the LLM fresh. The average production app sends 15–20% duplicate requests. You're paying for each one.

Want to see your own breakdown before reading further?

Preto shows you cost by feature, model, and endpoint — the data you need to find where your bill is actually going.

Get the LLM Cost Estimation Spreadsheet

Plug in your models and request volume. See your projected monthly bill in 10 minutes.

What AI Actually Costs by Product Category

These are representative monthly ranges for a mid-size SaaS (approximately 50,000 MAU, $500K ARR) in production. The wide ranges reflect the difference between unoptimized and optimized implementations of the same use case.

Typical monthly LLM API spend — 50K MAU, unoptimized vs. optimized

Voice AI &
Call Automation

$5,000 – $15,000/mo

Content
Generation

$3,000 – $12,000/mo

Customer Support
/ Chatbot

$2,000 – $8,000/mo

Code Assist
/ Review

$1,500 – $5,000/mo

Document
Analysis / RAG

$1,000 – $4,000/mo

Search / Semantic
Retrieval

$300 – $2,000/mo

The categories with the widest ranges — Voice AI, Content Generation — are also the ones where optimization has the most leverage. A voice AI app paying $15K/month unoptimized typically has 40–60% recoverable waste from redundant transcription calls, over-provisioned models on classification steps, and missing caching on repeated intents.

If you're in one of these categories and have never done a systematic cost audit, there is almost certainly 30–50% of your bill sitting in waste you could eliminate this quarter.

Where the Waste Actually Hides

The OpenAI dashboard shows you one number: total spend. It tells you nothing about which portion of that spend delivered value and which portion was waste. Here's how that spend typically breaks down in an unoptimized production app:

~40%

Productive spend

Tokens that generated output the user actually needed. The target is to grow this percentage, not the absolute number.

~20%

Over-routed to expensive models

Classification, intent detection, extraction — tasks a $0.10/1M model handles — running on $2–5/1M models because of hardcoded defaults.

~18%

Exact duplicates

The same question asked again by a different user, the same scheduled job rerunning, the same form submission triggering the same LLM call. No caching means paying full price every time.

~22%

Structural overhead

Oversized system prompts repeated on every call, verbose chain-of-thought that could be pruned, retry overhead from transient errors, and context windows filled with irrelevant history.

The bottom line: roughly 60% of the average production LLM bill is either waste or overhead that could be reduced. That number will surprise you if you've never measured it. It doesn't surprise the teams that have.

The 5-Part Survival Plan

The order here is not arbitrary. Every step depends on the one before it.

Observe: Get per-feature visibility

Tag every LLM request with the feature or endpoint that triggered it. Without this, everything else is guesswork. You can't optimize what you can't see. This is a one-line change: add an X-Feature header to every LLM call. A proxy layer captures it and attributes cost automatically.

Measure: Calculate cost per unit of value

Raw token counts don't tell you if you have a problem. Cost-per-ticket-resolved, cost-per-document-processed, cost-per-code-review does. Set a budget for what each AI feature is allowed to cost per user action. Anything above that threshold needs investigation.

Cache: Eliminate exact and semantic duplicates

Start with exact-match caching (SHA-256 prompt hashing). Zero false positives, under 1ms overhead, immediate results. Once you've measured your remaining duplicate rate, add semantic caching for near-matches. Typical result: 15–25% cost reduction with no change to application code.

Route: Match model to task complexity

Audit your top 10 endpoints by cost. For each one, ask: does this task actually need a frontier model? Classification, extraction, sentiment, boolean answers — these belong on cheap models ($0.10–0.40/1M tokens). Reasoning, code generation, complex summarization — keep those on capable models. Teams that implement routing typically see 20–40% cost reduction in the first week.

Cap: Enforce budgets before surprises hit

Set hard spending limits per feature and per team. Not alerts — limits. An alert fires after the damage. A limit stops it. Budget enforcement at the proxy layer means a runaway feature or a new agentic workflow doesn't turn into a finance conversation at the end of the month.

Are You Already Behind? A 5-Minute Diagnostic

If you answer yes to three or more of these, your AI costs are already compounding and you are operating without the data to stop them:

Risk signals

You manage AI costs from the OpenAI dashboard — which shows total spend only, no feature breakdown

You don't know which feature or endpoint caused last month's biggest cost spike

Your LLM calls don't have feature tags — you couldn't tell the system which feature each request came from

All requests go to the same model regardless of complexity — no routing logic exists

You have no caching layer between your app and the LLM provider

You've never run a query to find what percentage of your requests are exact duplicates

You've added at least one new AI feature in the last 3 months without a cost budget for it

These signals don't mean you're doomed — they mean you're in the position most teams are in before they get serious about AI cost management. The work to fix them is specific, sequential, and faster than most teams expect. Most of it doesn't require changes to application code.

By Industry: What This Looks Like in Your Specific Context

The cost math plays out differently depending on your product category. The compounding drivers — agentic workflows, user growth, feature sprawl — manifest in distinct ways across industries:

Fintech: Document analysis and compliance workflows are input-token-heavy. The risk is prompt size, not request count.
Healthcare SaaS: HIPAA requirements constrain caching options. Every optimization must be evaluated against data residency rules.
EdTech: Tutoring bots have high per-session token counts — the unit economics question is cost-per-session, not cost-per-request.
Developer tools: Code review and code generation are output-heavy. The waste pattern is over-routing — simple linting tasks on frontier models.
Customer support: The highest duplicate rates of any category. Support bots see the same questions constantly. Caching ROI is immediate.
Voice AI and call automation: Cascaded architecture (STT → LLM → TTS) means 3 billable calls per user utterance. Cost compounds at every layer.

Industry-specific cost breakdowns, benchmarks, and optimization patterns for each of these categories are covered in the guides in this series.

Frequently Asked Questions

Why do AI costs triple even when LLM prices are falling?

Model prices are falling roughly 50% per year — but usage is growing faster. When a SaaS company adds AI to more features, acquires more users, and builds agentic workflows that chain multiple LLM calls per action, usage can grow 3–5x year-over-year even as per-token costs decline. The net effect: the bill goes up.

What percentage of revenue should a SaaS company spend on LLM APIs?

Industry benchmarks put sustainable LLM API costs at 15–20% of the subscription price for any given AI feature. AI-native SaaS companies currently run 50–65% gross margins versus 70–85% for traditional SaaS — a 10–20 point gap driven largely by API costs. If your AI features are consuming more than 20% of what you charge for them, you have a margin problem.

What are the biggest sources of LLM API waste?

The largest waste sources are: duplicate requests with no caching (typically 15–20% of all requests), requests routed to expensive models for tasks that don't need them (classification, extraction, Q&A), and structural overhead from oversized prompts repeated on every call. These three account for 30–40% of most companies' LLM spend.

How much can companies actually save by optimizing LLM costs?

Organizations that implement model routing, exact-match caching, and prompt optimization typically reduce LLM API spend by 40–80%. The range depends on workload: high-volume, repetitive use cases (support bots, FAQ, document classification) see the largest gains. Open-ended chat and complex reasoning see smaller gains since those tasks genuinely need capable models.

What is the first step to controlling LLM costs?

Visibility. Before you can optimize anything, you need to know where the money is going: which features, which models, which users, at what volume. Most teams manage AI costs through the OpenAI dashboard — which shows total spend but no feature breakdown. A request-level log tagged by feature is the prerequisite for every optimization that follows.

Start with the numbers — then fix them.

The LLM Cost Estimation Spreadsheet lets you plug in your models and request volume and see your projected monthly bill in 10 minutes. It's the fastest way to find out if you have a cost problem before the next invoice does.

Get the LLM Cost Estimation Spreadsheet

Free. No signup required. Or connect Preto to see your actual costs in real time.

Gaurav Dagade

Founder of Preto.ai. 11 years engineering leadership. Previously Engineering Manager at Bynry. Building the cost intelligence layer for AI infrastructure.

LinkedIn · Twitter