How do you fix gross margin without raising prices on customers?

Four levers in order of impact for indie-scale AI startups: (1) Model routing — send simple tasks to cheap models (Haiku 4.5 is roughly 12x cheaper than Sonnet 4.6; OpenRouter, Martian, and NotDiamond report 40–85% cost reduction on routing). (2) Prompt caching — Anthropic's official claim is up to 90% reduction on long shared prompts; OpenAI's automatic caching is 50% off cached input on most models, up to 90% on GPT-5 family. (3) Prompt compression and system prompt audit — Microsoft's LLMLingua research demonstrates up to 20x compression with about 1.5% performance loss. (4) Usage caps + tier enforcement — prevents the negative-margin power user from existing at scale. Most teams get 20–40 points of gross margin recovery from levers 1–3 alone, before they ever change pricing.

My AI Startup's Gross Margin Was 72%. Then I Counted LLM Costs.

Q: What is a realistic gross margin for an AI startup with LLM costs in COGS?

Reported ranges from analyst commentary: Jamin Ball (Clouded Judgement) has noted public AI-exposed SaaS gross margins compressing into the 60s versus traditional SaaS 75–85%. Private AI-native companies with unoptimized stacks regularly land in the 30–55% range when LLM costs are properly classified — Sequoia and a16z have both written extensively about this gap (the 'high cost of AI compute' and the '$600B question' essays). Optimized stacks with model routing, prompt caching, and tier-based usage limits typically move from those 30–55% baselines to 55–75%, recovering most of the gap to traditional SaaS norms.

The slide was titled "Unit Economics." It said gross margin: 72%. I'd built the spreadsheet myself, copied the formula from the SaaS metrics deck Lenny published, and felt good about how the number compared to the traditional SaaS benchmarks I'd grown up reading.

Then a friend who runs an AI infra fund asked me where LLM costs sat in the model. I'd put them under "operating expenses" — the same bucket as Vercel hosting and the Postgres bill. He asked me to move them into COGS.

The 72% became 31%.

The product hadn't changed. The customers hadn't changed. The bill from OpenAI hadn't changed. What changed was that I stopped pretending the API spend was infrastructure and started counting it as what it actually is: variable cost of goods sold that scales with how much my users use the product.

TL;DR

1. If you're an AI founder and your LLM API spend isn't in COGS, your gross margin is wrong by 20–40 points — and your investors are going to find this during diligence.
2. The fix is real: model routing, prompt caching, and usage tier enforcement typically recover 20–40 points of gross margin without raising prices.
3. The teams that get this right early raise rounds. The teams that don't end up explaining a 30-point margin compression to existing investors who feel misled.

The Spreadsheet I'd Been Lying to Myself With

I run a $30/month AI writing tool. About 1,200 paying users. Most of them are light — they paste in three blog drafts a week and walk away. A handful of power users grind on it daily. I had been calculating margin like this:

MRR per user

$30.00

Stripe processing

−$0.90

Hosting (Vercel + DB) — allocated

−$1.50

Customer support time — allocated

−$1.20

Email + analytics SaaS — allocated

−$0.80

Gross profit per user

$25.60 (85%)

Looked great. Felt great. Was wrong.

The OpenAI bill was sitting in a separate line on my P&L called "Software & Tools," next to my Notion subscription. About $13,500/month across 1,200 users — so an average of $11.25 per user, per month, in API spend. Nobody had told me to put that in COGS, so I hadn't.

Add it where it belongs:

MRR per user

$30.00

Stripe processing

−$0.90

Hosting + support + tools (allocated)

−$3.50

LLM API spend (variable COGS)

−$11.25

Gross profit per user

$14.35 (47.8%)

That's the blended number. The blended number is also a lie — because the average hides what's happening at the tails.

Where the Margin Actually Goes

I segmented the OpenAI spend by user. The top decile — about 120 power users — accounted for roughly 55% of total API spend. Their per-user math:

Top 10% by usage

−$32/mo

$30 in revenue, $62 in API spend. Each of these users is costing me money every month they stay subscribed.

Bottom 50% by usage

+$24/mo

$30 in revenue, ~$2 in API spend. These users carry the company. Without them, I'm out of business.

The light users were subsidizing the power users. That's not a unit-economics model. That's a hope strategy.

I'm not the first person to hit this. Sam Altman publicly said in January 2025 that ChatGPT Pro at $200/month was losing money. Cursor's 2024 transition from "unlimited" to 500 fast requests/month on Pro was an explicit response to power-user economics. Anthropic added weekly usage limits to Claude Pro in 2024, citing "a small number of users consuming disproportionate resources." GitHub Copilot introduced "premium requests" tiering. Replit Agent moved to checkpoint-based pricing. Vercel v0 went from free to credit-based.

The pattern is the same across all of them: flat subscription pricing breaks at the power user. Either you fix the math at the cost side or you fix it at the price side. Most companies eventually do both.

The Investor Diligence Conversation Is Coming

Here's what an investor sees when they look at AI SaaS metrics in 2026:

Sequoia's "$600B Question" framed the gap between AI infrastructure spend and AI revenue. a16z's "Navigating the High Cost of AI Compute" made the case that compute is the dominant COGS line for AI products. Bessemer's State of the Cloud 2024 explicitly wrote about inference costs moving from R&D into COGS as products mature. Jamin Ball at Altimeter has been writing about public AI-exposed SaaS gross margins compressing into the 60s versus traditional SaaS 75–85%.

The investor reading your deck has internalized all of that. When they see 80% gross margin on an AI product, the first question is: is LLM cost in there? If the answer is no, the rest of the conversation is about whether you understand your own business.

I learned this the easy way — a friend who happens to invest in this space asked the question over coffee, not over a pitch. Most founders learn it the hard way.

Want to know which of your users are actually unprofitable?

Preto attributes API spend to user IDs and pricing tiers automatically. You see your real per-user margin — and which power users are eating the rest.

See Which Users Are Costing You Money

Your LLM costs are variable COGS. Preto shows you per-user profitability.

The Four Levers I Pulled (Without Raising Prices)

I gave myself a quarter to fix the margin without changing the public price. Here's what worked.

Lever 1: Model routing. About 60% of my API spend was on Sonnet 4.6 doing tasks that Haiku 4.5 handles fine — outline expansion, formatting cleanup, simple rewrites. Haiku 4.5 is roughly 12x cheaper. I built a routing layer that classified the task type and sent the easy ones to Haiku, escalating only structurally complex generation to Sonnet. Tools like OpenRouter and NotDiamond can do this without you building the classifier yourself; both report 40–85% cost reduction on routable workloads.

Result: 28% reduction in monthly API spend.

Lever 2: Prompt caching. My system prompt was 1,800 tokens — full of writing-style instructions, brand voice rules, output formatting. Same prompt on every call. Anthropic's prompt caching launched in August 2024 with the official claim of up to 90% cost reduction on long shared prompts. Du'An Lightfoot publicly documented going from $720/mo to $72/mo on the same pattern.

I added the cache_control markers, made sure no dynamic content was in the prefix, and watched the input cost line drop.

Result: another 22% reduction on what was left after lever 1.

Lever 3: System prompt audit. I went line by line through my system prompt and removed anything that wasn't either (a) measurably improving output quality on my eval set or (b) actively preventing a known failure mode. Three of my five "few-shot examples" hadn't moved quality numbers in months. Half my formatting instructions were redundant after I switched to Structured Outputs. The prompt went from 1,800 tokens to 950.

Microsoft's LLMLingua research demonstrates up to 20x compression with about 1.5% performance loss. I didn't go that far. A manual audit got me almost 2x.

Result: another 12% reduction.

Lever 4: Usage tier enforcement. The hardest one to ship and the one I'd been avoiding. I added a soft cap at 100 generations per month on the $30 tier and offered a $79 "power" tier with 500 generations. The 120 power users either upgraded (about half did) or hit the cap (the other half). The downgrade churn from this was real — 14 of them cancelled — but the math on the remaining 106 went from −$32/user/month to either +$15 (cap users) or +$23 (upgraded users).

Result: stopped bleeding on the long tail.

Where the Margin Landed

After all four levers, the new per-user math:

Blended MRR per user (after tier mix shift)

$33.20

Stripe processing

−$1.00

Hosting + support + tools (allocated)

−$3.50

LLM API spend (post-optimization)

−$5.40

Gross profit per user

$23.30 (70.2%)

From the real 47.8% to a real 70.2%. Not the fake 85% I started with. Not the panic-induced 31% I got after I correctly classified COGS without doing any optimization. The actual number, with LLM costs in the right place and the four levers actually pulled.

The thing I didn't appreciate until I'd done all the work: the optimization isn't a one-time project. The day I shipped the routing layer, OpenAI released a new model. The day I added prompt caching, my product team wanted to add a feature that broke the cache prefix. Margin is a moving target on AI products in a way it isn't on traditional SaaS, because the cost side is volatile. Pricing pages don't change every month. Model pricing does. New features land in the prompt prefix and silently break caching. Power users find new ways to grind on the product.

You don't fix this once. You instrument for it once and then keep watching.

What I Wish I'd Done at $5K MRR

If you're earlier-stage than I was when I learned this, here's the cheap version of the lesson:

Put LLM costs in COGS from the day you charge your first dollar. It doesn't matter if your accountant's GAAP standard hasn't caught up — your investors and your future-self CFO already think this way. The discipline of looking at the right number every month is worth more than the small amount of extra modeling work it takes.

Track per-user cost from day one, not from the day you raise. Tag every API request with a user ID. The infrastructure is one URL change — a proxy that does this for you exists, and the cost of setting it up before you have a problem is less than 1% of the cost of discovering you have a problem after you've raised on a misleading metric.

This is the work I cover in more depth in why cost-per-request is the number you should be losing sleep over and the unit economics nobody shows on AI SaaS pitch decks. Both go deeper on the implementation than the founder-narrative I've kept this post in.

The TL;DR for everyone who hasn't done it yet: your gross margin number is wrong, and the fix is straightforward. Move LLM costs into COGS where they belong. Then start pulling the four levers above, in roughly that order. You'll recover most of the gap. The teams that don't are going to be the ones explaining a 30-point margin surprise during their Series A diligence.

Frequently Asked Questions

Should LLM API costs be classified as COGS or operating expenses?

COGS. LLM API spend scales directly with feature usage — it is variable cost of goods sold, not fixed infrastructure. Bessemer's State of the Cloud 2024 explicitly addressed the transition from R&D-funded experimentation to revenue-generating production. Classifying LLM spend as 'infrastructure' or 'hosting' artificially inflates gross margin and hides margin pressure that will surface during diligence. The analytical norm in the SaaS investor community has converged on COGS even where formal GAAP standards have not.

What is a realistic gross margin for an AI startup with LLM costs in COGS?

Reported ranges: Jamin Ball (Clouded Judgement) has noted public AI-exposed SaaS gross margins compressing into the 60s vs traditional SaaS 75–85%. Private AI-native companies with unoptimized stacks regularly land in the 30–55% range when LLM costs are properly classified — Sequoia and a16z have both written extensively about this gap. Optimized stacks with model routing, prompt caching, and tier-based usage limits typically move from those baselines to 55–75%, recovering most of the gap to traditional SaaS.

Why are AI startups losing money on power users?

Because LLM cost-per-user is unbounded above the price of the subscription. A user paying $20/month who uses your product 200 times can cost $50 in API spend — net negative on that user. Cursor's 2024 transition to 500 fast requests/month on Pro was a direct response. Anthropic added weekly usage limits citing 'a small number of users consuming disproportionate resources.' Sam Altman publicly stated in January 2025 that ChatGPT Pro at $200/month was losing money. The pattern across Cursor, Replit Agent, GitHub Copilot, and Vercel v0 is the same: subscription pricing breaks at the power user.

How do you fix gross margin without raising prices?

Four levers: (1) Model routing — Haiku 4.5 is ~12x cheaper than Sonnet 4.6; OpenRouter, Martian, and NotDiamond report 40–85% cost reduction. (2) Prompt caching — Anthropic's official claim is up to 90% reduction; OpenAI is 50% off cached input on most models. (3) System prompt audit — LLMLingua research demonstrates up to 20x compression with ~1.5% performance loss. (4) Usage caps + tier enforcement — prevents the negative-margin power user from existing at scale. Most teams get 20–40 points of recovery from levers 1–3 alone.

What is the difference between gross margin and contribution margin for AI startups?

Gross margin = revenue minus COGS (hosting, support, LLM API). Contribution margin subtracts variable costs that scale with customer count, including CAC amortization, customer success time, and any usage-based costs. For AI startups, gross margin is the right metric for benchmarking against other SaaS, but contribution margin by cohort tells you whether growth is creating value. Both should include LLM API costs as variable COGS. Many AI startups present gross margin numbers that look healthy because LLM costs were excluded; contribution margin by cohort, properly calculated, often reveals the heaviest users are the most unprofitable.

Find your real gross margin in 5 minutes.

Preto attributes LLM spend to users, tiers, and features automatically. You see per-user profitability — and which of your power users are silently eating the rest of your margin. One URL change. No code refactor.

See Which Users Are Costing You Money

Free forever up to 10K requests. No credit card required.

Gaurav Dagade

Founder of Preto.ai. 11 years engineering leadership. Previously Engineering Manager at Bynry. Building the cost intelligence layer for AI infrastructure.

LinkedIn · Twitter