Semantic Caching for LLM APIs: Architecture and Real-World Hit Rates

You opened your OpenAI dashboard this morning and felt that familiar pit in your stomach. The number was higher than last month. Again. Somebody mentioned semantic caching — "just cache the responses, cut costs by 90%." So you looked into it.

The vendor pages all say the same thing: 95% cache hit rates, 90% cost reduction, millisecond responses. Then you ran the numbers on your own traffic and the reality was different. Much different.

This post breaks down how semantic caching actually works, what the published production hit rates are (not the marketing numbers), and which use cases benefit — and which don't.

TL;DR

1. Published production hit rates range from 20-45%, not 90-95%. The 95% number refers to accuracy of cache matches, not frequency of hits.
2. Even a 20% hit rate saves real money — $1,000/month on a $5K LLM bill — while cutting latency from 2-5s to under 5ms on cached requests.
3. Preto detects cacheable prompts automatically and shows your duplicate rate before you change anything.

Exact Caching vs. Semantic Caching: Two Different Problems

Before diving into architecture, the distinction matters because most teams should start with exact caching and only add semantic caching if exact caching alone doesn't cover enough.

Exact caching

Hash the full prompt (including model name, temperature, and other parameters) with SHA-256. If the hash matches a stored request, return the cached response. Zero ambiguity — the prompt is identical, so the response is valid.

cache_key = sha256(model + prompt + str(temperature) + str(max_tokens))
cached = redis.get(cache_key)
if cached:
    return cached  # <5ms, zero LLM cost
response = call_llm(prompt)
redis.set(cache_key, response, ttl=3600)
return response

Pros: Zero false positives. Sub-millisecond lookup. Trivial to implement.
Cons: Misses rephrased duplicates. "How do I reset my password?" and "password reset help" are different hashes.

Exact caching alone catches more traffic than you'd expect. The average production app sends 15-30% identical requests — automated pipelines, retries, and users asking the same FAQ.

Semantic caching

Generate a vector embedding of the prompt, compare it via cosine similarity to stored embeddings, and return a cached response if the similarity exceeds a threshold. This catches rephrased duplicates.

embedding = embed_model.encode(prompt)  # ~2-5ms
matches = vector_db.search(embedding, threshold=0.92)
if matches:
    return matches[0].response  # <5ms total
response = call_llm(prompt)
vector_db.upsert(embedding, response, ttl=3600)
return response

Pros: Catches semantically similar requests with different wording.
Cons: Embedding generation adds 2-5ms. False positives are possible. Threshold tuning is critical and use-case dependent.

The 95% Myth: What the Numbers Actually Say

The "95% cache hit rate" claim circulates across vendor marketing pages. Here's what the published data actually shows:

Source	Hit Rate	Context	Type
Portkey (production)	~20%	RAG use cases, 99% match accuracy	Vendor data
EdTech platform (production)	~45%	Student Q&A — high repetition	Case study
GPT Semantic Cache (academic)	61-69%	Controlled benchmark, curated dataset	Research paper
General production estimate	30-40%	Mixed traffic across use cases	Industry average
Open-ended chat (production)	10-20%	Unique conversations, low repetition	Observed range

The 95% number, when you trace it back, almost always refers to match accuracy — meaning 95% of the time a cache returns a response, that response is correct for the query. Not that 95% of queries hit the cache. These are fundamentally different metrics.

The honest range for production semantic caching: 20-45% hit rate, depending heavily on use case.

Why academic benchmarks are misleading

Academic benchmarks test against curated datasets where similar questions are intentionally grouped. Production traffic is messier — 60-70% of real queries are genuinely unique. The 61-69% hit rates from research papers don't survive contact with production diversity.

Hit Rates by Use Case: Where Caching Works (and Doesn't)

Use Case	Expected Hit Rate	Why
FAQ / customer support	40-60%	Users ask the same questions in slightly different ways. High repetition, bounded answer space.
Classification / labeling	50-70%	Automated pipelines often send identical or near-identical inputs. Exact caching alone captures most of this.
Internal knowledge base Q&A	30-45%	Employees ask similar questions about policies, processes, docs. Moderate repetition.
RAG with document retrieval	15-25%	Context varies per query even if questions are similar. The retrieved documents change the prompt.
Open-ended chat	10-20%	Conversations are unique. Multi-turn context makes each request different even if the user message is similar.
Code generation	5-15%	High specificity per request. Users want varied outputs. Caching risks returning wrong code for a different context.

The pattern: bounded answer spaces with repetitive inputs cache well. Open-ended, context-dependent, or creative tasks don't.

The Threshold Problem: 0.85 vs. 0.92 vs. 0.98

The cosine similarity threshold is the most important — and most under-discussed — configuration in semantic caching. It's the knob that determines whether your cache is useful or dangerous.

Threshold 0.85 (aggressive): More cache hits, but higher false positive rate. "How to reset my password" might match "How to change my email" — similar intent, wrong answer. Good for FAQ-style use cases where a slightly imprecise answer is acceptable.
Threshold 0.92 (balanced): The sweet spot for most production use cases. Catches clear rephrasings while rejecting distinct-but-similar queries. This is where most teams start.
Threshold 0.98 (conservative): Almost-exact matching. Very few false positives, but you're catching only the most obvious rephrasings. At this point, exact caching captures nearly as much with zero false positive risk.

There is no universal correct threshold. It depends on the cost of a wrong answer in your application. A customer support bot returning a slightly wrong FAQ answer is tolerable. A medical advice application returning a cached answer for a different condition is dangerous.

What's your duplicate rate?

Preto detects cacheable prompts automatically — exact matches and semantic duplicates — and shows your cache potential before you enable caching.

Find Your Duplicate Requests — Free

Preto detects cacheable prompts automatically. See yours in 5 minutes.

Five Failure Modes Nobody Warns You About

1. Context-dependent queries that look identical

"What's the status?" asked by User A about Order #4521 and User B about Order #7893 will have near-identical embeddings. Without user-scoped or session-scoped cache keys, User B gets User A's order status. Cache keys must include relevant context — not just the prompt text.

2. Time-sensitive queries returning stale answers

"What's the latest pricing for GPT-5?" cached last week is wrong this week if OpenAI changed prices. TTL (time-to-live) helps, but the right TTL varies by query type. Pricing questions need TTLs of hours. FAQ answers can cache for days. One-size-fits-all TTL is a guarantee of either stale answers or low hit rates.

3. Embedding model drift

If you update your embedding model, all previously cached embeddings become invalid. The similarity scores between old and new embeddings are meaningless. You need a cache invalidation strategy tied to your embedding model version. Most teams learn this the hard way after a model update causes a spike in incorrect cache responses.

4. Cache poisoning from bad responses

If the LLM returns a hallucinated or incorrect response and you cache it, every similar future query gets that same bad answer. The cache amplifies the error. Mitigation: add quality checks before caching (confidence scores, length validation, format checks), or let users flag cached responses as incorrect to trigger cache eviction.

5. Streaming response caching complexity

Most LLM calls use streaming (stream: true). You can't cache a streaming response mid-stream — you need to buffer the full response, then store it. On cache hit, you either return the full response instantly (breaking the streaming contract your client expects) or simulate streaming by chunking the cached response with artificial delays. Both are engineering overhead that vendors rarely mention.

The Dollar Math: What Caching Actually Saves

For a team spending $5,000/month on LLM APIs:

10% hits

$500

saved/month

20% hits

$1,000

saved/month

30% hits

$1,500

saved/month

45% hits

$2,250

saved/month

The savings come from two places: avoided LLM calls (the obvious one) and reduced latency (the hidden one). A cache hit returns in under 5ms instead of 2-5 seconds. For customer-facing applications, that latency improvement often matters more than the dollar savings.

The cost of running the cache itself is minimal. Embedding generation uses a small model (text-embedding-3-small at $0.02/1M tokens). Vector storage in Redis or a dedicated vector DB adds $50-200/month depending on cache size. The infrastructure cost is under 5% of the savings at even a 10% hit rate.

The Right Architecture: Layer Exact and Semantic Caching

The best approach is a two-layer cache that checks exact matches first (fast, zero risk) and falls back to semantic matching only when needed:

# Layer 1: Exact cache (sub-ms, zero false positives)
exact_key = sha256(model + prompt + params)
if exact_hit := cache.get(exact_key):
    return exact_hit

# Layer 2: Semantic cache (2-5ms, threshold-gated)
embedding = embed(prompt)
semantic_hit = vector_db.search(embedding, threshold=0.92)
if semantic_hit:
    return semantic_hit.response

# Cache miss: call the LLM
response = call_llm(prompt)

# Write to both layers
cache.set(exact_key, response, ttl=3600)
vector_db.upsert(embedding, response, ttl=3600)

return response

At Preto, we use this two-layer approach with SHA-256 hashing for the exact layer and configurable cosine similarity thresholds for the semantic layer. The average app we onboard discovers that 18% of requests are exact duplicates on day one — before semantic matching even kicks in.

Cache backends matter less than you'd think. In-memory works for single-instance proxies. Redis works for distributed deployments. Dedicated vector databases (Qdrant, Pinecone) are worth it only if your cache exceeds 1M entries — below that, Redis with vector search is sufficient and simpler to operate.

Start With Measurement, Not Implementation

The most common mistake: building a caching layer before understanding what your traffic looks like. You might spend two weeks implementing semantic caching only to discover that your traffic is 90% unique, context-dependent queries with a 12% hit rate ceiling.

Measure first:

Log all prompts for a week. Hash them. Count exact duplicates. That's your floor.
Sample 1,000 requests. Generate embeddings. Cluster them. Count how many fall within a 0.92 similarity threshold. That's your ceiling.
Estimate savings. Floor hit rate × monthly LLM spend = guaranteed savings. Ceiling hit rate × monthly spend = maximum possible savings. If both numbers are under $200/month, caching isn't worth the engineering effort.

If both numbers justify the effort, start with exact caching only. Run it for two weeks. Then add semantic caching on top and compare the marginal improvement. If semantic caching only adds 5-8 percentage points over exact caching, the false positive risk may not justify the complexity.

Before you build, it's worth knowing what your current duplicate rate actually is. Preto's proxy detects exact and near-duplicate requests automatically — use the LLM cost calculator to estimate savings, or see how Preto compares to tools like Helicone and Langfuse that don't include caching detection.

Frequently Asked Questions

What is semantic caching for LLM APIs?

Semantic caching stores LLM responses and retrieves them when a new request is semantically similar to a previous one — even if the wording is different. It generates vector embeddings of prompts and uses cosine similarity to find matches above a configured threshold. A cache hit returns in under 5ms, skipping the LLM entirely.

What are realistic cache hit rates for LLM APIs?

Published production data shows 20-45%, not the 90-95% often claimed. Portkey reports ~20% for RAG use cases. EdTech platforms see ~45%. The variance depends on use case: classification tasks cache well (40-60%), while open-ended chat caches poorly (10-20%). The "95%" figure almost always refers to match accuracy, not hit frequency.

How much does semantic caching save on LLM costs?

For a team spending $5,000/month: a 20% hit rate saves ~$1,000/month, 30% saves ~$1,500/month, and 45% saves ~$2,250/month. Cache infrastructure costs (embedding generation + vector storage) are typically under 5% of the savings.

What's the difference between exact and semantic caching?

Exact caching hashes the full prompt and matches only identical requests. It's fast, has zero false positives, and catches 15-30% of production traffic. Semantic caching generates embeddings and matches similar (not identical) prompts. It catches rephrased duplicates but introduces false positive risk and requires threshold tuning.

When does semantic caching fail?

Five main failure modes: (1) Context-dependent queries that look similar but need different answers. (2) Time-sensitive queries returning stale cached responses. (3) Embedding model updates invalidating all cached embeddings. (4) Cache poisoning from hallucinated LLM responses. (5) Streaming response caching that breaks the streaming contract clients expect.

Find out how much of your LLM traffic is cacheable.

Preto detects exact duplicates and semantically similar requests across your LLM traffic. See your cache potential — and projected savings — before you build anything.

Find Your Duplicate Requests — Free

Preto detects cacheable prompts automatically. See yours in 5 minutes.

Gaurav Dagade

Founder of Preto.ai. 11 years engineering leadership. Previously Engineering Manager at Bynry. Building the cost intelligence layer for AI infrastructure.

LinkedIn · Twitter