You opened your OpenAI dashboard this morning and felt that familiar pit in your stomach. The number was higher than last month. Again. Somebody mentioned semantic caching — "just cache the responses, cut costs by 90%." So you looked into it.
The vendor pages all say the same thing: 95% cache hit rates, 90% cost reduction, millisecond responses. Then you ran the numbers on your own traffic and the reality was different. Much different.
This post breaks down how semantic caching actually works, what the published production hit rates are (not the marketing numbers), and which use cases benefit — and which don't.
1. Published production hit rates range from 20-45%, not 90-95%. The 95% number refers to accuracy of cache matches, not frequency of hits.
2. Even a 20% hit rate saves real money — $1,000/month on a $5K LLM bill — while cutting latency from 2-5s to under 5ms on cached requests.
3. Preto detects cacheable prompts automatically and shows your duplicate rate before you change anything.
Exact Caching vs. Semantic Caching: Two Different Problems
Before diving into architecture, the distinction matters because most teams should start with exact caching and only add semantic caching if exact caching alone doesn't cover enough.
Exact caching
Hash the full prompt (including model name, temperature, and other parameters) with SHA-256. If the hash matches a stored request, return the cached response. Zero ambiguity — the prompt is identical, so the response is valid.
cache_key = sha256(model + prompt + str(temperature) + str(max_tokens))
cached = redis.get(cache_key)
if cached:
return cached # <5ms, zero LLM cost
response = call_llm(prompt)
redis.set(cache_key, response, ttl=3600)
return response
Pros: Zero false positives. Sub-millisecond lookup. Trivial to implement.
Cons: Misses rephrased duplicates. "How do I reset my password?" and "password reset help" are different hashes.
Exact caching alone catches more traffic than you'd expect. The average production app sends 15-30% identical requests — automated pipelines, retries, and users asking the same FAQ.
Semantic caching
Generate a vector embedding of the prompt, compare it via cosine similarity to stored embeddings, and return a cached response if the similarity exceeds a threshold. This catches rephrased duplicates.
embedding = embed_model.encode(prompt) # ~2-5ms
matches = vector_db.search(embedding, threshold=0.92)
if matches:
return matches[0].response # <5ms total
response = call_llm(prompt)
vector_db.upsert(embedding, response, ttl=3600)
return response
Pros: Catches semantically similar requests with different wording.
Cons: Embedding generation adds 2-5ms. False positives are possible. Threshold tuning is critical and use-case dependent.
The 95% Myth: What the Numbers Actually Say
The "95% cache hit rate" claim circulates across vendor marketing pages. Here's what the published data actually shows:
| Source | Hit Rate | Context | Type |
|---|---|---|---|
| Portkey (production) | ~20% | RAG use cases, 99% match accuracy | Vendor data |
| EdTech platform (production) | ~45% | Student Q&A — high repetition | Case study |
| GPT Semantic Cache (academic) | 61-69% | Controlled benchmark, curated dataset | Research paper |
| General production estimate | 30-40% | Mixed traffic across use cases | Industry average |
| Open-ended chat (production) | 10-20% | Unique conversations, low repetition | Observed range |
The 95% number, when you trace it back, almost always refers to match accuracy — meaning 95% of the time a cache returns a response, that response is correct for the query. Not that 95% of queries hit the cache. These are fundamentally different metrics.
The honest range for production semantic caching: 20-45% hit rate, depending heavily on use case.
Academic benchmarks test against curated datasets where similar questions are intentionally grouped. Production traffic is messier — 60-70% of real queries are genuinely unique. The 61-69% hit rates from research papers don't survive contact with production diversity.
Hit Rates by Use Case: Where Caching Works (and Doesn't)
| Use Case | Expected Hit Rate | Why |
|---|---|---|
| FAQ / customer support | 40-60% | Users ask the same questions in slightly different ways. High repetition, bounded answer space. |
| Classification / labeling | 50-70% | Automated pipelines often send identical or near-identical inputs. Exact caching alone captures most of this. |
| Internal knowledge base Q&A | 30-45% | Employees ask similar questions about policies, processes, docs. Moderate repetition. |
| RAG with document retrieval | 15-25% | Context varies per query even if questions are similar. The retrieved documents change the prompt. |
| Open-ended chat | 10-20% | Conversations are unique. Multi-turn context makes each request different even if the user message is similar. |
| Code generation | 5-15% | High specificity per request. Users want varied outputs. Caching risks returning wrong code for a different context. |
The pattern: bounded answer spaces with repetitive inputs cache well. Open-ended, context-dependent, or creative tasks don't.
The Threshold Problem: 0.85 vs. 0.92 vs. 0.98
The cosine similarity threshold is the most important — and most under-discussed — configuration in semantic caching. It's the knob that determines whether your cache is useful or dangerous.
- Threshold 0.85 (aggressive): More cache hits, but higher false positive rate. "How to reset my password" might match "How to change my email" — similar intent, wrong answer. Good for FAQ-style use cases where a slightly imprecise answer is acceptable.
- Threshold 0.92 (balanced): The sweet spot for most production use cases. Catches clear rephrasings while rejecting distinct-but-similar queries. This is where most teams start.
- Threshold 0.98 (conservative): Almost-exact matching. Very few false positives, but you're catching only the most obvious rephrasings. At this point, exact caching captures nearly as much with zero false positive risk.
There is no universal correct threshold. It depends on the cost of a wrong answer in your application. A customer support bot returning a slightly wrong FAQ answer is tolerable. A medical advice application returning a cached answer for a different condition is dangerous.
What's your duplicate rate?
Preto detects cacheable prompts automatically — exact matches and semantic duplicates — and shows your cache potential before you enable caching.
Find Your Duplicate Requests — FreePreto detects cacheable prompts automatically. See yours in 5 minutes.
Five Failure Modes Nobody Warns You About
1. Context-dependent queries that look identical
"What's the status?" asked by User A about Order #4521 and User B about Order #7893 will have near-identical embeddings. Without user-scoped or session-scoped cache keys, User B gets User A's order status. Cache keys must include relevant context — not just the prompt text.
2. Time-sensitive queries returning stale answers
"What's the latest pricing for GPT-5?" cached last week is wrong this week if OpenAI changed prices. TTL (time-to-live) helps, but the right TTL varies by query type. Pricing questions need TTLs of hours. FAQ answers can cache for days. One-size-fits-all TTL is a guarantee of either stale answers or low hit rates.
3. Embedding model drift
If you update your embedding model, all previously cached embeddings become invalid. The similarity scores between old and new embeddings are meaningless. You need a cache invalidation strategy tied to your embedding model version. Most teams learn this the hard way after a model update causes a spike in incorrect cache responses.
4. Cache poisoning from bad responses
If the LLM returns a hallucinated or incorrect response and you cache it, every similar future query gets that same bad answer. The cache amplifies the error. Mitigation: add quality checks before caching (confidence scores, length validation, format checks), or let users flag cached responses as incorrect to trigger cache eviction.
5. Streaming response caching complexity
Most LLM calls use streaming (stream: true). You can't cache a streaming response mid-stream — you need to buffer the full response, then store it. On cache hit, you either return the full response instantly (breaking the streaming contract your client expects) or simulate streaming by chunking the cached response with artificial delays. Both are engineering overhead that vendors rarely mention.
The Dollar Math: What Caching Actually Saves
For a team spending $5,000/month on LLM APIs:
The savings come from two places: avoided LLM calls (the obvious one) and reduced latency (the hidden one). A cache hit returns in under 5ms instead of 2-5 seconds. For customer-facing applications, that latency improvement often matters more than the dollar savings.
The cost of running the cache itself is minimal. Embedding generation uses a small model (text-embedding-3-small at $0.02/1M tokens). Vector storage in Redis or a dedicated vector DB adds $50-200/month depending on cache size. The infrastructure cost is under 5% of the savings at even a 10% hit rate.
The Right Architecture: Layer Exact and Semantic Caching
The best approach is a two-layer cache that checks exact matches first (fast, zero risk) and falls back to semantic matching only when needed:
# Layer 1: Exact cache (sub-ms, zero false positives)
exact_key = sha256(model + prompt + params)
if exact_hit := cache.get(exact_key):
return exact_hit
# Layer 2: Semantic cache (2-5ms, threshold-gated)
embedding = embed(prompt)
semantic_hit = vector_db.search(embedding, threshold=0.92)
if semantic_hit:
return semantic_hit.response
# Cache miss: call the LLM
response = call_llm(prompt)
# Write to both layers
cache.set(exact_key, response, ttl=3600)
vector_db.upsert(embedding, response, ttl=3600)
return response
At Preto, we use this two-layer approach with SHA-256 hashing for the exact layer and configurable cosine similarity thresholds for the semantic layer. The average app we onboard discovers that 18% of requests are exact duplicates on day one — before semantic matching even kicks in.
Cache backends matter less than you'd think. In-memory works for single-instance proxies. Redis works for distributed deployments. Dedicated vector databases (Qdrant, Pinecone) are worth it only if your cache exceeds 1M entries — below that, Redis with vector search is sufficient and simpler to operate.
Start With Measurement, Not Implementation
The most common mistake: building a caching layer before understanding what your traffic looks like. You might spend two weeks implementing semantic caching only to discover that your traffic is 90% unique, context-dependent queries with a 12% hit rate ceiling.
Measure first:
- Log all prompts for a week. Hash them. Count exact duplicates. That's your floor.
- Sample 1,000 requests. Generate embeddings. Cluster them. Count how many fall within a 0.92 similarity threshold. That's your ceiling.
- Estimate savings. Floor hit rate × monthly LLM spend = guaranteed savings. Ceiling hit rate × monthly spend = maximum possible savings. If both numbers are under $200/month, caching isn't worth the engineering effort.
If both numbers justify the effort, start with exact caching only. Run it for two weeks. Then add semantic caching on top and compare the marginal improvement. If semantic caching only adds 5-8 percentage points over exact caching, the false positive risk may not justify the complexity.
Before you build, it's worth knowing what your current duplicate rate actually is. Preto's proxy detects exact and near-duplicate requests automatically — use the LLM cost calculator to estimate savings, or see how Preto compares to tools like Helicone and Langfuse that don't include caching detection.
Frequently Asked Questions
What is semantic caching for LLM APIs?
What are realistic cache hit rates for LLM APIs?
How much does semantic caching save on LLM costs?
What's the difference between exact and semantic caching?
When does semantic caching fail?
Find out how much of your LLM traffic is cacheable.
Preto detects exact duplicates and semantically similar requests across your LLM traffic. See your cache potential — and projected savings — before you build anything.
Find Your Duplicate Requests — FreePreto detects cacheable prompts automatically. See yours in 5 minutes.