You know your LLM bill is higher than it should be. You've looked at the OpenAI dashboard. You can see total spend, total tokens, cost by model. What you can't see: how many of those requests are asking the exact same question that was answered five minutes ago.
Prompt hashing is the cheapest optimization available — no model changes, no prompt rewrites, under 1ms added to the request path, no false positives. You hash the request, check the cache, skip the LLM entirely on a hit — and don't pay for that call. Here's the right way to do it, and what the actual duplicate rates look like in production.
1. The average production app sends 15-30% duplicate LLM requests. SHA-256 exact hashing catches all of them with zero false positives.
2. The hash key must cover model + full messages array + normalized generation params — not just the prompt text. One wrong field and your cache corrupts.
3. When teams first connect Preto, the average exact duplicate rate discovered on day one is 18% — pure recoverable waste, visible immediately.
Why Duplicate Requests Are More Common Than You Think
The first reaction: "our requests aren't duplicates — every user's query is different." That's true for open-ended chat. It's not true for the majority of production LLM use cases.
Support and FAQ bots. "What are your business hours?" "How do I reset my password?" "What's your return policy?" These questions arrive hundreds of times per day, word for word. Every one hits the LLM fresh.
Internal tooling. Weekly report generation, nightly summaries, scheduled content pipelines. If a cron job calls the LLM with the same prompt template and the same data, it's a duplicate. If a deploy fails and retries, it's a duplicate. If two engineers run the same analysis script before syncing, it's a duplicate.
Application-layer cache misses. The most common root cause: someone added LLM calls to a hot path without adding a cache layer. Every page load, every API call, every webhook hits the LLM even when nothing has changed. The cache was on the roadmap. It never shipped.
SHA-256 hashing at the proxy layer catches all of these before they reach the provider — even if the application has no caching at all. You stop paying for duplicates without touching a line of app code.
What to Hash (Most Implementations Get This Wrong)
The naive approach: hash the last user message. Fast to implement, wrong in production.
Consider: "Summarize this document" sent to GPT-4o with temperature 0.2 should cache differently from the same string sent to GPT-4o-mini with temperature 0.8. Same prompt text, different request, different response.
The hash key must include everything that deterministically affects the output:
- Model name —
gpt-4.1andgpt-4.1-nanoare different requests - Full messages array — role and content for every message, including system prompt
- Temperature — technically, any temperature > 0 makes responses non-deterministic. In practice, temperatures below 0.3 are stable enough to cache safely.
- max_tokens, top_p, frequency_penalty, presence_penalty — if you set them, include them
Exclude anything that changes per-request without affecting the response: user field, request IDs, stream flag, stream_options.
Want to see your exact duplicate rate today?
Preto shows cache hit rates, duplicate counts, and recoverable waste per endpoint — free up to 10K requests.
See What Your LLM Spend Looks LikeFree forever for up to 10K requests. No credit card.
The Implementation in Go
The critical requirement is a canonical serialization — the same logical request must always produce the same byte sequence before hashing. JSON marshaling is not canonical by default (map key order is undefined). We fix this explicitly:
type CacheKey struct {
Model string `json:"model"`
Messages []Message `json:"messages"`
Params KeyParams `json:"params"`
}
type Message struct {
Role string `json:"role"`
Content string `json:"content"`
}
// Only include params that affect output
type KeyParams struct {
Temperature float64 `json:"temperature,omitempty"`
MaxTokens int `json:"max_tokens,omitempty"`
TopP float64 `json:"top_p,omitempty"`
FrequencyPenalty float64 `json:"frequency_penalty,omitempty"`
PresencePenalty float64 `json:"presence_penalty,omitempty"`
}
func ComputePromptHash(req *openai.ChatCompletionRequest) string {
key := CacheKey{
Model: req.Model,
Messages: normalizeMessages(req.Messages),
Params: KeyParams{
Temperature: math.Round(req.Temperature*1000) / 1000, // 3 decimal places
MaxTokens: req.MaxTokens,
TopP: math.Round(req.TopP*1000) / 1000,
FrequencyPenalty: math.Round(req.FrequencyPenalty*1000) / 1000,
PresencePenalty: math.Round(req.PresencePenalty*1000) / 1000,
},
}
// encoding/json marshals struct fields in declaration order — deterministic
b, _ := json.Marshal(key)
h := sha256.Sum256(b)
return hex.EncodeToString(h[:])
}
func normalizeMessages(msgs []openai.ChatCompletionMessage) []Message {
out := make([]Message, len(msgs))
for i, m := range msgs {
out[i] = Message{
Role: strings.TrimSpace(m.Role),
Content: strings.TrimSpace(m.Content),
}
}
return out
}
The float normalization (math.Round(x*1000)/1000) matters. Different SDK versions or serialization paths can produce 0.7 vs 0.6999999999999998 for the same value. Without normalization, these produce different hashes and your cache never hits.
Cache lookup and write in Redis:
const cacheTTL = 4 * time.Hour
func (c *Cache) Get(hash string) (*CachedResponse, bool) {
val, err := c.redis.Get(context.Background(), "ph:"+hash).Bytes()
if err != nil {
return nil, false
}
var resp CachedResponse
if err := json.Unmarshal(val, &resp); err != nil {
return nil, false
}
return &resp, true
}
func (c *Cache) Set(hash string, resp *CachedResponse) {
b, _ := json.Marshal(resp)
c.redis.Set(context.Background(), "ph:"+hash, b, cacheTTL)
}
The ph: prefix namespaces prompt hash keys in Redis, keeping them separate from other application data in the same instance.
Real Duplicate Rates by Use Case
Anonymized rates from apps running on Preto, across 90+ days of production traffic. The range reflects variance in app type — apps with narrower use cases (one feature type per bot) sit at the high end.
The weighted average across all use cases: 18% of requests are exact cache hits on day one. That's not an edge case — it's the first thing visible when a team connects a proxy with prompt hashing. Some teams see 40%.
What to Do With the Hash Beyond Caching
Caching isn't the only use for the hash.
Duplicate rate reporting. Track what percentage of requests per endpoint are cache hits vs. misses. This surfaces which parts of your application are sending duplicate traffic so you can fix the root cause — usually a missing application-layer cache.
Cost projection. If a hash has been seen 200 times this month, and the average cost per request is $0.008 (roughly GPT-4o-mini at 500 tokens), you have $1.60 in recoverable waste for that single prompt. Multiply across all duplicate hashes and you have a projected monthly saving from caching alone — a concrete number to put in a cost review.
Abuse detection. A single hash appearing 10,000 times in an hour from the same user is a different pattern from organic duplicates. Rate limit by hash to catch prompt injection loops and runaway retry logic before they hit your bill.
At Preto, the prompt hash is computed at the proxy layer for every request. The hash appears in the log entry, which means the ClickHouse dashboard can show duplicate counts, cache hit rates, and recoverable waste per feature and per endpoint — without any changes to application code. See how we store and query those logs.
Where Hashing Falls Short
SHA-256 catches exact duplicates. It misses semantic duplicates.
"What are your business hours?" and "When do you open?" hash to completely different values. They're semantically identical — the same question, different phrasing — but exact hashing won't help you.
Semantic caching handles this by generating an embedding of the request, querying a vector store for similar past requests, and returning a cached response if similarity exceeds a threshold. It can catch 2–3x more waste than exact hashing. It also requires an embedding model, a vector store, latency budget for the lookup, and careful threshold tuning to avoid false positives.
The right order: implement exact hashing first. It's zero-risk, one afternoon of work, and delivers immediate savings. Add semantic caching once exact hashing is running and you've measured the remaining duplicate rate.
Frequently Asked Questions
What is prompt hashing for LLM deduplication?
What should I include in the prompt hash key?
What duplicate rates should I expect in production?
Is prompt hashing the same as semantic caching?
How long should I cache hashed prompt responses?
See your exact duplicate rate — today, not after a refactor.
Preto computes prompt hashes at the proxy layer and surfaces duplicate rates, cache hit percentages, and recoverable waste per feature. One URL change, no code refactor.
See What Your LLM Spend Looks LikeCache TTLs are configurable per endpoint.
Free forever for up to 10K requests. No credit card.