Prompt Hashing for Duplicate Detection: Cutting LLM Waste With SHA-256

Q: What is prompt hashing for LLM deduplication?

Prompt hashing is a technique where you compute a deterministic hash (typically SHA-256) of the full LLM request — model, messages array, and generation parameters — before sending it to the provider. If the same hash has been seen before, return the cached response instead of paying for a new generation. It's the cheapest form of LLM cost optimization: zero quality tradeoff, zero accuracy risk, pure waste elimination.

Q: What should I include in the prompt hash key?

The hash must include everything that affects the response: model name, the full messages array (role + content for each), and deterministic generation parameters (temperature, max_tokens, top_p, frequency_penalty). Do not include request_id, timestamps, or any field that changes per-request. Normalize floats to a consistent precision (0.7 not 0.70000001) and sort any maps before serializing.

Q: What duplicate rates should I expect in production?

It depends heavily on the use case. Support/FAQ bots typically see 35-45% exact duplicate rates — users ask the same questions constantly. Code assist tools see 8-15%. General-purpose chat apps see 3-8%. When teams first connect Preto, the average exact cache hit rate discovered on day one is 18% across all use cases.

Q: Is prompt hashing the same as semantic caching?

No. Prompt hashing only catches exact duplicates — byte-for-byte identical requests. Semantic caching catches near-duplicates using embedding similarity: 'What are your hours?' and 'When do you open?' are semantically identical but hash differently. Prompt hashing is cheaper to implement and has zero false positives. Semantic caching catches more waste but requires an embedding model and a vector store.

Q: How long should I cache hashed prompt responses?

It depends on how often your underlying data changes. For static FAQ content, 24-48 hours. For code assist on stable codebases, 1-4 hours. For anything pulling live data or referencing recent events, 15-30 minutes or skip caching entirely. A stale cached response is worse than no cache — set TTLs conservatively and monitor cache hit rates vs. staleness complaints.

You know your LLM bill is higher than it should be. You've looked at the OpenAI dashboard. You can see total spend, total tokens, cost by model. What you can't see: how many of those requests are asking the exact same question that was answered five minutes ago.

Prompt hashing is the cheapest optimization available — no model changes, no prompt rewrites, under 1ms added to the request path, no false positives. You hash the request, check the cache, skip the LLM entirely on a hit — and don't pay for that call. Here's the right way to do it, and what the actual duplicate rates look like in production.

TL;DR

1. The average production app sends 15-30% duplicate LLM requests. SHA-256 exact hashing catches all of them with zero false positives.
2. The hash key must cover model + full messages array + normalized generation params — not just the prompt text. One wrong field and your cache corrupts.
3. When teams first connect Preto, the average exact duplicate rate discovered on day one is 18% — pure recoverable waste, visible immediately.

Why Duplicate Requests Are More Common Than You Think

The first reaction: "our requests aren't duplicates — every user's query is different." That's true for open-ended chat. It's not true for the majority of production LLM use cases.

Support and FAQ bots. "What are your business hours?" "How do I reset my password?" "What's your return policy?" These questions arrive hundreds of times per day, word for word. Every one hits the LLM fresh.

Internal tooling. Weekly report generation, nightly summaries, scheduled content pipelines. If a cron job calls the LLM with the same prompt template and the same data, it's a duplicate. If a deploy fails and retries, it's a duplicate. If two engineers run the same analysis script before syncing, it's a duplicate.

Application-layer cache misses. The most common root cause: someone added LLM calls to a hot path without adding a cache layer. Every page load, every API call, every webhook hits the LLM even when nothing has changed. The cache was on the roadmap. It never shipped.

SHA-256 hashing at the proxy layer catches all of these before they reach the provider — even if the application has no caching at all. You stop paying for duplicates without touching a line of app code.

What to Hash (Most Implementations Get This Wrong)

The naive approach: hash the last user message. Fast to implement, wrong in production.

Consider: "Summarize this document" sent to GPT-4o with temperature 0.2 should cache differently from the same string sent to GPT-4o-mini with temperature 0.8. Same prompt text, different request, different response.

The hash key must include everything that deterministically affects the output:

Model name — gpt-4.1 and gpt-4.1-nano are different requests
Full messages array — role and content for every message, including system prompt
Temperature — technically, any temperature > 0 makes responses non-deterministic. In practice, temperatures below 0.3 are stable enough to cache safely.
max_tokens, top_p, frequency_penalty, presence_penalty — if you set them, include them

Exclude anything that changes per-request without affecting the response: user field, request IDs, stream flag, stream_options.

Want to see your exact duplicate rate today?

Preto shows cache hit rates, duplicate counts, and recoverable waste per endpoint — free up to 10K requests.

See What Your LLM Spend Looks Like

Free forever for up to 10K requests. No credit card.

The Implementation in Go

The critical requirement is a canonical serialization — the same logical request must always produce the same byte sequence before hashing. JSON marshaling is not canonical by default (map key order is undefined). We fix this explicitly:

type CacheKey struct {
  Model    string    `json:"model"`
  Messages []Message `json:"messages"`
  Params   KeyParams `json:"params"`
}

type Message struct {
  Role    string `json:"role"`
  Content string `json:"content"`
}

// Only include params that affect output
type KeyParams struct {
  Temperature      float64 `json:"temperature,omitempty"`
  MaxTokens        int     `json:"max_tokens,omitempty"`
  TopP             float64 `json:"top_p,omitempty"`
  FrequencyPenalty float64 `json:"frequency_penalty,omitempty"`
  PresencePenalty  float64 `json:"presence_penalty,omitempty"`
}

func ComputePromptHash(req *openai.ChatCompletionRequest) string {
  key := CacheKey{
    Model:    req.Model,
    Messages: normalizeMessages(req.Messages),
    Params: KeyParams{
      Temperature:      math.Round(req.Temperature*1000) / 1000, // 3 decimal places
      MaxTokens:        req.MaxTokens,
      TopP:             math.Round(req.TopP*1000) / 1000,
      FrequencyPenalty: math.Round(req.FrequencyPenalty*1000) / 1000,
      PresencePenalty:  math.Round(req.PresencePenalty*1000) / 1000,
    },
  }

  // encoding/json marshals struct fields in declaration order — deterministic
  b, _ := json.Marshal(key)
  h := sha256.Sum256(b)
  return hex.EncodeToString(h[:])
}

func normalizeMessages(msgs []openai.ChatCompletionMessage) []Message {
  out := make([]Message, len(msgs))
  for i, m := range msgs {
    out[i] = Message{
      Role:    strings.TrimSpace(m.Role),
      Content: strings.TrimSpace(m.Content),
    }
  }
  return out
}

The float normalization (math.Round(x*1000)/1000) matters. Different SDK versions or serialization paths can produce 0.7 vs 0.6999999999999998 for the same value. Without normalization, these produce different hashes and your cache never hits.

Cache lookup and write in Redis:

const cacheTTL = 4 * time.Hour

func (c *Cache) Get(hash string) (*CachedResponse, bool) {
  val, err := c.redis.Get(context.Background(), "ph:"+hash).Bytes()
  if err != nil {
    return nil, false
  }
  var resp CachedResponse
  if err := json.Unmarshal(val, &resp); err != nil {
    return nil, false
  }
  return &resp, true
}

func (c *Cache) Set(hash string, resp *CachedResponse) {
  b, _ := json.Marshal(resp)
  c.redis.Set(context.Background(), "ph:"+hash, b, cacheTTL)
}

The ph: prefix namespaces prompt hash keys in Redis, keeping them separate from other application data in the same instance.

Real Duplicate Rates by Use Case

Anonymized rates from apps running on Preto, across 90+ days of production traffic. The range reflects variance in app type — apps with narrower use cases (one feature type per bot) sit at the high end.

Support / FAQ Bots

35–45%

Highest duplicate rate. Users ask the same questions constantly. "Hours?", "Returns?", "Pricing?" — all exact hits.

Code Assist / Review

8–15%

Common boilerplate patterns, repeated review requests on unchanged code, shared function stubs across a team.

Scheduled / Batch Jobs

20–60%

Wide range. Retried jobs, re-run pipelines, and multi-instance deploys all generate exact duplicates.

General Chat / Copilot

3–8%

Lowest rate. Open-ended queries are rarely identical. Duplicate saves here come from retries and shared sessions.

The weighted average across all use cases: 18% of requests are exact cache hits on day one. That's not an edge case — it's the first thing visible when a team connects a proxy with prompt hashing. Some teams see 40%.

What to Do With the Hash Beyond Caching

Caching isn't the only use for the hash.

Duplicate rate reporting. Track what percentage of requests per endpoint are cache hits vs. misses. This surfaces which parts of your application are sending duplicate traffic so you can fix the root cause — usually a missing application-layer cache.

Cost projection. If a hash has been seen 200 times this month, and the average cost per request is $0.008 (roughly GPT-4o-mini at 500 tokens), you have $1.60 in recoverable waste for that single prompt. Multiply across all duplicate hashes and you have a projected monthly saving from caching alone — a concrete number to put in a cost review.

Abuse detection. A single hash appearing 10,000 times in an hour from the same user is a different pattern from organic duplicates. Rate limit by hash to catch prompt injection loops and runaway retry logic before they hit your bill.

At Preto, the prompt hash is computed at the proxy layer for every request. The hash appears in the log entry, which means the ClickHouse dashboard can show duplicate counts, cache hit rates, and recoverable waste per feature and per endpoint — without any changes to application code. See how we store and query those logs.

Where Hashing Falls Short

SHA-256 catches exact duplicates. It misses semantic duplicates.

"What are your business hours?" and "When do you open?" hash to completely different values. They're semantically identical — the same question, different phrasing — but exact hashing won't help you.

Semantic caching handles this by generating an embedding of the request, querying a vector store for similar past requests, and returning a cached response if similarity exceeds a threshold. It can catch 2–3x more waste than exact hashing. It also requires an embedding model, a vector store, latency budget for the lookup, and careful threshold tuning to avoid false positives.

The right order: implement exact hashing first. It's zero-risk, one afternoon of work, and delivers immediate savings. Add semantic caching once exact hashing is running and you've measured the remaining duplicate rate.

Frequently Asked Questions

What is prompt hashing for LLM deduplication?

Prompt hashing computes a deterministic SHA-256 hash of the full LLM request — model, messages, and generation parameters — before sending to the provider. If the same hash has been seen, return the cached response instead of paying for a new generation. Zero quality tradeoff, zero accuracy risk.

What should I include in the prompt hash key?

Model name, the full messages array (role + content for every message), and deterministic generation parameters (temperature, max_tokens, top_p, frequency_penalty). Exclude request_id, timestamps, stream flag, and anything that changes per-request without affecting the response. Normalize floats to consistent precision before hashing.

What duplicate rates should I expect in production?

Support/FAQ bots: 35–45%. Code assist: 8–15%. Scheduled batch jobs: 20–60%. General chat: 3–8%. The weighted average across use cases is ~18% on day one — visible immediately when you add a proxy with prompt hashing.

Is prompt hashing the same as semantic caching?

No. Prompt hashing catches exact byte-for-byte duplicates only. Semantic caching catches near-duplicates using embedding similarity — "What are your hours?" and "When do you open?" are semantically identical but hash differently. Implement exact hashing first: zero-risk, one afternoon, immediate savings. Add semantic caching once you've measured the remaining duplicate rate.

How long should I cache hashed prompt responses?

Depends on how often your underlying data changes. Static FAQ content: 24–48 hours. Code assist on stable codebases: 1–4 hours. Anything pulling live data: 15–30 minutes or skip caching. A stale cached response is worse than no cache — set TTLs conservatively.

See your exact duplicate rate — today, not after a refactor.

Preto computes prompt hashes at the proxy layer and surfaces duplicate rates, cache hit percentages, and recoverable waste per feature. One URL change, no code refactor.

See What Your LLM Spend Looks Like

Cache TTLs are configurable per endpoint.

Free forever for up to 10K requests. No credit card.

Gaurav Dagade

Founder of Preto.ai. 11 years engineering leadership. Previously Engineering Manager at Bynry. Building the cost intelligence layer for AI infrastructure.

LinkedIn · Twitter