AI Costs for Legal Tech: What Law Firms Actually Spend on LLM APIs

Q: Can legal tech companies cache LLM responses given attorney-client privilege concerns?

Yes — the privilege constraint applies to the document content, not to the cached analytical output. The right architecture caches the extraction result (clause types identified, risk signals found) against a document hash — not the document text itself. Standard matter intake questions, regulatory context prompts, and clause classification schemas are fully cacheable with no privilege exposure. The constraint is on storing client communications, not on storing what your system learned about a document type.

Q: What is cascaded document review and how much does it save?

Cascaded review uses a cheap, fast model for the first pass — relevance screening, basic classification, obvious non-responsive documents. Only documents that pass the first screen go to the expensive model for privilege review, key passage extraction, and substantive analysis. In practice, 60–75% of documents in a typical eDiscovery matter are non-responsive and get handled entirely by the cheap model. Blended cost reduction: 50–70%.

Q: How do you attribute LLM costs per matter for client billing passthrough?

Tag every LLM request with the matter ID before it reaches the provider. A proxy layer is the natural place to enforce this — it intercepts the request, reads the matter tag from the header, logs cost attribution, and forwards the request. At the end of the month, the proxy produces per-matter cost reports that map directly to client billing. Law firms that treat LLM costs as a disbursement recover them from clients — but only if they can show per-matter breakdowns.

Q: What is the typical per-document LLM cost for eDiscovery review?

Cost varies significantly by model and document length. Using gpt-4.1 at $2.00/1M input tokens: a 1,500-token document (short email, memo) costs approximately $0.003. A 10,000-token document (deposition exhibit, contract) costs $0.02. An eDiscovery matter with 50,000 documents at average 2,000 tokens costs roughly $200 in input tokens — but output tokens, system prompts, and re-review add 40–60% to that base. Large matters with long documents can run $1,000–$5,000 per matter in LLM costs alone.

Legal tech teams scope their LLM costs carefully. They count the contracts to review, estimate average document length, multiply by token price, add a buffer. Then contract analysis goes live and the bill is five to seven times the estimate.

The surprise isn't volume — it's token count. Legal documents are an order of magnitude longer than the inputs most LLM cost estimates assume. An NDA is 3,000 tokens. A commercial lease is 20,000. An M&A agreement can exceed 100,000. Teams that calibrate their estimates on short documents and then process complex agreements face per-document costs that bear no resemblance to the budget.

TL;DR

1. Document length — not document count — is the primary cost driver in legal tech LLM stacks. Contract analysis budgets built on short-document assumptions run 5–7x over when complex agreements enter the pipeline.
2. Attorney-client privilege does not prevent caching — it constrains what you cache. Analytical outputs (clause types, risk signals) are cacheable; document content is not.
3. Cascaded document review — cheap model for relevance screening, expensive model for substantive analysis — cuts eDiscovery LLM costs 50–70% with no accuracy loss on the overall review.

Legal Tech AI Use Cases and What They Actually Cost

Monthly estimates for a mid-size legal tech SaaS — approximately $5M ARR, 50 law firm clients — running production AI workloads across active matters:

Monthly LLM API spend by use case — 50-client legal tech SaaS, unoptimized

eDiscovery /
document review

$5,000 – $18,000/mo

Contract
analysis

$3,000 – $12,000/mo

Legal
research

$2,500 – $9,000/mo

Contract
drafting

$1,500 – $6,000/mo

Due diligence
review

$1,000 – $4,000/mo

Client intake /
matter management

$600 – $2,500/mo

Client intake sits at the bottom despite being client-facing. The reason: intake questions follow predictable patterns — the same matter type, jurisdiction, and practice area questions asked by many clients — making it the most cacheable legal workflow. eDiscovery sits at the top because document volume is tied to litigation activity, which is unpredictable and can spike 10x overnight with a new matter.

For the full picture of how these numbers fit into broader AI cost growth, see why AI costs compound even when LLM prices fall.

Why Contract Analysis Costs 5–7x the Estimate

The budgeting error follows a consistent pattern. The legal team demos the contract analysis feature on NDAs — 3,000–5,000 tokens each, the most common contract type. The cost looks reasonable: roughly $0.01 per document at gpt-4.1 pricing. A volume estimate is built. The feature ships.

Then real production traffic arrives. The NDA is not the most common contract by revenue — it's the most common by count. The work that actually drives billing is commercial agreements: MSAs, software license agreements, commercial leases, supplier contracts. These run 15,000–40,000 tokens each. A single M&A purchase agreement exceeds 100,000 tokens.

The per-document math shift: NDA at 4,000 tokens × $2.00/1M = $0.008/document. Commercial lease at 25,000 tokens = $0.05/document. At 5,000 contracts/month, the budget assuming NDA-length is $400/month. Reality with a typical commercial contract mix: $2,800/month. The contract count estimate was accurate. The token count per contract was off by 7x.

LLM contract analysis earns its cost — it finds issues that manual review misses and speeds review by 60–80%. The problem is discovering the real cost after the feature is live, not before.

Want to see what contract analysis is actually costing per document?

Preto attributes LLM cost by feature and endpoint — so you see the per-document economics broken down by contract type before they surprise you.

See Your Cost Breakdown by Feature

One URL change. See which features cost the most. Free to start.

The Privilege Caching Problem (and the Workaround)

Most legal tech engineering teams conclude early that attorney-client privilege prevents caching. The reasoning: you can't store privileged communications in a cache without risking inadvertent disclosure. So caching gets ruled out — and the team overpays on every request.

The constraint is real, but it's narrower than it appears. Privilege protects client communications and attorney work product. It does not apply to the analytical output your system generated about a document type.

For prompt caching: Regulatory context, statute summaries, clause classification schemas, and standard form definitions repeat across every document in a matter. Cache these as system prompt prefixes. A 2,000-token legal context block cached at OpenAI's prompt caching rate ($0.025/1M versus $2.00/1M uncached) saves $3,500/month at 100,000 requests per day.

For document-level results: Cache the extraction result — clause types identified, risk signals found, jurisdiction confirmed — against a hash of the document ID and version. The cache stores what your system learned about the document structure, not the document content itself. No privilege exposure.

For matter closeout: Tag cached entries with matter IDs. When a matter closes or a client departs, invalidate those entries. This satisfies both privilege management and any applicable data retention policies — and it's a few extra lines in the cache layer, not a reason to avoid caching entirely.

Three Optimizations That Work in Legal Environments

1. Cascaded document review for eDiscovery

The highest-ROI change in any high-volume legal LLM workflow. Run every document through a cheap classifier first — relevance screening, basic responsiveness determination, obvious non-responsive filtering:

gpt-4.1-nano ($0.10/1M)

→

Non-responsive / irrelevant

→

Skip substantive review

~65% of documents

gpt-4.1-nano ($0.10/1M)

→

Responsive / potentially privileged

→

gpt-4.1 ($2.00/1M)

~35% of documents

In a typical eDiscovery matter, 60–75% of documents are non-responsive and need only a relevance determination. The cheap model handles those entirely. Only documents that pass first-pass screening go to the capable model for privilege review, key passage extraction, and substantive analysis. Blended cost drops 50–70% with no loss in review accuracy — the cases that require reasoning still get full model capability.

2. Boilerplate prefix caching for contract analysis

Standard contract forms are 60–80% boilerplate. The governing law clause, limitation of liability structure, IP assignment mechanics — these repeat with minor variation across thousands of contracts of the same type. Cache the system prompt with full clause-type definitions and risk criteria for each contract category (NDA, MSA, lease, SOW). The cache hit rate for same-category contracts typically exceeds 85%, turning a 2,000-token system prompt from a cost center into a near-zero line item.

3. Matter-level cost attribution for billing passthrough

Legal is the one vertical where LLM costs can often be recovered directly from clients — treated as a disbursement, similar to court filing fees or expert costs. But only if you can produce per-matter cost breakdowns. A proxy layer that tags every request with the matter ID and generates monthly cost reports by matter turns AI cost visibility into a billing asset. Teams that implement this typically recover 30–50% of their LLM spend through client billing.

Frequently Asked Questions

Why does contract analysis cost more than legal tech teams expect?

The budgeting error is consistent: teams estimate based on contract count, not token count. An NDA is roughly 3,000 tokens. A commercial MSA or lease is 15,000–30,000 tokens. Teams that scope costs on NDA-equivalent length and then process complex commercial agreements face a 5–10x per-document cost overrun. The count was right; the tokens-per-document estimate was wrong.

Can legal tech companies cache LLM responses given attorney-client privilege?

Yes — privilege applies to client communications and attorney work product, not to analytical outputs. Cache extraction results (clause types, risk signals) against a document hash, not the document text. Cache regulatory context and clause schemas in system prompt prefixes. Tag entries with matter IDs for invalidation on matter close. The constraint is narrower than most teams assume.

What is cascaded document review and how much does it save?

Cascaded review uses a cheap model for first-pass relevance screening and a capable model only for substantive analysis and privilege review. Roughly 65% of documents in a typical matter are non-responsive — the cheap model handles those entirely. Blended cost reduction: 50–70%, with no loss in overall review accuracy.

How do you attribute LLM costs per matter for client billing passthrough?

Tag every LLM request with a matter ID before it reaches the provider. A proxy layer is the natural enforcement point — it reads the matter tag from the request header, logs cost attribution, and produces per-matter reports at month-end. Law firms treating LLM costs as a disbursement can recover them from clients, but only with per-matter cost breakdowns.

What is the typical per-document LLM cost for eDiscovery?

Using gpt-4.1 at $2.00/1M input tokens: a 1,500-token document costs roughly $0.003; a 10,000-token document costs $0.02. Output tokens and system prompts add 40–60% on top. A matter with 50,000 documents at 2,000 tokens average runs approximately $300–$400 in LLM costs — but large matters with long documents can reach $2,000–$5,000 per matter.

See which legal features are driving your LLM bill.

Preto attributes cost by feature, model, and endpoint in real time — so document review, contract analysis, and legal research each show their own cost line. One URL change, no code refactor.

See Your Cost Breakdown by Feature

Free forever up to 10K requests. No credit card required.

Gaurav Dagade

Founder of Preto.ai. 11 years engineering leadership. Previously Engineering Manager at Bynry. Building the cost intelligence layer for AI infrastructure.

LinkedIn · Twitter