You opened your AWS console this morning and the Comprehend Medical bill is up 40% month-over-month. Anthropic raised on you again because the BAA-eligible models don't include Haiku 4.5. The compliance team flagged that prompt caching wasn't in your last SOC 2 review. And the per-encounter cost on your ambient scribing product is still drifting up despite tighter prompts.

Healthcare AI carries a cost structure that consumer AI products do not. PHI redaction runs as an extra inference pass before the model ever sees the request. ZDR endpoints disqualify the most aggressive prompt-caching tier. Per-patient context cannot be shared across customers, so the cross-request cache hit rates that subsidize consumer chatbots simply do not exist for you. None of this shows up in OpenAI's pricing page.

TL;DR

1. HIPAA constraints add 60–120% to the equivalent unconstrained LLM workload — split across PHI redaction, restricted caching, BAA-tier model selection, and audit logging overhead.
2. The four BAA maps (OpenAI ZDR, Anthropic Enterprise, AWS Bedrock, Azure OpenAI, Google Vertex) each restrict different optimization paths. Knowing which one your stack is locked into determines which cost levers are even available.
3. The highest-ROI optimization in healthcare AI is application-layer caching of de-identified extraction schemas — not provider-side prompt caching, which collapses under per-patient context.

The Four Layers HIPAA Adds to Your Token Bill

The base inference cost is the smallest line item in a HIPAA-bound stack. Four additional layers compound on top.

+15–25%
PHI redaction pass
AWS Comprehend Medical at ~$10 per 1M characters, run before the LLM call to strip patient identifiers, MRNs, and account references.
+20–40%
Lost prompt caching
OpenAI extended (disk-backed) prompt caching is not ZDR-eligible. The 50–90% input-token discount on long shared system prompts is unavailable on BAA endpoints.
+10–20%
Tier-restricted models
Cheapest cross-region routing is blocked by US-only HIPAA region pinning. The 12x price differential between Haiku 4.5 and Opus 4.6 only matters if both are BAA-eligible in your account tier.
+5–15%
Audit logging overhead
Every prompt, response, recipient, and downstream clinical action is logged for HIPAA accountability. Storage and metadata-tracking infrastructure compound on every call.

The 60–120% HIPAA premium on equivalent workloads is a synthesis number — the exact multiplier depends on how aggressively your consumer competitors are using prompt caching and cross-region routing. Teams with cleanly architected ZDR pipelines and disciplined PHI handling tend toward the low end. Teams that bolted compliance on after the architecture was set tend toward the high end.

Per-Encounter Token Economics: What the Three Big Use Cases Actually Cost

Headline monthly spend for a mid-size healthcare AI product — approximately 200,000 patient encounters per month across roughly 800 active clinicians:

Monthly LLM API + compliance spend by use case — 200K encounter / 800 clinician healthcare SaaS
Ambient scribing
(encounter transcription)
$18,000 – $42,000/mo
Prior authorization
generation
$8,000 – $25,000/mo
Clinical decision
support
$5,000 – $18,000/mo
Medical necessity
+ appeals letters
$3,000 – $12,000/mo
Patient intake +
triage chat
$1,500 – $7,000/mo
FAQ / patient
support chatbot
$600 – $3,500/mo

Ambient scribing dominates because it runs on every encounter — and encounters are continuous, not episodic. The Permanente Medical Group's deployment processed roughly 2.5 million encounters across 7,260 physicians over a 14-month window. At even modest per-encounter token counts, the inference bill alone is the largest single line in a healthcare AI P&L.

FAQ chatbots sit at the bottom because patient support questions ("how do I refill my prescription," "what does my deductible mean") are highly repetitive. With correct PHI stripping in the cache key, hit rates land near consumer-SaaS norms. Without it, the cache silently leaks PHI and you have a regulatory problem rather than a cost win.

For the broader picture of how these numbers fit into your full AI cost structure, see why AI costs compound even when LLM prices fall and the unit economics breakdown for AI SaaS.

The BAA Map: Which Provider Lets You Do What

Not all BAAs are equivalent. The constraints baked into each provider's compliance posture determine which optimization tactics are even available to you.

Provider BAA Path Prompt Caching Under BAA Watch-Out
OpenAI API via ZDR endpoints; request through baa@openai.com. ChatGPT BAA only via Enterprise/Edu sales. Partial — in-memory caching only. Extended (disk-backed) caching is not ZDR-eligible. The 50–90% input discount on long static prompts is unavailable on the cheapest tier of caching.
Anthropic First-party API + HIPAA-ready Enterprise plan (introduced Dec 2, 2025). Pre-Dec-2025 BAAs cover API only. Available on API; Enterprise plan coverage requires written confirmation from Anthropic sales. Old BAAs do not auto-extend to the new Enterprise tier — confirm scope before relying on it.
AWS Bedrock HIPAA-eligible under AWS BAA. Covers Claude, Llama, and Titan models without separate provider BAAs. Yes — under the same AWS data-handling guarantees. Single-vendor BAA simplifies legal but locks you into Bedrock's regional availability and version cadence.
Azure OpenAI HIPAA-eligible under Microsoft Online Services DPA (auto-included with EA/CSP). Yes — but the Realtime audio API is in preview and not in HIPAA scope. Voice-AI products that need Realtime have to fall back to text-only or self-host until Realtime hits GA with HIPAA coverage.
Google Vertex AI Requires both the Google Cloud BAA and a project-level regulated-data flag for Vertex/Gemini API workloads. Yes — once the regulated-data flag is set on the project. Consumer Gemini is explicitly out of scope. Confirm the project flag is set before any production traffic.

The single most common architecture mistake we see in healthcare AI is teams assuming an old API BAA covers a new product surface (Realtime voice, agent loops, the Anthropic Enterprise plan) without re-confirming scope. The cost of finding out you were out of compliance is much higher than the cost of a single email to your provider's legal team.

Want to see what HIPAA overhead is actually costing per encounter?

Preto attributes LLM cost by feature, model, and endpoint — so PHI redaction, BAA-tier model selection, and per-encounter token counts each surface as their own line. One URL change, no PHI ever leaves your infrastructure.

See Your Cost Breakdown by Feature

One URL change. See which features cost the most. Free to start.

Why the Consumer-AI Caching Math Breaks for You

Consumer AI products subsidize their token bills with two cache hit rate sources: provider-side prompt caching on long static system prompts (50–90% off the input portion), and cross-customer response caching for any prompt that's been seen before. Healthcare workloads lose access to both.

Provider prompt caching: OpenAI's most aggressive cached-input discount lives in extended (disk-backed) caching, which is not ZDR-eligible. In-memory caching still works under ZDR — but the hit rate depends on consecutive requests against the same prefix within seconds. For ambient scribing where each encounter has a unique patient context, the in-memory hit rate is near zero.

Cross-customer response caching: Even on consumer SaaS, sharing a cached response across customers is fine because the prompt was usually content the user typed, not data about another user. In healthcare, every prompt contains protected information about a specific patient. The "two patients with similar symptoms" case still cannot share a cache entry because the input contains identifying context.

The architectural workaround is to push caching one layer up — into your application — and to cache de-identified analytical schemas rather than raw responses.

Three Patterns That Work Inside the Constraints

1. Schema-level result caching for repetitive document workflows

Prior authorization letters follow templates. So do appeal letters, medical necessity justifications, and structured clinical notes. The variable content is patient-specific; the schema (sections, required clinical evidence types, payer-specific phrasing) is not. Cache the structured plan that a model produces — "for ICD-10 code X with this evidence pattern, the letter needs sections A, B, C with these supporting citations" — and then have a cheaper model fill in the patient-specific blanks. Typical reduction: 40–60% on the per-document cost without changing the output quality.

2. Cascade routing for patient triage

Patient intake and triage workflows have a long tail of low-acuity questions ("appointment scheduling," "prescription refill timing") and a small head of high-acuity cases that need careful reasoning. Run a fast classifier first — Haiku 4.5 or GPT-5.4 nano — to detect acuity. Route low-acuity to the cheap model with FAQ-pattern responses; escalate the genuinely clinical questions to Sonnet 4.6 or Opus 4.7. Typical blended cost reduction: 50–70%, with no measurable change in clinical accuracy because the escalation path catches the cases that need it.

3. PHI tokenization at the proxy layer for cache safety

The single architectural change that unlocks safe caching is tokenizing PHI at the edge before it reaches the LLM or the cache. Replace patient identifiers, MRNs, names, and dates with consistent tokens (PT_4F2A, MRN_9C1B, DOS_2026Q2). The model still produces clinically correct output. The cache key is computed against the tokenized prompt, so two encounters with similar clinical context but different patients will share a cache entry safely. The detokenization map stays inside your VPC and never enters the cache or the LLM provider's logs.

This is the same pattern fintech teams use for PCI-DSS compliance — the framework we covered for cardholder data generalizes cleanly to PHI.

The Compliance Trade-Off Most Teams Get Wrong

The instinct under HIPAA is to default to the most capable model on every call ("we can't afford a hallucination on clinical content"). The math says otherwise: a tiered architecture with cheap models for classification and capable models for reasoning has lower hallucination rates in production than a single-capable-model architecture, because the cheap model handles the cases where the capable model would have over-reasoned and the capable model handles the cases where the cheap model would have under-reasoned. Cost goes down. Quality goes up. The constraint to optimize against is not "how do we pay less" — it's "where in the workflow does each model belong."

Frequently Asked Questions

Which LLM API providers offer HIPAA Business Associate Agreements in 2026?
OpenAI signs BAAs against the API but only via Zero Data Retention endpoints — request through baa@openai.com. Anthropic offers BAAs on the first-party API and on the HIPAA-ready Enterprise plan introduced Dec 2, 2025; pre-existing BAAs cover only the API tier. AWS Bedrock is HIPAA-eligible under the AWS BAA, which extends to Claude, Llama, and Titan models. Azure OpenAI is HIPAA-eligible under the Microsoft Online Services DPA, but the Realtime audio API is still in preview and not in scope. Google Vertex AI / Gemini requires both the Google Cloud BAA and a project-level regulated-data flag.
Why does HIPAA increase LLM API costs?
Four cost layers stack on top of the base token bill: PHI redaction adds an extra inference pass (~$10/M characters on Comprehend Medical), ZDR endpoints disqualify extended prompt caching where the 50–90% input discount lives, per-patient context cannot be shared across customers (collapsing cross-request cache hit rates), and audit logging requirements inflate storage and tracking infrastructure. Combined, these layers commonly run 60–120% above the equivalent unconstrained workload.
What does ambient medical scribing actually cost per encounter?
Vendors do not publish per-encounter token counts. Public pricing puts enterprise scribing in the $600–800 per provider per month range. The Permanente Medical Group processed roughly 2.5 million encounters across 7,260 physicians (October 2023 to December 2024), saving an estimated 16,000 documentation hours. Suki AI customers including Rush, McLeod, and FMOL report an average $1,223 per provider per month in incremental revenue, with Rush specifically seeing a 5.5 percentage point lift in same-day chart closure.
Can healthcare AI use prompt caching while staying HIPAA-compliant?
Partially. OpenAI's in-memory prompt caching is compatible with ZDR, but extended disk-backed caching — where the deepest discount lives — is not ZDR-eligible. Anthropic prompt caching is available on the API; Enterprise plan coverage should be confirmed in writing with sales. The architectural workaround is to cache de-identified extraction schemas at the application layer and tokenize PHI at the proxy layer so cache keys are computed against patient-safe tokens.
What is the largest LLM cost driver in healthcare SaaS?
Volume-driven workloads dominate. Ambient scribing runs on every encounter, prior authorization scales with payer interactions, and compliance review fans out per claim. Clinical decision support is mid-tier on monthly spend because it triggers per-query rather than per-encounter, but it requires the highest-capability models. Patient-facing chatbots are typically lowest cost because FAQ-style queries cache well — provided PHI is correctly stripped from cache keys.

See which healthcare AI features are driving your LLM bill.

Preto attributes cost by feature, model, and endpoint in real time — so ambient scribing, prior auth, and CDS each show their own cost line. Deploy in your own VPC for ZDR-equivalent guarantees. One URL change, no PHI leaves your perimeter.

See Your Cost Breakdown by Feature

Self-hostable. BAA-friendly architecture. Free forever up to 10K requests.

Gaurav Dagade
Gaurav Dagade

Founder of Preto.ai. 11 years engineering leadership. Previously Engineering Manager at Bynry. Building the cost intelligence layer for AI infrastructure.

LinkedIn · Twitter