You're choosing between GPT-5 and Claude Sonnet 4 for a production workload. The pricing pages give you per-million-token numbers that are easy to compare. The benchmark leaderboards give you scores that don't always survive contact with your actual queries. The honest comparison lives in between — per-task cost on workloads that look like yours, with the gotchas that don't show up on either page.

This post is that comparison. Real per-task cost math across five common workloads. Head-to-head benchmark results from each model's launch window. Production gotchas you should know before signing a commitment. And the deprecation timeline that changes which Anthropic model you're actually choosing between.

Deprecation note before we go further. Claude Sonnet 4 (API ID claude-sonnet-4-20250514, launched May 22, 2025) is deprecated and retires on June 15, 2026. The migration target Anthropic recommends is Claude Sonnet 4.6 — same pricing, larger context window, more capable. Most teams choosing between GPT-5 and "Sonnet 4" today are practically choosing between GPT-5 and Sonnet 4.6 because the original Sonnet 4 won't be in production a quarter from now. We cover both below — Sonnet 4 for context on what shipped in May 2025, Sonnet 4.6 for the decision you're making for the workload that runs after June 15.

TL;DR

1. GPT-5 is roughly 2x cheaper per task than Sonnet 4 / Sonnet 4.6 across most workload mixes ($1.25/$10 vs $3/$15 per MTok). At high volume the price gap dominates.
2. GPT-5 wins on math and science reasoning (AIME 2025: 94.6% vs 70.5%; GPQA Diamond: 88.4% vs 75.4%). Sonnet 4 / 4.6 wins on agentic tool use and tends to be the safer choice for software engineering agents.
3. The most cost-effective production architecture is usually neither alone — it's a routing layer that uses GPT-5 nano or Haiku 4.5 for simple classification and escalates to GPT-5 or Sonnet 4.6 only for requests that need the capability.

The Pricing Reality (April 2026)

Dimension GPT-5 (Aug 2025) Sonnet 4 (May 2025, deprecating) Sonnet 4.6 (current)
Input price $1.25 / MTok $3.00 / MTok $3.00 / MTok
Output price $10.00 / MTok $15.00 / MTok $15.00 / MTok
Cached input ~$0.125 / MTok (~90% off) $0.30 / MTok (cache read; write 1.25x base) $0.30 / MTok (cache read; write 1.25x base)
Context window 400K (2x rate above 272K on GPT-5.4) 200K 1M (flat pricing)
Max output 128K 64K 64K
Reasoning model? Yes — reasoning tokens billed as output; reasoning_effort: none/low/medium/high/xhigh Extended thinking; thinking tokens billed as output Extended + adaptive thinking; same billing
Batch API discount 50% off both directions 50% off both directions 50% off both directions
Knowledge cutoff ~Sep/Oct 2024 (verify on model card) ~Mar 2025 Aug 2025

Sources: OpenAI GPT-5 model page, Anthropic pricing, Anthropic models overview.

The headline pricing gap: GPT-5 is 2.4x cheaper on input and 1.5x cheaper on output than Sonnet 4 / 4.6. On a typical 4:1 input-to-output workload, that blends to a 1.6–2.0x cost advantage. The picture changes when caching enters: GPT-5's 90% cached-input discount is competitive with Anthropic's 90% cache-read discount, but Anthropic charges a 1.25x premium on cache writes (5-min TTL), which front-loads the cost on the first call.

The long-context line is where Sonnet 4.6 quietly wins. GPT-5.4 charges 2x the standard rate above 272K input tokens; Sonnet 4.6 has flat pricing across its full 1M token context. For document-heavy workloads (large codebases, long PDF analysis, research synthesis), Sonnet 4.6 is often cheaper per request despite the higher per-token rate.

Head-to-Head Benchmarks (Launch Window)

Benchmark GPT-5 Sonnet 4 Margin
SWE-bench Verified (coding) 74.9% 72.7% (80.2% high-compute) Tight; high-compute Sonnet leads
AIME 2025 math (no tools) 94.6% 70.5% GPT-5 +24.1pp
GPQA Diamond (graduate science) 88.4% (GPT-5 Pro) 75.4% GPT-5 +13.0pp
MMMU (multimodal) 84.2% 74.4% GPT-5 +9.8pp
Aider Polyglot (coding) 88% n/a published
Tau-Bench Retail (agentic) n/a published 80.5% Sonnet
Tau-Bench Airline (agentic) n/a published 60.0% Sonnet

Sources: OpenAI GPT-5 launch, Anthropic Claude 4 launch.

The pattern that emerges: GPT-5 dominates pure reasoning benchmarks (math, science, multimodal). Sonnet 4 holds its own on agentic tool-use benchmarks where reliability matters more than peak intelligence — Tau-Bench is a stronger predictor of how a model behaves inside a long agent loop than MMLU is.

On the current generation, the gap narrows significantly. SWE-bench Verified results for the current models: Sonnet 4.6 hits 79.6%, GPT-5.4 lands around 80%, Opus 4.5/4.6 reaches 80.8–80.9%. The difference between picking GPT-5.4 and Sonnet 4.6 for software engineering work is much smaller than the GPT-5-vs-Sonnet-4 launch numbers suggest.

Cost Per Real Task: The Math That Matters

Token-per-million pricing is hard to internalize. Per-task cost is what you'll actually pay. Five workloads at typical input/output sizes:

Customer support reply
200 in / 150 out
GPT-5
$0.001750
Sonnet 4 / 4.6
$0.002850
GPT-5 is 1.6x cheaper per reply. At 100K replies/month, that's $175 vs $285 — $110/month difference. Both models handle this workload well; pick on cost.
Code review of a 500-line PR
4,000 in / 800 out
GPT-5
$0.013000
Sonnet 4 / 4.6
$0.024000
GPT-5 is 1.85x cheaper per review. But Sonnet 4.6's SWE-bench Verified parity and stronger agentic tool-use track record make it the conventional choice for code-review agents. The premium buys you reliability on the tool-use chain, not raw intelligence.
Document summarization
3,000 in / 400 out
GPT-5
$0.007750
Sonnet 4 / 4.6
$0.015000
GPT-5 is 1.94x cheaper per summary. Quality is comparable for sub-200K-token documents. For long documents (above 272K tokens), Sonnet 4.6's flat 1M-context pricing flips the cost picture — it's cheaper per long-document call than GPT-5.4.
RAG-enabled Q&A
2,500 in / 250 out
GPT-5
$0.005625
Sonnet 4 / 4.6
$0.011250
GPT-5 is 2.0x cheaper per query. Both handle RAG patterns well. Pick on cost unless you've benchmarked Sonnet 4.6 outperforming on your specific retrieval-grounded answer quality.
Agentic task with 5 tool calls
~8,000 in / ~3,500 out (incl. reasoning)
GPT-5
$0.045000
Sonnet 4 / 4.6
$0.076500
GPT-5 is 1.7x cheaper per agent run. Sonnet 4 / 4.6's agentic reliability advantage on Tau-Bench means production agents often pay the premium for fewer retries and tool-use failures. The cost-per-successful-task ratio narrows considerably once retry math is included.
Reasoning token caveat: Add ~30–60% to GPT-5 output cost when running with reasoning_effort >= medium. Reasoning tokens are silent and consume your output budget. Sonnet 4.6 extended thinking has the same dynamic — thinking tokens bill as output, configurable budget. Both pricing tables above assume low reasoning effort; high-effort runs change the cost picture.

For the broader question of how to think about per-call cost beyond tokens, see why cost-per-request is the number you should be losing sleep over.

Don't decide on benchmarks. Decide on your traffic.

Preto runs your last 30 days of API traffic against both GPT-5 and Sonnet 4.6, surfaces per-task cost on your actual workload, and recommends which model belongs on which endpoint. No assumptions, no industry-average estimates.

See Which Model You Should Actually Use

Preto analyzes your traffic and recommends the right model. Free to start.

Production Gotchas Neither Pricing Page Mentions

GPT-5 reasoning inflation. The reasoning_effort parameter has five levels (none, low, medium, high, xhigh). xhigh runs roughly 3–5x the cost of low because of hidden reasoning token volume. max_completion_tokens ≠ visible output for reasoning models — the budget includes the reasoning tokens you're billed for but never see. Set reasoning_effort explicitly on every production call. "Be concise" in the prompt does not control reasoning verbosity. Source: BSWEN GPT-5.4 reasoning effort guide.

Sonnet output verbosity. Anthropic's own model documentation notes Sonnet's "engaging responses" and recommends prompt-tuning for concision. Real-world reports consistently describe Sonnet outputs as more verbose than GPT-5 outputs. The cost implication is direct: more output tokens at $15/M. The mitigation is the same as the GPT-5 reasoning-effort fix — explicit length constraints in the system prompt, plus a max_tokens ceiling.

Long-context pricing flip. Above 272K input tokens, GPT-5.4 charges 2x the standard input rate. Sonnet 4.6's 1M context window has flat pricing throughout. For workloads that regularly use long context — codebases above ~50K lines, multi-document research synthesis, RAG with large retrieval windows — Sonnet 4.6 can be cheaper per request despite the higher per-token rate.

Rate limits. OpenAI GPT-5 (after a September 2025 increase): Tier 1 500K TPM, Tier 2 1M, Tier 3 2M, Tier 4 4M, Tier 5 40M. Source: OpenAI Devs on X. Anthropic: Tier 1 (after $5 deposit) 50 RPM, Tier 2 1K, Tier 3 2K, Tier 4 4K. The structures aren't directly comparable (TPM vs RPM), but production teams hitting bursty traffic on Anthropic regularly need to plan for tier escalation earlier than equivalent OpenAI workloads.

Reliability (December 2025 reference month). IsDown's LLM provider report: Anthropic 20 incidents (7 major), 184.5 hours total downtime. OpenAI 22 incidents (1 major), 182.7 hours. Anthropic had fewer total incidents but more severe ones; OpenAI had more frequent minor incidents. Both providers are above the reliability bar where uptime is a deciding factor for most workloads.

Recent security incidents to be aware of. November 2025 OpenAI Mixpanel breach exposed API portal customer profiles; Anthropic had a Claude Code internal-files exposure incident. Both are public. Source: AI Incident Database. Neither is a reason to switch providers, but both inform the BAA / DPA conversation if you're in a regulated industry.

Geographic / data residency premiums. Anthropic charges a 1.1x multiplier for US-only inference_geo on Opus 4.6+; Bedrock and Vertex regional endpoints add a ~10% premium for Sonnet 4.5+. Source: Anthropic pricing. If your compliance posture requires US-only inference, factor this into the per-task cost.

The Decision Framework

Choose GPT-5 (or GPT-5 mini)
When cost-per-task wins
  • Math-heavy reasoning, science Q&A, technical analysis
  • Structured-output generation at lower cost
  • High-volume workloads where the 1.6–2.0x price gap compounds
  • RAG and document summarization on documents under 272K tokens
  • Customer support replies where per-reply cost matters more than peak quality
Choose Claude Sonnet 4.6
When reliability or long context wins
  • Agentic workflows with tool-use reliability requirements
  • Software engineering agents (code review, refactoring, multi-file patches)
  • Long-context workloads — flat 1M pricing beats GPT-5's 2x above 272K
  • Writing-heavy tasks where coherent verbose output is preferred
  • Workloads where retry math (failed tool calls) makes "cheaper" GPT-5 more expensive in practice

Choose neither alone for production. The architecture that minimizes cost-per-task across a real product is a routing layer that uses GPT-5 nano or Haiku 4.5 for simple classification and FAQ-pattern requests, and escalates to GPT-5 or Sonnet 4.6 only for the requests that need the capability. Berkeley's RouteLLM benchmarks demonstrate ~85% cost reduction at 95% quality on routable workloads. The model routing setup is straightforward; the gain is much larger than the GPT-5-vs-Sonnet pricing gap.

What the Pricing Comparison Doesn't Capture

The 2x price gap between GPT-5 and Sonnet 4.6 is real but it's not the most consequential variable in your bill. The variables that matter more — in roughly this order:

  1. Whether you're routing at all. A team running 100% on Sonnet 4.6 is paying 6–10x what a team running a routed mix of Haiku 4.5 + Sonnet 4.6 pays for the same product.
  2. Whether prompt caching is active. Up to 90% off cached input on both providers. The bug that breaks the cache (timestamps in the prefix, dynamic content at the top of the system prompt) is more expensive than picking the more expensive model.
  3. Whether reasoning_effort is set. Default reasoning settings on GPT-5 can blow your output budget by 3–5x silently. xhigh on agentic loops is the most common cause of unexpected GPT-5 cost spikes.
  4. Whether you're on the Batch API for batchable work. Flat 50% off both providers — invisible if your traffic is all real-time, enormous if any non-realtime work is in the mix.
  5. Then, finally, the per-token rate of the model you picked.

Picking the right model matters. Picking the right routing, caching, and reasoning configuration matters more. The full audit is the seven-tactic playbook we covered separately.

Frequently Asked Questions

Is Claude Sonnet 4 still available in 2026?
Claude Sonnet 4 (claude-sonnet-4-20250514, launched May 22, 2025) is deprecated and retires June 15, 2026. Anthropic recommends Claude Sonnet 4.6 as the migration target — same pricing as Sonnet 4 ($3 input / $15 output per MTok), 1M token context window (vs 200K for Sonnet 4), knowledge cutoff August 2025, with extended and adaptive thinking modes. Any production workload comparing GPT-5 to Sonnet 4 today should be planned against Sonnet 4.6.
What is the cost difference between GPT-5 and Claude Sonnet 4 per million tokens?
GPT-5 is $1.25 input / $10.00 output per million tokens. Claude Sonnet 4 (and 4.6) is $3.00 input / $15.00 output. GPT-5 is roughly 2.4x cheaper on input and 1.5x cheaper on output, blending to a typical 1.6–2.0x cost advantage on most workload mixes. Cached input changes the comparison: GPT-5 caches at ~90% off (~$0.125/M); Anthropic cache reads run at 0.1x base ($0.30/M). Both batch APIs offer flat 50% discounts.
Which model wins on production benchmarks?
GPT-5 leads on AIME 2025 math (94.6% vs 70.5%), GPQA Diamond science (88.4% vs 75.4%), and Aider Polyglot (88%). Sonnet 4 is competitive on SWE-bench Verified (72.7% standard, 80.2% high-compute) and Tau-Bench agentic workflows (80.5% retail, 60.0% airline). Current generation narrows further — Sonnet 4.6 hits 79.6% SWE-bench Verified vs GPT-5.4 at ~80%. GPT-5 wins on math and science reasoning at lower cost; Sonnet 4 / 4.6 wins on agentic tool-use reliability.
What are the production gotchas with GPT-5 reasoning tokens?
GPT-5 has no separate SKU for reasoning tokens — they bill as output at $10/M. The reasoning_effort parameter has five levels (none/low/medium/high/xhigh); xhigh runs roughly 3–5x the cost of low. Set reasoning_effort explicitly. The 128K output budget includes reasoning tokens, so 'be concise' instructions don't apply. Above 272K input tokens, GPT-5.4 charges 2x rate; Sonnet 4.6 has flat pricing across its 1M context, which is a real cost advantage for long-context workloads.
Which model should I use for my production workload?
Choose GPT-5 (or GPT-5 mini) for math-heavy reasoning, science Q&A, structured-output generation at lower cost, and high-volume workloads where the price gap compounds. Choose Sonnet 4.6 for agentic workflows with tool-use reliability requirements, software engineering agents, long-context workloads (1M flat pricing vs GPT-5's 2x above 272K), and writing-heavy tasks. The most cost-effective production architecture is usually a routing layer that uses GPT-5 nano or Haiku 4.5 for simple classification and escalates to GPT-5 or Sonnet 4.6 only for requests that need the capability.

Stop guessing. Run the comparison on your own traffic.

Preto pulls your last 30 days of API traffic, runs each request through cost models for GPT-5, GPT-5 mini, Sonnet 4.6, Haiku 4.5, and Opus 4.7, and surfaces per-task cost and quality projection — so you pick on data, not benchmark averages.

See Which Model You Should Actually Use

Free forever up to 10K requests. No credit card required.

Gaurav Dagade
Gaurav Dagade

Founder of Preto.ai. 11 years engineering leadership. Previously Engineering Manager at Bynry. Building the cost intelligence layer for AI infrastructure.

LinkedIn · Twitter