You're choosing between GPT-5 and Claude Sonnet 4 for a production workload. The pricing pages give you per-million-token numbers that are easy to compare. The benchmark leaderboards give you scores that don't always survive contact with your actual queries. The honest comparison lives in between — per-task cost on workloads that look like yours, with the gotchas that don't show up on either page.
This post is that comparison. Real per-task cost math across five common workloads. Head-to-head benchmark results from each model's launch window. Production gotchas you should know before signing a commitment. And the deprecation timeline that changes which Anthropic model you're actually choosing between.
Deprecation note before we go further. Claude Sonnet 4 (API ID claude-sonnet-4-20250514, launched May 22, 2025) is deprecated and retires on June 15, 2026. The migration target Anthropic recommends is Claude Sonnet 4.6 — same pricing, larger context window, more capable. Most teams choosing between GPT-5 and "Sonnet 4" today are practically choosing between GPT-5 and Sonnet 4.6 because the original Sonnet 4 won't be in production a quarter from now. We cover both below — Sonnet 4 for context on what shipped in May 2025, Sonnet 4.6 for the decision you're making for the workload that runs after June 15.
1. GPT-5 is roughly 2x cheaper per task than Sonnet 4 / Sonnet 4.6 across most workload mixes ($1.25/$10 vs $3/$15 per MTok). At high volume the price gap dominates.
2. GPT-5 wins on math and science reasoning (AIME 2025: 94.6% vs 70.5%; GPQA Diamond: 88.4% vs 75.4%). Sonnet 4 / 4.6 wins on agentic tool use and tends to be the safer choice for software engineering agents.
3. The most cost-effective production architecture is usually neither alone — it's a routing layer that uses GPT-5 nano or Haiku 4.5 for simple classification and escalates to GPT-5 or Sonnet 4.6 only for requests that need the capability.
The Pricing Reality (April 2026)
| Dimension | GPT-5 (Aug 2025) | Sonnet 4 (May 2025, deprecating) | Sonnet 4.6 (current) |
|---|---|---|---|
| Input price | $1.25 / MTok | $3.00 / MTok | $3.00 / MTok |
| Output price | $10.00 / MTok | $15.00 / MTok | $15.00 / MTok |
| Cached input | ~$0.125 / MTok (~90% off) | $0.30 / MTok (cache read; write 1.25x base) | $0.30 / MTok (cache read; write 1.25x base) |
| Context window | 400K (2x rate above 272K on GPT-5.4) | 200K | 1M (flat pricing) |
| Max output | 128K | 64K | 64K |
| Reasoning model? | Yes — reasoning tokens billed as output; reasoning_effort: none/low/medium/high/xhigh | Extended thinking; thinking tokens billed as output | Extended + adaptive thinking; same billing |
| Batch API discount | 50% off both directions | 50% off both directions | 50% off both directions |
| Knowledge cutoff | ~Sep/Oct 2024 (verify on model card) | ~Mar 2025 | Aug 2025 |
Sources: OpenAI GPT-5 model page, Anthropic pricing, Anthropic models overview.
The headline pricing gap: GPT-5 is 2.4x cheaper on input and 1.5x cheaper on output than Sonnet 4 / 4.6. On a typical 4:1 input-to-output workload, that blends to a 1.6–2.0x cost advantage. The picture changes when caching enters: GPT-5's 90% cached-input discount is competitive with Anthropic's 90% cache-read discount, but Anthropic charges a 1.25x premium on cache writes (5-min TTL), which front-loads the cost on the first call.
The long-context line is where Sonnet 4.6 quietly wins. GPT-5.4 charges 2x the standard rate above 272K input tokens; Sonnet 4.6 has flat pricing across its full 1M token context. For document-heavy workloads (large codebases, long PDF analysis, research synthesis), Sonnet 4.6 is often cheaper per request despite the higher per-token rate.
Head-to-Head Benchmarks (Launch Window)
| Benchmark | GPT-5 | Sonnet 4 | Margin |
|---|---|---|---|
| SWE-bench Verified (coding) | 74.9% | 72.7% (80.2% high-compute) | Tight; high-compute Sonnet leads |
| AIME 2025 math (no tools) | 94.6% | 70.5% | GPT-5 +24.1pp |
| GPQA Diamond (graduate science) | 88.4% (GPT-5 Pro) | 75.4% | GPT-5 +13.0pp |
| MMMU (multimodal) | 84.2% | 74.4% | GPT-5 +9.8pp |
| Aider Polyglot (coding) | 88% | n/a published | — |
| Tau-Bench Retail (agentic) | n/a published | 80.5% | Sonnet |
| Tau-Bench Airline (agentic) | n/a published | 60.0% | Sonnet |
Sources: OpenAI GPT-5 launch, Anthropic Claude 4 launch.
The pattern that emerges: GPT-5 dominates pure reasoning benchmarks (math, science, multimodal). Sonnet 4 holds its own on agentic tool-use benchmarks where reliability matters more than peak intelligence — Tau-Bench is a stronger predictor of how a model behaves inside a long agent loop than MMLU is.
On the current generation, the gap narrows significantly. SWE-bench Verified results for the current models: Sonnet 4.6 hits 79.6%, GPT-5.4 lands around 80%, Opus 4.5/4.6 reaches 80.8–80.9%. The difference between picking GPT-5.4 and Sonnet 4.6 for software engineering work is much smaller than the GPT-5-vs-Sonnet-4 launch numbers suggest.
Cost Per Real Task: The Math That Matters
Token-per-million pricing is hard to internalize. Per-task cost is what you'll actually pay. Five workloads at typical input/output sizes:
reasoning_effort >= medium. Reasoning tokens are silent and consume your output budget. Sonnet 4.6 extended thinking has the same dynamic — thinking tokens bill as output, configurable budget. Both pricing tables above assume low reasoning effort; high-effort runs change the cost picture.
For the broader question of how to think about per-call cost beyond tokens, see why cost-per-request is the number you should be losing sleep over.
Don't decide on benchmarks. Decide on your traffic.
Preto runs your last 30 days of API traffic against both GPT-5 and Sonnet 4.6, surfaces per-task cost on your actual workload, and recommends which model belongs on which endpoint. No assumptions, no industry-average estimates.
See Which Model You Should Actually UsePreto analyzes your traffic and recommends the right model. Free to start.
Production Gotchas Neither Pricing Page Mentions
GPT-5 reasoning inflation. The reasoning_effort parameter has five levels (none, low, medium, high, xhigh). xhigh runs roughly 3–5x the cost of low because of hidden reasoning token volume. max_completion_tokens ≠ visible output for reasoning models — the budget includes the reasoning tokens you're billed for but never see. Set reasoning_effort explicitly on every production call. "Be concise" in the prompt does not control reasoning verbosity. Source: BSWEN GPT-5.4 reasoning effort guide.
Sonnet output verbosity. Anthropic's own model documentation notes Sonnet's "engaging responses" and recommends prompt-tuning for concision. Real-world reports consistently describe Sonnet outputs as more verbose than GPT-5 outputs. The cost implication is direct: more output tokens at $15/M. The mitigation is the same as the GPT-5 reasoning-effort fix — explicit length constraints in the system prompt, plus a max_tokens ceiling.
Long-context pricing flip. Above 272K input tokens, GPT-5.4 charges 2x the standard input rate. Sonnet 4.6's 1M context window has flat pricing throughout. For workloads that regularly use long context — codebases above ~50K lines, multi-document research synthesis, RAG with large retrieval windows — Sonnet 4.6 can be cheaper per request despite the higher per-token rate.
Rate limits. OpenAI GPT-5 (after a September 2025 increase): Tier 1 500K TPM, Tier 2 1M, Tier 3 2M, Tier 4 4M, Tier 5 40M. Source: OpenAI Devs on X. Anthropic: Tier 1 (after $5 deposit) 50 RPM, Tier 2 1K, Tier 3 2K, Tier 4 4K. The structures aren't directly comparable (TPM vs RPM), but production teams hitting bursty traffic on Anthropic regularly need to plan for tier escalation earlier than equivalent OpenAI workloads.
Reliability (December 2025 reference month). IsDown's LLM provider report: Anthropic 20 incidents (7 major), 184.5 hours total downtime. OpenAI 22 incidents (1 major), 182.7 hours. Anthropic had fewer total incidents but more severe ones; OpenAI had more frequent minor incidents. Both providers are above the reliability bar where uptime is a deciding factor for most workloads.
Recent security incidents to be aware of. November 2025 OpenAI Mixpanel breach exposed API portal customer profiles; Anthropic had a Claude Code internal-files exposure incident. Both are public. Source: AI Incident Database. Neither is a reason to switch providers, but both inform the BAA / DPA conversation if you're in a regulated industry.
Geographic / data residency premiums. Anthropic charges a 1.1x multiplier for US-only inference_geo on Opus 4.6+; Bedrock and Vertex regional endpoints add a ~10% premium for Sonnet 4.5+. Source: Anthropic pricing. If your compliance posture requires US-only inference, factor this into the per-task cost.
The Decision Framework
- Math-heavy reasoning, science Q&A, technical analysis
- Structured-output generation at lower cost
- High-volume workloads where the 1.6–2.0x price gap compounds
- RAG and document summarization on documents under 272K tokens
- Customer support replies where per-reply cost matters more than peak quality
- Agentic workflows with tool-use reliability requirements
- Software engineering agents (code review, refactoring, multi-file patches)
- Long-context workloads — flat 1M pricing beats GPT-5's 2x above 272K
- Writing-heavy tasks where coherent verbose output is preferred
- Workloads where retry math (failed tool calls) makes "cheaper" GPT-5 more expensive in practice
Choose neither alone for production. The architecture that minimizes cost-per-task across a real product is a routing layer that uses GPT-5 nano or Haiku 4.5 for simple classification and FAQ-pattern requests, and escalates to GPT-5 or Sonnet 4.6 only for the requests that need the capability. Berkeley's RouteLLM benchmarks demonstrate ~85% cost reduction at 95% quality on routable workloads. The model routing setup is straightforward; the gain is much larger than the GPT-5-vs-Sonnet pricing gap.
What the Pricing Comparison Doesn't Capture
The 2x price gap between GPT-5 and Sonnet 4.6 is real but it's not the most consequential variable in your bill. The variables that matter more — in roughly this order:
- Whether you're routing at all. A team running 100% on Sonnet 4.6 is paying 6–10x what a team running a routed mix of Haiku 4.5 + Sonnet 4.6 pays for the same product.
- Whether prompt caching is active. Up to 90% off cached input on both providers. The bug that breaks the cache (timestamps in the prefix, dynamic content at the top of the system prompt) is more expensive than picking the more expensive model.
- Whether reasoning_effort is set. Default reasoning settings on GPT-5 can blow your output budget by 3–5x silently. xhigh on agentic loops is the most common cause of unexpected GPT-5 cost spikes.
- Whether you're on the Batch API for batchable work. Flat 50% off both providers — invisible if your traffic is all real-time, enormous if any non-realtime work is in the mix.
- Then, finally, the per-token rate of the model you picked.
Picking the right model matters. Picking the right routing, caching, and reasoning configuration matters more. The full audit is the seven-tactic playbook we covered separately.
Frequently Asked Questions
Is Claude Sonnet 4 still available in 2026?
What is the cost difference between GPT-5 and Claude Sonnet 4 per million tokens?
Which model wins on production benchmarks?
What are the production gotchas with GPT-5 reasoning tokens?
Which model should I use for my production workload?
Stop guessing. Run the comparison on your own traffic.
Preto pulls your last 30 days of API traffic, runs each request through cost models for GPT-5, GPT-5 mini, Sonnet 4.6, Haiku 4.5, and Opus 4.7, and surfaces per-task cost and quality projection — so you pick on data, not benchmark averages.
See Which Model You Should Actually UseFree forever up to 10K requests. No credit card required.