DeepSeek V4 Just Dropped — And It's the Cheapest Frontier Model on the Market
A developer's guide to what changed, what it costs, and whether you should actually switch
One year after DeepSeek R1 wiped $600 billion from Nvidia’s market cap in a single day — what Marc Andreessen called AI’s Sputnik moment — the Chinese lab is back.
DeepSeek V4 dropped on April 24, 2026. Same day OpenAI released GPT-5.5. The timing is almost certainly deliberate, but that’s not what makes this interesting.
What makes this interesting is the price tag.
Two Models, One Mission
V4 comes in two variants.
DeepSeek-V4-Pro is the flagship — 1.6 trillion total parameters with 49 billion active per token. The one you reach for on hard tasks.
DeepSeek-V4-Flash is the fast workhorse — 284 billion total parameters, 13 billion active. Surprisingly close to Pro on most benchmarks, at a fraction of the cost.
Both use a Mixture of Experts (MoE) architecture, which is why those numbers aren’t as scary as they sound. Only a small fraction of parameters actually fire for each token — so inference costs stay low even at massive parameter counts. Both support a 1 million token context window. Both are MIT-licensed and available as open weights on Hugging Face.
The headline technical upgrade is the Hybrid Attention Architecture — a new mechanism combining Compressed Sparse Attention and Heavily Compressed Attention. The practical outcome: at 1M tokens, V4-Pro requires only 27% of the inference compute and 10% of the KV cache that V3.2 needed. That’s the difference between a 1M context window being theoretically available and actually being affordable to use in production.
The Pricing. This Is the Part You Need to See.
Let’s skip straight to what matters for anyone shipping products.
DeepSeek V4-Flash costs 0.28 per million output tokens. That makes it the cheapest model at its capability tier — less than every Flash, Mini, and Nano offering from every major Western provider.
DeepSeek V4-Pro costs 3.48 per million output tokens. Less than GPT-5.4’s input price alone. About one-ninth of Claude Opus 4.7’s output cost.
For context, here’s where the major models land on output pricing per million tokens: DeepSeek V4-Flash at 0.50, Gemini 3 Flash at 3.48, GPT-5.4 at 12.00, GPT-5.5 at 75.00.
To make it concrete, imagine a pipeline generating 100 million output tokens per month:
Running on GPT-5.5: $3,000/month. Running on Claude Opus 4.7: $7,500/month. Running on DeepSeek V4-Pro: $348/month.
That’s a 9 to 22x cost difference. Not a rounding error. A different category of economics.
How Does It Actually Perform?
DeepSeek published something unusual alongside the release: a candid self-assessment. They state directly that V4-Pro “trails state-of-the-art frontier models by approximately 3 to 6 months.” You don’t often see AI labs publish their own gap estimates — it’s either genuine intellectual honesty or very smart expectation management. Either way, it’s worth noting.
Here’s where V4 stands on the benchmarks developers actually care about.
Coding. V4-Pro reaches a 3,206 Codeforces rating, ranking 23rd among human competitors worldwide. On SWE-bench and agentic coding tasks, DeepSeek’s internal evaluation places V4-Pro above Claude Sonnet 4.5 and approaching Claude Opus 4.5. V4-Flash alone leads all open-source models in coding benchmarks. Independent verification is still coming, but DeepSeek’s R1 numbers held up almost perfectly under external testing — which gives these claims more weight than the average model-launch hype.
Reasoning and Math. V4-Flash-Max scores 81.0 on Putnam-200 Pass@8, against 35.5 for Seed-2.0-Pro and 26.5 for Gemini 3-Pro. On Putnam-2025 formal math, V4 achieves a proof-perfect 120/120. These aren’t close races.
World Knowledge. V4-Pro leads every open-source model, trailing only Gemini 3.1 Pro among all models — open or closed.
Long Context. The 1M token window is real and usable, not a marketing number. The new attention architecture maintains quality as context grows — a problem that seriously degraded previous long-context models. You can load an entire codebase or a book-length document in a single prompt and it holds together.
The honest summary: V4-Pro isn’t the best model alive. But at its price point, it doesn’t need to be.
The Developer Experience
Migration is one line of code. If you’re already on the DeepSeek API, you change the model string and nothing else. The base URL stays identical. The API supports both OpenAI ChatCompletions and Anthropic API formats — so your existing client code works without modification.
# Before
response = client.chat.completions.create(
model="deepseek-chat",
messages=[...]
)
# After — literally just this
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[...]
)Three reasoning modes. Both models support Non-Thinking (fast, direct responses), Thinking (standard chain-of-thought), and Think Max (maximum reasoning budget — set your context window to at least 384K tokens for best results). Flash-Max is surprisingly competitive with Pro-Max on reasoning benchmarks when given enough thinking budget, making Flash a great cost-efficient default for complex pipelines that don’t need Pro’s deeper world knowledge.
⚠️ Deprecation deadline. The old deepseek-chat and deepseek-reasoner endpoints are being retired on July 24, 2026 at 15:59 UTC. They currently route to V4-Flash equivalents, but after that date they stop responding. Put this on your calendar now.
Running it locally. V4-Flash weighs 160GB on Hugging Face — potentially runnable on a 128GB M5 MacBook Pro with light quantization. V4-Pro is 865GB, so you’ll need a multi-GPU setup or cloud infrastructure. Unsloth is already working on quantized versions. For local inference, vLLM 0.9+ with --tensor-parallel-size 2 gives you an OpenAI-compatible endpoint at localhost:8000/v1 — and any existing OpenAI client points straight at it with a one-line URL change.
The Geopolitics. You Can’t Skip This Part.
DeepSeek V4 runs on Huawei Ascend 950 chips — not Nvidia GPUs. Huawei announced that its Ascend supernode fully supports V4 out of the box, making the model deployable at scale on Chinese domestic hardware, entirely outside US export controls. This is the first frontier-class open model that doesn’t need American chips to run at scale.
For developers, this has a very practical consequence: data residency and regulatory risk. Multiple US states, Australia, South Korea, and several EU countries introduced restrictions on DeepSeek R1 citing national security and data privacy concerns. V4 is subject to the same landscape.
If you’re building for regulated industries — government, healthcare, defense, fintech — check your compliance requirements before routing production traffic through DeepSeek’s hosted API. If you’re self-hosting V4, the model weights are MIT-licensed and those concerns largely evaporate. The weights are just weights.
So — Should You Switch?
Reach for V4-Flash when you’re running high-volume pipelines where cost dominates, you need fast responses on tasks that don’t require deep world knowledge, or you want a cheap first-pass model in a tiered routing setup. At 0.28 per million tokens, it’s the new default for anything that doesn’t need frontier performance.
Reach for V4-Pro when you need frontier-level coding and agentic reasoning but the 9–22x price difference over GPT-5.5 or Claude Opus 4.7 is eating your margins, or when you’re working with large codebases where the efficient 1M context window actually matters.
Stay on closed models when you’re in a regulated environment where Chinese-hosted infrastructure is a compliance issue, you need the absolute ceiling of performance (GPT-5.5 leads on terminal agents, Claude Opus 4.7 leads on reasoning and writing quality), or you need deep ecosystem integration — Grok 4 for real-time X data, Gemini for Google Workspace.
The Bottom Line
DeepSeek V4 is the clearest proof yet that frontier AI performance and frontier AI pricing are being permanently decoupled.
A year ago, R1 proved you didn’t need $100M in compute to build a world-class reasoning model. V4 extends that thesis: you don’t need to pay OpenAI or Anthropic prices to get within 3–6 months of the frontier.
Independent benchmarks will close the story over the next week or two. But given DeepSeek’s track record — every self-reported R1 number held up under external scrutiny — V4’s claims deserve to be taken seriously.
If you’re not running V4 in a test environment by end of this month, you’re leaving a significant cost optimization on the table. At these prices, the experiment costs you almost nothing.
Links
DeepSeek V4 on Hugging Face — open weights, MIT license
DeepSeek API Docs — migration guide, thinking mode reference, deprecation timeline
Full Technical Report (PDF) — DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
More from AI For Developers
This newsletter is part of AI For Developers — a growing directory of AI developer tools, APIs, frameworks, and resources. If you’re evaluating tools for your stack or just want to stay on top of what’s out there, check it out:
🔗 AI For Developers — Browse the directory
📬 AI For Developers newsletter — Subscribe to the newsletter
Every issue covers one topic in depth — no fluff, no hype, just the stuff you need to build with AI. Subscribe if you haven’t already, and I’ll see you in the next one.







(2) For all models, the input cache hit price has been reduced to 1/10 of the launch price. This price adjustment takes effect from 2026/4/26 12:15 UTC.
(3) The deepseek-v4-pro model is currently offered at a 75% discount, extended until 2026/05/31 15:59 UTC
Nice
Another week, another whale surfaces to nuke everyone's profit margins and completely reset the hype clock. It’s honestly exhausting watching the world panic over having a million token context window when most of us barely have enough genuinely interesting thoughts to fill a single postcard