DeepSeek V4 Arrives: Trillion-Parameter MoE, Chinese LLMs Finally Deliver

After nearly a month of waiting, DeepSeek V4 is finally here.

Honestly, this release is even more hardcore than I expected—trillion-parameter MoE architecture, 40% training efficiency improvement, and still open-source.

More importantly, its performance numbers are impressive: across multiple benchmarks, DeepSeek V4 can now go toe-to-toe with GPT-5.4 and Claude Opus 4.6.

This reminds me of last month’s “false alarm”—the internet thought DeepSeek V4 was released, but it turned out to be Xiaomi’s MiMo-V2.

This time it’s real.

What Is MoE Architecture?

MoE (Mixture of Experts) isn’t a new concept, but DeepSeek V4 has taken it to the extreme.

Simply put, traditional LLMs use “full parameter activation”—every inference engages all parameters, requiring massive computation.

MoE uses “sparse activation”—while total parameters are in the trillion range, each inference only activates a small portion (estimated 10%-20%), dramatically reducing computational cost.

An analogy: traditional LLMs are like a “generalist employee”—knows everything but must deploy all knowledge every time; MoE is like an “expert team”—different problems go to different experts, much more efficient.

DeepSeek V4’s MoE architecture, according to official disclosure, has 128 “expert networks,” with 16-20 activated per inference.

What does this mean?

A trillion-parameter model, but actual inference only uses computational power for tens of billions of parameters—extremely cost-effective.

Performance Data: Chinese LLMs Finally Deliver

The official numbers are solid:

  • Coding Ability: In SWE-bench Pro, DeepSeek V4 scores 78.2%, close to GPT-5.4’s 79.1%.

  • Mathematical Reasoning: In MATH benchmark, DeepSeek V4 scores 92.1%, surpassing Claude Opus 4.6’s 89.7%.

  • Long Context Processing: Supports 128K context window—gap with GPT-5.4’s 200K, but sufficient for most uses.

  • Training Efficiency: 40% improvement over V3—this metric is crucial, showing domestic compute utilization efficiency is rising.

What surprised me most is its cost control: official statement says DeepSeek V4’s inference cost is 60% lower than GPT-5.4, 50% lower than Claude Opus 4.6.

How did they achieve this?

The answer is MoE architecture—sparse activation dramatically reduces computation, naturally lowering costs.

Ray’s Take: MoE Is Chinese LLMs’ “Overtaking Opportunity”

Why was DeepSeek able to build a trillion-parameter MoE?

I think there are three key factors:

  1. Innovation Forced by Compute Constraints: Domestic compute chips (Huawei Ascend, Cambricon) still lag Nvidia in single-card performance, but have advantages in large-scale cluster scheduling. MoE architecture naturally suits distributed training, playing to domestic compute strengths.

  2. Engineering Breakthroughs: MoE architecture’s biggest challenge is “expert routing”—how the model knows which expert to consult for which problem. DeepSeek has deep engineering accumulation here, working on MoE since V2.

  3. Long-term Value of Open-Source Strategy: DeepSeek has been open-source from day one, attracting many developers contributing code and reporting issues. This V4 launch, the community has already contributed 200+ optimization patches.

This reminds me of Huawei’s chip strategy: when you can’t catch up in point technology, compensate through system optimization.

DeepSeek V4 follows this thinking: when single-card performance isn’t enough, boost overall efficiency through architectural innovation.

A Small Detail: DeepSeek V4’s “Domestic Compute” Ratio

Official numbers aren’t disclosed, but from various sources, I estimate domestic compute (Huawei Ascend, Cambricon) accounts for 30%-40% of DeepSeek V4’s training.

This is a significant increase from V3’s 20%.

Why does this matter?

Because domestic compute availability directly determines Chinese LLMs’ “self-controllable” degree.

If DeepSeek V4’s training relied entirely on Nvidia GPUs, no matter how open-source, it would still be constrained by “stranglehold” risks.

But now, with rising domestic compute ratio, Chinese LLMs are gradually reducing dependence on overseas chips.

This is a long-term trend—short-term differences may not show, but when US-China tech competition intensifies further, this “domestic compute reserve” will prove its value.

The Controversy: Are Trillion Parameters Just “Bloat”?

Some question: trillion parameters sounds impressive, but actual inference only activates tens of billions—is this just “bloat”?

I think this criticism misses the point.

MoE architecture’s essence is “large parameter count + sparse activation”—this isn’t “bloat,” it’s “lean organization.”

An analogy: you have a 1000-person company, but only send 100 people per project. Is this “bloat”? No, it’s “specialized division of labor.”

The key metric isn’t “total parameters,” but “cost-effectiveness”—for the same performance, who has lower cost?

DeepSeek V4 has proven itself here: its inference cost is 60% lower than GPT-5.4. That’s real capability.

Ray’s Prediction: Chinese LLMs Enter “Architectural Innovation” Phase

2024 was Chinese LLMs’ “catch-up year”—desperately scaling parameters to close the gap with GPT-4.

2025 was the “application year”—focusing on practical deployment, not just benchmark scores.

2026, I think, is the “architectural innovation year”—achieving performance breakthroughs under compute constraints through MoE, mixed-precision training, distributed inference.

DeepSeek V4 is a typical representative of this trend.

It proves: Chinese LLMs don’t need to compete head-on with overseas giants on “parameter scale”—they can find their own path through architectural innovation.

The significance of this is no less than Huawei making 5G chips on 7nm process.

Final Thoughts

DeepSeek V4’s release gives me a bit more confidence in China’s AI industry.

Not just because of its strong performance, but because it demonstrates a possibility: achieving breakthroughs through technical innovation even under compute constraints.

This kind of “innovation born from constraints” is often more valuable than just throwing money.

DeepSeek V4, I give it 90 points—minus 10 for long-context capability still having room for improvement.

But this starting point is already high enough.

(The End)