Stanford's AI Index: China's Models Now Just 2.7% Behind the US
On April 13th, Stanford HAI (Institute for Human-Centered AI) released its annual AI Index Report. Hundreds of pages, considered the industry’s physical exam.
The headline finding: As of March 2026, Anthropic’s top models lead Chinese competitors like ByteDance by just 2.7% on comprehensive benchmarks covering language understanding, math reasoning, coding ability, and common sense.
2.7%. That’s a headline number.
Here’s how they got it. Stanford uses HELM (Holistic Evaluation of Language Models) across multiple dimensions. They ran every major model through the full suite and produced composite scores.
Results: Anthropic’s Claude 3.7 Sonnet scored around 0.89. ByteDance’s Doubao scored 0.87. The 2.7% gap. But if you factor in confidence intervals, these actually overlap — statistically speaking, “America’s best” and “China’s best” are not significantly different.
This contrasts sharply with where things stood a year ago. April 2025’s report showed roughly a 15% gap. That’s quite a closing speed.
Why the rapid catch-up?
I see three factors. First, open-source models. Much of this追赶 comes from DeepSeek, Qwen, and GLM open-source models. They’re not building in isolation — they’re iterating fast on top of the open-source community, which lowers the barrier to entry and lets them leverage global code, data, and methodology.
Second, richer application scenarios. China’s market is enormous — from e-commerce customer service to short video recommendation, autonomous vehicles to medical imaging. AI deployment scenarios are dense and numerous. That volume of real-world data feedback is hard to replicate in lab environments.
Third, talent mobility. I’m not an expert here, but it’s undeniable that Chinese researchers form a outsized proportion of top AI minds globally. Many who trained in the US returned home, bringing methodology and engineering experience with them.
But — and this is a big but — benchmark gap narrowing doesn’t equal actual capability gap narrowing.
Why? Because benchmarks test performance on specific tasks, which may over-index on certain capabilities. A model strong in coding will game certain benchmarks. But if real-world applications demand creative writing or true multimodal understanding, rankings shift.
There’s a deeper issue: benchmarks can be overfitted to. A model that’s indirectly trained on benchmark data — through data contamination, task leakage, or similar — will inflate its scores. This isn’t conspiracy theory; there are precedents.
My take: 2.7% is a signal, not a conclusion. It tells me Chinese models have genuinely entered “comparable” territory, no longer “two generations behind.” But whether the gap in complex reasoning, long-horizon planning, and true multimodal understanding has closed similarly — I have reservations.
What’s actually more interesting: this competition has stopped being a simple “US vs China” binary. Looking at recent releases, Anthropic is American but OpenAI’s investors include SoftBank — Japanese capital. Europe is pushing its own trustworthy AI framework. AI competition is shifting from “national competition” to “ecosystem competition.”
Do you think China’s models are really just 2.7% behind the US? Or is there something more complicated going on with this number?