LLM Rankings April 2026: GPT-6 Has Arrived — But Can Claude Keep Its Crown?

April 14th, OpenAI released GPT-6.

My social feeds blew up all day. “Finally here,” “40% performance boost,” “AGI has arrived,” they said.

Honestly, I’m immune to marketing speak by now. What I really care about is: Is GPT-6 actually strong? How does it compare to Claude and Gemini?

So I spent two weeks testing all the major models: GPT-6, Claude Opus 4.7, Gemini 3.1 Pro, DeepSeek V4, Kimi K2.5, GLM-5, and more.

Testing dimensions were simple: coding, reasoning, multimodal. All real scenarios I encounter daily.

Here’s my verdict.

Coding: Claude Still the King

For coding tests, I used three tasks: write a complex React component, fix a bug, refactor legacy code.

The result was surprising: Claude Opus 4.7 performed best.

GPT-6’s code quality is also good, but there’s an issue: it’s too “doctrinaire.” When facing decisions that need tradeoffs, it always gives the “standard answer,” not the “answer that fits this project.”

For example, I asked “Should this API have caching?” It said “Recommend adding cache for performance.” Sounds fine, but it didn’t ask “What’s your QPS?” or “How often does your data update?”

Claude asks those background questions first, then gives targeted advice. That’s the difference.

DeepSeek V4 also surprised me. For a Chinese model to reach this level is genuinely impressive. There’s still a gap with Claude and GPT-6, but in Chinese scenarios, it actually performs better.

Reasoning: GPT-6 Takes the Lead

For reasoning tests, I used math problems, logic puzzles, and a multi-step real-world problem.

GPT-6 is indeed strong in this dimension. Complex multi-step reasoning, it clearly shows each thinking step, with high accuracy.

Claude follows closely, with a small gap. But Claude has an advantage: its reasoning process is more “human-like,” easier to understand.

Gemini 3.1 Pro performs solidly. Its strength is broad knowledge, but deep reasoning falls short of GPT-6 and Claude.

One interesting finding: Kimi K2.5 performs well on long-text reasoning. Give it a 50-page technical document, and it accurately extracts key information and makes inferences. Other models often “lose memory” on this.

Multimodal: Each Has Its Strengths

Multimodal tests included image understanding, image generation, video understanding.

Gemini 3.1 Pro is strongest at image understanding. Give it a screenshot, it not only identifies content but understands the underlying intent.

GPT-6 performs best at image generation. Generated images match expectations better, with more refined details.

Claude is relatively weak on multimodal. Anthropic clearly lags behind Google and OpenAI in this direction.

Pricing: Who’s More Cost-Effective

Capabilities aren’t enough—gotta consider the wallet too.

GPT-6 is most expensive at $15 per million tokens. Claude Opus 4.7 slightly lower at $12.

DeepSeek V4 is cheapest at just $3. Best bang for the buck.

Gemini and GLM-5 fall in the middle.

Honestly, if budget allows, you can’t go wrong with Claude or GPT-6. But if cost-effectiveness matters, DeepSeek V4 and GLM-5 are solid choices, especially for Chinese-language scenarios.

My Rankings

Overall, my rankings are:

Coding: Claude Opus 4.7 > GPT-6 > DeepSeek V4

Reasoning: GPT-6 > Claude Opus 4.7 > Gemini 3.1 Pro

Multimodal: Gemini 3.1 Pro > GPT-6 > Claude Opus 4.7

Cost-effectiveness: DeepSeek V4 > GLM-5 > Kimi K2.5

But don’t treat rankings as gospel. Different scenarios will have different optimal choices.

My personal usage pattern: Claude for coding, GPT-6 for reasoning, Gemini for images. Chinese models for Chinese content—they genuinely work better.

One last thing: LLM technology iterates incredibly fast. These rankings might be obsolete in two months.

So don’t obsess over who’s first or second. Finding the tool that fits your scenario—that’s what matters most.