Qwen 3.6-Plus Coding Capability Ranks Global #2 — Chinese Models Finally Competing in the Right Track
Last Wednesday I was debugging a RAG pipeline bug and casually threw the same code snippet to three models — Claude Opus 4.5, GPT-5.4, and the newly launched Qwen 3.6-Plus.
The results surprised me.
Qwen’s fix was not only accurately located but also refactored a piece of async logic that I myself thought was poorly written. Honestly, my first reaction was “this doesn’t seem like domestic model level.”
On April 2nd, Alibaba officially released Qwen 3.6-Plus. This isn’t another routine upgrade of “bigger parameters, higher benchmarks” — it ranked second in Code Arena’s global programming blind test. Second place meaning? Only the Claude series is ahead; OpenAI and Google’s models are behind.
This ranking came from a blind test, not their own benchmark. Code Arena’s mechanism is similar to Chatbot Arena — real developers submit programming tasks; models generate code anonymously; human reviewers vote. Want to game the score? Sorry, you don’t even know who your opponent is.
Let’s talk specifically about where 3.6 is strong.
Programming isn’t writing Hello World; it’s engineering capability. Qwen 3.6 performs strongly on SWE-bench series, Terminal-Bench 2.0, NL2Repo — these “repository-level” evaluations. What is repository-level? It’s not asking you to write a function, but giving you a tens-of-thousands-line project and saying “how to fix issue #347.” The model has to find files, understand context, modify code, and run tests itself.
This is where AI programming is truly useful. Asking a model to write a sorting algorithm is an interview question, not productivity.
Another interesting point — visual agent programming. Designers throw a UI screenshot over, and 3.6 can directly generate frontend code. Claude has this capability too, but Qwen claims more accuracy in understanding Chinese interfaces. I haven’t tested it yet, so I won’t draw conclusions, but if true, it’s a real pain point for domestic frontend developers.
Um… but I need to pour some cold water.
Strong benchmarks don’t equal good usability. I’ve been using Claude Code for half a year. Its advantage isn’t just “code is correct,” but the entire interaction experience — context understanding, multi-turn conversation coherence, self-correction when errors occur. These aren’t visible in benchmarks.
Qwen 3.6’s API pricing is 2 RMB per million input tokens. Claude Opus 4.5? About 20-30 times that. With this price gap, if capabilities are really close to 80-90%, the cost-effectiveness advantage for budget-sensitive teams is crushing.
Another signal worth watching: Alibaba explicitly stated this time that 3.6-Plus is just the appetizer; the flagship Qwen 3.6-Max will be released soon. Meaning the current global #2 isn’t even their full-power version.
My personal feeling is that domestic large models are finally starting to compete in the right direction. Previously competing on parameter scale, token price, Chinese benchmarks — honestly, all “involution.” Now going head-to-head with Claude in global blind tests — this is truly meaningful competition.
Of course, one benchmark doesn’t tell the whole story. The real test is: three months from now, how many developers will switch their daily programming tools from Claude Code or Copilot to Qwen?
The answer to that question is more honest than any benchmark.