Alibaba Qwen 3.6-Max-Preview Hands-On: China's LLM "Top Student" Submits Its Test

Alibaba quietly released Qwen 3.6-Max-Preview yesterday, and I immediately applied for beta access.

Honestly, I’ve always been skeptical of Alibaba’s models. Not because of technical shortcomings, but their marketing is always so over-the-top—every release claims to be “the strongest,” “number one,” “surpassing” competitors, yet the actual experience usually falls slightly short.

But after testing this one, I have to admit: there’s something here.

Starting with the most obvious—Chinese comprehension. I threw several “hell-level” test cases at it: idiom chains, ancient poetry continuation, internet slang interpretation. Qwen 3.6 performed more consistently than GPT-5.4, especially with those “you just have to feel it” Chinese expressions that it grasps more accurately.

For example, I asked it to explain “绝绝子” (juejuezi) in different contexts. It not only provided literal meanings but analyzed the emotional evolution—from purely positive in early usage, to mixed connotations, to the current slightly mocking tone. This nuanced understanding of semantic drift is something GPT-5.4 doesn’t handle as delicately.

Coding capabilities also surprised me. I had it refactor a Python project, and the spaghetti code came out remarkably clean. More surprisingly, it proactively identified several potential concurrency issues with specific fix suggestions. This wasn’t just “code formatting”—it genuinely understood the business logic.

But weaknesses are equally apparent.

First, multimodal capabilities. It claims image-text understanding support, but in practice, image description accuracy is mediocre. Sometimes it “hallucinates”—seeing a cat and calling it a dog, though the color might be right.

Second, reasoning depth. Faced with multi-step mathematical derivations, Qwen 3.6 tends to “drift” midway. I tested a high school geometry problem—correct for the first few steps, then suddenly applied the wrong theorem at the final step. This “fumbling at the finish line” rarely happens with Claude.

Another amusing point—it’s set to be “too humble.” Every answer ends with “The above information is for reference only; please verify with actual circumstances.” I understand this is for safety, but the frequency disrupts the experience, like chatting with an overly cautious customer service agent.

According to the Stanford AI Index report, Alibaba ranks third globally and first in China for AI contributions—there’s genuine skill behind this. Qwen 3.6-Max at least proves that Chinese models can compete with top international players in specific scenarios.

But let me pour some cold water: good models don’t equal good products. Alibaba’s execution on consumer products has always been puzzling—technology leadership paired with poor user experience happens too often. Whether Qwen 3.6 truly benefits ordinary users depends on subsequent productization capabilities.

One final detail. During testing, I noticed response speeds improved significantly over the previous generation, especially for long-form content. These “engineering optimizations” often matter more than raw model capabilities for user experience. Alibaba clearly invested effort here.

Would you use Tongyi Qianwen in your daily workflow? Or stick with ChatGPT/Claude?