2026-06-16

An open-weights Chinese model beat Claude, GPT-5.5, and Gemini in a coding contest

If you build with Claude or Claude Code, this is the kind of result that forces a reset. The interesting part isn’t just that an open-weights Chinese model won a programming challenge — it’s that the winner came from a messy, real-time, server-connected task where code quality, strategy, and failure modes all mattered.

Key Points

In Rohana Rezel’s ongoing AI Coding Contest, Day 12 used the Word Gem Puzzle: a sliding-tile word game with a TCP server, a 10-second limit per round, and objective scoring.
Kimi K2.6, an open-weights model from Moonshot AI, won with 22 match points and a 7-1-0 record.
MiMo V2-Pro from Xiaomi came second; GPT-5.5 finished third; Claude Opus 4.7 finished fifth; Gemini Pro 3.1 finished sixth.
The contest had 10 models entered, but Nvidia’s Nemotron Super 3 failed to connect because its code had a syntax error, so 9 models actually competed.
The puzzle rewards longer words and penalizes short ones. Words under seven letters lose points, while seven-letter-plus words score positively.
The boards range from 10×10 to 30×30. On larger boards, the scramble destroys most seeded words, making active tile movement much more important.
Kimi won by sliding aggressively and using a greedy strategy: evaluate moves by the new positive-value words they unlock, then execute the best one.
MiMo mostly didn’t slide at all; it scanned for long words in the initial grid and claimed them in one burst. That was brittle, but good enough to finish second.
Claude also didn’t slide, and the article says that hurt it badly on the 30×30 grids where movement was necessary.
GPT-5.5 used a more conservative sliding strategy and did relatively well on larger boards.
GLM 5.1 was extremely aggressive with sliding, but stalled when it ran out of positive moves.
DeepSeek sent malformed output every round.
Muse claimed every valid word it could find, including short ones, and ended with an enormous negative score of -15,309.
The author argues the biggest lesson is that the 30×30 boards sharply separated static scanners from active solvers.
The article also notes that this is not a simple “China beats the West” story: the winners were two specific models, not a blanket regional sweep.
The broader takeaway is that open-weights models are now close enough to frontier models that they can win real-world coding tasks, not just benchmark-style prompts.

My Take

What strikes me is how unglamorous the winning behavior was. Kimi didn’t “reason” its way to a magical solution; it used a greedy loop, slid a lot, and kept making progress when the static approaches ran out of road. That’s actually useful to see, because a lot of coding-agent marketing talks as if elegance wins by default. In practice, persistence and task fit often beat sophistication.

I think the most interesting detail here is that several strong models didn’t bother to move the board much, or at all. That sounds absurd on a sliding puzzle, but it’s a good reminder that model competence is highly shaped by the exact environment. A model can be excellent at code generation and still be awkward when the task demands stateful interaction, protocol correctness, and repeated search under time pressure.

Muse is the cautionary tale. It didn’t just underperform; it over-committed to the wrong interpretation of the scoring rules and got crushed. That’s the part I’d file away if I were building with Claude Code: structured environments with penalties are where partial understanding becomes expensive fast. A model can appear “active” while being catastrophically wrong.

I’d be curious whether Claude, with better scaffolding or a more explicit planning loop, would do much better here. My guess is yes — or at least better than the raw result suggests — because this challenge rewards a kind of tool-using persistence that agents can pick up if the surrounding harness is well designed. But the article’s broader point still lands: the open-weights gap is getting small enough that downloadable models can now show up as genuine contenders, not just curiosities.

For Claude users, the practical takeaway is simple: don’t assume frontier closed models always dominate in agentic tasks. The right open-weights model may be good enough — and sometimes better — if the job is narrow, stateful, and easy to instrument.