2026-06-23

A Local Qwen Model Beat Claude Opus 4.7 on Simon Willison’s Pelican Test

From a Claude / Claude Code builder’s perspective, this is a useful reminder that benchmark vibes can get weird fast. Simon Willison’s long-running “pelican riding a bicycle” test is intentionally silly, but it sometimes tracks something real: whether a model can reliably produce a clean SVG illustration from a prompt.

Key Points

Simon compared two fresh model releases: Alibaba’s Qwen3.6-35B-A3B and Anthropic’s Claude Opus 4.7.
He ran Qwen locally on a MacBook Pro M5 using a 20.9GB quantized GGUF model in LM Studio via the llm-lmstudio plugin.
For the pelican benchmark, Simon preferred the Qwen output over Opus 4.7 because Opus messed up the bicycle frame.
He tried Claude Opus 4.7 again with thinking_level: max, but it still didn’t improve much.
He then tried a second “secret backup test”: “Generate an SVG of a flamingo riding a unicycle.”
Qwen also won that round, including a humorous SVG comment: 
Simon says the pelican benchmark was always meant as a joke, but it has often correlated with broader model usefulness.
That correlation seems to have broken here: he doubts a 21GB quantized local model is actually more powerful or useful overall than Claude Opus 4.7.
The narrow takeaway is still striking: if you specifically want an SVG of a pelican riding a bicycle, the local Qwen model beat Anthropic’s latest release in this test.

My Take

What strikes me is how good this is as a reminder that “best model” is always task-specific. If you’re building with Claude or Claude Code, it’s easy to assume the newest frontier model should dominate everywhere, but image-ish structured output in SVG format is one of those places where quirky local models can surprise you.

I think the most interesting part is not that Qwen won a joke benchmark; it’s that it won a second joke benchmark too. That makes the result feel less like random noise and more like a real signal that this particular local setup is unusually strong at turning prompts into playful vector output.

At the same time, I’d be careful not to over-read this. Simon explicitly says he does not think Qwen is “cheating,” and he also says he does not believe the local quantized model is more powerful or more useful than Claude Opus 4.7 overall. That’s the right level of skepticism. A better pelican is not the same thing as a better coding assistant, a better reasoning model, or a better agent model.

If I were using Claude Code day to day, I’d treat this as a nudge toward pragmatic model routing. For code, docs, and agent workflows, Claude may still be my default; for cheap local SVG generation or other self-contained creative tasks, I’d absolutely be curious whether a local Qwen-class model could be the more efficient tool. That’s the real lesson here: pick the model that wins the actual task, not the one with the strongest brand halo.

In short, this post is funny on the surface but genuinely useful underneath. It shows that local models can be surprisingly competitive in specific, weirdly concrete jobs—even when a flagship Claude release is the more formidable system overall.