PaPoo
cover

GLM 5.2 Just Made the Claude Benchmarks Look Less Safe

If you build with Claude or Claude Code, this Semgrep write-up is the kind of benchmark result that should make you pause. It’s not because open-weight models suddenly “won the future,” but because the gap between a naked prompt and a purpose-built security harness looks bigger than a lot of people probably wanted to admit.

image_0004.svg

image_0003.svg

image_0002.svg

Key Points

image_0008.svg

image_0007.svg

image_0006.svg

image_0005.svg

image_0012.svg

image_0011.svg

image_0010.svg

image_0009.svg

My Take

image_0016.svg

image_0015.svg

image_0014.svg

image_0013.svg

What strikes me is not “open models beat Claude,” because that’s too shallow a reading. The more interesting part is that Semgrep is basically showing how much security work is a systems problem, not just a raw-model problem. A model with no scaffolding can look surprisingly strong, but a purpose-built harness still wins by a lot. That feels right to me.

image_0020.svg

image_0019.svg

image_0018.svg

image_0017.svg

I think this is especially relevant for Claude Code users. A lot of people treat the model as the product, when in practice the workflow around it decides whether you get a clever autocomplete toy or something that can actually reason about a codebase. If you’re doing security work, the endpoint discovery, context selection, and loop design may matter more than whether the model scored a few points higher on some coding leaderboard.

image_0023.svg

image_0022.svg

image_0021.svg

The GLM 5.2 numbers are still genuinely interesting, though. An open-weight model getting into the same conversation as Claude on a security benchmark is not nothing. For teams with data sensitivity, self-hosting needs, or cost pressure, that’s the part I’d actually test. I’d be curious whether GLM 5.2 holds up on real internal repos the way it does in Semgrep’s benchmark, especially once the tasks get messy and the auth logic stops being tidy.

image_0027.jpg

image_0026.png

image_0025.svg

image_0024.jpg

At the same time, I think people may overread the “beats Claude” headline. This wasn’t a fully equal product-to-product bakeoff. Semgrep is clear that its own multimodal pipeline has extra structure, and the open models got a much thinner harness. That makes the result useful, but also very specific. It tells you where the leverage is, not who has magically solved security agents.

image_0031.jpeg

image_0030.jpeg

image_0029.png

image_0028.jpeg

If I were building with Claude Code, I’d take this as a reason to invest more in retrieval, task shaping, and static-analysis hooks rather than obsessing over model swaps alone. The model matters. The wrapper matters more than most people want to believe.

image_0035.svg

image_0034.svg

image_0033.jpeg

image_0032.jpg


image_0039.svg

image_0038.svg

image_0037.svg

image_0036.svg

Reference: We have Mythos at Home: GLM 5.2 beats Claude in our Cyber Benchmarks

image_0042.svg

image_0041.svg

image_0040.svg

同じ著者の記事