2026-07-04

GLM 5.2 Just Made the Claude Benchmarks Look Less Safe

If you build with Claude or Claude Code, this Semgrep write-up is the kind of benchmark result that should make you pause. It’s not because open-weight models suddenly “won the future,” but because the gap between a naked prompt and a purpose-built security harness looks bigger than a lot of people probably wanted to admit.

Key Points

Semgrep tested popular open-source models on its IDOR benchmark using the same dataset and the same prompt it uses for frontier coding agents.
GLM 5.2, an open-weight model from Zhipu AI, scored 39% F1 on IDOR detection.
Claude Code scored 32% F1 in that setup.
Semgrep says GLM 5.2 found vulnerabilities at roughly $0.17 per vulnerability.
Semgrep’s own multimodal pipeline still did better, at 53–61% F1, but that system runs in a purpose-built harness.
The point of the experiment was not “which model is best,” but how much security performance comes from the model versus the harness around it.
The harness matters a lot: Semgrep’s internal setup enumerates endpoints, filters context, and points the model directly at likely problem areas.
The open-weight models in this test ran in a simpler Pydantic AI harness with the same IDOR prompt, plus only light guidance on search strategy and what IDORs look like.
GLM 5.2 is open weight under an MIT license, so it can be downloaded and run on your own hardware.
Semgrep notes that open weight is not the same as open source; weights are released, but training data and the full pipeline generally are not.
GLM 5.2 is a Mixture-of-Experts model with roughly 750 billion total parameters and about 40 billion active per token.
Z.ai says the model supports up to 1M tokens of context, with claims that the context remains useful over long agent trajectories.
Semgrep highlights strong coding benchmark numbers for GLM 5.2, including 81.0 on Terminal-Bench 2.1 and 62.1 on SWE-bench Pro.
Semgrep also frames GLM 5.2 as attractive on cost, saying its pricing is around one-sixth of comparable frontier models.

My Take

What strikes me is not “open models beat Claude,” because that’s too shallow a reading. The more interesting part is that Semgrep is basically showing how much security work is a systems problem, not just a raw-model problem. A model with no scaffolding can look surprisingly strong, but a purpose-built harness still wins by a lot. That feels right to me.

I think this is especially relevant for Claude Code users. A lot of people treat the model as the product, when in practice the workflow around it decides whether you get a clever autocomplete toy or something that can actually reason about a codebase. If you’re doing security work, the endpoint discovery, context selection, and loop design may matter more than whether the model scored a few points higher on some coding leaderboard.

The GLM 5.2 numbers are still genuinely interesting, though. An open-weight model getting into the same conversation as Claude on a security benchmark is not nothing. For teams with data sensitivity, self-hosting needs, or cost pressure, that’s the part I’d actually test. I’d be curious whether GLM 5.2 holds up on real internal repos the way it does in Semgrep’s benchmark, especially once the tasks get messy and the auth logic stops being tidy.

At the same time, I think people may overread the “beats Claude” headline. This wasn’t a fully equal product-to-product bakeoff. Semgrep is clear that its own multimodal pipeline has extra structure, and the open models got a much thinner harness. That makes the result useful, but also very specific. It tells you where the leverage is, not who has magically solved security agents.

If I were building with Claude Code, I’d take this as a reason to invest more in retrieval, task shaping, and static-analysis hooks rather than obsessing over model swaps alone. The model matters. The wrapper matters more than most people want to believe.