PaPoo
cover

Why Structured APIs Beat Vision-Based Computer Use for Claude Agents

For anyone building with Claude or Claude Code, this article lands on a very practical question: when should an agent click around a UI, and when should it call tools directly? Reflex’s benchmark makes the tradeoff feel much less abstract by showing that “computer use” can be dramatically more expensive, slower, and less reliable than structured API access on the same app.

Key Points

image_0001.webp

image_0002.svg

My Take

image_0003.svg

What strikes me is how unglamorous the conclusion is: the expensive part isn’t always model intelligence, it’s interface design. I think a lot of people want to frame agent reliability as “use a bigger model,” but this benchmark argues something more annoying and more useful — if the app exposes structure, the agent’s job gets radically easier.

image_0004.svg

The part I find most convincing is the pagination failure. That’s exactly the kind of bug you get with UI-only agents: they can be “reasoning” correctly over an incomplete view of the world. The API agent didn’t need to infer that there might be more below the fold; it got the full structured result set. That feels like the real lesson for Claude users building internal tools: if you control the software, don’t make the model read pixels just to do database-shaped work.

image_0005.svg

I’d be curious whether other vision agents do materially better here, but I don’t think that changes the core economics. Better perception might reduce some errors, yet the agent still has to inspect screen after screen. That step count is the tax. And once you see the numbers — 8 calls versus 53 steps, 12k tokens versus 551k — the “just use browser automation” default starts looking less like convenience and more like technical debt.

image_0006.svg

What I’d actually do with this as a Claude Code user is pretty simple: expose structured tools whenever I can, especially for internal workflows, admin panels, and anything involving pagination, filtering, or cross-entity updates. I’d reserve computer use for the places where I truly can’t touch the app. That seems like the honest boundary line, not a hype-friendly one.

image_0008.webp

The takeaway is blunt: vision agents are useful when you’re trapped outside the system, but for apps you own, structured APIs are not a nice-to-have — they’re the difference between a manageable agent and a costly one.

image_0009.webp

Reference: Computer use is 45x More Expensive Than Structured APIs

image_0010.svg

同じ著者の記事