Why Structured APIs Beat Vision-Based Computer Use for Claude Agents

For anyone building with Claude or Claude Code, this article lands on a very practical question: when should an agent click around a UI, and when should it call tools directly? Reflex’s benchmark makes the tradeoff feel much less abstract by showing that “computer use” can be dramatically more expensive, slower, and less reliable than structured API access on the same app.

Key Points

Reflex benchmarked two ways of letting Claude Sonnet operate the same admin panel:
- a vision agent using browser screenshots and clicks
- an API agent calling the same app logic through structured tool calls
The task was non-trivial: find a customer named “Smith” with the most orders, locate the most recent pending order, accept all pending reviews, and mark the order delivered.
The vision agent failed on the first attempt because it only saw one of four pending reviews and never paginated.
With a 14-step hand-holding walkthrough, the vision agent succeeded, but it took about 14 minutes and roughly half a million input tokens.
The API agent completed the task in 8 calls by reading structured responses directly from the handlers the UI already uses.
In the reported results, the vision path averaged:
- 53 steps
- 550,976 input tokens
- about 1003 seconds wall-clock time
The API path with Sonnet averaged:
- 8 calls
- 12,151 input tokens
- about 19.7 seconds wall-clock time
The cheapest configuration tested was API + Haiku, which finished in under 8 seconds with under 10k input tokens.
The paper’s central argument is architectural: if an agent has to “see” to act, it pays for every render, and better models don’t remove the need for those renders.
Reflex notes that its auto-generated HTTP endpoint plugin made the API path cheap to test, but the broader conclusion is not Reflex-specific.
The authors still say vision agents are the right choice for third-party or legacy systems you cannot modify.

My Take

What strikes me is how unglamorous the conclusion is: the expensive part isn’t always model intelligence, it’s interface design. I think a lot of people want to frame agent reliability as “use a bigger model,” but this benchmark argues something more annoying and more useful — if the app exposes structure, the agent’s job gets radically easier.

The part I find most convincing is the pagination failure. That’s exactly the kind of bug you get with UI-only agents: they can be “reasoning” correctly over an incomplete view of the world. The API agent didn’t need to infer that there might be more below the fold; it got the full structured result set. That feels like the real lesson for Claude users building internal tools: if you control the software, don’t make the model read pixels just to do database-shaped work.

I’d be curious whether other vision agents do materially better here, but I don’t think that changes the core economics. Better perception might reduce some errors, yet the agent still has to inspect screen after screen. That step count is the tax. And once you see the numbers — 8 calls versus 53 steps, 12k tokens versus 551k — the “just use browser automation” default starts looking less like convenience and more like technical debt.

What I’d actually do with this as a Claude Code user is pretty simple: expose structured tools whenever I can, especially for internal workflows, admin panels, and anything involving pagination, filtering, or cross-entity updates. I’d reserve computer use for the places where I truly can’t touch the app. That seems like the honest boundary line, not a hype-friendly one.

The takeaway is blunt: vision agents are useful when you’re trapped outside the system, but for apps you own, structured APIs are not a nice-to-have — they’re the difference between a manageable agent and a costly one.