2026-07-05

Claude Versus the Cockpit: An LLM Agent Takes on X-Plane

This is the kind of experiment that actually tells you something useful about Claude, not just whether it can write code. The interesting part isn’t “can an AI fly a plane” in the movie-poster sense; it’s whether Claude can maintain state, reason across delays, and adapt its own tooling while operating in a realtime loop. That’s a much sharper test for Claude Code-style agentic workflows than another toy benchmark.

Key Points

The author asked Claude to use the X-Plane 12 API and try to fly a Cessna from Haikou Meilan in Hainan to nearby Qionghai Bo'ao.
Claude kept its own pilot log, and most of the narrative is written as live flight notes.
The experiment exposed a delay problem: Claude was working from screenshots and API data that lagged behind the plane’s actual state.
Claude quickly wrote Python code to handle takeoff, then kept extending that script with more functions as new problems appeared.
On one attempt, it climbed cleanly after takeoff but then overcorrected badly when the controller pushed full forward elevator, causing a nose-first crash.
The author rewrote the controller with slew-rate limits and safer handoff logic; the third attempt was much better.
That third controller used a pure proportional approach and managed stable climb, cruise, and three 90-degree turns in a left circuit.
It still overshot altitude targets because the controller didn’t anticipate the aircraft’s continued climb quickly enough.
The landing failed partly because the plane wasn’t slowing down enough, so flaps were used to help.
The second landing failure happened for a different reason: there was a ~20 second gap between invocations of the controller, and the plane kept descending until it hit terrain with no active control.
The author’s conclusion is basically “not yet,” but with a better harness loop, Claude might get there.

My Take

What strikes me is that this reads less like a stunt and more like a messy, honest systems test. That’s valuable. A lot of AI demos hide the exact thing that matters most: latency, control handoff, and what happens when the model is not continuously in the loop. Here, the failure modes are the whole story.

I think the most interesting detail is that Claude didn’t just answer questions about flying; it started building its own tooling to survive the task. It wrote code for takeoff first, which feels intuitive in a slightly alarming way. That’s a very Claude Code-ish instinct: solve the immediate subproblem, then patch the next one when it breaks. I like that. It’s practical. But it also shows the weakness of agentic systems that don’t have a strong world model plus tight control over timing. A delayed observation stream can make “reasonable” actions become terrible ones fast.

What worries me is the handoff gap. Twenty seconds of no controller is an eternity in flight, and that translates directly to real agent systems too. If your model is making decisions in bursts while the environment keeps moving, you need explicit guardrails for stale state, not just better prompts. I think that’s the real lesson for Claude users building agents: don’t obsess over clever reasoning first. Build the boring safety layer, the rate limits, the fallback states, the continuous loop. Those are what keep the model from driving itself off a cliff.

I’d be curious whether a more deliberate harness, with tighter state refresh and continuous low-level control, would let Claude finish the landing. Perhaps it would. But even if it does, the impressive part won’t be “Claude can fly a plane.” It’ll be that the agent can remain oriented in a fast-changing environment without losing the plot.

The takeaway is simple: this is a useful benchmark because it exposes timing, control, and planning failures in a way normal text tasks don’t. For Claude and Claude Code builders, that’s where the real engineering work starts.