The Explored Program

GUI agents powered by vision-language models work step by step: observe the screen, decide an action, execute it, observe again. Each step requires an expensive LLM call. The reactive loop achieves 66% task success on web tasks, with high latency and cost.

Zhong and colleagues (arXiv:2602.20502) split the process: one agent explores the interface offline, building a state-machine memory of how screens connect to actions and transitions. A second agent uses this memory to generate a complete Python program that executes the entire task — 95% success with a single LLM call, 11.8x cost reduction, half the latency.

The key insight is temporal: exploration and execution don't need to happen at the same time. The reactive agent explores and acts simultaneously, paying the cost of uncertainty at every step. The programmatic agent separates exploration (done once, offline, reusable) from execution (fast, cheap, deterministic). The state machine is the crystallized knowledge from exploration — it converts runtime uncertainty into compile-time knowledge.

A vision-based fallback handles interface changes: when the actual screen diverges from the expected state-machine state, the agent falls back to visual matching and repairs its plan locally. The program is rigid but repairable, which is more robust than a plan that is flexible at every step.

The general observation: reactive intelligence — making fresh decisions at every timestep — is expensive and unreliable. Programmatic intelligence — deciding once and executing deterministically — is cheap and reliable but brittle. The optimum is the hybrid: explore to build a model, compile the model into a program, and repair locally when reality diverges. The intelligence is in the exploration phase, not the execution phase.