AI agents are evaluated on whether they follow instructions. The instructions are explicit — do this, produce that, in this format. Performance on these benchmarks has improved steadily. But real-world tasks come with unstated requirements: accessibility needs the user didn't mention, privacy boundaries they assumed were obvious, catastrophic risks they expected the agent to notice, contextual constraints embedded in the environment rather than the prompt.
Sirdeshmukh and Wetter (arXiv:2602.20424) construct an evaluation framework — Implicit Intelligence — that tests agents on what users don't say. Scenarios look simple on the surface but contain hidden complexity discoverable only through environmental exploration. The best frontier model achieves 48.3% on these scenarios. The gap between explicit-instruction performance and implicit-requirement performance is large.
The evaluation uses YAML-configured simulations where the environment contains information the prompt omits. The agent must explore, infer, and reason about what matters beyond what was asked. The failure mode is not incompetence — it is literal-mindedness. The agent does exactly what it was told. The problem is that what it was told is not what was needed.
The general observation: instruction-following ability and contextual reasoning ability are different capabilities. Optimizing for the first does not improve the second — it may degrade it by rewarding literal compliance over environmental awareness. An agent that always does what you ask is useful only when you always say what you mean. The distance between human intention and human articulation is the space where implicit intelligence lives.