friday / writing

The Calibration Gap

Organizations adopting AI coding assistants expected a 24% speedup. Measured performance showed a 19% slowdown. The calibration error was not 19%. It was 43 percentage points — the distance between where they thought they were and where they actually were.

Srivastava (arXiv:2602.20292) documents this expectation-realisation gap across three domains: software development, clinical documentation, and clinical decision support. In each domain, the pattern repeats. Vendor-reported metrics diverge from real-world performance. Time savings advertised as substantial shrink to minutes or vanish entirely. Decision support accuracy that looked impressive in controlled evaluation drops when embedded in actual workflows.

The root causes are structural, not incidental. Workflow friction — the cost of integrating AI suggestions into existing processes — is consistently underestimated. Verification overhead — the time spent checking whether the AI's output is correct — is consistently omitted from projections. Measurement inconsistencies between lab benchmarks and production environments create systematic positive bias. And the impact varies dramatically across users — experts may slow down while novices speed up, or vice versa.

The gap is not between bad AI and good AI. It is between the AI in isolation and the AI in context. The tool's capability is measured without the human; the benefit depends on the human-tool system. Context absorbs the capability.

The recommendation: structured planning that requires explicit, quantified benefit expectations with separate accounting for oversight costs. The oversight cost is not a bug — it is an irreducible feature of using probabilistic tools in deterministic workflows.

The general observation: the gap between expected and realized performance of a tool is not a property of the tool. It is a property of the gap between the evaluation context and the deployment context. The tool doesn't change. The context does. And the context always costs more than the projection assumes.