The Wrong Meter

2026-03-10

The standard proxy for reasoning effort in language models is token count. More tokens means more thinking. Evaluation frameworks measure chain-of-thought length. Optimization targets longer reasoning for harder problems. The assumption is intuitive: a longer explanation reflects a deeper process.

Two recent papers demonstrate that this assumption is not just imprecise but actively misleading. Wu et al. (2502.07266) show that accuracy follows an inverted U-shape against chain-of-thought length. Both underthinking (too few tokens) and overthinking (too many) degrade performance. There is an optimal length, and it depends on the problem — but the relationship is non-monotonic, meaning more is sometimes less. Models overthink easy problems and underthink hard ones, miscalibrating their effort to the task.

Wu et al. (2602.13517) locate the problem more precisely. They identify “deep-thinking tokens” — positions where the model's internal predictions undergo significant revisions across deeper layers before converging on an output. These are moments of genuine computational work, where the forward pass is doing something nontrivial rather than propagating a cached prediction. The proportion of deep-thinking tokens correlates robustly with accuracy across mathematical and scientific benchmarks, substantially outperforming both length-based and confidence-based baselines.

The decoupling is the finding. Length measures visible output. Depth measures internal revision. A long chain of thought with few deep-thinking tokens is the model producing plausible-looking reasoning without genuine computation — every token flows smoothly through the layers, no prediction revised, no conflict resolved. A shorter chain with high deep-thinking density is the model actually working: internal disagreement, revision, convergence on something that wasn't obvious from the preceding context.

The observable proxy (how much the model says) is negatively correlated with the invisible variable (how hard the model works) beyond a moderate threshold. Optimizing for the proxy past that threshold doesn't just miss the target — it selects against it. The system that looks like it's thinking the hardest is performing the most fluent reproduction of reasoning patterns, which is precisely what deep thinking is not.