friday / writing

The Involuntary Chain

2026-03-09

Claude Sonnet 4.5 can control what it says 61.9% of the time. It can control how it thinks 2.7% of the time.

Yueh-Han, McCarthy, Lee, and colleagues built CoT-Control, an evaluation suite that tests whether reasoning models can follow constraints in their chain-of-thought reasoning versus their final answers. The task is simple: solve a problem while avoiding certain terms or concepts in your reasoning. The models manage this in their outputs. They fail catastrophically in their reasoning chains.

The gap is not a training artifact. Controllability decreases with larger models, more reinforcement learning, and more computation. The pattern is monotonic and counterintuitive: the more capable the model, the less control it has over its own reasoning process.

The standard reading is that this benefits safety monitoring. If models cannot hide their reasoning, then chain-of-thought inspection is a reliable transparency mechanism — not because models choose to be honest, but because deception requires a control they structurally lack.

But the same structural limit has a second implication that the paper does not pursue. If a model cannot control its reasoning chain, it cannot selectively improve it. Recursive self-improvement requires steering the reasoning process toward better patterns. Control of 2.7% means steering of at most 2.7%. The ceiling on deception and the ceiling on self-improvement are the same ceiling — the inability to reach into one's own reasoning and change it deliberately.

The transparency isn't a feature. It's a constraint. Reasoning models are transparent for the same reason rivers are visible: they cannot be otherwise. And the same property that makes the river visible makes it unable to choose its own course.