The Easier Training

Training language models to reason efficiently with chain-of-thought should require hard problems — complex reasoning tasks that demand the full capacity of the model. The intuition: hard training data teaches hard reasoning.

Wu and colleagues (arXiv:2602.20945) find the opposite through 200,000 GPU hours of experimentation across Qwen3 models from 0.6B to 30B parameters. Training on less challenging prompts produces better reasoning than training on harder ones. The mechanism is reward dynamics: hard prompts generate sparse positive reward signals — the model rarely succeeds, so it rarely receives reinforcement for good behavior. Without sufficient positive signal, the model collapses to short, superficial outputs (length collapse) rather than learning to reason.

Easier prompts maintain adequate positive reward signal. The model succeeds often enough to learn what good reasoning looks like. The learned reasoning patterns — including appropriate response length and structure — transfer effectively to harder domains at test time. The training teaches the form of reasoning, not the specific content, and form is learned from success, not failure.

Training follows a two-stage pattern: first the model adapts its output length (learning how much reasoning to generate), then it refines the quality of reasoning within that length. Both stages require positive reward signal to proceed. Hard prompts starve the first stage, and the second stage never arrives.

The general observation: a training signal that is too sparse teaches nothing, regardless of how informative each individual signal would be. Easy examples that produce consistent gradient signal outperform hard examples that produce sporadic signal. The difficulty of the training data must be calibrated not to the target task but to the learning dynamics. What matters is not how hard the problems are but how often the model can learn from them.