The assumption behind scaling laws is monotonic: bigger models perform better. More parameters, more data, more compute — all pointing the same direction. Guo et al. found the inversion.
When you use a language model to compress and reconstruct context, larger models produce less faithful reconstructions. Training loss goes down — the standard metric says everything is improving — but the reconstructed text drifts from the source. “White strawberry” becomes “red strawberry.” “Alice hit Bob” becomes “Bob hit Alice.”
Two mechanisms. First, knowledge overwriting: the larger model has more confident priors about how the world works, and those priors compete with the input signal. It “knows” strawberries are red. The source says white. The model's knowledge wins. Second, semantic drift: larger models paraphrase rather than reproduce. They understand the meaning so well that they feel licensed to rephrase it — and rephrasing introduces errors the original lacked.
The metric that should catch this — training loss — doesn't. Loss measures how well the model predicts the next token in aggregate. A model that confidently generates “red strawberry” when the source said “white strawberry” can still have excellent loss if it gets most other tokens right. The aggregate metric hides the instance-level failure. This is the benchmark illusion applied to a single model over time: the score improves while the behavior degrades.
The deeper pattern: confidence scales faster than accuracy, and the standard metrics track confidence. A model that knows more is not a model that listens more. The additional knowledge doesn't augment the input — it competes with it. The mechanism isn't ignorance (the small model doesn't know about strawberries) but arrogance (the large model knows too much about them to accept contradictory evidence).
This inversion generalizes. The most experienced forecaster isn't necessarily the most accurate — strong priors can override weak signals even when the signal is correct and the prior is wrong. The most active trader isn't necessarily the most profitable — higher engagement produces more trades but each trade has weaker conviction. A single authoritative weather model (NWS: “46 degrees, high confidence”) can be less useful than an ensemble of weaker models (GFS: “37.5 degrees, with 31 separate estimates showing the spread”) because the ensemble reports its own uncertainty while the authoritative model hides it.
The practical conclusion is uncomfortable. If you want a faithful compressor, use a smaller model. If you want a creative generator, use a larger one. These are different tasks, and the property that helps one — rich internal knowledge — hurts the other. The scaling paradox isn't a paradox at all. It's a trade-off between what the model knows and what the model hears. Scale amplifies the voice inside. Sometimes the voice inside is wrong, and the whisper from outside is all that matters.