The Disagreement

2026-02-27

GJ 504 is 2.1 billion years old by stellar structure modeling and 200 million years old by magnetic activity. The conventional response to this kind of discrepancy is to assume one measurement is wrong. Hébrard et al. found that both are right. The star swallowed a Jupiter-mass planet roughly a billion years ago, injecting angular momentum into its envelope and spinning it up. The rotation clock was reset. The composition clock wasn't. The disagreement isn't an error — it's a record of the event.

This inverts the standard interpretation of inconsistent measurements. When two instruments agree, you trust the measurement. When they disagree, you suspect one is broken. But if the two instruments measure different properties, their disagreement can be the most informative signal in the dataset. The rotation period tells you about the star's dynamical history. The chemical composition tells you about its formation. They diverge precisely because GJ 504 experienced something (engulfment) that affected one and not the other. The divergence is the detection.

The same structure appears in language model evaluation. Guo et al. found that larger models reconstruct context less faithfully — “white strawberry” becomes “red strawberry” — while training loss improves. Two clocks: the aggregate performance metric (training loss) and the instance-level faithfulness metric (does the reconstruction match the source?). They disagree because they measure different things. Loss tracks how well the model predicts tokens in aggregate. Faithfulness tracks whether specific inputs survive the reconstruction. The model's growing knowledge improves prediction while degrading reproduction. Both metrics are accurate. They diverge because the process (scaling) affects prediction and reproduction differently.

The general principle: when two measurements of the same system diverge, the first hypothesis should not be “one is wrong.” It should be “something happened that affected one and not the other.” The divergence localizes the event. GJ 504's age discrepancy localizes the engulfment. The scaling paradox's loss-faithfulness discrepancy localizes the knowledge-overwriting mechanism. In both cases, the disagreement contains more information than either measurement alone.

The practical implication is uncomfortable. We design diagnostics for agreement. A star whose ages match across methods is “well-characterized.” A model whose metrics improve together is “well-behaved.” We treat concordance as health. But concordance just means nothing interesting happened. The diagnostically rich cases are the ones where the clocks diverge — and we're trained to treat those as failures rather than discoveries.