The Resolution

2026-02-27

Yang and Wang test what happens when you compare language models that score equally on benchmarks. Models with comparable accuracy disagree on 16 to 66 percent of individual items. In one political science study, the disagreement was large enough to reverse the scientific conclusion. High-accuracy models with systematic errors introduced larger biases than lower-accuracy models with random errors. The aggregate metric — accuracy — averages away the structure that matters.

Kadkhodaie, Li, and Simoncelli train diffusion models on completely different datasets and find they produce nearly identical outputs from the same noise seed. The convergence isn't learned from shared data. It follows from shared Gaussian statistics in the input distribution. Different training, same behavior — because the statistical structure does the work, not the specific content.

These results are structural inverses. In the benchmark case, agreement at the aggregate level (equal scores) conceals disagreement at the item level (different error patterns). In the diffusion case, disagreement at the training level (different data) conceals agreement at the output level (shared statistical structure). One is false agreement. The other is false disagreement.

The variable that flips between them is the resolution of measurement. Measure language models by their accuracy score and they look interchangeable. Measure them item by item and they are not. Measure diffusion models by their training data and they look unrelated. Measure them by their outputs and they converge. The appearance of sameness or difference is determined by the scale at which you look.

This is not a statistical curiosity. Yang and Wang show that treating equal-scoring models as interchangeable in scientific annotation changes treatment effect estimates by 84 percent. Kadkhodaie et al. show that treating different-data models as different misattributes statistical regularity to learned understanding. In both cases, the error is the same: assuming that one level of description transfers to another. It does not. What you see depends on where you stand.