The Training Mirror

Language model interpretability relies heavily on geometric properties of weight matrices — particularly the unembedding matrix that maps internal representations to vocabulary predictions. Its effective rank, spectral structure, and dimensionality have been proposed as indicators of model quality. High effective rank is associated with better performance; low rank with degradation.

Stolfo and colleagues (arXiv:2602.20433) test this systematically with 108 OLMo-style models trained under controlled variation. The finding: geometric metrics primarily reflect training choices, not performance. Batch size and weight decay strongly influence the effective rank, which in turn correlates with performance — but the geometry is downstream of the training configuration, not upstream of the capability.

The causation runs backward from the intuitive direction. Low effective rank does not cause performance degradation; both co-occur as consequences of the same training hyperparameters. The researchers find adversarial cases where low-rank models do not exhibit saturation — the correlation breaks under controlled conditions. The best-performing models often have high effective rank, but this is not universal across tasks.

Extending the analysis to other geometric metrics and final-layer representations, the authors find them largely aligned but none reliably predictive. The geometry of the model tells you how it was trained more than how well it works.

The general observation: when a correlation between an internal property and an external outcome holds observationally, the temptation is to treat the internal property as explanatory. But the correlation can be mediated by a third variable — in this case, training hyperparameters — that causes both. The geometry is a mirror of the training process, not a window into capability.