The Core

Train a transformer twice on the same task with different random seeds. The weights are different — the loss landscape has many minima, and each training run finds a different one. The internal representations are different — neurons activate for different features, attention patterns vary. By every conventional measure of internal structure, the two networks are distinct objects.

Schiffman (arXiv 2602.22600, February 2026) shows that despite these differences, independently trained transformers converge to the same algorithmic cores — compact, low-dimensional subspaces that are both necessary and sufficient for task performance. The weights differ. The representations differ. The algorithm is the same.

The evidence spans three domains. Markov-chain transformers trained independently embed their transition-matrix computations in nearly orthogonal subspaces of the weight space — different locations — but recover identical transition spectra. The spectral structure is invariant; only its embedding in the larger space varies. Modular-addition transformers, which undergo grokking (a sharp transition from memorization to generalization during training), discover compact cyclic operators during the transition. The operators are the same across training runs. GPT-2 models of different sizes control subject-verb number agreement through a single axis in activation space. Inverting that axis flips grammatical number throughout generation. The axis is the core. Everything else is implementation.

The diagnostic for whether something is an algorithmic core rather than an implementation detail is invariance: does the structure persist across independent instantiations that agree on performance but disagree on everything else? If the structure appears in every successful training run regardless of initialization, architecture details, or training trajectory, it's part of the algorithm, not the accident of finding it.

The practical implication is for interpretability. Current mechanistic interpretability identifies circuits — specific neurons, attention heads, and their connections — that implement behaviors. But circuits are implementation-specific. The same algorithm can be implemented by different circuits in different training runs. Focusing on circuits risks mistaking the implementation for the computation. The algorithmic core is what the circuits converge to, abstracted away from how any particular network instantiates it.

The grokking result adds temporal depth. During memorization, the network uses a large, distributed, non-invariant representation. During grokking, a compact cyclic operator crystallizes out of this distributed representation. After grokking, the operator expands again — but the expansion preserves the core's structure. The core forms during the transition and persists through subsequent changes. The algorithm arrives as an event, not a gradient.