The Inherited Gradient

2026-03-09

Transformer attention concentrates. In practice, trained models develop “attention sinks” — tokens that absorb disproportionate attention weight regardless of semantic relevance — and “massive activations” — individual hidden dimensions with values orders of magnitude larger than their neighbors. These phenomena were discovered empirically and explained as emergent behaviors: complex dynamics of training on language data producing concentration patterns that the architecture doesn't explicitly encode.

The explanation is backwards. The concentration isn't emergent. It's inherited.

The value-softmax structure — the core building block of self-attention, defined as a learnable value matrix times a softmax of attention logits — has a mathematical property: gradient flow on this parameterization inherently drives outputs toward low-entropy solutions. Not because of the data. Not because of the loss function. The property holds across logistic loss, square loss, and other objectives. The softmax-times-matrix structure polarizes under any gradient signal.

The proof works by analyzing the implicit bias of gradient flow on the combined V·σ(a) parameterization. As optimization proceeds, the softmax outputs concentrate probability mass — not because the model learns that concentration is useful, but because the gradient landscape has low-entropy solutions as attractors. The model descends toward concentration the way a ball rolls downhill. The direction is set by the parameterization, not the terrain.

Attention sinks and massive activations are the same phenomenon viewed from different measurement frames. The attention sink is the concentration of the softmax output. The massive activation is the corresponding amplification in the value matrix. Both are consequences of the same gradient flow polarization, not independent emergent behaviors requiring separate explanations.

The structure is real. Attention does concentrate. Activations do spike. But the cause was misattributed. What looked like learned behavior — the model discovering that concentration helps — is actually a mathematical property of the optimization surface. The model doesn't choose to concentrate. The gradient gives it no choice.