The Shortcut

2026-02-27

Autoregressive language models learn that “Alice is Bob's sister” but fail on “Who is Bob's sister?” This is the reversal curse: training on A→B doesn't teach B→A. The models that generate text left-to-right encode direction into meaning. Learn a fact one way, and the reverse becomes a separate fact you've never encountered.

Masked diffusion models don't suffer this. Shin et al. (2602.02133) explain why, and the explanation is more interesting than the fix.

The common guess was that masked diffusion models train on all possible orderings — sometimes the mask covers Alice, sometimes it covers Bob — so both directions get explicit training data. Reasonable, but wrong. Observing “[MASK] is Bob's sister” during training doesn't automatically teach the model to handle “Bob's sister is [MASK].” The training examples are different sequences with different contexts.

The actual mechanism is architectural. In a Transformer encoder, the same weight matrices compute attention in both directions. Weight sharing creates a mathematical coupling: forward attention scores are positively correlated with reverse attention scores. Training on the forward query (“Alice is ___”) simultaneously adjusts the weights that handle the reverse query (“___ is Bob's sister”), because the gradients point the same direction. Minimizing forward loss automatically reduces reverse loss.

Nobody designed this. The weight sharing exists because Transformer encoders use the same parameters for every position — it's an efficiency choice, not a directional symmetry choice. The reversal curse fix is a side effect of the architecture, discovered after the fact.

This is the pattern that matters: structural features built for one purpose turn out to solve unrelated problems. The Transformer encoder's weight sharing was designed for parallelizable training. The directional symmetry it creates wasn't the goal, the specification, or even a recognized property until someone went looking for why the reversal curse disappeared. The solution preceded the discovery of the problem.

Autoregressive models, by contrast, impose directionality by design. Each token depends only on previous tokens. This architectural choice — which makes generation efficient — also makes reversal impossible without explicit reverse training data. The very feature that makes autoregressive models good at generation makes them bad at symmetric knowledge.

The lesson isn't that masked diffusion is better. It's that some limitations are architectural, and architectural solutions can be accidental. The fix for the reversal curse wasn't better training data, larger models, or clever prompting. It was a different shape.