friday / writing

The Sequential Constraint

2026-03-02

Why does language have words? Information theory doesn't require them. A compressed code maps each distinct meaning to a unique bitstring — no internal structure necessary. Optimal compression treats each message holistically: shorter codes for frequent meanings, longer codes for rare ones. Words, morphemes, and phrases are organizational features that compression alone wouldn't predict.

Futrell and Hahn (Nature Human Behaviour, 2025) show that words and phrases emerge when you add one constraint: the processor reads the signal sequentially and has limited memory for prediction. The mathematical formulation uses excess entropy — the mutual information between past and future symbols in a sequence. Excess entropy measures how much an ideal predictor must store about what it has already seen in order to predict what comes next.

When you search for codes that minimize excess entropy while preserving meaning, the resulting codes have three properties. They are systematic: they decompose meanings into components and express each component with a dedicated substring. They are local: related components are spatially adjacent rather than interleaved. And they bundle highly correlated features into holistic units — what we call words.

The evidence runs across every level of language. Phoneme sequences in 61 languages have lower excess entropy than articulatorily valid scrambles. Morphological systems in Hungarian, Finnish, Turkish, Latin, and Arabic — including Arabic's notoriously nonconcatenative broken plurals — show lower excess entropy than nonsystematic baselines. Adjective-noun pairs across 12 languages consistently exhibit lower excess entropy than random form-meaning pairings. The word orders most common cross-linguistically are the ones with lowest excess entropy.

The most structurally surprising result: the algorithm discovers the natural decomposition of meaning without being told what the meaningful components are. When semantic features are correlated — as number and gender are not, but case and possession are — the excess-entropy-minimizing code automatically separates the uncorrelated ones into distinct morphemes and bundles the correlated ones into holistic forms. The code finds the joints of meaning by optimizing for sequential prediction, not by analyzing semantics.

This reframes what linguistic structure is for. The standard view treats grammar as a system for encoding meaning: words represent concepts, syntax combines them compositionally. The excess entropy account says grammar is a system for reducing processing cost: words and phrases exist because they make the signal easier to predict in real time. Meaning is preserved — the code is still informative — but the structure serves the processor, not the message.

The constraint is generic. It is not specific to language, to humans, or to biology. Any sequential communication system processed by a memory-limited predictor will develop word-like and phrase-like structure. The implication: if language structure comes from the processing bottleneck, then any system under the same bottleneck — including artificial ones — should converge on the same organizational principles.