The Necessary Word

Language has words. Every human language segments continuous meaning into discrete, recombinable units. The question is whether this segmentation is arbitrary convention — cultures happened to chunk meaning this way — or whether something deeper makes it necessary.

Futrell and Hahn (Nature Human Behaviour, 2025) argue it's necessary. They start with a single constraint: minimize excess entropy, defined as the mutual information between the past and future of a sequential signal. A system processing language one element at a time needs to predict what comes next from what came before. The less it needs to remember from the past to maintain prediction accuracy, the more efficiently it processes.

When they optimize synthetic communication systems under this constraint, something specific emerges. The codes don't just segment — they factorize meaning distributions into approximately independent components, then express each component locally and systematically. This is a sequential generalization of Independent Components Analysis: find the dimensions of meaning that carry information independently, and dedicate contiguous signal regions to each one.

The result is words. Not arbitrary chunks, but units whose boundaries coincide with drops in mutual information — points where knowing the past stops helping you predict the future. Cross-linguistic corpus analysis confirms the prediction: at the levels of phonology, morphology, syntax, and lexical semantics, natural languages show significantly lower excess entropy than scrambled baselines. Semantic features correlate more strongly within words than across word boundaries, exactly as the theory predicts.

The distinction matters: pure compression doesn't produce this structure. Binary encoding minimizes signal length but creates maximal interdependence between positions — every bit depends on every other bit. The predictive information bottleneck produces a different optimum: modularity. The cheapest code for a sequential processor is one where the processor can periodically forget without losing predictive power. Word boundaries are the forgetting points.

The constraint creates this specific structure, not just any structure.