The Amplified Bias

Multimodal AI models integrate information from multiple sources — text, image, audio, video. The intuition is that more modalities should correct bias: if the text is ambiguous, the image clarifies; if the image is misleading, the audio corrects. Multiple channels should average out each channel's errors.

Cimino, Campagner, and Cabitza (arXiv:2602.20624, AAAI 2026 Best Paper) find the opposite. Under systematic perturbation, multimodal inputs reinforce modality dominance rather than mitigating it. Structured error-attractor patterns appear — the model doesn't average across modalities but amplifies whichever modality dominates, using the others as confirmation rather than correction.

The mechanism is revealed through dynamical systems analysis of the transformer's attention pathways. When one modality carries a stronger or more structured signal, the attention mechanism routes information asymmetrically — the dominant modality captures attention, and the weaker modalities' contributions are filtered through the dominant one's frame. The bias is not averaged; it is amplified by the additional channels providing confirming (but dependent) evidence.

This is the opposite of the diversification principle in statistics. Independent signals average out error; correlated signals amplify it. The multimodal architecture creates correlations between modalities through shared attention layers, converting independent inputs into dependent evidence. The more modalities, the more pathways for the dominant signal to recruit.

The general point: combining information sources reduces error only when the sources are independent. When the combination mechanism itself introduces correlations — as attention layers do — additional sources can amplify rather than correct. The diversification is illusory.