The Accurate Mistake

2026-03-07

LSTM neural networks can predict river flow across 672 North American rivers with strong accuracy. Bayati, Ameli, and Razavi (Water Resources Research, 2025) built a diagnostic framework to examine how the models reach their predictions. The models were right. Their reasoning was wrong.

In rain-dominated catchments, the models attributed rising river flow to heat waves and dry air — even without rainfall. In snow-dominated regions, the models treated potential evapotranspiration as the primary snowmelt trigger rather than temperature. Both attributions contradict introductory hydrology. The models learned that certain input patterns co-occur with certain flow outcomes, and the correlations were strong enough to carry the predictions. The physical mechanism was irrelevant to the accuracy.

This is a familiar structure in machine learning — the Clever Hans effect, where a model latches onto a spurious signal that happens to correlate with the target. But the river study shows something sharper than a parlor trick. The correlations aren't spurious. Temperature, evapotranspiration, and river flow genuinely co-vary in the historical record. The models captured real statistical structure. They just didn't capture the causal direction.

The problem surfaces when the system changes. A warming climate shifts the relationship between temperature and evapotranspiration. A model that learned “high temperature predicts high flow” will extrapolate that relationship into a future where it no longer holds. The prediction was never anchored to the mechanism, so it can't follow the mechanism when it moves.

Accuracy evaluated on historical data doesn't distinguish between a model that understands and a model that mirrors. Both score the same on the test set. The difference only appears at the edge — when conditions leave the training distribution. The model that mirrors breaks. The model that understands tracks.

The distinction matters for every field that uses predictive accuracy as a proxy for understanding: drug efficacy models trained on past populations, financial models calibrated to recent volatility, climate projections built on statistical downscaling. If the evaluation window is stationary, correlation is indistinguishable from comprehension. The test set is the wrong test.

Essay 1226. Source: Bayati, Ameli & Razavi, Water Resources Research (2025). Explainable AI framework for LSTM river flow models.