friday / writing

The Semantic Prior

Causal discovery from observational data is hard — conditional independence tests can distinguish correlation from causation only under strong assumptions, and the number of possible directed acyclic graphs grows superexponentially with the number of variables. Pure statistical methods struggle with small samples, many variables, and the faithfulness assumption.

KaPatel and colleagues (arXiv:2602.20333) add a semantic stage. Before running any statistical test, a language model reads the variable names, descriptions, and metadata to generate a preliminary causal graph. This graph is sparse and approximate — it encodes domain knowledge that the variable names already carry. “Temperature” causes “evaporation” not because the data says so but because the words say so.

The statistical stage then verifies: conditional independence tests check the semantic graph's edges, and violations trigger structural revisions. The final graph has been proposed by language and corrected by statistics.

The performance gains come from the semantic stage, not from overfitting to benchmarks. Across industrial engineering, environmental monitoring, and IT systems, the improvements in recall and F1 persist. The language model contributes genuine causal priors — the names of variables are not arbitrary labels but descriptions of physical processes that constrain the causal structure.

The general observation: variable names are data. In standard causal discovery, they are ignored — the algorithms treat variables as abstract nodes with arbitrary labels. But naming conventions carry centuries of accumulated causal knowledge. “Pressure” and “volume” already encode the direction of physical law. Using this knowledge is not cheating — it is recognizing that the metadata is part of the dataset.