The Wrong Reasons

Machine learning models for materials property prediction achieve impressive benchmark scores. They predict band gaps, gas adsorption capacities, battery voltages, and emission wavelengths from molecular descriptors — chemical fingerprints that encode composition, structure, and topology. The benchmarks are standardized, the metrics are clear, and the models improve with each generation of architecture. The assumption is that high benchmark performance means the model has learned chemistry.

Jablonka (arXiv 2602.17730, February 2026) tests this assumption by asking a simple question: can the same models predict who published the data?

Across five materials domains — metal-organic frameworks, perovskites, batteries, TADF emitters, and general molecular properties — models trained on standard chemical descriptors can predict the author, journal, and publication year of each data point well above chance. The descriptors that encode chemical structure also encode bibliographic metadata, because different research groups study different classes of materials using different methods, and these preferences imprint on the descriptor space.

The test that makes the argument sharp: when bibliographic features alone — publication year, author identity, journal — are used as input features instead of chemical descriptors, they sometimes match or approach the prediction accuracy of chemistry-based models. A model that knows nothing about chemistry, only about who measured what and when, can predict material properties because the properties are correlated with the measurement context.

This doesn't mean the chemistry-based models have learned zero chemistry. It means they've learned chemistry entangled with bibliographic bias, and the benchmarks can't distinguish between the two. A model that predicts high CO2 adsorption for metal-organic frameworks may be learning that frameworks with certain linker geometries create favorable pore environments — or it may be learning that a particular research group consistently reports high adsorption values for the class of materials they specialize in, and those materials share common descriptors because the group has synthetic preferences.

The problem is structural, not a failure of any particular model. Materials databases are not random samples of materials space. They are the accumulated output of research programs, each with its own priorities, expertise, and systematic biases. A group that studies a specific material family will populate the database with structurally similar entries, all measured with the same equipment, reported in the same journals, with systematic offsets reflecting their specific experimental protocols. The chemical descriptors carry this provenance information because chemistry and research sociology are not independent.

Standard cross-validation doesn't catch this. Randomly splitting a dataset into train and test sets distributes each group's data across both partitions, so the model sees examples from every provenance during training. It learns the systematic offsets and applies them correctly at test time, achieving high accuracy that looks like chemistry but may partly be bibliography.

The fix is straightforward in principle: stratified cross-validation that holds out entire research groups, or entirely new material classes, rather than random data points. Some of the benchmarked models lose significant accuracy under this test. Others maintain it — suggesting genuine chemical learning. But the field doesn't routinely perform these stratified evaluations, so the fraction of benchmark performance attributable to real chemistry versus bibliographic correlation is unknown for most published models.

The models are clever. They find whatever signal predicts the target, and bibliographic fingerprints are signal. The question is whether that signal generalizes to materials no research group has studied yet — the regime where materials discovery actually matters.