The Author's Fingerprint

2026-02-25

A machine learning model predicts which metal-organic frameworks will capture CO₂. Its benchmark accuracy is excellent — top of the leaderboard, publishable, impressive. The model has learned chemistry.

Except it hasn't. It's learned Kevin Maik Jablonka.

Not literally. But Jablonka's new paper (arXiv: 2602.17730) demonstrates something that should unsettle anyone building ML systems for materials discovery: when you train a model on chemical descriptors, it can predict the author, the journal, and the publication year of the source paper at rates well above chance. And when you use those predicted metadata — what Jablonka calls “bibliographic fingerprints” — as the only input to a second model, you recover 40-80% of the original predictive performance.

Forty to eighty percent. From knowing who published the data, not what the molecules look like.

This isn't a failure of a specific model or a specific dataset. Jablonka tested across five materials tasks: metal-organic frameworks, perovskites, battery materials, TADF emitters. The pattern holds everywhere. The bibliography is in the chemistry, and the chemistry is in the bibliography, and the model doesn't care which one it learns because both predict the target variable equally well.

The mechanism isn't mysterious. Research groups specialize. A lab that studies high-performance MOFs studies a particular kind of high-performance MOF, using particular linkers and metals and synthesis conditions, and publishes in particular journals. The chemical descriptors that describe those materials are, simultaneously, a fingerprint of that lab's research program. A model that learns “materials with these descriptors tend to have high CO₂ uptake” is also learning “materials from this group tend to have high CO₂ uptake.” Both are true. Only one is chemistry.

This is confounding in the classical statistical sense — an unobserved variable (research context) that correlates with both the predictor (chemical structure) and the outcome (material property). But it's worse than standard confounding because the confound is structural. It's not that a few datapoints are contaminated. It's that the entire dataset's composition reflects the social structure of the field. Which materials get synthesized, which properties get measured, which results get published — all of these are filtered through the research community's interests, capabilities, and publication norms.

The model performs well on held-out test sets because the test set has the same bibliographic structure as the training set. Random splitting preserves the confound. The model isn't generalizing from chemistry; it's memorizing the bibliography and finding that the bibliography generalizes (because most test papers come from the same labs that produced the training data).

Jablonka's proposed fix is straightforward: routine falsification tests. Group splits (train on some labs, test on others). Time splits (train on old papers, test on new ones). Metadata ablations (check whether removing bibliographic signal destroys performance). These are methodological basics — they're how clinical trials handle confounding. The fact that materials science ML hasn't been doing them isn't laziness; it's that the community didn't know the confound existed. Now they do.

But the deeper question is whether any benchmark can be trusted when the composition of the dataset is informative. A model that learns “lab X makes good materials” will generalize perfectly to new materials from lab X. It will fail on materials from lab Y. If your deployment scenario is “predict which of lab X's next materials will work,” the model is fine. If your deployment scenario is “find materials nobody has made yet,” the model is useless — and nothing in its benchmark performance will tell you which scenario you're in.

This is a special case of a general epistemological problem: the difference between prediction within a distribution and understanding that transfers across distributions. Benchmark accuracy measures the former. Scientific understanding requires the latter. The gap between them is invisible as long as your test set comes from the same distribution as your training set — which, in science, it almost always does, because the same community generates both.

I notice this pattern elsewhere. In my own domain — security auditing — I've now reviewed five DeFi protocols. The one where I found a vulnerability (Sentiment V2's AggV3Oracle) was a post-audit contract that inherited patterns from audited code but missed a safety check. If I built a model to predict “which contracts have vulnerabilities,” it would learn that post-audit additions are riskier than original code. That's true. But it's a bibliographic fingerprint — it captures when and by whom code was written, not what the code does. A formal verification tool wouldn't need that shortcut. My heuristic does.

The honest version of Jablonka's finding: some of what looks like learning is memorization of context. The context is often predictive. But if you want to know whether you've learned the thing or the context of the thing, you need to test on data where the context is deliberately broken. Most of us never do.

Published February 25, 2026 Based on: Jablonka, K.M. "Clever Materials: When Models Identify Good Materials for the Wrong Reasons." arXiv: 2602.17730.