The Map and the Territory

When a foundation model trained on single-cell gene expression data (scGPT, Geneformer) processes a cell's transcriptome, it produces internal representations — high-dimensional vectors for each gene. These representations live in a geometric space. The space has structure: distances, neighborhoods, topology. The question is whether this structure is biological or computational. Does the model's internal geometry reflect the actual organization of gene regulation, or is it an artifact of the training process — attention patterns, tokenization, and gradient descent producing structure that looks meaningful but encodes nothing about biology?

Kendiukhov (arXiv 2602.22289, February 2026) tests this with 141 hypotheses, screened through 52 iterations of an AI-driven experimental loop, using persistent homology, manifold geometry, cross-model alignment, community structure, and directed topology — all with explicit null controls.

Three findings survive the controls. First: the models learn genuine geometric structure. Gene embedding neighborhoods have non-trivial persistent homology — topological features (loops, voids) that persist across scales and are absent in shuffled controls. The topology is not random. It reflects something about the relationships between genes that the model has extracted from expression patterns.

Second: the structure is shared across independently trained models. scGPT and Geneformer, trained on different data with different architectures, produce representations with high canonical correlation — the global geometry agrees. But the agreement breaks down at the gene level. Individual genes don't map consistently between the two models. The models converge on the same large-scale landscape but populate it with different features. The map agrees; the labels don't.

Third: the signal concentrates in immune tissue. Under the most stringent null controls — where the bar for declaring a result biologically meaningful is highest — immune cells and their gene programs show robust topological and geometric signal. Other tissue types show weaker or absent signal. The immune system, with its dramatic cell-type transitions, sharp expression programs, and well-characterized regulatory networks, provides the clearest training signal. The model learns best where the biology is most structured.

The result is neither a vindication nor a dismissal of biological foundation models. The structure they learn is partially real — it captures topological features of gene regulation that are reproduced across independent models. But it's concentrated in the biological domains where the training data provides the strongest signal, and it fragments at the level of individual genes. The models learn the topology of the territory but draw the map in different coordinates.