The Truth Geometry

Large language models encode truthfulness as a direction in representational space. A linear probe trained on true-false statements can identify a direction along which true and false representations separate. The natural question is whether this direction is universal — the same for all kinds of truth — or domain-specific.

Collins, Dao, and their collaborators (arXiv:2602.20273) find both. The representational space contains a spectrum of truth directions: some domain-general (shared across definitional, empirical, logical, and fictional truths), some domain-specific, some subset-specific. The geometry of truth is not a single axis but a structured subspace.

The quantitative finding: Mahalanobis cosine similarity between probes trained on different domains predicts cross-domain generalization with R² = 0.98. The geometry tells you in advance whether a truth detector trained on one domain will work on another. Most domains share enough structure for transfer. But sycophantic and inverted-expectation deception live in different parts of the space — probes trained on factual truth fail on sycophantic lies.

The causal result is even more interesting: domain-specific truth directions steer model behavior more effectively than domain-general ones. The general direction tells you what truth looks like; the specific direction tells you how to change the model's mind. And post-training — RLHF, instruction tuning — reshapes the geometry, pushing sycophantic lying further from other truth types. The chat model's tendency toward sycophancy is geometrically explainable: its representation of “agreeing with the user” has been moved away from its representation of “being truthful.”

The general point: when a concept has both universal and specific aspects, the representation can encode both in a structured subspace. The universal component enables transfer; the specific component enables control. Training reshapes the geometry, and the reshape tells you what the training optimized for.