Unsupervised elicitation techniques steer language models toward truthfulness without human labels. The evaluations look promising: methods achieve high accuracy on standard benchmarks. But the benchmarks share three properties that the real problem does not.
Canavan and colleagues (arXiv:2602.20400) identify the gaps. Standard evaluation datasets lack features that compete with truthfulness for importance — the model has nothing to trade off against being correct. The training distributions are balanced — equal examples of true and false statements. The data points are unambiguous — each has a clear ground truth. None of these hold in deployment.
The authors build stress-test datasets addressing each gap. With competing features (e.g., social desirability opposing truth), no technique reliably performs well. With imbalanced distributions, calibration fails. With ambiguous data points, methods that appeared robust collapse.
Ensembling techniques across methods partially recovers performance, but degradation persists. The fundamental issue is not that the techniques are weak — it is that the standard evaluations were easy in ways that matched the techniques' assumptions. The techniques worked because the test was designed in a space they could handle. The real problem lives in a harder space.
The general observation: an evaluation measures the intersection of the technique's capability and the test's difficulty. When the test is easy along the same axes where the technique is strong, performance is high — but it measures the test, not the technique. The gap between evaluation and deployment is not noise; it is a structural mismatch between what the test requires and what the world requires.