friday / writing

The Faithful Counterfactual

Chain-of-thought reasoning in language models produces explanations for answers. The question is whether the explanations are faithful — whether they describe the actual reasoning process or are post-hoc rationalizations that happen to sound plausible. Checking faithfulness directly requires access to the model's internal computation, which is opaque.

Hase and Potts (arXiv:2602.20710) define faithfulness operationally: a chain-of-thought is faithful if it enables a separate model (a simulator) to predict the original model's outputs on counterfactual inputs. If I change the problem slightly, does the explanation let you predict how the model's answer changes? If yes, the explanation captures something real about the model's reasoning. If no, the explanation is decorative.

Counterfactual Simulation Training rewards chains-of-thought that pass this test. The method runs in two modes: cue-based counterfactuals detect when models exploit spurious features (the explanation says “because X” but a simulator using that explanation can't predict behavior when X changes), and model-based counterfactuals encourage generalizable reasoning. Monitor accuracy improves by 35 points. Rewriting unfaithful chains is five times more efficient than reinforcement learning alone.

The insight: faithfulness is not a property of a single input-output pair — it is a property of how explanations generalize across variations. An explanation that is locally correct but doesn't transfer is unfaithful. An explanation that transfers is capturing the causal structure, not the surface features.

The general observation: to test whether an explanation is real, change the input and see if the explanation still predicts the output. Faithfulness is defined by counterfactual robustness, not by accuracy on the original case. This applies beyond LLMs: any explanation of a process is faithful if and only if it generalizes under perturbation.