The Recipe

A strategy that appears in correct solutions should be useful as guidance for producing new correct solutions. This seems tautological. If a mathematical approach works — if it appears reliably in solutions that get the right answer — then telling a solver to use that approach should help. The strategy is demonstrated to work. The guidance should transfer.

Liang, Sun, Nan, Li, Song, and Kawaguchi (arXiv 2602.22583, February 2026) show that this transfer systematically fails. Strategies that appear in successful solutions (high usage) can be ineffective or counterproductive as guidance (low executability), and the mismatch is structured, not random.

The distinction separates two things that are usually conflated. Strategy usage measures whether a particular approach appears in solutions that happen to be correct. Strategy executability measures whether a solver, given that approach as explicit guidance, can follow it to produce a correct solution. Usage is retrospective — it describes what the solver did when it succeeded. Executability is prospective — it predicts what the solver will do when told to follow a specific path.

The gap between them exists because a strategy can be correlated with success without causing it. The solver may have arrived at the correct answer through a complex computation that happened to include the strategy as a byproduct, or the strategy may only work when combined with other steps that the guidance does not specify, or the solver's internal process for following explicit guidance may differ from its process for generating solutions spontaneously.

The mismatch is source-dependent. Strategies written by humans and strategies extracted from model-generated solutions differ in structured, domain-dependent ways when applied as guidance. Human strategies that work well as guidance for humans may fail when given to models, and model-derived strategies may fail when given to different models. The executability depends on the specific solver, not just the strategy's mathematical validity. A correct recipe that a particular cook cannot follow is not useful guidance for that cook, regardless of how many other cooks have used it successfully.

The authors find consistent source-dependent reversals: in some mathematical domains, human-written strategies outperform model-generated ones as guidance, while in other domains the reverse holds. The reversals are not noise — they reproduce across problems within each domain. The structure of the mismatch reveals something about how different solvers represent and follow strategic guidance, something that looking at usage alone cannot reveal.

The practical consequence is that the common practice of in-context learning — showing a model examples of successful solutions and hoping it will extract useful strategies — has an inherent limitation. The model may extract the wrong lesson: a strategy that was present in the example but is not the causally relevant part, or a strategy that the example-solver could execute but the current solver cannot. Seeing someone else's correct work is not the same as being able to reproduce it by following the same plan.