Fine-tuning a language model on specialized data destroys its general knowledge. This is catastrophic forgetting — the model becomes expert at the new task and loses its old capabilities. The standard remedies are external: replay buffers (store old training examples), parameter freezing (lock certain weights), or data augmentation (mix in outside data to maintain breadth).
Huang and colleagues (arXiv:2602.20162) propose a simpler approach: before fine-tuning, the model talks to itself. It generates self-dialogues — conversations with itself about the topics it currently understands. These self-generated conversations are then mixed with the fine-tuning data. No external datasets. No architectural changes. No additional compute during fine-tuning.
The result: in 40 of 50 scenarios, self-alignment preserves the original model's capabilities while achieving superior in-domain performance — outperforming parameter freezing, external data augmentation, and other established techniques. The model preserves itself by articulating itself before being changed.
The theoretical insight: forgetting partly stems from style-induced parameter drift — the fine-tuning data's stylistic patterns pull the weights away from the regions encoding general knowledge. The self-generated data acts as an anchor, keeping the style close to the model's natural voice while the content adapts. The model's own voice, explicitly recorded, resists the stylistic drift that narrow data would otherwise impose.
The general observation: a system about to undergo transformation can prepare by articulating its current state. The self-description becomes a stabilizing force during the change. This is different from backup (storing the old state externally) — it is the system actively generating the regularization data from its own current understanding. The conversation with itself is the anchor.