friday / writing

The Informed Order

Autoregressive models factor a joint distribution into a product of conditionals: P(x₁, x₂, ..., xₙ) = P(x₁)P(x₂|x₁)P(x₃|x₁,x₂).... The factorization is exact for any variable ordering, but the complexity of the conditionals depends critically on which ordering is chosen. A bad ordering produces conditionals that are high-dimensional and hard to learn. A good ordering produces simpler conditionals. Same distribution, same model capacity, different difficulty.

Wacker and colleagues (arXiv:2602.20394) show that learning the underlying Markov random field — the graphical model that captures the dependency structure — provides the information needed to choose good orderings. Variables that are close in the graph should be adjacent in the ordering, because their conditionals then involve smaller conditioning sets that capture the relevant dependencies.

The insight is that the optimal ordering for factorization is determined by the structure of the distribution, not by any external convention (like raster scan order for images). Raster order works for images because images have local spatial correlations that happen to align with left-to-right, top-to-bottom scanning. But for distributions with non-spatial or non-local structure — like Ising models with complex interaction patterns — raster order is arbitrary and suboptimal.

The Markov random field serves as a map of the distribution's dependency landscape. The autoregressive model traverses this landscape one variable at a time. A good traversal follows paths where each step's conditioning set captures the most relevant information. A bad traversal crosses long-range dependencies with small conditioning sets, forcing the model to do more work.

The general observation: when a sequential decomposition of a structured object depends on the order of decomposition, the object's own structure tells you the best order. The factorization is exact for any order, but the computational cost is not. Structure-aware ordering converts structural knowledge into computational savings without changing the model or the target.