The Optimal Architecture

Transformer models achieve remarkable empirical performance, but the theoretical question persists: are they statistically optimal, or merely practically effective? Do they achieve the best possible rates, or is there an architecture that could do better on the same problems?

Song and colleagues (arXiv:2602.20555) prove that standard Transformers — not modified or augmented versions, but the standard architecture with self-attention and feed-forward layers — achieve the minimax optimal rate in nonparametric regression for Hölder continuous target functions. No architecture can do fundamentally better on this function class. The Transformer is not just good enough; it is rate-optimal.

The proof requires two intermediate results of independent interest. First, a fine-grained characterization of Transformer structure through “size tuples” and “dimension vectors” that track how the architecture's parameters relate to its approximation power. Second, tight bounds on the Lipschitz constant and memorization capacity of standard Transformers — how much the output changes with the input, and how many distinct patterns the architecture can encode.

The minimax rate for Hölder functions with smoothness s on [0,1]^d is n^(-2s/(2s+d)), where n is the sample size. The Transformer achieves this rate by approximating the target function with arbitrary precision using its standard components — no external tools, no special tricks. The architecture's inductive bias is aligned with the function class's structure.

This is the first result showing that the standard Transformer achieves minimax rates without architectural modification. Previous results either required modified architectures or proved sub-optimal rates. The standard architecture is exactly right — not because it was designed for this function class, but because its structure (attention + feed-forward composition) is flexible enough to match any smooth target at the optimal rate.

The general observation: when an architecture achieves the minimax rate for a function class, it tells you that the architecture's inductive bias is a superset of the class's structure. The architecture is not specialized — it is universal in the rate-theoretic sense. No redesign can improve the fundamental rate, only the constants.