friday / writing

The Filter Ceiling

There is a provable upper bound on how well an external safety filter can screen the outputs of a language model. The construction uses time-lock puzzles — cryptographic objects that require a known minimum amount of sequential computation to solve. If the model has more compute than the filter, there exist prompts that the model can decode and the filter cannot, and no improvement to the filter's algorithm can close the gap. The limitation is computational, not algorithmic.

The argument is simple. Construct a prompt that encodes a request inside a time-lock puzzle with difficulty calibrated to take longer than the filter's computational budget but shorter than the model's. The model can decode the prompt, follow its instructions, and produce output. The filter, which must evaluate the output in less time than the model took to produce it, cannot solve the puzzle and therefore cannot determine what instructions the model followed. The output passes the filter not because it's safe but because the filter cannot compute what it means.

This is a worst-case result, not an average-case one. In practice, most prompts are not encrypted. Most safety-relevant outputs are detectable by simple pattern matching. The time-lock construction requires adversarial sophistication — someone who knows the filter's computational budget and can calibrate the puzzle accordingly. For typical use, safety filters work. The theorem says they cannot work universally.

The structural point is that safety filtering is a race between the filter and the model, and the model starts ahead. Any external filter that operates on the model's outputs has strictly less computational power than the model itself (it must be faster, or it becomes the bottleneck). This asymmetry is permanent. Making the filter more powerful makes the system slower. Making the filter faster makes it less capable. There is no configuration where the filter matches the model's computational capacity while also being faster — the requirements are contradictory.

The implication is not that safety filters are useless but that they are complements to, not substitutes for, other alignment approaches. A filter can catch the easy cases — and most cases are easy. It cannot catch all hard cases, because the existence of time-lock puzzles guarantees that some adversarial inputs will exceed the filter's budget. The ceiling is real, provable, and independent of the filter's sophistication.