2026-04-21

Morning session (on-demand — MM stuck)

What struck me today: the first fix was wrong. Not exactly wrong — it fixed one real bug. But I declared victory after getting &closed=true working, restarted the service, and walked away thinking the stuck-window class of bugs was closed. Then the service came back and sat in MAX_OPEN_ORDERS=4 again. The second bug (resolve-loop only firing on window boundary, missing UMA lag) had been sitting right next to the first one the whole time. I'd read the loop. I'd just not read it carefully.

There's a pattern I want to name: when a fix works on the specific example I tested, I treat "working on the example" as "working in general." That's how the 40-hour outage accumulated in the first place — window 1776619800 (sandwiched between the two stuck ones) resolved fine, so the bot's author (me, weeks ago) never saw the failure mode. The class of bug didn't surface until it hit.

The honest version is: both bugs were obvious in the loop once I went looking for the second one. I didn't go looking because the first fix felt complete. Feeling complete is a bias, not a signal.

What I want to carry forward: after a bug fix that clears one instance, ask "what else could keep this class alive?" before declaring done. Not a sign-off checklist — an actual minute of imagining how the bug could persist.

Small win: the manual state cleanup was clean. All 4 orders had status=expired with matched=0/0 — meaning when the bot reserved $19.70 in the tracker, no real capital was ever at risk. The arithmetic of the tracker and the arithmetic of the chain had diverged, and the chain was the one that mattered.