The tool that passed every test and was quietly broken
A file-replace flow rendered a blank page for weeks. The tests were green the whole time. Here is why, with the commits.
Builds production software, and cleans up the AI-generated code that breaks it. pdflokal, his open-source PDF toolkit, is one of the repos these notes come from.
pdflokal is our own PDF editor, and it is open source. One of its features is Ganti File, replace-file: you are editing a document, you swap in a different one, and the editor should show the new file. A user sent a four-word bug report in Indonesian.
buka pdf, edit, ganti file, file baru ga keload.Open a PDF, edit it, replace the file. The new file does not load. A blank page.
The tests were green. They had been green the whole time.
The fix that looked like the fix
There was a real cause, and we found it. Replacing a file ran two steps back to back: reset the state, then load the new file. The reset was also destroying the single object that draws pages on screen. Once it was gone, every later draw call quietly did nothing, because the code guarded each one with optional chaining: renderer?.createPageSlots() on a null renderer is not an error, it is a silent no-op. The pages existed in state. No slots were ever created. The container stayed hidden. Blank workspace.
The fix (817ff97) was clean: stop destroying the renderer on a routine state reset, and only destroy it when actually leaving the editor. We added a regression test with it. It loaded a two-page sample, made an edit, fired the replace, and asserted three things: the container became visible, two canvas slots mounted, the previous file’s annotations were gone. All three passed. Case closed.
It was not closed.
The notebook
The blank page came back, in a subtler form. Now the structure was right: the container showed, the slots mounted, the three assertions still passed. But the canvas rendered blank. And when it did render, it drew the previous file’s pages under the new file’s name.
Instead of guessing, we wrote it down. One commit in this trail (bb7470e) is not a code change at all. It is a page from the notebook: an exploratory pass reproduced the bug end to end through the real picker, and narrowed four hypotheses to one.
Main rendered canvas (1104x1562) 100% blank. Triggering a render rendered OLD pdf content.From the commit that logged the hypothesis, three weeks before it was confirmed. The thumbnail pre-render was correct, so PDF.js could read the new file. Something was serving a stale copy.
The same note recorded why no test had caught it: the tests drove a hidden file input directly, not the file picker a person actually uses.
The one line, and the test that mattered
The culprit was a document cache: a private property on the renderer, keyed by file index. It was not part of the state object, so the reset that wiped everything else never touched it. Replacing a file kept the old rendered document for slot zero, and the draw call pulled it. New name, old pixels, or nothing at all.
The fix (80a5016) was one line in the right place: clear that cache inside the reset itself, so every path that reloads a file is covered, not just this one. But the fix is not the interesting part. The test that shipped with it is.
The new test opens the genuine replace-file menu the way a person does, swaps in a deliberately different file, a solid red one-page PDF, and then samples the center pixel. Before the fix, that pixel is white, (255,255,255): the old blank. After the fix, it is red. The old test could never have caught this, because it re-loaded the same fixture and never looked at a single pixel.
What this actually teaches
Read that last sentence again, because it is the whole thing. The old test re-loaded the same file and asserted that the structure existed. Of course it passed. It was checking that the machine had the right shape, not that it had done the right thing.
A passing test proves the code does what the test imagined. It says nothing about what the user needed, unless the test was built to feel what the user feels: the same entry point, a genuinely different input, the same final pixel. Green is not a fact about your software. It is a fact about your test.
This is the specific way AI-written code tends to fail. AI is very good at making the imagined version true. Ask it for a feature and a test, and you get code that satisfies the test and a test that describes the happy path, the demo. The demo works. The distance between the demo works and it works is exactly the fixture the test never swapped and the pixel it never sampled. When the same author writes both the code and the test, they agree with each other, and they can be wrong together, in the same direction.
So when AI-built code passes every test and a user still says it is broken, do not start by doubting the user. Start by doubting the test’s fidelity. Two questions find most of these. Does the test enter through the real door, or a side one nobody uses? And does it assert the outcome the user cares about, or only that the parts are present? A blank page and a rendered page have the same structure. Only the pixels differ, and only if you look.
Why does AI-generated code pass tests but still break for users?
A passing test only proves the code does what the test imagined. If the test drives a mock entry point (a hidden file input instead of the real picker) and asserts structure (that the right elements exist) instead of outcome (that the right pixels rendered), the code can satisfy every assertion while the real user path stays broken. AI is very good at making the test's imagined version true.
How do you catch silent bugs in AI-generated code?
Check the test's fidelity, not just its color. Does it drive the same entry point a real user does, and does it assert the real outcome rather than the DOM shape? In pdflokal, the fix was a test that opened the genuine file-replace menu and sampled the center pixel: it failed (white) before the fix and passed (red) after.
What is a silent failure in software?
A bug that produces no error and no crash. The program runs, the tests are green, and the wrong thing happens quietly. AI-generated code is prone to it because the demo path is exactly the path AI makes work first.