Field Note · field guide

Can you trust AI-generated code?

Fluent is not the same as correct, and they look identical on screen. Here is where to point the distrust, drawn from the bugs we found in our own AI-built code.

Ojan Lubis·2026-07-02

Builds production software, and cleans up the AI-generated code that breaks it. pdflokal, his open-source PDF toolkit, is one of the repos these notes come from.

You can trust AI-generated code to be fluent. You cannot trust it to be correct. The trouble is that fluent and correct look identical on the screen: the code reads well, it runs, the demo works. Telling the two apart is the whole job, and it is not a matter of trusting AI more or less. Trust is not a property the code has. It is one you add, by looking in the specific places where fluent code hides failure.

We build software and we clean up the AI-generated code that breaks it, most often our own, in the open. What follows is not a warning about AI. It is a map of where the distrust should point, drawn from real bugs with public commits. Every one of them ran. None of them errored. That is the pattern.

Where fluent code hides failure

In the test that proves the wrong thing. A passing test is a fact about the test, not the software. In our own editor a file-replace flow rendered a blank page for weeks while every test stayed green, because the test drove a hidden input instead of the real file picker and checked that elements existed rather than what they rendered. AI is very good at making the test’s imagined version true. Distrust green until you know the test enters through the real door.

In the config that runs wrong. Running and running correctly are different facts, and only the first one errors. One helpful-looking line made our PDF engine skip its background worker and render on the main thread for two months, with no crash and no failed test. Setup code that looks configured and quietly defeats itself is the most common shape of all, because nothing asks you to look.

In the crash you keep patching. When several crashes share one cause, a guard for each stops its own stack trace and fixes none of them. We shipped three guards in a week that all held and none of which was the bug. Fluent reasoning aimed at the wrong frame produces more convincing wrong fixes, faster. Distrust a fix that only silences the symptom you can see.

In the file that only grows. AI adds at the speed of typing and never proposes a deletion. One file in our product reached 3,634 lines because every feature was appended to the path of least resistance and nothing was ever taken back out. It runs fine until someone has to change it. The line count is a liability, not an asset.

In the boring helper nobody reviews. The risky bug hides where review stops. Our one cross-site scripting hole was not in the auth flow or the file parser; it was in a toast message that rendered a filename with innerHTML. Reviewers scrutinize the obvious surfaces and skip the mundane ones, which is exactly where an unsafe default survives.

So, can you trust it?

Trust it the way you would trust a fast, confident junior who never says “I am not sure.” The output is often good and always plausible, which is the problem: plausibility is the one thing you cannot use to tell right from wrong. So you do not extend trust or withhold it wholesale. You verify the specific things fluency hides. Distrust the green test until it drives the real path. Distrust the config that merely runs. Distrust the fix that only guards the symptom. Watch the file that only grows. Read the boring helper.

Do that and most of the risk is gone. Skip it and the code will look exactly as trustworthy as code that isn’t, right up until the day it doesn’t. None of this is an argument against building with AI. We build with it every day. It is an argument for knowing which of its outputs to believe, and that knowing is the work that does not get faster.

Can you trust AI-generated code?

You can trust it to be fluent, not to be correct, and the two look identical on the screen. AI reliably produces code that reads right and usually runs. Whether it does the right thing is a separate question the fluency does not answer. Trust is not something the code hands you; it is something you add by checking the specific places where fluent code hides failure: the test that proves the wrong thing, the config that runs wrong, the helper nobody reviews.

Is AI-generated code safe to use in production?

It can be, once someone has verified the parts the demo never exercised. The failures that matter are quiet: no crash, no red test. In our own code a cross-site scripting hole lived in a toast message, a performance bug lived in one config line, and a blank page passed every test for weeks. None of them errored. AI-generated code is safe in production when a human has checked the boring places and the real user path, not just run the demo.

How do you know if you can trust a piece of AI-written code?

Ask what would have to be true for it to be wrong without anyone noticing, then check that. Distrust a green test until you know it drives the real entry point. Distrust config that merely runs, because running and running correctly are different facts. Distrust the small helpers, because that is where review stops. If those three hold, most of the risk is gone.