
Quick answer
Developers ship AI-generated code they do not trust: 96% are not fully confident it is functionally correct, and fewer than half always check it before committing. The fix is architectural, not behavioral. In Bodega One Code, the Quality Enforcement Layer (QEL) verifies every change, and for server tasks it boots the generated app in a sandbox and sends a real request to the route the task asked for. Evidence, not confidence.
What is the verification gap?
Sonar surveyed over 1,100 professional developers and published the result in January 2026: 96% do not fully trust that AI-generated code is functionally correct. The same survey found only 48% always check AI-assisted code before committing it. Sonar named the distance between those two numbers the verification gap, and it is the most honest description of how AI-assisted teams actually ship today: code nobody fully trusts, checked half the time.
The trend line is getting worse, not better. Stack Overflow's 2025 survey of 33,244 developers found 84% use or plan to use AI tools, while trust in their accuracy fell from 40% in 2024 to 29% in 2025. Only 3.1% highly trust the output. Adoption up, trust down.
And the distrust is earned. Veracode's 2025 GenAI code security report tested more than 100 LLMs across four languages and found AI-generated code introduced OWASP Top 10 vulnerabilities in 45% of tests. The models write code that looks finished. Looking finished is the problem.
Why does AI code feel done when it is not?
Because generation and verification are different jobs, and most tools only do the first one. A model produces code with the same fluent confidence whether the code works or not. It does not know the import path is stale, the route returns a 500, or the test only passes because it never ran. When the same system that wrote the code is the only thing vouching for it, "done" means "the model stopped generating."
Human review was supposed to close that gap. At AI speed, it cannot keep up, which is how you get half of developers committing unchecked AI code. The third-party take on where this is heading matches what we see: Heimdall calls 2026 the year AI checks its own work, because human-in-the-loop verification collapses under its own weight.
Can an AI coding agent verify its own code?
Yes, with one architectural rule: the verifier must not trust the generator. In Bodega One Code that layer is QEL, the Quality Enforcement Layer. It extracts machine-checkable deliverables from your prompt before the agent starts, checks every file write as it lands, requires the work to compile before the task can be called complete, and hands the agent line-level repair instructions when something fails instead of a vague "try again." The step-by-step mechanics have their own write-up: how QEL works.
That pipeline catches code that does not compile or contradicts the contract. But compiling is a low bar. The interesting question is the next one.
What counts as proof that code runs?
A real request to a real route. Since beta.27, when the task is server-shaped, QEL runs an execution proof: it boots the generated app inside the sandbox, forces an ephemeral port, and sends one request to the route the task asked for. A response under 500 is the strongest pass evidence QEL has. A boot crash or a 5xx is a real failure. Anything environmental stays neutral and never counts against the work. The whole probe is loopback-only, runs with a secrets-free environment, caps at 12 seconds, and always kills the process tree afterward.
Test gates run separately: JS and TS test files run under vitest the way Python tests run under pytest, Ruby and PHP get syntax gates, and SQL and Dockerfiles get structural lints so a truncated CREATE TABLE fails instead of sailing through. The point of the probe is narrower and blunter than tests: the server the agent claims to have built actually boots and actually answers. We do not know of another IDE that makes the agent produce that evidence, an actual request to an actual route, before a task can call itself done.
How do you verify the verifier?
A verifier that nobody checks is just another model being confident. So QEL is measured against a calibration harness: 43 labeled scenarios, known-good and known-bad, run through the real verifier at every model tier, with CI floors that fail our build if broken work scores a pass. The first sweep caught two real scoring holes, including one that had been failing correct TypeScript in workspaces without a tsconfig and making local models look far worse than they are. Calibration is also how the semantic judge earns its keep: it runs air-gapped when the judge model is served locally, and when it suspects a marginal pass is wrong, it parks the work for your review instead of waving it through.
Where verification shows up when you use it
The score is not a hidden number. Fleet Parallel fans one task out to multiple models and shows each attempt's QEL score side by side, so picking the winner is reading a scoreboard. Scheduled Loops park every run as a reviewable diff with its score and full trace, and auto-apply only ever happens when the score clears the bar you set. Finished chat sessions carry their score in the sidebar. And since beta.27, verification runs once per turn instead of up to four times, at roughly half the wall-clock, so the checking no longer taxes the working.
What QEL does not catch
Credibility requires the limits, so here they are plainly.
- Execution proofs are server-task-scoped. A library, a CLI, or a pure frontend change gets the compile and structural gates, not a boot-and-probe.
- A probe is one request, not a test suite. One route answering under 500 is strong evidence the thing boots and serves. It is not proof that every endpoint and edge case behaves.
- The semantic judge can be wrong. That is exactly why marginal passes park for human review instead of auto-applying. Parking is the hedge, not the flex.
- Verification proves the code runs, not that it does what you meant. Code can compile, pass gates, answer its route, and still solve the wrong problem. No verifier closes that gap; a human reading a clean, scored diff does.
The bar to hold your tools to
The verification gap is not a discipline problem, and another "always review AI code" policy will not close it. It closes when the agent has to bring evidence: compiles, gates, scores, and, for a server task, a real response from the thing it built. Ask your current tool what evidence it produces before it says "done." If the answer is confidence, that is the gap.
Bodega One Code is free for everyone in the open beta, commercial use included, and the verification layer runs locally like everything else. Download it, give the agent a server task, and watch the probe land in the QEL card. The QEL docs cover every gate in detail.
Sources
Common questions
- Can an AI coding agent verify its own code?
- Yes, if verification is a separate layer that does not trust the generator. In Bodega One Code, the Quality Enforcement Layer (QEL) checks every file write, compiles the work, runs structural checks, and for server tasks boots the generated app in a sandbox and sends a real request to the route the task asked for. The agent cannot mark the task complete on confidence alone.
- What is the verification gap in AI coding?
- The distance between how much AI-generated code ships and how much of it anyone actually checks. Sonar's January 2026 developer survey found 96% of developers do not fully trust that AI-generated code is functionally correct, yet only 48% always check AI-assisted code before committing it.
- Is an execution proof the same as running the test suite?
- No. An execution proof is one real request to the route a server task asked for: strong evidence the code boots and serves, not proof that every endpoint or edge case is correct. QEL runs test gates separately (vitest for JS/TS, pytest for Python), and a probe never replaces tests.
- Does QEL verification work with local models?
- Yes. The verification pipeline runs locally, and the semantic judge runs air-gapped when the judge model is served locally. beta.27 also fixed a scoring bug that unfairly failed correct TypeScript in workspaces without a tsconfig, which had made local models look worse than they are.
- What does QEL not catch?
- It proves code compiles, passes its gates, and (for server tasks) boots and answers one real request. It does not prove business intent: code can run perfectly and still do the wrong thing. That is why marginal passes park for human review instead of auto-applying.
Written by the Bodega One team. We build Bodega One Code, the local-first AI IDE, and we write here about local models, AI costs, and what we learn shipping it. More about the team and why we build local-first on the about page.
Related posts
Stay in the loop
Build-in-public updates, model picks, and Copilot/Cursor news as it breaks.
Follow @BodegaOneAI on X →