Intelligence & verification

QEL Verification

QEL (Quality Enforcement Layer) is a fully automatic post-generation system built into Bodega One's agentic loop. It activates only on code-creation tasks, runs without any user input, and surfaces a scored pass/fail result directly in Chat and Code mode.

What QEL does - and when it runs

QEL activates when Bodega's agentic loop finishes a creation task: any request that contains a creation verb (create, write, generate, build, implement, scaffold) paired with an artifact noun (files, scripts, components, endpoints, etc.). It does not run on read-intent messages - questions that start with "what", "how many", "show me", or similar phrases are not treated as creation tasks.

The system does four things in sequence:

Extracts a machine-checkable contract from your request before the first LLM call - which files must be created, which framework is required, which patterns must appear in each file, and which shell commands should verify the output.
Injects that contract into the model's system prompt as a TASK CONTRACT block so the model knows exactly what it must produce.
Scores the output across five dimensions after the agent finishes writing files.
Runs repair nudges if the score falls below the pass threshold - injecting targeted correction instructions back into the agentic loop for up to 2–3 attempts before surfacing a failure note.

QEL was introduced in beta.19. The structured trace card, score breakdown object, and expanded framework database (Next.js, NestJS, SvelteKit, Hono, Elysia, Prisma) were added in beta.25.

Where QEL results appear

Results show up in three places.

Chat sidebar - QEL score badge Sessions that completed a QEL-verified creation task show a score badge (0–100) next to the session title in the Chat sidebar. Green = pass, red = fail. This lets you compare QEL outcomes across sessions at a glance without opening each one.

Chat mode - QEL trace card A collapsible card appears directly below the last assistant message after every creation task. The card header shows:

The label QEL
The numeric score, color-coded: green ≥ 80, yellow ≥ 60, red < 60
A PASS or FAIL badge
A repair #N tag if the response is the result of a repair attempt
The deliverable filenames

Click the header to expand the card body. Inside you'll find:

Score breakdown bars for Files / Patterns / Framework / Complete / Proof (value / max)
An Issues section - failed files (red X), framework violations (red X), missing required patterns (orange triangle), failed proof gates (red X)
An Optional patterns missing section for soft-pattern misses
Proof gate output with truncated stderr from failed gates
A collapsible Show full report section with the complete text report

The card is not persisted to the database. Closing and reopening the session clears it.

Code mode - QEL tab in the Debug panel Open the Debug panel (Ctrl+Shift+E or click Debug in the panel sidebar), then click the QEL tab. The QEL tab shows the same data as the Chat mode card but with more vertical space and adds a repair trajectory section - a reverse-chronological list of all repair attempts for the current creation task (up to 10 iterations). A purple dot badge appears on the tab label when a trace is available and the tab is not currently active.

When no creation task has run, the QEL tab shows: No verification data. Runs automatically after creation tasks.

ACP agents (Cursor, Claude Code, Gemini CLI, Codex) do not go through QEL. Their Fleet cards show an external - not QEL-verified badge instead.

How the five scoring dimensions work

Dimension	Max pts	What it checks
Files	5	Did each expected deliverable file get created? Scored per-file proportionally.
Patterns	35	Did the written files include expected content? Hard patterns (2× weight) are required patterns like framework imports and route paths. Soft patterns (1× weight) are optional but expected. Both use exact substring match first, then fuzzy match (≤ 15% edit distance) as a fallback. Comments are stripped before matching - commented-out code cannot satisfy a pattern.
Framework	15	Were the written files using the correct framework? If you asked for Flask but the model wrote FastAPI imports, this is a hard fail that overrides all other scores and forces an immediate rewrite.
Complete	15	Are the files substantively filled in, not trivially small? Uses structural element count (functions, classes, route handlers) against expected field counts; falls back to byte-size when no field count is available.
Proof	30	Did compile/test commands pass? A compile failure hard-caps the total score at 50 - broken code cannot pass QEL regardless of pattern coverage.

Final score = sum of all five dimensions, capped 0–100.

Structural integrity - anti-stub multiplier

Before the pattern score is finalized, StructuralVerifier scans the written code for stub functions - empty bodies, TODO comments, pass, raise NotImplementedError. The ratio of real to stub elements becomes a multiplier on the pattern score. 50% stubs halves the pattern credit. This runs on .ts, .tsx, .js, .jsx, .py, .go, and .rs files only - not on HTML, CSS, or JSON.

Contract extraction - how Bodega reads your request

The contract extractor runs in under 5 ms using pure regex - no LLM call. It parses your message to produce a machine-checkable ExecutionContract that specifies:

Deliverables: which files must be created. If you name files explicitly (create server.py and models.py), those are used directly. If you only name a framework (create a Flask API), the extractor infers the entry-point file (app.py).
Framework: detected by keyword matching against the FRAMEWORK_DB. Recognized frameworks include: Flask, FastAPI, Django, pytest, unittest, Express, Fastify, React, Vue, Next.js, NestJS, SvelteKit, Hono, Elysia, Prisma, Gin, Echo, Fiber, Actix, Rocket, Axum, Spring, Rails.
Hard patterns (MUST appear): framework imports, route paths, class names. Weighted 2× in scoring.
Soft patterns (SHOULD appear): helper function names, field names. Weighted 1×.
Proof-gate commands: inferred from file extensions.

The contract is injected into the model's system prompt as a TASK CONTRACT block before the first LLM call. It is also visible as the Deliverables section inside the expanded QEL trace card.

Extraction confidence can be low or medium when requests are phrased ambiguously - in those cases, fewer patterns are extracted and the score may be higher than expected for incomplete output. If you want more rigorous verification, name your files explicitly and include the framework name in the request.

Contract preview - see what QEL will check before you send

As you type a creation or modification task in the chat composer, a small non-blocking Will verify (creation) / Will verify (modification) card appears above the input listing the deliverable files, framework, and route paths QEL will check against.

It is purely observational - debounced, read-only, and it never gates or alters your send. It stays hidden for plain chat and read-intent questions. Under the hood it runs the same ExecutionContract extractor (regex + static maps, under 5 ms; no LLM, DB, file-system, or network access), so the preview is the same contract you'll see post-run in the trace card's Deliverables section - just surfaced before the run.

This lets you tighten your request before generating: if the preview shows fewer files or the wrong framework than you intended, name the files explicitly or add the framework name, then re-read the updated card. What you see in the preview is exactly what the oracle will verify against.

Proof gates - compile and test verification

Proof gates run real compiler and test-runner commands against the written files. Commands are inferred from the file extensions in the contract:

Language	Proof gate command
TypeScript / JavaScript	`npx tsc --noEmit --skipLibCheck 2>&1`
Python	`python -m compileall -q . 2>&1`
Go	`go build ./...` and `go vet ./...`
Rust	`cargo check 2>&1` (skips codegen - faster than `cargo build`)
Java	`find . -name '*.java' -maxdepth 3 -exec javac {} + 2>&1`
C#	`dotnet build 2>&1`
Ruby	`ruby -c <file>` (syntax only)
PHP	`php -l <file>` (lint only)
SQL	built-in structural parse (`lint:sql` - no database, no install)
Dockerfile	built-in structural check (`lint:dockerfile`)
Tests (if requested)	`pytest`; `vitest` is inferred automatically when a JS/TS test file is a deliverable

Gates run in parallel, so a compile gate and a test gate cost one round, not two.

If the toolchain is not installed or not on PATH, the exit matches an environment-unavailable pattern (command not found, ENOENT, etc.) and the gate is scored neutral - no positive or negative points, no repair nudge. The trace card shows the stderr so you can diagnose toolchain issues.

For languages with no proof gate (Swift, Kotlin, and others), the trace card shows: No proof gate available for that extension. Code was not verified. - and a code task with no verification evidence at all parks for review rather than auto-passing.

Mid-loop micro-proof gates also run after every Nth file write during a multi-file task, using the same commands with a tighter timeout. These catch compile errors while the model is still in its tool-calling loop - before the full post-loop verification.

Execution proofs (new in beta.27): for a server task, QEL goes one step past compiling - it starts the generated app on a private local port and sends one request to the first route you asked for. A response under 500 is the strongest pass evidence; a crash on boot or a 5xx is a real failure; a missing interpreter or undetectable start command is neutral. The whole proof is capped at ~12 seconds, runs only against 127.0.0.1, and the spawned process tree is always killed when it ends. Turn it off with the qel.execution_proofs setting. The trace card shows the probe (GET /api/tasks → 200).

Model-tier pass thresholds

QEL automatically adjusts its pass threshold and timeouts by the model's size class. You don't configure this - the threshold is resolved from the model ID before the agentic loop runs.

Size class	Parameters	Pass threshold	Compile timeout	Test timeout
tiny	≤ 3B	55 / 100	40 s	180 s
small	4B – 13B	65 / 100	40 s	180 s
medium	13B – 34B	75 / 100	25 s	120 s
large	34B – 70B	80 / 100	20 s	90 s
xlarge	70B+, Claude, GPT-4-class	85 / 100	15 s	60 s

Smaller models get longer timeouts because they generate code more slowly. A PASS badge always reflects the threshold for the model that ran - a small-tier PASS at 65 / 100 is not the same bar as an xlarge PASS at 85 / 100. Factor in the tier when comparing outputs across models.

Repair nudges - how QEL fixes its own failures

When the score falls below the pass threshold, CompletionRepairManager injects a structured repair nudge back into the agentic loop as an additional user message. The nudge is not free-form - it contains numbered REWRITE/PATCH instructions:

REWRITE server.py: Missing REQUIRED patterns: /api/tasks. Use Flask, not FastAPI.
PATCH models.py: Add: Task class, User class

The default repair budget is 2 attempts. A 3rd bonus attempt is granted if the score improved between iterations (a converging model). If the score stagnates or regresses, the bonus is withheld.

After the budget is exhausted, a summary note is appended to the response:

Repair budget exhausted (2 attempts). Score: 62/100. Missing: server.py. Details: ...

After 3 consecutive repair failures, the nudge is prefixed with a re-plan instruction - STOP - your repairs are not working. Step back and re-plan... - to break the model out of a stuck loop.

Test-first repair (beta.29): at the 2nd consecutive failure - earlier than the re-plan - a capable model whose project has a test runner (Vitest/Jest/pytest/go test) gets a stronger move instead of another pattern-nudge: write one failing test that pins the missing behavior, prove it fails for the right reason, then fix until it's green. A passing test is concrete proof the behavior works. It's deliberately gated to strong/medium tool-calling models - weaker local models tend to spiral writing tests they can't satisfy - and fires once, not in a loop. Toggle: agent.test_first_repair (Settings → Agent, on by default).

Repair is skipped for single-file deliverables where the file was successfully written. Small local models tend to re-write the same file endlessly when given a verify nudge - skipping repair prevents that oscillation.

Truncated file detection: if a file write was cut off mid-stream (detected via [Truncated or max_tokens in the tool result), a file-continuation nudge fires first: The file was truncated. Use file_system append from exactly where it stopped.

In Chat mode you see each repair iteration as a new agent response, with a repair #N tag on the QEL card header. In Code mode the VerificationReportPanel QEL tab shows the full repair trajectory under Show repair trajectory (N iterations).

Post-loop code review - a quality pass on the changed files

After a creation task passes QEL, Bodega can run an optional second pass that reviews the files the agent actually changed - looking for single-responsibility violations, awkward naming, and obvious bugs - and surfaces what it finds as a collapsible Code Quality section inside the verification card, grouped by file with a severity dot.

It is non-blocking: it never gates completion and never changes the QEL pass/fail verdict. It is deliberately bounded - it reviews creation tasks only, skips any file over 50 KB, caps the list at five findings, and bails to an empty result after a 5-second budget so a slow review can never hold up your run. Every finding is sanitized (paths relativized, secrets redacted) before it lands in the trace.

It is off by default. Turn it on in Settings:

qel.code_review - enable the post-loop review.
qel.code_review_model - the model to review with. Leave empty to reuse the model that generated the code.

Under air-gap the review only runs against a locally-served model; if none is available it simply returns nothing. The findings ride the existing QEL trace, so old sessions (and headless/CLI runs that have no card to render) are unaffected - the section just doesn't appear.

Rubric grader - score the output against your own quality bar

Sometimes "it compiles and matches the contract" isn't the bar you care about. The rubric grader lets you attach a free-text quality rubric to a task; after QEL passes, a one-shot grader reads the rubric and the QEL result and pins a verdict (pass / fail / inconclusive), an optional score, and a short justification to the verification card.

It is opt-in per request - the rubric travels with the message, so you decide task-by-task whether to grade. It runs on a dedicated small grader model when you configure one (qel.grader_model), falling back to the model that generated the code when you don't. Using a different model to grade is recommended: a model judging its own output shares its own blind spots.

The grader is built to never get in the way. It short-circuits to inconclusive under air-gap (grading is an LLM call), on an empty rubric, or on any error - it never breaks the stream, and the rubric is truncated to keep the prompt bounded. Like the code-review section, the result rides the existing QEL trace as an optional field, so cards without a rubric render exactly as before.

Goals - durable objectives with an adversarial completion gate

A TODO list lives for one message. A goal lives for the whole task. Type /goal API passes all auth tests with rate limiting in any AI panel and the goal becomes a real object, not a line in the chat:

It survives every message. Its task list picks up exactly where the last run stopped - a run that hits its iteration cap reports what's left and resumes on your next message.
An approved plan's steps become goal tasks automatically - plans stop being throwaway text.
The task panel shows the goal with live progress (3/7 done), and the goal title rides into the agent's context each turn so it never loses the thread.

The agent can't just declare victory. A goal only completes when the work actually verifies (QEL pass), and right before it does, a second model is sent in to attack the result - "find what breaks against this goal." Anything it finds becomes new tasks and the agent keeps working. One attack per run, it never blocks a finish on a reviewer hiccup, and you pick the reviewer model in Settings → Agent (goals.reviewer_model) - two different models don't share blind spots. Turn the gate off with goals.adversarial_review (on by default).

The trained companion is the /decompose skill, which turns a big objective into a goal with 3–7 verifiable tasks (each task naming its own evidence - the test that passes, the file that exists).

Per-model QEL performance stats

Every QEL result (pass/fail + score) is logged to the model_performance_log table with a rolling window of 100 results per model. To see aggregate stats:

Open Settings → Models.
Click My Models.
Click the Performance tab.

The table shows each model's QEL pass rate and average score across all sessions. Stats persist until you uninstall the app (clearing the local SQLite database resets the tracker).

If a model accumulates 3 consecutive QEL failures within a session, a boost_suggestion debug event fires suggesting you enable Cloud Boost for that task. This also appears in the trace card's debug section.

What QEL does not check

A QEL PASS badge means the code is well-formed and structurally complete for what Bodega extracted from your request. It does not mean the code is correct for your use case.

QEL does not verify:

Business logic correctness - code can compile and pass all pattern checks while doing the wrong thing
Algorithm correctness or efficiency
Security vulnerabilities beyond known structural patterns
Race conditions or concurrency bugs
Whether the code matches your actual intent - only the explicit patterns extracted from the request
Database query correctness
API endpoint response correctness - the execution proof verifies the server boots and the route answers (status < 500), not that the response body is right
Test quality - tests can be trivial or tautological and still pass the proof gate

Always review, run, and test generated code before using it.

Note on the Semantic judge: the LLM-judge (qel.semantic_judge, off by default; configured via settings, not yet in the Settings UI) runs a separate judge model on the 65–85 uncertain score band for large/xlarge models, adding up to 15 bonus points and a Semantic score bar. Since beta.27 it also works in air-gap mode when the judge model is served locally (Ollama, llama.cpp, LM Studio) - only cloud judges stay blocked - and it can veto: a 0/3 verdict on a barely-passing score pulls the run just below the bar so it parks for review instead of auto-applying (qel.judge_can_veto, on by default). It never touches confident scores.

Getting the most out of QEL

Name your files explicitly. Contract extraction is regex-based. Create server.py and models.py with a Flask API produces a tighter contract than Create a Flask API - the extractor infers app.py as a fallback when no filenames are specified, but your actual request may expect different filenames.

Include the framework name. QEL's framework scoring and mutual-exclusion detection require a keyword match against the FRAMEWORK_DB. Requests like Create a NestJS API with controllers and services give QEL everything it needs to flag an Express import as a hard violation.

Install your toolchain. Proof gates require tsc, python, go, cargo, javac, or dotnet to be on PATH. If the command is not found, the gate exits neutral - QEL gets no signal from the compile step and the proof score stays at 0.

Don't compare PASS badges across model tiers without accounting for the threshold. A small-tier PASS at 65 / 100 reflects a different bar than an xlarge PASS at 85 / 100. The trace card always shows the raw score - use that for apples-to-apples comparison.

Keyboard shortcuts

Keys	Action
`Ctrl+Shift+E`	Open Debug panel (Code mode) - where the QEL tab lives

This page mirrors the in-app docs hub for app version 1.0.0-beta.32.1. Found something unclear or out of date? Tell us on Discord. New here? Download the free beta and follow along.