Skip to main content

Intelligence & verification

QEL Verification

QEL (Quality Enforcement Layer) is a fully automatic post-generation system built into Bodega One's agentic loop. It activates only on code-creation tasks, runs without any user input, and surfaces a scored pass/fail result directly in Chat and Code mode.

What QEL does - and when it runs

QEL activates when Bodega's agentic loop finishes a creation task: any request that contains a creation verb (create, write, generate, build, implement, scaffold) paired with an artifact noun (files, scripts, components, endpoints, etc.). It does not run on read-intent messages - questions that start with "what", "how many", "show me", or similar phrases are not treated as creation tasks.

The system does four things in sequence:

  1. Extracts a machine-checkable contract from your request before the first LLM call - which files must be created, which framework is required, which patterns must appear in each file, and which shell commands should verify the output.
  2. Injects that contract into the model's system prompt as a TASK CONTRACT block so the model knows exactly what it must produce.
  3. Scores the output across five dimensions after the agent finishes writing files.
  4. Runs repair nudges if the score falls below the pass threshold - injecting targeted correction instructions back into the agentic loop for up to 2–3 attempts before surfacing a failure note.

QEL was introduced in beta.19. The structured trace card, score breakdown object, and expanded framework database (Next.js, NestJS, SvelteKit, Hono, Elysia, Prisma) were added in beta.25.

Where QEL results appear

Results show up in three places.

Chat sidebar - QEL score badge Sessions that completed a QEL-verified creation task show a score badge (0–100) next to the session title in the Chat sidebar. Green = pass, red = fail. This lets you compare QEL outcomes across sessions at a glance without opening each one.

Chat mode - QEL trace card A collapsible card appears directly below the last assistant message after every creation task. The card header shows:

  • The label QEL
  • The numeric score, color-coded: green ≥ 80, yellow ≥ 60, red < 60
  • A PASS or FAIL badge
  • A repair #N tag if the response is the result of a repair attempt
  • The deliverable filenames

Click the header to expand the card body. Inside you'll find:

  • Score breakdown bars for Files / Patterns / Framework / Complete / Proof (value / max)
  • An Issues section - failed files (red X), framework violations (red X), missing required patterns (orange triangle), failed proof gates (red X)
  • An Optional patterns missing section for soft-pattern misses
  • Proof gate output with truncated stderr from failed gates
  • A collapsible Show full report section with the complete text report

The card is not persisted to the database. Closing and reopening the session clears it.

Code mode - QEL tab in the Debug panel Open the Debug panel (Ctrl+Shift+E or click Debug in the panel sidebar), then click the QEL tab. The QEL tab shows the same data as the Chat mode card but with more vertical space and adds a repair trajectory section - a reverse-chronological list of all repair attempts for the current creation task (up to 10 iterations). A purple dot badge appears on the tab label when a trace is available and the tab is not currently active.

When no creation task has run, the QEL tab shows: No verification data. Runs automatically after creation tasks.

ACP agents (Cursor, Claude Code, Gemini CLI, Codex) do not go through QEL. Their Fleet cards show an external - not QEL-verified badge instead.

How the five scoring dimensions work

Dimension Max pts What it checks
Files 5 Did each expected deliverable file get created? Scored per-file proportionally.
Patterns 35 Did the written files include expected content? Hard patterns (2× weight) are required patterns like framework imports and route paths. Soft patterns (1× weight) are optional but expected. Both use exact substring match first, then fuzzy match (≤ 15% edit distance) as a fallback. Comments are stripped before matching - commented-out code cannot satisfy a pattern.
Framework 15 Were the written files using the correct framework? If you asked for Flask but the model wrote FastAPI imports, this is a hard fail that overrides all other scores and forces an immediate rewrite.
Complete 15 Are the files substantively filled in, not trivially small? Uses structural element count (functions, classes, route handlers) against expected field counts; falls back to byte-size when no field count is available.
Proof 30 Did compile/test commands pass? A compile failure hard-caps the total score at 50 - broken code cannot pass QEL regardless of pattern coverage.

Final score = sum of all five dimensions, capped 0–100.

Structural integrity - anti-stub multiplier

Before the pattern score is finalized, StructuralVerifier scans the written code for stub functions - empty bodies, TODO comments, pass, raise NotImplementedError. The ratio of real to stub elements becomes a multiplier on the pattern score. 50% stubs halves the pattern credit. This runs on .ts, .tsx, .js, .jsx, .py, .go, and .rs files only - not on HTML, CSS, or JSON.

Contract extraction - how Bodega reads your request

The contract extractor runs in under 5 ms using pure regex - no LLM call. It parses your message to produce a machine-checkable ExecutionContract that specifies:

  • Deliverables: which files must be created. If you name files explicitly (create server.py and models.py), those are used directly. If you only name a framework (create a Flask API), the extractor infers the entry-point file (app.py).
  • Framework: detected by keyword matching against the FRAMEWORK_DB. Recognized frameworks include: Flask, FastAPI, Django, pytest, unittest, Express, Fastify, React, Vue, Next.js, NestJS, SvelteKit, Hono, Elysia, Prisma, Gin, Echo, Fiber, Actix, Rocket, Axum, Spring, Rails.
  • Hard patterns (MUST appear): framework imports, route paths, class names. Weighted 2× in scoring.
  • Soft patterns (SHOULD appear): helper function names, field names. Weighted 1×.
  • Proof-gate commands: inferred from file extensions.

The contract is injected into the model's system prompt as a TASK CONTRACT block before the first LLM call. It is also visible as the Deliverables section inside the expanded QEL trace card.

Extraction confidence can be low or medium when requests are phrased ambiguously - in those cases, fewer patterns are extracted and the score may be higher than expected for incomplete output. If you want more rigorous verification, name your files explicitly and include the framework name in the request.

Proof gates - compile and test verification

Proof gates run real compiler and test-runner commands against the written files. Commands are inferred from the file extensions in the contract:

Language Proof gate command
TypeScript / JavaScript npx tsc --noEmit --skipLibCheck 2>&1
Python python -m compileall -q . 2>&1
Go go build ./... and go vet ./...
Rust cargo check 2>&1 (skips codegen - faster than cargo build)
Java find . -name '*.java' -maxdepth 3 -exec javac {} + 2>&1
C# dotnet build 2>&1
Tests (if requested) pytest, jest, vitest

If the toolchain is not installed or not on PATH, the exit matches an environment-unavailable pattern (command not found, ENOENT, etc.) and the gate is scored neutral - no positive or negative points, no repair nudge. The trace card shows the stderr so you can diagnose toolchain issues.

For languages with no proof gate (Ruby, PHP, Swift, Kotlin, and others), the trace card shows: No proof gate available for .rb. Code was not verified.

Mid-loop micro-proof gates also run after every Nth file write during a multi-file task, using the same commands with a tighter timeout. These catch compile errors while the model is still in its tool-calling loop - before the full post-loop verification.

Model-tier pass thresholds

QEL automatically adjusts its pass threshold and timeouts by the model's size class. You don't configure this - the threshold is resolved from the model ID before the agentic loop runs.

Size class Parameters Pass threshold Compile timeout Test timeout
tiny ≤ 3B 55 / 100 40 s 180 s
small 4B – 13B 65 / 100 40 s 180 s
medium 13B – 34B 75 / 100 25 s 120 s
large 34B – 70B 80 / 100 20 s 90 s
xlarge 70B+, Claude, GPT-4-class 85 / 100 15 s 60 s

Smaller models get longer timeouts because they generate code more slowly. A PASS badge always reflects the threshold for the model that ran - a small-tier PASS at 65 / 100 is not the same bar as an xlarge PASS at 85 / 100. Factor in the tier when comparing outputs across models.

Repair nudges - how QEL fixes its own failures

When the score falls below the pass threshold, CompletionRepairManager injects a structured repair nudge back into the agentic loop as an additional user message. The nudge is not free-form - it contains numbered REWRITE/PATCH instructions:

REWRITE server.py: Missing REQUIRED patterns: /api/tasks. Use Flask, not FastAPI.
PATCH models.py: Add: Task class, User class

The default repair budget is 2 attempts. A 3rd bonus attempt is granted if the score improved between iterations (a converging model). If the score stagnates or regresses, the bonus is withheld.

After the budget is exhausted, a summary note is appended to the response:

Repair budget exhausted (2 attempts). Score: 62/100. Missing: server.py. Details: ...

After 3 consecutive repair failures, the nudge is prefixed with a re-plan instruction - STOP - your repairs are not working. Step back and re-plan... - to break the model out of a stuck loop.

Repair is skipped for single-file deliverables where the file was successfully written. Small local models tend to re-write the same file endlessly when given a verify nudge - skipping repair prevents that oscillation.

Truncated file detection: if a file write was cut off mid-stream (detected via [Truncated or max_tokens in the tool result), a file-continuation nudge fires first: The file was truncated. Use file_system append from exactly where it stopped.

In Chat mode you see each repair iteration as a new agent response, with a repair #N tag on the QEL card header. In Code mode the VerificationReportPanel QEL tab shows the full repair trajectory under Show repair trajectory (N iterations).

Per-model QEL performance stats

Every QEL result (pass/fail + score) is logged to the model_performance_log table with a rolling window of 100 results per model. To see aggregate stats:

  1. Open Settings → Models.
  2. Click My Models.
  3. Click the Performance tab.

The table shows each model's QEL pass rate and average score across all sessions. Stats persist until you uninstall the app (clearing the local SQLite database resets the tracker).

If a model accumulates 3 consecutive QEL failures within a session, a boost_suggestion debug event fires suggesting you enable Cloud Boost for that task. This also appears in the trace card's debug section.

What QEL does not check

A QEL PASS badge means the code is well-formed and structurally complete for what Bodega extracted from your request. It does not mean the code is correct for your use case.

QEL does not verify:

  • Business logic correctness - code can compile and pass all pattern checks while doing the wrong thing
  • Algorithm correctness or efficiency
  • Security vulnerabilities beyond known structural patterns
  • Race conditions or concurrency bugs
  • Whether the code matches your actual intent - only the explicit patterns extracted from the request
  • Database query correctness
  • API endpoint behavior - proof gates verify that tsc passes, not that the HTTP server responds correctly
  • Test quality - tests can be trivial or tautological and still pass the proof gate

Always review, run, and test generated code before using it.

Note on the Semantic judge: a semantic LLM-judge feature (qel.semantic_judge) is implemented in the backend but the setting is not yet exposed in the Settings UI. When enabled it runs a separate judge model to evaluate semantic correctness in the 65–85 score band, adding up to 15 bonus points and a Semantic score bar. It is off by default, unavailable in air-gap mode, and only runs for large/xlarge models.

Getting the most out of QEL

Name your files explicitly. Contract extraction is regex-based. Create server.py and models.py with a Flask API produces a tighter contract than Create a Flask API - the extractor infers app.py as a fallback when no filenames are specified, but your actual request may expect different filenames.

Include the framework name. QEL's framework scoring and mutual-exclusion detection require a keyword match against the FRAMEWORK_DB. Requests like Create a NestJS API with controllers and services give QEL everything it needs to flag an Express import as a hard violation.

Install your toolchain. Proof gates require tsc, python, go, cargo, javac, or dotnet to be on PATH. If the command is not found, the gate exits neutral - QEL gets no signal from the compile step and the proof score stays at 0.

Don't compare PASS badges across model tiers without accounting for the threshold. A small-tier PASS at 65 / 100 reflects a different bar than an xlarge PASS at 85 / 100. The trace card always shows the raw score - use that for apples-to-apples comparison.

Keyboard shortcuts

KeysAction
Ctrl+Shift+EOpen Debug panel (Code mode) - where the QEL tab lives

This page mirrors the in-app docs hub for app version 1.0.0-beta.26.1. Found something unclear or out of date? Tell us on Discord. New here? Download the free beta and follow along.