You deploy an autonomous agent. It picks up a task, attempts a fix, and fails. Then it picks up the same task, attempts the exact same fix, and fails again. Third attempt: same approach, same failure. It marks the task as blocked, and you are left debugging something the agent should have solved twenty minutes ago.
This is not an edge case. It is the default behavior of every agent framework that treats dispatch as stateless. The LLM has no access to what it tried before, no mechanism to avoid repeating mistakes, and no signal about whether this category of task is even worth attempting.
The fix is not more retries. It is three architectural patterns: trust gating, outcome history injection, and supervised dispatch.
The Problem: Stateless Retry Is Expensive Failure
Most agent loops look roughly like this: acquire a task from a queue, build a prompt, call the LLM, parse the result. If it fails, increment a retry counter and try again. The retry counter is the only state the system carries between attempts.
This design has three critical flaws:
- No attempt memory. The LLM does not know what it tried last time. It cannot avoid a previously failed approach because nobody told it about the previous approach. Each retry is a fresh coin flip with the same bad odds.
- No category-level learning. If your agent has failed the last 8 out of 10
css_fixtasks, it still picks up the 11thcss_fixtask at full confidence. There is no mechanism to say "this task type is failing systemically, stop burning tokens on it." - No staleness check. The task might already be resolved by a human or a different process. Without verification, the agent works on a problem that no longer exists.
The cost adds up. Three retries on a task that was never going to succeed, multiplied across dozens of task types and hundreds of daily dispatches, can easily waste 30-40% of your LLM spend on guaranteed failures.
Pattern 1: Trust Gating
Trust gating is a pre-dispatch check. Before the agent spends a single token on a task, you compute a trust score for that task type based on historical outcomes. If the score is below a threshold, the task is blocked with a "needs human review" status instead of being dispatched.
The score formula weights recent results more heavily than older ones:
// Score = weighted blend of overall and recent success rates // Window: last 50 outcomes for this task type // Minimum sample: 10 outcomes (no gating before that) const score = overallSuccessRate * 0.6 + recencyWeightedRate * 0.4; if (score < TRUST_GATE_THRESHOLD) { return `Blocked: ${taskType} trust score ${score.toFixed(2)} below ${TRUST_GATE_THRESHOLD}`; }
The default threshold is 0.15. That sounds low, but remember: this catches catastrophic failure patterns, not marginal ones. A task type scoring below 0.15 is failing more than 85% of the time. There is no point dispatching more of those.
Two important refinements prevent the gate from becoming a permanent block:
- Exempt types. Self-repair task types like
triage_fix,security_fix, andservice_restartbypass gating entirely. You never want the system to refuse to fix itself. - Exploration probes. After 5 consecutive blocks on a task type, one task is allowed through to re-evaluate. This prevents permanent lockout when underlying conditions change (new model, better prompts, fixed data).
import { checkTrustGate, TRUST_GATE_THRESHOLD } from 'agent-framework'; // Before dispatching any task: const blockReason = await checkTrustGate(task.task_type); if (blockReason) { // Mark task as needs_human_review, not failed await updateTaskStatus(task.id, 'needs_human_review', blockReason); return; } // Task type is above threshold -- proceed with dispatch
Pattern 2: Outcome History Injection
The second pattern gives the LLM memory of its own previous attempts. Before building the prompt, you query the outcome history for this specific file and task type, then inject the results as structured XML that the model can reason about.
<previous_attempts> <attempt n="1" outcome="failure" at="2026-05-18T14:22:00Z"> Tried adding null check in handleResponse() -- wrong location, the null originates in parsePayload() upstream </attempt> <attempt n="2" outcome="failure" at="2026-05-19T09:10:00Z"> Modified parsePayload() but missed the async path where response.body can be undefined before stream completes </attempt> </previous_attempts>
This gives the LLM explicit knowledge of what failed and why. Instead of a blind retry, the third attempt can reason: "Attempts 1 and 2 both targeted the wrong stage of the pipeline. The null appears during streaming, so I need to guard the async path in parsePayload() specifically."
The implementation is a query plus a formatter:
import { getRecentOutcomes, formatOutcomesBlock, buildPrompt, } from 'agent-framework'; // Fetch last 5 outcomes for this file + task type const outcomes = await getRecentOutcomes( task.task_type, ctx.file_path, 5 ); // Format as XML block (empty string if no history) const historyBlock = formatOutcomesBlock(outcomes); // buildPrompt injects historyBlock into the template const prompt = await buildPrompt(task, ctx, historyBlock);
The prompt template receives the XML block and includes it in its context section. The LLM sees what happened before, what the failure modes were, and can steer around them.
Pattern 3: Supervised Dispatch
The third pattern ties everything together into a single dispatch pipeline. Rather than a bare retry loop, you get a supervised pipeline with verification at every stage:
- Acquire. Lock a pending task using
SELECT ... FOR UPDATE SKIP LOCKEDso concurrent workers never collide. - Verify staleness. Run a registered verifier to check if the underlying problem still exists. If it is resolved, cancel the task without calling the LLM.
- Check trust gate. Compute the trust score for this task type. Block if below threshold.
- Build prompt with history. Load the template, inject context and previous attempt outcomes.
- Execute. Call the LLM with appropriate model routing for the task type.
- Handle result. Parse the output, update the task status, and record the outcome for future trust scoring and history injection.
import { acquireTask, processTask, recoverStaleTasks, } from 'agent-framework'; // Recover any tasks stuck in 'running' from crashed workers await recoverStaleTasks(); // Main dispatch const task = await acquireTask(); if (task) { // processTask handles the full pipeline: // staleness -> trust gate -> history -> prompt -> execute -> result await processTask(task); }
Every outcome is recorded. Every dispatch checks history. Every task type earns or loses trust based on results. The system gets better over time, not worse.
Results in Practice
This architecture came from operating a production system handling 300K+ daily LLM API calls. Before trust gating, roughly 35% of LLM spend went to task types with sub-20% success rates. After implementing the three patterns described above:
- Wasted LLM spend dropped by 40%. Trust gating alone stopped the bleed on systemically failing task types.
- Retry success rate increased from 12% to 41%. Outcome history injection meant retries were informed, not random.
- Human escalations became actionable. Instead of "task failed 3 times," engineers got a trust score, a history of what the agent tried, and a clear signal about what the agent could not solve.
Key Takeaway
The retry loop is not a model problem. It is a systems problem. Your LLM is perfectly capable of fixing the issue on the second attempt -- it just needs to know what happened on the first one. Give it memory, gate the lost causes, and supervise the pipeline.
Implementing This Yourself
The three patterns above are framework-agnostic. You can implement trust gating with a SQL query on an outcomes table, outcome history with XML prompt injection, and supervised dispatch with a pipeline function. The core logic is maybe 400 lines of well-tested code.
If you want a production-ready implementation rather than building from scratch, the patterns above are exactly what ships in the Agent Framework pack -- a supervised task dispatcher with trust gating, outcome history, staleness verification, and automatic retry/recovery. It was extracted from the production system described in this article.
Agent Framework
Supervised agent dispatch with trust gating and outcome history. Not a toy framework -- this is the code running a real multi-project platform. Pluggable database adapters, configurable thresholds, exploration probes, staleness verifiers, structured logging, and a full API for extending task types and verifiers.
The Full Stack bundle includes all 9 packs (2,015 total tests) for $149.