Why Your AI Agent Keeps Retrying Failed Tasks (And How to Fix It)

You deploy an autonomous agent. It picks up a task, attempts a fix, and fails. Then it picks up the same task, attempts the exact same fix, and fails again. Third attempt: same approach, same failure. It marks the task as blocked, and you are left debugging something the agent should have solved twenty minutes ago.

This is not an edge case. It is the default behavior of every agent framework that treats dispatch as stateless. The LLM has no access to what it tried before, no mechanism to avoid repeating mistakes, and no signal about whether this category of task is even worth attempting.

The fix is not more retries. It is three architectural patterns: trust gating, outcome history injection, and supervised dispatch.

The Problem: Stateless Retry Is Expensive Failure

Most agent loops look roughly like this: acquire a task from a queue, build a prompt, call the LLM, parse the result. If it fails, increment a retry counter and try again. The retry counter is the only state the system carries between attempts.

This design has three critical flaws:

No attempt memory. The LLM does not know what it tried last time. It cannot avoid a previously failed approach because nobody told it about the previous approach. Each retry is a fresh coin flip with the same bad odds.
No category-level learning. If your agent has failed the last 8 out of 10 css_fix tasks, it still picks up the 11th css_fix task at full confidence. There is no mechanism to say "this task type is failing systemically, stop burning tokens on it."
No staleness check. The task might already be resolved by a human or a different process. Without verification, the agent works on a problem that no longer exists.

The cost adds up. Three retries on a task that was never going to succeed, multiplied across dozens of task types and hundreds of daily dispatches, can easily waste 30-40% of your LLM spend on guaranteed failures.

Pattern 1: Trust Gating

Trust gating is a pre-dispatch check. Before the agent spends a single token on a task, you compute a trust score for that task type based on historical outcomes. If the score is below a threshold, the task is blocked with a "needs human review" status instead of being dispatched.

The score formula weights recent results more heavily than older ones:

Trust Score Calculation

// Score = weighted blend of overall and recent success rates
// Window: last 50 outcomes for this task type
// Minimum sample: 10 outcomes (no gating before that)

const score = overallSuccessRate * 0.6
            + recencyWeightedRate * 0.4;

if (score < TRUST_GATE_THRESHOLD) {
  return `Blocked: ${taskType} trust score ${score.toFixed(2)} below ${TRUST_GATE_THRESHOLD}`;
}

The default threshold is 0.15. That sounds low, but remember: this catches catastrophic failure patterns, not marginal ones. A task type scoring below 0.15 is failing more than 85% of the time. There is no point dispatching more of those.

Two important refinements prevent the gate from becoming a permanent block:

Exempt types. Self-repair task types like triage_fix, security_fix, and service_restart bypass gating entirely. You never want the system to refuse to fix itself.
Exploration probes. After 5 consecutive blocks on a task type, one task is allowed through to re-evaluate. This prevents permanent lockout when underlying conditions change (new model, better prompts, fixed data).

Configuring the Trust Gate

import { checkTrustGate, TRUST_GATE_THRESHOLD } from 'agent-framework';

// Before dispatching any task:
const blockReason = await checkTrustGate(task.task_type);

if (blockReason) {
  // Mark task as needs_human_review, not failed
  await updateTaskStatus(task.id, 'needs_human_review', blockReason);
  return;
}

// Task type is above threshold -- proceed with dispatch

Pattern 2: Outcome History Injection

The second pattern gives the LLM memory of its own previous attempts. Before building the prompt, you query the outcome history for this specific file and task type, then inject the results as structured XML that the model can reason about.

Outcome History XML

<previous_attempts>
  <attempt n="1" outcome="failure" at="2026-05-18T14:22:00Z">
    Tried adding null check in handleResponse() -- wrong location,
    the null originates in parsePayload() upstream
  </attempt>
  <attempt n="2" outcome="failure" at="2026-05-19T09:10:00Z">
    Modified parsePayload() but missed the async path where
    response.body can be undefined before stream completes
  </attempt>
</previous_attempts>

This gives the LLM explicit knowledge of what failed and why. Instead of a blind retry, the third attempt can reason: "Attempts 1 and 2 both targeted the wrong stage of the pipeline. The null appears during streaming, so I need to guard the async path in parsePayload() specifically."

The implementation is a query plus a formatter:

Injecting History Into the Prompt

import {
  getRecentOutcomes,
  formatOutcomesBlock,
  buildPrompt,
} from 'agent-framework';

// Fetch last 5 outcomes for this file + task type
const outcomes = await getRecentOutcomes(
  task.task_type,
  ctx.file_path,
  5
);

// Format as XML block (empty string if no history)
const historyBlock = formatOutcomesBlock(outcomes);

// buildPrompt injects historyBlock into the template
const prompt = await buildPrompt(task, ctx, historyBlock);

The prompt template receives the XML block and includes it in its context section. The LLM sees what happened before, what the failure modes were, and can steer around them.

Pattern 3: Supervised Dispatch

The third pattern ties everything together into a single dispatch pipeline. Rather than a bare retry loop, you get a supervised pipeline with verification at every stage:

Acquire. Lock a pending task using SELECT ... FOR UPDATE SKIP LOCKED so concurrent workers never collide.
Verify staleness. Run a registered verifier to check if the underlying problem still exists. If it is resolved, cancel the task without calling the LLM.
Check trust gate. Compute the trust score for this task type. Block if below threshold.
Build prompt with history. Load the template, inject context and previous attempt outcomes.
Execute. Call the LLM with appropriate model routing for the task type.
Handle result. Parse the output, update the task status, and record the outcome for future trust scoring and history injection.

Full Dispatch Loop

import {
  acquireTask,
  processTask,
  recoverStaleTasks,
} from 'agent-framework';

// Recover any tasks stuck in 'running' from crashed workers
await recoverStaleTasks();

// Main dispatch
const task = await acquireTask();

if (task) {
  // processTask handles the full pipeline:
  // staleness -> trust gate -> history -> prompt -> execute -> result
  await processTask(task);
}

Every outcome is recorded. Every dispatch checks history. Every task type earns or loses trust based on results. The system gets better over time, not worse.

Results in Practice

This architecture came from operating a production system handling 300K+ daily LLM API calls. Before trust gating, roughly 35% of LLM spend went to task types with sub-20% success rates. After implementing the three patterns described above:

Wasted LLM spend dropped by 40%. Trust gating alone stopped the bleed on systemically failing task types.
Retry success rate increased from 12% to 41%. Outcome history injection meant retries were informed, not random.
Human escalations became actionable. Instead of "task failed 3 times," engineers got a trust score, a history of what the agent tried, and a clear signal about what the agent could not solve.

Key Takeaway

The retry loop is not a model problem. It is a systems problem. Your LLM is perfectly capable of fixing the issue on the second attempt -- it just needs to know what happened on the first one. Give it memory, gate the lost causes, and supervise the pipeline.

Implementing This Yourself

The three patterns above are framework-agnostic. You can implement trust gating with a SQL query on an outcomes table, outcome history with XML prompt injection, and supervised dispatch with a pipeline function. The core logic is maybe 400 lines of well-tested code.

If you want a production-ready implementation rather than building from scratch, the patterns above are exactly what ships in the Agent Framework pack -- a supervised task dispatcher with trust gating, outcome history, staleness verification, and automatic retry/recovery. It was extracted from the production system described in this article.

Agent Framework

Supervised agent dispatch with trust gating and outcome history. Not a toy framework -- this is the code running a real multi-project platform. Pluggable database adapters, configurable thresholds, exploration probes, staleness verifiers, structured logging, and a full API for extending task types and verifiers.

199 tests passing

$49 one-time

Node.js standalone pack

Get Agent Framework -- $49 Full Stack Bundle -- $149

The Full Stack bundle includes all 9 packs (2,015 total tests) for $149.