LLM Output Quality: Why You Need Two Layers of Evaluation

Your LLM output looks fine. The JSON parses. The tone feels right. You ship it to production and move on to the next feature.

Then a customer gets a proposal addressed to "Office" instead of a human name. A cold email includes john.doe@example.com as the contact. A sales message leaks your pricing in the first touch. An SMS comes in at 280 characters when the channel limit is 160.

These are not hallucinations in the traditional sense. The LLM followed instructions and produced structurally valid output. The problem is that structurally valid is not the same as production-ready, and the gap between those two things is where your reputation lives.

The Problem: LLM Output Fails Silently

Most teams treat LLM quality as a binary: either the output parsed or it did not. Maybe you check that the JSON schema matches. Maybe you eyeball a few responses in your logs. But LLM output has a long tail of subtle failures that only surface when they reach a real user.

Here are the categories we have seen in production across 300K+ daily LLM API calls:

Demo data leaking through — example.com, Acme Corp, 555-5555, lorem ipsum appearing in customer-facing output
Banned phrases — "free audit", "I hope this finds you well", "kind regards" in outreach that is supposed to be conversational
Pricing in early-funnel messages — the LLM helpfully includes your rate card in a cold first-touch message
Channel violations — SMS over 160 characters, email subject lines over 72 characters, DMs over 100 words
Role labels as names — "Dear Sales" or "Hi Reception" instead of an actual contact name
Spam trigger words — "guaranteed", "act now", "limited time" that tank deliverability
Generic proposals — "I noticed your website could be improved" instead of something specific to the business

No single check catches all of these. You need two layers.

Layer 1: Fast Programmatic Checks

Layer 1 is synchronous, zero-dependency, and runs on every scored response. These are regex patterns, keyword matching, structural validation, and domain-specific rules. They are deterministic, fast, and catch roughly 80% of issues instantly.

Here is a simplified programmatic scorer for outbound text:

Layer 1 — Programmatic Scorer

const BANNED_PHRASES = [
  'free audit', 'free consultation',
  'i hope this finds you well',
  'kind regards', 'best regards',
];

const DEMO_PATTERNS = [
  /\bexample\.com\b/i, /\bjohn\.doe\b/i,
  /\bacme\s+(corp|inc|ltd)\b/i,
  /\blorem\s+ipsum\b/i, /\b555-\d{4}\b/,
];

const PRICING_PATTERNS = [
  /\$\d+/, /per month/i, /monthly fee/i,
];

function scoreOutboundText(text, opts = {}) {
  const flags = [];

  // Banned phrases
  const lower = text.toLowerCase();
  const banned = BANNED_PHRASES.find(p => lower.includes(p));
  if (banned) flags.push(`banned_phrase:${banned}`);

  // Demo / placeholder data
  const demo = DEMO_PATTERNS.find(p => p.test(text));
  if (demo) flags.push('demo_data:detected');

  // Pricing leak in early funnel
  if (opts.funnelStep <= 3) {
    const pricing = PRICING_PATTERNS.find(p => p.test(text));
    if (pricing) flags.push('pricing_in_early_step');
  }

  // Channel-specific length
  if (opts.channel === 'sms' && text.length > 160)
    flags.push(`sms_over_160:${text.length}`);

  return flags;
}
      

This runs in microseconds. No LLM call, no network request, no latency. You wire it into your callLLM() response path and it fires every time (or on a configurable sample rate for high-volume workloads).

The key insight: each workload gets its own scorer. A proposal scorer checks for role labels used as names, unresolved template variables, and broken spintax prefixes. An enrichment scorer cross-validates extracted emails against the source HTML. A classification scorer validates enum values and checks for missing fields. One size does not fit all.

Layer 2: LLM-as-Judge

Layer 1 catches everything a regex can catch. But some quality issues require comprehension. Is the proposal actually personalized, or did the LLM just insert a name and write something generic? Does the follow-up add new value, or is it rehashing the same points? Does the reply actually address what the customer asked?

This is where Layer 2 comes in. You build a prompt, send the original output to a cheap model for evaluation, and parse a YES/NO verdict.

Layer 2 — buildJudgePrompt()

function buildJudgePrompt(workload, output, context) {
  const questions = JUDGE_QUESTIONS[workload];
  if (!questions) return null;

  const parts = [
    `You are a quality evaluator.`,
    `Below is the output of an LLM completing`,
    `a "${workload}" task.`,
    `Answer each question YES or NO only.`,
    `Format: A1:YES, A2:NO, etc.`,
    '',
  ];

  // Include input context for relevance check
  if (context?.messages?.length) {
    parts.push(
      '=== INPUT ===',
      context.messages.map(
        m => `[${m.role}]: ${m.content}`
      ).join('\n').slice(0, 1500),
      '=== END INPUT ===',
    );
  }

  parts.push(
    '=== LLM OUTPUT ===',
    output.slice(0, 3000),
    '=== END OUTPUT ===',
    '',
    'Questions:',
    questions.map((q, i) =>
      `Q${i + 1}: ${q}`
    ).join('\n'),
  );

  return parts.join('\n');
}
      

The judge prompt is kept under 6,500 characters total, which means it runs cheaply on free-tier models (Cerebras, Groq) for public data or Haiku-class models for private data. You are not sending output to GPT-4 for grading — that defeats the economics.

Parsing the response is equally straightforward:

Layer 2 — parseJudgeScore()

function parseJudgeScore(judgeOutput) {
  const lines = judgeOutput.split('\n')
    .map(l => l.trim()).filter(Boolean);
  const answers = [];

  for (const line of lines) {
    const m = line.match(/^(?:A\d+:\s*)?(YES|NO)\b/i);
    if (m) answers.push(m[1].toUpperCase() === 'YES');
  }

  if (answers.length === 0) {
    return { score: 0.5, flags: ['judge_unparseable'] };
  }

  const passing = answers.filter(Boolean).length;
  const flags = answers
    .map((a, i) => a ? null : `judge_q${i+1}_fail`)
    .filter(Boolean);

  return { score: passing / answers.length, flags };
}
      

Each workload has its own set of judge questions. Proposals get questions about personalization depth and whether the prospect is a legitimate small-business target. Reply scorers check whether the response addresses the specific customer enquiry. Outreach scorers verify that follow-ups add new value instead of repeating earlier points.

Why Two Layers, Not One?

Either layer alone is insufficient:

Layer 1 alone catches structural and pattern-matching issues but misses semantic quality. It cannot tell if a proposal is genuinely personalized or just cleverly templated.
Layer 2 alone is too slow and expensive to run on every call. At 300K daily API calls, even a cheap judge model at 0.1 cents per call adds up. And LLM judges have their own failure modes — they sometimes say YES to obviously bad output.

The two-layer approach gives you the best of both worlds. Layer 1 runs on 100% of high-value workloads and 5-20% of commodity workloads. Layer 2 samples a fraction of Layer 1 passes for deeper inspection. Together, they create a quality funnel where the cheap, fast check handles volume and the expensive, smart check handles nuance.

16 Workload-Specific Scorers

Generic quality checks are a starting point, but real coverage requires workload awareness. The scorers we run in production cover:

proposals — role labels as names, unresolved template variables, pricing leaks, subject line length, grade mismatches, broken spintax
cold email / outreach — spam trigger words, meeting requests in first touch, word count limits, channel-specific length
SMS — 160-character limit, US/CA STOP compliance, opt-out language detection, all-caps words
classification — valid intent enums, sentiment validation, missing field detection
enrichment — email format validation, cross-validation against source HTML, demo data detection in extracted contacts
audit reports — jargon detection (CTA, bounce rate, hero section), mockup change validation, summary completeness
video scripts — scene count, timecode validation, overlay word limits, URL/pricing/phone detection
inbound replies — skip reason validation, reply length, funnel-stage pricing rules, readability scoring

Each scorer returns a { score, flags, scorer } result. The score is a 0-1 float. The flags array tells you exactly what went wrong: banned_phrase:free audit, sms_over_160:187, role_label_as_name:Reception. No ambiguity, no interpretation required.

Implementing This in Your Stack

The pattern is the same regardless of your LLM provider or framework:

Define your workload types — each call to your LLM should carry a workload identifier
Write Layer 1 scorers per workload — start with the output you have seen fail, then expand
Define judge questions per workload — only semantic questions that regex cannot answer
Wire Layer 1 into your response path — fire-and-forget async, with configurable sample rates
Wire Layer 2 after Layer 1 passes — sample a fraction, route to a cheap model, log the results
Alert on score drops — aggregate scores by workload and flag regressions after prompt changes or model updates

The biggest mistake teams make is building a generic quality layer and calling it done. Workload-specific scoring is what turns quality monitoring from a checkbox into a safety net.

LLM Quality Monitor

Everything described in this post, production-ready. 16 workload-specific scorers, Layer 1 + Layer 2 infrastructure, configurable sample rates, zero dependencies.

$39 one-time

16 programmatic scorers (Layer 1)
LLM-as-judge infrastructure (Layer 2)
Per-workload judge questions
Configurable sample rates
361 tests, zero runtime dependencies
Full source code, MIT licensed

Get LLM Quality Monitor — $39 LLM Ops Toolkit bundle — $69

The LLM Ops Toolkit bundles Quality Monitor + Cost Router + Format Bridge at a discount.