The Problem: One Provider, All Your Eggs
If your application makes any significant number of LLM calls, you have almost certainly experienced the moment: your primary provider returns a 429, your queue backs up, and your users wait. Or you open your monthly bill and realize that one runaway summarization task burned through your entire budget in a weekend.
The single-provider approach has three failure modes that compound as you scale:
- Availability risk. When your one provider goes down or rate-limits you, every LLM-dependent feature stops working.
- Cost inefficiency. You pay per-token for workloads that could run on free tiers (Cerebras, Gemini free, Claude Max CLI) if you had the routing logic to try them first.
- PII exposure. Not all providers are equal. Some train on inputs, some operate in jurisdictions with different data sovereignty rules. Without enforcement at the routing layer, sensitive data flows wherever the code sends it.
The fix is not to pick a "better" provider. It is to route intelligently based on the workload, the cost, and the data classification of each call.
The Architecture: Workload-Based Routing
The core idea is simple: every LLM call gets tagged with a workload identifier. That identifier maps to metadata -- task type, quality requirement, PII classification, and token budget. The router uses that metadata to pick the right provider at runtime, not at compile time.
Here is what a workload registry looks like in code:
import { registerWorkload, TASK, QUALITY, PII } from 'llm-cost-router';
// High-value task: needs quality, contains private data
registerWorkload('audit_report', {
task: TASK.GENERATE,
quality: QUALITY.HIGH,
pii: PII.PRIVATE,
maxTokens: 8000,
});
// Low-stakes task: can use cheap/free models
registerWorkload('content_summary', {
task: TASK.SUMMARISE,
quality: QUALITY.LOW,
pii: PII.PUBLIC,
maxTokens: 2000,
});
The key insight is that not every LLM call deserves the same model. A content summary does not need GPT-4 or Claude Opus. A one-line classification definitely does not. By declaring the workload upfront, you let the router make cost-optimal decisions automatically.
Free-Tier-First: Stop Paying for Calls That Could Be Free
Several providers now offer genuinely useful free tiers. Cerebras gives you access to Qwen 3 235B with sub-400ms latency. Google Gemini has a free tier for moderate traffic. If you run Claude Max, you already have a subscription that covers unlimited CLI calls.
The strategy is straightforward: before calling a paid API, try the free options. If a free provider succeeds, you never hit the paid endpoint. If it returns a rate-limit error (429), fall through to the next option.
callLLM({ workload: 'content_summary' }) | v [1] Cerebras (free tier, Qwen 3 235B) --429--> [2] Claude CLI (Max sub) --429--> | | success success | | v v return result return result \ --429--> | v [3] Anthropic API (paid) | 429/503 --> [4] OpenRouter (fallback) | success | v return result
In practice, this reduces paid API calls significantly. On a production system handling 300K+ daily LLM calls, free-tier-first routing cuts costs by 30-50% depending on the workload mix. The calls that do hit paid providers are the ones that actually need them -- high-quality generation, private data that requires specific provider guarantees, or peak traffic that exhausts free quotas.
Budget Enforcement: Preventing Runaway Costs
Even with free-tier-first routing, you need hard limits. A single misconfigured loop can generate thousands of LLM calls in minutes. Budget enforcement operates at the workload level with daily token caps:
import { enforceBudget, recordTokens, getBudgetStatus } from 'llm-cost-router';
// Before every call: throws LLMBudgetExceeded if over cap
await enforceBudget('content_summary');
// After a successful call: track usage
recordTokens('content_summary', inputTokens, outputTokens);
// Dashboard: check all workloads
const statuses = await getBudgetStatus();
// [{ workload: 'content_summary', cap: 5000000,
// tokensToday: 1230000, pct: 24.6, halted: false }]
When a workload hits its daily cap, the router throws an LLMBudgetExceeded error instead of silently burning through money. You can unlock a halted workload manually or bump the cap temporarily -- but the default is fail-closed. No silent cost overruns.
PII-Safe Provider Selection
Not every provider should handle every type of data. Some operate infrastructure in jurisdictions with different privacy regulations. Some free tiers explicitly train on inputs. The routing layer needs to enforce this automatically.
| Provider | Tier | Jurisdiction | PII Safe |
|---|---|---|---|
| Anthropic | Paid | US | Yes |
| OpenRouter | Paid | US | Yes (with X-No-Store) |
| Cerebras | Free | US | Public data only |
| Gemini | Free | US | Public data only |
| xAI (Grok) | Paid | US | Yes |
| Claude CLI | Free* | Local | Yes |
| Z.AI | Paid | CN | No |
When a workload is classified as pii: 'private', the router will refuse to dispatch it to providers marked as PII-unsafe. This is not a soft warning -- it is a hard rejection. The call fails rather than leaking data to the wrong jurisdiction.
Practical Advice You Can Use Today
Even without a routing library, you can apply these principles to any LLM integration:
- Classify your calls. Not by API shape, but by business purpose. Which calls need high quality? Which are throwaway? Which touch personal data? Write it down.
- Try free tiers first. Before calling Anthropic or OpenAI, try Cerebras or Gemini free for low-stakes workloads. Wrap each call in a try/catch and fall through on rate-limit errors.
- Set daily token budgets. Even a simple counter in Redis or SQLite that resets at midnight UTC will save you from runaway loops. Fail loudly when the cap is hit.
- Tag PII at the call site. Before you send a message array to any LLM, decide: does this contain personal data? If yes, restrict which providers can handle it.
- Use fallback chains. When your primary returns a 429, do not retry the same provider. Fall through to an alternative. Anthropic down? Try the same model through OpenRouter. OpenRouter down? Try xAI.
Key takeaway
The difference between a $200/month LLM bill and a $2,000/month bill is usually not the model you pick. It is whether you have routing logic that uses the cheapest viable option for each call, hard stops when budgets are hit, and automatic fallbacks when providers fail.
A/B Benchmarking Across Providers
One benefit of workload-based routing is that you can split traffic between providers to compare quality and cost in production. Route 80% of a workload to your current provider and 20% to a challenger. Bucket assignment can be deterministic -- hash the workload and a stable identifier so the same user always gets the same arm within a test window. After a week, compare output quality, latency, and cost per token. Data-driven provider decisions instead of vibes.
LLM Cost Router
Everything described in this post -- workload registry, free-tier-first routing, budget enforcement, PII-safe provider selection, fallback chains, A/B benchmarking, and 7 provider adapters -- is available as a production-tested Node.js package. 258 tests. Extracted from a system handling 300K+ daily LLM calls. One-time purchase, MIT licensed source code.
The LLM Ops Toolkit bundles Cost Router + Quality Monitor + Quota Manager at 36% off.