AI agents in production: the cost controls most teams build too late
Token budgets, circuit breakers, context summarisation, and kill switches — five operational controls that prevent silent cost runaway
The demo worked. You spent three weeks building an AI agent that reads customer tickets, looks up order history, drafts responses, and escalates the ones it cannot handle. On staging it handled 200 simulated tickets without a fault. You pushed to production.
Three days later, your LLM bill is four times what you expected. A bug in your retry logic caused the agent to call the same tool 47 times on a single ticket. The tool kept timing out. The agent kept retrying. Nobody noticed until the invoice came.
This specific failure has happened to enough engineering teams that it has a name: cost runaway. It is almost never caused by the AI doing something dramatically wrong. It is caused by the infrastructure around the AI doing something mundanely wrong: retry logic, context accumulation, and uncapped tool calls. All of it invisible without the right controls in place.
There are five cost controls for AI agents that catch this class of failure before it reaches the invoice. Most teams build them after the first incident. You can build them before.
The token budget as a first-class system resource
Your database has connection limits. Your HTTP server has request timeouts. Your queues have max depth. These limits are not defensive pessimism. They are the thing that turns "it worked in staging" into "it works in production."
LLM calls have no limits by default. Every API call you make to Claude or GPT-4 is uncapped. If your agent sends a 200,000-token context by accident, you pay for a 200,000-token inference. If your agent retries 20 times on a failed tool call, you pay for 20 inferences.
The fix: before every LLM call, count the tokens you are about to send. If the count exceeds a configured budget, short-circuit to a fallback: a simpler prompt, a cached response, or an escalation to a human. The budget is a system parameter, not a magic number. Set it per workflow type, not globally.
A reasonable first-pass budget for a ticket-handling agent with access to five tools is around 8,000 input tokens per turn. If you are regularly hitting 20,000 tokens per turn, something is wrong with your context management, not with the budget. The budget makes that visible.
# Count tokens before every LLM call
import anthropic
client = anthropic.Anthropic()
BUDGET_BY_WORKFLOW = {
"ticket_handler": 8_000,
"report_generator": 30_000,
"one_shot_classifier": 2_000,
}
def call_with_budget(workflow_type, messages, tools, system):
budget = BUDGET_BY_WORKFLOW[workflow_type]
token_count = client.beta.messages.count_tokens(
model="claude-sonnet-4-6",
system=system,
messages=messages,
tools=tools,
).input_tokens
if token_count > budget:
raise TokenBudgetExceeded(
f"{workflow_type}: {token_count} tokens > budget of {budget}"
)
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=system,
messages=messages,
tools=tools,
)Set budgets in configuration, not hardcoded in the call site. You will want to tune them after the first week of production data. The anthropic SDK's count_tokens method makes this a preflight check with no inference cost.
The retry multiplier: why your error handling is the expensive part
Most engineering teams write good retry logic for database calls and HTTP requests: exponential backoff with jitter, a maximum retry count, a circuit breaker that opens after sustained failures. This is well-understood.
The mistake: applying the same retry patterns to LLM calls and tool calls without accounting for the cost difference. A failed database query costs essentially nothing to retry. A failed LLM call that sends 5,000 tokens costs the same as a successful one.
Now add tool use. A tool call times out. The agent decides to retry. The tool times out again. The agent tries a different approach, calling a different tool, also times out. The agent writes a summary of the partial results it has, which itself triggers another LLM call. You have paid for four LLM inferences and three tool timeouts on a single ticket, and the ticket is still unresolved.
Three specific controls stop this pattern:
- Max tool calls per turn. Set a hard limit (not a soft suggestion to the model) on how many tool calls can happen in a single conversation turn. Enforce it in your orchestration layer, not in the system prompt.
- Tool call deduplication. Before executing a tool call, hash the tool name plus its arguments and check whether you have made the same call in this conversation. An agent that calls get_order_status with the same order ID three times in a single turn is stuck in a loop. The deduplication check catches this before you pay for the third inference.
- Per-conversation cost ceiling. Keep a running total of tokens consumed across the conversation. When it reaches the ceiling, stop the agent and return whatever partial result it has, flagged for human review. The ceiling is a safety valve, not an expected condition.
The circuit breaker for LLM calls
Database connection pools have circuit breakers. HTTP clients have circuit breakers. LLM calls need them too, but they are almost never implemented.
What a circuit breaker does: after N consecutive failures within a time window, it stops making the call and returns a fallback immediately, without attempting the call. It opens the circuit. After a recovery timeout, it allows one probe request. If the probe succeeds, the circuit closes and normal operation resumes.
LLM circuit breakers differ from HTTP circuit breakers in three ways that matter:
First, failure is expensive, not just slow. An HTTP call that fails costs latency. An LLM call that fails costs latency and tokens. You paid for the input even if the output was unusable.
Second, partial failures are common. An LLM call can succeed at the transport level but return output that is unusable: malformed JSON, an unexpected refusal, or a truncated response. These register as successes to an HTTP-level circuit breaker.
Third, failure modes cascade. A tool that is down causes the agent to retry, consuming tokens, potentially hitting rate limits, causing more failures. Standard circuit breakers do not track this compound behaviour.
A minimal LLM circuit breaker tracks three signal types: hard failures (5xx errors, timeouts, connection errors), soft failures (responses that fail schema validation or are empty), and cost events (calls that consumed more than twice the expected token count). When hard and soft failures cross a threshold in a two-minute window, open the circuit and return a fallback. Log cost events separately. They are signals for tuning rather than circuit triggers, but worth watching.
For a ticket-handling agent, the fallback when the circuit opens might be: mark the ticket for human review and send a template acknowledgement. That costs nothing and keeps the customer informed. Better than 20 more LLM calls while the circuit is struggling.
Context budget management: spending your window deliberately
Most agent frameworks accumulate context until the conversation ends or the context window overflows. The model sees everything that happened: every tool call result, every intermediate draft, every prior assistant turn.
Two problems emerge as conversations grow longer:
Cost scales faster than linearly. Each turn sends the full accumulated history. A conversation that goes ten turns instead of five does not cost twice as much. It costs five to ten times more, because every turn is longer than the last.
Attention degrades in long contexts. Models attend more reliably to content at the start and end of their context. Information in the middle gets deprioritised. A conversation history that has grown to 80,000 tokens is not just expensive. It also works worse than a well-managed 15,000-token context.
| Strategy | Token cost / 10-turn conversation | Implementation effort | Risk |
|---|---|---|---|
| Raw accumulation | 50,000–200,000 | None | Cost runaway; quality degradation from turn 6+ |
| Rolling window (last N turns) | 20,000–80,000 | Low | Loses important early context |
| Summarisation at threshold | 15,000–40,000 | Medium | Summary quality matters |
| Selective tool result trimming | 10,000–30,000 | Medium | Requires per-tool trim logic |
Summarisation at a 3–4 turn threshold is the best general-purpose approach: after the third turn, summarise the conversation history into a compact paragraph and replace the raw history with the summary. The model loses some granularity but gains reliability, and you save 60–80% of context costs.
Tool result trimming compounds the benefit. Tool call results are often verbose. An order history API might return 3KB of JSON when the agent needs only the order status and estimated delivery date. Trimming results to relevant fields before adding them to context is mechanical work with a large cost payoff.
Set a hard token ceiling for context depth, decided before deployment. When you hit it, summarise. This turns context overflow from an emergency into a handled condition.
The kill switch and the async eval ping
Two controls you should build before shipping, but almost never do.
The kill switch is a feature flag that stops all agent processing immediately, for a specific agent type or globally. This sounds obvious. The implementation detail that matters: it must be checked at the start of every agent turn, not only at startup. An agent mid-conversation should check the kill switch before every LLM call. If the flag is set, it stops immediately, saves state if you have a queue, and makes no further API calls.
Implement it with your existing feature flag system or a Redis key. Choose whatever you can change in under ten seconds. The time-to-change matters. When you discover that 500 customers are stuck in a loop at 3am, you want to stop the bleeding in seconds, not minutes.
The async eval ping is a lightweight quality check that runs out-of-band after the agent finishes a conversation. It calls a small, cheap model: Claude Haiku or Gemini Flash. The call is one question: did the agent accomplish the goal? It returns pass or fail.
This is not a replacement for your main eval suite. It is a production signal. When your eval ping pass rate drops below 80%, something has changed: a tool schema updated, a prompt regressed, a new edge case appeared in real traffic. You catch it in hours, not days.
Cost of the async eval ping: roughly £0.002 per conversation at Haiku prices. The value: catching regressions before they become a support incident.
What to instrument in the first week
You have built the controls. Now you need dashboards that tell you when they fire.
Five signals, in priority order:
- Token cost per conversation, segmented by agent type. This is your primary cost metric. Alert when the rolling average crosses 150% of the week-one baseline.
- Circuit breaker open events per hour. Under normal load, your circuit breaker should not open at all. If it opens more than twice per hour, you have a systemic problem, usually a degraded downstream tool.
- Tool call count per turn, at the 95th percentile. Your p95 tells you whether the agent is looping. If p95 is more than three times your median, something is wrong with your loop detection or your tool descriptions.
- Eval ping pass rate, by day. A drop of more than ten percentage points in a day warrants investigation. A drop sustained over three days usually means a prompt or tool schema has regressed.
- Kill switch activation count. This should be zero. If it is not, something serious is wrong.
These five dashboards do not require a specialised LLM observability platform. You can build them with your existing logging pipeline. The instrumentation is a matter of logging the right values at the right places: input token counts before calls, circuit state transitions, tool call events, eval results.
“The five signals — cost per conversation, circuit open events, p95 tool calls, eval pass rate, kill switch activations — are all the monitoring an AI agent needs before you add anything else.”
Before you ship, not after
Building an AI agent that works is the first engineering problem. Building one that stays working: at predictable cost, with degradable behaviour when something goes wrong. That is the second problem. Most teams discover the second problem after the first incident.
The five controls described here: token budget, retry ceiling, circuit breaker, context management, kill switch plus eval ping. None will prevent all failures. They will prevent the class of failures that announce themselves via an unexpected invoice or a user who has been talking to a looping agent for 20 turns.
Build them before you ship. Tune the numbers in week one using real production data. The structure does not change — the specific thresholds do.
Frequently asked questions
Related reading
Prompt caching in production: why the hit rate depends on prompt structure, not the API setting
Prompt caching keys on the leading token prefix. One dynamic field early in the prompt invalidates the cache for everything after it. Here is what that means for how you structure production prompts.
Feature flags in production: the lifecycle teams skip
Most teams have a system for adding feature flags. Almost none have a system for retiring them. Here is the full lifecycle: flag types, staleness detection, and the cleanup playbook.
When per-seat pricing breaks: what GitHub Copilot's billing shift signals for AI-powered SaaS
AI agents consume compute in ways that don't map to user count — and Copilot's June 2026 billing shift is the clearest signal yet. Here's what the transition reveals about pricing for AI-powered products.