What model should I use for the async eval ping?

Claude Haiku or Gemini Flash — both cost under $0.003 per conversation at typical ticket lengths. The eval is a single yes/no question, so the model does not need to be powerful. The goal is a quick signal, not a detailed critique.

How do I choose the right token budget per workflow type?

Start with a week of production data before you set limits. Log the actual input token count for every LLM call, then set the budget at the 95th percentile plus a 20% buffer. A first-pass estimate for a ticket-handling agent with five tools is around 8,000 tokens per turn; for a long-document analysis workflow, 30,000 is more realistic.

Does context summarisation hurt the agent's ability to do its job?

It depends on the task. For multi-turn customer support or triage workflows, summarisation at turn 3–4 loses granularity but rarely loses the information the agent needs. For tasks where exact earlier wording matters — contract review, legal analysis — use a rolling window or selective retention instead.

What should the kill switch actually do — drop in-flight conversations or queue them?

Queue them if you have a queue; drop them if you do not. The important thing is that no new LLM API calls are made after the switch fires. If you queue, include the conversation state so a human or restarted agent can pick up where it left off. Do not try to resume mid-turn — save state at turn boundaries.

AI & LLMsMay 17, 20268 min readReviewed May 17, 2026

AI agents in production: the cost controls most teams build too late

Token budgets, circuit breakers, context summarisation, and kill switches — five operational controls that prevent silent cost runaway

By FlowVerify Editorial Team

The demo worked. You spent three weeks building an AI agent that reads customer tickets, looks up order history, drafts responses, and escalates the ones it cannot handle. On staging it handled 200 simulated tickets without a fault. You pushed to production.

Three days later, your LLM bill is four times what you expected. A bug in your retry logic caused the agent to call the same tool 47 times on a single ticket. The tool kept timing out. The agent kept retrying. Nobody noticed until the invoice came.

This specific failure has happened to enough engineering teams that it has a name: cost runaway. It is almost never caused by the AI doing something dramatically wrong. It is caused by the infrastructure around the AI doing something mundanely wrong: retry logic, context accumulation, and uncapped tool calls. All of it invisible without the right controls in place.

There are five cost controls for AI agents that catch this class of failure before it reaches the invoice. Most teams build them after the first incident. You can build them before.

The token budget as a first-class system resource

Your database has connection limits. Your HTTP server has request timeouts. Your queues have max depth. These limits are not defensive pessimism. They are the thing that turns "it worked in staging" into "it works in production."

LLM calls have no limits by default. Every API call you make to Claude or GPT-4 is uncapped. If your agent sends a 200,000-token context by accident, you pay for a 200,000-token inference. If your agent retries 20 times on a failed tool call, you pay for 20 inferences.

The fix: before every LLM call, count the tokens you are about to send. If the count exceeds a configured budget, short-circuit to a fallback: a simpler prompt, a cached response, or an escalation to a human. The budget is a system parameter, not a magic number. Set it per workflow type, not globally.

A reasonable first-pass budget for a ticket-handling agent with access to five tools is around 8,000 input tokens per turn. If you are regularly hitting 20,000 tokens per turn, something is wrong with your context management, not with the budget. The budget makes that visible.

token_budget.py

# Count tokens before every LLM call
import anthropic

client = anthropic.Anthropic()

BUDGET_BY_WORKFLOW = {
    "ticket_handler":        8_000,
    "report_generator":     30_000,
    "one_shot_classifier":   2_000,
}

def call_with_budget(workflow_type, messages, tools, system):
    budget = BUDGET_BY_WORKFLOW[workflow_type]

    token_count = client.beta.messages.count_tokens(
        model="claude-sonnet-4-6",
        system=system,
        messages=messages,
        tools=tools,
    ).input_tokens

    if token_count > budget:
        raise TokenBudgetExceeded(
            f"{workflow_type}: {token_count} tokens > budget of {budget}"
        )

    return client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system=system,
        messages=messages,
        tools=tools,
    )

Set budgets in configuration, not hardcoded in the call site. You will want to tune them after the first week of production data. The anthropic SDK's count_tokens method makes this a preflight check with no inference cost.

The retry multiplier: why your error handling is the expensive part

Most engineering teams write good retry logic for database calls and HTTP requests: exponential backoff with jitter, a maximum retry count, a circuit breaker that opens after sustained failures. This is well-understood.

The mistake: applying the same retry patterns to LLM calls and tool calls without accounting for the cost difference. A failed database query costs essentially nothing to retry. A failed LLM call that sends 5,000 tokens costs the same as a successful one.

Now add tool use. A tool call times out. The agent decides to retry. The tool times out again. The agent tries a different approach, calling a different tool, also times out. The agent writes a summary of the partial results it has, which itself triggers another LLM call. You have paid for four LLM inferences and three tool timeouts on a single ticket, and the ticket is still unresolved.

Three specific controls stop this pattern:

Max tool calls per turn. Set a hard limit (not a soft suggestion to the model) on how many tool calls can happen in a single conversation turn. Enforce it in your orchestration layer, not in the system prompt.
Tool call deduplication. Before executing a tool call, hash the tool name plus its arguments and check whether you have made the same call in this conversation. An agent that calls get_order_status with the same order ID three times in a single turn is stuck in a loop. The deduplication check catches this before you pay for the third inference.
Per-conversation cost ceiling. Keep a running total of tokens consumed across the conversation. When it reaches the ceiling, stop the agent and return whatever partial result it has, flagged for human review. The ceiling is a safety valve, not an expected condition.

The circuit breaker for LLM calls

Database connection pools have circuit breakers. HTTP clients have circuit breakers. LLM calls need them too, but they are almost never implemented.

What a circuit breaker does: after N consecutive failures within a time window, it stops making the call and returns a fallback immediately, without attempting the call. It opens the circuit. After a recovery timeout, it allows one probe request. If the probe succeeds, the circuit closes and normal operation resumes.

LLM circuit breakers differ from HTTP circuit breakers in three ways that matter:

First, failure is expensive, not just slow. An HTTP call that fails costs latency. An LLM call that fails costs latency and tokens. You paid for the input even if the output was unusable.

Second, partial failures are common. An LLM call can succeed at the transport level but return output that is unusable: malformed JSON, an unexpected refusal, or a truncated response. These register as successes to an HTTP-level circuit breaker.

Third, failure modes cascade. A tool that is down causes the agent to retry, consuming tokens, potentially hitting rate limits, causing more failures. Standard circuit breakers do not track this compound behaviour.

A minimal LLM circuit breaker tracks three signal types: hard failures (5xx errors, timeouts, connection errors), soft failures (responses that fail schema validation or are empty), and cost events (calls that consumed more than twice the expected token count). When hard and soft failures cross a threshold in a two-minute window, open the circuit and return a fallback. Log cost events separately. They are signals for tuning rather than circuit triggers, but worth watching.

For a ticket-handling agent, the fallback when the circuit opens might be: mark the ticket for human review and send a template acknowledgement. That costs nothing and keeps the customer informed. Better than 20 more LLM calls while the circuit is struggling.

Context budget management: spending your window deliberately

Most agent frameworks accumulate context until the conversation ends or the context window overflows. The model sees everything that happened: every tool call result, every intermediate draft, every prior assistant turn.

Two problems emerge as conversations grow longer:

Cost scales faster than linearly. Each turn sends the full accumulated history. A conversation that goes ten turns instead of five does not cost twice as much. It costs five to ten times more, because every turn is longer than the last.

Attention degrades in long contexts. Models attend more reliably to content at the start and end of their context. Information in the middle gets deprioritised. A conversation history that has grown to 80,000 tokens is not just expensive. It also works worse than a well-managed 15,000-token context.

Strategy	Token cost / 10-turn conversation	Implementation effort	Risk
Raw accumulation	50,000–200,000	None	Cost runaway; quality degradation from turn 6+
Rolling window (last N turns)	20,000–80,000	Low	Loses important early context
Summarisation at threshold	15,000–40,000	Medium	Summary quality matters
Selective tool result trimming	10,000–30,000	Medium	Requires per-tool trim logic

Context management strategies compared

Summarisation at a 3–4 turn threshold is the best general-purpose approach: after the third turn, summarise the conversation history into a compact paragraph and replace the raw history with the summary. The model loses some granularity but gains reliability, and you save 60–80% of context costs.

Tool result trimming compounds the benefit. Tool call results are often verbose. An order history API might return 3KB of JSON when the agent needs only the order status and estimated delivery date. Trimming results to relevant fields before adding them to context is mechanical work with a large cost payoff.

Set a hard token ceiling for context depth, decided before deployment. When you hit it, summarise. This turns context overflow from an emergency into a handled condition.

The kill switch and the async eval ping

Two controls you should build before shipping, but almost never do.

The kill switch is a feature flag that stops all agent processing immediately, for a specific agent type or globally. This sounds obvious. The implementation detail that matters: it must be checked at the start of every agent turn, not only at startup. An agent mid-conversation should check the kill switch before every LLM call. If the flag is set, it stops immediately, saves state if you have a queue, and makes no further API calls.

Implement it with your existing feature flag system or a Redis key. Choose whatever you can change in under ten seconds. The time-to-change matters. When you discover that 500 customers are stuck in a loop at 3am, you want to stop the bleeding in seconds, not minutes.

The async eval ping is a lightweight quality check that runs out-of-band after the agent finishes a conversation. It calls a small, cheap model: Claude Haiku or Gemini Flash. The call is one question: did the agent accomplish the goal? It returns pass or fail.

This is not a replacement for your main eval suite. It is a production signal. When your eval ping pass rate drops below 80%, something has changed: a tool schema updated, a prompt regressed, a new edge case appeared in real traffic. You catch it in hours, not days.

Cost of the async eval ping: roughly £0.002 per conversation at Haiku prices. The value: catching regressions before they become a support incident.

What to instrument in the first week

You have built the controls. Now you need dashboards that tell you when they fire.

Five signals, in priority order:

Token cost per conversation, segmented by agent type. This is your primary cost metric. Alert when the rolling average crosses 150% of the week-one baseline.
Circuit breaker open events per hour. Under normal load, your circuit breaker should not open at all. If it opens more than twice per hour, you have a systemic problem, usually a degraded downstream tool.
Tool call count per turn, at the 95th percentile. Your p95 tells you whether the agent is looping. If p95 is more than three times your median, something is wrong with your loop detection or your tool descriptions.
Eval ping pass rate, by day. A drop of more than ten percentage points in a day warrants investigation. A drop sustained over three days usually means a prompt or tool schema has regressed.
Kill switch activation count. This should be zero. If it is not, something serious is wrong.

These five dashboards do not require a specialised LLM observability platform. You can build them with your existing logging pipeline. The instrumentation is a matter of logging the right values at the right places: input token counts before calls, circuit state transitions, tool call events, eval results.

“The five signals — cost per conversation, circuit open events, p95 tool calls, eval pass rate, kill switch activations — are all the monitoring an AI agent needs before you add anything else.”

— FlowVerify Engineering

Before you ship, not after

Building an AI agent that works is the first engineering problem. Building one that stays working: at predictable cost, with degradable behaviour when something goes wrong. That is the second problem. Most teams discover the second problem after the first incident.

The five controls described here: token budget, retry ceiling, circuit breaker, context management, kill switch plus eval ping. None will prevent all failures. They will prevent the class of failures that announce themselves via an unexpected invoice or a user who has been talking to a looping agent for 20 turns.

Build them before you ship. Tune the numbers in week one using real production data. The structure does not change — the specific thresholds do.

Frequently asked questions

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

Jul 2, 2026Read full article →

AI & LLMsMay 17, 20268 min readReviewed May 17, 2026

AI agents in production: the cost controls most teams build too late

Token budgets, circuit breakers, context summarisation, and kill switches — five operational controls that prevent silent cost runaway

By FlowVerify Editorial Team

There are five cost controls for AI agents that catch this class of failure before it reaches the invoice. Most teams build them after the first incident. You can build them before.

The token budget as a first-class system resource

token_budget.py

# Count tokens before every LLM call
import anthropic

client = anthropic.Anthropic()

BUDGET_BY_WORKFLOW = {
    "ticket_handler":        8_000,
    "report_generator":     30_000,
    "one_shot_classifier":   2_000,
}

def call_with_budget(workflow_type, messages, tools, system):
    budget = BUDGET_BY_WORKFLOW[workflow_type]

    token_count = client.beta.messages.count_tokens(
        model="claude-sonnet-4-6",
        system=system,
        messages=messages,
        tools=tools,
    ).input_tokens

    if token_count > budget:
        raise TokenBudgetExceeded(
            f"{workflow_type}: {token_count} tokens > budget of {budget}"
        )

    return client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system=system,
        messages=messages,
        tools=tools,
    )

The retry multiplier: why your error handling is the expensive part

Three specific controls stop this pattern:

Max tool calls per turn. Set a hard limit (not a soft suggestion to the model) on how many tool calls can happen in a single conversation turn. Enforce it in your orchestration layer, not in the system prompt.
Tool call deduplication. Before executing a tool call, hash the tool name plus its arguments and check whether you have made the same call in this conversation. An agent that calls get_order_status with the same order ID three times in a single turn is stuck in a loop. The deduplication check catches this before you pay for the third inference.
Per-conversation cost ceiling. Keep a running total of tokens consumed across the conversation. When it reaches the ceiling, stop the agent and return whatever partial result it has, flagged for human review. The ceiling is a safety valve, not an expected condition.

The circuit breaker for LLM calls

Database connection pools have circuit breakers. HTTP clients have circuit breakers. LLM calls need them too, but they are almost never implemented.

LLM circuit breakers differ from HTTP circuit breakers in three ways that matter:

First, failure is expensive, not just slow. An HTTP call that fails costs latency. An LLM call that fails costs latency and tokens. You paid for the input even if the output was unusable.

Context budget management: spending your window deliberately

Two problems emerge as conversations grow longer:

Strategy	Token cost / 10-turn conversation	Implementation effort	Risk
Raw accumulation	50,000–200,000	None	Cost runaway; quality degradation from turn 6+
Rolling window (last N turns)	20,000–80,000	Low	Loses important early context
Summarisation at threshold	15,000–40,000	Medium	Summary quality matters
Selective tool result trimming	10,000–30,000	Medium	Requires per-tool trim logic

Context management strategies compared

Set a hard token ceiling for context depth, decided before deployment. When you hit it, summarise. This turns context overflow from an emergency into a handled condition.

The kill switch and the async eval ping

Two controls you should build before shipping, but almost never do.

Cost of the async eval ping: roughly £0.002 per conversation at Haiku prices. The value: catching regressions before they become a support incident.

What to instrument in the first week

You have built the controls. Now you need dashboards that tell you when they fire.

Five signals, in priority order:

Token cost per conversation, segmented by agent type. This is your primary cost metric. Alert when the rolling average crosses 150% of the week-one baseline.
Circuit breaker open events per hour. Under normal load, your circuit breaker should not open at all. If it opens more than twice per hour, you have a systemic problem, usually a degraded downstream tool.
Tool call count per turn, at the 95th percentile. Your p95 tells you whether the agent is looping. If p95 is more than three times your median, something is wrong with your loop detection or your tool descriptions.
Eval ping pass rate, by day. A drop of more than ten percentage points in a day warrants investigation. A drop sustained over three days usually means a prompt or tool schema has regressed.
Kill switch activation count. This should be zero. If it is not, something serious is wrong.

“The five signals — cost per conversation, circuit open events, p95 tool calls, eval pass rate, kill switch activations — are all the monitoring an AI agent needs before you add anything else.”

— FlowVerify Engineering

Before you ship, not after

Build them before you ship. Tune the numbers in week one using real production data. The structure does not change — the specific thresholds do.

AI agents in production: the cost controls most teams build too late

The token budget as a first-class system resource

The retry multiplier: why your error handling is the expensive part

The circuit breaker for LLM calls

Context budget management: spending your window deliberately

The kill switch and the async eval ping

What to instrument in the first week

Before you ship, not after

Frequently asked questions

Related reading

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.

The AI memory shortage just rewrote the cloud cost-optimisation playbook

Stay ahead on eSignatures, compliance, and document workflows

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

AI agents in production: the cost controls most teams build too late

The token budget as a first-class system resource

The retry multiplier: why your error handling is the expensive part

The circuit breaker for LLM calls

Context budget management: spending your window deliberately

The kill switch and the async eval ping

What to instrument in the first week

Before you ship, not after

Frequently asked questions

Related reading

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.

The AI memory shortage just rewrote the cloud cost-optimisation playbook

Stay ahead on eSignatures, compliance, and document workflows

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.