75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.
A 2026 survey of 2,500+ decision-makers surfaces a governance paradox that stops looking like a paradox once you see what 'rollback' actually measures.
The number that’s been misread
Three in four enterprises have rolled back or shut down a customer-facing AI agent after putting it into production. That figure comes from a Sinch survey of more than 2,500 senior decision-makers across ten countries, and it has been circulating for weeks as a verdict on AI agent governance: the technology isn't ready, the pilots don't survive contact with real customers, take your pick of headline.
The number that gets quoted less, and matters more, sits right next to it. Among organisations with the most mature AI agent governance frameworks, the rollback rate is 81%, not lower. If governance worked the way the headline implies, more oversight should produce fewer rollbacks. The data says the opposite.
That’s not a contradiction in the survey. It’s a contradiction in how most people define "rollback."
What "rollback" actually means in the data
The Sinch report breaks down why agents got pulled, and the top three reasons aren’t about the model being bad at its job:
- Customer data exposure — 31%
- Hallucination or brand risk — 22%
- Inability to diagnose what went wrong — 16%
Read that last one again. Sixteen percent of rollbacks happened not because the agent did something visibly wrong, but because nobody could tell whether it had. That’s not a model-quality problem. That’s an observability problem wearing a model-quality costume.
“The most advanced organisations aren't failing less. They're seeing failures sooner.”
A team running an agent with no audit log, no session replay, and no owner watching a dashboard doesn’t have zero incidents. It has zero visibility into its incidents. The agent can leak a customer record or hallucinate a refund policy for months without a single rollback being logged, because nobody is positioned to notice.
Why mature AI agent governance produces more rollbacks, not fewer
Rollback rate isn’t a proxy for how often AI agents misbehave. It’s a proxy for how often an organisation can tell that one did.
A team with an incident channel, a review cadence, and someone whose job includes owning an agent’s output will catch and roll back a bad deployment inside days. A team without any of that infrastructure will run the same bad deployment for a quarter, because the failure mode that would trigger a rollback (a complaint reaching the right person, a data-exposure alert firing, a hallucinated answer getting caught before it ships) never reaches anyone with the authority to pull the plug.
Put the two organisations side by side and the one with governance looks worse on a single metric. It isn't. It's the only one of the two that actually knows what its agents are doing. Gartner analyst Greg Carlucci frames the finding the same way: these rollbacks are what real-world deployment looks like once someone is actually watching, not evidence that the deployment failed. The alternative to a visible rollback isn't a stable agent — it's an invisible one.
The binary-governance trap
If governance is the detection layer, the obvious next question is why it produces this many detections in the first place. Gartner's answer, from senior director analyst Shiva Varma, is that most organisations run AI agent governance as a binary switch: locked down, or fully trusted. There isn't a middle setting.
That binary breaks down because agents aren’t one thing. A read-only agent that summarises support tickets and a fully autonomous agent that can issue refunds or push a config change to production have almost nothing in common operationally, but a binary governance model reviews them with the same checklist, or skips review on both equally. Neither outcome is right. The summariser gets buried in access reviews it doesn’t need, and someone eventually routes around the process to ship faster. The refund agent gets the same lightweight sign-off as the summariser, and it’s the one that can actually move money.
Gartner’s projection for where this leads: 40% of enterprises will demote or decommission an autonomous AI agent by 2027, after a governance gap surfaces in production rather than in review. The rollback, again, isn’t the failure. It’s the moment the gap stopped being invisible.
This is a different number from "AI projects fail"
It’s worth separating this from the broader "80% of AI projects fail" claim that gets attached to almost any AI story this year, traced back to RAND Corporation research on AI project outcomes in general. That number describes projects that never reached production at all: model selection, data pipeline, budget, or internal buy-in that stalled before an agent ever touched a live customer. A related figure from the same period puts the share of agent pilots that fail to graduate to production at 88%, with evaluation gaps and model reliability named as the leading blockers there.
The 75% rollback figure is a downstream number. It only applies to agents that already cleared that earlier bar: built, evaluated, shipped, and running against real traffic. Conflating the two makes AI agents look uniformly unready, which overstates the problem. Getting an agent into production is largely a model and evaluation problem. Keeping it there, and knowing when to pull it, is a governance problem, and it’s the one this piece is about.
Four tiers, four different failure modes
Gartner’s proposed fix replaces the binary switch with four autonomy tiers, each carrying its own minimum control set. The point isn’t the specific label on each tier. It’s that AI agent governance stops being one policy and becomes four, sized to what the agent can actually do.
| Tier | What the agent does | Minimum control | What breaks without it |
|---|---|---|---|
| Observe | Reads and summarises only | Scoped data access | Surfaces data it was never meant to see |
| Advise | Recommends; a human executes | Accuracy and citation checks | A confidently wrong answer ships as if verified |
| Act with Approval | Executes only after sign-off | A real review step, not a rubber stamp | Approval becomes theatre; nobody reads the diff |
| Act Autonomously | Executes inside guardrails | Monitoring, spend caps, named owner | Runs for days before anyone notices |
An agent that only reads and summarises doesn’t need a human-in-the-loop approval gate; it needs its data access scoped so it can’t surface something it was never meant to see. An agent that executes transactions needs the opposite problem solved: not whether it can see too much, but whether it can do too much before someone notices.
Most of the organisations in the 81% weren’t running four tiers of anything. They were running one policy, applied unevenly, with the highest-risk agents often getting the least scrutiny, because the person requesting them was senior enough to skip the queue.
What "undiagnosable" actually looks like
The 16% of rollbacks attributed to an undiagnosable problem sounds abstract until you look at a specific instance of it. Reports circulated this year of a multi-agent system, built with no step cap configured, where one agent asked a second agent for clarification, the second agent asked the first for clarification back, and neither had the shared state to recognise the conversation was going nowhere.
The loop ran for eleven days. The bill came to roughly $47,000. It produced zero useful output the entire time.
Nobody rolled that system back on day one, because nothing about it looked like an incident from the outside. The API meter ran. The agents kept calling each other. It looked like work.
That’s the shape of an undiagnosable failure: not a crash, not an error message, just a system that keeps doing something that resembles its job while producing nothing. The fix isn’t a smarter model. It’s a boundary the system can’t cross without a human finding out:
MAX_STEPS = 25
MAX_SPEND_USD = 20
def guard(step_count, spend_so_far):
if step_count > MAX_STEPS:
raise AgentHalted(f"step cap hit: {step_count} > {MAX_STEPS}")
if spend_so_far > MAX_SPEND_USD:
raise AgentHalted(f"budget cap hit: ${spend_so_far:.2f}")Eleven days and $47,000 is what happens when neither of those two lines exists anywhere in the system. A cap that low would have surfaced the loop in minutes, at a cost of a few dollars, as a routine alert instead of a viral postmortem.
Building a tiered rollout instead of a blanket policy
None of this argues for less governance. It argues for governance that’s sized to what each agent can actually do, decided before the agent ships rather than after it causes the incident that forces the conversation.
In practice, that means three things a team can do before writing another line of agent code. First, assign every agent a tier (Observe, Advise, Act with Approval, or Act Autonomously) at design time, not retroactively after a review flags it as a concern. Second, treat "no governance policy yet" as a default assignment to Observe, not a default assignment to Act Autonomously by omission. Third, reserve named human ownership, spend caps, and rollback mechanisms specifically for anything in the top tier: that’s where the eleven-day loops live, and it’s the one tier where "we’ll notice eventually" isn’t good enough.
The 75% figure will keep getting quoted as a verdict on whether AI agents work. The more useful question is how many of an organisation’s own agents would show up in a rollback count at all, or whether they’re running, unmeasured, in the 25% that never got the chance to be counted.
Frequently asked questions
Related reading
An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.
A Cursor agent found one unscoped API token and wiped a production database and its backups in nine seconds. The real failure was credential scoping, not the model.
AI agents advertise a 200K-token context window. The reliable number is closer to 130K.
Vendors advertise 200,000-token context windows. The number production agents can actually use reliably is closer to 130,000 — and closing that gap is a compression-architecture decision, not a bigger-window one.
Four agentic payments protocols, mapped: what AP2, ACP, UCP, and x402 actually solve
AP2, ACP, UCP, and x402 all claim to solve 'agentic payments.' They solve different problems, and one already failed in its first real deployment for reasons that had nothing to do with the protocol.