Does a 75% AI agent rollback rate mean the technology doesn't work?

Not on its own. Sinch's 2026 survey found the highest rollback rates at organisations with the most mature AI agent governance. The number tracks how well a company can detect a failing agent, not how often agents fail overall. Teams without monitoring or a named owner for each agent can run a broken deployment for months without ever logging it as a rollback.

What is Gartner's four-tier AI agent governance framework?

It assigns each AI agent to one of four autonomy levels: Observe (read-only), Advise (recommends, a human executes), Act with Approval (executes only after sign-off), and Act Autonomously (executes inside guardrails). Each tier gets a different minimum control set instead of one governance policy applied to every agent regardless of what it can do.

Why do organisations with mature AI governance have higher rollback rates?

Because governance is largely a detection mechanism. An organisation with audit logs, review cadences, and named owners catches a misbehaving agent and rolls it back. An organisation without that infrastructure often keeps running the same agent, because nothing in its process would ever flag the failure as a rollback in the first place.

What's the most common reason enterprises roll back an AI agent?

In the Sinch survey, customer data exposure was the single largest cause, cited by 31% of respondents, followed by hallucination or brand risk at 22% and an inability to diagnose the underlying problem at 16%.

AI & LLMsJul 2, 20266 min readReviewed Jul 2, 2026

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

A 2026 survey of 2,500+ decision-makers surfaces a governance paradox that stops looking like a paradox once you see what 'rollback' actually measures.

By FlowVerify Editorial Team

The number that’s been misread

Three in four enterprises have rolled back or shut down a customer-facing AI agent after putting it into production. That figure comes from a Sinch survey of more than 2,500 senior decision-makers across ten countries, and it has been circulating for weeks as a verdict on AI agent governance: the technology isn't ready, the pilots don't survive contact with real customers, take your pick of headline.

The number that gets quoted less, and matters more, sits right next to it. Among organisations with the most mature AI agent governance frameworks, the rollback rate is 81%, not lower. If governance worked the way the headline implies, more oversight should produce fewer rollbacks. The data says the opposite.

That’s not a contradiction in the survey. It’s a contradiction in how most people define "rollback."

What "rollback" actually means in the data

The Sinch report breaks down why agents got pulled, and the top three reasons aren’t about the model being bad at its job:

Customer data exposure — 31%
Hallucination or brand risk — 22%
Inability to diagnose what went wrong — 16%

Read that last one again. Sixteen percent of rollbacks happened not because the agent did something visibly wrong, but because nobody could tell whether it had. That’s not a model-quality problem. That’s an observability problem wearing a model-quality costume.

“The most advanced organisations aren't failing less. They're seeing failures sooner.”

— Daniel Morris, Chief Product Officer, Sinch

A team running an agent with no audit log, no session replay, and no owner watching a dashboard doesn’t have zero incidents. It has zero visibility into its incidents. The agent can leak a customer record or hallucinate a refund policy for months without a single rollback being logged, because nobody is positioned to notice.

Why mature AI agent governance produces more rollbacks, not fewer

Rollback rate isn’t a proxy for how often AI agents misbehave. It’s a proxy for how often an organisation can tell that one did.

A team with an incident channel, a review cadence, and someone whose job includes owning an agent’s output will catch and roll back a bad deployment inside days. A team without any of that infrastructure will run the same bad deployment for a quarter, because the failure mode that would trigger a rollback (a complaint reaching the right person, a data-exposure alert firing, a hallucinated answer getting caught before it ships) never reaches anyone with the authority to pull the plug.

Put the two organisations side by side and the one with governance looks worse on a single metric. It isn't. It's the only one of the two that actually knows what its agents are doing. Gartner analyst Greg Carlucci frames the finding the same way: these rollbacks are what real-world deployment looks like once someone is actually watching, not evidence that the deployment failed. The alternative to a visible rollback isn't a stable agent — it's an invisible one.

The binary-governance trap

If governance is the detection layer, the obvious next question is why it produces this many detections in the first place. Gartner's answer, from senior director analyst Shiva Varma, is that most organisations run AI agent governance as a binary switch: locked down, or fully trusted. There isn't a middle setting.

That binary breaks down because agents aren’t one thing. A read-only agent that summarises support tickets and a fully autonomous agent that can issue refunds or push a config change to production have almost nothing in common operationally, but a binary governance model reviews them with the same checklist, or skips review on both equally. Neither outcome is right. The summariser gets buried in access reviews it doesn’t need, and someone eventually routes around the process to ship faster. The refund agent gets the same lightweight sign-off as the summariser, and it’s the one that can actually move money.

Gartner’s projection for where this leads: 40% of enterprises will demote or decommission an autonomous AI agent by 2027, after a governance gap surfaces in production rather than in review. The rollback, again, isn’t the failure. It’s the moment the gap stopped being invisible.

This is a different number from "AI projects fail"

It’s worth separating this from the broader "80% of AI projects fail" claim that gets attached to almost any AI story this year, traced back to RAND Corporation research on AI project outcomes in general. That number describes projects that never reached production at all: model selection, data pipeline, budget, or internal buy-in that stalled before an agent ever touched a live customer. A related figure from the same period puts the share of agent pilots that fail to graduate to production at 88%, with evaluation gaps and model reliability named as the leading blockers there.

The 75% rollback figure is a downstream number. It only applies to agents that already cleared that earlier bar: built, evaluated, shipped, and running against real traffic. Conflating the two makes AI agents look uniformly unready, which overstates the problem. Getting an agent into production is largely a model and evaluation problem. Keeping it there, and knowing when to pull it, is a governance problem, and it’s the one this piece is about.

Four tiers, four different failure modes

Gartner’s proposed fix replaces the binary switch with four autonomy tiers, each carrying its own minimum control set. The point isn’t the specific label on each tier. It’s that AI agent governance stops being one policy and becomes four, sized to what the agent can actually do.

Tier	What the agent does	Minimum control	What breaks without it
Observe	Reads and summarises only	Scoped data access	Surfaces data it was never meant to see
Advise	Recommends; a human executes	Accuracy and citation checks	A confidently wrong answer ships as if verified
Act with Approval	Executes only after sign-off	A real review step, not a rubber stamp	Approval becomes theatre; nobody reads the diff
Act Autonomously	Executes inside guardrails	Monitoring, spend caps, named owner	Runs for days before anyone notices

Gartner's four AI agent autonomy tiers

An agent that only reads and summarises doesn’t need a human-in-the-loop approval gate; it needs its data access scoped so it can’t surface something it was never meant to see. An agent that executes transactions needs the opposite problem solved: not whether it can see too much, but whether it can do too much before someone notices.

Most of the organisations in the 81% weren’t running four tiers of anything. They were running one policy, applied unevenly, with the highest-risk agents often getting the least scrutiny, because the person requesting them was senior enough to skip the queue.

What "undiagnosable" actually looks like

The 16% of rollbacks attributed to an undiagnosable problem sounds abstract until you look at a specific instance of it. Reports circulated this year of a multi-agent system, built with no step cap configured, where one agent asked a second agent for clarification, the second agent asked the first for clarification back, and neither had the shared state to recognise the conversation was going nowhere.

The loop ran for eleven days. The bill came to roughly $47,000. It produced zero useful output the entire time.

Nobody rolled that system back on day one, because nothing about it looked like an incident from the outside. The API meter ran. The agents kept calling each other. It looked like work.

That’s the shape of an undiagnosable failure: not a crash, not an error message, just a system that keeps doing something that resembles its job while producing nothing. The fix isn’t a smarter model. It’s a boundary the system can’t cross without a human finding out:

guardrail.py

MAX_STEPS = 25
MAX_SPEND_USD = 20

def guard(step_count, spend_so_far):
    if step_count > MAX_STEPS:
        raise AgentHalted(f"step cap hit: {step_count} > {MAX_STEPS}")
    if spend_so_far > MAX_SPEND_USD:
        raise AgentHalted(f"budget cap hit: ${spend_so_far:.2f}")

Eleven days and $47,000 is what happens when neither of those two lines exists anywhere in the system. A cap that low would have surfaced the loop in minutes, at a cost of a few dollars, as a routine alert instead of a viral postmortem.

Building a tiered rollout instead of a blanket policy

None of this argues for less governance. It argues for governance that’s sized to what each agent can actually do, decided before the agent ships rather than after it causes the incident that forces the conversation.

In practice, that means three things a team can do before writing another line of agent code. First, assign every agent a tier (Observe, Advise, Act with Approval, or Act Autonomously) at design time, not retroactively after a review flags it as a concern. Second, treat "no governance policy yet" as a default assignment to Observe, not a default assignment to Act Autonomously by omission. Third, reserve named human ownership, spend caps, and rollback mechanisms specifically for anything in the top tier: that’s where the eleven-day loops live, and it’s the one tier where "we’ll notice eventually" isn’t good enough.

The 75% figure will keep getting quoted as a verdict on whether AI agents work. The more useful question is how many of an organisation’s own agents would show up in a rollback count at all, or whether they’re running, unmeasured, in the 25% that never got the chance to be counted.

Frequently asked questions

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.

A Cursor agent found one unscoped API token and wiped a production database and its backups in nine seconds. The real failure was credential scoping, not the model.

Jul 1, 2026Read full article →

AI & LLMsJul 2, 20266 min readReviewed Jul 2, 2026

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

A 2026 survey of 2,500+ decision-makers surfaces a governance paradox that stops looking like a paradox once you see what 'rollback' actually measures.

By FlowVerify Editorial Team

The number that’s been misread

That’s not a contradiction in the survey. It’s a contradiction in how most people define "rollback."

What "rollback" actually means in the data

The Sinch report breaks down why agents got pulled, and the top three reasons aren’t about the model being bad at its job:

Customer data exposure — 31%
Hallucination or brand risk — 22%
Inability to diagnose what went wrong — 16%

“The most advanced organisations aren't failing less. They're seeing failures sooner.”

— Daniel Morris, Chief Product Officer, Sinch

Why mature AI agent governance produces more rollbacks, not fewer

Rollback rate isn’t a proxy for how often AI agents misbehave. It’s a proxy for how often an organisation can tell that one did.

The binary-governance trap

This is a different number from "AI projects fail"

Four tiers, four different failure modes

Tier	What the agent does	Minimum control	What breaks without it
Observe	Reads and summarises only	Scoped data access	Surfaces data it was never meant to see
Advise	Recommends; a human executes	Accuracy and citation checks	A confidently wrong answer ships as if verified
Act with Approval	Executes only after sign-off	A real review step, not a rubber stamp	Approval becomes theatre; nobody reads the diff
Act Autonomously	Executes inside guardrails	Monitoring, spend caps, named owner	Runs for days before anyone notices

Gartner's four AI agent autonomy tiers

What "undiagnosable" actually looks like

The loop ran for eleven days. The bill came to roughly $47,000. It produced zero useful output the entire time.

Nobody rolled that system back on day one, because nothing about it looked like an incident from the outside. The API meter ran. The agents kept calling each other. It looked like work.

guardrail.py

MAX_STEPS = 25
MAX_SPEND_USD = 20

def guard(step_count, spend_so_far):
    if step_count > MAX_STEPS:
        raise AgentHalted(f"step cap hit: {step_count} > {MAX_STEPS}")
    if spend_so_far > MAX_SPEND_USD:
        raise AgentHalted(f"budget cap hit: ${spend_so_far:.2f}")

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

The number that’s been misread

What "rollback" actually means in the data

Why mature AI agent governance produces more rollbacks, not fewer

The binary-governance trap

This is a different number from "AI projects fail"

Four tiers, four different failure modes

What "undiagnosable" actually looks like

Building a tiered rollout instead of a blanket policy

Frequently asked questions

Related reading

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.

AI agents advertise a 200K-token context window. The reliable number is closer to 130K.

Four agentic payments protocols, mapped: what AP2, ACP, UCP, and x402 actually solve

Stay ahead on eSignatures, compliance, and document workflows

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

The number that’s been misread

What "rollback" actually means in the data

Why mature AI agent governance produces more rollbacks, not fewer

The binary-governance trap

This is a different number from "AI projects fail"

Four tiers, four different failure modes

What "undiagnosable" actually looks like

Building a tiered rollout instead of a blanket policy

Frequently asked questions

Related reading

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.

AI agents advertise a 200K-token context window. The reliable number is closer to 130K.

Four agentic payments protocols, mapped: what AP2, ACP, UCP, and x402 actually solve

Stay ahead on eSignatures, compliance, and document workflows

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.