Did Klarna's AI actually save money?

Yes. The roughly $40 million annual saving figure was real and was not retracted. The issue was that the deployment model, which aimed to replace rather than augment human agents, produced quality degradation in high-stakes interactions that offset some of those savings in customer retention terms. The economics of AI-augmented support still work, and work best with a hybrid architecture.

What is Klarna's customer service model in 2026?

AI handles first contact across all interaction types. Human agents take over when interaction complexity crosses a defined threshold, typically in multi-turn disputes, emotionally charged escalations, or cases involving significant transaction values. The human workforce is smaller than before the AI deployment but has different responsibilities: less scripted answering, more judgment-intensive resolution.

What went wrong with measuring Klarna's AI performance?

The main measurement issue was using a blended CSAT metric across all interaction types. Blended CSAT masks the quality difference between routine interactions, where AI performs well, and complex, high-stakes interactions, where it performs worse. CSAT needs to be tracked per tier of interaction complexity to give an accurate picture of AI deployment quality.

Is this story evidence that AI cannot replace humans in support roles?

Not precisely. The Klarna experiment shows that AI cannot replace humans across the full range of support interactions at current capability levels, particularly for complex, high-stakes, emotionally variable cases. It can handle a large fraction of routine, scripted interactions well. The implication is not 'don't use AI in support' but 'use it selectively on the interactions where it is reliable, and measure each tier separately'.

Industry AnalysisMay 19, 20266 min readReviewed May 19, 2026

Klarna replaced 700 support agents with AI, then reversed. What the sequence actually proves.

A closer look at what the experiment succeeded at, where it failed, and what it means for anyone deploying AI in a customer-facing role.

By FlowVerify Editorial Team

In February 2024, Klarna published results from a month of live AI deployment that achieved something uncommon: it became the reference point for an entire category of business decision. The fintech, which processes payments for roughly 150 million users, announced that its AI assistant was handling two-thirds of all customer service chats. The headline figure was the equivalent of 700 full-time agents, at first-contact resolution rates matching human performance and with an average resolution time 25% faster.

The figures were precise, they came from a company you had heard of, and they landed in the middle of a period when every board was asking its executive team what they were doing about AI. Within weeks, those numbers had made it into board decks at companies selling fintech software, logistics platforms, insurance, and consumer goods. The Klarna press release became a reference point for what AI-first customer service could look like.

By May 2025, the CEO of Klarna was giving a notably different interview. By early 2026, the company had been hiring human agents back for the better part of a year. What happened between those two moments is more instructive than either headline.

What the press release did not carry

The metrics Klarna published in February 2024 were real. They were also selective.

The announcement showed volume handled, resolution speed, and a headline cost saving of approximately $40 million annually. What it did not show was customer satisfaction scores broken down by interaction type, escalation rates for conversations running longer than three turns, or resolution quality for disputes above a certain value threshold.

These omissions were not unusual for a corporate announcement. It is standard practice to lead with strong metrics. The problem was that the missing metrics were precisely the ones most relevant to anyone considering a similar deployment.

By late 2024, internal data and customer feedback told a different story on one specific dimension. The quality of AI handling for complex interactions, including multi-step disputes, refunds requiring explanation of policy decisions, and cases with some emotional charge, was measurably lower than equivalent human handling. Not on volume terms, but on the interactions that carry the most weight in customer retention.

Metric	Feb 2024	Later status
Share of chats handled by AI	Published: two-thirds	Confirmed accurate
Full-time equivalent agents replaced	Published: 700	Confirmed accurate
Resolution time improvement	Published: 25% faster	Confirmed accurate
Annual cost saving	Published: ~$40M	Confirmed accurate
CSAT for complex interactions	Not reported	Confirmed lower than human
Escalation rate on multi-turn cases	Not reported	Not publicly disclosed
CEO admission: quality was sacrificed	Not stated	Confirmed, May 2025

Klarna's February 2024 announcement versus what later reporting confirmed

The task taxonomy problem

Customer support is not one thing. The common mistake in most arguments for AI-first support is treating the function as though its interactions have uniform value. They do not.

A useful split puts support interactions into three tiers:

Klarna's 2024 success was real, and it was concentrated in Tier 1. That tier is where AI-to-human quality parity is easiest to achieve, because the interaction pattern is narrow and predictable.

The difficulty is that Tier 1 interactions, while high in volume, are low in brand value. A customer who gets a fast, correct answer to a simple question is satisfied, but they were already going to be satisfied. The customers who decide whether to stay or leave based on a support interaction are almost always in Tier 3.

A single CSAT score that blends all three tiers can hold steady or even improve while Tier 3 quality falls. The aggregate masks the specific, and the specific is what matters. That is the structural trap Klarna walked into, not through bad faith, but through choosing the wrong measurement granularity.

What the CEO said in 2025

In a May 2025 interview, Sebastian Siemiatkowski described the situation with a level of specificity that corporate communications rarely reach.

“We focused too much on efficiency and cost. The result was lower quality, and that's not sustainable.”

— Sebastian Siemiatkowski, Klarna CEO, May 2025

He was careful not to say AI had failed. What he described was a deployment model that had optimised for the wrong metric: aggregate volume handled rather than quality on the interactions where quality determines customer lifetime value.

This is worth reading precisely. 'AI failed at customer service' is a simpler and more shareable story. What actually happened was more specific: AI succeeded at a measurable subset of customer service (the subset that was reported) and fell short at the subset that was not. The deployment model assumed the subset was the whole.

By mid-2025, Klarna had begun a structured hiring process for human agents again. The workforce it is rebuilding is smaller than the one it dismantled; reporting from early 2026 points to roughly 30 to 40% of the previous human headcount. The job specifications are different: less scripted answering, more judgment-intensive resolution. AI handles first contact; humans take over when interaction complexity crosses a defined threshold.

What the experiment actually proves

Stated with precision:

AI in 2024 and 2025 could absorb significant volume of routine, low-complexity interactions without customers noticing a quality difference. This is a real finding and it remains true.
At the same level of sophistication, it could not match human-quality handling of high-stakes, multi-step, emotionally variable interactions. Also a real finding.
A deployment that optimises for volume metrics without separately tracking quality on high-stakes interactions will show favourable results longer than it should, then show unfavourable ones once the higher-stakes data accumulates at scale.
The timing of the 2024 announcement, before quality data on complex interactions was collected at scale, amplified the gap between the initial claim and the eventual correction. The result was a public reversal that drew attention precisely because the original claim was so categorical.

What the experiment did not prove: that AI is unsuitable for customer service roles. What it did prove: that 'AI handles X% of chats at the same CSAT' is an incomplete success metric, because CSAT is not uniform across interaction types, and the interactions that matter most to retention are the ones least well served by current AI.

The model that is actually emerging

The Klarna case is not an outlier. Several large consumer-facing companies that deployed AI-first support between 2023 and 2024 have since moved to versions of the same hybrid architecture: AI on first contact, human escalation on complexity. The details vary; some set the threshold by turn count, some by dispute value, some by sentiment signals. But the structural shape is consistent.

For anyone in a position where that decision is live, the practical steps the Klarna sequence points toward:

Map your support volume by interaction tier, not by ticket count. The question is not how many tickets you receive but what fraction of tickets, by type, determine customer lifetime value. That fraction sets the ceiling for safe automation.

Measure CSAT separately per tier. A blended score is misleading and will stay misleading until the Tier 3 data accumulates to the point where it pulls the average down. By then the damage is done.

Define the handoff threshold before going live. The threshold needs to be specific about which ticket attributes trigger human handoff, and it needs empirical validation. The edge cases requiring human judgment are not always visible in advance; plan for iteration.

Don't publish results before you have Tier 3 data at scale. Klarna's communication problem was primarily one of timing. A month of deployment is long enough to measure volume; it is not long enough to measure the quality impact on complex interactions at the tail of the distribution.

The economics of a hybrid model remain compelling. AI-handled interactions cost materially less than human-handled ones, and even a 40 to 50% automation rate produces significant cost reduction. The difference from the full-replacement model is that you are measuring the right outcomes from the start, which means the correction is incremental rather than categorical.

The 2024 press release and the 2026 model

The 2024 press release was written to demonstrate what AI could do. The 2026 operating model is structured around what AI currently cannot do. Both of these are true simultaneously, and the second does not cancel the first.

Companies that do well with AI in customer operations tend to start with the harder question: where does this break, at what input complexity, and what is the brand cost when it breaks? That question requires thinking about the value distribution across your support interactions, not just the volume distribution.

Klarna's sequence of events is worth studying precisely because the cost of finding out by public reversal is visible. Most companies that have made similar mistakes have corrected quietly. The ones that announced loudly have had to correct loudly, and that contrast is useful data for anyone about to make the same announcement.

Frequently asked questions

Microsoft's seven new MAI models make a lot more sense once you read the OpenAI contract behind them

Jun 30, 2026Read full article →

Industry AnalysisMay 19, 20266 min readReviewed May 19, 2026

Klarna replaced 700 support agents with AI, then reversed. What the sequence actually proves.

A closer look at what the experiment succeeded at, where it failed, and what it means for anyone deploying AI in a customer-facing role.

By FlowVerify Editorial Team

What the press release did not carry

The metrics Klarna published in February 2024 were real. They were also selective.

Metric	Feb 2024	Later status
Share of chats handled by AI	Published: two-thirds	Confirmed accurate
Full-time equivalent agents replaced	Published: 700	Confirmed accurate
Resolution time improvement	Published: 25% faster	Confirmed accurate
Annual cost saving	Published: ~$40M	Confirmed accurate
CSAT for complex interactions	Not reported	Confirmed lower than human
Escalation rate on multi-turn cases	Not reported	Not publicly disclosed
CEO admission: quality was sacrificed	Not stated	Confirmed, May 2025

Klarna's February 2024 announcement versus what later reporting confirmed

The task taxonomy problem

Customer support is not one thing. The common mistake in most arguments for AI-first support is treating the function as though its interactions have uniform value. They do not.

A useful split puts support interactions into three tiers:

Klarna's 2024 success was real, and it was concentrated in Tier 1. That tier is where AI-to-human quality parity is easiest to achieve, because the interaction pattern is narrow and predictable.

What the CEO said in 2025

In a May 2025 interview, Sebastian Siemiatkowski described the situation with a level of specificity that corporate communications rarely reach.

“We focused too much on efficiency and cost. The result was lower quality, and that's not sustainable.”

— Sebastian Siemiatkowski, Klarna CEO, May 2025

What the experiment actually proves

Stated with precision:

AI in 2024 and 2025 could absorb significant volume of routine, low-complexity interactions without customers noticing a quality difference. This is a real finding and it remains true.
At the same level of sophistication, it could not match human-quality handling of high-stakes, multi-step, emotionally variable interactions. Also a real finding.
A deployment that optimises for volume metrics without separately tracking quality on high-stakes interactions will show favourable results longer than it should, then show unfavourable ones once the higher-stakes data accumulates at scale.
The timing of the 2024 announcement, before quality data on complex interactions was collected at scale, amplified the gap between the initial claim and the eventual correction. The result was a public reversal that drew attention precisely because the original claim was so categorical.

The model that is actually emerging

For anyone in a position where that decision is live, the practical steps the Klarna sequence points toward:

Klarna replaced 700 support agents with AI, then reversed. What the sequence actually proves.

What the press release did not carry

The task taxonomy problem

What the CEO said in 2025

What the experiment actually proves

The model that is actually emerging

The 2024 press release and the 2026 model

Frequently asked questions

Related reading

Microsoft's seven new MAI models make a lot more sense once you read the OpenAI contract behind them

$662 billion in AI data-center leases isn't on any balance sheet yet

Meta published a postmortem for its 2021 outage. Not for the ones in 2026.

Stay ahead on eSignatures, compliance, and document workflows

Microsoft's seven new MAI models make a lot more sense once you read the OpenAI contract behind them

Klarna replaced 700 support agents with AI, then reversed. What the sequence actually proves.

What the press release did not carry

The task taxonomy problem

What the CEO said in 2025

What the experiment actually proves

The model that is actually emerging

The 2024 press release and the 2026 model

Frequently asked questions

Related reading

Microsoft's seven new MAI models make a lot more sense once you read the OpenAI contract behind them

$662 billion in AI data-center leases isn't on any balance sheet yet

Meta published a postmortem for its 2021 outage. Not for the ones in 2026.

Stay ahead on eSignatures, compliance, and document workflows

Microsoft's seven new MAI models make a lot more sense once you read the OpenAI contract behind them