Klarna replaced 700 support agents with AI, then reversed. What the sequence actually proves.
A closer look at what the experiment succeeded at, where it failed, and what it means for anyone deploying AI in a customer-facing role.
In February 2024, Klarna published results from a month of live AI deployment that achieved something uncommon: it became the reference point for an entire category of business decision. The fintech, which processes payments for roughly 150 million users, announced that its AI assistant was handling two-thirds of all customer service chats. The headline figure was the equivalent of 700 full-time agents, at first-contact resolution rates matching human performance and with an average resolution time 25% faster.
The figures were precise, they came from a company you had heard of, and they landed in the middle of a period when every board was asking its executive team what they were doing about AI. Within weeks, those numbers had made it into board decks at companies selling fintech software, logistics platforms, insurance, and consumer goods. The Klarna press release became a reference point for what AI-first customer service could look like.
By May 2025, the CEO of Klarna was giving a notably different interview. By early 2026, the company had been hiring human agents back for the better part of a year. What happened between those two moments is more instructive than either headline.
What the press release did not carry
The metrics Klarna published in February 2024 were real. They were also selective.
The announcement showed volume handled, resolution speed, and a headline cost saving of approximately $40 million annually. What it did not show was customer satisfaction scores broken down by interaction type, escalation rates for conversations running longer than three turns, or resolution quality for disputes above a certain value threshold.
These omissions were not unusual for a corporate announcement. It is standard practice to lead with strong metrics. The problem was that the missing metrics were precisely the ones most relevant to anyone considering a similar deployment.
By late 2024, internal data and customer feedback told a different story on one specific dimension. The quality of AI handling for complex interactions, including multi-step disputes, refunds requiring explanation of policy decisions, and cases with some emotional charge, was measurably lower than equivalent human handling. Not on volume terms, but on the interactions that carry the most weight in customer retention.
| Metric | Feb 2024 | Later status |
|---|---|---|
| Share of chats handled by AI | Published: two-thirds | Confirmed accurate |
| Full-time equivalent agents replaced | Published: 700 | Confirmed accurate |
| Resolution time improvement | Published: 25% faster | Confirmed accurate |
| Annual cost saving | Published: ~$40M | Confirmed accurate |
| CSAT for complex interactions | Not reported | Confirmed lower than human |
| Escalation rate on multi-turn cases | Not reported | Not publicly disclosed |
| CEO admission: quality was sacrificed | Not stated | Confirmed, May 2025 |
The task taxonomy problem
Customer support is not one thing. The common mistake in most arguments for AI-first support is treating the function as though its interactions have uniform value. They do not.
A useful split puts support interactions into three tiers:
Klarna's 2024 success was real, and it was concentrated in Tier 1. That tier is where AI-to-human quality parity is easiest to achieve, because the interaction pattern is narrow and predictable.
The difficulty is that Tier 1 interactions, while high in volume, are low in brand value. A customer who gets a fast, correct answer to a simple question is satisfied, but they were already going to be satisfied. The customers who decide whether to stay or leave based on a support interaction are almost always in Tier 3.
A single CSAT score that blends all three tiers can hold steady or even improve while Tier 3 quality falls. The aggregate masks the specific, and the specific is what matters. That is the structural trap Klarna walked into, not through bad faith, but through choosing the wrong measurement granularity.
What the CEO said in 2025
In a May 2025 interview, Sebastian Siemiatkowski described the situation with a level of specificity that corporate communications rarely reach.
“We focused too much on efficiency and cost. The result was lower quality, and that's not sustainable.”
He was careful not to say AI had failed. What he described was a deployment model that had optimised for the wrong metric: aggregate volume handled rather than quality on the interactions where quality determines customer lifetime value.
This is worth reading precisely. 'AI failed at customer service' is a simpler and more shareable story. What actually happened was more specific: AI succeeded at a measurable subset of customer service (the subset that was reported) and fell short at the subset that was not. The deployment model assumed the subset was the whole.
By mid-2025, Klarna had begun a structured hiring process for human agents again. The workforce it is rebuilding is smaller than the one it dismantled; reporting from early 2026 points to roughly 30 to 40% of the previous human headcount. The job specifications are different: less scripted answering, more judgment-intensive resolution. AI handles first contact; humans take over when interaction complexity crosses a defined threshold.
What the experiment actually proves
Stated with precision:
- AI in 2024 and 2025 could absorb significant volume of routine, low-complexity interactions without customers noticing a quality difference. This is a real finding and it remains true.
- At the same level of sophistication, it could not match human-quality handling of high-stakes, multi-step, emotionally variable interactions. Also a real finding.
- A deployment that optimises for volume metrics without separately tracking quality on high-stakes interactions will show favourable results longer than it should, then show unfavourable ones once the higher-stakes data accumulates at scale.
- The timing of the 2024 announcement, before quality data on complex interactions was collected at scale, amplified the gap between the initial claim and the eventual correction. The result was a public reversal that drew attention precisely because the original claim was so categorical.
What the experiment did not prove: that AI is unsuitable for customer service roles. What it did prove: that 'AI handles X% of chats at the same CSAT' is an incomplete success metric, because CSAT is not uniform across interaction types, and the interactions that matter most to retention are the ones least well served by current AI.
The model that is actually emerging
The Klarna case is not an outlier. Several large consumer-facing companies that deployed AI-first support between 2023 and 2024 have since moved to versions of the same hybrid architecture: AI on first contact, human escalation on complexity. The details vary; some set the threshold by turn count, some by dispute value, some by sentiment signals. But the structural shape is consistent.
For anyone in a position where that decision is live, the practical steps the Klarna sequence points toward:
Map your support volume by interaction tier, not by ticket count. The question is not how many tickets you receive but what fraction of tickets, by type, determine customer lifetime value. That fraction sets the ceiling for safe automation.
Measure CSAT separately per tier. A blended score is misleading and will stay misleading until the Tier 3 data accumulates to the point where it pulls the average down. By then the damage is done.
Define the handoff threshold before going live. The threshold needs to be specific about which ticket attributes trigger human handoff, and it needs empirical validation. The edge cases requiring human judgment are not always visible in advance; plan for iteration.
Don't publish results before you have Tier 3 data at scale. Klarna's communication problem was primarily one of timing. A month of deployment is long enough to measure volume; it is not long enough to measure the quality impact on complex interactions at the tail of the distribution.
The economics of a hybrid model remain compelling. AI-handled interactions cost materially less than human-handled ones, and even a 40 to 50% automation rate produces significant cost reduction. The difference from the full-replacement model is that you are measuring the right outcomes from the start, which means the correction is incremental rather than categorical.
The 2024 press release and the 2026 model
The 2024 press release was written to demonstrate what AI could do. The 2026 operating model is structured around what AI currently cannot do. Both of these are true simultaneously, and the second does not cancel the first.
Companies that do well with AI in customer operations tend to start with the harder question: where does this break, at what input complexity, and what is the brand cost when it breaks? That question requires thinking about the value distribution across your support interactions, not just the volume distribution.
Klarna's sequence of events is worth studying precisely because the cost of finding out by public reversal is visible. Most companies that have made similar mistakes have corrected quietly. The ones that announced loudly have had to correct loudly, and that contrast is useful data for anyone about to make the same announcement.
Frequently asked questions
Related reading
When per-seat pricing breaks: what GitHub Copilot's billing shift signals for AI-powered SaaS
AI agents consume compute in ways that don't map to user count — and Copilot's June 2026 billing shift is the clearest signal yet. Here's what the transition reveals about pricing for AI-powered products.
Most AI strategy decks are written backwards
AI strategy decks that list capabilities by department feel comprehensive and systematically land on the wrong priorities. The fix is not a better use-case inventory — it is a constraints map.
AI made your developers faster. Why hasn't software delivery caught up?
Developer PR rates nearly doubled on AI-heavy teams. Review time went up 91%. Here is what the data says about why faster code has not led to faster delivery — and three structural changes that do.