Klarna replaced 700 agents with AI in customer service. Here is what the metrics missed.
Two years on from Klarna's headline AI launch, the company is hiring humans again. What the efficiency numbers missed — and what a working hybrid looks like.
In early 2024, Klarna deployed AI in customer service at scale, announcing that its assistant had effectively replaced 700 agents and was handling two-thirds of all customer chats in the first month of full deployment. The CEO called it human-equivalent quality. Analysts cited it as proof the AI transition in service functions was real. Vendors included it in every sales deck for the next 18 months.
By 2026, Klarna was quietly hiring humans again.
The story of what happened between those two moments is more instructive than either the initial claim or the reversal. It is not a story about AI being incapable of handling support. It is a story about which metrics you watch when you automate customer service, and what those metrics cannot tell you.
The headline numbers were real
Klarna's AI handled a large volume of customer interactions. The efficiency gains were not fabricated: faster average handle times, no queuing, around-the-clock availability, lower cost per closed ticket. For the interactions the system was designed to handle, it handled them well.
The CEO's claim about human-equivalent performance was accurate for a specific reading of the data. Volume metrics looked clean. First-response time improved. Ticket-close rate went up. The cost story was compelling enough that competitors started running similar numbers against their own support headcount.
None of that was wrong. The problem was that those numbers measure throughput. They do not measure what happens to a customer after the ticket closes.
What the volume metrics could not see
The metric that triggered the reversal was customer satisfaction on complex interactions — CSAT and NPS data that arrived not just after a single ticket, but aggregated over months of post-interaction surveys.
The problem was not that the AI failed on all interactions. It failed on a subset. That subset turned out to include interactions that matter most to customer retention: disputed transactions, payment issues where a customer was confused or distressed, edge cases outside standard policy, and anything requiring a judgment call rather than a policy lookup.
Volume metrics masked this for a long time. If 80% of your interactions are routine and your AI handles them cleanly, your aggregate CSAT looks fine even if complex-case satisfaction is two points below where it should be. The 20% where things went wrong were also the 20% that correlate most with churn.
By the time the NPS signal was unambiguous, tens of millions of interactions had already gone through the broken path. That is the cost of optimising for volume without building quality signals at the interaction-type level.
| Interaction type | AI fit | Quality risk | Reason |
|---|---|---|---|
| Account status, order lookup | High | Low | Policy-bounded; no judgment required |
| Standard refunds, simple account changes | High | Low | Clear rules, predictable flow |
| Billing disputes with emotional charge | Low | High | Needs discretion, pacing, customer trust |
| Policy-adjacent edge cases | Medium | Medium-high | Literal policy may not be the right answer |
| Novel product or feature questions | Low | High | Outside training distribution; degrades over time |
| Fraud and security concerns | Low | High | Trust-critical; errors amplify customer distress |
The three categories that AI customer service breaks on
Looking across the Klarna case and the broader pattern in financial services and e-commerce, three interaction categories consistently underperform when handed to AI without careful design.
High-stakes emotional interactions
A customer in financial distress, a disputed charge that has cascaded into overdrafts, a fraud case with real-world consequences. These require recognising the emotional register of the conversation and adjusting pacing, tone, and resolution accordingly. Current models can detect distress in text, but they struggle with the sustained judgment those situations need: when to apologise versus when to explain, when to offer a concession versus when to hold the line.
Policy-adjacent edge cases
Every company's policies have ambiguities. 'Is this covered under the return policy?' Sometimes the answer is technically no, but the context (long-tenure customer, small value, unusual circumstances) makes yes the right business decision. Human agents in those situations apply discretion. AI systems, when uncertain, revert to the literal policy text. That discretion, applied consistently, is what builds customer trust over time.
Interactions outside the training distribution
A new product feature, a regulatory change, a promotional edge case, a payment method that just launched in one market — the distribution of what customers ask shifts constantly. Human agents adapt through training and common sense. AI systems degrade silently until retrained. An AI support deployment that is not being continuously updated on new interaction types is not a stable system; it is one that is gradually becoming less reliable without any visible signal.
This is not an 'AI doesn't work' story
It would be easy to read the Klarna reversal as a vindication for skeptics who argued AI could not replace human support agents. That reading is wrong, and it is not a useful frame for anyone building support automation right now.
AI in customer service works, at scale, for specific interaction types. Klarna's assistant handled, and continues to handle, a large share of customer interactions. The company did not abandon AI. It rebalanced: AI for routine queries, human agents for the cases that broke.
“We went too far.”
The mistake was not deploying AI. The mistake was deploying it without the quality signals needed to detect subset failures early. By the time the aggregate data was unambiguous, the damage was already visible in churn cohorts.
The companies that have navigated this well started with a narrower deployment scope and expanded it, rather than deploying broadly and contracting under pressure. Starting narrow is slower. It is also how you find the boundary between what AI handles reliably and what it doesn't, before that boundary costs you customers.
What the working hybrid looks like in 2026
The model that has emerged from companies that built AI-human hybrids more deliberately shares a few structural properties.
Classify by complexity and risk before routing
A billing question that is a simple status lookup is a different class of interaction from a billing question involving a distressed customer and a disputed charge. Topic-based routing gets the first step wrong. The routing decision needs to factor in what failure looks like — what the customer does if this interaction goes badly — not just what the question is about.
Build quality signals at the interaction level
Aggregate CSAT is a trailing indicator that smooths over the signal. Per-interaction CSAT — even a 1-to-5 rating after close — and in-conversation sentiment signals give you both leading and lagging data. The leading signal lets you catch an interaction going badly before it closes. The lagging signal tells you which interaction types are underperforming so you can reroute them. Most teams build aggregate dashboards and then wonder why they did not see the cliff coming.
Make escalation cheap for the customer
One reason Klarna's situation compounded: customers who could not get the AI to handle their edge case correctly did not always have a fast, clear path to a human. When the wait was long enough, some gave up. That shows up in churn data months later, not in support ticket metrics. An AI that escalates efficiently is a force multiplier for human agents. An AI that traps customers in loops is a liability.
Treat the training data cycle as infrastructure
AI support systems that are not being retrained on recent failure cases accumulate drift. New policies, new product features, new customer segments — the distribution shifts, and the system needs to shift with it. 'Deploy and monitor' without 'deploy and retrain' is how a working system quietly becomes a broken one over 18 months.
Three questions before you automate a support interaction type
Before handing an interaction category to AI, ask these.
What does failure look like, and how would you know?
Not what would the AI get wrong in the abstract. Specifically: if the AI mishandles this interaction, what does the customer do next? Do they escalate? Do they churn? Do they dispute the charge? If you cannot trace the failure to a measurable downstream outcome, you will not see the problem until it is large enough to show up in NPS.
Is this interaction type bounded by policy, or does it require judgment?
If the right answer is always derivable from written policy without contextual discretion, AI handles it reliably. If the right answer sometimes requires reading the situation and applying goodwill — a concession, an exception, a tone adjustment — AI will get it wrong in the cases that matter most to retention.
What is the escalation path, and is it fast enough?
An escalation that takes two minutes may be fine for a routine account query and intolerable for a fraud case. The escalation design needs to be calibrated to the interaction type, not set uniformly. If the answer is 'users can click a button and wait in a queue', that is not a fast path for the interactions where speed matters.
The Klarna case is not remarkable because a company tried AI in customer service and had to adjust. That is normal product work. What makes it instructive is the length of the gap between when the efficiency metrics looked fine and when the problem was obvious — and the fact that this gap is compressible.
Better leading indicators, interaction-level quality signals, and deliberate escalation design would have surfaced the failures months earlier. The companies getting AI in support right in 2026 are treating 'it is working' as a hypothesis to test against the right metrics, not a conclusion that follows from volume numbers going up.
Frequently asked questions
Related reading
Context rot is real: what the 18-model study means for production LLM engineering
Chroma's 2025 research tested 18 frontier models and found every one degrades as context grows. This is what context rot means for production engineering decisions — and the specific patterns that address it.
The AI productivity paradox is more interesting than either side admits
AI is making specific tasks measurably faster: coding 55%, X-ray reading 36%, customer service sales up 16%. And yet 90% of firms saw no firm-level productivity gain. Here's what the gap means.
Model Context Protocol: what it actually standardises (and what you'll still have to build yourself)
MCP is becoming the standard interface for connecting AI agents to external tools. But most teams adopting it don't have a clear picture of what the protocol covers and what it deliberately leaves out.