Did Klarna abandon AI in customer service?

No. Klarna rebalanced to a hybrid model rather than abandoning AI. The assistant continues to handle a large share of routine, policy-bounded interactions where it performs reliably. What changed is that human agents were added back for complex, high-stakes, and edge-case interactions where quality had deteriorated.

What specifically triggered Klarna's reversal?

CSAT and NPS data on complex interactions — disputed transactions, emotionally-charged billing issues, fraud cases, and edge cases outside the AI's training distribution — deteriorated enough that the CEO publicly acknowledged the company had gone too far. Volume and throughput metrics had masked the problem because routine interactions dominate aggregate numbers.

How should companies measure AI support quality without being misled by volume metrics?

Aggregate CSAT alone is insufficient. The better approach combines per-interaction CSAT ratings with in-conversation sentiment signals, and tracks quality separately across interaction-type categories rather than in aggregate. This surfaces failures in specific categories before they compound into NPS decline.

Industry AnalysisJun 3, 20267 min readReviewed Jun 3, 2026

Klarna replaced 700 agents with AI in customer service. Here is what the metrics missed.

Two years on from Klarna's headline AI launch, the company is hiring humans again. What the efficiency numbers missed — and what a working hybrid looks like.

By FlowVerify Editorial Team

In early 2024, Klarna deployed AI in customer service at scale, announcing that its assistant had effectively replaced 700 agents and was handling two-thirds of all customer chats in the first month of full deployment. The CEO called it human-equivalent quality. Analysts cited it as proof the AI transition in service functions was real. Vendors included it in every sales deck for the next 18 months.

By 2026, Klarna was quietly hiring humans again.

The story of what happened between those two moments is more instructive than either the initial claim or the reversal. It is not a story about AI being incapable of handling support. It is a story about which metrics you watch when you automate customer service, and what those metrics cannot tell you.

The headline numbers were real

Klarna's AI handled a large volume of customer interactions. The efficiency gains were not fabricated: faster average handle times, no queuing, around-the-clock availability, lower cost per closed ticket. For the interactions the system was designed to handle, it handled them well.

The CEO's claim about human-equivalent performance was accurate for a specific reading of the data. Volume metrics looked clean. First-response time improved. Ticket-close rate went up. The cost story was compelling enough that competitors started running similar numbers against their own support headcount.

None of that was wrong. The problem was that those numbers measure throughput. They do not measure what happens to a customer after the ticket closes.

What the volume metrics could not see

The metric that triggered the reversal was customer satisfaction on complex interactions — CSAT and NPS data that arrived not just after a single ticket, but aggregated over months of post-interaction surveys.

The problem was not that the AI failed on all interactions. It failed on a subset. That subset turned out to include interactions that matter most to customer retention: disputed transactions, payment issues where a customer was confused or distressed, edge cases outside standard policy, and anything requiring a judgment call rather than a policy lookup.

Volume metrics masked this for a long time. If 80% of your interactions are routine and your AI handles them cleanly, your aggregate CSAT looks fine even if complex-case satisfaction is two points below where it should be. The 20% where things went wrong were also the 20% that correlate most with churn.

By the time the NPS signal was unambiguous, tens of millions of interactions had already gone through the broken path. That is the cost of optimising for volume without building quality signals at the interaction-type level.

Interaction type	AI fit	Quality risk	Reason
Account status, order lookup	High	Low	Policy-bounded; no judgment required
Standard refunds, simple account changes	High	Low	Clear rules, predictable flow
Billing disputes with emotional charge	Low	High	Needs discretion, pacing, customer trust
Policy-adjacent edge cases	Medium	Medium-high	Literal policy may not be the right answer
Novel product or feature questions	Low	High	Outside training distribution; degrades over time
Fraud and security concerns	Low	High	Trust-critical; errors amplify customer distress

Interaction types: AI fit versus quality risk

The three categories that AI customer service breaks on

Looking across the Klarna case and the broader pattern in financial services and e-commerce, three interaction categories consistently underperform when handed to AI without careful design.

High-stakes emotional interactions

A customer in financial distress, a disputed charge that has cascaded into overdrafts, a fraud case with real-world consequences. These require recognising the emotional register of the conversation and adjusting pacing, tone, and resolution accordingly. Current models can detect distress in text, but they struggle with the sustained judgment those situations need: when to apologise versus when to explain, when to offer a concession versus when to hold the line.

Policy-adjacent edge cases

Every company's policies have ambiguities. 'Is this covered under the return policy?' Sometimes the answer is technically no, but the context (long-tenure customer, small value, unusual circumstances) makes yes the right business decision. Human agents in those situations apply discretion. AI systems, when uncertain, revert to the literal policy text. That discretion, applied consistently, is what builds customer trust over time.

Interactions outside the training distribution

A new product feature, a regulatory change, a promotional edge case, a payment method that just launched in one market — the distribution of what customers ask shifts constantly. Human agents adapt through training and common sense. AI systems degrade silently until retrained. An AI support deployment that is not being continuously updated on new interaction types is not a stable system; it is one that is gradually becoming less reliable without any visible signal.

This is not an 'AI doesn't work' story

It would be easy to read the Klarna reversal as a vindication for skeptics who argued AI could not replace human support agents. That reading is wrong, and it is not a useful frame for anyone building support automation right now.

AI in customer service works, at scale, for specific interaction types. Klarna's assistant handled, and continues to handle, a large share of customer interactions. The company did not abandon AI. It rebalanced: AI for routine queries, human agents for the cases that broke.

“We went too far.”

— Sebastian Siemiatkowski, Klarna CEO

The mistake was not deploying AI. The mistake was deploying it without the quality signals needed to detect subset failures early. By the time the aggregate data was unambiguous, the damage was already visible in churn cohorts.

The companies that have navigated this well started with a narrower deployment scope and expanded it, rather than deploying broadly and contracting under pressure. Starting narrow is slower. It is also how you find the boundary between what AI handles reliably and what it doesn't, before that boundary costs you customers.

What the working hybrid looks like in 2026

The model that has emerged from companies that built AI-human hybrids more deliberately shares a few structural properties.

Classify by complexity and risk before routing

A billing question that is a simple status lookup is a different class of interaction from a billing question involving a distressed customer and a disputed charge. Topic-based routing gets the first step wrong. The routing decision needs to factor in what failure looks like — what the customer does if this interaction goes badly — not just what the question is about.

Build quality signals at the interaction level

Aggregate CSAT is a trailing indicator that smooths over the signal. Per-interaction CSAT — even a 1-to-5 rating after close — and in-conversation sentiment signals give you both leading and lagging data. The leading signal lets you catch an interaction going badly before it closes. The lagging signal tells you which interaction types are underperforming so you can reroute them. Most teams build aggregate dashboards and then wonder why they did not see the cliff coming.

Make escalation cheap for the customer

One reason Klarna's situation compounded: customers who could not get the AI to handle their edge case correctly did not always have a fast, clear path to a human. When the wait was long enough, some gave up. That shows up in churn data months later, not in support ticket metrics. An AI that escalates efficiently is a force multiplier for human agents. An AI that traps customers in loops is a liability.

Treat the training data cycle as infrastructure

AI support systems that are not being retrained on recent failure cases accumulate drift. New policies, new product features, new customer segments — the distribution shifts, and the system needs to shift with it. 'Deploy and monitor' without 'deploy and retrain' is how a working system quietly becomes a broken one over 18 months.

Three questions before you automate a support interaction type

Before handing an interaction category to AI, ask these.

What does failure look like, and how would you know?

Not what would the AI get wrong in the abstract. Specifically: if the AI mishandles this interaction, what does the customer do next? Do they escalate? Do they churn? Do they dispute the charge? If you cannot trace the failure to a measurable downstream outcome, you will not see the problem until it is large enough to show up in NPS.

Is this interaction type bounded by policy, or does it require judgment?

If the right answer is always derivable from written policy without contextual discretion, AI handles it reliably. If the right answer sometimes requires reading the situation and applying goodwill — a concession, an exception, a tone adjustment — AI will get it wrong in the cases that matter most to retention.

What is the escalation path, and is it fast enough?

An escalation that takes two minutes may be fine for a routine account query and intolerable for a fraud case. The escalation design needs to be calibrated to the interaction type, not set uniformly. If the answer is 'users can click a button and wait in a queue', that is not a fast path for the interactions where speed matters.

The Klarna case is not remarkable because a company tried AI in customer service and had to adjust. That is normal product work. What makes it instructive is the length of the gap between when the efficiency metrics looked fine and when the problem was obvious — and the fact that this gap is compressible.

Better leading indicators, interaction-level quality signals, and deliberate escalation design would have surfaced the failures months earlier. The companies getting AI in support right in 2026 are treating 'it is working' as a hypothesis to test against the right metrics, not a conclusion that follows from volume numbers going up.

Frequently asked questions

Outcome-based AI pricing charges per resolution. Vendors decide what a resolution is.

Jul 10, 2026Read full article →

Industry AnalysisJun 3, 20267 min readReviewed Jun 3, 2026

Klarna replaced 700 agents with AI in customer service. Here is what the metrics missed.

Two years on from Klarna's headline AI launch, the company is hiring humans again. What the efficiency numbers missed — and what a working hybrid looks like.

By FlowVerify Editorial Team

By 2026, Klarna was quietly hiring humans again.

The headline numbers were real

None of that was wrong. The problem was that those numbers measure throughput. They do not measure what happens to a customer after the ticket closes.

What the volume metrics could not see

Interaction type	AI fit	Quality risk	Reason
Account status, order lookup	High	Low	Policy-bounded; no judgment required
Standard refunds, simple account changes	High	Low	Clear rules, predictable flow
Billing disputes with emotional charge	Low	High	Needs discretion, pacing, customer trust
Policy-adjacent edge cases	Medium	Medium-high	Literal policy may not be the right answer
Novel product or feature questions	Low	High	Outside training distribution; degrades over time
Fraud and security concerns	Low	High	Trust-critical; errors amplify customer distress

Interaction types: AI fit versus quality risk

The three categories that AI customer service breaks on

Looking across the Klarna case and the broader pattern in financial services and e-commerce, three interaction categories consistently underperform when handed to AI without careful design.

High-stakes emotional interactions

Policy-adjacent edge cases

Interactions outside the training distribution

This is not an 'AI doesn't work' story

“We went too far.”

— Sebastian Siemiatkowski, Klarna CEO

What the working hybrid looks like in 2026

The model that has emerged from companies that built AI-human hybrids more deliberately shares a few structural properties.

The headline numbers were real

What the volume metrics could not see

The three categories that AI customer service breaks on

High-stakes emotional interactions

Policy-adjacent edge cases

Interactions outside the training distribution

This is not an 'AI doesn't work' story

What the working hybrid looks like in 2026

Classify by complexity and risk before routing

Build quality signals at the interaction level

Make escalation cheap for the customer

Treat the training data cycle as infrastructure

Three questions before you automate a support interaction type

What does failure look like, and how would you know?

Is this interaction type bounded by policy, or does it require judgment?

What is the escalation path, and is it fast enough?

Frequently asked questions

Did Klarna abandon AI in customer service?

What specifically triggered Klarna's reversal?

How should companies measure AI support quality without being misled by volume metrics?

Related reading

Stay ahead on eSignatures, compliance, and document workflows

The headline numbers were real

What the volume metrics could not see

The three categories that AI customer service breaks on

High-stakes emotional interactions

Policy-adjacent edge cases

Interactions outside the training distribution

This is not an 'AI doesn't work' story

What the working hybrid looks like in 2026

Classify by complexity and risk before routing

Build quality signals at the interaction level

Make escalation cheap for the customer

Treat the training data cycle as infrastructure

Three questions before you automate a support interaction type

What does failure look like, and how would you know?

Is this interaction type bounded by policy, or does it require judgment?

What is the escalation path, and is it fast enough?

Frequently asked questions

Did Klarna abandon AI in customer service?

What specifically triggered Klarna's reversal?

How should companies measure AI support quality without being misled by volume metrics?

Related reading

Stay ahead on eSignatures, compliance, and document workflows