Was this outage Coinbase’s fault or AWS’s fault?

Both contributed, but assigning blame misses the more useful lesson. AWS’s cooling failure was the trigger and stayed contained to one availability zone, which is the scenario multi-AZ infrastructure is supposed to survive. The length and depth of the outage came from two of Coinbase’s own architecture choices: a latency-driven placement decision for its matching engine, and a managed-service control-plane bug that no customer could have directly prevented. The second kind of failure is the one every team running on managed cloud services should actually worry about, because picking a different provider wouldn’t have fixed it.

What is a cluster placement group, and why would a low-latency system use one?

A cluster placement group is an AWS feature that packs EC2 instances physically close together, on the same low-latency network fabric, usually within a single availability zone. Systems where network round-trip time sits on the critical path, like a consensus protocol coordinating trades, use placement groups to keep that round trip as short as possible. The trade-off is that "physically close together" and "spread across failure domains" pull in opposite directions, so the same placement choice that buys the latency also concentrates risk in one zone.

How is a Kafka control-plane failure different from a broker outage?

A broker outage is a node-health problem: a machine running Kafka goes down, and standard monitoring, CPU, disk, replication lag, usually catches it quickly. A control-plane failure affects leader election and partition assignment, the logic that decides which broker is allowed to accept writes for a given partition. In this incident, AWS’s managed Kafka service had a defect in exactly that logic, so producers could connect to a cluster that looked structurally fine and still be unable to write, because no broker had been assigned as leader for their partition.

Does this mean multi-AZ architecture doesn’t work?

No. It means multi-AZ architecture has to be checked at every layer, not just the compute layer. The parts of Coinbase’s infrastructure that were genuinely spread across availability zones, mostly stateless services, behaved as designed during this incident. Both failures happened in the two places where a deliberate engineering decision, made for good reasons, had quietly reintroduced a single-AZ dependency inside an otherwise multi-AZ system.

EngineeringJun 23, 20268 min readReviewed Jun 23, 2026

Coinbase's AWS outage lasted 18 hours. The postmortem shows why multi-AZ didn't help.

A cooling failure took out one zone. Getting the rest of the system back took until the next afternoon.

By FlowVerify Editorial Team

On the evening of May 7, 2026, a bank of chiller units failed inside a single data hall in AWS's us-east-1 region. The affected racks went into thermal shutdown. Twenty-eight minutes later, nearly all trading on Coinbase had stopped. Engineers spent the rest of the night and most of the next morning bringing it back in stages, and the last queues didn't fully clear until 2 PM the following day. Eighteen hours, give or take, for a failure that AWS itself contained to one availability zone out of six in that region.

That gap between “one zone went down” and “eighteen hours of degraded trading” is the part worth sitting with. Coinbase designs for multi-AZ resilience, the same way most serious cloud-native companies do. A single zone failing is precisely the scenario that design exists to absorb without anyone outside the incident channel noticing. Coinbase's own postmortem, published in early June and covered in more technical detail by InfoQ, describes two specific places where the system was multi-AZ on the infrastructure layer and quietly wasn't, one layer up. Neither failure mode is specific to exchanges or to crypto. Both are worth fifteen minutes of checking against whatever you've built.

What actually failed, in what order

At 7:20 PM ET, multiple chiller units in a single AWS data hall failed at roughly the same time. The cooling loss forced a thermal-safety shutdown across the racks in that hall, which took their EC2 instances and EBS volumes offline. AWS's own failure domain held: the outage stayed contained to one availability zone, use1-az4, inside us-east-1. Under AWS's stated redundancy model, a region is supposed to keep running on its remaining zones when this happens.

It didn't keep running without customer impact, not for Coinbase. By 7:48 PM, error rates had spiked across multiple services and nearly all trading had halted. Coinbase restored trading gradually: first in a cancel-only mode, then through periodic auctions, while two of its Kafka clusters stayed stuck in what its postmortem calls a “healing” state. Engineers performed manual partition reassignments at 3 AM to move topics off the impaired brokers. Priority-zero and priority-one topics reached full availability by 9:30 AM. The rest cleared by 2 PM, more than eighteen hours after the chillers first failed.

Two systems explain most of that gap: the exchange's matching engine, and its messaging layer. They failed for different reasons, and both reasons generalise well past Coinbase.

Where the multi-AZ design quietly became single-AZ

A matching engine pairs buy and sell orders, and on a major exchange its latency budget is measured in single-digit milliseconds. Coinbase runs its matching engine as a Raft consensus cluster, meaning a majority of nodes have to agree before any state change counts as durable. A network hop between AWS availability zones typically costs a few extra milliseconds round trip. For most services, that's nothing. For a consensus protocol sitting on the path of every single trade, it's overhead a trading system is built to avoid.

So the nodes sat close together: inside a single AWS cluster placement group, a primitive that exists specifically to put instances physically near each other for low, predictable network latency. It's a reasonable choice, and a common one for any system where consensus sits on the hot path. The cost of that choice shows up exactly once, when the availability zone the placement group lives in goes down. Three of the matching engine's five Raft nodes failed along with the AZ. A five-node Raft cluster needs three nodes to hold quorum. Losing exactly three meant losing quorum, and losing quorum meant the cluster couldn't safely process anything until engineers rebuilt it by hand.

This is a general shape, not a crypto-specific one. Any leader-elected or quorum-based system built for low latency runs into the identical tension: a primary-replica database failover group, a distributed lock service, an in-memory cache cluster sitting in front of a hot read path. Collocating for latency and distributing for resilience pull in opposite directions. Most teams resolve that tension implicitly, by accepting whatever default their orchestration tooling happens to favour, rather than deciding it on purpose and writing the decision down.

Why the obvious fix isn't free

The obvious response, spread the Raft group across more availability zones, is also a real engineering trade-off, not a free upgrade. Stretching a five-node cluster across three zones in something like a 2-2-1 pattern means losing one zone costs you at most two nodes, which preserves quorum. It also means every write now has to clear a cross-zone round trip on the consensus path, every time, not just during an incident. Some systems split the difference with a lighter-weight arbiter or witness node in a third zone, one that participates in quorum decisions without holding a full replica, to get AZ-level resilience without paying full replication latency on every write.

None of these options is strictly better. They're different points on the same latency-versus-resilience curve, and the right point depends on what a few milliseconds of added latency actually costs your product against what an AZ-level outage costs it. The mistake isn't picking the low-latency end of that curve. It's picking it without writing down, anywhere a postmortem author could find it later, what's being traded away to get there.

A managed service that looked healthy and wasn’t

The second failure has less to do with Coinbase's design choices and more to do with a blind spot that comes bundled with any managed service. Two of Coinbase's Amazon MSK clusters, AWS's managed Kafka offering, got stuck in a “healing” state during the incident. A defect in MSK's control plane stopped partition leaders from being reelected after the AZ outage took some brokers offline. Producers could still connect to the cluster. They just couldn't write to it.

That distinction matters more than it sounds like it should. A broker going down is a node-health problem, and node health is exactly what every standard Kafka dashboard watches: CPU, disk, under-replicated partitions, consumer lag. Leader election is a control-plane function, a layer above node health, and in a managed service that's the layer AWS operates on your behalf. When that layer breaks, your brokers can report green while producers sit there unable to write a single message, and the usual dashboards won't tell you why, because they were never pointed at that specific transition.

Kafka and MSK do expose the right signal, if anyone is alerting on it specifically. ActiveControllerCount tells you whether the cluster has a controller at all; it should read 1, and a sustained 0 means leader election has stalled. OfflinePartitionsCount tells you when partitions have no leader assigned, which is the precise failure mode at issue here. Neither metric tracks broker CPU or disk closely, which is exactly why a dashboard built around resource usage can stay green straight through this kind of failure. The same pattern shows up outside Kafka: an etcd cluster tracks leader changes separately from node health, a Postgres streaming replica tracks replication lag separately from instance load, and in both cases the control-plane signal is the one that actually tells you whether the system is doing its job. Every managed streaming or consensus service exposes some version of this signal if you go looking for it. The discipline is making sure someone is looking, and that the alert fires before a customer notices, not after.

The blocked writes cascaded in a straight line. Coinbase's fee service depended on those Kafka topics. Quoting depended on the fee service. Quoting failing is what most customers actually experienced: stuck trades and missing prices, not an error message that mentioned Kafka anywhere.

“Redundant infrastructure and redundant coordination are not the same property. Most architecture diagrams only draw the first one.”

Two recovery times, eighteen hours apart

Lay the timeline out in order and a second pattern shows up, separate from the two root causes.

Time (ET)	What happened
7:20 PM, May 7	Chiller failure triggers thermal shutdown in AZ use1-az4; EC2 and EBS in that zone go offline
7:48 PM, May 7	Error rates spike across services; nearly all trading halts
Overnight	Trading returns in cancel-only and auction modes while two MSK clusters stay stuck in a “healing” state
3:00 AM, May 8	Engineers manually reassign Kafka partitions off the impaired brokers
9:30 AM, May 8	Priority-zero and priority-one topics reach full availability
2:00 PM, May 8	Remaining topics clear; full recovery

What happened, and when

Two different numbers come out of that timeline, and postmortems tend to blur them into one. The customer-visible outage, the stretch between “trading stopped” and “trading is back in some form,” ran a few hours. Full recovery, the time until every queue had drained and every system was back to its normal operating state, took until early afternoon the next day. Both numbers are real and they answer different questions. The first is what you put on a status page and in a regulatory filing. The second is what your on-call rotation actually has to staff for, and it's usually the number nobody plans capacity around, because the dashboards turn green long before it's reached.

What Coinbase says it's fixing

The remediation list in Coinbase's postmortem reads like a direct response to the two failures above: a warm, cross-zone standby for the matching engine, so a future AZ loss doesn't require rebuilding a Raft cluster by hand under pressure; faster and more automated quorum restoration; messaging infrastructure designed to tolerate the same control-plane failure mode; and disaster-recovery testing specifically against AZ-level failures, not just node-level ones.

All four are reasonable fixes for this incident. None of them generalises automatically to the next one, and that's really the point. Coinbase's architecture wasn't obviously wrong on May 6. It was a defensible latency trade-off that happened to line up badly with one specific failure mode. The fix that does generalise isn't on the remediation list, because it isn't a system change. It's a question, asked on a schedule, about every consensus-based or control-plane-dependent component a team operates, asked before the AZ that fails is theirs.

Three questions to ask about your own multi-AZ system

None of these require reading the Coinbase postmortem twice. They require fifteen minutes with whoever owns your highest-stakes consensus cluster.

Does your lowest-latency cluster collocate nodes in one AZ or placement group on purpose? If so, that decision already determines what happens when that AZ fails, whether or not anyone wrote it down.
Do you monitor control-plane behaviour, leader election, partition assignment, separately from node health for every managed service on your critical path? “The brokers are healthy” and “the cluster is doing its job” are different claims, and only one of them is usually instrumented.
Do you know your full recovery time under a backlog twice the size of a normal day, not just your time to first customer-visible recovery? They're rarely the same number, and only one of them tends to make it onto a status page.
Do two of your “independent” services share a hard dependency, like one Kafka topic or one database, that would turn a single logical failure into multiple customer-visible outages? Mapping this on a whiteboard usually takes less time than people expect, and the answer is usually yes somewhere.

The next outage your team writes up probably won't involve a matching engine or a failed chiller. It will likely still have the same shape: a real, defensible latency decision, made by someone reasonable, that quietly moved the actual failure boundary to somewhere other than where the architecture diagram says it is.

Frequently asked questions

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.

A Cursor agent found one unscoped API token and wiped a production database and its backups in nine seconds. The real failure was credential scoping, not the model.

Jul 1, 2026Read full article →

EngineeringJun 23, 20268 min readReviewed Jun 23, 2026

Coinbase's AWS outage lasted 18 hours. The postmortem shows why multi-AZ didn't help.

A cooling failure took out one zone. Getting the rest of the system back took until the next afternoon.

By FlowVerify Editorial Team

What actually failed, in what order

Two systems explain most of that gap: the exchange's matching engine, and its messaging layer. They failed for different reasons, and both reasons generalise well past Coinbase.

Where the multi-AZ design quietly became single-AZ

Why the obvious fix isn't free

A managed service that looked healthy and wasn’t

“Redundant infrastructure and redundant coordination are not the same property. Most architecture diagrams only draw the first one.”

Two recovery times, eighteen hours apart

Lay the timeline out in order and a second pattern shows up, separate from the two root causes.

Time (ET)	What happened
7:20 PM, May 7	Chiller failure triggers thermal shutdown in AZ use1-az4; EC2 and EBS in that zone go offline
7:48 PM, May 7	Error rates spike across services; nearly all trading halts
Overnight	Trading returns in cancel-only and auction modes while two MSK clusters stay stuck in a “healing” state
3:00 AM, May 8	Engineers manually reassign Kafka partitions off the impaired brokers
9:30 AM, May 8	Priority-zero and priority-one topics reach full availability
2:00 PM, May 8	Remaining topics clear; full recovery

What happened, and when

What Coinbase says it's fixing

Three questions to ask about your own multi-AZ system

None of these require reading the Coinbase postmortem twice. They require fifteen minutes with whoever owns your highest-stakes consensus cluster.

Does your lowest-latency cluster collocate nodes in one AZ or placement group on purpose? If so, that decision already determines what happens when that AZ fails, whether or not anyone wrote it down.
Do you monitor control-plane behaviour, leader election, partition assignment, separately from node health for every managed service on your critical path? “The brokers are healthy” and “the cluster is doing its job” are different claims, and only one of them is usually instrumented.
Do you know your full recovery time under a backlog twice the size of a normal day, not just your time to first customer-visible recovery? They're rarely the same number, and only one of them tends to make it onto a status page.
Do two of your “independent” services share a hard dependency, like one Kafka topic or one database, that would turn a single logical failure into multiple customer-visible outages? Mapping this on a whiteboard usually takes less time than people expect, and the answer is usually yes somewhere.

Coinbase's AWS outage lasted 18 hours. The postmortem shows why multi-AZ didn't help.

What actually failed, in what order

Where the multi-AZ design quietly became single-AZ

Why the obvious fix isn't free

A managed service that looked healthy and wasn’t

Two recovery times, eighteen hours apart

What Coinbase says it's fixing

Three questions to ask about your own multi-AZ system

Frequently asked questions

Related reading

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.

The AI memory shortage just rewrote the cloud cost-optimisation playbook

Meta published a postmortem for its 2021 outage. Not for the ones in 2026.

Stay ahead on eSignatures, compliance, and document workflows

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.

Coinbase's AWS outage lasted 18 hours. The postmortem shows why multi-AZ didn't help.

What actually failed, in what order

Where the multi-AZ design quietly became single-AZ

Why the obvious fix isn't free

A managed service that looked healthy and wasn’t

Two recovery times, eighteen hours apart

What Coinbase says it's fixing

Three questions to ask about your own multi-AZ system

Frequently asked questions

Related reading

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.

The AI memory shortage just rewrote the cloud cost-optimisation playbook

Meta published a postmortem for its 2021 outage. Not for the ones in 2026.

Stay ahead on eSignatures, compliance, and document workflows

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.