Coinbase's AWS outage lasted 18 hours. The postmortem shows why multi-AZ didn't help.
A cooling failure took out one zone. Getting the rest of the system back took until the next afternoon.
On the evening of May 7, 2026, a bank of chiller units failed inside a single data hall in AWS's us-east-1 region. The affected racks went into thermal shutdown. Twenty-eight minutes later, nearly all trading on Coinbase had stopped. Engineers spent the rest of the night and most of the next morning bringing it back in stages, and the last queues didn't fully clear until 2 PM the following day. Eighteen hours, give or take, for a failure that AWS itself contained to one availability zone out of six in that region.
That gap between “one zone went down” and “eighteen hours of degraded trading” is the part worth sitting with. Coinbase designs for multi-AZ resilience, the same way most serious cloud-native companies do. A single zone failing is precisely the scenario that design exists to absorb without anyone outside the incident channel noticing. Coinbase's own postmortem, published in early June and covered in more technical detail by InfoQ, describes two specific places where the system was multi-AZ on the infrastructure layer and quietly wasn't, one layer up. Neither failure mode is specific to exchanges or to crypto. Both are worth fifteen minutes of checking against whatever you've built.
What actually failed, in what order
At 7:20 PM ET, multiple chiller units in a single AWS data hall failed at roughly the same time. The cooling loss forced a thermal-safety shutdown across the racks in that hall, which took their EC2 instances and EBS volumes offline. AWS's own failure domain held: the outage stayed contained to one availability zone, use1-az4, inside us-east-1. Under AWS's stated redundancy model, a region is supposed to keep running on its remaining zones when this happens.
It didn't keep running without customer impact, not for Coinbase. By 7:48 PM, error rates had spiked across multiple services and nearly all trading had halted. Coinbase restored trading gradually: first in a cancel-only mode, then through periodic auctions, while two of its Kafka clusters stayed stuck in what its postmortem calls a “healing” state. Engineers performed manual partition reassignments at 3 AM to move topics off the impaired brokers. Priority-zero and priority-one topics reached full availability by 9:30 AM. The rest cleared by 2 PM, more than eighteen hours after the chillers first failed.
Two systems explain most of that gap: the exchange's matching engine, and its messaging layer. They failed for different reasons, and both reasons generalise well past Coinbase.
Where the multi-AZ design quietly became single-AZ
A matching engine pairs buy and sell orders, and on a major exchange its latency budget is measured in single-digit milliseconds. Coinbase runs its matching engine as a Raft consensus cluster, meaning a majority of nodes have to agree before any state change counts as durable. A network hop between AWS availability zones typically costs a few extra milliseconds round trip. For most services, that's nothing. For a consensus protocol sitting on the path of every single trade, it's overhead a trading system is built to avoid.
So the nodes sat close together: inside a single AWS cluster placement group, a primitive that exists specifically to put instances physically near each other for low, predictable network latency. It's a reasonable choice, and a common one for any system where consensus sits on the hot path. The cost of that choice shows up exactly once, when the availability zone the placement group lives in goes down. Three of the matching engine's five Raft nodes failed along with the AZ. A five-node Raft cluster needs three nodes to hold quorum. Losing exactly three meant losing quorum, and losing quorum meant the cluster couldn't safely process anything until engineers rebuilt it by hand.
This is a general shape, not a crypto-specific one. Any leader-elected or quorum-based system built for low latency runs into the identical tension: a primary-replica database failover group, a distributed lock service, an in-memory cache cluster sitting in front of a hot read path. Collocating for latency and distributing for resilience pull in opposite directions. Most teams resolve that tension implicitly, by accepting whatever default their orchestration tooling happens to favour, rather than deciding it on purpose and writing the decision down.
Why the obvious fix isn't free
The obvious response, spread the Raft group across more availability zones, is also a real engineering trade-off, not a free upgrade. Stretching a five-node cluster across three zones in something like a 2-2-1 pattern means losing one zone costs you at most two nodes, which preserves quorum. It also means every write now has to clear a cross-zone round trip on the consensus path, every time, not just during an incident. Some systems split the difference with a lighter-weight arbiter or witness node in a third zone, one that participates in quorum decisions without holding a full replica, to get AZ-level resilience without paying full replication latency on every write.
None of these options is strictly better. They're different points on the same latency-versus-resilience curve, and the right point depends on what a few milliseconds of added latency actually costs your product against what an AZ-level outage costs it. The mistake isn't picking the low-latency end of that curve. It's picking it without writing down, anywhere a postmortem author could find it later, what's being traded away to get there.
A managed service that looked healthy and wasn’t
The second failure has less to do with Coinbase's design choices and more to do with a blind spot that comes bundled with any managed service. Two of Coinbase's Amazon MSK clusters, AWS's managed Kafka offering, got stuck in a “healing” state during the incident. A defect in MSK's control plane stopped partition leaders from being reelected after the AZ outage took some brokers offline. Producers could still connect to the cluster. They just couldn't write to it.
That distinction matters more than it sounds like it should. A broker going down is a node-health problem, and node health is exactly what every standard Kafka dashboard watches: CPU, disk, under-replicated partitions, consumer lag. Leader election is a control-plane function, a layer above node health, and in a managed service that's the layer AWS operates on your behalf. When that layer breaks, your brokers can report green while producers sit there unable to write a single message, and the usual dashboards won't tell you why, because they were never pointed at that specific transition.
Kafka and MSK do expose the right signal, if anyone is alerting on it specifically. ActiveControllerCount tells you whether the cluster has a controller at all; it should read 1, and a sustained 0 means leader election has stalled. OfflinePartitionsCount tells you when partitions have no leader assigned, which is the precise failure mode at issue here. Neither metric tracks broker CPU or disk closely, which is exactly why a dashboard built around resource usage can stay green straight through this kind of failure. The same pattern shows up outside Kafka: an etcd cluster tracks leader changes separately from node health, a Postgres streaming replica tracks replication lag separately from instance load, and in both cases the control-plane signal is the one that actually tells you whether the system is doing its job. Every managed streaming or consensus service exposes some version of this signal if you go looking for it. The discipline is making sure someone is looking, and that the alert fires before a customer notices, not after.
The blocked writes cascaded in a straight line. Coinbase's fee service depended on those Kafka topics. Quoting depended on the fee service. Quoting failing is what most customers actually experienced: stuck trades and missing prices, not an error message that mentioned Kafka anywhere.
“Redundant infrastructure and redundant coordination are not the same property. Most architecture diagrams only draw the first one.”
Two recovery times, eighteen hours apart
Lay the timeline out in order and a second pattern shows up, separate from the two root causes.
| Time (ET) | What happened |
|---|---|
| 7:20 PM, May 7 | Chiller failure triggers thermal shutdown in AZ use1-az4; EC2 and EBS in that zone go offline |
| 7:48 PM, May 7 | Error rates spike across services; nearly all trading halts |
| Overnight | Trading returns in cancel-only and auction modes while two MSK clusters stay stuck in a “healing” state |
| 3:00 AM, May 8 | Engineers manually reassign Kafka partitions off the impaired brokers |
| 9:30 AM, May 8 | Priority-zero and priority-one topics reach full availability |
| 2:00 PM, May 8 | Remaining topics clear; full recovery |
Two different numbers come out of that timeline, and postmortems tend to blur them into one. The customer-visible outage, the stretch between “trading stopped” and “trading is back in some form,” ran a few hours. Full recovery, the time until every queue had drained and every system was back to its normal operating state, took until early afternoon the next day. Both numbers are real and they answer different questions. The first is what you put on a status page and in a regulatory filing. The second is what your on-call rotation actually has to staff for, and it's usually the number nobody plans capacity around, because the dashboards turn green long before it's reached.
What Coinbase says it's fixing
The remediation list in Coinbase's postmortem reads like a direct response to the two failures above: a warm, cross-zone standby for the matching engine, so a future AZ loss doesn't require rebuilding a Raft cluster by hand under pressure; faster and more automated quorum restoration; messaging infrastructure designed to tolerate the same control-plane failure mode; and disaster-recovery testing specifically against AZ-level failures, not just node-level ones.
All four are reasonable fixes for this incident. None of them generalises automatically to the next one, and that's really the point. Coinbase's architecture wasn't obviously wrong on May 6. It was a defensible latency trade-off that happened to line up badly with one specific failure mode. The fix that does generalise isn't on the remediation list, because it isn't a system change. It's a question, asked on a schedule, about every consensus-based or control-plane-dependent component a team operates, asked before the AZ that fails is theirs.
Three questions to ask about your own multi-AZ system
None of these require reading the Coinbase postmortem twice. They require fifteen minutes with whoever owns your highest-stakes consensus cluster.
- Does your lowest-latency cluster collocate nodes in one AZ or placement group on purpose? If so, that decision already determines what happens when that AZ fails, whether or not anyone wrote it down.
- Do you monitor control-plane behaviour, leader election, partition assignment, separately from node health for every managed service on your critical path? “The brokers are healthy” and “the cluster is doing its job” are different claims, and only one of them is usually instrumented.
- Do you know your full recovery time under a backlog twice the size of a normal day, not just your time to first customer-visible recovery? They're rarely the same number, and only one of them tends to make it onto a status page.
- Do two of your “independent” services share a hard dependency, like one Kafka topic or one database, that would turn a single logical failure into multiple customer-visible outages? Mapping this on a whiteboard usually takes less time than people expect, and the answer is usually yes somewhere.
The next outage your team writes up probably won't involve a matching engine or a failed chiller. It will likely still have the same shape: a real, defensible latency decision, made by someone reasonable, that quietly moved the actual failure boundary to somewhere other than where the architecture diagram says it is.
Frequently asked questions
Related reading
An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.
A Cursor agent found one unscoped API token and wiped a production database and its backups in nine seconds. The real failure was credential scoping, not the model.
The AI memory shortage just rewrote the cloud cost-optimisation playbook
DRAM and NAND contract prices rose roughly 95% in a single quarter. The cause is a global reallocation of memory manufacturing towards AI accelerators, and the usual cost-optimisation playbook does not touch it.
Meta published a postmortem for its 2021 outage. Not for the ones in 2026.
Meta's Instagram breach traced to a basic authentication gap, not a sophisticated attack, after its Trust and Safety team lost half its staff to an AI reassignment. No public postmortem has followed.