Railway disconnected a carrier to contain an outage. It cut its last route instead.
Inside the July 2 incident report: a containment call, a routine Linux networking default, and two hours of storage running at a third of speed.
At 07:44 UTC on July 2, 2026, Railway's monitoring caught packet loss on a carrier serving one of its US East data centres, a first-generation site older than the regions the company built after it started running its own bare-metal fleet. The response looked routine: identify the bad carrier, cut over to a second one, wait for the network to settle. Four and a half hours later, the incident was resolved. But the Railway outage that mattered most wasn't the one everyone could see. It was the two hours of storage traffic that followed, running at roughly a third of normal speed, caused by a containment decision and a well-known Linux networking default that most teams never get around to checking.
Railway published a detailed incident report the next day. It's worth reading in full, because it's an unusually honest breakdown of how a fix can end up worse than the fault it was fixing.
The first 20 minutes of the Railway outage were the easy part
The initial problem was simple. An upstream ISP was dropping packets on Railway's primary carrier at the US border. Engineers disconnected it at 07:44, which is the standard move: get the bad path out of the routing table before it does more damage.
Then, at 08:39, they disconnected the secondary carrier too.
That second disconnection was meant to be routine cleanup. Instead, it removed the site's only remaining default route. For roughly 20 minutes, the data centre had no stable path to the internet at all. Every request to a workload hosted there, application traffic, dashboard calls, deploy triggers, had nowhere to go.
The team caught it fast. The secondary carrier was reconnected at 08:59 and routing stabilised within minutes. If the story ended there, this would be a fairly ordinary carrier-flap incident: bad ISP, quick fix, twenty minutes of pain, done.
A containment decision that removed the last route out
The part worth sitting with is why the second carrier got disconnected in the first place. Nothing in the public report suggests it was still causing problems at 08:39. The disconnection reads as a precaution taken while the team was still working through the primary carrier's failure, not a response to a second, independent fault. In other words, the fix for the first problem is what created the second one.
This is a familiar shape for anyone who has run on-call. Under pressure, disconnecting a suspect path feels safe. It's reversible, it's contained, and it buys time to think. What it doesn't do automatically is check whether that path was secretly the only thing keeping the site reachable. A five-second sanity check, "if I cut this, what's left?", would have caught it. Most incident runbooks don't ask that question explicitly, because most of the time the answer is obvious. This was the time it wasn't.
The slower failure that came after
Routing stabilised at 08:59. On paper, the incident was mostly over. In practice, the data centre's storage layer stayed degraded until 10:45, almost two hours, running at roughly a third of its normal throughput. Two-thirds of the servers in that availability zone were stuck with elevated disk I/O wait.
The report's explanation is specific. During the roughly 20 minutes without a stable route, systems in the affected zone captured incorrect network paths as they scrambled to find any way to keep talking to each other. Once the real route came back, those stale paths didn't correct themselves. Storage traffic kept flowing over whatever path it had locked onto during the chaos, which for a meaningful chunk of the fleet turned out to be the slow management network instead of the fast storage network.
That's the part that should worry anyone running multi-homed Linux infrastructure, because it isn't specific to Railway's stack. It's a known category of bug.
Why this looks like a textbook Linux networking default
Linux boxes with more than one network interface, say one for storage, one for management, one for the public internet, inherit a default behaviour sometimes called the weak host model. By default, the kernel will answer an ARP request for any IP address configured on the box, on any interface, regardless of which interface that address is supposed to live on. Under normal conditions this is invisible. The right traffic goes to the right interface because the right routes exist and nothing forces a detour.
The trouble starts when a preferred path degrades or disappears, as Railway's storage network briefly did. If a slower, unrelated network, like a management network, can also reach the same host, the weak host model lets that host keep answering for its storage address there too. Traffic that should have failed loudly instead limps along at whatever speed the fallback network can manage. From the outside, nothing looks broken. Health checks pass. Connections succeed. They're just running at a third of the speed they should be, and nothing is designed to alert on "technically still up, but slow for no visible reason."
Railway's report doesn't use the term "weak host model," but the mechanism it describes, servers continuing to answer for storage-network addresses over a different, slower network once the preferred path was disrupted, is a close match for this well-documented class of failure. The standard mitigations are not exotic:
# Prevent a multi-homed host from answering ARP requests
# for an address on the "wrong" interface
sysctl -w net.ipv4.conf.all.arp_ignore=1
sysctl -w net.ipv4.conf.all.arp_announce=2
# Enforce strict reverse-path filtering so a host rejects
# traffic that arrives on an interface it shouldn't
sysctl -w net.ipv4.conf.all.rp_filter=1Neither line is a complete fix on its own. The deeper answer is usually to keep storage and management traffic on genuinely separate L2 segments or VRFs so the ambiguity can't exist in the first place. But the sysctl settings above are the cheapest first check, and on most fleets, nobody has ever gone looking to see whether they're set.
20,000 blackholed links, and why silent beats loud
The scale here matters. At peak, roughly 20,000 host-to-host private network links across the affected zone were blackholed. This wasn't one unlucky server quietly misrouting its own traffic. It was a mesh-wide effect touching a large share of the zone at once, which is consistent with a shared default, not a single misconfigured box, being the root cause.
| Time | Event | Impact |
|---|---|---|
| 07:44 | Packet loss detected on primary carrier; incident declared | Primary path degraded |
| 07:44–08:32 | Primary carrier disconnected at US border | Traffic rerouted to secondary carrier |
| 08:39 | Secondary carrier disconnected | Last default route removed |
| 08:39–08:59 | No stable internet route to the site | ~20 minutes of severe, visible impact |
| 08:59 | Secondary carrier reconnected | Routing stabilised |
| 09:00–10:45 | Stale paths persist from the reroute window | Storage throughput at ~33% of normal |
| 10:45 | Root cause identified: stuck connections on management network | Fix applied |
| 11:04–12:01 | Disk I/O wait normalises; private mesh recovers | Incident resolved |
The instructive part of that table is the gap between the two failure types. The hard failure, no route at all, lasted 20 minutes and was impossible to miss. The soft failure, storage at a third of capacity, lasted more than five times as long and, by the report's own account, took until 10:45 to even get diagnosed. A monitoring stack tuned to catch "is this host up" will sail straight past "this host is up but running at 33% of its normal throughput." That gap is where the real cost of this incident lived.
What Railway committed to fix
The report lists four concrete remediation steps: migrate first-generation sites so they generate their own default routes instead of depending on external carrier state; make production traffic fail cleanly and immediately rather than silently falling back onto the management network; add alerting specifically on management-network load and on blackholed private links; and only reconnect a degraded carrier once its backbone recovery is independently verified, rather than on a fixed timer or gut feel.
Every one of those is aimed at removing a silent failure mode and replacing it with a loud one. That's the correct instinct. A system that fails hard when its assumptions break is easier to operate than one that degrades gracefully into a state nobody is watching for. Graceful degradation is only a virtue if someone built a way to notice it's happening.
The checklist this hands every team running multi-homed Linux
None of this requires running your own data centres to be relevant. Three checks translate directly to almost any fleet with more than one network per host.
- Check arp_ignore, arp_announce, and rp_filter on any multi-homed box where a fast network (storage, internal RPC) sits alongside a slower one (management, out-of-band). If they're at their defaults, you have the same latent exposure Railway did.
- Add a rule to your incident runbooks that any "disconnect the degraded path" step is followed by an explicit statement of what path remains. It costs five seconds and it's the single change that would have shortened this incident the most.
- Audit your alerting for the difference between "down" and "slow." If your health checks only test reachability, a fallback path that works but crawls will never page anyone until a customer notices first.
Railway's postmortem is a reminder that redundancy isn't just about having a second carrier, a second cloud, or a second network. It's about knowing, with certainty, what happens to traffic in the seconds after the first path stops being the one carrying it.
Frequently asked questions
Related reading
pgvector's HNSW index has a memory cliff, and the Postgres defaults walk right into it
pgvector handles most RAG workloads under ten million vectors just fine. The HNSW index underneath it has a memory requirement Postgres won't mention until the build already ran 40x slower.
An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.
A Cursor agent found one unscoped API token and wiped a production database and its backups in nine seconds. The real failure was credential scoping, not the model.
Meta published a postmortem for its 2021 outage. Not for the ones in 2026.
Meta's Instagram breach traced to a basic authentication gap, not a sophisticated attack, after its Trust and Safety team lost half its staff to an AI reassignment. No public postmortem has followed.