What actually caused Railway's July 2, 2026 outage?

An upstream ISP was dropping packets on Railway's primary carrier at a US East data centre. Engineers disconnected the primary carrier, then disconnected the secondary one as a precaution, which removed the site's only remaining default route for about 20 minutes. Once routing was restored, stale network paths captured during that window kept a large share of storage traffic on the slow management network instead of the fast storage network for close to two more hours.

Is this the same failure as Railway's May 19, 2026 GCP outage?

No. The May 19 incident was caused by Google Cloud suspending Railway's production account, which took down a control plane that Railway's edge proxies depended on for routing. The July 2 incident happened entirely within Railway's own bare-metal infrastructure and involved carrier connectivity and a Linux networking default, not any cloud provider account action.

What is the weak host model, and why does it cause a silent slowdown instead of a hard failure?

On Linux, the weak host model is a default behaviour where a multi-homed server will answer an ARP request for any of its configured IP addresses on any of its network interfaces, not only the interface that address is meant to belong to. If a fast network path degrades and a slower network can still reach the same host, traffic can keep flowing over that slower path without any connection actually failing, so health checks that only test reachability won't catch it.

How can a team check if its own infrastructure is exposed to this failure mode?

Check the arp_ignore, arp_announce, and rp_filter sysctl settings on any multi-homed host where a fast network sits alongside a slower one. If they are at their Linux defaults, the same exposure exists. Longer term, keep storage and management traffic on separate L2 segments or VRFs, and add alerting that distinguishes a slow path from a down one.

EngineeringJul 5, 20266 min readReviewed Jul 5, 2026

Railway disconnected a carrier to contain an outage. It cut its last route instead.

Inside the July 2 incident report: a containment call, a routine Linux networking default, and two hours of storage running at a third of speed.

By FlowVerify Editorial Team

At 07:44 UTC on July 2, 2026, Railway's monitoring caught packet loss on a carrier serving one of its US East data centres, a first-generation site older than the regions the company built after it started running its own bare-metal fleet. The response looked routine: identify the bad carrier, cut over to a second one, wait for the network to settle. Four and a half hours later, the incident was resolved. But the Railway outage that mattered most wasn't the one everyone could see. It was the two hours of storage traffic that followed, running at roughly a third of normal speed, caused by a containment decision and a well-known Linux networking default that most teams never get around to checking.

Railway published a detailed incident report the next day. It's worth reading in full, because it's an unusually honest breakdown of how a fix can end up worse than the fault it was fixing.

The first 20 minutes of the Railway outage were the easy part

The initial problem was simple. An upstream ISP was dropping packets on Railway's primary carrier at the US border. Engineers disconnected it at 07:44, which is the standard move: get the bad path out of the routing table before it does more damage.

Then, at 08:39, they disconnected the secondary carrier too.

That second disconnection was meant to be routine cleanup. Instead, it removed the site's only remaining default route. For roughly 20 minutes, the data centre had no stable path to the internet at all. Every request to a workload hosted there, application traffic, dashboard calls, deploy triggers, had nowhere to go.

The team caught it fast. The secondary carrier was reconnected at 08:59 and routing stabilised within minutes. If the story ended there, this would be a fairly ordinary carrier-flap incident: bad ISP, quick fix, twenty minutes of pain, done.

A containment decision that removed the last route out

The part worth sitting with is why the second carrier got disconnected in the first place. Nothing in the public report suggests it was still causing problems at 08:39. The disconnection reads as a precaution taken while the team was still working through the primary carrier's failure, not a response to a second, independent fault. In other words, the fix for the first problem is what created the second one.

This is a familiar shape for anyone who has run on-call. Under pressure, disconnecting a suspect path feels safe. It's reversible, it's contained, and it buys time to think. What it doesn't do automatically is check whether that path was secretly the only thing keeping the site reachable. A five-second sanity check, "if I cut this, what's left?", would have caught it. Most incident runbooks don't ask that question explicitly, because most of the time the answer is obvious. This was the time it wasn't.

The slower failure that came after

Routing stabilised at 08:59. On paper, the incident was mostly over. In practice, the data centre's storage layer stayed degraded until 10:45, almost two hours, running at roughly a third of its normal throughput. Two-thirds of the servers in that availability zone were stuck with elevated disk I/O wait.

The report's explanation is specific. During the roughly 20 minutes without a stable route, systems in the affected zone captured incorrect network paths as they scrambled to find any way to keep talking to each other. Once the real route came back, those stale paths didn't correct themselves. Storage traffic kept flowing over whatever path it had locked onto during the chaos, which for a meaningful chunk of the fleet turned out to be the slow management network instead of the fast storage network.

That's the part that should worry anyone running multi-homed Linux infrastructure, because it isn't specific to Railway's stack. It's a known category of bug.

Why this looks like a textbook Linux networking default

Linux boxes with more than one network interface, say one for storage, one for management, one for the public internet, inherit a default behaviour sometimes called the weak host model. By default, the kernel will answer an ARP request for any IP address configured on the box, on any interface, regardless of which interface that address is supposed to live on. Under normal conditions this is invisible. The right traffic goes to the right interface because the right routes exist and nothing forces a detour.

The trouble starts when a preferred path degrades or disappears, as Railway's storage network briefly did. If a slower, unrelated network, like a management network, can also reach the same host, the weak host model lets that host keep answering for its storage address there too. Traffic that should have failed loudly instead limps along at whatever speed the fallback network can manage. From the outside, nothing looks broken. Health checks pass. Connections succeed. They're just running at a third of the speed they should be, and nothing is designed to alert on "technically still up, but slow for no visible reason."

Railway's report doesn't use the term "weak host model," but the mechanism it describes, servers continuing to answer for storage-network addresses over a different, slower network once the preferred path was disrupted, is a close match for this well-documented class of failure. The standard mitigations are not exotic:

multi-homed-host-sysctl.sh

# Prevent a multi-homed host from answering ARP requests
# for an address on the "wrong" interface
sysctl -w net.ipv4.conf.all.arp_ignore=1
sysctl -w net.ipv4.conf.all.arp_announce=2

# Enforce strict reverse-path filtering so a host rejects
# traffic that arrives on an interface it shouldn't
sysctl -w net.ipv4.conf.all.rp_filter=1

Neither line is a complete fix on its own. The deeper answer is usually to keep storage and management traffic on genuinely separate L2 segments or VRFs so the ambiguity can't exist in the first place. But the sysctl settings above are the cheapest first check, and on most fleets, nobody has ever gone looking to see whether they're set.

20,000 blackholed links, and why silent beats loud

The scale here matters. At peak, roughly 20,000 host-to-host private network links across the affected zone were blackholed. This wasn't one unlucky server quietly misrouting its own traffic. It was a mesh-wide effect touching a large share of the zone at once, which is consistent with a shared default, not a single misconfigured box, being the root cause.

Time	Event	Impact
07:44	Packet loss detected on primary carrier; incident declared	Primary path degraded
07:44–08:32	Primary carrier disconnected at US border	Traffic rerouted to secondary carrier
08:39	Secondary carrier disconnected	Last default route removed
08:39–08:59	No stable internet route to the site	~20 minutes of severe, visible impact
08:59	Secondary carrier reconnected	Routing stabilised
09:00–10:45	Stale paths persist from the reroute window	Storage throughput at ~33% of normal
10:45	Root cause identified: stuck connections on management network	Fix applied
11:04–12:01	Disk I/O wait normalises; private mesh recovers	Incident resolved

Railway's July 2, 2026 incident timeline (UTC)

The instructive part of that table is the gap between the two failure types. The hard failure, no route at all, lasted 20 minutes and was impossible to miss. The soft failure, storage at a third of capacity, lasted more than five times as long and, by the report's own account, took until 10:45 to even get diagnosed. A monitoring stack tuned to catch "is this host up" will sail straight past "this host is up but running at 33% of its normal throughput." That gap is where the real cost of this incident lived.

What Railway committed to fix

The report lists four concrete remediation steps: migrate first-generation sites so they generate their own default routes instead of depending on external carrier state; make production traffic fail cleanly and immediately rather than silently falling back onto the management network; add alerting specifically on management-network load and on blackholed private links; and only reconnect a degraded carrier once its backbone recovery is independently verified, rather than on a fixed timer or gut feel.

Every one of those is aimed at removing a silent failure mode and replacing it with a loud one. That's the correct instinct. A system that fails hard when its assumptions break is easier to operate than one that degrades gracefully into a state nobody is watching for. Graceful degradation is only a virtue if someone built a way to notice it's happening.

The checklist this hands every team running multi-homed Linux

None of this requires running your own data centres to be relevant. Three checks translate directly to almost any fleet with more than one network per host.

Check arp_ignore, arp_announce, and rp_filter on any multi-homed box where a fast network (storage, internal RPC) sits alongside a slower one (management, out-of-band). If they're at their defaults, you have the same latent exposure Railway did.
Add a rule to your incident runbooks that any "disconnect the degraded path" step is followed by an explicit statement of what path remains. It costs five seconds and it's the single change that would have shortened this incident the most.
Audit your alerting for the difference between "down" and "slow." If your health checks only test reachability, a fallback path that works but crawls will never page anyone until a customer notices first.

Railway's postmortem is a reminder that redundancy isn't just about having a second carrier, a second cloud, or a second network. It's about knowing, with certainty, what happens to traffic in the seconds after the first path stops being the one carrying it.

Frequently asked questions

pgvector's HNSW index has a memory cliff, and the Postgres defaults walk right into it

pgvector handles most RAG workloads under ten million vectors just fine. The HNSW index underneath it has a memory requirement Postgres won't mention until the build already ran 40x slower.

Jul 3, 2026Read full article →

EngineeringJul 5, 20266 min readReviewed Jul 5, 2026

Railway disconnected a carrier to contain an outage. It cut its last route instead.

Inside the July 2 incident report: a containment call, a routine Linux networking default, and two hours of storage running at a third of speed.

By FlowVerify Editorial Team

Railway published a detailed incident report the next day. It's worth reading in full, because it's an unusually honest breakdown of how a fix can end up worse than the fault it was fixing.

The first 20 minutes of the Railway outage were the easy part

Then, at 08:39, they disconnected the secondary carrier too.

A containment decision that removed the last route out

The slower failure that came after

That's the part that should worry anyone running multi-homed Linux infrastructure, because it isn't specific to Railway's stack. It's a known category of bug.

Why this looks like a textbook Linux networking default

multi-homed-host-sysctl.sh

# Prevent a multi-homed host from answering ARP requests
# for an address on the "wrong" interface
sysctl -w net.ipv4.conf.all.arp_ignore=1
sysctl -w net.ipv4.conf.all.arp_announce=2

# Enforce strict reverse-path filtering so a host rejects
# traffic that arrives on an interface it shouldn't
sysctl -w net.ipv4.conf.all.rp_filter=1

20,000 blackholed links, and why silent beats loud

Time	Event	Impact
07:44	Packet loss detected on primary carrier; incident declared	Primary path degraded
07:44–08:32	Primary carrier disconnected at US border	Traffic rerouted to secondary carrier
08:39	Secondary carrier disconnected	Last default route removed
08:39–08:59	No stable internet route to the site	~20 minutes of severe, visible impact
08:59	Secondary carrier reconnected	Routing stabilised
09:00–10:45	Stale paths persist from the reroute window	Storage throughput at ~33% of normal
10:45	Root cause identified: stuck connections on management network	Fix applied
11:04–12:01	Disk I/O wait normalises; private mesh recovers	Incident resolved

Railway's July 2, 2026 incident timeline (UTC)

What Railway committed to fix

The checklist this hands every team running multi-homed Linux

None of this requires running your own data centres to be relevant. Three checks translate directly to almost any fleet with more than one network per host.

Check arp_ignore, arp_announce, and rp_filter on any multi-homed box where a fast network (storage, internal RPC) sits alongside a slower one (management, out-of-band). If they're at their defaults, you have the same latent exposure Railway did.
Add a rule to your incident runbooks that any "disconnect the degraded path" step is followed by an explicit statement of what path remains. It costs five seconds and it's the single change that would have shortened this incident the most.
Audit your alerting for the difference between "down" and "slow." If your health checks only test reachability, a fallback path that works but crawls will never page anyone until a customer notices first.

Railway disconnected a carrier to contain an outage. It cut its last route instead.

The first 20 minutes of the Railway outage were the easy part

A containment decision that removed the last route out

The slower failure that came after

Why this looks like a textbook Linux networking default

20,000 blackholed links, and why silent beats loud

What Railway committed to fix

The checklist this hands every team running multi-homed Linux

Frequently asked questions

Related reading

pgvector's HNSW index has a memory cliff, and the Postgres defaults walk right into it

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.

Meta published a postmortem for its 2021 outage. Not for the ones in 2026.

Stay ahead on eSignatures, compliance, and document workflows

pgvector's HNSW index has a memory cliff, and the Postgres defaults walk right into it

Railway disconnected a carrier to contain an outage. It cut its last route instead.

The first 20 minutes of the Railway outage were the easy part

A containment decision that removed the last route out

The slower failure that came after

Why this looks like a textbook Linux networking default

20,000 blackholed links, and why silent beats loud

What Railway committed to fix

The checklist this hands every team running multi-homed Linux

Frequently asked questions

Related reading

pgvector's HNSW index has a memory cliff, and the Postgres defaults walk right into it

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.

Meta published a postmortem for its 2021 outage. Not for the ones in 2026.

Stay ahead on eSignatures, compliance, and document workflows

pgvector's HNSW index has a memory cliff, and the Postgres defaults walk right into it