Razorblack's Blog

A Plain-English Technical Breakdown of the October 2025 AWS Outage

🚨 Introduction

On October 22, 2025, the internet collectively took a deep breath and then held it. From major websites like Snapchat, Reddit, Signal, and Slack to smaller SaaS tools, everything seemed slower or outright broken.

When half the internet blinks, the usual suspect is AWS (Amazon Web Services) the backbone for thousands of applications worldwide. And once again, AWS’s us-east-1 region, its largest and most critical hub, stumbled for over six hours.

So what actually happened this time? Let’s unpack it.

🕒 Timeline of Events

Start: Around 9:15 AM ET, October 22, 2025
Peak Impact: 10:00 AM – 2:00 PM ET
Duration: ~6 hours of partial to full service disruption
Affected Region: Primarily us-east-1, with ripple effects across North America and Europe
Key Services Impacted: Route 53 (DNS), Lambda, DynamoDB, EC2, and internal AWS management tools

Developers around the world quickly noticed cloud dashboards were timing out, CI/CD pipelines were hanging, and users couldn’t even log in to AWS consoles.

AWS Outage Image

⚙️ What Went Wrong

The root cause was a race condition triggered by an automation bug in AWS’s DNS management system.

To understand that, let’s first break down two key terms:

DNS (Domain Name System): Think of DNS as the phonebook of the internet. When you type amazon.com, DNS tells your browser where that server actually lives. If DNS fails, the internet feels like it’s “down” because no one can find anything.
DynamoDB: A fast, highly scalable NoSQL database AWS uses internally (and offers externally). Many internal AWS services store and retrieve configuration data through DynamoDB.

Here’s the short version: An automated DNS configuration process made conflicting updates to internal DNS tables stored in DynamoDB. Those updates collided causing a race condition that corrupted routing data.

When the DNS system got confused, it started sending bad routing info. That broke communication between core AWS systems and the chaos spread fast.

🔍 Technical Breakdown

AWS runs its DNS systems through two key internal automation services:

DNS Planner: Prepares changes to routing tables and plans updates.
DNS Enactor: Actually applies those changes across data centers.

During a routine internal update, a bug in DNS Planner’s automation script accidentally allowed multiple processes to push overlapping updates. Normally, these updates happen in a controlled queue. But a timing bug caused two planners to modify overlapping records at nearly the same instant.

This race condition caused stale and conflicting entries in DynamoDB. Since DNS Enactor depends on DynamoDB’s integrity, it started pulling bad configurations and pushing them live effectively breaking internal name resolution.

What followed was a cascade failure:

Lambda functions couldn’t connect to DynamoDB.
EC2 instances couldn’t resolve DNS queries.
Route 53 struggled to update routes.
Monitoring and control planes within AWS itself became partially blind.

To put it simply, one bad DNS update propagated like a domino effect across multiple AWS layers.

🧩 Recovery Process

AWS engineers quickly isolated the faulty DNS Planner processes and rolled back the last batch of changes. They then purged corrupted DynamoDB entries and re-synced DNS records using backups from unaffected regions.

Full recovery was gradual service interdependencies meant some systems only came back online hours later. By around 3:30 PM ET, most customer-facing services were restored, though monitoring tools and some Lambda triggers lagged into the evening.

🌍 Real-World Impact

Affected Platforms: Snapchat, Reddit, Signal, Slack, and many smaller SaaS apps experienced downtime or slow performance.
Developer Tools: GitHub Actions, Vercel deployments, and CI/CD pipelines using AWS-backed storage also saw failures.
Estimated Financial Loss: Around $581 million in total global impact due to service interruptions, ad revenue loss, and downtime.

For users, it felt like “the internet broke.” For engineers, it was a live demonstration of how interconnected and fragile distributed systems can be.

💡 Lessons for Engineers

Redundancy is non-negotiable. Relying solely on a single region (like us-east-1) is a recipe for downtime. Multi-region architectures save more than they cost.
Graceful degradation matters. Design systems so that partial failures don’t cascade. Losing DNS shouldn’t mean losing all functionality.
Test your automation. The same scripts that make deployments faster can also magnify bugs at scale. Always validate assumptions in production-like environments.
Monitor internal dependencies. Your system might depend on services you take for granted like DNS or an internal database. Visibility into these layers is critical.

🧠 Conclusion

Even the most advanced systems fail not because of lack of resources, but because humans write the automation that keeps them running.

AWS’s October 2025 outage is a reminder that resilience isn’t about avoiding failures it’s about preparing for them.

So the next time you’re designing a system, think about what happens when the phonebook of your architecture suddenly forgets a few names.

Because in distributed systems, the smallest bug can have the loudest echo.

Written by a fellow engineer who’s deployed one too many “harmless” automation scripts.

Razorblack’s Code Chronicles

Decoding Tech, One Post at a Time

Inside the AWS Outage: When Automation Goes Rogue