long AWS outage takes numerous services offline

An outage in Amazon Web Services Inc.’s North Virginia data center cluster disrupted numerous online services this morning. AWS disclosed the issue shortly after midnight PDT. Around the same time, users started losing access to ChatGPT, Disney+, Snapchat, Venmo, Perplexity and a long list of other online services. Some of AWS parent Amazon.com Inc.’s services, including Alexa+, were affected as well. “With more and more vibe coding and AI adoption, infrastructure is getting more complex — and more fragile,” said FluidCloud Ltd. co-founder and Chief Technology Officer Harshit Omar. “More disruptions are coming underway. Vendor lock-in is the new downtime.” The outage was caused by a series of technical issues in AWS’ US-EAST-1 Region cloud region. A cloud region is a collection of availability zones, which are data center campuses that each use separate power infrastructure. US-EAST-1 contains 6 availability zones, twice as many as most other AWS cloud regions. The Amazon unit first confirmed the malfunction on its status page at 12:11 a.m. PDT. Its engineers issue in a memo that the problem disrupted multiple cloud services hosted in US-EAST-1. Additionally, some customers had trouble submitting support tickets. In a notification published about an hour later, AWS disclosed that the issue affected its Amazon DynamoDB managed NoSQL database. The error was in the application programming interface that customer workloads use to interact with DynamoDB. The API’s DNS mechanism, which translates URLs into IP addresses of the relevant servers, malfunctioned. Three hours after AWS first confirmed the outage, its engineers announced that they had “fully mitigated” the DNS failure. However, they discovered a second issue along the way: users were struggling to launch Amazon EC2 instances. The new malfunction kicked off an hours-long troubleshooting effort. While AWS’ engineers were working on a fix to EC2, two more issues cropped up. The first affected the AWS Lambda serverless compute service, which developers use to host code. The service couldn’t read data delivered to it by another AWS service called Amazon SQS. The second issue that AWS took the form of networking disruptions in US-East-1. The updates AWS posted over the next few hours revealed that the three malfunctions were connected to a certain extent. At 8:43 a.m., the cloud giant stated that it had throttled EC2 instance launches to expedite recovery from the networking issues. Shortly after 10:0 a.m., AWS disclosed that those networking issues caused some of the errors in Lambda. US-EAST-1’s network malfunctioned because of an issue in a system tasked with monitoring the health of its load balancers. A load balancer is a device that ensures network traffic is evenly spread among servers, which avoids situations where an overwhelming amount of data is sent to a single machine. “Networking is certainly a foundational component of AWS services,” ,” said Corey Beck, director of cloud Technologies at data services provider DataStrike LLC. “When it stumbles in a region like US-East-1, the effects go way beyond, it ripples through EC2, S3, DynamoDB, RDS, and pretty much every service that depends on them. “You have to design with failure in mind, because it’s going to happen.” AWS identified the root cause of the network issues at 8:43 a.m. It started rolling out mitigations shortly thereafter. In a noon update, the company stated that its engineers were observing “recovery across all AWS services” but Lambda was still experiencing intermittent errors. The outage comes four years after an hours-long malfunction in US-EAST-1 that likewise took numerous third-party services offline. Similarly to today’s disruption, the incident began with DNS errors. AWS engineers later determined that the outage was caused by one of its cloud services’ autoscaling engine.

long AWS outage takes numerous services offline

Guess You Like