Amazon apologises for massive AWS outage and reveals cause

Amazon has revealed the cause of Monday's massive outage that sparked global turmoil among thousands of sites, including some of the web's most popular apps such as Snapchat and Reddit. The disruption knocked workers from London to Tokyo offline, and stopped others from conducting everyday tasks such as paying hairdressers or changing their airline tickets. Amazon Web Services (AWS), which hosts applications and computer processes for companies around the world, revealed in a lengthy statement that a string of events triggered the outage. It said the problems stemmed from a "latent defect" in what is known as the Domain Name System, or DNS. This prevented applications from finding the correct address for AWS's DynamoDB API, a cloud database relied upon to store user information and other critical data. "We apologise for the impact this event caused our customers," the statement read. "We know how critical our services are to our customers, their applications and end users, and their businesses. The AWS cloud service returned to normal operations on Monday afternoon, local time. Calls for better fault tolerance The outage was the largest internet disruption since last year's CrowdStrike malfunction hobbled technology systems in hospitals, banks and airports, highlighting the vulnerability of the world's interconnected technologies. It was at least the third time in five years that AWS's northern Virginia cluster, known as US-EAST-1, contributed to a major internet meltdown. Amazon did not address a request for more clarity about why that particular data centre keeps being affected. Earlier, AWS said the root cause of the outage was an underlying subsystem that monitored the health of its network load balancers used to distribute traffic across several servers. The issue, AWS said, originated from within the "EC2 internal network", Amazon's "Elastic Compute Cloud" service, which provides on-demand cloud capacity within AWS. Ken Birman, a computer science professor at Cornell University, said software developers needed to build better fault tolerance. He said AWS provided tools developers could use to protect themselves in the event of a problem at one of any of its sprawling network of data centres, and developers could also create backups with other cloud providers. "When people cut costs and cut corners to try to get an application up, and then forget that they skipped that last step and didn't really protect against an outage, those companies are the ones who really ought to be scrutinised later," Mr Birman told Reuters. ABC/Reuters

Amazon apologises for massive AWS outage and reveals cause

Guess You Like