When the cloud bursts

BitDepth 1535 MARK LYNDERSAY ON OCTOBER 19, Amazon's Web Services (AWS) experienced a failure that took down many websites and web services globally. Soon after resolving the issue some 15 hours later, the company posted a deeply technical explanation of the meltdown (aws.amazon.com/message/101925/). "The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS (Domain Name Server) record for the service's regional endpoint that the automation failed to repair," Amazon stated. Two programs competed to write the same DNS entry at the same time, resulting in an empty DNS record for the service's regional endpoint, which led to the accidental deletion of all IP addresses for the database service's regional endpoint. Imagine two students trying to write an answer to the same problem on a whiteboard, each erasing the other's efforts in the process. Finally, imagine the resulting scramble wiping the whiteboard completely and you have a sense of what happened. DNS records match the text name of a website address with the numbers of its actual digital address. Without it, websites and services cannot be found. Reports of the number of vendors affected by the cloud outage at the company's first data centre in north Virginia varies between 2,000 and 70,000. Direct customers will be covered by Amazon's service level agreements (SLAs) through compensatory service credits according to each agreement. These are usually based on a small fraction of the customer's monthly bill. Customers of vendors reselling Amazon cloud capacity and services will have to negotiate according to SLAs with their providers. Estimates of the insurable loss resulting from the outage have run as high as US$581 million over the 15 hours of downtime. Why were so many affected by a DNS failure at a single cloud provider? Amazon is one of three major data hyperscalers, providers of hardware capacity and infrastructure on a global scale that are the backbone of the commercial internet. AWS commands a leading share of that market at 32 per cent, followed by Microsoft Azure, with 23 per cent and Google Cloud holding 13 per cent. As a business sector, AWS accounts for 18 per cent of the company's total revenue, but its operating profit is significantly higher than the e-commerce business. That reliance on a single vendor for internet presence was addressed in posts by Meredith Whittaker, president of Signal on BlueSky. "The extent of the concentration of power in the hands of a few hyperscalers is way less widely understood than I'd assumed," Whittaker wrote. "Which bodes poorly for our ability to craft reality-based strategies capable of contesting this concentration and solving the real problem. This isn't 'renting a server.' It's leasing access to a whole sprawling, capital-intensive, technically-capable system that must be just as available in Cairo as in Capetown, just as functional in Bangkok as Berlin. "Infrastructure like AWS is not something that Signal, or almost anyone else, could afford to just 'spin up.' Which is why nearly everyone that manages a real-time service – from Signal, to X, to Palantir, to Mastodon – rely at least in part on services provisioned by these companies." On October 29, Azure experienced an eight-hour outage of some of its services, the result of an "internal configuration change." Companies requiring resilience at that scale can't easily implement standard ICT practices like redundant hardware without investing in staggering plant costs and may not have the margins to have a warm standby system on another cloud provider's service. Hyperscalers typically operate networks of hundreds of data centres with millions of servers distributed globally. Their marketing pitch has always been customer flexibility; buy the capacity needed with the flexibility to scale up or down on demand. The hyperscaler service model evolved to respond to internal business needs rather than external market demand. AWS began as an internal project to provide uniform backend services for Amazon's e-commerce business as it both grew in size and geographic scope. Having built a world-scale data management system, it was a short jump from there to sell the service to other companies, which admittedly required significant persuasion in the early days. Ideally, network infrastructure should be distributed and redundant and technically, hyperscalers tick those boxes, but they also gather shared services in fewer real-world locations than the internet's creators originally envisioned. Compared to that widely scattered design, today's hyperscalers bundle an uncomfortable number of eggs in far fewer baskets. Mark Lyndersay is the editor of technewstt.com. An expanded version of this column can be found there

When the cloud bursts

Guess You Like