Hiltzik: Lessons of the AWS meltdown
Hiltzik: Lessons of the AWS meltdown
Homepage   /    technology   /    Hiltzik: Lessons of the AWS meltdown

Hiltzik: Lessons of the AWS meltdown

🕒︎ 2025-10-23

Copyright Los Angeles Times

Hiltzik: Lessons of the AWS meltdown

On Monday, millions of internet users got a painful answer to a question few even knew to exist. The question was: What do Snapchat, Roblox, Fortnite, Signal, United and Delta airlines and countless other web-based sites and services have in common? The answer is: They were all brought down by a cascading glitch at a data center in northern Virginia owned and operated by Amazon Web Services, an arm of the giant e-commerce company. AWS is one of the top three cloud platforms, meaning that it holds its clients’ data on its own servers and manages the transfer and transmission of that data within the client companies and between them and end users. When AWS’ northern Virginia data hub went down a few minutes before midnight Sunday, Pacific Daylight Time, 141 AWS services went dark, along with client firms reliant on its hub, producing a cascade of outages affecting users around the world. Users of Amazon’s own Ring home security devices such as video-enabled doorbells were affected. Amazon didn’t declare that the problem had been fixed until 3:53 p.m. PDT Monday, although some clients were still reporting problems as late as Tuesday. The damage done to AWS clients and their millions of users is incalculable. As my colleague Queenie Wong reported, web users couldn’t access their services or accounts. Customers of some banks, as well as the web brokerage Robinhood, couldn’t complete transactions. Delta and United passengers were unable to track reservations, check in online or retrieve their seat assignments; airline employees were forced to resort to manual alternatives, like in prehistoric (i.e., pre-internet) times. Owners of Eight Sleep mattress covers, which cost thousands of dollars and require an annual fee of $300 or $400, use a web app to adjust temperature and incline, reported being stuck in uncomfortable positions and sweltering under uncontrollable heat. The company’s chief executive issued an online apology and said Eight Sleep would roll out a feature allowing owners to connect with their beds via bluetooth if the internet connection failed. The outage is certain to raise questions about whether Amazon — and its fellows in Big Tech — supervise their systems with the rigor appropriate to crucial services with a global footprint. As lawyers put it, “res ipsa loquitur” — “the thing speaks for itself.” The answer it gives is “no.” In the old days when “plain old telephone service,” or POTS, was entirely under the control of a single company, AT&T, the company’s commitment was to “five nines” reliability, meaning that it worked 99.999% of the time, or tolerated no more than about 5.26 minutes of downtime per year. Since AWS systems were down this week for at least 15 hours, or 900 minutes, it effectively tossed that standard in the trash. The five nines standard reflected the conviction that phone service was too important not to be, in effect, always on. Today’s high-tech service providers often seem to take the attitude that just-good-enough should be good enough for anyone. As I noted last year, some of today’s richest companies pocket billions of dollars in profits but don’t spend enough to protect their customers’ private personal data from hackers — for example, AT&T, which booked a pretax profit of $16.7 billion last year, was so sloppy about protecting its customers’ private information that the data of nearly all those customers — 110 million users — ended up in the hands of “financially motivated” hackers. Amazon has stated, so far convincingly, that its outage wasn’t caused by hackers or other hostile actors. It came entirely from inside the house, so to speak. To keep the technical gibberish at a minimum, let’s just say that something failed in its Domain Name System, which enables the system to translate the web address you type into your browser to communicate with the website itself. The technological confusion rippled throughout the AWS structure, resulting in pain at the website and user ends. Amazon says it will eventually provide a “post-event summary” identifying the cause of the outage. Amazon plainly deserves most of the blame for the fiasco. Some Amazon-watchers have conjectured that the glitch may be connected to mass layoffs the company implemented in the summer in its cloud computing unit, with the jobs purportedly replaced by artificial intelligence. The company confirmed the layoffs but didn’t say how many jobs were cut; Reuters reported that it was in the hundreds. Amazon dismisses speculation that the outage was connected to the layoffs. A spokesman pointed me to an interview in which AWS CEO Matt Garman disdained the idea of replacing entry-level staff with AI bots, calling it “one of the dumbest things I’ve ever heard.” That said, it’s unclear who in the cloud unit was laid off. Some tech experts have issued warnings for years about website operators failing to have a Plan B at hand for exactly the sort of outage that struck this week. AWS isn’t the only cloud platform in existence. Microsoft and Google are the other members of the top three. Nor are AWS users bound to rely on the company’s northern Virginia data hub. AWS has data hubs all around the country, and it advised users to switch to any of the others — but with the Virginia hub out of service, that left users out of luck if they hadn’t implemented a workaround before this glitch. IT departments should “design for failure (because it will happen),” Lydia Leong of the tech consulting firm Gartner advised this week. “Modern cloud-native apps should distribute workloads across multiple availability zones and be ready to fail over quickly to another region when needed,” Leong wrote — in other words, be set up to automatically shift their data away from trouble spots. “It’s not about eliminating risk; it’s about reducing blast radius and recovery time.” This problem may be an artifact of internet history, as Jorg Dekker of the internet backbone company Arelion pointed out. The internet was designed as a neutral system that trusts all data flowing through its connected networks to be, well, trustworthy. “This means that it assumes all updates are valid, a network can announce anything it likes, and the resources available cannot be checked,” he noted. The net’s original designers dealt with that imperfection by providing for the network to steer data away from blockages or other problems. “The internet routes around damage” is the mantra, but that doesn’t always work, especially when the damage is in a core functionality. And sometimes trusted updates shouldn’t be trusted. That was the case with last year’s CrowdStrike outage. An ineptly designed update to a program rolled out by the cybersecurity company and installed automatically on users’ machines instantly crashed millions of computers running Microsoft programs and left them disabled until manual fixes could be undertaken. The errant CrowdStrike application was burrowed so deep within the Microsoft operating system — as it’s designed to be — that every time a machine restarted, it ran into the same glitch and went dead again in an infinite doom loop. As I wrote then: “Thousands of flights were canceled. Doctors couldn’t perform surgeries. Banking transactions were frozen. Emergency 911 lines went silent.” There are benefits, to be sure, in placing the crucial backbones of the internet under the control of three of the richest technology companies in the world. After all, they have the financial resources to maintain quality and reliability. The downside is that their systems work absolutely perfectly right up until the moment when they stop working; that’s when a global reliance on a few big operators turns into a global meltdown. The inescapable feature of modern life is that to an ever-increasing extent, for anyone living in the modern world there’s nowhere to hide from web service screwups. It’s not merely that our voice and data phone calls, emailing, and video entertainment come via the web, but some appliances require an internet connection to operate at all. I can’t adjust the noise cancellation mode on my Bose headphones except through a phone app; the same goes for my ultra-fancy automated pour-over coffee maker and self-heating coffee mug. The other day, when I was trying to add a line to my family T-Mobile account, T-Mobile insisted that I load a T-mobile app onto my (non-T-Mobile) iPhone to complete the deal — and I was sitting in a T-mobile store with a T-mobile rep at the time. More and more appliances, however, are being marketed with unnecessary internet capability, reflecting the internet-of-things nirvana pitched by web promoters and appliance makers. A good rule of thumb may be that if your refrigerator or cooktop doesn’t need an internet connection to work, don’t connect it. That way, it won’t turn into a brained brick because of a human error somewhere in northern Virginia. Web connectivity has brought us benefits unimaginable even at the turn of the most recent century. But as with anything, with the boons come burdens. A few lines of renegade code can dial back our 21st century lives to the world of the 1950s or ‘60s. Back then, when our household appliances were mechanical or electric, not electronic, a breakdown was easy to diagnose and fix — switch out a vacuum tube or tighten a screw. Today, if your television goes dark and you can’t get HBO Max, you can have no idea where the problem lies — inside the TV, with your cable box, or over at HBO Max. You just have to wait for someone to make a fix, hoping all the while that the problem isn’t just at your house or your neighborhood, but widely dispersed enough for the service providers to notice and roll a truck. We all live in a balancing act: Today’s technology is great, when it works. When it doesn’t, we’re on our own. There’s a lesson there somewhere.

Guess You Like

Pres. Ali confers GDF ranks with prestigious medals
Pres. Ali confers GDF ranks with prestigious medals
The Guyana Defence Force’s (GD...
2025-10-23
Apple hit with EU antitrust complaint over App Store terms
Apple hit with EU antitrust complaint over App Store terms
TCF vendors Exponential Inter...
2025-10-22