When a large portion of the World Wide Web suddenly ceased to function recently, it was a wakeup call. Fifty-four of the top 100 sites were down, or suffering from degraded service, according to Internet performance monitoring firm Apica, including such diverse operations as (take a deep breath) Adobe's services, Amazon's Twitch, Atlassian's Bitbucket and HipChat, Buffer, Business Insider, Carto, Chef, Citrix, Clarifai, Codecademy, Coindesk, Convo, Coursera, Cracked, Disney Store, Docker, Elastic, Expedia, Expensify, FanDuel, FiftyThree, Flipboard, Flippa, Giphy, GitHub, GitLab, Google-owned Fabric, Greenhouse, Heroku, Home Chef, iFixit, IFTTT, Imgur, Ionic, isitdownrightnow.com, Jamf, JSTOR, Kickstarter, Lonely Planet, Mailchimp, Mapbox, Medium, Microsoft's HockeyApp, the MIT Technology Review, MuckRock, Netflix, New Relic, News Corp, Nike, Nordstrom, PagerDuty, Pantheon, Quora, Razer, Signal, Slack, Sprout Social, StatusPage, Target, Tinder, Travis CI, Trello, Twilio, Unbounce, the U.S. Securities and Exchange Commission (SEC), Vermont Public Radio, VSCO, Zendesk, and others.
The cause: an outage at Amazon Web Services (AWS) S3 storage service.
Amazon's message explaining the outage was brutally honest, in a geeky sort of way. It said, in part:
"The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable."
For those who don't speak geek, the gist of the explanation was, "oops." Someone entered a command incorrectly and took down a ton of resources, which crippled the service. The length of the outage was also explained.
"S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact. We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes. While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected. The index subsystem was the first of the two affected subsystems that needed to be restarted. By 12:26PM PST, the index subsystem had activated enough capacity to begin servicing S3 GET, LIST, and DELETE requests. By 1:18PM PST, the index subsystem was fully recovered and GET, LIST, and DELETE APIs were functioning normally. The S3 PUT API also required the placement subsystem. The placement subsystem began recovery when the index subsystem was functional and finished recovery at 1:54PM PST. At this point, S3 was operating normally. Other AWS services that were impacted by this event began recovering. Some of these services had accumulated a backlog of work during the S3 disruption and required additional time to fully recover."
Again translating the geek, it tells us that the systems needed a full restart to recover, and some of them hadn't been restarted for years. That meant there was a lot of electronic housekeeping required to ensure the integrity of the systems, like rebuilding of indexes (and anyone who's suffered through Outlook's usually badly-timed index rebuilds knows how long THAT can take).
The result was around five hours of downtime, including three hours in which even Amazon's own dashboard system that's supposed to tell users about this sort of thing was also non-functional.
The post concluded by saying that safeguards were being put in place to prevent a recurrence. It also apologized for the inconvenience.
That "inconvenience" cost a bundle, according to Business Insider, which cited an analysis by Cyence estimating that S&P 500 companies had lost about $150 million, and the outage had hit financial services for about $160 million. A big ouch for a small oops.
There are lessons to learn from AWS customers that were not affected by this outage, including Amazon itself (the e-commerce site, amazon.com, was unaffected). It all comes down to an ancient concept in IT called "single point of failure." Granny would have called it putting all of your eggs in one basket. While AWS provides what it calls Availability Zones within regions, in this case, the whole region went toes-up. Customers who survived unscathed had spread their usage across multiple AWS regions (which costs more, and presents other challenges with security, since inter-region data movement is over the public Internet), so an outage in one was compensated for by increased usage in other regions. While, depending on the other regions' capacity, that may have slightly degraded service, it wouldn't take them completely offline. In other words, all of their eggs were not in one basket, so when a basket went splat, there was another set of eggs available.
Amazon is probably also cringing over the error, which illustrated a shortcoming in its processes. Companies can learn from this too. A playbook (a predefined set of instructions) is only as good as the process underlying it. Amazon's playbook allowed too much scope – it was possible for a typo to take down too many systems. It is now changing its tools to limit the number of resources that can be taken offline at once.
Corporate IT could learn from this. First, automate anything possible, and be conservative in what's allowed by default. The fat finger syndrome has been around as long as there have been systems and administrators. In fact, historically, most of the AWS outages have not been caused by component failures, they've been the result of an admin typing an incorrect command. For the protection of both the systems and the admins, scripts and admin tools need to contain enough smarts to prevent those inadvertent errors without inhibiting the ability to do what's necessary to recover from a problem.
It's also a reminder that, contrary to marketing hype, the cloud is not the panacea IT had hoped for. Companies still need business continuity and disaster recovery plans in place.
If that five-hour outage only teaches us that, it was worth the grief.