A major outage which struck Amazon’s US-EAST-1 region on Tuesday, rendering large swathes of the internet inaccessible, was caused by a simple input error on the part of an engineering team, AWS has revealed.
The cloud giant explained in a lengthy online post that a Simple Storage Service (S3) team was debugging an issue which had been causing the S3 billing system to slow.
It continued:
“At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
The bad news continued when it turned out that the servers inadvertently removed were supporting two other S3 subsystems. The index subsystem manages the metadata and location information of all S3 objects in the region, and the location subsystem “manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate.”
Amazon was forced to restart these two subsystems, also rendering a range of other services reliant on S3 for storage unavailable, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes when data was needed from a S3 snapshot and AWS Lambda.
AWS said it has not had to restart the index or placement subsystem for several years, during which time S3 has experienced massive growth which made the whole process, including checks on the integrity of metadata, take longer than expected.
The cloud giant said it is changing things to prevent a similar incident happening in the future, but for many it is a reminder of what can go wrong even in organizations with the resources of Amazon Web Services.
Apart from coinciding with Amazon’s AWSome day, designed to encourage UK start-ups to migrate to the cloud, reports suggest websites and services including Quora, Imgur, Github, Zendesk and Yahoo Mail went down or were patchy for several hours.
Gavin Millard, EMEA technical director of Tenable Network Security, argued that cloud services are usually less prone to downtime than on-premise set-ups, but can cause a domino effect when they do hit trouble.
“When migrating critical infrastructure to a cloud provider, it’s important to remember that whilst they have robust strategies for dealing with outages to core services, single points of failure can still impact availability," he added. "Spreading the workloads across multiple regions and having a plan in place to deal with catastrophic issues like S3 going down would be wise.”