Amazon Outage Caused by Simple Input Error

A major outage which struck Amazon’s US-EAST-1 region on Tuesday, rendering large swathes of the internet inaccessible, was caused by a simple input error on the part of an engineering team, AWS has revealed.

The cloud giant explained in a lengthy online post that a Simple Storage Service (S3) team was debugging an issue which had been causing the S3 billing system to slow.

It continued:

“At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”

The bad news continued when it turned out that the servers inadvertently removed were supporting two other S3 subsystems. The index subsystem manages the metadata and location information of all S3 objects in the region, and the location subsystem “manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate.”

Amazon was forced to restart these two subsystems, also rendering a range of other services reliant on S3 for storage unavailable, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes when data was needed from a S3 snapshot and AWS Lambda.

AWS said it has not had to restart the index or placement subsystem for several years, during which time S3 has experienced massive growth which made the whole process, including checks on the integrity of metadata, take longer than expected.

The cloud giant said it is changing things to prevent a similar incident happening in the future, but for many it is a reminder of what can go wrong even in organizations with the resources of Amazon Web Services.

Apart from coinciding with Amazon’s AWSome day, designed to encourage UK start-ups to migrate to the cloud, reports suggest websites and services including Quora, Imgur, Github, Zendesk and Yahoo Mail went down or were patchy for several hours.

Gavin Millard, EMEA technical director of Tenable Network Security, argued that cloud services are usually less prone to downtime than on-premise set-ups, but can cause a domino effect when they do hit trouble.

“When migrating critical infrastructure to a cloud provider, it’s important to remember that whilst they have robust strategies for dealing with outages to core services, single points of failure can still impact availability," he added. "Spreading the workloads across multiple regions and having a plan in place to deal with catastrophic issues like S3 going down would be wise.”

Amazon Outage Caused by Simple Input Error

Phil Muncaster

You may also like

DDoS-ers Launch Attacks From Amazon EC2

Why Leaky Clouds Lead to Data Breaches

#NextGenResearch: Is There Enough Training to Work With IaaS?

#2018InReview Cloud Security

3.2 Million Files Revealed on AWS S3 Bucket

What’s hot on Infosecurity Magazine?

TA455’s Iranian Dream Job Campaign Targets Aerospace with Malware

Phishing Tool GoIssue Targets Developers on GitHub

How to Backup and Restore Database in SQL Server

Microsoft Fixes Four More Zero-Days in November Patch Tuesday

CISOs Turn to Indemnity Insurance as Breach Pressure Mounts

EU Ramps Up Cyber Resilience with Major Crisis Simulation Exercise

Microsoft Visio Files Used in Sophisticated Phishing Attacks

Chinese Air Fryers May Be Spying on Consumers, Which? Warns

New Remcos RAT Variant Targets Windows Users Via Phishing

UK Regulator Urges Stronger Data Protection in AI Recruitment Tools

NCSC Publishes Tips to Tackle Malvertising Threat

Defenders Outpace Attackers in AI Adoption

The Future of Fraud: Defending Against Advanced Account Attacks

How to Manage Your Risks and Protect Your Financial Data

New Cyber Regulations: What it Means for UK and EU Businesses

Identifying Concentration Risk and Securing the Supply Chain

How to Unlock Frictionless Security with Device Identity & MFA

How to Proactively Remediate Rising Web Application Threats

Defenders Outpace Attackers in AI Adoption

ISACA CEO Erik Prusch on AI Fundamentals, Workforce, and Tackling Cybersecurity Challenges

31 New Ransomware Groups Join the Ecosystem in 12 Months

How Belgium's Leonidas Project Boosts National Cyber Resilience

Protecting the Healthcare Supply Chain Against Russian Ransomware Attacks

Snowflake Hacking Suspect Arrested in Canada

Amazon Outage Caused by Simple Input Error

Written by

You may also like

What’s hot on Infosecurity Magazine?