Amazon Web Services’ (AWS) Simple Storage Service (S3) was hit by an outage last Tuesday. The outage caused multiple popular websites and services to be temporarily unavailable. The problems were caused by human error, AWS announced.
An outage at a big cloud provider such as AWS can have a big impact on the availability of websites and services. Numerous big websites and services were hit by the recent outage, such as Trello, Quora, IFTTT, Business Insider, Giphy, Slack, Adobe-services, Buffer, Flipboard, gitHub, Imgur, Medium, The Verge and Zendesk.
The outage was caused by human error. “At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended”, AWS wrote in an explanation.
AWS promises to take actions to prevent such an event from happening again in the future. AWS: “We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks. We will also make changes to improve the recovery time of key S3 subsystems.”
Quick recovery from any kind of failure
“We employ multiple techniques to allow our services to recover from any failure quickly. One of the most important involves breaking services into small partitions which we call cells. By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem. As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. During this event, the recovery time of the index subsystem still took longer than we expected. The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately.”
“Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further”, AWS said.