The AWS S3 Outage – Lessons in Performance

I’d like to start by making it clear this is not another blog having a go at AWS for their recent outage. So many public cloud detractors rushed to the internet to rejoice in the news that a part of AWS had gone down. We at Capacitas are huge fans of AWS (and to be clear, we have no vested interest - we’re not a partner or reseller). We recognise the engineering brilliance at AWS that has built a global leading cloud platform.

But we do see the outage, (and the reasons behind it), as providing a cautionary tale for any organisation with complex infrastructure. If it can happen to AWS you can be absolutely sure it can happen to you.

So let’s look at what happened:

On the 28^th February 2017 Amazon suffered a severe outage to its core S3 platform in one region. This resulted in highly publicised outages on several major websites, and lots of questions about what could cause such significant disruption.

AWS provided a message summarising what happened here.

In short, someone made a mistake, which brought down the system. AWS have made it clear that they recognise that there need to be constraints in place and better design around how the system is structured. But the reason for the severity of the issue was, fundamentally, a performance problem.

At Capacitas, we have 7 Pillars of Performance that provide a framework for our classification of performance risk and for all performance engineering – for Amazon’s S3 outage three of these crumbled in the extreme circumstances and together contributed to the problems they experienced.

The pillars that failed for Amazon were Resilience, Scalability and Throughput/Response time:

Resilience – how the system can cope with and respond to component or third party outage while reducing user impact.
Scalability – how the system behaves as demand increases up to and beyond what is expected.
Throughput and Response Time – how long operations take to execute at the required throughput.

The reason behind the severity of the S3 outage was the combined impact of failures in all three of these areas. The crunch came when the team were required to restart the index subsystem, which had been stopped incorrectly. Initially there was the unplanned component outage, and the appropriate recovery processes were put into operation. But the Amazon announcement then tells us:

“While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.”

S3 has experienced huge growth (scalability), but the recovery procedures on which they relied had not been tested or modelled at the required demand (scalability + resilience). This then led to unexpectedly long execution times for restoring this now much larger service (scalability + resilience + response time).

As performance and capacity experts we know how hard it is to get this right, and we have seen very similar situations occur on numerous occasions over the years, often with major disruption and business impact.

So, if this can happen to AWS what can you do to avoid it?

Anticipate your risks by undertaking a comprehensive Risk Modelling activity, looking at all 7 Pillars of Performance, and then implement an appropriate strategy to manage that performance risk. Too often performance engineering focusses on only a subset of the 7 Pillars – often only on response times and throughput – and typically only for “happy day scenarios”. There is an assumption that as long as it’s working today it will keep working in future. The AWS S3 outage is a timely reminder for everyone that all these factors need to be considered together – and that you need to plan for the unexpected.

If you're interested in finding out more about 'The 7 Pillars of Performance' download our whitepaper: "Agile Performance: How to move fast and not break things" by clicking below. Or, sign-up for our webinar, with guest speaker from easyJet.

Insights

The AWS S3 Outage – Lessons in Performance

Filter by Tags:

Navigation

Find Us

Contact Us