Estimated read time: 3 Minutes
Author: Dr. Manzoor Mohammed
While it’s hard to deal with these global cloud outages since the issue is often outside of your control, it’s how you plan and react to these events that make a difference.
Last Tuesday, 8th June 2021, Fastly (one of the top CDN providers worldwide) suffered a systems fault. It affected websites all over the internet for longer than an hour during which users were unable to access their favourite websites. The issue was apparently caused by an undiscovered software bug triggered by a valid customer configuration change.
Fastly was able to recover fast (49 minutes) according to Nick Rockwell, SVP of Engineering and Infrastructure. Unfortunately, their customers took a little longer to fully recover.
Paypal, for example, took a further 40 minutes to recover after Fastly restored the service. This extra recovery time would have cost approximately $106M. The BBC, by contrast, recovered quicker and it wasn’t just by having an alternative CDN provider.
Here’s what they probably did and what the other affected parts of the internet could have done to prepare:
- Detect the problem quickly
- Have an alternative cloud provider (if possible)
- Prepare a resilience plan and execute it
- Provision sufficient capacity to deal with a flood of non-cached content with the recovery (e.g. flood of non-cached content)
1. Detect the problem quickly
This is where real-time monitoring and alerting is critical. While in this case, the outage would have been obvious, in many others, the early warning signs before the outage can be missed by alert-based thresholds.
One client I worked with had 250,000+ dashboards. Despite the large coverage, they often missed the early warning signs of incidents. Many tools now provide more sophisticated alerting systems to mitigate this. But even using these more advanced tools, there's a risk that they are not tuned appropriately, leading to false alerts, missed early warning signs or both.
2. Have an alternative cloud provider
Don't put your eggs in one basket.
- Old proverb
This proverb is especially true when you are running business-critical systems. Companies such as the BBC had an alternative CDN provider to whom they could switch.
Having said this, it is not the most practical or cost-effective method for everyone. It must be a critical business decision to add this level of complexity to your tech stack.
3. Prepare a resilience plan and execute it
Simply having an alternative cloud provider doesn't mean you will automatically switch to it in the event of the primary provider's failure. If it is not automatic, you need a tried and tested plan for how to switch to your alternative provider.
Even if it is automatic, you still need to test it and make sure it will work as expected. There is nothing like a bit of chaos engineering to give you confidence.
4. Leave sufficient capacity to deal with the recovery
Whether or not you have an alternative, you still need to consider the capacity of your platform in the event of CDN failures because of increased demand post-recovery. During the CDN failure and recovery period, customers who could have gone to the edge will instead go to the origin.
This only works if there is sufficient capacity to support the non-cached traffic. That may be one explanation for why companies such as Paypal had recovery times lasting far longer than the outage window.