Use of the internet for everyday services has grown exponentially during the COVID-19 pandemic, and organizations have become increasingly reliant on their websites to function effectively.
Amid this backdrop, the widespread website outages on June 8 2021 became a very newsworthy incident, with numerous high-profile organizations’ websites forced offline for 30-60 minutes. These included Amazon, Reddit and Twitch, major news sites like the Financial Times, The Guardian and New York Times and the UK government website gov.uk. While some websites were taken offline entirely due to the incident, specific sections of other services were damaged, notably the servers on Twitter that host the social network’s emojis.
The problem was quickly traced to a bug in the content delivery network (CDN) of cloud services provider (CSP) Fastly, which manages around 10% of the world’s internet traffic. The company released a blog the following day revealing that the problem occurred when a single Fastly customer changed their settings, exposing a bug in a software update issued by the CDN provider around a month earlier. As Nick Rockwell, senior vice president of engineering and infrastructure at Fastly, explains, “We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a valid customer configuration change.”
Although the outage was short-lived, it still proved very disruptive for many organizations in such a digitized era. For instance, e-commerce giant Amazon is thought to have missed out on $34m in sales during the incident, as calculated by The Independent. In addition, such incidents can damage organizations’ reputations, while people, including the vulnerable, cannot access vital online services.
Therefore, it is vital to ask whether organizations have become too reliant on CDNs and CSPs for running their websites and what, if anything, can be done to improve the situation?
Keeping Things in Perspective
It is first important to retain a sense of perspective when analyzing this topic. Speaking to Infosecurity, Brian Honan, CEO of BH Consulting, points out, “Thankfully, these types of outages are relatively rare; otherwise, if they were more common, we wouldn’t be having this conversation.”
Additionally, Fastly’s impressive response to the incident should be acknowledged. Diana Kelley, CTO and co-founder of SecurityCurve, notes, “They identified the outage in a minute (so their monitoring was working and responsive), and they posted a status update at the 11-minute mark, indicating that they have a tested set of standard operating procedures to follow during outages. Services started to come back up within an hour, and at the 1-hour 13-minute mark, almost all services were restored. That’s pretty good.”
Honan was particularly impressed with Fastly’s early and frequent communication. “Fastly were open and transparent and revealed it was a configuration issue. They knew what the problem was and they were fixing it, meaning that organizations could implement their contingencies in an appropriate way,” he outlines.
"If they hadn't understood how the dependencies worked with Fastly for these websites, they just got to see it live for an hour on that morning"
It is also important to recognize the benefits that CSPs and CDNs provide to websites regarding performance and security. Kelley recalls her previous experience running web servers for a company located at a physical site. This meant the websites were far more vulnerable to outages, which a power cut could cause. She explains, “When you go to a cloud service provider for all of this, it’s wonderful because Microsoft, Google and Amazon know how to have a hardware backup and high availability and failover, and they know how to get energy back up — all of that gets taken off your plate.”
CDNs have become especially invaluable over recent years — this is highlighted by the fact that Amazon uses Fastly to manage its content distribution, despite having its own cloud computing services, Amazon Web Services (AWS). Justin Wray, director of operations at Core BTS’ Security Practice, says, “Cloud platforms allow for seamless scalability and rapid development. CDNs are a near-necessity for content-driven platforms (like e-commerce or video on-demand).”
Lessons to Be Learned?
Given the benefits provided by CSPs and CDNs, should organizations just accept that these kinds of incidents will occasionally occur, given no system is 100% foolproof?
The answer to this is a resounding no. Fastly’s Rockwell himself acknowledges that “even though there were specific conditions that triggered this outage, we should have anticipated it.” It is also good practice for any organization to conduct a thorough audit of incidents. This should be “not just from a technical viewpoint, but also a procedural and policy viewpoint,” comments Honan.
Kelley notes that in this particular case, it is troubling “that a software update had a vulnerability in it that could take down so many systems for an hour.” The fact the bug was triggered by a customer configuring a software update — as they were supposed to — suggests that Fastly needs to improve its app security and quality assurance testing processes. Kelley believes that to be most effective, this should account for all scenarios — including when customers are misusing software.
Another major concern is that a configuration from a single organization disrupted around 85% of Fastly customers, indicating its system is too interconnected. Kelley explains: “I’d guess there was a lack of segmentation or isolation, which meant that the one failure cascaded to all those other customers who hadn’t made that change. They should have kept going if they had proper segmentation/isolation.”
Encouragingly, Fastly appears to have recognized the need to make improvements in this respect. “We have been — and will continue to be — innovating and investing in fundamental changes to the safety of our underlying platforms. Broadly, this means fully leveraging the isolation capabilities of WebAssembly and Compute@Edge to build greater resiliency from the ground up,” says Rockwell.
It is to be hoped that other CDN providers conduct similar reviews of their systems. While rare, this certainly wasn’t the first, and won’t be the last, outage of its kind. Indeed, just a week or so later, a number of Australian websites suffered outages due to an issue at CDN provider Akamai.
What Can Website Owners Do?
Despite the many benefits, the Fastly case highlights drawbacks to relying on cloud and CDN services. As Honan puts it, “It’s a double-edged sword, you’re using these services to provide you with more resilience and better security, but if they have an issue, then you have an issue.”
It is also possible that cyber-threat actors have taken note of what happened, as it demonstrates the extent to which cloud providers have become a crucial link in the website supply chain. Sergio Loureiro, cloud security director of Outpost24, says, “It is very worrisome that a bug triggered by a customer can bring down a supposedly reliable service. Customers are paying for reliability and performance. Imagine if a state-sponsored hacking group tried to take down such a provider in a similar fashion?”
“If they hadn’t understood how the dependencies worked with Fastly for these websites, they just got to see it live for an hour on that morning,” adds Kelley.
Amid this backdrop, can organizations better protect their websites from issues experienced by their cloud and CDN providers?
According to Honan, first and foremost they need to regularly engage with their providers, pressuring them to enhance the resiliency of their systems and making sure they “contact them if there are any issues within their network that may impact their clients.”
Additionally, organizations should have a contingency plan to enable their websites to keep running in the event of a problem at their CSP or CDN. “The most important action an organization can take when addressing the third-party risk associated with a CDN or cloud-provided service is planning. Be prepared, have a plan, test and train on your plan,” says Wray.
This includes ensuring backups are in place, for example, by “storing local copies of CDN artifacts (e.g. images, objects) that could be used if the CDN is unavailable. You can also have it set to fall back to local copies if the CDN is unavailable,” advises Kelley.
Another option, which is most relevant to organizations that are particularly impacted by website outages, is to have on-site non-CDN servers. Wray points out that “for many organizations, temporarily having a slower website is better than no website.”
Overall, Honan believes organizations should be asking more searching questions regarding the extent to which their websites are reliant on third parties. “I would hope that companies would take this as a reminder to review what other services they are relying on, and build in appropriate business continuity and resilience controls around them.”
He points out that amid the digital shift during COVID-19, many companies have “rushed from on-site or self-hosted platforms into the cloud in response to COVID-19 to facilitate keeping their businesses going.” This means many will be using larger cloud providers for services, “but have they thought what the risks with those providers are now?”
One way to enhance resiliency is to use multi-cloud providers. This means there is a ready failover in the event one provider has a problem. Jake Madders, director at Hyve Managed Hosting, says, “The key is to not put all your eggs in one basket by using multiple vendors; using one vendor for production and another for disaster recovery greatly decreases the risks of downtime or data loss.”
However, Kelley warns that a thorough plan of how failover is implemented is required for those that go down this route: “This can help to spread the risk if one goes down, but I would caution companies that are using a multi-cloud approach to focus on the strategy carefully.” She also believes this option will not suit all organizations, recognizing that “going with a single cloud provider for site hosting could lead to an outage, but it is cost-effective and most of the time delivers on redundancy and uptime.”
Time to Reduce the Tech Monopoly?
Looking at the issue more broadly, the reliance on a small number of CSPs and CDNs to manage websites and their content is problematic. “As technology has become a mainstay across industries, there has been a large amount of centralization. A small number of platforms and services have become the de facto leaders. While this centralization can have significant performant benefits, it does have downsides as we’ve seen with the Fastly outage,” Wray notes.
"The recent incident at Fastly has raised several questions about the resiliency of websites"
Kelley concurs, adding that the growing influence of a handful of tech companies for essential services means there will have “to be a reckoning” at some stage. This will involve looking closely at “what it means to the rest of society that there are three or four companies that have so much control over how data moves and where it goes.”
The recent incident at Fastly has raised several questions about the resiliency of websites. Moving away from CSPs and CDNs is probably not a realistic option for many organizations; however, some steps can be taken to mitigate the risk of such widespread outages occurring in the future, on the part of both providers and customers.
The episode also further highlights the downside of relying on a small number of large tech firms for digital services, an area that will surely need to be looked at more closely in the future.