Visitors to the Cloudflare sites faced 502 errors on July 2 2019, according to a blog post by the company.
A post written by John Graham-Cumming, CTO of Cloudflare, was published after a 30-minute outage affected Cloudflare's network, resulting in downtime on its sites. The issues were caused by a massive spike in CPU utilization on the company's network, which was a result of a “bad software deploy.” According to Graham-Cumming, once the deployment was rolled back, service returned to normal.
“This was not an attack (as some have speculated) and we are incredibly sorry that this incident occurred,” writes Graham-Cumming. “Internal teams are meeting as I write performing a full post-mortem to understand how this occurred and how we prevent this from ever occurring again.”
Starting at 13:42 UTC, Cloudflare experienced a global outage across its network which meant visitors to its proxied domains faced “Bad Gateway errors.” The deployment of a single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new Cloudflare WAF Managed rules was the cause. According to the company's blog post, the intent of these new rules was to improve the blocking of inline JavaScript that is used in attacks. However, one of the rules “contained a regular expression that caused CPU to spike to 100% on its machines worldwide” causing traffic to drop by 82%.
“We make software deployments constantly across the network and have automated systems to run test suites and a procedure for deploying progressively to prevent incidents,” wrote Graham-Cumming. “Unfortunately, these WAF rules were deployed globally in one go and caused today’s outage.
“We recognize that an incident like this is very painful for our customers. Our testing processes were insufficient in this case and we are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future.”