Resilience is “the capacity to recover quickly from difficulties, or toughness.” With the rise in both natural disasters and cyber threats, today’s businesses must ensure not only their physical resilience, but the resilience of their IT systems so they can continually provide a great customer experience.
How do you know if you’re prepared for the worst? It’s all about testing. In fact, there is one method of testing known as “chaos engineering” which is defined as “the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
The goal of chaos testing is to expose weaknesses in your systems before they manifest themselves in the crash or unavailability of an end-user service. By doing this on purpose, a business and its systems become better at handling unforeseen failures.
Typically, we don’t look for a service’s complete failure, or for high latency in a service’s response. Couple this with the fact that almost all modern IT systems are very distributed in nature, and we have other issues like cascading failures that are very hard to foresee from a test team’s perspective.
Planning your approach to chaos experiments
The initial stages of your foray into introducing failures/chaos into your organization play a vital role in ensuring success. There are a few key areas to consider before starting the journey:
- Know your application’s architecture and steady state metrics
- Work with non-critical services that have a good steady state defined
- Apply either an opt-out or opt-in (less aggressive) model for the delivery teams
- Provide ways to evangelize your experiments with teams in QA environments
- Have the necessary fall backs in place (circuit breakers, for example) and verify if they triggered as expected
- After the experiment, ensure you are measuring and comparing against the known ‘steady state’ and becoming better (for example, aiming for a lower MTTR – Mean Time to Recover); run the tests again to measure
Your goal is to slowly move towards automated chaos on the service in question. From here, you can move into more specific experiments. For example, if you are doing failovers, create experiments where a specific business critical platform comes back up with a key piece missing.
Consider a situation where a messaging/streaming platform fails over but with a topic missing, or with just half its intended capacity. Determine whether or not the system can handle this — or does it fail.
You can take this one step further by looking for any cascading impacts your failure might have. In the messaging example, maybe this fails your loan application intake process, your payment processing or your checkout process. None of this can be clearly predicted until the experimentation phase.
One key thing to remember is that in order to be successful in testing for cascading failures and addressing them in QA, you should have the necessary service teams’ reps participating in these experiments.
Get your testing regime up and running
A simple way to begin your testing regime is by looking at recent production issues and discovering whether you could have caught any of those problems by experimenting earlier on. Many traditional enterprises have a problem management group that can help to spearhead this discussion, or you can check with your DevOps/Service team(s).
Some IT organizations introduce system degradation using a tool like Chaos Monkey, which was invented by Netflix in 2011 to gauge the resilience of its IT infrastructure.
Remember that your goal is not to cause problems, but to reveal them. Be careful not to overlook the type and amount of traffic being created by your tests. Tools like the Chaos Automation Platform (ChAP, another test bed built within Netflix) provide ways to route a percentage of your internet traffic to the experiment and thereby help ‘increase the safety, cadence, and breadth of experimentation.’
Resilience maturity
While chaos experiments are very useful, one current limitation is the amount of upfront time involved in meeting and planning with different teams and finding good use cases and faults to inject into services.
The industry and best practices are maturing as new algorithms are being tested to automate the identification of the right services to run experiments. This can help reduce and eliminate the upfront meeting times and automate the finding of more critical flaws early on, before they surface as a production issue or customer complaint.
Werner Vogels, Amazon’s CTO is notorious for his quote “everything fails all the time”. This is even more true in the elastic cloud environment with applications architected on immutable infrastructure. So, the culture of asking “What happens if this fails?” needs to shift to “What happens when this fails?”