Adopting Chaos Engineering

Michael Tobin
Apr 25, 2024
3 min read

Updated: Apr 26, 2024

Around a year ago I did a presentation at the Yorkshire Azure User Group on Chaos Engineering and I thought I'd share my findings in a post, so what is Chaos Engineering, and is it the key to improving system resilience?

Chaos engineering is a practice that aims to improve system resilience and reliability by intentionally introducing controlled disruptions and failures into your production environment.

Chaos engineering can help organizations to build more resilient, reliable systems that can better withstand unexpected events and disruptions.
Adopting a chaos engineering approach requires a cultural shift towards embracing failure as an opportunity for learning and improvement.

The Problem

In 2010, Netflix decided to move their systems to AWS. In this new environment, hosts could be terminated and replaced at any time, which meant their services needed to be prepared for this constraint

The Solution

Netflix created Chaos Monkey to test system stability by forcing failures via the pseudo-random termination of instances and services within their architecture.

The Result

Netflix estimated that having a team of Chaos Engineers had a 53% ROI by comparing lost revenue from incidents which could have been detected, to the cost of chaos experiments & a team to do them.

Isn't this just testing?

It’s important to differentiate that while testing is important and will always remain so, chaos engineering is about testing against the unknown. In both Development & Infrastructure we tend to test what we against what we already “know”. If a company implement a DR Plan with automatic failover configured, then they’d be confident in their resilience, right? But what happens when a condition of the failover isn’t met? What happens when they start injecting small faults which they’ve not considered in their DR plan? This is what Chaos engineering attempts to explore.

Timeline

Honourable Mentions:

2014 – Netflix coined the job role “Chaos Engineer”

2016 - Kolton Andrus and Matthew Fornaciari founded Gremlin, the world's first managed enterprise Chaos Engineering solution

2023 - Azure Chaos Studio exits preview and goes into GA.

What tools are available to leverage Chaos Engineering?

Azure

Azure Chaos Studio is a cloud-based service that enables engineers to practice Chaos Engineering by running experiments on their Azure infrastructure. Chaos Studio provides pre-set templates for users to choose from such as network disruptions, virtual machine failures, or application crashes, and allows users to customize these scenarios to suit their specific needs.

AWS

AWS Fault Injection Simulator (FIS) is a fully managed service for running fault injection experiments to improve an application’s performance, observability, and resiliency.

FIS simplifies the process of setting up and running controlled fault injection experiments across a range of AWS services, so teams can build confidence in their application behaviour.

GCP

Currently, GCP do not offer a managed Chaos Engineering solution, there are open source, 3rd party tools such as ChaosToolkit, however it’s worth noting this is a command line-based tool and not supported by Google.

On-premise

Chaos Engineering doesn't necessarily have to be restricted to the cloud, there are multiple tools out there which cover a wide range of scenarios, check out this GitHub for a list of tools dastergon/awesome-chaos-engineering: A curated list of Chaos Engineering resources. (github.com )

Dark Debt

Dark debt is the potential loss of value from hidden vulnerabilities in a stable IT environment that will eventually fail. Dark debt can be referred to as anything that happens that wasn’t planned for or defended against. As we build larger, more complicated systems the risk of small anomalies to cause total system failures increases exponentially. These failures aren’t accounted for in design, development or QA, both for software and infrastructure.

Chaos Engineering is a key practice for attempting to reduce ‘dark debt’ as much as possible, injecting failures and learning from them prevents us from having to learn from them in a real-world outage.

Conclusion

The biggest barrier to entry for organisations looking to adopt Chaos Engineering is the cultural one, getting comfortable with breaking things in a production environment, being able to get key stakeholders involved and respond to potential outages as a result of a chaos experiments quickly, bridging the gap between internal IT teams. The resources are out there to be able to plan a Chaos Experiment, and the tools are available and if you're in the cloud, especially AWS & Azure I'd argue to tools are easy to use. I'm looking to release a post on using Chaos Studio soon so keep your eye out for that.