June 13, 2022
Let’s start with the definition by the Chaos community
“Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
Which Chaos Engineering experiments to perform first?
The diagram below illustrates this concept:
Even when all of the individual services in a distributed system are functioning properly, the interactions between those services can cause unpredictable outcomes. Chaos Engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses. These experiments follow 5 phases:
Measure experiments carefully, ensuring they are low-risk: involve few users, limit user flows, limit the number of live devices, etc.
The following is a list of tools to get you started:
Chaos Monkey: The OG of chaos engineering. The tool is still maintained and currently integrated into Spinnaker, a continuous delivery platform developed initially by Netflix to release software changes rapidly and reliably.
Platform: Spinnaker
Release year: 2012
Creator: Netflix
Language: Go
Mangle: Enables one to run chaos engineering experiments against applications and infrastructure components and quickly assess resiliency and fault tolerance. Designed to introduce faults with minimal pre-configuration and supports a wide range of tooling, including K8S, Docker, vCenter, or any Remote Machine with SSH enabled.
Platforms: Docker, Kubernetes, bare-metal, cloud platforms
Release year: 2018
Creator: ChaosIQ
Language: Python
AWS Fault Injection Simulator: AWS Fault Injection Simulator is a fully managed service for running fault injection experiments on AWS that makes it easier to improve an application’s performance, observability, and resiliency.
Works with: Amazon Relational Database Service (RDS), Elastic Compute Cloud (EC2), Elastic Container Service (ECS), and Elastic Kubernetes Service (EKS)
Release year: 2021
Creator: Amazon Web Services
ChaosBlade is built on nearly ten years of failure testing at Alibaba. It supports a wide range of platforms including Kubernetes, cloud platforms, and bare-metal, and provides dozens of attacks including packet loss, process killing, and resource consumption. It also supports application-level fault injection for Java, C++, and Node.js applications, which provides arbitrary code injection, delayed code execution, and modifying memory values.
As chaos engineering is an experimentation approach, it gives us a holistic view of the system’s behavior and how all the moving parts interact in a given set of circumstances, allowing us to derive insights into the system’s technical and soft aspects (aka, the human factor). Chaos engineering will enable organizations to find security vulnerabilities that are otherwise challenging to detect by traditional methods due to distributed systems’ complex nature. This may include losses caused by human factors, poor design, or lack of resiliency.