Definition: Discipline introducing controlled failures into production systems to discover weaknesses before they cause real incidents, improving system resilience.
— Source: NERVICO, Product Development Consultancy
What Is Chaos Engineering
Chaos engineering is a discipline that improves system resilience through the deliberate introduction of controlled failures in production or pre-production environments. Popularized by Netflix with their Chaos Monkey tool, it operates on the premise that complex distributed systems fail in unpredictable ways, and the only way to prepare is to proactively experiment with those failures under controlled conditions.
How It Works
The process follows four steps: (1) define the system’s steady state (normal metrics), (2) formulate a hypothesis about how the system will handle a specific failure, (3) introduce the failure in a controlled manner (terminate instances, inject latency, simulate network outages), and (4) observe whether the system maintains its steady state or degrades. If the hypothesis fails, a real weakness has been discovered that can be corrected before causing a production incident. AWS Fault Injection Simulator (FIS) enables running these experiments in a managed way.
Key Use Cases
- Validating that auto-healing and failover mechanisms work correctly when instances or availability zones fail
- Discovering hidden dependencies between services not documented in architecture documentation
- Verifying system behavior under degraded network conditions such as high latency or packet loss
- Training incident response teams with realistic failure scenarios in controlled environments
Advantages and Considerations
Chaos engineering transforms production failures from unexpected events into rehearsed situations, significantly reducing resolution time when real incidents occur. It builds confidence in system resilience based on empirical evidence. The main consideration is that it requires mature observability systems and the ability to stop experiments quickly if impact exceeds expectations. It should be adopted gradually, starting with low-impact experiments.