Chaos Engineering Overview
Principle of Chaos Engieering
Discipline of experimenting on system to build confidence in system's capability to withstand turbulent conditions in production.
Modern large-scale software systems are complex with many components and services functioning in a distributed system. Interactiosn between services can cause unpredictable outcomes that affect production environments.
Weaknesses in system need to be tested for improper fallback settings, unavailable services, outages from traffic overload, cascading failures from single point of failure and many more. Rigorous testing will measure stability of complex system in production deployment and areas to improve and deal with potential chaos.
Practising Chaos
Define 'steady state' of measurable output of system indicating normal behaviour
Hypothesise steady state in control and experiment group
Introduce vairables of real world events like service failure, network overloading etc.
Disprove hypothesis by viewing difference in steady state between control and experiment group
The more difficult it is to disrupt the steady state, there are more confidence in the system's resilience.
Benefits
Improved system resilience and reliability
Reduce revenue loss
Develop in-depth understanding of system
Improve failure recovery
Challenges
Risk of outages
Resource limitation
Requirement of robust monitoring systems
Tools
Interview Questions
How would you test for resilience in a system?
What is the difference between fault tolerance and resiliency?
What are the differences between load testing and chaos engineering?
Author
References
Last updated