You might have heard about chaos engineering in the context of Netflix and Amazon, and how they kill EC2 servers in production at random to verify that their systems can stay up in the face of infrastructure failures. But did you know that the same ideas can be applied to serverless applications? Yes, despite not having access to the underlying servers, we can still apply principles of chaos engineering to uncover failure modes in our system (and there are plenty!) so we can build a defence against them and make our serverless applications more robust and more resilient!
4. @theburningmonk theburningmonk.com
“the discipline of experimenting on a system in order to build confidence in the
system’s capability to withstand turbulent conditions in production”
principlesofchaos.org
31. @theburningmonk theburningmonk.com
chaos monkey kills an
EC2 instance
latency monkey induces
artificial delay in APIs
chaos gorilla kills an AWS
Availability Zone
chaos kong kills an entire
AWS region
56. @theburningmonk theburningmonk.com
TIL: most HTTP client libraries have default timeout of 60s.
API Gateway has an integration timeout of 29s.
Most Lambda functions default to timeout of 3-6s.
Don’t forget about the cold starts!
78. @theburningmonk theburningmonk.com
TIL: the js DynamoDB client defaults to 10 retries
with base delay of 50ms
delay = Math.random() * (Math.pow(2, retryCount) * base)
this is Marc Brooker’s
fav formula!