Session slides from Future Insights Live, Vegas 2015:
https://futureinsightslive.com/las-vegas-2015/
Reliability and uptime are a critical aspect to any product. It doesn't matter how beautiful the user interface is, how amazing a feature is, or stunning a cutting edge product is if its down. We live in a world where our users expect the products we make to work anytime, every time, all the time. Chaos driven development is the discipline to start with failure scenarios and design our products with failure in mind. It forces us to understand our users, minimum viable product and what drives our businesses in order to architect the systems which we build our product's foundation on. Innovation is about navigating tradeoffs, chaos driven development helps us both understand and be intentional about the tradeoffs we make. Bruce takes a look at the radical strategies Netflix applies to ensure a reliable customer experience. Why chaos, how chaos, what chaos and the results.
2. A LITTLE ABOUT ME
• Founder of Chaos
Engineering @ Netflix
• Computer Science
Background
• Multiple roles scaling Netflix
from 8m to 60m+ subs
• CurrentlyTaking a Break
@bruce_m_wong
3. Most enterprises hire people to fix things. Netflix hires
people to break things….
…we should embrace Netflix's culture of "chaos engineering"
throughout organizations of all shapes and sizes.
http://readwrite.com/2014/09/17/netflix-chaos-engineering-for-everyone
@bruce_m_wong
15. FAULT-TOLERANT DESIGN PRINCIPLES
• Eliminate Single Points of Failure
• Allow parts of the system to fail independently
(Failure Isolation)
• Prevent propagation (Failure Containment)
@bruce_m_wong
18. PRIORITIZE
• Many aspects and features are important
• Each have different consequences for not working
• A product’s value proposition is what drives your
business
@bruce_m_wong
30. WHAT DOES CHAOS LOOK
LIKE?
• Types - errors, latency
• Duration - how long?
• Intensity - how much?
@bruce_m_wong
31. WHAT DOES CHAOS LOOK
LIKE?
• Return errors a % of requests
• i.e. return HTTP500 for 1% of requests for 1 minute
@bruce_m_wong
32. WHAT DOES CHAOS LOOK
LIKE?
• Make it slow(er) - Introduce Latency
• i.e. sleep for 10ms on every request for 1 minute
@bruce_m_wong
33. WHAT DOES CHAOS LOOK
LIKE?
Gradually increase
• i.e. sleep for 10ms on every request for 1
minute
• sleep for 100ms on every request for 3
minutes
@bruce_m_wong
34. WHAT DOES CHAOS LOOK
LIKE?
The design/implementation worked!
• microscopic impact, high confidence
What if it didn’t work?
• smaller impact than an outage
• proactively fix it and try again
@bruce_m_wong
35. WHAT AN OUTAGE LOOKS
LIKE?
• Detection takes time (TTD)
• Analysis takes time
• Resolution takes time (TTR)
• Inconvenient times
@bruce_m_wong
37. WHAT ABOUTTESTING?
• Testing is good - do it, automate it
• While great testing disciplines can find most
functional bugs…
• scale, traffic and capacity
• System misconfiguration and design limitations
@bruce_m_wong
38. LESSONS LEARNED
• Learn more from chaos exercises than outages
• Fixing a failure mode will uncover new ones
• Configuration is often overlooked
• Tools can break
@bruce_m_wong
42. WHAT MAKES CHAOS HARD?
In addition to the technical challenges
• Organizations rarely incentivize people to try and
break production
• Misconceptions about complex systems and scale
@bruce_m_wong
43. TAKE AWAYS
• What are the consequences?
• Start small, start early
• Work together - share context
• Validate don’t assume
@bruce_m_wong