Chaos Driven Development (Bruce Wong)

CHAOS DRIVEN DEVELOPMENT
Future Insights Live 2015, LasVegas
Bruce Wong

A LITTLE ABOUT ME
• Founder of Chaos
Engineering @ Netﬂix
• Computer Science
Background
• Multiple roles scaling Netﬂix
from 8m to 60m+ subs
• CurrentlyTaking a Break
@bruce_m_wong

Most enterprises hire people to fix things. Netflix hires
people to break things….
…we should embrace Netflix's culture of "chaos engineering"
throughout organizations of all shapes and sizes.
http://readwrite.com/2014/09/17/netﬂix-chaos-engineering-for-everyone
@bruce_m_wong

http://www.techrepublic.com/article/serious-about-cloud-it-might-be-time-to-look-into-chaos-engineering/
https://gigaom.com/2014/09/11/netﬂixs-new-chaos-engineering-push-aims-to-hire-staff-to-help-break-its-cloud-based-system/@bruce_m_wong

http://www.cnbc.com/id/102394893@bruce_m_wong

CHAOS DEFINED
“If it ain’t broke don’t ﬁx it”
-Bert Lance, Nation’s Business 1977
If it ain’t broke, try harder
-chaos philosophy
@bruce_m_wong

CHAOS DEFINED
Intentionally introducing failure into a system
with the purpose of validating resilience design.
@bruce_m_wong

WHY CHAOS?
Failure happens.
@bruce_m_wong

WHY CHAOS?
•Hardware fails
•Power outages
•Software has bugs
•Human error
•Natural disasters
@bruce_m_wong

http://money.cnn.com/2012/10/30/technology/netﬂix-hurricane-sandy/@bruce_m_wong

http://www.pcworld.com/article/2691772/how-netﬂix-survived-the-amazon-ec2-reboot.html
https://gigaom.com/2014/10/03/netﬂix-lost-218-database-servers-during-aws-reboot-and-stayed-online/
@bruce_m_wong

BLUE MOONS
Once in a blue moon will eventually happen
@bruce_m_wong

FAULT-TOLERANT DESIGN PRINCIPLES
• Eliminate Single Points of Failure
• Allow parts of the system to fail independently
(Failure Isolation)
• Prevent propagation (Failure Containment)
@bruce_m_wong

START WITH
CONSEQUENCES
Chaos Driven Development
@bruce_m_wong

MINIMUMVIABLE PRODUCT
• Understand your users
• Understand your value proposition
• Understand your business
@bruce_m_wong

PRIORITIZE
• Many aspects and features are important
• Each have different consequences for not working
• A product’s value proposition is what drives your
business
@bruce_m_wong

DESIGN FOR
FAILURE
What failure isolation might
look like
@bruce_m_wong

APPLYING
CHAOS
Validation of fault-tolerant
design
@bruce_m_wong

BREAKINGTHE CONNECTION
How Conﬁdent are you?
-Next week?
-Next month?
-After that “quick patch”

WHAT DOES CHAOS LOOK
LIKE?
• Types - errors, latency
• Duration - how long?
• Intensity - how much?
@bruce_m_wong

LIKE?
• Return errors a % of requests
• i.e. return HTTP500 for 1% of requests for 1 minute
@bruce_m_wong

LIKE?
• Make it slow(er) - Introduce Latency
• i.e. sleep for 10ms on every request for 1 minute
@bruce_m_wong

LIKE?
Gradually increase
• i.e. sleep for 10ms on every request for 1
minute
• sleep for 100ms on every request for 3
minutes
@bruce_m_wong

LIKE?
The design/implementation worked!
• microscopic impact, high conﬁdence
What if it didn’t work?
• smaller impact than an outage
• proactively ﬁx it and try again
@bruce_m_wong

WHAT AN OUTAGE LOOKS
LIKE?
• Detection takes time (TTD)
• Analysis takes time
• Resolution takes time (TTR)
• Inconvenient times
@bruce_m_wong

CHAOSVS OUTAGE
Chaos
• Controlled
• Planned
• Intentional
• Microscopic user impact
Outages
• Uncontrolled
• Unpredictable
• Unintended
• Large impact
@bruce_m_wong

WHAT ABOUTTESTING?
• Testing is good - do it, automate it
• While great testing disciplines can find most
functional bugs…
• scale, traffic and capacity
• System misconfiguration and design limitations
@bruce_m_wong

LESSONS LEARNED
• Learn more from chaos exercises than outages
• Fixing a failure mode will uncover new ones
• Conﬁguration is often overlooked
• Tools can break
@bruce_m_wong

WHY ISTHIS
HARD?
@bruce_m_wong

WHAT MAKES RESILIENCE
DESIGN HARD?
• Product and Engineering Decision
• Tradeoffs are difﬁcult
• Organizational Silos
@bruce_m_wong

ORGANIZATIONAL SILOS
• Services by Domain
• Dev/Ops/Product
• Incomplete context
@bruce_m_wong

WHAT MAKES CHAOS HARD?
In addition to the technical challenges
• Organizations rarely incentivize people to try and
break production
• Misconceptions about complex systems and scale
@bruce_m_wong

TAKE AWAYS
• What are the consequences?
• Start small, start early
• Work together - share context
• Validate don’t assume
@bruce_m_wong

Chaos Driven Development (Bruce Wong)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Chaos Driven Development (Bruce Wong)

Ähnlich wie Chaos Driven Development (Bruce Wong) (20)

Mehr von Future Insights

Mehr von Future Insights (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Chaos Driven Development (Bruce Wong)