QConSF 2016 Abstract:
By embracing the tension between order and chaos and applying a healthy mix of discipline and surrender Netflix reliably operates microservices in the cloud at scale. But every lesson learned and solution developed over the last seven years was born out of pain for us and our customers. Even today we remain vigilant as we evolve our service architecture. For those just starting the microservices journey these lessons and solutions provide a blueprint for success.
In this talk we’ll explore the chaotic and vibrant world of microservices at Netflix. We’ll start with the basics - the anatomy of a microservice, the challenges around distributed systems, and the benefits realized when integrated operational practices and technical solutions are properly leveraged. Then we’ll build on that foundation exploring the cultural, architectural, and operational methods that lead to microservice mastery.
10. Leader in subscription internet tv service
Hollywood, indy, local
Growing slate of original content
86 million members
~190 countries, 10s of languages
1000s of device types
Microservices on AWS
12. Netflix DVD Data Center - 2000
Linux Host
What microservices are not
Apache
Tomcat
Javaweb
STORE
LoadBalancer
BILLING
HTTP
JDBC
DB Link
HTTP/S
Monolithic code base
Monolithic database
Tightly coupled architecture
14. …the microservice architectural style is an
approach to developing a single application as a
suite of small services, each running in its own
process and communicating with lightweight
mechanisms, often an HTTP resource API.
- Martin Fowler
15. Separation of concerns
Modularity, encapsulation
Scalability
Horizontally scaling
Workload partitioning
Virtualization & elasticity
Automated operations
On demand provisioning
An Evolutionary Response
25. Linux Host
Linux Host
Linux Host
Linux Host
Crossing the Chasm
Linux Host
Apache Tomcat
Linux Host
Apache Tomcat
Network latency, congestion, failure
Logical or scaling failure
Service A Service B
30. Device Service B
Service C
Internet EdgeZuul
Service A
ELB
FITSynthetic transactions
Override by device or account
% of live traffic up to 100%
Fault Injection Testing (FIT)
31. Device Service B
Service C
Internet EdgeZuul
Service A
ELB
FIT
Fault Injection Testing (FIT)
Enforced throughout the call path
42. In the presence of a network partition, you must choose
between consistency and availability
CAP Theorem
DB
DB
DB
Network B
Network C
Network D
Service
Network A
X
43. Zone A
Zone B
Zone C
Zone B
Zone C
Client
Zone A
Local Quorum
(Typical)
100ms
Eventual Consistency
52. Not a cache or a database
Frequently accessed metadata
No instance affinity
Loss a node is a non-event
What is a stateless service?
53.
54. Minimum size
Desired capacity
Maximum size
Scale out as needed
S3AMI retrieved on demand
Compute efficiency
Node failure
Traffic spikes
Performance bugs
Auto Scaling Groups
55. Cluster A Cluster D
Edge Cluster
Cluster B
Cluster C
Surviving Instance Failure
57. Databases & caches
Custom apps which hold large amounts of data
Loss of a node is a notable event
What is a stateful service?
58. Dedicated Shards – An Antipattern
Squid 1 Squid 2 Squid 3
Client Application
Subscriber Client Library
Cache Client Service Client
S S S S. . .
DB DB DB DB. . .
Squid n
HA Proxy
Set 1 Set 2 Set 3 Set n
X
64. It’s easy to take EVCache for granted
30 million requests/sec
2 trillion requests per day globally
Hundreds of billions of objects
Tens of thousands of memcached instances
Milliseconds of latency per request
65. Batch
S S S S. . .
DB DB DB DB. . .
. . . . . .
Member Path
Member Path
Member Path
Batch
Batch
Called by many services
Online & offline clients
Called many times / request
800k – 1M RPS
Fallback to service/db
Excessive Load
66. Batch
S S S S. . .
DB DB DB DB. . .
. . . . . .
Member Path
Member Path
Member Path
Batch
Batch
Excessive Load
X X
67. Batch
S S S S. . .
DB DB DB DB. . .
. . . . . .
Member Path
Member Path
Member Path
Batch
Batch
Workload partitioning
Request-level caching
Secure token fallback
Chaos under load
Solutions
Online Offline
93. Netflix Data Center - 2009
API
Netflix API – from public to private
LoadBalancer
General REST API
JSON schema
HTTP response codes
Oauth security model
Content Metadata
Content
Metadata
Application
94. Customer Device
Netflix Data Center – 2010
API
Hybrid Architecture
LB
Netflix App
Security
Activation
Playback
Platform (NRDP)
UI
Content
Metadata
NCCP
ED
LB
Distinct
• Services
• Protocols
• Schemas
• Security
95. Josh: what is the right long term architecture?
Peter: do you care about the organizational
implications?
96. Conway’s Law
Organizations which design systems are constrained to
produce designs which are copies of the
communication structures of these organizations.
Any piece of software reflects the organizational
structure that produced it.
97. Conway’s Law
If you have four teams working on a compiler you will
end up with a four pass compiler
100. Outcomes
Productivity & new capabilities
Refactored organization
Lessons
Solutions first, team second
Reconfigure teams to best support your architecture
Outcomes & Lessons
Even the simple act of breathing is a complex act requiring many systems to cooperate and posing the potential to inhale dangerous gases or pathogens.
Pause – so you’re probably wondering why I’m talking about biology and disease in a talk about microservices?
And just as we human beings thrive in a world filled with threats so can your microservice architecture
And just as my step mother Barbara’s own body attacked itself in response to some unknown pathogen our own services can do the same thing.
Poorly tuned timouts, retries, and fallbacks can reek havoc and take your entire customer-facing service down
There are big challenges but every challenge has a solution
But, just as for all of us, it requires discipline to stay fit. You must embrace the chaos and that it is impossible for any one individual to fully understand the whole distributed system.
This is why we’re here today – to talk about the Netflix microservice journey. How we walk the razors edge between discipline ad chaos. And how you can benefit from the lessons we’ve learned over the last 7 years.
Even the simple act of breathing is a complex act requiring many systems to cooperate and posing the potential to inhale dangerous gases or pathogens.
Read from cache
On cache miss call service
Service calls DB & responds
Service updates cache
External trigger, internal response
As soon as you go out of process and/or off box – you have a distributed system
Combinatorial math on nines of availability
Adrian Cockcroft suggested Netflix in a box as a thought experiment early on – to address connectivity concerns
* If you do not defend against failure at each level then you have what is essentially a distributed monolith – if any microservice fails then they all fail
* Calls start failing, retries make it worse, thread pools become saturated, lack of isolation leads to full cascading failure
This nasty looking creature comes right out of your favorite horror movie
The good news is that it’s a very tiny creature – not something that would destroy Tokyo
The bad news is that it’s a vampire – a hookworm that attaches itself to the wall of the intestine, puncturing blood vessels and feeding on blood.
This can lead to severe anemia, effecting the health of the whole organism
And – just like the hookworm, client libraries can consume resources of your microservice application
cache, service, backfill
Request level caching
Client writes to any node
Coordinator replicates to nodes
Nodes ack to coordinator
Coordinator acks to client
Write to commit log
Hinted handoff to offline nodes
On Christmas eve, 2012 Netflix experiences a region-wide outage due an accidental ELB configuration change
Many engineers were on call, missing time with their families
They spent much of the night and into the morning trying to mitigate the impact of the outage on our customers but to no avail
We ultimately had to wait for Amazon to address the root cause
And our members, many of them new to Netflix were unable to stream
Their responses varied in intensity from…
Early on we had two competing approaches to caching. The Subscriber service team leaned on Squid caches, applying a dedicated shard model
This model proved problematic– involving long outages for members when a shard went down.
In addition – lack of proper thread pool isolation meant that the entire Netflix service might be come unavailable when one shard became unavailable
I was on a conference call several years ago where it took four hours to recover from such an outage
Even the simple act of breathing is a complex act requiring many systems to cooperate and posing the potential to inhale dangerous gases or pathogens.
Pause – so you’re probably wondering why I’m talking about biology and disease in a talk about microservices?
Now let’s look at scale from the perspective of a complex microservice architecture
One in which there is a caching tier fronting the microservice tier
In this case the subscriber team heavily relied on the caching tier
Taking traffic north of 800k rps
In this case the subscriber team heavily relied on the caching tier
Taking traffic north of 800k rps
There are several solutions that address this anti-pattern…
Story – bricked test environment for 6 hours – global configuration change
Staging necessary for deployments & configuration changes
Architecture first, organizational structure second
Blameless incident reviews
Commitment to continuous improvement
Different end points, protocols, security made life difficult for client teams
Especially when we wanted to integrated UI and playback functionality
Architecture first, organizational structure second
Blameless incident reviews
Commitment to continuous improvement
Even the simple act of breathing is a complex act requiring many systems to cooperate and posing the potential to inhale dangerous gases or pathogens.
Pause – so you’re probably wondering why I’m talking about biology and disease in a talk about microservices?