5. www.flipkart.com
• Started in 2007
• Current Architecture from mid 2010
• Evolution of the architecture presented as…
Issue[1] RCA[2] Actions Learnings
• *1+ Issue: Website is “slow”
• [2] RCA = Root Cause Analysis
8. RCA
• Why?
– MySQL queries taking too long
• Why?
– Too many queries
– Many slow queries
– Queries locking tables
• Why?
– Capacity
• Hmm…
9. Fixing it
• Get beefier servers (the obvious)
• Separate master_db, slave_db
– Writes go to master_db
– Reads from slave_db
– Critical reads from master_db
Writes Reads
Reads Writes
MySQL MySQL MySQL
Replication Slave
Master
10. Learning from it
• Scale-out databases reads by distributing load
across systems
• Isolate database writes from reads
– Writes are (usually) more critical
12. RCA
• Why?
– MySQL queries taking too long (on slave_db)
• Why?
– Too many queries
– Many slow queries
• Why?
– Queries from analytics / reporting and other
backend jobs
• Urm…
13. Fixing it
• Analytics / reporting DB (archival_db)
– Use MyISAM — optimized for reads
– Additional indexes for quicker reporting
Website Website
Writes Reads
Website Website
Writes Reads
MySQL MySQL
Replication Slave 1
Master
MySQL MySQL
Replication Slave
Master Replication
Analytics MySQL Analytics
Reads Slave 2 Reads
14. Learning from it
• Isolate the databases being used for serving
website traffic from those being used for
analytical/reporting
• Isolate systems being used by production
website from those being used for background
processing
19. RCA - 2
• Why?
– Service Oriented Architecture (SOA)
– Too many calls to remote services per request
• Creating fresh connection for each call
• All the calls are made in serial order
Connect to Request Connect Request Send
Receive request
Service1 Service1 Service2 Service2 response
20. RCA - 3
• Why?
– Configurability
– Fetch a lot of “config” from database for serving
each request
Receive Fetch Fetch Fetch Fetch Send
request Config1 Config2 Config3 Config4 response
21. RCA – 1,2,3
• Why?
– Logging a lot
– SOA
– Configurability
• Why?
– PHP’s process model
• Argh!
22. Fixing it
• fk-w3-agent
– Simple Java “middleware” daemon
– Deployed on each web server
– PHP communicates to it through local socket
– Hosts pluggable “handlers”
26. Learning from it
• PHP — good for frontend and templating
– Gives a lot of agility
– Limiting process model
• Hurdle for high performance
• Java — stability and performance
• Horses for courses
28. RCA
• Why?
– PHP processes taking up too much time
– PHP processes taking up too much CPU
• Why?
– Product info deserialization taking up time/CPU
– View construction taking up time/CPU
29. Fixing it
• Caching!
• Cache fully constructed pages
– For a few minutes
– Only for highly trafficked pages (Homepage)
• Cache PHP serialized Product objects
– ~20 million objects
– Memcache
• Yeah! But…
– Add caching => add complexity
30. Caching: Complications (1)
• “Caching fully constructed pages”
• But parts of pages still need to be dynamic
• Example: Logged-in user’s name
• Impossible to do effective bucket testing
• Or at least makes it prohibitively complex
31. Caching: Complications (2)
• “Caching PHP serialized Product objects”
• Without caching:
getProductInfo() Fetch from CMS
• With caching, cache hit:
getProductInfo() Fetch from Cache
• With caching, cache miss:
Fetch from Fetch from
getProductInfo() Set in Cache
Cache CMS
32. Caching: Complications (3)
• TTL: ∞ (i.e. no invalidation)
• Pro-actively repopulate products in the cache
– Receive “notifications” about product updates
• Notification Server — pushes notifications raised by
CMS
• Use a persistent, distributed cache
– Memcache => Membase, Couchbase
33. Learning from it
• Caching is a powerful tool for performance
optimization
• Caching adds complexities
– Reduced by keeping cache close to data source
– Think deeply about TTL, invalidation
• Use caching to go from “acceptable
performance” to “awesome performance”
– Don’t rely on it to get to “acceptable
performance”
36. RCA
• Why?
– Search-service is slow (or Reviews-service is slow
or Recommendations-service is slow)
• But why is rest of website slow?
– Requests to the slow service are blocking
processing threads
• Eh?!
37. Let’s do some math
• Let’s say
– Mean (or median) response time: 100 ms
– 8-core server
– All requests are CPU bound
• Throughput: 80 requests per second (rps)
• Let’s also say
– 95th Percentile response time: 1000 ms
• Call them “bad requests”
• 4 bad requests in a second
– Throughput down to 44 rps
• 8 bad requests in a second?
– Throughput down to 8 rps
38. Fixing it
• Aggressive timeouts for all service calls
– Isolate impact of a slow service
• only to pages that depend on it
• Very aggressive timeouts for non-critical
services
– Example: Recommendations
• On a Product page, Search results page etc.
• Not on My Recommendations page
• Load non-critical parts of pages through AJAX
39. Learning from it
• Isolate the impact of a poorly performing
services / systems
• Isolate the required from the good-to-have
41. RCA
• Why?
– Load average of web servers has spiked
• Why?
– Requests per second has spiked
• From 1000 rps to 1500 rps
• Why?
– Large number of notifications of product
information updates
42. Fixing it
• Separate cluster for receiving product info
update notifications from the cluster that
serves users
• Admission control: Don’t let a system receive
more requests than it can handle
– Throttling
• Batch the notifications
43. Learning from it
• Isolate the systems serving internal requests
from those serving production traffic
• Admission control to ensure that a system is
isolated from the over-enthusiasm of a client
• Look at the granularity at which we’re working
47. Mistake?
• Sub-optimal decision
– Not all information/scenarios considered
– Insufficient information
– Built for a different scenario
• Due to focus on “functional” aspects
• A mistake is a mistake
– … in retrospect
Hinweis der Redaktion
“This has basically given us lots of opportunities to make mistakes. And make mistakes we did.”