Updated version of Reliability & Scale in AWS while letting you sleep through the night
===============================================================
More and more startups/companies are deploying their infrastructure directly and exclusively in EC2 or similar cloud provider. With that comes a whole new set of challenges and paradigms around scalability, reliability and availability.
This talk will focus on how to leverage all the infrastructure parts of AWS, augment them with great (affordable) third party services and solid Open Source Software to create an operations environment that will scale with you, be as reliable as it can be, providing you and your peers with all the data you need to make good decisions to support (rapid) changes while letting you sleep through the night. And all that using a tiny operations team.
It may make you coffee in the morning too.
Devoxx UK: Reliability & Scale in AWS while letting you sleep through the night
1. ONE MAN OPS
Reliability & Scale in AWS while letting you sleep through the night
Jos Boumans - @jiboumans
http://www.fwallpaper.net/picture_pics-Sleepy-cat.html
Tuesday 26 March 13
2. RIPE NCC
Engineering manager for RIPE Database
http://www.ripe.net/db
Tuesday 26 March 13
3. CANONICAL
Engineering manager for Ubuntu Server 10.04 & 10.10
http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775 http://www.ubuntu.com/business/server/overview
Tuesday 26 March 13
4. KRUX
VP of Operations & Infrastructure
http://www.krux.com/
Tuesday 26 March 13
17. AWS OUTAGE = YOUR OUTAGE
http://it.mario.wikia.com/wiki/Lakitu
Tuesday 26 March 13
18. THE RULES HAVE CHANGED
You're not in Kansas anymore
http://entreatmenot.blogspot.com/2011/04/shattered-dreams.html
Tuesday 26 March 13
19. NETWORK WILL PARTITION
And it will happen often
http://thevinylvillain.blogspot.com/2010_04_01_archive.html
Tuesday 26 March 13
20. DISK IO WILL FLUCTUATE
On a good day, it's mediocre
http://www.freeguidetonwcamping.com/oregon_washington_main/washington/southwest_wa/cape_disappointment_sp.htm
Tuesday 26 March 13
21. IP ADDRESSES WILL CHANGE
IP lease is 8 hours
DNS TTL is 60 seconds
www.fantom-xp.com
Tuesday 26 March 13
22. INSTANCES WILL DIE
And it will always be your Database Master
http://room57.deviantart.com/art/Hangman-188353196
Tuesday 26 March 13
24. EMBRACE FAILURE
Hardware will fail. Humans will make errors.
Nature will produce thunderstorms.
http://www.freeguidetonwcamping.com/oregon_washington_main/washington/southwest_wa/cape_disappointment_sp.htm
Tuesday 26 March 13
26. ADJUST YOUR STRATEGY
Don't bring a knife to a gun fight
http://www.flickr.com/photos/statlerhotel/6628770499/sizes/l/in/photostream/
Tuesday 26 March 13
27. DATA STORES
Some work better than others
http://gustavhoiland.com/2010/03/10/stacked-boxes/
Tuesday 26 March 13
28. RDBMS
CouchDB
BigTable Based
Dynamo Based
Master / Slave based
CAP THEOREM
Your choice: sacrifice availability or consistency.
Orange is a lie.
Tuesday 26 March 13
29. MYSQL / ORACLE VS RDS
See: Network partitioning & instances dying
Tuesday 26 March 13
30. AMAZON REDSHIFT
Great for analytics/reports, bad for OLTP
Unburden your RDS instances
http://www.flitemedia.com/music.php http://aws.amazon.com/redshift
Tuesday 26 March 13
31. BIGTABLE BASED STORES
HBase, Accumulo, Hypertable
Still suffer when network partitioning happens
http://www.cloudera.com/cdh4/
Tuesday 26 March 13
32. DYNAMO BASED STORES
Cassandra, Riak, DynamoDB
http://www.fromoldbooks.org/Walker-ElectricLightingForShips/pages/015-Siemens-Alternate-Current-Dynamo//1552x1175-q75.html http://aws.amazon.com/dynamodb/faqs/
Tuesday 26 March 13
33. GO HOSTED?
CouchDB, MongoDB, Riak, Cassandra, HBase
Your Latency May Vary
http://www.fromoldbooks.org/Walker-ElectricLightingForShips/pages/015-Siemens-Alternate-Current-Dynamo//1552x1175-q75.html
Tuesday 26 March 13
34. CLIENT SIDE STORAGE
Keep a copy of your users data locally
http://www.wired.com/gadgetlab/2012/03/badass-gadget-ammo-lunch-box/ http://www.w3.org/2001/tag/2010/09/ClientSideStorage.html
Tuesday 26 March 13
35. FILE STORES
EBS vs Instance Store ...
... vs RamFS
http://homedezine.blogspot.com/2011/04/day-my-cat-removed-carpet-photo-studio.html
Tuesday 26 March 13
36. SIMPLE STORAGE SERVICE
S3: Arguably AWS' best feature
http://www.iwallpaper.us/gold-star-fo-christmas-wallpaper-140/
Tuesday 26 March 13
37. TRAFFIC SHAPING
Control every part of the request
http://www.visualphotos.com/image/2x4154765/man_standing_with_traffic_cones_in_shape_of_u-turn
Tuesday 26 March 13
38. STAY LOCAL IF YOU CAN
Going off box exposes you to risks you need to mitigate
http://southshorewoman.com/issue/june-2010/article/local-character
Tuesday 26 March 13
39. CACHE WHAT YOU CAN
HTTP Responses, DB Queries, User content
Browsers have caches too!
http://theoatmeal.com/blog/charity_money
Tuesday 26 March 13
40. USE ELASTIC LOAD BALANCERS
They will save you more than once
http://wallpapers5.com/wallpaper/Balance-Green-Tree-Frog/
Tuesday 26 March 13
41. USE GLOBAL LOAD BALANCING
Fail over to the closest data center on region failure
Tuesday 26 March 13
42. SHOUT OUT: DYN
DNS for Bit.ly, Quora, Twitter, Wikia, etc
Tuesday 26 March 13
43. USE A CDN
Critical items should always be available
http://kadanthuponanimidangal.blogspot.com/2010/12/blog-post_6992.html
Tuesday 26 March 13
44. MEASURE EVERYTHING
Find outliers, deviants & trends before they cause trouble
http://www.themoviedb.org/movie/629-the-usual-suspects
Tuesday 26 March 13
45. GRAPHITE, STATSD & COLLECTD
Use Statsd & Collectd for application/system metrics
Use graphite to store, aggregate & visualize
http://hostedgraphite.com/
http://bakingismyzen.blogspot.com/2011/07/beignets-cant-have-just-one.html http://jiboumans.wordpress.com/2012/07/02/measure-all-the-things/
Tuesday 26 March 13
46. GRAPH EVENTS
Deployments, outages, CDN reconfigurations, failed builds, etc
Anything that's important to the health of your eco system
http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/
Tuesday 26 March 13
47. COMPARE WEEK TO WEEK
Overlay week to week graphs using timeShift()
Quickly identifies trends and deviations from trends
http://obfuscurity.com/2012/04/Unhelpful-Graphite-Tip-10
Tuesday 26 March 13
48. FORECASTING
Use Holt-Winters confidence bands
Verify that your metrics are within normal tolerance
https://github.com/ripienaar/graphite-graph-dsl/wiki/Creating-Holt-Winters-Forecasts
Tuesday 26 March 13
49. FIND INDIVIDUAL OUTLIERS
Absolute numbers mean very little
Use mean & standard deviation
http://en.wikipedia.org/wiki/File:Black_sheep-1.jpg
Tuesday 26 March 13
50. ALERT ON TRENDS
Once you go over a threshold, it's too late
Alert on unwanted trends and preemptively fix
http://sub-second.blogspot.com/2012/06/reporting-response-times-percentile.html http://aphyr.github.com/riemann/
Tuesday 26 March 13
51. MEASURE WITHOUT RETROFIT
LogFormat "http.beacon:%D|ms" stats
CustomLog "|nc -u localhost 8125" stats
http://jiboumans.wordpress.com/2012/07/02/measure-all-the-things/
http://absinthemindedhero.blogspot.com/2012/03/victory-nonetheless.html http://jiboumans.wordpress.com/2013/02/27/realtime-stats-from-varnish/
Tuesday 26 March 13
52. SHOUT OUT: NEW RELIC
Java, but also Python, Ruby, .NET, PHP & NodeJS support
In depth profiling of your app for performance & errors.
Tuesday 26 March 13
53. CONFIGURATION MANAGEMENT
Unique snowflakes are bad
http://www.torange.us/Plants/Conifers/spruce-needles-in-hoarfrost-424.html
Tuesday 26 March 13
54. PUPPET VS CHEF
Yes.
http://puppetlabs.com/
http://www.opscode.com/chef
Tuesday 26 March 13
55. INFRASTRUCTURE AS CODE
Use different environments
Measure and report on it
http://americansingercanary.com/green.htm
Tuesday 26 March 13
56. SHOUT OUT: UBUNTU
Ubuntu + cloud-init + boto = awesome*
*I am biased
http://www.123rf.com/photo_4871141_food-pyramid-isolated-on-white.html https://github.com/krux/ops-tools
Tuesday 26 March 13
57. AWS OPSWORKS
Hosted Chef, No extra charge, Ubuntu 12.04 or Amazon Linux
Still rough around the edges.
http://thebrandbuilder.files.wordpress.com/2011/08/gordon-01.jpg http://aws.amazon.com/opsworks/
Tuesday 26 March 13
58. DEV = PRODUCTION
"I dunno, it worked on my laptop"
Instead, use vagrant
http://vagrantup.com/ http://vagrantup.com/
Tuesday 26 March 13
59. ROLL YOUR OWN AMIS
Instantly boot up new deployments
Reduce Time to Respond
http://bakingismyzen.blogspot.com/2011/07/beignets-cant-have-just-one.html http://puppetlabs.com/blog/rapid-scaling-with-auto-generated-amis-using-puppet/
Tuesday 26 March 13
60. CONFIDENT DEPLOYS
That human error could be yours
http://www.etsy.com/listing/37178125/stormtrooper-regrets-those-were-the
Tuesday 26 March 13
61. CONTINUOUS INTEGRATION
Ours: Github + Jenkins + FPM + apt::s3
From commit to deployable in one command http://github.com/
http://jenkins-ci.org/
https://github.com/thekad/apt-s3
https://github.com/jordansissel/fpm/wiki/
Tuesday 26 March 13
62. ONE CLICK DEPLOYMENTS
Deployments should not be exciting.
Don't create a checklist; automate & track
https://checkmarkable.com
http://www.thegreenhead.com/2012/07/one-click-butter-cutter.php https://github.com/jib/aws-analysis-tools/
Tuesday 26 March 13
63. DARK LAUNCHES
Exercise the code without impacting the user experience
http://www.kissmetrics.com/
http://www.layoutsparks.com/pictures/moon-23 https://github.com/yahoo/boomerang/
Tuesday 26 March 13
64. SHADOW TRAFFIC
Test new code against live traffic
http://doppelthingers.tumblr.com/post/12839979386/traffic-light-shadow-hangman-and-possibly-his https://gist.github.com/3125323
Tuesday 26 March 13