B. Durrett The Challenges of Continuous Deployment Social Developer Summit
1. Scaling with Continuous Deployment Social Developer Summit San Francisco, CA, June 29, 2010 Brett G. Durrett (@bdurrett) Vice President Engineering & Operations, IMVU, Inc.
4. In a Nutshell What is Continuous Deployment? Engineer commits code 20 minutes later it is live in production Repeat about 50 times per day
5. Does This Really Work? “Maybe this is just viable for a single developer … your site will be down. A lot.” “It seems like the author either has no customers or very understanding customers” Responses to February 2009 posting by Timothy Fitz about Continuous Deployment at IMVU (at the time IMVU had a $12 million run rate)
6. Benefits Regressions easy to find, correct Releases have zero overhead Rapid iteration using real customer metrics
7. Finding and Fixing Problems Each release has few changes, 1-3 commits Production issues correlate with check-in timestamp No overhead to producing a new release to correct issue Identifying cause takes minutes
8. CD at IMVU: Simple Overview Rollback (Blocks) Local tests pass, engineer commits code No Metrics good? Code deployed to all servers Lots and lots of tests run Yes All tests pass? Metrics still good? Code deployed to % of servers Yes No No Yes Revert commit (Blocks) Win!
10. Getting Started – Extreme Basics Continuous integration system Production monitoring and alerting System performance Business metrics Trending is nice too Simple deploy / roll-back system
11. Commit to Making Forward Progress Require coverage for all new code Add coverage for bugs / regressions Understand and fix root cause of failures
12. Expect Some Hurdles Production outages New overhead Tests Build systems Production outages Frustration Production outages (but well worth it)
13. Dealing with SQL Problems Difficult to roll-back schema Alter statements lock / impact customers Solutions New schema has formal review process No alter on large tables, create new table Copy on read Complete migration with background job
14. Big Features Developed on trunk, not branch “hidden” from customers by A/B experiment 100% control, add QA to experiment Deployed daily during development Slow roll-out by increasing experiment % Experiment closed = fully launched
15. Test Speed Slow tests burden to scaling Can’t run all tests in sandbox Faster to debug on build cluster If possible… Keep tests fast Keep tests specific
16. The cost of failing tests As the team grows… More likely to have test failures More people blocked as a result Intermittent failures very bad Eliminate the root cause
17. Other Issues Won’t catch issues that fail slowly SELECT * FROM growing_table WHERE 1 Some critical areas cause hard lock-ups MySQL Memcached Lack of test coverage of older code Not an issue if you start with test coverage
18. Does Continuous Deployment Scale? Technical staff ~50 people 10 million monthly unique visitors Peak ~115K concurrent IM client logins It’s a real business! $40 million run rate Profitable and doubled revenue in 2009
24. Build Throughput Initial implementation sequential builds Scaled okay to ~20 engineers Like trains running every 20 minutes One “red” blocks all following builds Solution: build isolation Enable testing single build without deploy “Red” build pulled, allow other builds to pass
25. Current Systems > 15,000 tests 72 web build servers 51 Linux, 21 Windows > 6 hours of tests on average hardware Deploy to cluster of ~700 servers
26. Web Build Software Custom test-file runner with JS GUI PHP SimpleTest Python's built-in unittest Selenium Core with in-house API wrapper YUITest for browser JS unit tests ErlangEunit
27. Conclusion Continuous Deployment is good Try it – starting earlier is easier It’s a key part of a nutritious development process
29. More on Continuous Deployment SD Times Leaders of Agile: Kent Beck's Principles of Agility: http://bit.ly/9wsAYv(this webinar tomorrow, June 30) Eric Ries (Startup Lessons Learned) on Continuous Deployment: http://bit.ly/5l6X1 Timothy Fitz (IMVU) Doing the impossible 50 times a day: http://bit.ly/OxJv
30. Thank You! Brett G. Durrett bdurrett@imvu.com Twitter: @bdurrett IMVU was recognized as one of the “Best Places to Work” (and we’re hiring) http://www.imvu.com/jobs/
Hinweis der Redaktion
IMVU is avatar-based chat with a hint of social networkingPeople decorate their avatar, room and home pages with UGCAll content is built by customers, for customers – over 4MM itemsWe have a robust economy, micro transactions, virtual currency, creators real businessWe offer Windows download for chat and web for social-network type activities
How many familiar with continuous deployment?How many practicing CD now?Frequency releases? 1 week, several times week, 1 day, several times a day?
Like continuous integrationWhere CI reduces pain of integrationCD reduces the pain of deployment
Early last year TF, engineer at IMVU blogged about thisResponses reflected a lot of skepticismAt this point millions of customers, $12MM run rateUptime > 99%
Third point is criticalTalk about other placeA good system builds confidenceFirst day engineers have code running in production (current record is 12:30 in afternoon)
Another place I worked each release represented 6 weeks workThousands of commitsRelease day was everybody trying to react to issues
Simple overview processTests not passing: intermittent or missedRollbacks: real issues, occasional spikesTotal time capturing metrics 2-3 minutes… lots of customers
Here is a more detailed eye-chartMostly here to show that it’s a little more complexMake slides available / provide details for interested
At IMVU, went from 3->1No tests, monitoring / alerting was early customer e-mailCI started off as local testsDeploy was rsync, rollback was SVN revert & rsync againMonitoring – predictive desired, can’t make it work
Need tests? Is it important that it keeps working, it needs test coverageIf not, why is it in the code base in the first placeIf you already have code not under test, old code less coverageKeep tests fast… too many functional testsDependency injection
When I talk to people trying this…Cultural:Battle of testing on BB – 1 person efficient, many not20% overheadMorale: builds over 20 minutes == unhappy
When describing this system frequently get asked about databasesOf course, NoSQLmay fix this
Another common question
May seem obvious that fast tests are goodHow slow tests impact may not be obvious, poor tagging leads to test in build
We solved with some custom workIsolate bad queriesQuery killerOutsourcing
We hope so!Since we first blogged and were questionedMore than doubled businessAlmost doubled staff
Square wave painCheck-in pile-up2009 company kept head count flatWorking effectively, 2x revenue, few build problemsThis year having to investing more, engineers to a build system are like cars to a freeway
This may seem obvious, these are criticalEasy to lose sight of how much productivity being lost
Tech Ops background, can’t believe I missed thisSame diligence you would have for your live siteMonitoring, trending, alerting
Here is a sample build and push times
More engineers = more pushes, more opportunity for redWith red, more commits get included in next push, more chance of failureMore tests = more chance of intermittentImpact of a red test on rest of organizationRequires process to keep working well
client build machine count: 24web build machine count: 72 totallinux web machines: 51windows web machines: 21Total test count: 1392 (including) selenium tests: 134 windows only tests: 183 linux only tests: 10 other / general tests: 1065