SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
How It All Goes Down 
Your next service outage will be during 
@ctosummit at 1:50pm. 
Daniel Doubrovkine 
db@artsy.net 
@dblockdotorg
Do you have a Syrian domain? 
somewhere here there’s your .sy TLD authority
Are you using MongoDB?
Do you use a PAAS?
Is there any logic involved? 
1. Restart Unicorns, no improvement. 
2. Restart a server, no improvement. 
3. Found an EC2 maintenance note buried in email. 
4. Patch responsible for 5x slower response times! 
5. Do a DB query from a production console. 
6. Identify a pattern (slow, then fast). 
7. Look in MongoDB logs.
Are there humans involved?
Are you using a beta version of a driver?
Post-Mortem
Subject
Summary 
We have experienced 3 separate 
outages over the last 72 hours that 
have affected api.artsy.net which is 
the backbone of all our 
applications, including Admin, 
CMS, artsy.net, m.artsy.net, etc. 
The first incident was slower 
response time 9/28 2AM-10AM 
EST. 
The second and third incidents 
were two outages, 9/29 11PM- 
12AM EST and 9/30 4AM-6:30AM 
EST.
Cause 
1. Trigger 
Servers rebooted. 
2. Unexpected behavior. 
Reboots should have been handled just fine by software. 
3. Cause. 
Human error + bug in the driver. 
4. Consequence. 
All front-ends down.
Resolution 
1. Ticket 
Woke up a contractor. 
2. Manual intervention. 
Kick servers.
Post-Mortem 
1. Human Error Preventable 
The dismissal of the alert was wrong. 
2. Failure to Plan. 
Reboot schedule needed a human to monitor it.
Outage History
Thanks! 
Daniel Doubrovkine 
db@artsy.net 
@dblockdotorg

Weitere ähnliche Inhalte

Ähnlich wie How it All Goes Down

Lessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at CraigslistLessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at CraigslistJeremy Zawodny
 
Bp106 Worst Practices Final
Bp106   Worst Practices FinalBp106   Worst Practices Final
Bp106 Worst Practices FinalBill Buchan
 
IBM ConnectED 2015 - BP103: Solving the Weird, the Obscure, and the Mind-Bending
IBM ConnectED 2015 - BP103: Solving the Weird, the Obscure, and the Mind-BendingIBM ConnectED 2015 - BP103: Solving the Weird, the Obscure, and the Mind-Bending
IBM ConnectED 2015 - BP103: Solving the Weird, the Obscure, and the Mind-BendingLuis Guirigay
 
High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013Server Density
 
Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)
Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)
Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)Jean-Luc David
 
Team 4 Presents: The Client Server Model
Team 4 Presents: The Client Server Model Team 4 Presents: The Client Server Model
Team 4 Presents: The Client Server Model anniekate93
 
How it's made - MyGet (CloudBurst)
How it's made - MyGet (CloudBurst)How it's made - MyGet (CloudBurst)
How it's made - MyGet (CloudBurst)Maarten Balliauw
 
89025069 mike-krieger-instagram-at-the-airbnb-tech-talk-on-scaling-instagram
89025069 mike-krieger-instagram-at-the-airbnb-tech-talk-on-scaling-instagram89025069 mike-krieger-instagram-at-the-airbnb-tech-talk-on-scaling-instagram
89025069 mike-krieger-instagram-at-the-airbnb-tech-talk-on-scaling-instagramMohit Jain
 
Sirt roundtable malicious-emailtrendmicro
Sirt roundtable malicious-emailtrendmicroSirt roundtable malicious-emailtrendmicro
Sirt roundtable malicious-emailtrendmicroSumit Tambe
 
Real Life Java EE Performance Tuning
Real Life Java EE Performance TuningReal Life Java EE Performance Tuning
Real Life Java EE Performance TuningC2B2 Consulting
 
Real Life Java EE Performance Tuning
Real Life Java EE Performance TuningReal Life Java EE Performance Tuning
Real Life Java EE Performance Tuningmbrasier
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Brian Brazil
 
STP201 Efficiency at Scale - AWS re: Invent 2012
STP201 Efficiency at Scale - AWS re: Invent 2012STP201 Efficiency at Scale - AWS re: Invent 2012
STP201 Efficiency at Scale - AWS re: Invent 2012Amazon Web Services
 
Wcl303 russinovich
Wcl303 russinovichWcl303 russinovich
Wcl303 russinovichconleyc
 
Linux server world
Linux server worldLinux server world
Linux server worldAkshat Singh
 
Cassandra at Zalando
Cassandra at ZalandoCassandra at Zalando
Cassandra at ZalandoLuis Mineiro
 
IBM Connect 2014 presentation. BP106 Managed Mail File Replicas.– How to Win...
IBM Connect 2014 presentation.  BP106 Managed Mail File Replicas.– How to Win...IBM Connect 2014 presentation.  BP106 Managed Mail File Replicas.– How to Win...
IBM Connect 2014 presentation. BP106 Managed Mail File Replicas.– How to Win...Don Martindale
 

Ähnlich wie How it All Goes Down (20)

Lessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at CraigslistLessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at Craigslist
 
Bp106 Worst Practices Final
Bp106   Worst Practices FinalBp106   Worst Practices Final
Bp106 Worst Practices Final
 
IBM ConnectED 2015 - BP103: Solving the Weird, the Obscure, and the Mind-Bending
IBM ConnectED 2015 - BP103: Solving the Weird, the Obscure, and the Mind-BendingIBM ConnectED 2015 - BP103: Solving the Weird, the Obscure, and the Mind-Bending
IBM ConnectED 2015 - BP103: Solving the Weird, the Obscure, and the Mind-Bending
 
High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013
 
Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)
Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)
Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)
 
Team 4 Presents: The Client Server Model
Team 4 Presents: The Client Server Model Team 4 Presents: The Client Server Model
Team 4 Presents: The Client Server Model
 
How it's made - MyGet (CloudBurst)
How it's made - MyGet (CloudBurst)How it's made - MyGet (CloudBurst)
How it's made - MyGet (CloudBurst)
 
89025069 mike-krieger-instagram-at-the-airbnb-tech-talk-on-scaling-instagram
89025069 mike-krieger-instagram-at-the-airbnb-tech-talk-on-scaling-instagram89025069 mike-krieger-instagram-at-the-airbnb-tech-talk-on-scaling-instagram
89025069 mike-krieger-instagram-at-the-airbnb-tech-talk-on-scaling-instagram
 
Sirt roundtable malicious-emailtrendmicro
Sirt roundtable malicious-emailtrendmicroSirt roundtable malicious-emailtrendmicro
Sirt roundtable malicious-emailtrendmicro
 
Real Life Java EE Performance Tuning
Real Life Java EE Performance TuningReal Life Java EE Performance Tuning
Real Life Java EE Performance Tuning
 
Real Life Java EE Performance Tuning
Real Life Java EE Performance TuningReal Life Java EE Performance Tuning
Real Life Java EE Performance Tuning
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
 
STP201 Efficiency at Scale - AWS re: Invent 2012
STP201 Efficiency at Scale - AWS re: Invent 2012STP201 Efficiency at Scale - AWS re: Invent 2012
STP201 Efficiency at Scale - AWS re: Invent 2012
 
Tuning Java Servers
Tuning Java Servers Tuning Java Servers
Tuning Java Servers
 
Wcl303 russinovich
Wcl303 russinovichWcl303 russinovich
Wcl303 russinovich
 
Linux server world
Linux server worldLinux server world
Linux server world
 
Cassandra at Zalando
Cassandra at ZalandoCassandra at Zalando
Cassandra at Zalando
 
Micro services
Micro servicesMicro services
Micro services
 
Os Whitaker
Os WhitakerOs Whitaker
Os Whitaker
 
IBM Connect 2014 presentation. BP106 Managed Mail File Replicas.– How to Win...
IBM Connect 2014 presentation.  BP106 Managed Mail File Replicas.– How to Win...IBM Connect 2014 presentation.  BP106 Managed Mail File Replicas.– How to Win...
IBM Connect 2014 presentation. BP106 Managed Mail File Replicas.– How to Win...
 

Mehr von Daniel Doubrovkine

The Future of Art @ Worlds Fair Nano
The Future of Art @ Worlds Fair NanoThe Future of Art @ Worlds Fair Nano
The Future of Art @ Worlds Fair NanoDaniel Doubrovkine
 
Nasdaq CTO Summit: Inspiring Team Leads to Give Away Legos
Nasdaq CTO Summit: Inspiring Team Leads to Give Away LegosNasdaq CTO Summit: Inspiring Team Leads to Give Away Legos
Nasdaq CTO Summit: Inspiring Team Leads to Give Away LegosDaniel Doubrovkine
 
Open-Source by Default, UN Community.camp
Open-Source by Default, UN Community.campOpen-Source by Default, UN Community.camp
Open-Source by Default, UN Community.campDaniel Doubrovkine
 
Taking Over Open Source Projects @ GoGaRuCo 2014
Taking Over Open Source Projects @ GoGaRuCo 2014Taking Over Open Source Projects @ GoGaRuCo 2014
Taking Over Open Source Projects @ GoGaRuCo 2014Daniel Doubrovkine
 
Tiling and Zooming ASCII Art @ iOSoho
Tiling and Zooming ASCII Art @ iOSohoTiling and Zooming ASCII Art @ iOSoho
Tiling and Zooming ASCII Art @ iOSohoDaniel Doubrovkine
 
The Other Side of Your Interview
The Other Side of Your InterviewThe Other Side of Your Interview
The Other Side of Your InterviewDaniel Doubrovkine
 
Hiring Engineers (the Artsy Way)
Hiring Engineers (the Artsy Way)Hiring Engineers (the Artsy Way)
Hiring Engineers (the Artsy Way)Daniel Doubrovkine
 
Building and Scaling a Test Driven Culture
Building and Scaling a Test Driven CultureBuilding and Scaling a Test Driven Culture
Building and Scaling a Test Driven CultureDaniel Doubrovkine
 
Introducing Remote Install Framework
Introducing Remote Install FrameworkIntroducing Remote Install Framework
Introducing Remote Install FrameworkDaniel Doubrovkine
 
Taming the Testing Beast - AgileDC 2012
Taming the Testing Beast - AgileDC 2012Taming the Testing Beast - AgileDC 2012
Taming the Testing Beast - AgileDC 2012Daniel Doubrovkine
 
GeneralAssemb.ly Summer Program: Tech from the Ground Up
GeneralAssemb.ly Summer Program: Tech from the Ground UpGeneralAssemb.ly Summer Program: Tech from the Ground Up
GeneralAssemb.ly Summer Program: Tech from the Ground UpDaniel Doubrovkine
 
Making Agile Choices in Software Technology
Making Agile Choices in Software TechnologyMaking Agile Choices in Software Technology
Making Agile Choices in Software TechnologyDaniel Doubrovkine
 
From Zero to Mongo, Art.sy Experience w/ MongoDB
From Zero to Mongo, Art.sy Experience w/ MongoDBFrom Zero to Mongo, Art.sy Experience w/ MongoDB
From Zero to Mongo, Art.sy Experience w/ MongoDBDaniel Doubrovkine
 

Mehr von Daniel Doubrovkine (20)

The Future of Art @ Worlds Fair Nano
The Future of Art @ Worlds Fair NanoThe Future of Art @ Worlds Fair Nano
The Future of Art @ Worlds Fair Nano
 
Nasdaq CTO Summit: Inspiring Team Leads to Give Away Legos
Nasdaq CTO Summit: Inspiring Team Leads to Give Away LegosNasdaq CTO Summit: Inspiring Team Leads to Give Away Legos
Nasdaq CTO Summit: Inspiring Team Leads to Give Away Legos
 
Product Development 101
Product Development 101Product Development 101
Product Development 101
 
Open-Source by Default, UN Community.camp
Open-Source by Default, UN Community.campOpen-Source by Default, UN Community.camp
Open-Source by Default, UN Community.camp
 
Your First Slack Ruby Bot
Your First Slack Ruby BotYour First Slack Ruby Bot
Your First Slack Ruby Bot
 
Single Sign-On with Waffle
Single Sign-On with WaffleSingle Sign-On with Waffle
Single Sign-On with Waffle
 
Taking Over Open Source Projects @ GoGaRuCo 2014
Taking Over Open Source Projects @ GoGaRuCo 2014Taking Over Open Source Projects @ GoGaRuCo 2014
Taking Over Open Source Projects @ GoGaRuCo 2014
 
Mentoring Engineers & Humans
Mentoring Engineers & HumansMentoring Engineers & Humans
Mentoring Engineers & Humans
 
Tiling and Zooming ASCII Art @ iOSoho
Tiling and Zooming ASCII Art @ iOSohoTiling and Zooming ASCII Art @ iOSoho
Tiling and Zooming ASCII Art @ iOSoho
 
Artsy ♥ ASCII ART
Artsy ♥ ASCII ARTArtsy ♥ ASCII ART
Artsy ♥ ASCII ART
 
The Other Side of Your Interview
The Other Side of Your InterviewThe Other Side of Your Interview
The Other Side of Your Interview
 
Hiring Engineers (the Artsy Way)
Hiring Engineers (the Artsy Way)Hiring Engineers (the Artsy Way)
Hiring Engineers (the Artsy Way)
 
Mentoring 101 - the Artsy way
Mentoring 101 - the Artsy wayMentoring 101 - the Artsy way
Mentoring 101 - the Artsy way
 
Building and Scaling a Test Driven Culture
Building and Scaling a Test Driven CultureBuilding and Scaling a Test Driven Culture
Building and Scaling a Test Driven Culture
 
Introducing Remote Install Framework
Introducing Remote Install FrameworkIntroducing Remote Install Framework
Introducing Remote Install Framework
 
HackYale 0-60 in Startup Tech
HackYale 0-60 in Startup TechHackYale 0-60 in Startup Tech
HackYale 0-60 in Startup Tech
 
Taming the Testing Beast - AgileDC 2012
Taming the Testing Beast - AgileDC 2012Taming the Testing Beast - AgileDC 2012
Taming the Testing Beast - AgileDC 2012
 
GeneralAssemb.ly Summer Program: Tech from the Ground Up
GeneralAssemb.ly Summer Program: Tech from the Ground UpGeneralAssemb.ly Summer Program: Tech from the Ground Up
GeneralAssemb.ly Summer Program: Tech from the Ground Up
 
Making Agile Choices in Software Technology
Making Agile Choices in Software TechnologyMaking Agile Choices in Software Technology
Making Agile Choices in Software Technology
 
From Zero to Mongo, Art.sy Experience w/ MongoDB
From Zero to Mongo, Art.sy Experience w/ MongoDBFrom Zero to Mongo, Art.sy Experience w/ MongoDB
From Zero to Mongo, Art.sy Experience w/ MongoDB
 

Kürzlich hochgeladen

Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 

Kürzlich hochgeladen (20)

Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 

How it All Goes Down

  • 1. How It All Goes Down Your next service outage will be during @ctosummit at 1:50pm. Daniel Doubrovkine db@artsy.net @dblockdotorg
  • 2.
  • 3. Do you have a Syrian domain? somewhere here there’s your .sy TLD authority
  • 4. Are you using MongoDB?
  • 5. Do you use a PAAS?
  • 6. Is there any logic involved? 1. Restart Unicorns, no improvement. 2. Restart a server, no improvement. 3. Found an EC2 maintenance note buried in email. 4. Patch responsible for 5x slower response times! 5. Do a DB query from a production console. 6. Identify a pattern (slow, then fast). 7. Look in MongoDB logs.
  • 7. Are there humans involved?
  • 8. Are you using a beta version of a driver?
  • 11. Summary We have experienced 3 separate outages over the last 72 hours that have affected api.artsy.net which is the backbone of all our applications, including Admin, CMS, artsy.net, m.artsy.net, etc. The first incident was slower response time 9/28 2AM-10AM EST. The second and third incidents were two outages, 9/29 11PM- 12AM EST and 9/30 4AM-6:30AM EST.
  • 12. Cause 1. Trigger Servers rebooted. 2. Unexpected behavior. Reboots should have been handled just fine by software. 3. Cause. Human error + bug in the driver. 4. Consequence. All front-ends down.
  • 13. Resolution 1. Ticket Woke up a contractor. 2. Manual intervention. Kick servers.
  • 14. Post-Mortem 1. Human Error Preventable The dismissal of the alert was wrong. 2. Failure to Plan. Reboot schedule needed a human to monitor it.
  • 16. Thanks! Daniel Doubrovkine db@artsy.net @dblockdotorg