Watching Somebody Else's Computer: Cloud Native Observability

•Als PPTX, PDF herunterladen•

0 gefällt mir•3 views

You’ve probably heard the statement that there is no cloud, there’s just somebody else’s computer. How can we monitor what we don’t own? Developers and operations teams are increasingly relying on cloud providers to manage and operate their infrastructure. While this can offer many benefits, it also presents new challenges when it comes to observability. In this talk, we’ll explore the unique challenges of observability in a cloud-native environment, and discuss some best practices for ensuring that you can effectively monitor and troubleshoot your applications, even when you don’t have direct access to the underlying infrastructure. We’ll begin by discussing the basic principles of observability in a cloud-native context, including the importance of monitoring not just the application itself, but also the underlying infrastructure and the interactions between different components. We’ll then explore some common challenges that can arise when it comes to observability in a cloud-native environment, including issues with data access and the need to deal with large volumes of data from multiple sources. We’ll also discuss some practical strategies for addressing these challenges, including the use of cloud-native observability tools such as Kubernetes metrics and logging frameworks, as well as best practices for configuring and deploying these tools effectively. We’ll also explore the role of observability in incident response and how it can help teams quickly diagnose and resolve issues in a cloud-native environment. Whether you’re just getting started with cloud-native observability or you’re looking to take your observability practices to the next level, this talk will provide valuable insights and practical tips for ensuring that you can effectively monitor and troubleshoot your applications, even when they’re running on somebody else’s computer.

Technologie

Watching
Somebody
Else’s
Computer
Ronald McCollam

Cattle vs. Pets
???
Jerem43, CC BY-SA 3.0 Deed
Imgflip/Paramount, Fair Use
Vectorportal, CC BY 4.0

Observability in the Cloud
Basic Principles
Common Challenges
and Best Practices
O11y in IRM
Observability in Incident Response Management

Why Me?
(The least interesting part of any presentation!)
● IT operations management since the “cloud” was new and scary
● 15+ years experience in observability
● Interviewed hundreds of SRE teams
● Literally wrote the book on Grafana, an open source o11y tool

Why set up monitoring at all?
What’s your bottom line?
How much does downtime cost you?
Are you happy with a $2000 SLA for downtime?
Maybe!
Defaults cover the average user. If you’re doing something outside of the defaults,
think about it – are you really monitoring what you think you’re monitoring?
RTFM!
My cloud provider does this for me, right?
The cloud provider defaults are good enough, right?
Lou Gold, CC BY-NC-SA 2.0 Deed

MLT
Where do I start?
Metrics tell you that something is wrong, but not what
Logs tell you what went wrong
(Traces tell you where a bottleneck is)
You can make logs act like metrics, but not the other way around
Logs are indispensable
Barnaby Dorfman, CC-BY 2.0

Infrastructure (k8s, database services, etc.)
Basic services - what does everything else depend on
The basics
Don’t care about things that scale (CPU, memory… even nodes)
At most, warn about a node failure. Never page for it.
Core resources first
You build for HA, so you should monitor that way
CI/CD - did deployments work? What versions are out there? Where are they?

Alert for user impact
(Hint: user experience = “free” SLOs!)
Synthetics can help
Make sure you catch everything!
(Did you remember PagerDuty? VictorOps? Grafana Cloud?)
So what should I alert for?
Symptoms, not causes
Monitor your monitoring

Quality varies – some are better than others
UX is not a solved problem here
Pick your tool(s)
Do you ever intend to go multi-cloud? (Or migrate?)
COMMUNITY MATTERS
Cloud-specific tools take time to learn
… but so does anything
Cloud provider monitoring
Think about lock in

Automatically onboard
Use a consistent format
Use it!
Semantics → Standards → Automation
Automated SLOs
One dashboard for the whole company
Drive adoption
Build, then automate

Challenges
(Or “Theory vs. Practice”)
Robert Harker, CC-BY-SA 3.0 Gael Mace, CC-BY 3.0

Scaling
It looks that way until it isn’t.
What could go wrong?
● Object storage
The cloud is infinitely scalable, right?
BorisFromStofckdale, GNU FDL
● Autoscaling
● Spot pricing
● Serverless
- even if the storage is there, are the IOPS?
- what happens when a zone runs out of a specific machine type?
- heaven help you on Black Friday
… isn’t serverless

If your monitoring system is in the same cloud and AZ as your production
environment, what happens when that AZ goes down?
Monitoring scaling
Observe from outside
Watch your overhead
Even when you have scaling defined, keep some capacity
Know what’s important
If you do run out of capacity, what can you afford to lose first? What can’t go down?
(Make sure you’re monitoring that thing if nothing else!)

Scaling monitoring
Remember to scale your observability environment!
Node 1 Node 2 Node 3
Prometheus

Scaling monitoring
Remember to scale your observability environment!
Node 1 Node 2 Node 3 Node 4 Node 5 Node 6
Prometheus
Prometheus

You already have tools for this!
(Especially if you think you don’t)
These are mission critical and need to be monitored!

(even if it costs money!)
Consistency is key
Use the same tools in dev and prod
Litlnemo, CC-BY-NC-SA 2.0 Deed Jon Sullivan, Public Domain
Dev Prod

Agree ahead of time on a starting point and process
Create a shared doc/room/channel for communication
Start simple – get everyone on the highest level dashboard to start and work down
Human Intelligence > Artificial Intelligence
Use runbooks over autoremediation
Automate where appropriate…
… but remember humans are great at thinking!

Takeaways
You need to monitor
Engagement matters
/ Infinity ain’t so
Monitor for scale
Scale your monitoring
Consistency is key

CREDITS: This presentation template was
created by Slidesgo, and includes icons by
Flaticon and infographics & images by Freepik
Thanks!
Contact
ronald@grafana.com
rm@mastodon.amaseto.com
CREDITS
This presentation template was created by Slidesgo, and
includes icons by Flaticon and infographics & images by
Freepik.
Images are from Adam Y Zhang, Jerem34, Vectorportal,
Lou Gold, Robert Harker, Gael Mace, BorisFromStockdale,
Barnaby Dorfman, Robert Scoble, litlnemo, Jon Sullivan and
are used via the Creative Commons license.
I’m deeply indebted to my colleagues working in
observability, but most especially to Heds Simons and
Goutham Veeramachaneni for their advice and review of
this content.

Empfohlen

WinOps Conf 2016 - Matteo Emili - Development and QA Dilemmas in DevOpsWinOps Conf

Herding cats in the CloudDewey Sasser

Evolution of Monitoring and Prometheus (Dublin 2018)Brian Brazil

The Final Frontier, Automating Dynamic Security TestingMatt Tesauro

Cinci ug-january2011-anti-patternsSteven Smith

Codemash 2.0.1.4: Tech Trends and Pwning Your Pwn CareerKevin Davis

TxJS 2011Brian LeRoux

Devops interview questionsenrollmy training

Empfohlen

WinOps Conf 2016 - Matteo Emili - Development and QA Dilemmas in DevOpsWinOps Conf

Herding cats in the CloudDewey Sasser

Evolution of Monitoring and Prometheus (Dublin 2018)Brian Brazil

The Final Frontier, Automating Dynamic Security TestingMatt Tesauro

Cinci ug-january2011-anti-patternsSteven Smith

Codemash 2.0.1.4: Tech Trends and Pwning Your Pwn CareerKevin Davis

TxJS 2011Brian LeRoux

Devops interview questionsenrollmy training

From open source labs to ceo methods and advice by sysferafOSSa - Free Open Source Software Academia Conference

Availability in a cloud native world v1.6 (Feb 2019)Haytham Elkhoja

30 days or less: New Features to ProductionKarthik Gaekwad

Continuous Delivery for Python Developers – PyCon OttoPeter Bittner

2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0Joakim Lindbom

PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...Puppet

Sai devops - the art of being specializing generalistOdd-e

PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...Puppet

What does "monitoring" mean? (FOSDEM 2017)Brian Brazil

DebuggingOlivier Teytaud

Agile Development Practices - ProductivityAlex Moore

When to Code / Config / Config + Code in Salesforce - Nikunj DoshiSakthivel Madesh

Sensepost assessment automationSensePost

Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Brian Brazil

SOC Meets Cloud: What Breaks, What Changes, What to Do?Anton Chuvakin

An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil

Dev and Ops Collaboration and Awareness at Etsy and FlickrJohn Allspaw

Cloud Security Practices and PrinciplesSumo Logic

Ci tips and_tricks_linards_liepinsLinards Liep

Binary crosswordsLaurent Cerveau

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Weitere ähnliche Inhalte

Ähnlich wie Watching Somebody Else's Computer: Cloud Native Observability

From open source labs to ceo methods and advice by sysferafOSSa - Free Open Source Software Academia Conference

Availability in a cloud native world v1.6 (Feb 2019)Haytham Elkhoja

30 days or less: New Features to ProductionKarthik Gaekwad

Continuous Delivery for Python Developers – PyCon OttoPeter Bittner

2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0Joakim Lindbom

PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...Puppet

Sai devops - the art of being specializing generalistOdd-e

PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...Puppet

What does "monitoring" mean? (FOSDEM 2017)Brian Brazil

DebuggingOlivier Teytaud

Agile Development Practices - ProductivityAlex Moore

When to Code / Config / Config + Code in Salesforce - Nikunj DoshiSakthivel Madesh

Sensepost assessment automationSensePost

Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Brian Brazil

SOC Meets Cloud: What Breaks, What Changes, What to Do?Anton Chuvakin

An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil

Dev and Ops Collaboration and Awareness at Etsy and FlickrJohn Allspaw

Cloud Security Practices and PrinciplesSumo Logic

Ci tips and_tricks_linards_liepinsLinards Liep

Binary crosswordsLaurent Cerveau

Ähnlich wie Watching Somebody Else's Computer: Cloud Native Observability (20)

From open source labs to ceo methods and advice by sysfera

Availability in a cloud native world v1.6 (Feb 2019)

30 days or less: New Features to Production

Continuous Delivery for Python Developers – PyCon Otto

2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0

PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...

Sai devops - the art of being specializing generalist

PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...

What does "monitoring" mean? (FOSDEM 2017)

Debugging

Agile Development Practices - Productivity

When to Code / Config / Config + Code in Salesforce - Nikunj Doshi

Sensepost assessment automation

Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...

SOC Meets Cloud: What Breaks, What Changes, What to Do?

An Introduction to Prometheus (GrafanaCon 2016)

Dev and Ops Collaboration and Awareness at Etsy and Flickr

Cloud Security Practices and Principles

Ci tips and_tricks_linards_liepins

Binary crosswords

Kürzlich hochgeladen

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Slack Application Development 101 Slidespraypatel2

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Developing An App To Navigate The Roads of BrazilV3cube

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

GenCyber Cyber Security Day PresentationMichael W. Hawkins

A Call to Action for Generative AI in 2024Results

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Kürzlich hochgeladen (20)

08448380779 Call Girls In Civil Lines Women Seeking Men

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Slack Application Development 101 Slides

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Driving Behavioral Change for Information Management through Data-Driven Gree...

Finology Group – Insurtech Innovation Award 2024

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Developing An App To Navigate The Roads of Brazil

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Salesforce Community Group Quito, Salesforce 101

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

🐬 The future of MySQL is Postgres 🐘

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

GenCyber Cyber Security Day Presentation

A Call to Action for Generative AI in 2024

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Watching Somebody Else's Computer: Cloud Native Observability

1. Watching Somebody Else’s Computer Ronald McCollam

2. Markus Meier, FSFE, CC-BY-SA 4.0

3. Cattle vs. Pets ??? Jerem43, CC BY-SA 3.0 Deed Imgflip/Paramount, Fair Use Vectorportal, CC BY 4.0

4. Observability in the Cloud Basic Principles Common Challenges and Best Practices O11y in IRM Observability in Incident Response Management

5. Why Me? (The least interesting part of any presentation!) ● IT operations management since the “cloud” was new and scary ● 15+ years experience in observability ● Interviewed hundreds of SRE teams ● Literally wrote the book on Grafana, an open source o11y tool

6. Principles

7. Why set up monitoring at all? What’s your bottom line? How much does downtime cost you? Are you happy with a $2000 SLA for downtime? Maybe! Defaults cover the average user. If you’re doing something outside of the defaults, think about it – are you really monitoring what you think you’re monitoring? RTFM! My cloud provider does this for me, right? The cloud provider defaults are good enough, right? Lou Gold, CC BY-NC-SA 2.0 Deed

8. MLT Where do I start? Metrics tell you that something is wrong, but not what Logs tell you what went wrong (Traces tell you where a bottleneck is) You can make logs act like metrics, but not the other way around Logs are indispensable Barnaby Dorfman, CC-BY 2.0

9. Infrastructure (k8s, database services, etc.) Basic services - what does everything else depend on The basics Don’t care about things that scale (CPU, memory… even nodes) At most, warn about a node failure. Never page for it. Core resources first You build for HA, so you should monitor that way CI/CD - did deployments work? What versions are out there? Where are they?

10. Alert for user impact (Hint: user experience = “free” SLOs!) Synthetics can help Make sure you catch everything! (Did you remember PagerDuty? VictorOps? Grafana Cloud?) So what should I alert for? Symptoms, not causes Monitor your monitoring

11. Quality varies – some are better than others UX is not a solved problem here Pick your tool(s) Do you ever intend to go multi-cloud? (Or migrate?) COMMUNITY MATTERS Cloud-specific tools take time to learn … but so does anything Cloud provider monitoring Think about lock in

12. Automatically onboard Use a consistent format Use it! Semantics → Standards → Automation Automated SLOs One dashboard for the whole company Drive adoption Build, then automate

13. Driving adoption through SLOs

14. Challenges (Or “Theory vs. Practice”) Robert Harker, CC-BY-SA 3.0 Gael Mace, CC-BY 3.0

15. Scaling It looks that way until it isn’t. What could go wrong? ● Object storage The cloud is infinitely scalable, right? BorisFromStofckdale, GNU FDL ● Autoscaling ● Spot pricing ● Serverless - even if the storage is there, are the IOPS? - what happens when a zone runs out of a specific machine type? - heaven help you on Black Friday … isn’t serverless

16. If your monitoring system is in the same cloud and AZ as your production environment, what happens when that AZ goes down? Monitoring scaling Observe from outside Watch your overhead Even when you have scaling defined, keep some capacity Know what’s important If you do run out of capacity, what can you afford to lose first? What can’t go down? (Make sure you’re monitoring that thing if nothing else!)

17. Scaling monitoring Remember to scale your observability environment! Node 1 Node 2 Node 3 Prometheus

18. Scaling monitoring Remember to scale your observability environment! Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Prometheus Prometheus

19. O11y in IRM Robert Scoble, CC-BY 2.0

20. (and process) Tools, tools, tools

21. You already have tools for this! (Especially if you think you don’t) These are mission critical and need to be monitored!

22. (even if it costs money!) Consistency is key Use the same tools in dev and prod Litlnemo, CC-BY-NC-SA 2.0 Deed Jon Sullivan, Public Domain Dev Prod

23. Agree ahead of time on a starting point and process Create a shared doc/room/channel for communication Start simple – get everyone on the highest level dashboard to start and work down Human Intelligence > Artificial Intelligence Use runbooks over autoremediation Automate where appropriate… … but remember humans are great at thinking!

24. Takeaways You need to monitor Engagement matters / Infinity ain’t so Monitor for scale Scale your monitoring Consistency is key

25. CREDITS: This presentation template was created by Slidesgo, and includes icons by Flaticon and infographics & images by Freepik Thanks! Contact ronald@grafana.com rm@mastodon.amaseto.com CREDITS This presentation template was created by Slidesgo, and includes icons by Flaticon and infographics & images by Freepik. Images are from Adam Y Zhang, Jerem34, Vectorportal, Lou Gold, Robert Harker, Gael Mace, BorisFromStockdale, Barnaby Dorfman, Robert Scoble, litlnemo, Jon Sullivan and are used via the Creative Commons license. I’m deeply indebted to my colleagues working in observability, but most especially to Heds Simons and Goutham Veeramachaneni for their advice and review of this content.

Hinweis der Redaktion

First let’s get all the cliches used in this presentation out in the open: I’m sure everybody in here has heard this quote: There is no cloud, just other people’s computers. And while it’s funny because it’s true, it also means that everyone who uses cloud computing needs to think about how not being in control of their environment impacts how they think about monitoring it.
Pets are how we used to think about servers They each had a name We knew their quirks We cared a lot about keeping them healthy Cattle are numbered and replaceable (I hate this analogy so if anyone can help me come up with a better one I’d appreciate it!)
Nobody’s here to hear about me, but in case you’re wondering why I’m qualified to talk about this… Way way back in the day I built monitoring systems for on-prem datacenters when the idea of trusting Amazon to run your computing infrastructure would get you laughed out of the room. I’ve since worked in and around observability doing ITIL/ITSM, logging, metrics and monitoring, and then for the last 4.5 years been at Grafana Labs where I’ve had the opportunity to talk to companies from tiny startups to the Fortune 50s to understand and advise on their observability strategies. And I’m never the smartest person in the room, but I get to hang out with the folks who are. I’ve drawn heavily on the experiences of my colleagues both in Grafana Labs running a multi-cloud SaaS environment at massive scale as well as my friends and colleagues in other companies.
Why care about monitoring at all? Cloud provider does this for me, right? Well, they do it for themselves You need to know how you were impacted to make an SLA case Without that you’re reliant on what they decide Defaults are good enough, right? Maybe! They’ll get you the basics – if you’re doing something outside of the defaults, you need to think! You need to know what you’re actually monitoring. Is it what you think it is? RTFM!
If you have to pick one tool, pick logs. Always start there. Metrics are great! They are easy to implement and alert off of and you can make pretty pictures with them to make the pointy haired bosses happy. But logs will tell you what’s actually broken. (Traces are a distant third here. They’re fantastically useful, because identifying bottlenecks and hotspots in your environment can take your performance to the next level. But this doesn’t matter if you can’t keep your application running to start with.) It’s sometimes painful and messy, but you can aggregate logs together and make them work like metrics. But you can’t pull detailed tracebacks from a metric – they just don’t exist there.
Where in the environment should we start monitoring? It sounds simplistic, but it’s true: always start with core resources. These are the things that if they break, your whole production environment breaks. Things like k8s, databases, storage… things that you have to have working for everything else to function. Usually people manage to think about this, because that’s what we’ve been looking at since the beginning of time. But also remember that core infrastructure means things like your deployment and orchestration systems! If your CI/CD system is broken, you’re going to have a rough time deploying an emergency fix when something breaks! When talking about monitoring core resources, this doesn’t mean “speeds and feeds”. You almost certainly shouldn’t care about CPU utilization, free memory, disk space, even the health of individual nodes. (Remember “cattle vs. pets”? Don’t try to make a failing node healthy. Shoot it in the head and replace it. Worry about persistent or recurring failures.) It makes sense to track node failures. You’ll want that data to look for patterns. But don’t alert anyone for it! You’ve spent a ton of time and money getting into a self-healing environment, so let it self-heal!
Focus on symptoms, not causes. Engineers are really good at finding the cause of an issue. If you try to build that intelligence into your monitoring, you’ll catch the things you thought about ahead of time but not the things that you didn’t expect. As an aside, if you think about what impacts your users, that will guide you pretty clearly to building service level objectives, which we’ll look at more in a moment. SLOs can tell you if you’re meeting expectations, sure, but also if you’re making changes too slowly! It’s a great way to know if you can move faster and break more things without pissing off your users. Synthetics: even a simple page load test to look for a 200 response on your observability system is huge. (You probably want to do this from outside of that system itself, of course!) Remember to look at all the systems that are part of your observability suite! (Is your alerting environment working? Ticketing? If you’re using Datadog, Grafana Cloud, etc. consider having a local environment ping that from time to time…!)
What should you use? Cloud provider monitoring varies. Cloudwatch is really powerful and you can probably (after beating your head against the docs for a while) get it to do what you want. (This is not necessarily the case for other providers’ tools.) UX is inconsistent and often hard to figure out in these tools. There’s not a ton of motivation for cloud providers to improve this, because you’ll take what you get and like it. You can’t switch from Azure Monitor to CloudWatch or vice-versa without moving your whole environment If you are multi-cloud, your tools should be too. You don’t want to have to learn three different mutually incompatible systems to monitor your app! Having a solid community around your tools matters. GCP had continuous profiling before anyone else, but nobody really knew about it because it wasn’t talked about widely. A middling tool that you know about and can get help with is better than the best tool that you’ve never heard of and can’t figure out. DD, Grafana, Splunk, etc. have great resources online to help you figure out how things work. (Google for “How do I do X in Stackdriver” vs. “How do I do X in Grafana or DataDog” – StackDriver is all questions, DD/Grafana are all answers) You can get great results out of cloud provider tooling if you put in the effort But again, now you’re locked into that toolset, and if you want to move or expand you’re stuck. If you had put that effort into learning something platform-agnostic, you’d be home by now
Monitoring is only as good as the use it’s put to! You can build the best monitoring system in the world, but if nobody adopts it, it’s worthless Automatically onboard people – make it simple for people to use. Sensible defaults, provided as part of your infrastructure Keep your logs/metrics/traces in a consistent format with a consistent naming convention. Make it simple for people who aren’t familiar with a service to know what it’s doing and what its state is Once you’ve got this baseline, you can build automation and tooling on top Semantic conventions lead to standardization leads to automation Then you can derive SLOs from your data without much hassle This lets you build ONE SLO dashboard to see everything in the company and enable interested parties to drill down
There’s a simple trick that I learned from our CTO Tom Wilkie. He wanted to drive toward that standardization of SLO-based monitoring. But it’s hard to get a bunch of engineers to really buy into something by just mandating it. (Cat herding!) So he picked a few services and instrumented them the way he wanted, and started sending out a weekly report to the senior leadership team and every engineering manager. After a few weeks, people started asking “how can I get my team’s status into that report?” Leverage FOMO!
… although I have worked for a couple of startups that began in an academic setting and designed everything in a pristine theoretical environment before going to the real world. So cynically I might call this “why everything you think you know is wrong”.
Infinitely scalable? This is the promise of cloud computing. And mostly, you can pretend that it’s true. But when it isn’t, it’s a disaster. Suddenly you go from not knowing or caring about the underlying infrastructure you’re running on to being completely limited by it, but unable to fix it! Object storage IOPS - this is part of knowing and understanding the SLAs your provider gives. Read the fine print! Do they promise performance or just writable bytes? Autoscaling - sure, there might be more resources available in your cloud provider, but are they the machine type you specified? Spot pricing - someone will always have more money than you and be able to outbid you on compute! Serverless isn’t serverless - it’s still somebody else’s computer! A lambda function can fail because there’s still an EC2 instance underneath it running things, and hardware does sometimes break. Don’t blindly assume that your serverless functions will always just work. They need to be monitored too!
If you're monitoring your environment from inside that environment, you're not monitoring! What happens when the DC goes down? Separate cluster, separate AZ for o11y Ensure you have some overhead capacity, even if you scale Rate limits where appropriate... if you're running at 90% and there's a >10% spike that you can't scale, you're dead (Plus scaling still takes time!) Know which thing you're going to kill first if you have to... make sure you're monitoring the critical thing
The flip side of monitoring scaling is scaling monitoring. Something I see really commonly for folks who are rolling their own: When you build your service environment you probably set up scaling groups, but did you remember to set up scaling for your observability? If not, what’s going to happen when you grow that production environment on Black Friday?
You scale up to handle the load, but your Prometheus instance was sized to handle normal traffic. So what happens when it starts taking double the load it was scoped for? It goes down! And when your metric or log system goes down before you have an incident, now you’re really hosed! I might be biased, but: scale your observability systems _before_ you scale your services!
Facebook’s “War Room” in 2010
This is the one point where I’m going to diverge a bit and talk about tools first. It’s not because process isn’t important, but because when we get to IRM, it’s the one area where I see people actually really think through process first. (This is awesome!) But tools are still important.
Communications tools - where you’ll stay in sync. Pick one AND ONLY ONE. Should be the same you use daily, but have a separate incident channel Documentation tools - ideally something collaborative that you can update in real time THESE ARE CRITICAL TOOLS AND SHOULD BE MONITORED
First and foremost, involve developers in observability. This goes back to the “adoption” points earlier. Your developers are the ones who know the most about the internals of systems, and thus they should have a good idea how they can fail. But they need to share a common language and tools with your production environment to be useful. The biggest mistake I see in incident response happens way before bad code sneaks into production. It’s when someone says “Dynatrace/AppD/Splunk costs a lot of money, so we’ll use it in Prod but not in Dev”. Now you have two completely different ways of monitoring and thinking about your environment. Your ops team has a set of queries and alerts, but your devs have no idea how those work. Or worse yet, you’re doing DevOps and your developers spend 90% of their time in the Dev environment and don’t know how to use the Prod tools. Having carried a pager in a past life, I can tell you that issues always occur at 3AM when you’re half awake and not thinking straight. If you don’t intimately know how to use the tools you have, you’ll struggle to make them work in an emergency.
Developers should be taking the first crack at defining alerts, and should include information about how to resolve them. This is usually best done as a runbook rather than trying to automate a fix. Automation can hide recurring issues, and things like a quick reboot can disguise the root cause of issues. Having a runbook attached to an alert tells responders the most likely place to look, but allows for investigation and human intervention when appropriate. (This is specific to your application, not the infrastructure. You still want a pod to restart when something crashes, but if you hit a CrashLoopBackoff you definitely want to dig in and understand why.)
You need to monitor even in the cloud, because downtime costs you more than your cloud provider cares about – If nothing else, you need logs Community! Your tools are useless if nobody can tell you how to use them, and they’re worse than useless if nobody in your organization uses them. Pick tools that you can use, and show people what they can do and make it easy to engage The cloud looks infinite but it’s not! Think about less obvious failure modes and check for them. Think about what happens when scaling fails and what you’ll do as a result! Monitor for scale – alert for user impact, not for infrastructure impact But scale your monitoring – be sure you have the infrastructure to monitor your infrastructure effectively Consistency is key - use your tools, and use the _same_ tools everywhere
Apply the same ideas about architecture to observability Don’t care about CPU utilization, memory utilization Care about service health
It looks that way… until it isn’t!
It looks that way… until it isn’t!
It looks that way… until it isn’t!
It looks that way… until it isn’t!
Throw a few charts in place and you’re good, right? [examples]
Throw a few charts in place and you’re good, right? [examples]
Throw a few charts in place and you’re good, right? [examples]