You’ve probably heard the statement that there is no cloud, there’s just somebody else’s computer. How can we monitor what we don’t own?
Developers and operations teams are increasingly relying on cloud providers to manage and operate their infrastructure. While this can offer many benefits, it also presents new challenges when it comes to observability. In this talk, we’ll explore the unique challenges of observability in a cloud-native environment, and discuss some best practices for ensuring that you can effectively monitor and troubleshoot your applications, even when you don’t have direct access to the underlying infrastructure.
We’ll begin by discussing the basic principles of observability in a cloud-native context, including the importance of monitoring not just the application itself, but also the underlying infrastructure and the interactions between different components. We’ll then explore some common challenges that can arise when it comes to observability in a cloud-native environment, including issues with data access and the need to deal with large volumes of data from multiple sources.
We’ll also discuss some practical strategies for addressing these challenges, including the use of cloud-native observability tools such as Kubernetes metrics and logging frameworks, as well as best practices for configuring and deploying these tools effectively. We’ll also explore the role of observability in incident response and how it can help teams quickly diagnose and resolve issues in a cloud-native environment.
Whether you’re just getting started with cloud-native observability or you’re looking to take your observability practices to the next level, this talk will provide valuable insights and practical tips for ensuring that you can effectively monitor and troubleshoot your applications, even when they’re running on somebody else’s computer.
4. Observability in the Cloud
Basic Principles
Common Challenges
and Best Practices
O11y in IRM
Observability in Incident Response Management
5. Why Me?
(The least interesting part of any presentation!)
● IT operations management since the “cloud” was new and scary
● 15+ years experience in observability
● Interviewed hundreds of SRE teams
● Literally wrote the book on Grafana, an open source o11y tool
7. Why set up monitoring at all?
What’s your bottom line?
How much does downtime cost you?
Are you happy with a $2000 SLA for downtime?
Maybe!
Defaults cover the average user. If you’re doing something outside of the defaults,
think about it – are you really monitoring what you think you’re monitoring?
RTFM!
My cloud provider does this for me, right?
The cloud provider defaults are good enough, right?
Lou Gold, CC BY-NC-SA 2.0 Deed
8. MLT
Where do I start?
Metrics tell you that something is wrong, but not what
Logs tell you what went wrong
(Traces tell you where a bottleneck is)
You can make logs act like metrics, but not the other way around
Logs are indispensable
Barnaby Dorfman, CC-BY 2.0
9. Infrastructure (k8s, database services, etc.)
Basic services - what does everything else depend on
The basics
Don’t care about things that scale (CPU, memory… even nodes)
At most, warn about a node failure. Never page for it.
Core resources first
You build for HA, so you should monitor that way
CI/CD - did deployments work? What versions are out there? Where are they?
10. Alert for user impact
(Hint: user experience = “free” SLOs!)
Synthetics can help
Make sure you catch everything!
(Did you remember PagerDuty? VictorOps? Grafana Cloud?)
So what should I alert for?
Symptoms, not causes
Monitor your monitoring
11. Quality varies – some are better than others
UX is not a solved problem here
Pick your tool(s)
Do you ever intend to go multi-cloud? (Or migrate?)
COMMUNITY MATTERS
Cloud-specific tools take time to learn
… but so does anything
Cloud provider monitoring
Think about lock in
12. Automatically onboard
Use a consistent format
Use it!
Semantics → Standards → Automation
Automated SLOs
One dashboard for the whole company
Drive adoption
Build, then automate
15. Scaling
It looks that way until it isn’t.
What could go wrong?
● Object storage
The cloud is infinitely scalable, right?
BorisFromStofckdale, GNU FDL
● Autoscaling
● Spot pricing
● Serverless
- even if the storage is there, are the IOPS?
- what happens when a zone runs out of a specific machine type?
- heaven help you on Black Friday
… isn’t serverless
16. If your monitoring system is in the same cloud and AZ as your production
environment, what happens when that AZ goes down?
Monitoring scaling
Observe from outside
Watch your overhead
Even when you have scaling defined, keep some capacity
Know what’s important
If you do run out of capacity, what can you afford to lose first? What can’t go down?
(Make sure you’re monitoring that thing if nothing else!)
21. You already have tools for this!
(Especially if you think you don’t)
These are mission critical and need to be monitored!
22. (even if it costs money!)
Consistency is key
Use the same tools in dev and prod
Litlnemo, CC-BY-NC-SA 2.0 Deed Jon Sullivan, Public Domain
Dev Prod
23. Agree ahead of time on a starting point and process
Create a shared doc/room/channel for communication
Start simple – get everyone on the highest level dashboard to start and work down
Human Intelligence > Artificial Intelligence
Use runbooks over autoremediation
Automate where appropriate…
… but remember humans are great at thinking!
24. Takeaways
You need to monitor
Engagement matters
/ Infinity ain’t so
Monitor for scale
Scale your monitoring
Consistency is key
25. CREDITS: This presentation template was
created by Slidesgo, and includes icons by
Flaticon and infographics & images by Freepik
Thanks!
Contact
ronald@grafana.com
rm@mastodon.amaseto.com
CREDITS
This presentation template was created by Slidesgo, and
includes icons by Flaticon and infographics & images by
Freepik.
Images are from Adam Y Zhang, Jerem34, Vectorportal,
Lou Gold, Robert Harker, Gael Mace, BorisFromStockdale,
Barnaby Dorfman, Robert Scoble, litlnemo, Jon Sullivan and
are used via the Creative Commons license.
I’m deeply indebted to my colleagues working in
observability, but most especially to Heds Simons and
Goutham Veeramachaneni for their advice and review of
this content.
Hinweis der Redaktion
First let’s get all the cliches used in this presentation out in the open:
I’m sure everybody in here has heard this quote: There is no cloud, just other people’s computers. And while it’s funny because it’s true, it also means that everyone who uses cloud computing needs to think about how not being in control of their environment impacts how they think about monitoring it.
Pets are how we used to think about servers
They each had a name
We knew their quirks
We cared a lot about keeping them healthy
Cattle are numbered and replaceable
(I hate this analogy so if anyone can help me come up with a better one I’d appreciate it!)
Nobody’s here to hear about me, but in case you’re wondering why I’m qualified to talk about this…
Way way back in the day I built monitoring systems for on-prem datacenters when the idea of trusting Amazon to run your computing infrastructure would get you laughed out of the room. I’ve since worked in and around observability doing ITIL/ITSM, logging, metrics and monitoring, and then for the last 4.5 years been at Grafana Labs where I’ve had the opportunity to talk to companies from tiny startups to the Fortune 50s to understand and advise on their observability strategies.
And I’m never the smartest person in the room, but I get to hang out with the folks who are. I’ve drawn heavily on the experiences of my colleagues both in Grafana Labs running a multi-cloud SaaS environment at massive scale as well as my friends and colleagues in other companies.
Why care about monitoring at all?
Cloud provider does this for me, right?
Well, they do it for themselves
You need to know how you were impacted to make an SLA case
Without that you’re reliant on what they decide
Defaults are good enough, right?
Maybe!
They’ll get you the basics – if you’re doing something outside of the defaults, you need to think!
You need to know what you’re actually monitoring. Is it what you think it is?
RTFM!
If you have to pick one tool, pick logs. Always start there.
Metrics are great! They are easy to implement and alert off of and you can make pretty pictures with them to make the pointy haired bosses happy. But logs will tell you what’s actually broken.
(Traces are a distant third here. They’re fantastically useful, because identifying bottlenecks and hotspots in your environment can take your performance to the next level. But this doesn’t matter if you can’t keep your application running to start with.)
It’s sometimes painful and messy, but you can aggregate logs together and make them work like metrics. But you can’t pull detailed tracebacks from a metric – they just don’t exist there.
Where in the environment should we start monitoring?
It sounds simplistic, but it’s true: always start with core resources. These are the things that if they break, your whole production environment breaks. Things like k8s, databases, storage… things that you have to have working for everything else to function. Usually people manage to think about this, because that’s what we’ve been looking at since the beginning of time.
But also remember that core infrastructure means things like your deployment and orchestration systems! If your CI/CD system is broken, you’re going to have a rough time deploying an emergency fix when something breaks!
When talking about monitoring core resources, this doesn’t mean “speeds and feeds”. You almost certainly shouldn’t care about CPU utilization, free memory, disk space, even the health of individual nodes. (Remember “cattle vs. pets”? Don’t try to make a failing node healthy. Shoot it in the head and replace it. Worry about persistent or recurring failures.)
It makes sense to track node failures. You’ll want that data to look for patterns. But don’t alert anyone for it! You’ve spent a ton of time and money getting into a self-healing environment, so let it self-heal!
Focus on symptoms, not causes. Engineers are really good at finding the cause of an issue. If you try to build that intelligence into your monitoring, you’ll catch the things you thought about ahead of time but not the things that you didn’t expect.
As an aside, if you think about what impacts your users, that will guide you pretty clearly to building service level objectives, which we’ll look at more in a moment. SLOs can tell you if you’re meeting expectations, sure, but also if you’re making changes too slowly! It’s a great way to know if you can move faster and break more things without pissing off your users.
Synthetics: even a simple page load test to look for a 200 response on your observability system is huge. (You probably want to do this from outside of that system itself, of course!)
Remember to look at all the systems that are part of your observability suite!
(Is your alerting environment working? Ticketing? If you’re using Datadog, Grafana Cloud, etc. consider having a local environment ping that from time to time…!)
What should you use?
Cloud provider monitoring varies.
Cloudwatch is really powerful and you can probably (after beating your head against the docs for a while) get it to do what you want. (This is not necessarily the case for other providers’ tools.)
UX is inconsistent and often hard to figure out in these tools. There’s not a ton of motivation for cloud providers to improve this, because you’ll take what you get and like it. You can’t switch from Azure Monitor to CloudWatch or vice-versa without moving your whole environment
If you are multi-cloud, your tools should be too. You don’t want to have to learn three different mutually incompatible systems to monitor your app!
Having a solid community around your tools matters. GCP had continuous profiling before anyone else, but nobody really knew about it because it wasn’t talked about widely. A middling tool that you know about and can get help with is better than the best tool that you’ve never heard of and can’t figure out. DD, Grafana, Splunk, etc. have great resources online to help you figure out how things work. (Google for “How do I do X in Stackdriver” vs. “How do I do X in Grafana or DataDog” – StackDriver is all questions, DD/Grafana are all answers)
You can get great results out of cloud provider tooling if you put in the effort
But again, now you’re locked into that toolset, and if you want to move or expand you’re stuck. If you had put that effort into learning something platform-agnostic, you’d be home by now
Monitoring is only as good as the use it’s put to!
You can build the best monitoring system in the world, but if nobody adopts it, it’s worthless
Automatically onboard people – make it simple for people to use. Sensible defaults, provided as part of your infrastructure
Keep your logs/metrics/traces in a consistent format with a consistent naming convention. Make it simple for people who aren’t familiar with a service to know what it’s doing and what its state is
Once you’ve got this baseline, you can build automation and tooling on top
Semantic conventions lead to standardization leads to automation
Then you can derive SLOs from your data without much hassle
This lets you build ONE SLO dashboard to see everything in the company and enable interested parties to drill down
There’s a simple trick that I learned from our CTO Tom Wilkie. He wanted to drive toward that standardization of SLO-based monitoring. But it’s hard to get a bunch of engineers to really buy into something by just mandating it. (Cat herding!)
So he picked a few services and instrumented them the way he wanted, and started sending out a weekly report to the senior leadership team and every engineering manager. After a few weeks, people started asking “how can I get my team’s status into that report?” Leverage FOMO!
… although I have worked for a couple of startups that began in an academic setting and designed everything in a pristine theoretical environment before going to the real world. So cynically I might call this “why everything you think you know is wrong”.
Infinitely scalable?
This is the promise of cloud computing. And mostly, you can pretend that it’s true. But when it isn’t, it’s a disaster. Suddenly you go from not knowing or caring about the underlying infrastructure you’re running on to being completely limited by it, but unable to fix it!
Object storage IOPS - this is part of knowing and understanding the SLAs your provider gives. Read the fine print! Do they promise performance or just writable bytes?
Autoscaling - sure, there might be more resources available in your cloud provider, but are they the machine type you specified?
Spot pricing - someone will always have more money than you and be able to outbid you on compute!
Serverless isn’t serverless - it’s still somebody else’s computer! A lambda function can fail because there’s still an EC2 instance underneath it running things, and hardware does sometimes break. Don’t blindly assume that your serverless functions will always just work. They need to be monitored too!
If you're monitoring your environment from inside that environment, you're not monitoring! What happens when the DC goes down?
Separate cluster, separate AZ for o11y
Ensure you have some overhead capacity, even if you scale
Rate limits where appropriate... if you're running at 90% and there's a >10% spike that you can't scale, you're dead
(Plus scaling still takes time!)
Know which thing you're going to kill first if you have to... make sure you're monitoring the critical thing
The flip side of monitoring scaling is scaling monitoring.
Something I see really commonly for folks who are rolling their own: When you build your service environment you probably set up scaling groups, but did you remember to set up scaling for your observability?
If not, what’s going to happen when you grow that production environment on Black Friday?
You scale up to handle the load, but your Prometheus instance was sized to handle normal traffic. So what happens when it starts taking double the load it was scoped for?
It goes down!
And when your metric or log system goes down before you have an incident, now you’re really hosed!
I might be biased, but: scale your observability systems _before_ you scale your services!
Facebook’s “War Room” in 2010
This is the one point where I’m going to diverge a bit and talk about tools first. It’s not because process isn’t important, but because when we get to IRM, it’s the one area where I see people actually really think through process first.
(This is awesome!)
But tools are still important.
Communications tools - where you’ll stay in sync. Pick one AND ONLY ONE. Should be the same you use daily, but have a separate incident channel
Documentation tools - ideally something collaborative that you can update in real time
THESE ARE CRITICAL TOOLS AND SHOULD BE MONITORED
First and foremost, involve developers in observability. This goes back to the “adoption” points earlier. Your developers are the ones who know the most about the internals of systems, and thus they should have a good idea how they can fail. But they need to share a common language and tools with your production environment to be useful.
The biggest mistake I see in incident response happens way before bad code sneaks into production. It’s when someone says “Dynatrace/AppD/Splunk costs a lot of money, so we’ll use it in Prod but not in Dev”. Now you have two completely different ways of monitoring and thinking about your environment. Your ops team has a set of queries and alerts, but your devs have no idea how those work. Or worse yet, you’re doing DevOps and your developers spend 90% of their time in the Dev environment and don’t know how to use the Prod tools. Having carried a pager in a past life, I can tell you that issues always occur at 3AM when you’re half awake and not thinking straight. If you don’t intimately know how to use the tools you have, you’ll struggle to make them work in an emergency.
Developers should be taking the first crack at defining alerts, and should include information about how to resolve them. This is usually best done as a runbook rather than trying to automate a fix. Automation can hide recurring issues, and things like a quick reboot can disguise the root cause of issues. Having a runbook attached to an alert tells responders the most likely place to look, but allows for investigation and human intervention when appropriate. (This is specific to your application, not the infrastructure. You still want a pod to restart when something crashes, but if you hit a CrashLoopBackoff you definitely want to dig in and understand why.)
You need to monitor even in the cloud, because downtime costs you more than your cloud provider cares about – If nothing else, you need logs
Community! Your tools are useless if nobody can tell you how to use them, and they’re worse than useless if nobody in your organization uses them. Pick tools that you can use, and show people what they can do and make it easy to engage
The cloud looks infinite but it’s not! Think about less obvious failure modes and check for them. Think about what happens when scaling fails and what you’ll do as a result!
Monitor for scale – alert for user impact, not for infrastructure impact
But scale your monitoring – be sure you have the infrastructure to monitor your infrastructure effectively
Consistency is key - use your tools, and use the _same_ tools everywhere
Apply the same ideas about architecture to observability
Don’t care about CPU utilization, memory utilization
Care about service health
It looks that way… until it isn’t!
It looks that way… until it isn’t!
It looks that way… until it isn’t!
It looks that way… until it isn’t!
Throw a few charts in place and you’re good, right? [examples]
Throw a few charts in place and you’re good, right? [examples]
Throw a few charts in place and you’re good, right? [examples]