Bad monitoring, alerting and logging has made Gil Zellner very frustrated in some of his previous positions. It seems that almost nobody gets this exactly right. This will be a talk about the most annoying issues he has come across and advice for how to fix them.
25. Easy (days) Intermediate
(months)
Hard (years)
- no changes to
infrastructure
- just policy
- Small changes
to apps
- logging
- light
automation
- Design for
better
operability
- long term
@Heathenaspargus
31. Easy (days) Intermediate
(months)
Hard (years)
- no changes to
infrastructure
- just policy
- Small changes
to apps
- logging
- light
automation
- Design for
better
operability
- long term
@Heathenaspargus
35. solution: alert only things that meet the following
criteria:
1) Alert on symptoms, not suspected "causes"
2) Actionable
3) Business breaking
@Heathenaspargus
53. Easy (days) Intermediate
(months)
Hard (years)
- no changes to
infrastructure
- just policy
- Small changes
to apps
- logging
- light
automation
- Design for
better
operability
- long term
@Heathenaspargus
56. Bad artists copy, great artists steal
email:
Gil.Zellner@gmail.com
Twitter:
@Heathenaspargus
Hinweis der Redaktion
before we begin, lets all relax a bit, dim the lights
Pagerduty Alert - add alert
What is this nonsense ?
Find something not worth waking me up over.
Find some app sending nonsense errors
Why:
lower capacity when starting
equipment etc.
External cost to other engineers
1st on-call
10th on call
if nobody is dying leave me alone
Why mandatory: people don’t allocate rest enough for themselves. A person would simply accumulate more vacation days.
Dedicated time in the week to prevent things from waking us up at night.
Simple fixes, even things like a jenkins job to do reboots of components.At similarWeb, production engineers spend their on-call week only working on ease-of-operation related issue.
Alert all the things!
Why this is bad:
boy who cried wolf syndrome, alert fatigue
Attention is a depletable resource
Alert everyone!
why this is bad
tragedy of the commons: everyone ends up turning off their phones thinking “someone else will take of this”
only people who can fix the problem get alerted, others get emails
system needs to be smart enough to make the choice, and fixed when it makes a mistake in waking up the wrong person
Log nothing
Log to /dev/null
Log to ephemeral storage
measuring the wrong things. hosts are irrelevant, so is cpu and memory.
mention multi AZ
Tweet at me! Tell me your monitoring horror stories!