16. The Four Golden Signals
@phennex
Site Reliability Engineering - How Google Runs Production Systems
17. What to monitor?
@phennex
Latency
The time it takes to service a request.
Important to distinguish between the latency of
successful and failed requests.
@phennex
18. What to monitor?
@phennex
Traffic
A measure of how much demand is being placed on your system,
measured in a high-level system-specific metric.
@phennex
19. What to monitor?
@phennex
Errors
The rate of requests that fail, either explicitly (e.g. HTTP 500s),
implicitly (HTTP 200 success with wrong content)
@phennex
20. What to monitor?
@phennex
Saturation
How “full” your service is. A measure of your system fraction,
emphasizing the resources that are most constrained
(e.g. in a memory-constrained system, show memory)
@phennex
29. The Data model
@phennex
<metric name>{<label name>=<label value>, …}
api_http_requests_total{method="POST", handler="/messages"}
Notation:
Example:
Every time series is uniquely identified by its metric name and a set of key-
value pairs, also known as labels.
30. How to get metrics?
@phennex
Directly
instrumented
Not Directly
instrumented
Exporter
Source: https://promcon.io/2016-berlin/talks/so-you-want-to-write-an-exporter/
50. Hope is NOT a strategy
@phennex
Source: Site Reliability Engineering, How Google Runs Production Systems (2016), B. Beyer et al.
51. If you wanna know more…
@phennex
- prometheus.io
- promcon.io
- The Site Reliability Engineering book
- Podcasts:
- https://dev.to/sedaily/prometheus-monitoring-with-brian-brazil
- https://dev.to/sedaily/the-art-of-monitoring-with-james-turnbull
(prefers push based opposite prometheus)
- https://dev.to/sedaily/prometheus-with-julius-volz