1. The document discusses observability in AWS and introduces a tool called SLIC Watch that automates the configuration of CloudWatch alarms and dashboards for serverless applications.
2. SLIC Watch generates CloudFormation templates that set up application-specific dashboards and alarms using metrics from services like Lambda, DynamoDB, and API Gateway to help operators more quickly identify issues.
3. The document provides an example of how SLIC Watch could help diagnose issues like DynamoDB throttling and Lambda timeouts by automating the creation of relevant metrics and alarms without requiring manual configuration of CloudWatch.
2. Hi! Iâm Eoin đ
CTO
aiasaservicebook.com
@eoins
eoins
â Get in touch
3. đ Hello, I am Luciano
Senior architect
nodejsdesignpatterns.com
Letâs connect:
đ loige.co
đŠ @loige
đ„ loige
𧳠lucianomammino
4. We are business focused technologists
that deliver.
Accelerated Serverless | AI as a Service | Platform Modernisation
We are hiring! Letâs have a chat đ
7. Observability in the cloud
a measure of how well internal states of a
system can be inferred from knowledge of its
external outputs
đȘ” đ đ đš
Structured Logs Tracing Metrics Alarms
â
8. A typical case study
⥠Serverless app
â Distributed system (100s of components)
đ HTTP APIs using
â Lambda
â DynamoDB
â API Gateway
â Cognito
𧱠Multiple services / stacks
đ Using SLIC Starter (fth.link/slic)
173
resources!
9. A typical case study
✠The goal: know about problems before users do
How?
đ Structured Logs
đ Metrics
đ Alarms
đ Dashboards
đș Traces (X-Ray)
10. Can we test our observability?
ó° We run a stress test
â Simulate traïŹc using the integration test
â Run the test a number of times in parallel (in a loop)
â Exercises all the APIs with typical use cases (login, CRUD operations, etc.)
đš After 10-15 minutes, we started to get alarms...
13. Initial Hypothesis
đ We got throttled (DynamoDB write throttle events)
âȘ đ causing AWS SDK retries (in the Lambda function)
âȘ â± causing Lambda timeouts
âȘ đ causing API Gateway 502
đ§Ș How do we validate this?
1. Check the timeout cause ⥠Lambda metrics/logs
2. Check the Lambda error cause ⥠Lambda logs
3. Identify the source of 5xx errors in API Gateway ⥠X-Ray
4. Check the DynamoDB metrics ⥠Dashboards
15. Checking timeouts
â Check lambda timeouts
â Duration metrics (aggregated data)
â Logs (individual requests)
â Logs Insights give us duration for each
individual request. We can use this to
isolate the logs for just that request.
â We use stats to see how many executions
are affected.
21. Conclusions
đĄ Symptom đ Problem ó°ż Resolution
1 DynamoDB throttles
Table with low provisioned
WCUs (write capacity)
Switch table to
PAY_PER_REQUEST
Add throttling in API Gateway to limit
potential cost impact
2
API 502 Errors
Lambda Timeouts
Throttles caused
DynamoDB retries with
exponential backoff - up to
50 seconds of retry
Change maxRetries to 3 (350ms max
retry)
3 API 500 Errors
Attempt to update a
missing record - problem
with integration test!
Fix the integration test to ensure
deletion occurs after other actions
complete. Also improved the API
design
23. What we have learned so far ó°
â We were able to identify, understand and ïŹx these errors quite quickly
â We didnât have to change the code to do that
â Nor did we run it locally with a debugger
â All of this was possible because we conïŹgured observability tools in
AWS in advance
27. Getting the best out of Cloudwatch
Cloudwatch can be your friend if you...
đ Research and understand available metrics
đ Decide thresholds
đ Write IaC for application dashboards
â° Write IaC for service metric alarms
âȘ Update every time your application changes
đ Copy and paste for each stack in your application
(a.k.a. A LOT OF WORK!)
28. Best practices
đ AWS Well Architected Framework
đ 5 Pillars
â Operational excellence pillar covers observability
đ§ Serverless lens applies these pillars
đ Good guidance on metrics to observe
đ More reading and research + you still have to pick thresholds
39. ConïŹguration
đ SLIC Watch comes with sane defaults
đ You can conïŹgure what you donât like
đ Or disable speciïŹc dashboards or alarms
40. How to get started
đŁ Create an SNS Topic as the alarm destination (optional)
đŠ ⯠npm install serverless-slic-watch-plugin --save-dev
â Update serverless.yml
â ConïŹgure (optional)
đą ⯠sls deploy
plugins:
- serverless-slic-watch-plugin đĄ Check out
the complete
example project
in the repo!
41. Wrapping up đ
â If your services are failing you deïŹnitely want to know about it!
â Observability can save you from hundreds of hours of blind debugging!
â CloudWatch is the go to tool in AWS but you have to conïŹgure it!
â Automation can take most of the conïŹguration pain away
â SLIC Watch can give you this automation
â You still have control and ïŹexibility
đŹTry it out! đŁ Give feedback! đ Letâs make it better!
fth.link/slic-watch