SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
AWS Observability
made simple
EĂłin Shanaghy - Luciano Mammino
AWS Community Day - November 11th 2021
fth.link/o11y-simple
Hi! I’m Eoin 🙂
CTO
aiasaservicebook.com
@eoins
eoins
✉ Get in touch
👋 Hello, I am Luciano
Senior architect
nodejsdesignpatterns.com
Let’s connect:
🌎 loige.co
🐩 @loige
đŸŽ„ loige
🧳 lucianomammino
We are business focused technologists
that deliver.
Accelerated Serverless | AI as a Service | Platform Modernisation
We are hiring! Let’s have a chat 🙂
Check out our new Podcast!
awsbites.com
fth.link/o11y-simple
Observability in the cloud
a measure of how well internal states of a
system can be inferred from knowledge of its
external outputs
đŸȘ” 🔍 📈 🚹
Structured Logs Tracing Metrics Alarms
“
A typical case study
⚡ Serverless app
● Distributed system (100s of components)
🔌 HTTP APIs using
● Lambda
● DynamoDB
● API Gateway
● Cognito
đŸ§± Multiple services / stacks
🏁 Using SLIC Starter (fth.link/slic)
173
resources!
A typical case study
⚜ The goal: know about problems before users do
How?
📝 Structured Logs
📐 Metrics
🔔 Alarms
📊 Dashboards
đŸ—ș Traces (X-Ray)
Can we test our observability?
󰝊 We run a stress test
○ Simulate traïŹƒc using the integration test
○ Run the test a number of times in parallel (in a loop)
○ Exercises all the APIs with typical use cases (login, CRUD operations, etc.)
🚹 After 10-15 minutes, we started to get alarms...
🚹 Alerts ïŹ‚ow!
Making sense of alerts
Initial Hypothesis
🛑 We got throttled (DynamoDB write throttle events)
â†Ș 🔁 causing AWS SDK retries (in the Lambda function)
â†Ș ⏱ causing Lambda timeouts
â†Ș 👎 causing API Gateway 502
đŸ§Ș How do we validate this?
1. Check the timeout cause ➡ Lambda metrics/logs
2. Check the Lambda error cause ➡ Lambda logs
3. Identify the source of 5xx errors in API Gateway ➡ X-Ray
4. Check the DynamoDB metrics ➡ Dashboards
Gathering evidence
Checking timeouts
● Check lambda timeouts
○ Duration metrics (aggregated data)
○ Logs (individual requests)
● Logs Insights give us duration for each
individual request. We can use this to
isolate the logs for just that request.
● We use stats to see how many executions
are affected.
Inspecting DynamoDB Capacity
Tracing errors
HTTP 502
HTTP 500
UNEXPECTED! đŸ˜±
Lambda CloudWatch
Logs
Conclusions
🌡 Symptom 🐞 Problem 󰟿 Resolution
1 DynamoDB throttles
Table with low provisioned
WCUs (write capacity)
Switch table to
PAY_PER_REQUEST
Add throttling in API Gateway to limit
potential cost impact
2
API 502 Errors
Lambda Timeouts
Throttles caused
DynamoDB retries with
exponential backoff - up to
50 seconds of retry
Change maxRetries to 3 (350ms max
retry)
3 API 500 Errors
Attempt to update a
missing record - problem
with integration test!
Fix the integration test to ensure
deletion occurs after other actions
complete. Also improved the API
design
Before and after
What we have learned so far 󰠅
● We were able to identify, understand and ïŹx these errors quite quickly
● We didn’t have to change the code to do that
● Nor did we run it locally with a debugger
● All of this was possible because we conïŹgured observability tools in
AWS in advance
AWS native o11y = CloudWatch
Cloudwatch gives you:
➔ Logs with Insights
➔ Metrics
➔ Dashboards
➔ Alarms
➔ Canaries
➔ Distributed tracing (with X-Ray)
Alternatives outside AWS
Established
New entrants
Roll your own (only for the brave)
CloudWatch out of the box
😍 A toolkit you can use to build
observability
đŸ€© Metrics are automatically
generated for all services!
😟 Lots of dashboards, but by
service and not by application!
😱 Zero alarms out of the box!
Getting the best out of Cloudwatch
Cloudwatch can be your friend if you...
📚 Research and understand available metrics
📐 Decide thresholds
📊 Write IaC for application dashboards
⏰ Write IaC for service metric alarms
âȘ Update every time your application changes
📋 Copy and paste for each stack in your application
(a.k.a. A LOT OF WORK!)
Best practices
😇 AWS Well Architected Framework
🏛 5 Pillars
⚙ Operational excellence pillar covers observability
🧐 Serverless lens applies these pillars
👍 Good guidance on metrics to observe
👎 More reading and research + you still have to pick thresholds
CloudFormation for CloudWatch Alarms 😬
"Type": "AWS::CloudWatch::Alarm",
"Properties": {
"ActionsEnabled": true,
"AlarmActions": [
"arn:aws:sns:eu-west-1:665863320777:FTSLICAlarms"
],
"AlarmName": "LambdaThrottles_serverless-test-project-dev-hello",
"AlarmDescription": "Throttles % for serverless-test-project-dev-hello ..",
"EvaluationPeriods": 1,
"ComparisonOperator": "GreaterThanThreshold",
"Threshold": 0,
"TreatMissingData": "notBreaching",
"Metrics": [
{
"Id": "throttles_pc",
"Expression": "(throttles / throttles + invocations) * 100",
"Label": "% Throttles",
"ReturnData": true
},
{
"Id": "throttles",
"MetricStat": {
"Metric": {
"Namespace": "AWS/Lambda",
"MetricName": "Throttles",
"Dimensions": [
{
"Name": "FunctionName",
"Value": "serverless-test-project-dev-hello"
}
]
},
"Period": 60,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "invocations",
"MetricStat": {
"Metric": {
"Namespace": "AWS/Lambda",
"MetricName": "Invocations",
Can we automate this?
Magically
generated alarms
and dashboards for
each application!
fth.link/slic-watch
Introducing
SLIC Watch
How SLIC Watch works 🛠
Your app
serverless.yml
sls deploy
CloudFormation stack
very-big.json
SLIC Watch
👀 🛠
CloudFormation stack ++
even-bigger.json
Deploy ☁
📊📈
Before SLIC Watch
After SLIC Watch
After SLIC Watch
After SLIC Watch
After SLIC Watch
After SLIC Watch
Check out SLIC Slack
ConïŹguration
🎀 SLIC Watch comes with sane defaults
📝 You can conïŹgure what you don’t like
🔌 Or disable speciïŹc dashboards or alarms
How to get started
📣 Create an SNS Topic as the alarm destination (optional)
📩 ❯ npm install serverless-slic-watch-plugin --save-dev
✍ Update serverless.yml
⚙ ConïŹgure (optional)
🚱 ❯ sls deploy
plugins:
- serverless-slic-watch-plugin 💡 Check out
the complete
example project
in the repo!
Wrapping up 🎁
★ If your services are failing you deïŹnitely want to know about it!
★ Observability can save you from hundreds of hours of blind debugging!
★ CloudWatch is the go to tool in AWS but you have to conïŹgure it!
★ Automation can take most of the conïŹguration pain away
★ SLIC Watch can give you this automation
★ You still have control and ïŹ‚exibility
🔬Try it out! 🗣 Give feedback! 🌈 Let’s make it better!
fth.link/slic-watch
Thank you!
fth.link/o11y-simple
Cover picture by Markus Spiske on Unsplash

Weitere Àhnliche Inhalte

Was ist angesagt?

Oracle OpenWorld 2013 - HOL9737 MySQL Replication Best Practices
Oracle OpenWorld 2013 - HOL9737 MySQL Replication Best PracticesOracle OpenWorld 2013 - HOL9737 MySQL Replication Best Practices
Oracle OpenWorld 2013 - HOL9737 MySQL Replication Best Practices
Sven Sandberg
 
Simplifying MySQL, Pre-FOSDEM MySQL Days, Brussels, January 30, 2020.
Simplifying MySQL, Pre-FOSDEM MySQL Days, Brussels, January 30, 2020.Simplifying MySQL, Pre-FOSDEM MySQL Days, Brussels, January 30, 2020.
Simplifying MySQL, Pre-FOSDEM MySQL Days, Brussels, January 30, 2020.
Geir HĂžydalsvik
 

Was ist angesagt? (20)

MySQL Shell - The Best MySQL DBA Tool
MySQL Shell - The Best MySQL DBA ToolMySQL Shell - The Best MySQL DBA Tool
MySQL Shell - The Best MySQL DBA Tool
 
MySQL InnoDB Cluster / ReplicaSet - Tutorial
MySQL InnoDB Cluster / ReplicaSet - TutorialMySQL InnoDB Cluster / ReplicaSet - Tutorial
MySQL InnoDB Cluster / ReplicaSet - Tutorial
 
How WebLogic 12c Can Boost Your Productivity
How WebLogic 12c Can Boost Your ProductivityHow WebLogic 12c Can Boost Your Productivity
How WebLogic 12c Can Boost Your Productivity
 
MySQL InnoDB Cluster and Group Replication - OSI 2017 Bangalore
MySQL InnoDB Cluster and Group Replication - OSI 2017 BangaloreMySQL InnoDB Cluster and Group Replication - OSI 2017 Bangalore
MySQL InnoDB Cluster and Group Replication - OSI 2017 Bangalore
 
Changes in WebLogic 12.1.3 Every Administrator Must Know
Changes in WebLogic 12.1.3 Every Administrator Must KnowChanges in WebLogic 12.1.3 Every Administrator Must Know
Changes in WebLogic 12.1.3 Every Administrator Must Know
 
MySQL InnoDB Cluster and MySQL Group Replication @HKOSC 2017
MySQL InnoDB Cluster and MySQL Group Replication @HKOSC 2017MySQL InnoDB Cluster and MySQL Group Replication @HKOSC 2017
MySQL InnoDB Cluster and MySQL Group Replication @HKOSC 2017
 
Java EE 7 for WebLogic 12c Developers
Java EE 7 for WebLogic 12c DevelopersJava EE 7 for WebLogic 12c Developers
Java EE 7 for WebLogic 12c Developers
 
MySQL 5.7: Focus on InnoDB
MySQL 5.7: Focus on InnoDBMySQL 5.7: Focus on InnoDB
MySQL 5.7: Focus on InnoDB
 
WebLogic on ODA - Oracle Open World 2013
WebLogic on ODA - Oracle Open World 2013WebLogic on ODA - Oracle Open World 2013
WebLogic on ODA - Oracle Open World 2013
 
Oracle OpenWorld 2013 - HOL9737 MySQL Replication Best Practices
Oracle OpenWorld 2013 - HOL9737 MySQL Replication Best PracticesOracle OpenWorld 2013 - HOL9737 MySQL Replication Best Practices
Oracle OpenWorld 2013 - HOL9737 MySQL Replication Best Practices
 
Oracle Fusion Middleware on Exalogic Best Practises
Oracle Fusion Middleware on Exalogic Best PractisesOracle Fusion Middleware on Exalogic Best Practises
Oracle Fusion Middleware on Exalogic Best Practises
 
Developing Java EE Applications on IntelliJ IDEA with Oracle WebLogic 12c
Developing Java EE Applications on IntelliJ IDEA with Oracle WebLogic 12cDeveloping Java EE Applications on IntelliJ IDEA with Oracle WebLogic 12c
Developing Java EE Applications on IntelliJ IDEA with Oracle WebLogic 12c
 
Best Practices - PHP and the Oracle Database
Best Practices - PHP and the Oracle DatabaseBest Practices - PHP and the Oracle Database
Best Practices - PHP and the Oracle Database
 
MySQL 5.7: What's New, Nov. 2015
MySQL 5.7: What's New, Nov. 2015MySQL 5.7: What's New, Nov. 2015
MySQL 5.7: What's New, Nov. 2015
 
MySQL InnoDB Cluster / ReplicaSet - Making Provisioning & Troubleshooting as ...
MySQL InnoDB Cluster / ReplicaSet - Making Provisioning & Troubleshooting as ...MySQL InnoDB Cluster / ReplicaSet - Making Provisioning & Troubleshooting as ...
MySQL InnoDB Cluster / ReplicaSet - Making Provisioning & Troubleshooting as ...
 
Why Play Framework is fast
Why Play Framework is fastWhy Play Framework is fast
Why Play Framework is fast
 
MySQL Group Replication - an Overview
MySQL Group Replication - an OverviewMySQL Group Replication - an Overview
MySQL Group Replication - an Overview
 
MySQL InnoDB Cluster: Management and Troubleshooting with MySQL Shell
MySQL InnoDB Cluster: Management and Troubleshooting with MySQL ShellMySQL InnoDB Cluster: Management and Troubleshooting with MySQL Shell
MySQL InnoDB Cluster: Management and Troubleshooting with MySQL Shell
 
Simplifying MySQL, Pre-FOSDEM MySQL Days, Brussels, January 30, 2020.
Simplifying MySQL, Pre-FOSDEM MySQL Days, Brussels, January 30, 2020.Simplifying MySQL, Pre-FOSDEM MySQL Days, Brussels, January 30, 2020.
Simplifying MySQL, Pre-FOSDEM MySQL Days, Brussels, January 30, 2020.
 
Christo kutrovsky oracle rac solving common scalability problems
Christo kutrovsky   oracle rac solving common scalability problemsChristo kutrovsky   oracle rac solving common scalability problems
Christo kutrovsky oracle rac solving common scalability problems
 

Ähnlich wie AWS Observability Made Simple

Ähnlich wie AWS Observability Made Simple (20)

AWS Observability (without the Pain)
AWS Observability (without the Pain)AWS Observability (without the Pain)
AWS Observability (without the Pain)
 
Serverless in production (O'Reilly Software Architecture)
Serverless in production (O'Reilly Software Architecture)Serverless in production (O'Reilly Software Architecture)
Serverless in production (O'Reilly Software Architecture)
 
Managing Your Cloud Assets
Managing Your Cloud AssetsManaging Your Cloud Assets
Managing Your Cloud Assets
 
AWS Lambda from the trenches (Serverless London)
AWS Lambda from the trenches (Serverless London)AWS Lambda from the trenches (Serverless London)
AWS Lambda from the trenches (Serverless London)
 
AWS Lambda from the Trenches
AWS Lambda from the TrenchesAWS Lambda from the Trenches
AWS Lambda from the Trenches
 
Serverless in production, an experience report (microservices london)
Serverless in production, an experience report (microservices london)Serverless in production, an experience report (microservices london)
Serverless in production, an experience report (microservices london)
 
Automatisierte Kontrolle und Transparenz in der AWS Cloud – Autopilot fĂŒr Com...
Automatisierte Kontrolle und Transparenz in der AWS Cloud – Autopilot fĂŒr Com...Automatisierte Kontrolle und Transparenz in der AWS Cloud – Autopilot fĂŒr Com...
Automatisierte Kontrolle und Transparenz in der AWS Cloud – Autopilot fĂŒr Com...
 
Serverless in production, an experience report (NDC London, 31 Jan 2018)
Serverless in production, an experience report (NDC London, 31 Jan 2018)Serverless in production, an experience report (NDC London, 31 Jan 2018)
Serverless in production, an experience report (NDC London, 31 Jan 2018)
 
Serverless in production, an experience report (NDC London 2018)
Serverless in production, an experience report (NDC London 2018)Serverless in production, an experience report (NDC London 2018)
Serverless in production, an experience report (NDC London 2018)
 
Serverless in production, an experience report (London js community)
Serverless in production, an experience report (London js community)Serverless in production, an experience report (London js community)
Serverless in production, an experience report (London js community)
 
Using AWS CloudTrail and AWS Config to Enhance Governance and Compliance of A...
Using AWS CloudTrail and AWS Config to Enhance Governance and Compliance of A...Using AWS CloudTrail and AWS Config to Enhance Governance and Compliance of A...
Using AWS CloudTrail and AWS Config to Enhance Governance and Compliance of A...
 
AWS re:Invent 2016: Automated Governance of Your AWS Resources (DEV302)
AWS re:Invent 2016: Automated Governance of Your AWS Resources (DEV302)AWS re:Invent 2016: Automated Governance of Your AWS Resources (DEV302)
AWS re:Invent 2016: Automated Governance of Your AWS Resources (DEV302)
 
Yan Cui - Serverless in production, an experience report - Codemotion Milan 2017
Yan Cui - Serverless in production, an experience report - Codemotion Milan 2017Yan Cui - Serverless in production, an experience report - Codemotion Milan 2017
Yan Cui - Serverless in production, an experience report - Codemotion Milan 2017
 
Serverless in production, an experience report (codemotion milan)
Serverless in production, an experience report (codemotion milan)Serverless in production, an experience report (codemotion milan)
Serverless in production, an experience report (codemotion milan)
 
Serverless microservices in the wild
Serverless microservices in the wildServerless microservices in the wild
Serverless microservices in the wild
 
Skillenza Build with Serverless Challenge - Advanced Serverless Concepts
Skillenza Build with Serverless Challenge -  Advanced Serverless ConceptsSkillenza Build with Serverless Challenge -  Advanced Serverless Concepts
Skillenza Build with Serverless Challenge - Advanced Serverless Concepts
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
 
Build an app on aws for your first 10 million users (2)
Build an app on aws for your first 10 million users (2)Build an app on aws for your first 10 million users (2)
Build an app on aws for your first 10 million users (2)
 
Automated Governance of Your AWS Resources
Automated Governance of Your AWS ResourcesAutomated Governance of Your AWS Resources
Automated Governance of Your AWS Resources
 
Serverless in production, an experience report (CoDe-Conf)
Serverless in production, an experience report (CoDe-Conf)Serverless in production, an experience report (CoDe-Conf)
Serverless in production, an experience report (CoDe-Conf)
 

Mehr von Luciano Mammino

Mehr von Luciano Mammino (20)

Did you know JavaScript has iterators? DublinJS
Did you know JavaScript has iterators? DublinJSDid you know JavaScript has iterators? DublinJS
Did you know JavaScript has iterators? DublinJS
 
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
 
Building an invite-only microsite with Next.js & Airtable - ReactJS Milano
Building an invite-only microsite with Next.js & Airtable - ReactJS MilanoBuilding an invite-only microsite with Next.js & Airtable - ReactJS Milano
Building an invite-only microsite with Next.js & Airtable - ReactJS Milano
 
From Node.js to Design Patterns - BuildPiper
From Node.js to Design Patterns - BuildPiperFrom Node.js to Design Patterns - BuildPiper
From Node.js to Design Patterns - BuildPiper
 
Let's build a 0-cost invite-only website with Next.js and Airtable!
Let's build a 0-cost invite-only website with Next.js and Airtable!Let's build a 0-cost invite-only website with Next.js and Airtable!
Let's build a 0-cost invite-only website with Next.js and Airtable!
 
Everything I know about S3 pre-signed URLs
Everything I know about S3 pre-signed URLsEverything I know about S3 pre-signed URLs
Everything I know about S3 pre-signed URLs
 
Serverless for High Performance Computing
Serverless for High Performance ComputingServerless for High Performance Computing
Serverless for High Performance Computing
 
Serverless for High Performance Computing
Serverless for High Performance ComputingServerless for High Performance Computing
Serverless for High Performance Computing
 
JavaScript Iteration Protocols - Workshop NodeConf EU 2022
JavaScript Iteration Protocols - Workshop NodeConf EU 2022JavaScript Iteration Protocols - Workshop NodeConf EU 2022
JavaScript Iteration Protocols - Workshop NodeConf EU 2022
 
Building an invite-only microsite with Next.js & Airtable
Building an invite-only microsite with Next.js & AirtableBuilding an invite-only microsite with Next.js & Airtable
Building an invite-only microsite with Next.js & Airtable
 
Let's take the monolith to the cloud 🚀
Let's take the monolith to the cloud 🚀Let's take the monolith to the cloud 🚀
Let's take the monolith to the cloud 🚀
 
A look inside the European Covid Green Certificate - Rust Dublin
A look inside the European Covid Green Certificate - Rust DublinA look inside the European Covid Green Certificate - Rust Dublin
A look inside the European Covid Green Certificate - Rust Dublin
 
Monoliths to the cloud!
Monoliths to the cloud!Monoliths to the cloud!
Monoliths to the cloud!
 
The senior dev
The senior devThe senior dev
The senior dev
 
Node.js: scalability tips - Azure Dev Community Vijayawada
Node.js: scalability tips - Azure Dev Community VijayawadaNode.js: scalability tips - Azure Dev Community Vijayawada
Node.js: scalability tips - Azure Dev Community Vijayawada
 
A look inside the European Covid Green Certificate (Codemotion 2021)
A look inside the European Covid Green Certificate (Codemotion 2021)A look inside the European Covid Green Certificate (Codemotion 2021)
A look inside the European Covid Green Certificate (Codemotion 2021)
 
Semplificare l'observability per progetti Serverless
Semplificare l'observability per progetti ServerlessSemplificare l'observability per progetti Serverless
Semplificare l'observability per progetti Serverless
 
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
 
Finding a lost song with Node.js and async iterators - EnterJS 2021
Finding a lost song with Node.js and async iterators - EnterJS 2021Finding a lost song with Node.js and async iterators - EnterJS 2021
Finding a lost song with Node.js and async iterators - EnterJS 2021
 
How to send gzipped requests with boto3
How to send gzipped requests with boto3How to send gzipped requests with boto3
How to send gzipped requests with boto3
 

KĂŒrzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

KĂŒrzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

AWS Observability Made Simple

  • 1. AWS Observability made simple EĂłin Shanaghy - Luciano Mammino AWS Community Day - November 11th 2021 fth.link/o11y-simple
  • 2. Hi! I’m Eoin 🙂 CTO aiasaservicebook.com @eoins eoins ✉ Get in touch
  • 3. 👋 Hello, I am Luciano Senior architect nodejsdesignpatterns.com Let’s connect: 🌎 loige.co 🐩 @loige đŸŽ„ loige 🧳 lucianomammino
  • 4. We are business focused technologists that deliver. Accelerated Serverless | AI as a Service | Platform Modernisation We are hiring! Let’s have a chat 🙂
  • 5. Check out our new Podcast! awsbites.com
  • 7. Observability in the cloud a measure of how well internal states of a system can be inferred from knowledge of its external outputs đŸȘ” 🔍 📈 🚹 Structured Logs Tracing Metrics Alarms “
  • 8. A typical case study ⚡ Serverless app ● Distributed system (100s of components) 🔌 HTTP APIs using ● Lambda ● DynamoDB ● API Gateway ● Cognito đŸ§± Multiple services / stacks 🏁 Using SLIC Starter (fth.link/slic) 173 resources!
  • 9. A typical case study âšœ The goal: know about problems before users do How? 📝 Structured Logs 📐 Metrics 🔔 Alarms 📊 Dashboards đŸ—ș Traces (X-Ray)
  • 10. Can we test our observability? 󰝊 We run a stress test ○ Simulate traïŹƒc using the integration test ○ Run the test a number of times in parallel (in a loop) ○ Exercises all the APIs with typical use cases (login, CRUD operations, etc.) 🚹 After 10-15 minutes, we started to get alarms...
  • 12. Making sense of alerts
  • 13. Initial Hypothesis 🛑 We got throttled (DynamoDB write throttle events) â†Ș 🔁 causing AWS SDK retries (in the Lambda function) â†Ș ⏱ causing Lambda timeouts â†Ș 👎 causing API Gateway 502 đŸ§Ș How do we validate this? 1. Check the timeout cause ➡ Lambda metrics/logs 2. Check the Lambda error cause ➡ Lambda logs 3. Identify the source of 5xx errors in API Gateway ➡ X-Ray 4. Check the DynamoDB metrics ➡ Dashboards
  • 15. Checking timeouts ● Check lambda timeouts ○ Duration metrics (aggregated data) ○ Logs (individual requests) ● Logs Insights give us duration for each individual request. We can use this to isolate the logs for just that request. ● We use stats to see how many executions are affected.
  • 21. Conclusions 🌡 Symptom 🐞 Problem 󰟿 Resolution 1 DynamoDB throttles Table with low provisioned WCUs (write capacity) Switch table to PAY_PER_REQUEST Add throttling in API Gateway to limit potential cost impact 2 API 502 Errors Lambda Timeouts Throttles caused DynamoDB retries with exponential backoff - up to 50 seconds of retry Change maxRetries to 3 (350ms max retry) 3 API 500 Errors Attempt to update a missing record - problem with integration test! Fix the integration test to ensure deletion occurs after other actions complete. Also improved the API design
  • 23. What we have learned so far 󰠅 ● We were able to identify, understand and ïŹx these errors quite quickly ● We didn’t have to change the code to do that ● Nor did we run it locally with a debugger ● All of this was possible because we conïŹgured observability tools in AWS in advance
  • 24. AWS native o11y = CloudWatch Cloudwatch gives you: ➔ Logs with Insights ➔ Metrics ➔ Dashboards ➔ Alarms ➔ Canaries ➔ Distributed tracing (with X-Ray)
  • 25. Alternatives outside AWS Established New entrants Roll your own (only for the brave)
  • 26. CloudWatch out of the box 😍 A toolkit you can use to build observability đŸ€© Metrics are automatically generated for all services! 😟 Lots of dashboards, but by service and not by application! 😱 Zero alarms out of the box!
  • 27. Getting the best out of Cloudwatch Cloudwatch can be your friend if you... 📚 Research and understand available metrics 📐 Decide thresholds 📊 Write IaC for application dashboards ⏰ Write IaC for service metric alarms âȘ Update every time your application changes 📋 Copy and paste for each stack in your application (a.k.a. A LOT OF WORK!)
  • 28. Best practices 😇 AWS Well Architected Framework 🏛 5 Pillars ⚙ Operational excellence pillar covers observability 🧐 Serverless lens applies these pillars 👍 Good guidance on metrics to observe 👎 More reading and research + you still have to pick thresholds
  • 29. CloudFormation for CloudWatch Alarms 😬 "Type": "AWS::CloudWatch::Alarm", "Properties": { "ActionsEnabled": true, "AlarmActions": [ "arn:aws:sns:eu-west-1:665863320777:FTSLICAlarms" ], "AlarmName": "LambdaThrottles_serverless-test-project-dev-hello", "AlarmDescription": "Throttles % for serverless-test-project-dev-hello ..", "EvaluationPeriods": 1, "ComparisonOperator": "GreaterThanThreshold", "Threshold": 0, "TreatMissingData": "notBreaching", "Metrics": [ { "Id": "throttles_pc", "Expression": "(throttles / throttles + invocations) * 100", "Label": "% Throttles", "ReturnData": true }, { "Id": "throttles", "MetricStat": { "Metric": { "Namespace": "AWS/Lambda", "MetricName": "Throttles", "Dimensions": [ { "Name": "FunctionName", "Value": "serverless-test-project-dev-hello" } ] }, "Period": 60, "Stat": "Sum" }, "ReturnData": false }, { "Id": "invocations", "MetricStat": { "Metric": { "Namespace": "AWS/Lambda", "MetricName": "Invocations",
  • 30. Can we automate this? Magically generated alarms and dashboards for each application!
  • 32. How SLIC Watch works 🛠 Your app serverless.yml sls deploy CloudFormation stack very-big.json SLIC Watch 👀 🛠 CloudFormation stack ++ even-bigger.json Deploy ☁ 📊📈
  • 38. After SLIC Watch Check out SLIC Slack
  • 39. ConïŹguration 🎀 SLIC Watch comes with sane defaults 📝 You can conïŹgure what you don’t like 🔌 Or disable speciïŹc dashboards or alarms
  • 40. How to get started 📣 Create an SNS Topic as the alarm destination (optional) 📩 ❯ npm install serverless-slic-watch-plugin --save-dev ✍ Update serverless.yml ⚙ ConïŹgure (optional) 🚱 ❯ sls deploy plugins: - serverless-slic-watch-plugin 💡 Check out the complete example project in the repo!
  • 41. Wrapping up 🎁 ★ If your services are failing you deïŹnitely want to know about it! ★ Observability can save you from hundreds of hours of blind debugging! ★ CloudWatch is the go to tool in AWS but you have to conïŹgure it! ★ Automation can take most of the conïŹguration pain away ★ SLIC Watch can give you this automation ★ You still have control and ïŹ‚exibility 🔬Try it out! 🗣 Give feedback! 🌈 Let’s make it better! fth.link/slic-watch
  • 42. Thank you! fth.link/o11y-simple Cover picture by Markus Spiske on Unsplash