Predicting Production Outages with Micro-Metrics

•

0 gefällt mir•25 views

1) The document discusses how monitoring micro-metrics like garbage collection logs and thread dumps can help predict production outages in applications. It provides examples of how specific micro-metrics could predict issues like memory leaks, backend slowdowns, CPU spikes, and poor response times. 2) The document also describes yCrash, a tool that captures micro-metrics every 3 minutes from applications and uses machine learning to detect potential problems and trigger full troubleshooting if an issue is forecasted. It provides open-source scripts to collect various system and application metrics for troubleshooting. 3) Real-world case studies are presented on how micro-metrics helped predict and solve issues for major financial, trading, and travel companies to prevent production

Software

Predicting Production Outages:
Unleashing the Power of Micro-Metrics
Ram Lakshmanan
Architect yCrash

3
Memory of Healthy Application
- Full Garbage Collection Event

4
Acute Memory Leak
- Full Garbage Collection Event

5
Memory Leak
- Full Garbage Collection Event

GC Throughput
Micrometric
Source: Garbage Collection Log

7
What is GC Throughput?
Amount of time application spends in processing customer
transactions
vs
Amount of time application spends in processing garbage
collection activity

1. GC Log
10. netstat
12. vmstat
2. Thread Dump
9. dmesg
3. Heap Dump
6. ps
8. Disk Usage
5. top 13. iostat
11. ping
14. Kernel Params
15. App Logs
16. metadata
4. Heap Substitute
7. top -H
8
Open-source script:
https://github.com/ycrash/yc-data-script
360° Troubleshooting artifacts

Application Architecture
JDBC
SOAP
MainFrame
REST
Server Thread Pool
Application Server
HTTP(S) request
10

Application Architecture
JDBC
SOAP
MainFrame
REST
Server Thread Pool
Application Server
HTTP(S) request
11

Threads with identical Stack trace
Micrometric
Source: Thread Dump

13
Case Study
Backend Slowdown in a Major
Financial Institution in N.
America

top –H –p <PROCESS_ID>’
Secrete Option:
16
We all might have used ‘top’

Thread Level CPU consumption
Micrometric
Source: top –H –p <PROCESS_ID>

Case Study
Major Trading app in N.
America
https://blog.fastthread.io/2020/04/23/troubleshooting-cpu-spike-in-a-major-trading-application/
18

public void synchronized getData() {
doSomething();
}
Thread 1
Thread 2
Thread 1
BLOCKED THREADS
Concurrency Problem
20

BLOCKED state threads
Micrometric
Source: Thread Dump

Case Study
Major Leisure Travel Service
Provider
https://blog.ycrash.io/2022/03/09/java-uuid-generation-performance-impact/
22

What is Garbage?
HTTP Request
Objects
Memory
Garbage
24

25
3-4 Decades ago
Developer
Writes code to Manually evict Garbage
JVM
Automatically evicts Garbage
Now
How are objects Garbage Collected?
Evolution: Manual -> Automatic

26
Automatic GC sounds good right?
Yes, but for
GC pauses CPU consumption

27
Application suffering from Consecutive Full GCs

GC Pause Time
Micrometric
Source: Garbage Collection Log

Specific Errors/Exceptions
Micrometric
Source: Application Logs

My App
yCrash
agent
yCrash Server
Container/Machine
1
Every 3 minutes Micro-Metrics*
are captured
2 Metrics are transmitted
4 If problem forecasted,
360 ° data capture
is triggered
3 ML, Patterns applied on the Micro-Metrics
Cloud/On-premise
31
Micro-Metrics *
1. Garbage Collection Log
2. Thread Dump + top –H
3. Application Log
Micro-Metrics Monitoring Architecture

Ram Lakshmanan ram@tier1app.com
@tier1app https://www.linkedin.com/company/ycrash
This deck will be published in:
https://blog.ycrash.io
If you want to learn more …
33
THANK YOU
FRIENDS

Empfohlen

predicting-m3-devopsconMunich-2023.pptxTier1 app

Top-5-java-perf-problems-jax_mainz_2024.pptxTier1 app

Top-5-production-devconMunich-2023.pptxTier1 app

this-is-garbage-talk-2022.pptxTier1 app

Micrometrics to forecast performance tsunamisTier1app

Top-5-Performance-JaxLondon-2023.pptxTier1 app

millions-gc-jax-2022.pptxTier1 app

Micro-metrics to forecast performance tsunamisTier1 app

Empfohlen

predicting-m3-devopsconMunich-2023.pptxTier1 app

Top-5-java-perf-problems-jax_mainz_2024.pptxTier1 app

Top-5-production-devconMunich-2023.pptxTier1 app

this-is-garbage-talk-2022.pptxTier1 app

Micrometrics to forecast performance tsunamisTier1app

Top-5-Performance-JaxLondon-2023.pptxTier1 app

millions-gc-jax-2022.pptxTier1 app

Micro-metrics to forecast performance tsunamisTier1 app

MAJOR OUTAGES IN MAJOR ENTERPRISESTier1 app

predicting-outages-micro-metrics-ADDO-2023.pptxannya14

Predicting Production Outages: Unleashing the Power of Micro-Metrics – ADDO C...Tier1 app

7 habits of highly effective Performance TroubleshootersTier1 app

7-JVM-arguments-JaxLondon-2023.pptxTier1 app

YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg

Real time web: is there a life without socket.io and node.js?Eduard Trayan

3rd Generation Web Application PlatformsNaresh Chintalcheru

Service Mesh - ObservabilityAraf Karsh Hamid

Analyzing the Performance of Mobile WebAriya Hidayat

Microservices with MicronautQAware GmbH

Asynchronous RubyAnton Mishchuk

Circonus: Design failures - A Case StudyHeinrich Hartmann

Web Leaps ForwardMoh Haghighat

Top-5-production-devconMunich-2023-v2.pptxTier1 app

Resistance is futile, resilience is crucialHristo Iliev

An emulation framework for IoT, Fog, and Edge ApplicationsMoysisSymeonides

DevOps: Find Solutions, Not More DefectsTechWell

Introduction to NodeJSUttam Aaseri

Temadag om-java-jamaica vm-2013-09InfinIT - Innovationsnetværket for it

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app

Weitere ähnliche Inhalte

Ähnlich wie Predicting Production Outages with Micro-Metrics

MAJOR OUTAGES IN MAJOR ENTERPRISESTier1 app

predicting-outages-micro-metrics-ADDO-2023.pptxannya14

Predicting Production Outages: Unleashing the Power of Micro-Metrics – ADDO C...Tier1 app

7 habits of highly effective Performance TroubleshootersTier1 app

7-JVM-arguments-JaxLondon-2023.pptxTier1 app

YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg

Real time web: is there a life without socket.io and node.js?Eduard Trayan

3rd Generation Web Application PlatformsNaresh Chintalcheru

Service Mesh - ObservabilityAraf Karsh Hamid

Analyzing the Performance of Mobile WebAriya Hidayat

Microservices with MicronautQAware GmbH

Asynchronous RubyAnton Mishchuk

Circonus: Design failures - A Case StudyHeinrich Hartmann

Web Leaps ForwardMoh Haghighat

Top-5-production-devconMunich-2023-v2.pptxTier1 app

Resistance is futile, resilience is crucialHristo Iliev

An emulation framework for IoT, Fog, and Edge ApplicationsMoysisSymeonides

DevOps: Find Solutions, Not More DefectsTechWell

Introduction to NodeJSUttam Aaseri

Temadag om-java-jamaica vm-2013-09InfinIT - Innovationsnetværket for it

Ähnlich wie Predicting Production Outages with Micro-Metrics (20)

MAJOR OUTAGES IN MAJOR ENTERPRISES

predicting-outages-micro-metrics-ADDO-2023.pptx

Predicting Production Outages: Unleashing the Power of Micro-Metrics – ADDO C...

7 habits of highly effective Performance Troubleshooters

7-JVM-arguments-JaxLondon-2023.pptx

YOW2018 Cloud Performance Root Cause Analysis at Netflix

Real time web: is there a life without socket.io and node.js?

3rd Generation Web Application Platforms

Service Mesh - Observability

Analyzing the Performance of Mobile Web

Microservices with Micronaut

Asynchronous Ruby

Circonus: Design failures - A Case Study

Web Leaps Forward

Top-5-production-devconMunich-2023-v2.pptx

Resistance is futile, resilience is crucial

An emulation framework for IoT, Fog, and Edge Applications

DevOps: Find Solutions, Not More Defects

Introduction to NodeJS

Temadag om-java-jamaica vm-2013-09

Mehr von Tier1 app

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app

16 ARTIFACTS TO CAPTURE WHEN YOUR CONTAINER APPLICATION IS IN TROUBLETier1 app

KnowAPIs-UnknownPerf-confoo-2023 (1).pptxTier1 app

memory-patterns-confoo-2023.pptxTier1 app

lets-crash-apps-jax-2022.pptxTier1 app

‘16 artifacts’ to capture when there is a production problemTier1 app

Lets crash-applicationsTier1 app

16 artifacts to capture when there is a production problemTier1 app

Lets crash-applicationsTier1 app

Major outagesmajorenteprises 2021Tier1 app

Jvm internals-1-slideTier1 app

Accelerating Incident Response To Production OutagesTier1 app

7 jvm-arguments-ConfooTier1 app

7 jvm-arguments-v1Tier1 app

How & why-memory-efficient?Tier1 app

Top feedbacksTier1 app

Shooting the troubles: Crashes, Slowdowns, CPU SpikesTier1 app

Troubleshooting performanceavailabilityproblems (1)Tier1 app

Mehr von Tier1 app (19)

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

Effectively Troubleshoot 9 Types of OutOfMemoryError

16 ARTIFACTS TO CAPTURE WHEN YOUR CONTAINER APPLICATION IS IN TROUBLE

KnowAPIs-UnknownPerf-confoo-2023 (1).pptx

memory-patterns-confoo-2023.pptx

lets-crash-apps-jax-2022.pptx

‘16 artifacts’ to capture when there is a production problem

Lets crash-applications

16 artifacts to capture when there is a production problem

Lets crash-applications

Major outagesmajorenteprises 2021

Jvm internals-1-slide

Accelerating Incident Response To Production Outages

7 jvm-arguments-Confoo

7 jvm-arguments-v1

How & why-memory-efficient?

Top feedbacks

Shooting the troubles: Crashes, Slowdowns, CPU Spikes

Troubleshooting performanceavailabilityproblems (1)

Kürzlich hochgeladen

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

Software Quality Assurance Interview QuestionsArshad QA

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

Test Automation Strategy for Frontend and BackendArshad QA

Professional Resume Template for Software DevelopersVinodh Ram

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

Project Based Learning (A.I).pptx detail explanationkaushalgiri8080

Kürzlich hochgeladen (20)

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...

A Secure and Reliable Document Management System is Essential.docx

Advancing Engineering with AI through the Next Generation of Strategic Projec...

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf

Exploring iOS App Development: Simplifying the Process

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

why an Opensea Clone Script might be your perfect match.pdf

Software Quality Assurance Interview Questions

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

HR Software Buyers Guide in 2024 - HRSoftware.com

Test Automation Strategy for Frontend and Backend

Professional Resume Template for Software Developers

Hand gesture recognition PROJECT PPT.pptx

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI

Unlocking the Future of AI Agents with Large Language Models

Project Based Learning (A.I).pptx detail explanation

Predicting Production Outages with Micro-Metrics

1. Predicting Production Outages: Unleashing the Power of Micro-Metrics Ram Lakshmanan Architect yCrash

2. Predicting Memory Problems

3. 3 Memory of Healthy Application - Full Garbage Collection Event

4. 4 Acute Memory Leak - Full Garbage Collection Event

5. 5 Memory Leak - Full Garbage Collection Event

6. GC Throughput Micrometric Source: Garbage Collection Log

7. 7 What is GC Throughput? Amount of time application spends in processing customer transactions vs Amount of time application spends in processing garbage collection activity

8. 1. GC Log 10. netstat 12. vmstat 2. Thread Dump 9. dmesg 3. Heap Dump 6. ps 8. Disk Usage 5. top 13. iostat 11. ping 14. Kernel Params 15. App Logs 16. metadata 4. Heap Substitute 7. top -H 8 Open-source script: https://github.com/ycrash/yc-data-script 360° Troubleshooting artifacts

9. Predicting Backend Slowdown

10. Application Architecture JDBC SOAP MainFrame REST Server Thread Pool Application Server HTTP(S) request 10

11. Application Architecture JDBC SOAP MainFrame REST Server Thread Pool Application Server HTTP(S) request 11

12. Threads with identical Stack trace Micrometric Source: Thread Dump

13. 13 Case Study Backend Slowdown in a Major Financial Institution in N. America

14. Predicting CPU Spike

15. What Causes CPU to Spike? 15

16. top –H –p <PROCESS_ID>’ Secrete Option: 16 We all might have used ‘top’

17. Thread Level CPU consumption Micrometric Source: top –H –p <PROCESS_ID>

18. Case Study Major Trading app in N. America https://blog.fastthread.io/2020/04/23/troubleshooting-cpu-spike-in-a-major-trading-application/ 18

19. Predicting Concurrency issues

20. public void synchronized getData() { doSomething(); } Thread 1 Thread 2 Thread 1 BLOCKED THREADS Concurrency Problem 20

21. BLOCKED state threads Micrometric Source: Thread Dump

22. Case Study Major Leisure Travel Service Provider https://blog.ycrash.io/2022/03/09/java-uuid-generation-performance-impact/ 22

23. Predicting Poor Response Time

24. What is Garbage? HTTP Request Objects Memory Garbage 24

25. 25 3-4 Decades ago Developer Writes code to Manually evict Garbage JVM Automatically evicts Garbage Now How are objects Garbage Collected? Evolution: Manual -> Automatic

26. 26 Automatic GC sounds good right? Yes, but for GC pauses CPU consumption

27. 27 Application suffering from Consecutive Full GCs

28. 28 Long GC Pause Duration

29. GC Pause Time Micrometric Source: Garbage Collection Log

30. Specific Errors/Exceptions Micrometric Source: Application Logs

31. My App yCrash agent yCrash Server Container/Machine 1 Every 3 minutes Micro-Metrics* are captured 2 Metrics are transmitted 4 If problem forecasted, 360 ° data capture is triggered 3 ML, Patterns applied on the Micro-Metrics Cloud/On-premise 31 Micro-Metrics * 1. Garbage Collection Log 2. Thread Dump + top –H 3. Application Log Micro-Metrics Monitoring Architecture

32. 1. GC Log 10. netstat 12. vmstat 2. Thread Dump 9. dmesg 3. Heap Dump 6. ps 8. Disk Usage 5. top 13. iostat 11. ping 14. Kernel Params 15. App Logs 16. metadata 4. Heap Substitute 7. top -H 32 Open-source script: https://github.com/ycrash/yc-data-script 360° Troubleshooting artifacts

33. Ram Lakshmanan ram@tier1app.com @tier1app https://www.linkedin.com/company/ycrash This deck will be published in: https://blog.ycrash.io If you want to learn more … 33 THANK YOU FRIENDS

Hinweis der Redaktion

http://localhost:8080/yc-report.jsp?ou=SAP&de=198.134.23.1&app=yc&ts=2023-06-11T22-56-32
http://localhost:8080/yc-report.jsp?ou=SAP&de=32.123.89.12&app=yc&ts=2023-06-11T23-54-10
http://localhost:8080/yc-report.jsp?ou=SAP&de=90.21.123.19&app=yc&ts=2023-12-03T19-11-33