SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
1© Copyright 2015 EMC Corporation. All rights reserved.
RAIDShield: Characterizing, Monitoring, and
Proactively Protecting Against Disk Failures
Ao Ma
Fred Douglis
Guanlin Lu
Darren Sawyer
EMC
Surendar Chandra
Windsor Hsu
Datrium
2© Copyright 2015 EMC Corporation. All rights reserved.
Disk failures are commonplace
• Whole-disk failure
• Partial failure
RAID is widely deployed
• Protect data against failures with redundancy
Pervasive RAID Protection
3© Copyright 2015 EMC Corporation. All rights reserved.
Storage system is evolving
• Escalated use of less reliable drives causes more
whole-disk failures
• Increasing disk capacity results in more sector errors
Solution
• Add extra redundancy (RAID5, RAID6, …)
– Ensure data reliability at the cost of storage efficiency
RAID Overview
Is adding extra redundancy an efficient solution?
4© Copyright 2015 EMC Corporation. All rights reserved.
Analyzed 1 million SATA disks and revealed
• Failure modes degrading RAID reliability
• Reallocated sectors reflect disk reliability deterioration
• Disk failure is predictable
Built RAIDSHIELD, an active defense mechanism
• Reconstruct failing disk before it’s too late!
• PLATE: single-disk proactive protection
– Deployment eliminates 70% of RAID failures
• ARMOR: disk group proactive protection
– Recognize vulnerable RAID groups
What We Did
5© Copyright 2015 EMC Corporation. All rights reserved.
Background
Disk failure analysis
RAIDSHIELD:
• Identify failure indicator
• Reallocated Sector (RS) characterization
• Single disk proactive protection
• Disk group proactive protection
5
Outline
6© Copyright 2015 EMC Corporation. All rights reserved.
Disk failure does not follow a fail-stop model
The production systems studied define failure as
• Connection is lost
• An operation exceeds the timeout threshold
• Write fails
Whole-disk Failure Definition
7© Copyright 2015 EMC Corporation. All rights reserved.
• Each disk drive model is denoted as <family-capacity>
• Relative sizes within a family are ordered by the capacity number
– E.g. A-2 is larger than A-1
Disk Model Population
(Thousands)
First
Deployment
Log Length
(Months)
A-1 34 06/2008 60
A-2 165 11/2008 60
B-1 100 06/2008 48
C-1 93 10/2010 36
C-2 253 12/2010 36
D-1 384 09/2011 21
Disk Data Collection
8© Copyright 2015 EMC Corporation. All rights reserved.
What Do Real Disk
Failures Look Like?
9© Copyright 2015 EMC Corporation. All rights reserved.
0 0 0 1 4
24
46
20
3 1
0
10
20
30
40
50
0-6 12-18 24-30 36-42 48-54
A-2
Month
Distribution of Lifetime of Failed Drives
0 0 1 2 4
15
34
29
11
3
0
10
20
30
40
0-6 12-18 24-30 36-42 48-54
A-1
Month
A large fraction of failed drives are found at a similar age
Percentage(%)Percentage(%)
10© Copyright 2015 EMC Corporation. All rights reserved.
The number of affected disks keep growing
• About 10% of disks get sector errors at the 3rd year
Sector error numbers increase continuously
• Average error count increases 25% to 300% year
over year
Increasing Frequency of Sector Errors
11© Copyright 2015 EMC Corporation. All rights reserved.
Drive failing at a similar age
• Failure rate is not constant
• A high risk of multiple simultaneous failures
Increasing frequency of sector errors
• Exacerbate risk of reconstruction failures
Passive Redundancy is Inefficient
Ensuring reliability in the worst case requires
adding considerable extra redundancy, making it
unattractive from a cost perspective
12© Copyright 2015 EMC Corporation. All rights reserved.
Motivation
• Ensure data safety with minimal redundancy
• Proactively recognize impending failures and
migrate vulnerable data in advance
Methodology
• Identify indicator of impending failure
• Indicator characterization
• Proactive protection
RAIDSHIELD, The Proactive Protection
13© Copyright 2015 EMC Corporation. All rights reserved.
Potential indicators
• Various disk errors
Criteria of a good indicator
• It happens much more frequently on failed
disks rather than working disks
Approach
• Quantify the discrimination between error
value on failed disks and working ones
– Deciles comparison is used
Identify Failure Indicator
14© Copyright 2015 EMC Corporation. All rights reserved.
Failed disks have more media errors than working ones
The discrimination is not significant enough
Media Error Comparison
2 3 5 9
15
22
32
47
86
1 1 1 2 3 4 7
13
30
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9
failed disk working disk
Deciles
A-2MediaErrorCount
15© Copyright 2015 EMC Corporation. All rights reserved.
2 23 87 187
327
522
812
1242
2025
0 0 0 0 0 1 2 6 29
0
500
1000
1500
2000
2500
1 2 3 4 5 6 7 8 9
failed disk working disk
A-2
Deciles
ReallocatedSectorCount
A-2
Deciles
RS is strongly correlated with disk failures
Reallocated Sector (RS) Comparison
16© Copyright 2015 EMC Corporation. All rights reserved.
Most failed drives tend to have a larger number of
RS than working ones
RS is strongly correlated with whole-disk failures,
followed by media errors, pending sector errors
and uncorrectable sector errors
Correlation Between Sector Errors
And Whole-disk Failure
RS is a strong indicator of impending disk failure
17© Copyright 2015 EMC Corporation. All rights reserved.
Larger RS count implies higher failure rate in
two-month observation window
Disk Failure Rate Given Different RS Count
1.7
67
75 80 83 85 86 89 90 90 91 92 93 93 94 95
0
20
40
60
80
100
0 40 80 120 160 200 240 280 320 360 400 440 480 520 560 600
DiskFailureRate(%)
RS count
A-2
RS Characterization (1)
18© Copyright 2015 EMC Corporation. All rights reserved.
Larger RS count, faster to fail
10%
25%
median
75%
90%
RS Count
TimeMargin(Days)
Disk Failure Time Given Different RS Count
RS Characterization (2)
19© Copyright 2015 EMC Corporation. All rights reserved.
RS count indicates the degree of disk
reliability deterioration
Use the RS count to predict impending
disk failure in advance
PLATE: Single Disk Proactive Protection
20© Copyright 2015 EMC Corporation. All rights reserved.
70.1 66.6 64 61.8 59.9
52.1
47
42.6 39 36.9
4.5 2.8 2.1 1.7 1.4 0.8 0.7 0.4 0.3 0.27
0
10
20
30
40
50
60
70
80
90
100
20 40 60 80 100 200 300 400 500 600
failures predicted
false positive
Percentage(%)
Both the predicted failure and false positive rates decrease
as the threshold increases
Simulation Result: Failures Captured
Rate Given Different RS Threshold
RS threshold
21© Copyright 2015 EMC Corporation. All rights reserved.
5 5
15 15
80
10
70
0
20
40
60
80
100
Without Proactive Protection With Proactive Protection
Hardware Failures Others Triple Failures Eliminated Triple Failures
Single proactive protection reduces about 70% of RAID
failures, equivalent to 88% of the triple-disk failures
PLATE Deployment Result:
Causes of Recovery Incidents
Percentage(%)
22© Copyright 2015 EMC Corporation. All rights reserved.
10% remaining triple failures
• PLATE misses RAID failures caused by multiple less reliable
drives, whose RS counts haven’t exceed the threshold
Triage
• Prioritize disk groups with highest risk
Motivation of ARMOR:
The RAID Group Proactive Protection
23© Copyright 2015 EMC Corporation. All rights reserved.
1-1
1-2
1-3
1-4
2-1
2-2
2-3
2-4
3-1
3-2
3-3
3-4
4-1
4-2
4-3
4-4
?
X
X
X X
?
?
?
?
Threat of
Failure
Imminent
Failure
Good
Disk
Healthy
DG1
Imminent
Failure of DG2
Protected
DG3
Possible Failure
of DG4
Single disk protection: Replace 2-3, 2-4, 3-4
(PLATE) Can’t identify DG4 nor the difference between DG2 and DG3
Group protection: Replace DG4 or increase redundancy
(ARMOR) Protect DG4 and recognize the difference between DG2 and DG3
Disk Group Protection Example
24© Copyright 2015 EMC Corporation. All rights reserved.
Calculate the single disk failure probability
• Conditional probability through Bayes Theorem
Calculate the probability of a vulnerable RAID
• Combination of those single disk probabilities
through joint probability
ARMOR Methodology
25© Copyright 2015 EMC Corporation. All rights reserved.
The discrimination shows ARMOR is effective to recognize endangered DGs
In practice, it identifies most DG failures that are not predicted by PLATE
Probability
Deciles distribution
Evaluation
0.25
0.33
0.39
0.44 0.46 0.5
0.63
0.73
0.93
0.15
0.2 0.23 0.25 0.27 0.28 0.3 0.31 0.32
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9
Groups with more than one failure
Groups without failure
26© Copyright 2015 EMC Corporation. All rights reserved.
Google reports SMART metrics such as reallocated
sector strongly suggest an impending failure, but
they also determine that half of the failed disks
show no such errors [Pinheiro’07]
• Different workload and RAID rewrite
Disk failure prediction
• Average maximum latency [Goldszmidt’12]
• SMART failure prediction [Murray’05, Hughes’02]
Related Work
27© Copyright 2015 EMC Corporation. All rights reserved.
We analyzed 1 million SATA drives
• Observe failure modes degrading RAID reliability
• Reveal RS count reflects the disk reliability deterioration
• Disk failure is predictable
We built RAIDSHIELD, an active defense mechanism
• PLATE: single disk proactive protection
– Deployment eliminates 70% of RAID failures
• ARMOR: disk group proactive protection
– Recognize vulnerable RAID groups
– Hope to deploy in future
Is adding extra redundancy an efficient solution?
• Use as much redundancy as needed to ensure availability
• Proactive replacement should decrease the level needed
Summary
28© Copyright 2015 EMC Corporation. All rights reserved.
RAIDShield: Characterizing, Monitoring, and
Proactively Protecting Against Disk Failures
Questions?
Acknowledgement
Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau
Data Domain engineer team, members of AD and CTO office, Stephen Manley
Contact: ao.ma@emc.com

Weitere ähnliche Inhalte

Ähnlich wie RAIDShield of Fast15 slides

Disk health prediction for Ceph
Disk health prediction for CephDisk health prediction for Ceph
Disk health prediction for CephCeph Community
 
Seatools dos-guide
Seatools dos-guideSeatools dos-guide
Seatools dos-guidessuserd6bd7c
 
02 fault tolerance
02 fault tolerance02 fault tolerance
02 fault toleranceashish61_scs
 
Transforming Backup and Recovery in VMware environments with EMC Avamar and D...
Transforming Backup and Recovery in VMware environments with EMC Avamar and D...Transforming Backup and Recovery in VMware environments with EMC Avamar and D...
Transforming Backup and Recovery in VMware environments with EMC Avamar and D...CTI Group
 
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...DataStax
 
Unitrends Sales Presentation 2010
Unitrends Sales Presentation 2010Unitrends Sales Presentation 2010
Unitrends Sales Presentation 2010lincolng
 
Micron: Addressing Process Scaling Challenges in Server Memory
Micron: Addressing Process Scaling Challenges in Server MemoryMicron: Addressing Process Scaling Challenges in Server Memory
Micron: Addressing Process Scaling Challenges in Server MemoryMicron Technology
 
recovery-series-family-datasheet 2015
recovery-series-family-datasheet 2015recovery-series-family-datasheet 2015
recovery-series-family-datasheet 2015Michael Standen
 
IT Infrastructure for Manufacturing
IT Infrastructure for ManufacturingIT Infrastructure for Manufacturing
IT Infrastructure for Manufacturingguestbe3b7a
 
Qualifying a high performance memory subsysten for Functional Safety
Qualifying a high performance memory subsysten for Functional SafetyQualifying a high performance memory subsysten for Functional Safety
Qualifying a high performance memory subsysten for Functional SafetyPankaj Singh
 
Why everyone speaks about DR but only few use it?
Why everyone speaks about DR but only few use it?Why everyone speaks about DR but only few use it?
Why everyone speaks about DR but only few use it?Francisco Alvarez
 
[iROC Webinar] Do I Need to Worry About Soft Errors?
[iROC Webinar] Do I Need to Worry About Soft Errors? [iROC Webinar] Do I Need to Worry About Soft Errors?
[iROC Webinar] Do I Need to Worry About Soft Errors? iROCTech
 
How to deploy SQL Server on an Microsoft Azure virtual machines
How to deploy SQL Server on an Microsoft Azure virtual machinesHow to deploy SQL Server on an Microsoft Azure virtual machines
How to deploy SQL Server on an Microsoft Azure virtual machinesSolarWinds
 
5 Things You Need to Know About Enterprise Fl
 5 Things You Need to Know About Enterprise Fl 5 Things You Need to Know About Enterprise Fl
5 Things You Need to Know About Enterprise FlWestern Digital
 
Vx Rack : L'hyperconvergence avec l'experience VCE
Vx Rack : L'hyperconvergence avec l'experience VCEVx Rack : L'hyperconvergence avec l'experience VCE
Vx Rack : L'hyperconvergence avec l'experience VCERSD
 
Reliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxReliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxSamsung Open Source Group
 
Performance & agilité les atouts du datacenter électronique selon XtremIO
Performance & agilité les atouts du datacenter électronique selon XtremIOPerformance & agilité les atouts du datacenter électronique selon XtremIO
Performance & agilité les atouts du datacenter électronique selon XtremIORSD
 

Ähnlich wie RAIDShield of Fast15 slides (20)

Disk health prediction for Ceph
Disk health prediction for CephDisk health prediction for Ceph
Disk health prediction for Ceph
 
Unity v1 unity_xt_380
Unity v1 unity_xt_380Unity v1 unity_xt_380
Unity v1 unity_xt_380
 
Seatools dos-guide
Seatools dos-guideSeatools dos-guide
Seatools dos-guide
 
02 fault tolerance
02 fault tolerance02 fault tolerance
02 fault tolerance
 
Transforming Backup and Recovery in VMware environments with EMC Avamar and D...
Transforming Backup and Recovery in VMware environments with EMC Avamar and D...Transforming Backup and Recovery in VMware environments with EMC Avamar and D...
Transforming Backup and Recovery in VMware environments with EMC Avamar and D...
 
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
 
Unitrends Sales Presentation 2010
Unitrends Sales Presentation 2010Unitrends Sales Presentation 2010
Unitrends Sales Presentation 2010
 
Micron: Addressing Process Scaling Challenges in Server Memory
Micron: Addressing Process Scaling Challenges in Server MemoryMicron: Addressing Process Scaling Challenges in Server Memory
Micron: Addressing Process Scaling Challenges in Server Memory
 
recovery-series-family-datasheet 2015
recovery-series-family-datasheet 2015recovery-series-family-datasheet 2015
recovery-series-family-datasheet 2015
 
IT Infrastructure for Manufacturing
IT Infrastructure for ManufacturingIT Infrastructure for Manufacturing
IT Infrastructure for Manufacturing
 
Qualifying a high performance memory subsysten for Functional Safety
Qualifying a high performance memory subsysten for Functional SafetyQualifying a high performance memory subsysten for Functional Safety
Qualifying a high performance memory subsysten for Functional Safety
 
Why everyone speaks about DR but only few use it?
Why everyone speaks about DR but only few use it?Why everyone speaks about DR but only few use it?
Why everyone speaks about DR but only few use it?
 
Module 03.pdf
Module 03.pdfModule 03.pdf
Module 03.pdf
 
[iROC Webinar] Do I Need to Worry About Soft Errors?
[iROC Webinar] Do I Need to Worry About Soft Errors? [iROC Webinar] Do I Need to Worry About Soft Errors?
[iROC Webinar] Do I Need to Worry About Soft Errors?
 
How to deploy SQL Server on an Microsoft Azure virtual machines
How to deploy SQL Server on an Microsoft Azure virtual machinesHow to deploy SQL Server on an Microsoft Azure virtual machines
How to deploy SQL Server on an Microsoft Azure virtual machines
 
5 Things You Need to Know About Enterprise Fl
 5 Things You Need to Know About Enterprise Fl 5 Things You Need to Know About Enterprise Fl
5 Things You Need to Know About Enterprise Fl
 
Vx Rack : L'hyperconvergence avec l'experience VCE
Vx Rack : L'hyperconvergence avec l'experience VCEVx Rack : L'hyperconvergence avec l'experience VCE
Vx Rack : L'hyperconvergence avec l'experience VCE
 
Reliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxReliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on Linux
 
Performance & agilité les atouts du datacenter électronique selon XtremIO
Performance & agilité les atouts du datacenter électronique selon XtremIOPerformance & agilité les atouts du datacenter électronique selon XtremIO
Performance & agilité les atouts du datacenter électronique selon XtremIO
 
10 reasons Why PCs crash U must Know
10 reasons Why PCs crash U must Know10 reasons Why PCs crash U must Know
10 reasons Why PCs crash U must Know
 

Kürzlich hochgeladen

Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...kalichargn70th171
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxAS Design & AST.
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxSasikiranMarri
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 

Kürzlich hochgeladen (20)

Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptx
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 

RAIDShield of Fast15 slides

  • 1. 1© Copyright 2015 EMC Corporation. All rights reserved. RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures Ao Ma Fred Douglis Guanlin Lu Darren Sawyer EMC Surendar Chandra Windsor Hsu Datrium
  • 2. 2© Copyright 2015 EMC Corporation. All rights reserved. Disk failures are commonplace • Whole-disk failure • Partial failure RAID is widely deployed • Protect data against failures with redundancy Pervasive RAID Protection
  • 3. 3© Copyright 2015 EMC Corporation. All rights reserved. Storage system is evolving • Escalated use of less reliable drives causes more whole-disk failures • Increasing disk capacity results in more sector errors Solution • Add extra redundancy (RAID5, RAID6, …) – Ensure data reliability at the cost of storage efficiency RAID Overview Is adding extra redundancy an efficient solution?
  • 4. 4© Copyright 2015 EMC Corporation. All rights reserved. Analyzed 1 million SATA disks and revealed • Failure modes degrading RAID reliability • Reallocated sectors reflect disk reliability deterioration • Disk failure is predictable Built RAIDSHIELD, an active defense mechanism • Reconstruct failing disk before it’s too late! • PLATE: single-disk proactive protection – Deployment eliminates 70% of RAID failures • ARMOR: disk group proactive protection – Recognize vulnerable RAID groups What We Did
  • 5. 5© Copyright 2015 EMC Corporation. All rights reserved. Background Disk failure analysis RAIDSHIELD: • Identify failure indicator • Reallocated Sector (RS) characterization • Single disk proactive protection • Disk group proactive protection 5 Outline
  • 6. 6© Copyright 2015 EMC Corporation. All rights reserved. Disk failure does not follow a fail-stop model The production systems studied define failure as • Connection is lost • An operation exceeds the timeout threshold • Write fails Whole-disk Failure Definition
  • 7. 7© Copyright 2015 EMC Corporation. All rights reserved. • Each disk drive model is denoted as <family-capacity> • Relative sizes within a family are ordered by the capacity number – E.g. A-2 is larger than A-1 Disk Model Population (Thousands) First Deployment Log Length (Months) A-1 34 06/2008 60 A-2 165 11/2008 60 B-1 100 06/2008 48 C-1 93 10/2010 36 C-2 253 12/2010 36 D-1 384 09/2011 21 Disk Data Collection
  • 8. 8© Copyright 2015 EMC Corporation. All rights reserved. What Do Real Disk Failures Look Like?
  • 9. 9© Copyright 2015 EMC Corporation. All rights reserved. 0 0 0 1 4 24 46 20 3 1 0 10 20 30 40 50 0-6 12-18 24-30 36-42 48-54 A-2 Month Distribution of Lifetime of Failed Drives 0 0 1 2 4 15 34 29 11 3 0 10 20 30 40 0-6 12-18 24-30 36-42 48-54 A-1 Month A large fraction of failed drives are found at a similar age Percentage(%)Percentage(%)
  • 10. 10© Copyright 2015 EMC Corporation. All rights reserved. The number of affected disks keep growing • About 10% of disks get sector errors at the 3rd year Sector error numbers increase continuously • Average error count increases 25% to 300% year over year Increasing Frequency of Sector Errors
  • 11. 11© Copyright 2015 EMC Corporation. All rights reserved. Drive failing at a similar age • Failure rate is not constant • A high risk of multiple simultaneous failures Increasing frequency of sector errors • Exacerbate risk of reconstruction failures Passive Redundancy is Inefficient Ensuring reliability in the worst case requires adding considerable extra redundancy, making it unattractive from a cost perspective
  • 12. 12© Copyright 2015 EMC Corporation. All rights reserved. Motivation • Ensure data safety with minimal redundancy • Proactively recognize impending failures and migrate vulnerable data in advance Methodology • Identify indicator of impending failure • Indicator characterization • Proactive protection RAIDSHIELD, The Proactive Protection
  • 13. 13© Copyright 2015 EMC Corporation. All rights reserved. Potential indicators • Various disk errors Criteria of a good indicator • It happens much more frequently on failed disks rather than working disks Approach • Quantify the discrimination between error value on failed disks and working ones – Deciles comparison is used Identify Failure Indicator
  • 14. 14© Copyright 2015 EMC Corporation. All rights reserved. Failed disks have more media errors than working ones The discrimination is not significant enough Media Error Comparison 2 3 5 9 15 22 32 47 86 1 1 1 2 3 4 7 13 30 0 20 40 60 80 100 1 2 3 4 5 6 7 8 9 failed disk working disk Deciles A-2MediaErrorCount
  • 15. 15© Copyright 2015 EMC Corporation. All rights reserved. 2 23 87 187 327 522 812 1242 2025 0 0 0 0 0 1 2 6 29 0 500 1000 1500 2000 2500 1 2 3 4 5 6 7 8 9 failed disk working disk A-2 Deciles ReallocatedSectorCount A-2 Deciles RS is strongly correlated with disk failures Reallocated Sector (RS) Comparison
  • 16. 16© Copyright 2015 EMC Corporation. All rights reserved. Most failed drives tend to have a larger number of RS than working ones RS is strongly correlated with whole-disk failures, followed by media errors, pending sector errors and uncorrectable sector errors Correlation Between Sector Errors And Whole-disk Failure RS is a strong indicator of impending disk failure
  • 17. 17© Copyright 2015 EMC Corporation. All rights reserved. Larger RS count implies higher failure rate in two-month observation window Disk Failure Rate Given Different RS Count 1.7 67 75 80 83 85 86 89 90 90 91 92 93 93 94 95 0 20 40 60 80 100 0 40 80 120 160 200 240 280 320 360 400 440 480 520 560 600 DiskFailureRate(%) RS count A-2 RS Characterization (1)
  • 18. 18© Copyright 2015 EMC Corporation. All rights reserved. Larger RS count, faster to fail 10% 25% median 75% 90% RS Count TimeMargin(Days) Disk Failure Time Given Different RS Count RS Characterization (2)
  • 19. 19© Copyright 2015 EMC Corporation. All rights reserved. RS count indicates the degree of disk reliability deterioration Use the RS count to predict impending disk failure in advance PLATE: Single Disk Proactive Protection
  • 20. 20© Copyright 2015 EMC Corporation. All rights reserved. 70.1 66.6 64 61.8 59.9 52.1 47 42.6 39 36.9 4.5 2.8 2.1 1.7 1.4 0.8 0.7 0.4 0.3 0.27 0 10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 200 300 400 500 600 failures predicted false positive Percentage(%) Both the predicted failure and false positive rates decrease as the threshold increases Simulation Result: Failures Captured Rate Given Different RS Threshold RS threshold
  • 21. 21© Copyright 2015 EMC Corporation. All rights reserved. 5 5 15 15 80 10 70 0 20 40 60 80 100 Without Proactive Protection With Proactive Protection Hardware Failures Others Triple Failures Eliminated Triple Failures Single proactive protection reduces about 70% of RAID failures, equivalent to 88% of the triple-disk failures PLATE Deployment Result: Causes of Recovery Incidents Percentage(%)
  • 22. 22© Copyright 2015 EMC Corporation. All rights reserved. 10% remaining triple failures • PLATE misses RAID failures caused by multiple less reliable drives, whose RS counts haven’t exceed the threshold Triage • Prioritize disk groups with highest risk Motivation of ARMOR: The RAID Group Proactive Protection
  • 23. 23© Copyright 2015 EMC Corporation. All rights reserved. 1-1 1-2 1-3 1-4 2-1 2-2 2-3 2-4 3-1 3-2 3-3 3-4 4-1 4-2 4-3 4-4 ? X X X X ? ? ? ? Threat of Failure Imminent Failure Good Disk Healthy DG1 Imminent Failure of DG2 Protected DG3 Possible Failure of DG4 Single disk protection: Replace 2-3, 2-4, 3-4 (PLATE) Can’t identify DG4 nor the difference between DG2 and DG3 Group protection: Replace DG4 or increase redundancy (ARMOR) Protect DG4 and recognize the difference between DG2 and DG3 Disk Group Protection Example
  • 24. 24© Copyright 2015 EMC Corporation. All rights reserved. Calculate the single disk failure probability • Conditional probability through Bayes Theorem Calculate the probability of a vulnerable RAID • Combination of those single disk probabilities through joint probability ARMOR Methodology
  • 25. 25© Copyright 2015 EMC Corporation. All rights reserved. The discrimination shows ARMOR is effective to recognize endangered DGs In practice, it identifies most DG failures that are not predicted by PLATE Probability Deciles distribution Evaluation 0.25 0.33 0.39 0.44 0.46 0.5 0.63 0.73 0.93 0.15 0.2 0.23 0.25 0.27 0.28 0.3 0.31 0.32 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 Groups with more than one failure Groups without failure
  • 26. 26© Copyright 2015 EMC Corporation. All rights reserved. Google reports SMART metrics such as reallocated sector strongly suggest an impending failure, but they also determine that half of the failed disks show no such errors [Pinheiro’07] • Different workload and RAID rewrite Disk failure prediction • Average maximum latency [Goldszmidt’12] • SMART failure prediction [Murray’05, Hughes’02] Related Work
  • 27. 27© Copyright 2015 EMC Corporation. All rights reserved. We analyzed 1 million SATA drives • Observe failure modes degrading RAID reliability • Reveal RS count reflects the disk reliability deterioration • Disk failure is predictable We built RAIDSHIELD, an active defense mechanism • PLATE: single disk proactive protection – Deployment eliminates 70% of RAID failures • ARMOR: disk group proactive protection – Recognize vulnerable RAID groups – Hope to deploy in future Is adding extra redundancy an efficient solution? • Use as much redundancy as needed to ensure availability • Proactive replacement should decrease the level needed Summary
  • 28. 28© Copyright 2015 EMC Corporation. All rights reserved. RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures Questions? Acknowledgement Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau Data Domain engineer team, members of AD and CTO office, Stephen Manley Contact: ao.ma@emc.com