RAIDShield of Fast15 slides

1© Copyright 2015 EMC Corporation. All rights reserved.
RAIDShield: Characterizing, Monitoring, and
Proactively Protecting Against Disk Failures
Ao Ma
Fred Douglis
Guanlin Lu
Darren Sawyer
EMC
Surendar Chandra
Windsor Hsu
Datrium

Disk failures are commonplace
• Whole-disk failure
• Partial failure
RAID is widely deployed
• Protect data against failures with redundancy
Pervasive RAID Protection

Storage system is evolving
• Escalated use of less reliable drives causes more
whole-disk failures
• Increasing disk capacity results in more sector errors
Solution
• Add extra redundancy (RAID5, RAID6, …)
– Ensure data reliability at the cost of storage efficiency
RAID Overview
Is adding extra redundancy an efficient solution?

Analyzed 1 million SATA disks and revealed
• Failure modes degrading RAID reliability
• Reallocated sectors reflect disk reliability deterioration
• Disk failure is predictable
Built RAIDSHIELD, an active defense mechanism
• Reconstruct failing disk before it’s too late!
• PLATE: single-disk proactive protection
– Deployment eliminates 70% of RAID failures
• ARMOR: disk group proactive protection
– Recognize vulnerable RAID groups
What We Did

Background
Disk failure analysis
RAIDSHIELD:
• Identify failure indicator
• Reallocated Sector (RS) characterization
• Single disk proactive protection
• Disk group proactive protection
5
Outline

Disk failure does not follow a fail-stop model
The production systems studied define failure as
• Connection is lost
• An operation exceeds the timeout threshold
• Write fails
Whole-disk Failure Definition

• Each disk drive model is denoted as <family-capacity>
• Relative sizes within a family are ordered by the capacity number
– E.g. A-2 is larger than A-1
Disk Model Population
(Thousands)
First
Deployment
Log Length
(Months)
A-1 34 06/2008 60
A-2 165 11/2008 60
B-1 100 06/2008 48
C-1 93 10/2010 36
C-2 253 12/2010 36
D-1 384 09/2011 21
Disk Data Collection

What Do Real Disk
Failures Look Like?

0 0 0 1 4
24
46
20
3 1
0
10
20
30
40
50
0-6 12-18 24-30 36-42 48-54
A-2
Month
Distribution of Lifetime of Failed Drives
0 0 1 2 4
15
34
29
11
3
0
10
20
30
40
0-6 12-18 24-30 36-42 48-54
A-1
Month
A large fraction of failed drives are found at a similar age
Percentage(%)Percentage(%)

The number of affected disks keep growing
• About 10% of disks get sector errors at the 3rd year
Sector error numbers increase continuously
• Average error count increases 25% to 300% year
over year
Increasing Frequency of Sector Errors

Drive failing at a similar age
• Failure rate is not constant
• A high risk of multiple simultaneous failures
Increasing frequency of sector errors
• Exacerbate risk of reconstruction failures
Passive Redundancy is Inefficient
Ensuring reliability in the worst case requires
adding considerable extra redundancy, making it
unattractive from a cost perspective

Motivation
• Ensure data safety with minimal redundancy
• Proactively recognize impending failures and
migrate vulnerable data in advance
Methodology
• Identify indicator of impending failure
• Indicator characterization
• Proactive protection
RAIDSHIELD, The Proactive Protection

Potential indicators
• Various disk errors
Criteria of a good indicator
• It happens much more frequently on failed
disks rather than working disks
Approach
• Quantify the discrimination between error
value on failed disks and working ones
– Deciles comparison is used
Identify Failure Indicator

Failed disks have more media errors than working ones
The discrimination is not significant enough
Media Error Comparison
2 3 5 9
15
22
32
47
86
1 1 1 2 3 4 7
13
30
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9
failed disk working disk
Deciles
A-2MediaErrorCount

2 23 87 187
327
522
812
1242
2025
0 0 0 0 0 1 2 6 29
0
500
1000
1500
2000
2500
1 2 3 4 5 6 7 8 9
failed disk working disk
A-2
Deciles
ReallocatedSectorCount
A-2
Deciles
RS is strongly correlated with disk failures
Reallocated Sector (RS) Comparison

Most failed drives tend to have a larger number of
RS than working ones
RS is strongly correlated with whole-disk failures,
followed by media errors, pending sector errors
and uncorrectable sector errors
Correlation Between Sector Errors
And Whole-disk Failure
RS is a strong indicator of impending disk failure

Larger RS count implies higher failure rate in
two-month observation window
Disk Failure Rate Given Different RS Count
1.7
67
75 80 83 85 86 89 90 90 91 92 93 93 94 95
0
20
40
60
80
100
0 40 80 120 160 200 240 280 320 360 400 440 480 520 560 600
DiskFailureRate(%)
RS count
A-2
RS Characterization (1)

Larger RS count, faster to fail
10%
25%
median
75%
90%
RS Count
TimeMargin(Days)
Disk Failure Time Given Different RS Count
RS Characterization (2)

RS count indicates the degree of disk
reliability deterioration
Use the RS count to predict impending
disk failure in advance
PLATE: Single Disk Proactive Protection

70.1 66.6 64 61.8 59.9
52.1
47
42.6 39 36.9
4.5 2.8 2.1 1.7 1.4 0.8 0.7 0.4 0.3 0.27
0
10
20
30
40
50
60
70
80
90
100
20 40 60 80 100 200 300 400 500 600
failures predicted
false positive
Percentage(%)
Both the predicted failure and false positive rates decrease
as the threshold increases
Simulation Result: Failures Captured
Rate Given Different RS Threshold
RS threshold

5 5
15 15
80
10
70
0
20
40
60
80
100
Without Proactive Protection With Proactive Protection
Hardware Failures Others Triple Failures Eliminated Triple Failures
Single proactive protection reduces about 70% of RAID
failures, equivalent to 88% of the triple-disk failures
PLATE Deployment Result:
Causes of Recovery Incidents
Percentage(%)

10% remaining triple failures
• PLATE misses RAID failures caused by multiple less reliable
drives, whose RS counts haven’t exceed the threshold
Triage
• Prioritize disk groups with highest risk
Motivation of ARMOR:
The RAID Group Proactive Protection

1-1
1-2
1-3
1-4
2-1
2-2
2-3
2-4
3-1
3-2
3-3
3-4
4-1
4-2
4-3
4-4
?
X
X
X X
?
?
?
?
Threat of
Failure
Imminent
Failure
Good
Disk
Healthy
DG1
Imminent
Failure of DG2
Protected
DG3
Possible Failure
of DG4
Single disk protection: Replace 2-3, 2-4, 3-4
(PLATE) Can’t identify DG4 nor the difference between DG2 and DG3
Group protection: Replace DG4 or increase redundancy
(ARMOR) Protect DG4 and recognize the difference between DG2 and DG3
Disk Group Protection Example

Calculate the single disk failure probability
• Conditional probability through Bayes Theorem
Calculate the probability of a vulnerable RAID
• Combination of those single disk probabilities
through joint probability
ARMOR Methodology

The discrimination shows ARMOR is effective to recognize endangered DGs
In practice, it identifies most DG failures that are not predicted by PLATE
Probability
Deciles distribution
Evaluation
0.25
0.33
0.39
0.44 0.46 0.5
0.63
0.73
0.93
0.15
0.2 0.23 0.25 0.27 0.28 0.3 0.31 0.32
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9
Groups with more than one failure
Groups without failure

Google reports SMART metrics such as reallocated
sector strongly suggest an impending failure, but
they also determine that half of the failed disks
show no such errors [Pinheiro’07]
• Different workload and RAID rewrite
Disk failure prediction
• Average maximum latency [Goldszmidt’12]
• SMART failure prediction [Murray’05, Hughes’02]
Related Work

We analyzed 1 million SATA drives
• Observe failure modes degrading RAID reliability
• Reveal RS count reflects the disk reliability deterioration
• Disk failure is predictable
We built RAIDSHIELD, an active defense mechanism
• PLATE: single disk proactive protection
– Deployment eliminates 70% of RAID failures
• ARMOR: disk group proactive protection
– Recognize vulnerable RAID groups
– Hope to deploy in future
Is adding extra redundancy an efficient solution?
• Use as much redundancy as needed to ensure availability
• Proactive replacement should decrease the level needed
Summary

RAIDShield: Characterizing, Monitoring, and
Proactively Protecting Against Disk Failures
Questions?
Acknowledgement
Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau
Data Domain engineer team, members of AD and CTO office, Stephen Manley
Contact: ao.ma@emc.com

RAIDShield of Fast15 slides

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie RAIDShield of Fast15 slides

Ähnlich wie RAIDShield of Fast15 slides (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

RAIDShield of Fast15 slides