Predicting Storage Failures Rebuild is not free 2012 2015 2017 Capacity 4TB HDD 8TB HDD 16TB HDD...
Transcript of Predicting Storage Failures Rebuild is not free 2012 2015 2017 Capacity 4TB HDD 8TB HDD 16TB HDD...
PredictingStorageFailuresbyAhmedEl-Shimi
VAULT- LinuxStorageandFileSystemsConferenceCambridgeMA.March22,2017
CopyrightMinimaInc.2017
ThisTalk
• PartI:Motivation• DriveFailure&CommonMitigations• SeekingaBetterRecover/Rebuild• UseCases&Goals
• PartII:ExaminingtheData• Dataset• Features,Trends,Challenges
• PartIII:PredictingDriveFailure• HowtoEvaluate• Baseline• Approach&Models• Evaluation&Results
CopyrightMinimaInc.2017
TheProblem
Source:HardDriveFailureRateReportQ32016,Backblaze.
2%DriveFailureRate
Averaged,Annualized.
DriveAnnualFailureRatebyManufacturerandModel
1.2%0.7% 0.6% 0.4% 0.4% 0.3% 0.0%
10.2%
3.2%
1.5%0.6%
11.3%
2.2%
0.0%
1 2 3 4 5 6 7 1 2 3 4 1 2 3
HGST Seagate WDC
CopyrightMinimaInc.2017
Today’sMitigation
Assume: “Everythingthatcanfailwillfail”
Design: Redundancyateverypointoffailure
RAID ERASURECODING REPLICATION
CopyrightMinimaInc.2017
ButRebuildisnotfree
2012 2015 2017
Capacity 4TBHDD 8TBHDD 16TBHDD
Throughput 100MB/Sec 150MB/Sec 200MB /Sec
1-Disk RebuildTime 11hours 15hours 22hours
0
200
400
600
800
1000
1200
1400
2002(80GBHDD) 2012(4TBHDD) 2015(8TBHDD) 2017(16TBHDD)
SingleHDDRebuildTime(Minutes)
RebuildInflationhasconsequences:• AvailabilityandDurability9s• Rebuildisaworkload!• Resilience,Reliability• DiskCapacity&NetworkManagement• FailureModes• Lotsofcomplexitytoaddressedgecases
Canwedobetterifwehadanearlywarning?
CopyrightMinimaInc.2017
UseCases&Goals
Cloud
• ProactiveRebuild
• SmarterOpsScheduling
Enterprise/Field
• ProactiveRebuild
• BetterFRUSLA
End-UserPC
• BackupNow• Contingencyplanning
CopyrightMinimaInc.2017
TheBackblaze Dataset
• Backblaze.com:OnlineBackupandCloudStorageprovider.• 83+Kdrives• 2013-2016• Seagate,Hitachi,HGST,WesternDigital,Toshiba,Samsung
https://www.backblaze.com/b2/hard-drive-test-data.html• Hatsofftothemforsharingtheirdataopenly.
CopyrightMinimaInc.2017
UnderstandingtheData
Dataset:• 83Kdrives• 46MDriveDays(2013-2016)• >5000failures• Dailysnapshotforeachdrive’shealth
state+SMARTmetrics• SMART(Self-Monitoring,Analysisand
ReportingTechnology)astandardmonitoringsystemincludedinHDDsandSSDs
ExampleSMARTAttributes:• SMART_1:ReadErrorRate• SMART_5:ReallocatedSectorsCount• SMART_9:PowerOnHours• SMART_7:SeekErrorRate• SMART_197:CurrentPendingSector
Count
https://en.wikipedia.org/wiki/S.M.A.R.T.
CopyrightMinimaInc.2017
SMART197:CurrentPendingSectorCountover30days
Sampleof7678drives(50%failed/50%healthy)
CopyrightMinimaInc.2017
PartIIIDiskfailurescanbepredicted.Notperfectly.Butbetterthansimpleheuristics.
CopyrightMinimaInc.2017
PerformanceMetric
• Goals:• Detectrareevent(1/20orless)• TunedependingonUseCase
• ToleranceforFalsePositives• vs.ToleranceforFalseNegatives
• Wewanttomaximizeourabilitytomakebettertradeoffs
• PerformanceMetric:• PR.AUC:AreaUnderPrecision-Recallcurve
PR.AUC
Precision
RecallCopyrightMinimaInc.2017
BaselineHeuristic
• IfanyofthecriticalSMARTattributes>0thenthedriveislikelytofail
• SMART_5:ReallocatedSectorsCount• SMART_187:ReportedUncorrectable• SMART_188:CommandTimeout• SMART_197:CurrentPendingSectorCount• SMART_198:OfflineUncorrectable
CopyrightMinimaInc.2017
BaselinePerformance
• Evaluationdataset:• 13980drives• 699failed• 13281healthy
• Precision:42%• (i.e.58%falsepositives)
• Recall:68%• (i.e.32%falsenegatives)
ConfusionMatrix
BaselinePrediction
Healthy Failed
TruthHealthy 12625 656
Failed 223 476
CopyrightMinimaInc.2017
Approach
• Splitthedataintotrain/testandEval• 2013-2015:Train/Test• 2016:Eval
• Samplefromthetraindataat50/50• (Learnequallyfromfailure/health)
• SamplefromtheEval dataat95/05• (Evaluateatafixedreal-lifefailure/healthymix)
CopyrightMinimaInc.2017
Featuremanagementchallenges
• InconsistencyofSMARTdatasupportacrossvendorsanddrivemodels• DataSparseness• OpacityofmostSMARTmetrics• FurtheropacityofNormalizedSMARTvalues• WideRangeformostSMARTvalues• Datasetskewnessbyvendorandmodel• Somegapsandinconsistenciesinthetelemetrydata
CopyrightMinimaInc.2017
FeatureSelection&Models
• FeatureSelection:• RawoverNormalizedSMARTData• Createdmodel_fam featuretocollapsevendor/model• Z-ScoreNormalizationofRAWvalues• 3-day,5-dayrollingVarianceforallSMARTRAWvalues• andafewthingswhichdidn’tworkaswellasinitiallyhopedJ
• Models:• RandomForests• LogisticRegression• SupportVectorMachines
• Goal:• Improveonthebaseline• ExpandoptionstotuneDecisionThresholdforvariousPrecisionvs.Recalltradeoffs(basedonUseCase)
CopyrightMinimaInc.2017
Model2:LogisticRegression
• Sigmoid(Step)functiontomodelabinomialoutput(0/1)
• Workswellforlinearcontinuousnumericalinputs• Corollary:Horriblyforcategoricalnon-linearvariablesappearingtobecontinuousnumerical
• Inourcasethekeyisto:• IsolaterightSMARTmetrics• Normalizewhereneeded
CopyrightMinimaInc.2017
Model3:SupportVectorMachines
• BinaryClassifierbasedonmappingnon-lineardataintoahigherdimensiontomakelinearseparationpossible
• Maximizesmarginofseparationbetween+/-categories
• InourcasethekeywasstilltoisolaterightSMARTmetricsbeforetraining
ByAlisneaky,svg versionbyUser:Zirguezi - Ownwork,CCBY-SA4.0,https://commons.wikimedia.org/w/index.php?curid=47868867
CopyrightMinimaInc.2017
FeatureImportance– SMARTRaw– AllModels
0
500
1000
1500
2000
2500
3000
3500
4000
Importance(RandomForestsmodeloverSMARTRawfeatures)
CopyrightMinimaInc.2017
FeatureImportance– SMARTVariance- Seagate
0
50
100
150
200
250
300
350
400
Importance(RandomForestsmodeloverSMARTRollingVariancefeatures)
CopyrightMinimaInc.2017
Summary
• 2%ofDisksfailannually,onaverage.Mileagevariesbymodel.• SMARTmetricscanclearlysignalfailures,sometimesdaysbeforeithappens• Wecantrain/predictacrosssomeofthedrivemodelsovercomingtrainingdatasparsity• WecanreasonablytrainmodelstopredictdrivefailureusingSMARTdataandimproveuponexistingheuristics• PickingandtuningtherightmodeldependsonUseCaseandgoals
• Precisionvs.Recalltradeoff• Differentmodelsgiveyoumoreoptions
• MoreDataisbetter.Duh!
CopyrightMinimaInc.2017