Predicting Storage Failures Rebuild is not free 2012 2015 2017 Capacity 4TB HDD 8TB HDD 16TB HDD...

PredictingStorageFailuresbyAhmedEl-Shimi

[email protected]

VAULT- LinuxStorageandFileSystemsConferenceCambridgeMA.March22,2017

CopyrightMinimaInc.2017

ThisTalk

• PartI:Motivation• DriveFailure&CommonMitigations• SeekingaBetterRecover/Rebuild• UseCases&Goals

• PartII:ExaminingtheData• Dataset• Features,Trends,Challenges

• PartIII:PredictingDriveFailure• HowtoEvaluate• Baseline• Approach&Models• Evaluation&Results


PartIDisksfail.Andevenwithredundancyfailurehascosts.


TheProblem

Source:HardDriveFailureRateReportQ32016,Backblaze.

2%DriveFailureRate

Averaged,Annualized.

DriveAnnualFailureRatebyManufacturerandModel

1.2%0.7% 0.6% 0.4% 0.4% 0.3% 0.0%

10.2%

3.2%

1.5%0.6%

11.3%

2.2%

0.0%

1 2 3 4 5 6 7 1 2 3 4 1 2 3

HGST Seagate WDC


Today’sMitigation

Assume: “Everythingthatcanfailwillfail”

Design: Redundancyateverypointoffailure

RAID ERASURECODING REPLICATION


ButRebuildisnotfree

2012 2015 2017

Capacity 4TBHDD 8TBHDD 16TBHDD

Throughput 100MB/Sec 150MB/Sec 200MB /Sec

1-Disk RebuildTime 11hours 15hours 22hours

0

200

400

600

800

1000

1200

1400

2002(80GBHDD) 2012(4TBHDD) 2015(8TBHDD) 2017(16TBHDD)

SingleHDDRebuildTime(Minutes)

RebuildInflationhasconsequences:• AvailabilityandDurability9s• Rebuildisaworkload!• Resilience,Reliability• DiskCapacity&NetworkManagement• FailureModes• Lotsofcomplexitytoaddressedgecases

Canwedobetterifwehadanearlywarning?


UseCases&Goals

Cloud

• ProactiveRebuild

• SmarterOpsScheduling

Enterprise/Field

• ProactiveRebuild

• BetterFRUSLA

End-UserPC

• BackupNow• Contingencyplanning


PartIIExaminingtheData.


TheBackblaze Dataset

• Backblaze.com:OnlineBackupandCloudStorageprovider.• 83+Kdrives• 2013-2016• Seagate,Hitachi,HGST,WesternDigital,Toshiba,Samsung

https://www.backblaze.com/b2/hard-drive-test-data.html• Hatsofftothemforsharingtheirdataopenly.


UnderstandingtheData

Dataset:• 83Kdrives• 46MDriveDays(2013-2016)• >5000failures• Dailysnapshotforeachdrive’shealth

state+SMARTmetrics• SMART(Self-Monitoring,Analysisand

ReportingTechnology)astandardmonitoringsystemincludedinHDDsandSSDs

ExampleSMARTAttributes:• SMART_1:ReadErrorRate• SMART_5:ReallocatedSectorsCount• SMART_9:PowerOnHours• SMART_7:SeekErrorRate• SMART_197:CurrentPendingSector

Count

https://en.wikipedia.org/wiki/S.M.A.R.T.


SMART5:ReallocatedSectorCount

Sampleof7678drives(50%failed/50%healthy)


SMART1:ReadErrorRate



SMART9:PowerOnHours



SMART197:CurrentPendingSectorCountover30days



PartIIIDiskfailurescanbepredicted.Notperfectly.Butbetterthansimpleheuristics.


PerformanceMetric

• Goals:• Detectrareevent(1/20orless)• TunedependingonUseCase

• ToleranceforFalsePositives• vs.ToleranceforFalseNegatives

• Wewanttomaximizeourabilitytomakebettertradeoffs

• PerformanceMetric:• PR.AUC:AreaUnderPrecision-Recallcurve

PR.AUC

Precision

RecallCopyrightMinimaInc.2017

BaselineHeuristic

• IfanyofthecriticalSMARTattributes>0thenthedriveislikelytofail

• SMART_5:ReallocatedSectorsCount• SMART_187:ReportedUncorrectable• SMART_188:CommandTimeout• SMART_197:CurrentPendingSectorCount• SMART_198:OfflineUncorrectable


BaselinePerformance

• Evaluationdataset:• 13980drives• 699failed• 13281healthy

• Precision:42%• (i.e.58%falsepositives)

• Recall:68%• (i.e.32%falsenegatives)

ConfusionMatrix

BaselinePrediction

Healthy Failed

TruthHealthy 12625 656

Failed 223 476


Approach

• Splitthedataintotrain/testandEval• 2013-2015:Train/Test• 2016:Eval

• Samplefromthetraindataat50/50• (Learnequallyfromfailure/health)

• SamplefromtheEval dataat95/05• (Evaluateatafixedreal-lifefailure/healthymix)


Featuremanagementchallenges

• InconsistencyofSMARTdatasupportacrossvendorsanddrivemodels• DataSparseness• OpacityofmostSMARTmetrics• FurtheropacityofNormalizedSMARTvalues• WideRangeformostSMARTvalues• Datasetskewnessbyvendorandmodel• Somegapsandinconsistenciesinthetelemetrydata


FeatureSelection&Models

• FeatureSelection:• RawoverNormalizedSMARTData• Createdmodel_fam featuretocollapsevendor/model• Z-ScoreNormalizationofRAWvalues• 3-day,5-dayrollingVarianceforallSMARTRAWvalues• andafewthingswhichdidn’tworkaswellasinitiallyhopedJ

• Models:• RandomForests• LogisticRegression• SupportVectorMachines

• Goal:• Improveonthebaseline• ExpandoptionstotuneDecisionThresholdforvariousPrecisionvs.Recalltradeoffs(basedonUseCase)


Model1:RandomForests

SingleDecisionTree RandomForestCopyrightMinimaInc.2017

Model2:LogisticRegression

• Sigmoid(Step)functiontomodelabinomialoutput(0/1)

• Workswellforlinearcontinuousnumericalinputs• Corollary:Horriblyforcategoricalnon-linearvariablesappearingtobecontinuousnumerical

• Inourcasethekeyisto:• IsolaterightSMARTmetrics• Normalizewhereneeded


Model3:SupportVectorMachines

• BinaryClassifierbasedonmappingnon-lineardataintoahigherdimensiontomakelinearseparationpossible

• Maximizesmarginofseparationbetween+/-categories

• InourcasethekeywasstilltoisolaterightSMARTmetricsbeforetraining

ByAlisneaky,svg versionbyUser:Zirguezi - Ownwork,CCBY-SA4.0,https://commons.wikimedia.org/w/index.php?curid=47868867


Results:Seagatedrives.DayZero


RandomModelAUC=0.05

Results:Seagatedrives.Day-1


RandomModelAUC=0.05



RandomModelAUC=0.05

Results:Hitachidrives.DayZero


RandomModelAUC=0.05

Results:Hitachidrives.Day-1


RandomModelAUC=0.05

FeatureImportance– SMARTRaw– AllModels

0

500

1000

1500

2000

2500

3000

3500

4000

Importance(RandomForestsmodeloverSMARTRawfeatures)


FeatureImportance– SMARTVariance- Seagate

0

50

100

150

200

250

300

350

400

Importance(RandomForestsmodeloverSMARTRollingVariancefeatures)


Summary

• 2%ofDisksfailannually,onaverage.Mileagevariesbymodel.• SMARTmetricscanclearlysignalfailures,sometimesdaysbeforeithappens• Wecantrain/predictacrosssomeofthedrivemodelsovercomingtrainingdatasparsity• WecanreasonablytrainmodelstopredictdrivefailureusingSMARTdataandimproveuponexistingheuristics• PickingandtuningtherightmodeldependsonUseCaseandgoals

• Precisionvs.Recalltradeoff• Differentmodelsgiveyoumoreoptions

• MoreDataisbetter.Duh!


Predicting Storage Failures Rebuild is not free 2012 2015 2017 Capacity 4TB HDD 8TB HDD 16TB HDD...

Documents

Transcript of Predicting Storage Failures Rebuild is not free 2012 2015 2017 Capacity 4TB HDD 8TB HDD 16TB HDD...