Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager...
-
Upload
carmel-terry -
Category
Documents
-
view
219 -
download
0
Transcript of Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager...
Before Terabytes FallBefore Terabytes FallDisk reliability in Windows Vista Disk reliability in Windows Vista and beyondand beyond
Frank ShuFrank ShuProgram ManagerProgram ManagerWDEG-StorageWDEG-StorageMicrosoft CorporationMicrosoft Corporation
Matthew KernerMatthew KernerProgram ManagerProgram ManagerWindows DiagnosisWindows DiagnosisMicrosoft CorporationMicrosoft Corporation
Windows Storage DevicesWindows Storage DevicesStrategic pillarsStrategic pillars
Optical Platform Client/Consumer
Storage Fabrics Server/Enterprise
Personal Storage Client/Consumer
PreferredStorage Platform
Partner/Customer
Timely, comprehensive, quality Timely, comprehensive, quality platform support for optical devicesplatform support for optical devices
Optimized platform features Optimized platform features enabling your Windows enabling your Windows experience, here and nowexperience, here and now
Leading platform enablingLeading platform enablingstorage fabric adoptionstorage fabric adoption
Preferred platform for developing, Preferred platform for developing, deploying, and using deploying, and using storage devices storage devices
Session OutlineSession Outline
Introduction (Frank Shu)Introduction (Frank Shu)
Windows Vista Disk Diagnostics Windows Vista Disk Diagnostics (Matthew Kerner)(Matthew Kerner)
Future Technology (Frank Shu)Future Technology (Frank Shu)
Demo (Microsoft and Samsung)Demo (Microsoft and Samsung)
What Matters MostWhat Matters MostTo Our Users?To Our Users?
A consumer bought a new computer and it A consumer bought a new computer and it works great at work and at home. She works great at work and at home. She couldn’t do her everyday tasks without it. couldn’t do her everyday tasks without it. What matters most to her?What matters most to her?a)a) CPU power CPU power
b)b) Network connection Network connection
c)c) Battery life Battery life
d)d) Something else…Something else…
The Answer Is…The Answer Is…
The DataThe Data
Protecting Data: Protecting Data: Windows Vista disk diagnosticsWindows Vista disk diagnostics
Matthew KernerMatthew Kerner
Quantifying Disk FailuresQuantifying Disk Failures
Catastrophic disk failuresCatastrophic disk failures~200 disks replaced per week at Microsoft ~200 disks replaced per week at Microsoft in 2003in 2003Top driver of Microsoft support’s hardware-Top driver of Microsoft support’s hardware-related support calls in both client and serverrelated support calls in both client and serverBased on Microsoft figures, disk failures cost Based on Microsoft figures, disk failures cost many millions of dollars per year in enterprisesmany millions of dollars per year in enterprises
Localized failures (bad blocks)Localized failures (bad blocks)Kernel and user-mode crashesKernel and user-mode crashes
1.7% of customer-report Microsoft Online Crash 1.7% of customer-report Microsoft Online Crash Analysis crashes are due to disk errorsAnalysis crashes are due to disk errors
Application hangs during read recoveryApplication hangs during read recovery
Disk Failure MitigationsDisk Failure Mitigations
PreventionPreventionHybrid hard disks (mobile systems)Hybrid hard disks (mobile systems)
RAIDRAID
Catastrophic failure recoveryCatastrophic failure recoveryData backupData backup
Disk replacementDisk replacement
Localized failure recoveryLocalized failure recoveryRepair from redundant copyRepair from redundant copy
Restore from backupRestore from backup
Windows Vista Windows Vista Disk DiagnosticsDisk Diagnostics
Purpose: Save user data before Purpose: Save user data before catastrophic disk failurecatastrophic disk failureClient SKUsClient SKUsSelf Monitoring And Reporting Technology Self Monitoring And Reporting Technology (S.M.A.R.T.) polling triggers diagnostic(S.M.A.R.T.) polling triggers diagnostic
Uses S.M.A.R.T. trip status – no Uses S.M.A.R.T. trip status – no threshold/attribute comparisonthreshold/attribute comparison
Warns user of impending failure and walks Warns user of impending failure and walks them through backup and replacementthem through backup and replacement
Windows Vista backup improvementsWindows Vista backup improvements
Disk Diagnostics DetailsDisk Diagnostics Details
Disk class driver polls S.M.A.R.T. status hourly Disk class driver polls S.M.A.R.T. status hourly as it has done since Windows 2000as it has done since Windows 2000
Based on industry feedback, no use of Disk Based on industry feedback, no use of Disk Self-Test or attribute comparisonSelf-Test or attribute comparison
Failure triggers user-mode codeFailure triggers user-mode codeFilter out duplicate failuresFilter out duplicate failures
Log SMART READ LOG details to OS event logLog SMART READ LOG details to OS event logDevice error count from summary error log sector Device error count from summary error log sector
Life timestamp from most recent error log entryLife timestamp from most recent error log entry
Trigger user-context interactive resolutionTrigger user-context interactive resolutionCustomizable by Group PolicyCustomizable by Group Policy
Print instructions, walk user through backupPrint instructions, walk user through backup
Startup Repair/Windows Startup Repair/Windows Recovery EnvironmentRecovery Environment
Purpose: Recover from non-bootable Purpose: Recover from non-bootable states, including those caused by states, including those caused by disk failuresdisk failures
Automatic failover on boot failureAutomatic failover on boot failureto recovery partitionto recovery partition
Optionally deployed by OEMOptionally deployed by OEM
Available on installation mediaAvailable on installation media
Hands-free diagnosis and repairHands-free diagnosis and repairof top non-boot issuesof top non-boot issues
Corrupted File RecoveryCorrupted File Recovery
Purpose: Turn repeat user-mode crashes Purpose: Turn repeat user-mode crashes caused by corrupted system binaries into caused by corrupted system binaries into one-time crash with silent repair one-time crash with silent repair from cachefrom cache
Windows Error Reporting crash handler Windows Error Reporting crash handler triggers diagnostic on inpage error triggers diagnostic on inpage error crashes due to bad blockscrashes due to bad blocks
Diagnoses corrupted system filesDiagnoses corrupted system files
Silent repair from System File CacheSilent repair from System File Cache
Windows Vista Windows Vista Disk DiagnosticsDisk Diagnostics
Matthew KernerMatthew KernerProgram ManagerProgram ManagerWindows DiagnosisWindows Diagnosis
Opportunities For Opportunities For Future TechnologyFuture Technology
Proactive failure preventionProactive failure prevention
Reduce scenario pain by enabling Reduce scenario pain by enabling resolutions other than just data recoveryresolutions other than just data recovery
Requires finer-grained failure descriptionRequires finer-grained failure descriptionto help host choose the best resolutionto help host choose the best resolution
Increase warning time before failuresIncrease warning time before failuresto allow users to save datato allow users to save data
Future Technology:Future Technology:Protecting User DataProtecting User DataAnd Preventing HardAnd Preventing HardDrive Failure ProactivelyDrive Failure Proactively
Frank ShuFrank Shu
What Is PRCS?What Is PRCS?
Proactive Reporting and Correcting Proactive Reporting and Correcting Safeguard (PRCS) enables a device and Safeguard (PRCS) enables a device and host to correct failure conditions proactivelyhost to correct failure conditions proactively
Device can report hostile conditions before Device can report hostile conditions before damage or failure occursdamage or failure occurs
Host reacts to a device event in real time Host reacts to a device event in real time based on policy and user preferencebased on policy and user preference
A proposal for the PRCS protocol hasA proposal for the PRCS protocol hasbeen submitted to T13been submitted to T13
Why Is PRCS Important?Why Is PRCS Important?
User’s digital data is more valuable than User’s digital data is more valuable than ever before ever before
Disk drive capacity continue to increaseDisk drive capacity continue to increase
Not every PC user can afford RAIDNot every PC user can afford RAID
Deliver on opportunities for improvements Deliver on opportunities for improvements beyond S.M.A.R.T. beyond S.M.A.R.T.
Goals Of PRCSGoals Of PRCS
Proactively protect user dataProactively protect user data
Improve the user experienceImprove the user experiencewhen data is at riskwhen data is at risk
Reduce OEM’s customer support costsReduce OEM’s customer support costs
Reduce warranty costs for disk Reduce warranty costs for disk drive vendorsdrive vendors
PRCS FeaturesPRCS Features
Device monitors its own conditionsDevice monitors its own conditionsin real timein real time
Reduce host monitoring performance impactReduce host monitoring performance impact
Device sends meaningful PRCS events to Device sends meaningful PRCS events to the host for correction of hostile conditions the host for correction of hostile conditions and data protectionand data protection
No translations or guesses requiredNo translations or guesses required
Host acts on device’s PRCS event Host acts on device’s PRCS event proactively according to policy and proactively according to policy and user preferenceuser preference
PRCS AdvantagesPRCS Advantages
PRCS is proactivePRCS is proactiveTaking a corrective action before errors occurTaking a corrective action before errors occurProtecting data when it is at riskProtecting data when it is at risk
PRCS is designed for end users, not just PRCS is designed for end users, not just computer expertscomputer experts
No need to understand a cryptic message toNo need to understand a cryptic message tobenefit from PRCS. For example: “The previousbenefit from PRCS. For example: “The previousself-test completed having the electrical elementself-test completed having the electrical elementof the test failed”of the test failed”
PRCS enables transparent mitigation of a hostile PRCS enables transparent mitigation of a hostile condition or a recovery processcondition or a recovery process
Users do not need to configure a self-test mode or Users do not need to configure a self-test mode or reporting methodreporting methodUsers control policy as desiredUsers control policy as desired
Proactive Proactive Disk DiagnosticsDisk Diagnostics
Debasis BaralDebasis BaralVice President of EngineeringVice President of EngineeringSamsungSamsung
HDD Reliability 101HDD Reliability 101
HDD reliability and performanceHDD reliability and performanceis is negatively impactednegatively impacted by extremes by extremesin the following operating conditionsin the following operating conditions
TemperatureTemperature DemoDemo
VibrationVibration DemoDemo
Shock DemoShock Demo
Duty cycle Duty cycle
AltitudeAltitude
HumidityHumidity
A combination of the above conditionsA combination of the above conditions
A history of the above combinationsA history of the above combinations
Reliability VersusReliability Versus Temperature Temperature
HDD life decreases with temperatureHDD life decreases with temperature
Failure rates increase exponentially with temperatureFailure rates increase exponentially with temperaturefor all HDD suppliers for all HDD suppliers
Environmental temperature increase from 25C to 100C Environmental temperature increase from 25C to 100C could translate into could translate into 10 – 50x shorter life10 – 50x shorter life
Ref.: Samsung reliability tests
Samsung HDD Lab Engineering Sample Data
Performance Versus Performance Versus VibrationVibration
Data throughput or drive performance can beData throughput or drive performance can besignificantly affectedsignificantly affected in the presence of in the presence of vibrationvibration
Effect of vibration is reversibleEffect of vibration is reversible
Cumulative effects of vibration on long term drive Cumulative effects of vibration on long term drive reliability is a subject of ongoing researchreliability is a subject of ongoing research
Performance Loss With Vibration
1
10
100
0.05 0.10 0.20 0.50 0.75 1.00 1.30
Vibration level, Arb. Units
Th
rou
gh
pu
t in
MB
/s
0
20
40
60
80
100
120
Off
track,
% T
rack P
tich
Thruput, MB/S Off Track
Samsung HDD Lab Engineering Sample Data
Reliability Versus ShockReliability Versus Shock
Excessive shock is the major Excessive shock is the major cause of failure in cause of failure in both PCboth PCand consumer electronics and consumer electronics environmentsenvironments
Shock ModelingShock Modeling
Courtesy: E. Jayson and Frank Talke, UC San Diego Courtesy: E. Jayson and Frank Talke, UC San Diego
Op. Shock Scratches
Damage by corners, leading edge, Damage by corners, leading edge, and side edges of the slider.and side edges of the slider.
Operating shock damageOperating shock damage
Non-operating shock damageNon-operating shock damage
Reliability Design GuidelinesReliability Design Guidelines
Failure modes and failure rates Failure modes and failure rates of disk drives depend on of disk drives depend on their their operating environmentsoperating environments
Temperature and HandlingTemperature and Handling(shock and vibration)(shock and vibration) are major factors are major factors impacting HDD reliabilityimpacting HDD reliability
HDD reliability will be enhanced if OS HDD reliability will be enhanced if OS detects and managesdetects and manages reliability risks reliability risksand stress events intelligently (PRCS)and stress events intelligently (PRCS)
Users can Users can improveimprove HDD data reliability HDD data reliabilityby correctly responding to PRCS eventsby correctly responding to PRCS events
PRCSPRCS
Kai ChenKai ChenMicrosoft CorporationMicrosoft Corporation
Debasis BaralDebasis BaralSamsungSamsung
Call To ActionCall To Action
Test your drives with Windows Vista Disk Test your drives with Windows Vista Disk Diagnostics and send feedbackDiagnostics and send feedbackEnsure your drives comply with ATA-7 Ensure your drives comply with ATA-7 specs to surface device error count and specs to surface device error count and life timestamplife timestampEngage with the Startup Repair team to Engage with the Startup Repair team to build a plan for Startup Repair in OEM build a plan for Startup Repair in OEM factory processesfactory processesParticipate in T13 discussions on PRCSParticipate in T13 discussions on PRCSPlan your device designs in line with Plan your device designs in line with PRCS guidelinesPRCS guidelines
Additional ResourcesAdditional Resources
WhitepapersWhitepapersWindows Recovery Environment/Startup Windows Recovery Environment/Startup Repair/Built-in Diagnostics: Repair/Built-in Diagnostics: http://www.microsoft.com/technet/windowsvista/evaluate/feat/relperf.mspxhttp://www.microsoft.com/technet/windowsvista/evaluate/feat/relperf.mspx
Feedback/QuestionsFeedback/QuestionsWindows Vista Disk Diagnosis:Windows Vista Disk Diagnosis:
Corrupt File Recovery:Corrupt File Recovery:
Windows Recovery Environment/Startup Repair:Windows Recovery Environment/Startup Repair:
PRCS:PRCS:
Dfdfeed @ microsoft.comDfdfeed @ microsoft.com
Dfdfeed @ microsoft.comDfdfeed @ microsoft.com
Recovery @ microsoft.comRecovery @ microsoft.comPrcsdisc @ microsoft.comPrcsdisc @ microsoft.com
© 2006 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions,
it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.