A Survey of Power-Saving Techniques for Storage Systems

94
A Survey of Power- Saving Techniques for Storage Systems An-I Andy Wang Florida State University May 3-4 1

description

A Survey of Power-Saving Techniques for Storage Systems. An-I Andy Wang Florida State University May 3-4. Why Care about the Energy C onsumption of Storage?. R elevant for mobile devices 8% for laptops Energy consumption of disk drives - PowerPoint PPT Presentation

Transcript of A Survey of Power-Saving Techniques for Storage Systems

Energy-efficient Techniques for Storage Systems

[Koloniari and Pitoura]Bloom FiltersCompact data structures for a probabilistic representation of a setAppropriate to answer membership queries92Why Care about the Energy Consumption of Storage?Relevant for mobile devices8% for laptops Energy consumption of disk drives40% of electricity cost for data centers more energy more heat more cooling lower computational density more space higher costsCost aside, fixed power infrastructureNeed to power more with less[Lampe-Onnerud 2008;Gallinaro 2009; Schulz 2010]2Compared to other componentsCPU Xeon X567016W per core when activeNear zero idle powerDisksHitachi Deskstar 7K100012W active8W idle3How about flash?Samsung SSD SM8251.8/3.4W active (read/write)1.3W idle10X $/GB Green but maybe too greenEnergy-efficient techniques need to meet diverse constraintsTotal cost of ownership (TCO), performance, capacity, reliability, etc.

4TCO ExampleFacebook: 100 petabytes (1015)Assumption $1/year for 1W/hourUse Hitachi Deskstar 7K1000 1TB disks$7M for 90K disks, $960K/year for electricityUse Hitachi Z5K500 500GB laptop disks$11M for 190K disks, $261K/year for electricityFlash? Dont even think about it.

[Ziegler 2012]5WorseExponential growth in storage demandData centers Cloud computingLimited growth in storage densityFor both disk and flash devicesImplicationsStorage can be both a performance and an energy bottlenecks

6RoadmapSoftware storage stack overviewPower-saving techniques for different storage layersHardwareDevice/multi-device driverFile systemCacheApplicationBy no means exhaustive7Software Storage StackVirtual file system (VFS)File systemMulti-device driversExt3Device driverDiskdriverMTDdriverMTDdriverJFFS2NFTLAppsDatabaseSearch engineUser levelOperating-systemlevelhardware8Hardware LevelCommon storage mediaDisk drivesFlash devices Energy-saving techniquesHigher-capacity disksSmaller rotating platterSlower/variable RPMHybrid drives9Virtual file system (VFS)File systemMulti-device driversExt3Device driverDiskdriverMTDdriverMTDdriverJFFS2NFTLAppsDatabaseSearch engineUser levelOperating-systemlevelhardwareHard Disk50-year-old storage technologyDisk access time Seek time + rotational delay + transfer time

Disk plattersDisk armDisk heads10Energy ModesRead/write modesActive mode (head is not parked)Idle mode (head is parked, disk spinning)Standby mode (disk is spun down)Sleep mode (minimum power)11Hitachi Deskstar 7K1000 1TBAverage access time: 13msSeek time: 9ms7200 RPM: 4ms for rotationTransfer time for 4KB: 0.1msTransfer rate of 37.5 MB/sPower30W startup12W active, 8W idle, 3.7W low RPM idle12Hitachi Deskstar 7K1000 1TB (continued)Reliability50K power cycles (27 cycles/day for 5 years)Error rate: 1 in 100TB bytes transferred350GB/day for 5 yearsLimits the growth in disk capacityPrice: $8013Hitachi Z5K500 500GB ($61)Average access time: 18msSeek time: 13ms5400 RPM: 5ms for rotationTransfer time for 4KB: 0.03msTransfer rate of 125.5 MB/sPower4.5W startup1.6W active, 1.5W idle, 0.1W sleepReliability: 600K power cycles (13/hr)

14Flash Storage DevicesA form of solid-state memory Similar to ROMHolds data without power supplyReads are fastCan be written once, more slowlyCan be erased, but very slowlyLimited number of erase cycles before degradation (10,000 100,000)1515Physical Characteristics

16NOR FlashUsed in cellular phones and PDAsByte-addressableCan write and erase individual bytesCan execute programs

17NAND FlashUsed in digital cameras and thumb drivesPage-addressable1 flash page ~= 1 disk block (1-4KB)Cannot run programsErased in flash blocksConsists of 4 - 64 flash pages18Writing In Flash MemoryIf writing to empty flash page (~disk block), just writeIf writing to previously written location, erase it, then writeWhile erasing a flash blockMay access other pages via other IO channelsNumber of channels limited by power (e.g., 16 channels max)1919Implications of Slow ErasesUse of flash translation layer (FTL)Write new version elsewhereErase the old version later20Implications of Limited Erase Cycles Wear-leveling mechanism Spread erases uniformly across storage locations21Multi-level cellsUse multiple voltage levels to represent bits

22Implications of MLCHigher density lowers price/GBNumber of voltage levels increases exponentially for linear increase in densityMaxed out quicklyReliability and performance decrease as the number of voltage levels increasesNeed a guard band between two voltage levelsTakes longer to programIncremental stepped pulse programming[Grupp et al. 2012]23Samsung SM825 400GBAccess time (4KB)Read: 0.02msWrite: 0.09msErase: Not mentionedTransfer rate: 220 MB/sPower1.8/3.4W active (read/write)1.3W idle

24Samsung SM825 (continued)Reliability: 17,500 erase cyclesCan write 7PB before failure4 TB/day, 44MB/s for 5 yearsPerhaps wear-leveling is no longer relevantAssume 2% content change/day + 10x amplification factor for writes = 80 GB/dayError rate: 1 in 13PBPrice: not released yetAt least $320 based on its prior 256GB model25Overall ComparisonsAverage disks+ Cheap capacity+ Good bandwidthPoor power consumptionPoor average access timesLimited number of power cyclesDensity limited by error rate

Flash devices+ Good performance+ Low powerMore expensiveLimited number of erase cyclesDensity limited by number of voltage levels

26HW Power-saving TechniquesHigher-capacity disksSmaller disk plattersDisks with slower RPMsVariable-RPM disksDisk-flash hybrid drives

[Battles et al. 2007]27Higher-capacity DisksConsolidate content with fewer disks+ Significant power savings- Significant decrease in parallelismSmaller Platters, Slower RPMIBM Microdrive 1GB ($130)Average access time: 20msSeek time: 12ms3600 RPM: 8ms for rotationTransfer time for 4KB: 0.3msTransfer rate of 13 MB/sPower: 0.8W active, 0.06W idleReliability: 300K power cycles

Smaller Platters, Slower RPMIBM Microdrive 1GB ($130)+ Low power+ Small physical dimension (for mobile devices)- Poor performance- Low capacity- High Price

Variable RPM DisksWestern Digital Caviar Green 3TBAverage access time: N/APeak transfer rate: 123 MB/sPower: 6W active, 5.5W idle, 0.8W sleepReliability: 600K power cyclesCost: $222

31Variable RPM DisksWestern Digital Caviar Green 3TB+ Low power + Capacity beyond mobile computing- Potentially high latency- Reduced reliability? Switching RPM may consume power cycle count- Price somewhat higher than disks with the same capacity

32Hybrid DrivesSeagate Momentus XT 750GB ($110)8GB flashAverage access time: 17msSeek time: 13ms7200 RPM: 4ms for rotationTransfer time for 4KB: NegligiblePower: 3.3W active, 1.1W idleReliability: 600K power cycles

33Hybrid DrivesSeagate Momentus XT 750GB ($110)+ Good performance with good localityEspecially if flash stores frequently accessed read-only data- Reduced reliability?Flash used as write buffer may not have enough erase cycles- Some price markups

34Device-driver Level TechniquesGeneral descriptionsEnergy-saving techniquesSpin down disksUse flash to cache disk content35Virtual file system (VFS)File systemMulti-device driversExt3Device driverDiskdriverMTDdriverMTDdriverJFFS2NFTLAppsDatabaseSearch engineUser levelOperating-systemlevelhardwareDevice DriversCarry out medium- and vendor-specific operations and optimizationsExamplesDiskReorder requests according to seek distancesFlashRemap writes to avoid erases via FTLCarry out wear leveling36Spin down Disks When IdleSave power whenPower saved > power needed to spin up

timepoweractiveidlespindownspin up~10 seconds37Spin down Disks When IdlePrediction techniquesWhenever the disk is idle for more than x seconds (typically 1-10 seconds)Probabilistic cost-benefit analysisCorrelate sequences of program counters to the length of subsequent idle periods

38[Douglis et al. 1994; Li et al. 1994; Krishnan et al. 1999; Gniady et al. 2006]Spin down Disks When Idle+ No special hardware- Potentially high latency at timesNeed to consider the total number of power cycles

39Use Flash for CachingFlashCacheSits between DRAM and diskReduces disk traffic for both reads and writesDisk switched to low power modes if possibleA read miss brings in a block from diskA write request first modifies the block in DRAMWhen DRAM is full, flushed the block to flashWhen flash is full, flushed to disk[Marsh et al. 1994]40Use Flash for CachingFlashCacheFlash has three LRU listsFree list (already erased)Clean list (can be erased)Dirty list (needs to be flushed)Identified the applicability of the LFS management on flash + Reduces energy usage by up to 40%+ Reduces overall response time up to 70%- Increases read response time up to 50%[Marsh et al. 1994]41Multi-device-level TechniquesGeneral descriptionsCommon software RAID classificationsEnergy-saving techniquesSpin down individual disksUse cache disks Use variable RPM disksRegenerate contentReplicate dataUse transposed mirroring

42Virtual file system (VFS)File systemMulti-device driversExt3Device driverDiskdriverMTDdriverMTDdriverJFFS2NFTLAppsDatabaseSearch engineUser levelOperating-systemlevelhardwareMulti-device LayerAllows the use of multiple storage devicesBut increases the chance of a single device failureRAID: Redundant Array of Independent DisksStandard way of organizing and classifying the reliability of multi-disk systems

RAID Level 0No redundancyUses block-level striping across disks1st block stored on disk 1, 2nd block stored on disk 2, and so onFailure causes data loss44RAID Level 1 (Mirrored Disks)Each disk has second disk that mirrors its contentsWrites go to both disksNo data striping+ Reliability is doubled+ Read access faster- Write access slower- Double the hardware cost45RAID Level 5 (continued)Data striped in blocks across disksEach stripe contains a parity blockParity blocks from different stripes spread across disks to avoid bottlenecks

readwriteparity blockRAID Level 5+ Parallelism+ Can survive a single-disk failure- Small writes involve 4 IOsRead data and parity blocksWrite data and parity blocksNew parity block = old parity block old data block new data block

Other RAID ConfigurationsRAID 6Can survive two disk failuresRAID 10 (RAID 1+0)Data striped across mirrored pairsRAID 01 (RAID 0+1)Mirroring two RAID 0 arrays RAID 15, RAID 5148Spin down Individual DrivesSimilar to the single-disk approach+ No special hardware- Not effective for data striping[Gurumurthi et al. 2003]49Dedicated Disks for CachingMAID (Massive Array of Idle Disks)Designed for archival purposesDedicate a few disks to cache frequently referenced dataUpdates are deferred+ Significant energy savings- Not designed to maximize parallelism

[Colarelli and Grunwald 2002]50Use Variable RPM DisksHibernatorPeriodically changes the RPM setting of disksBased on targeted response timeReorganizes blocks within individual stripesBlocks are assigned to disks based on their temperatures (hot/cold)+ Good energy savings- Data migration costs

[Zhu et al. 2005]51Flash + Content RegenerationEERAID and RIMACAssumes hardware RAIDAdds flash to RAID controller to buffer updates Mostly to retain reliability characteristicsFlushes changes to different drives periodicallyLengthen idle periods

[Li et al. 2004; Yao and Wang 2006]52Flash + Content Regeneration (continued)EERAID and RIMACUse blocks from powered disks to regenerate the block from sleeping diskSuppose B1, B2, B3 are in a stripeRead B1 = read B2, read B3, XOR(B2, B3) Up to 30% energy savingsBenefits diminishes as the number of disks increasesCan improve performance for < 5 disksCan degrade performance for > 5 disks

[Li et al. 2004; Yao and Wang 2006]53Flash + Content Regeneration (continued)Use (n, m) erasure encodingAny m out of n blocks can reconstruct the original contentSeparate m original disks from n m disks with redundant informationSpin down n m disks during light loadsWrites are buffered using NVRAMFlushed periodically[Pinheiro et al. 2006]54mn - mFlash + Content Regeneration (continued)Use (n, m) erasure encoding+ Can save significant amount of power+ Can use all disks under peak loadErasure encoding can be computationally expensive

[Pinheiro et al. 2006]55Replicate DataPARAID (Power-aware RAID)Replicate data from disks to be spun down to spare areas of active disks+ Preserves peak performance, reliability- Requires spare storage areasRAID level nunused areas[Weddle et al. 2007]56Transposed MirroringDiskgroupBased on RAID 0 + 1 (mirrored RAID 0)With a twist

[Lu et al. 2007]571st RAID 02nd RAID 0Disk 0Disk 1Disk 2Disk 3Disk 0Disk 1Disk 2Disk 3A1B1C1D1A1A2A3A4A2B2C2D2B1B2B3B4A3B3C3D3C1C2C3C4A4B4C4D4D1D2D3D4Transposed MirroringDiskgroupUse flash to buffer writes+ Can power on disks in the 2nd RAID 0 incrementally to meet performance needs- Reduced reliability

[Lu et al. 2007]581st RAID 02nd RAID 0Disk 0Disk 1Disk 2Disk 3Disk 0Disk 1Disk 2Disk 3A1B1C1D1A1A2A3A4A2B2C2D2B1B2B3B4A3B3C3D3C1C2C3C4A4B4C4D4D1D2D3D4File-system-level TechniquesGeneral descriptionsEnergy-saving techniquesHot and cold tiersReplicate dataAccess remote RAMUse local storage as backup

59Virtual file system (VFS)File systemMulti-device driversExt3Device driverDiskdriverMTDdriverMTDdriverJFFS2NFTLAppsDatabaseSearch engineUser levelOperating-systemlevelhardwareFile SystemsProvide names and attributes to blocksMap files to blocksNote: a file system does not know the physical medium of storage E.g., disks vs. flashA RAID appears as a logical disk

File SystemsImplicationsEnergy-saving techniques below the file system should be usable for all file systemsFile-system-level techniques should be agnostic to physical storage mediaIn reality, many energy-saving techniques are storage-medium-specificCannot be applied for both disk and flashExample: data migration is ill-suited for flashHot and Cold TiersNomad FSUses PDC (Popular Data Concentration)Exploits Zipfs distribution on file popularityUse multi-level feedback queues to sort files by popularityN LRU queues with exponentially growing time slicesDemote a file if not referenced within time slicePromote a file if referenced > 2N times62[Pinheiro and Blanchini 2004]Hot and Cold TiersPDCPeriodically migrate files to a subset of disksSorted by the popularityEach disk capped by either file size or load (file size/interarrival time)+ Can save more energy than MAID- Not exploiting parallelism- Data migration overhead63[Pinheiro and Blanchini 2004]Replicate DataFS2Identifies hot regions of a diskReplicates file blocks from other regions to the hot region to reduce seek delaysRequires changes to both file-system and device-driver layersFile system knows the free status of blocks and file membership of blocks

[Huang et al. 2005]64Replicate DataFS2For reads, access the replica closest to the current disk headFor writes, invalidate replicasAdditional data structures to track replicasPeriodically flushed+ Saves energy and improves performance- Need free space and worry about crashes

[Huang et al. 2005]65Access Remote RAMBlueFSDesigned for mobile devicesDecides whether to fetch data from remote RAM or local disk depending on the power states and relative power consumptionUpdates persistently cached on mobile clientPropagated to server in aggregation via optimistic consistency semantics[Nightingale et al. 2005]66Access Remote RAMBlueFSOne problemIf files are accessed one at a time over a network, and local disk is powered down, little incentive exists to power up the local diskSolutionProvides ghost hints to local disk about the opportunity costCan reduce latency by 3x and energy by 2x[Nightingale et al. 2005]67Use Local Storage as BackupGreenFSMostly designed for mobile devicesCan be applied to desktops and servers as wellWhen disconnected from a networkUse flash to buffer writesWhen connectedUse flash for caching[Joukov and Sipek 2008]68Use Local Storage as BackupGreenFSLocal disk is a backup of data stored remotelyProvides continuous data protectionSpun down most of the timeAccess the remote data whenever possibleSpun up and synchronized at least once per dayWhen the device is shut downWhen flash is near fullUsed when network connectivity is poorThe number of disk power cycles are tracked and capped[Joukov and Sipek 2008]69Use Local Storage as BackupGreenFSPerformance depends on the workloads+ Can save some power (up to 14%)+ Noise reduction+ Better tolerance to shocks when disks are off most of the time- Does not work well for large filesPower needed for network transfers > diskAssumed to be rare[Joukov and Sipek 2008]70VFS-level TechniquesGeneral descriptionsEnergy-efficient techniquesEncourage IO burstsCache cold disks71Virtual file system (VFS)File systemMulti-device driversExt3Device driverDiskdriverMTDdriverMTDdriverJFFS2NFTLAppsDatabaseSearch engineUser levelOperating-systemlevelhardwareVFSEnables an operation system to run different file systems simultaneouslyHandles common functions such as caching, prefetching, and write bufferingEncourage IO BurstsPrefetch aggressively based on monitored IO profiles of process groupsBut not so much to cause eviction missesWrites are flushed once per minuteAs opposed to 30 secondsAllow applications to specify whether it is okay to postpone updates to file closeAccess time stamps for media filesPreallocate memory based on write rate[Papathanasiou and Scott 2004]73Encourage IO BurstsNegligible performance overhead+ Energy savings up to 80%- Somewhat reduced reliability[Papathanasiou and Scott 2004]74Cache Cold DisksCache more blocks from cold disksLengthen the idle period of cold disksSlightly increase the load of hot disks[Zhu et al. 2004]75Cache Cold Disks (continued)Cache more blocks from cold disksCharacteristics of hot disksSmall percentage of cold missesIdentified by a Bloom FilterHigh probability of long interarrival timesTracked by epoch-based histogramsImproves energy savings by16%Assumes the use of multi-speed disksAssumes the use of NVRAM or a log disk to buffer writes

[Zhu et al. 2004]76Application-level TechniquesMany ways to apply system-level techniques (e.g., buffering)FlashEnergy-efficient encoding77Virtual file system (VFS)File systemMulti-device driversExt3Device driverDiskdriverMTDdriverMTDdriverJFFS2NFTLAppsDatabaseSearch engineUser levelOperating-systemlevelhardwareEnergy-efficient EncodingBasic observationEnergy consumption for 10101010 = 15.6JFor 11111111 = 0.038JThus, avoid 10 and 01 bit patternsApproach: user-level encoderMemory type Operation Time(s) Energy(J)Intel MLCNOR 28F256L18Program 00 110.002.37Program 01644.2314.77Program 10684.5715.60Program 1124.930.038[Joo et al. 2007]78Energy-efficient EncodingTradeoffs35% energy savings with 50% size overhead+ No changes to storage stack+ Good energy savings- File size overhead can be significant- Longer latency due to larger files- One-time encoding cost

[Joo et al. 2007]79Large-scale TechniquesGeneral descriptionsSome can operate on RAID granularityNo need to handle low-level reliability issuesInvolve distributed coordinationEnergy-efficient techniqueSpun-up tokenGraph coverageWrite offloadingPower proportionality

80Spun-up TokenPergamumDistributed system designed for archivalUsed commodity low-power componentsEach node contains a flash device, a low-power CPU, and a low-RPM diskFlash stores metadata; disk dataUsed erasure code for reliabilityUsed spun-up token to limit the number of power disks[Storer et al. 2008]81Graph CoverageProblem formulationA bipartite graph Nodes and data itemsAn edge indicates an items hostPower-saving goalMinimize the number of powered nodesWhile keeping all data items availableFor read-mostly workloadsWrites are buffered via flash82[Harnik et al. 2009]nodesdata itemsGraph CoverageSRCMapFor each node, replicate the working set to other nodesIncreases the probability of having few nodes covering the working sets for many nodes

83[Verma et al. 2010]Write OffloadingObservationsCache can handle most of read requestsDisks are active mostly due to writesSolution: write offloadingA volume manager redirects writes from sleeping disks to active disksInvalidates obsolete contentPropagates updates when disks are spun upE.g., read miss, no more space for offloading[Narayanan et al. 2008]84Write OffloadingUse versioning to ensure consistency+ Retain the reliability semantics of the underlying RAID level+ Can save power up to 60%36% just to spin down idle disks+ Can achieve better average response time - Extra latency for reads, but its expectedPower ProportionalityRabbitMatches power consumption with workload demandsUses equal work data layoutUses write offloading to handle updates[Amur et al. 2010]86

SummaryEnergy-efficient techniquesSpecialized hardware, caching, power down devices, data regeneration, replication, special layouts, remote access, power provisioning, etc.Common tradeoffsPerformance, capacity, reliability, specialized hardware, data migration, price, etc.No free lunch in generalQuestions?Thank you!Backup Slides89Erasure CodesExamplesRAID level 5 parity codeReed-Solomon codeTornado code[Mitzenmacher]back91Bloom Filters (contd)

Query for b: check the bits at positions H1(b), H2(b), ..., H4(b). back93Suppose the goal is 1M IOPSSamsung SM850: 11K IOPS91 units x $320?/unit = $30K + stuff (rack, space, network interconnect)Hitachi Deskstar 7K1000 : 77 IOPS13K units x $80/unit = $1M + 140x stuffIf your goal is performance, forget about disks and energy costs94