A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of...
-
Upload
emil-higgins -
Category
Documents
-
view
217 -
download
1
Transcript of A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of...
![Page 1: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/1.jpg)
A Robust Main-Memory Compression Scheme
Magnus Ekman and Per Stenström
Chalmers University of Technology
Göteborg, Sweden
![Page 2: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/2.jpg)
Motivation
• Memory resources are wasted to compensate for the increasing processor/memory/disk speedgap>50% of die size occupied by caches>50% of cost of a server is DRAM (and increasing)
• Lossless data compression techniques have the potential to free up more than 50% of memory resources.Unfortunately, compression introduces several
challenging design and performance issues
![Page 3: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/3.jpg)
core core
L1-cache L1-cache
L2-cache
0 64 128 192 256 320 384 448Main memory space
RequestData
![Page 4: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/4.jpg)
core core
L1-cache L1-cache
L2-cache
0 64 128 192 256 320 384 448
Compressed main memory space Translation table
Address
TranslationRequest
Decompressor
Data
Data
0 64 128 192 256 320 384 448
Fragmented compressed main memory space
![Page 5: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/5.jpg)
Contributions
A low-overhead main-memory compression scheme:
• Low decompression latency by using simple and fast algorithm (zero aware)
• Fast address translation by a proposed small translation structure that fits on the processor die
• Reduction of fragmentation through occassional relocation of data when compressibility varies
Overall, our compression scheme frees up 30% of the memory at a marginal performance loss of 0.2%!
![Page 6: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/6.jpg)
Outline
• Motivation• Issues• Contributions• Effectiveness of Zero-Aware Compressors• Our Compression Scheme• Performance Results• Related Work• Conclusions
![Page 7: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/7.jpg)
Frequency of zero-valued locations
• 12% of all 8KB pages only contain zeros• 30% of all 64B blocks only contain zeros• 42% of all 4B words only contain zeros• 55% of all bytes are zero!
Zero-aware compression schemes have a great potential!
![Page 8: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/8.jpg)
Evaluated AlgorithmsZero aware algorithms:
• FPC (Alameldeen and Wood) + 3 simplified versions
For comparison, we also consider:• X-Match Pro (efficient hardware implementations exist)• LZSS (popular algorithm, previously used by IBM for memory
compression)• Deflate (upper bound on compressibility)
![Page 9: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/9.jpg)
Resulting Compressed Sizes
Main observations:• FPC and all its variations can free up about 45% of memory• LZSS and X-MatchPro only marginally better in spite of complexity• Deflate can free up about 80% of memory but not clear how to exploit it
Fast and efficient compression algorithms exist!
SpecInt SpecFP Server
![Page 10: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/10.jpg)
Outline
• Motivation• Issues• Contributions• Effectiveness of Zero-Aware Compressors• Our Compression Scheme• Performance Results• Related Work• Conclusions
![Page 11: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/11.jpg)
Uncompressed data
Compressed data
Compressed fragmented data
01 00 11 10 00 01 10 11
Block size vector
Address translation
A block is assigned oneout of n predefined sizes.
In this example n=4.01 00 11 10 00 01 10 11
11 10 01 00 00 00 11 11
00 01 00 10 10 00 10 01
01 10 10 10 10 11 10 00
Block Size Table (BST)TLB
Address Calculator
• OS changes– Block size vector is kept in page table– Each page is assigned one out of k predefined
sizes. Physical address grows with log2k bits.
The Block Size Table enables fast translation!
![Page 12: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/12.jpg)
Size changes and compaction
Sub-page 0 Sub-page 1
sub-page slack sub-page slack
Terminology: block overflow/underflow, sub-pageoverflow/underflow, page overflow/underflow
Block overflowslack
Block underflow
page slack
![Page 13: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/13.jpg)
Handling of overflows/underflows
• Block and sub-page overflows/underflows implies moving data within a page
• On a page overflow/underflow the entire page needs to be moved to avoid having to move several pages
• Block and sub-page overflows/underflows are handled in hardware by an off-chip DMA-engine
• On a page overflow/underflow a trap is taken and the mapping for the page is changed
Processor has to stall if it accesses data that is being moved!
![Page 14: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/14.jpg)
core core
L1-cache L1-cache
L2-cacheBST
Calc.
Sub-page 0 Sub-page 1
Page 0
Comp Dec.DMA-engine
Putting it all together
![Page 15: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/15.jpg)
Experimental Methodology
Key issues to experimentally evaluate:• Compressibility and impact of fragmentation• Performance losses for our approach
Simulation approaches (both using Simics):• Fast functional simulator (in-order, 1 instr/cycle)
allowing entire benchmarks to be run• Detailed microarchitecture simulator driven by a
single sim-point per application
![Page 16: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/16.jpg)
Architectural ParametersInstr. Issue 4-w ooo
Exec units 4 int, 2 int mul/div
Branch pred. 16-k entr. gshare, 2k BTB
L1 I-cache 32 k, 2-w, 2-cycle
L1 D-cache 32 k, 4-w, 2-cycle
L2 cache 512k/thread, 8-w, 16 cycles
Memory latency 150 cycles1
Block lock-out 4000 cycles2
Subpage lock-out 23000 cycles2
Page lock-out 23000 cycles2
2Conservatively assumes 200 MHz DDR2 DRAM; leads to long lock-out time
1Aggressive for future processsors to not give advantage to our compr. scheme
Predefined sizes
Block 0 22 44 64
Subpage 256 512 768 1024
Page 2048 4096 6144 8192
Loads to a block onlycontaining zeros can retirewithout accessing memory!
![Page 17: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/17.jpg)
Benchmarks
SPEC2000 ran with reference set.SAP and SpecJBB ran 4 billion instructions per thread
SpecInt2000
Bzip
Gap
Gcc
Gzip
Mcf
Parser
Perlbmk
Twolf
Vpr
SpecFP2000
Ammp
Art
Equake
Mesa
Server
SAP S&D
SpecJBB
![Page 18: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/18.jpg)
Overflows/Underflows
Main observations:• About 1 out of every thousand instruction causes a block overflow/underflow• The use of subpages cuts down the number of page-level relocations by one order of magnitude• Memory bandwidth goes up by 58% with subpages and 212% without. Note that this is not the
bandwidth to the processor chip
Fragmentation with the use of a hierarchy of pages, sub-pages and blocks, reduce memory savings to 30%
Note: y axis is logarithmic
![Page 19: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/19.jpg)
Detailed Performance Results
We used a single simpoint per application according to [Sherwood et al. 2002]Main observations:• Decompression latency reduces performance by 0.5% on average• Misses to zero-valued blocks increases performance by 0.5% on average• Factoring in also relocation/compaction losses, performance losses are only 0.2%!
![Page 20: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/20.jpg)
Related WorkEarly work on main memory compression:
– Douglis [1993], Kjelso et al. [1999], and Wilson et al. [1999]– These works aimed at reducing paging overhead so the significant
decompression and address translation latencies were offset by the wins
More recently – IBM MXT [Abali et al. 2001]– Compresses entire memory with LZSS (64 cycle decompression latency)– Translation through memory resident translation table– Shields latency by huge (at that point in time) 32-MByte cache.
Sensitive to working set size
Compression algorithm– Inspired by frequent-value locality work by Zhang, Yang and Gupta
2000– Compression algorithm from Alameldeen and Wood, 2004
![Page 21: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/21.jpg)
Concluding Remarks
It is possible to free up significant amounts of memory resources with virtually zero performance overhead
This was achieved by
• exploiting zero-valued bytes which account for as much as and 55% of the memory contents
• leveraging a fast compression/decompression scheme
• a fast translation mechanism
• a hierarchical memory layout which offers some slack at the block, subpage, and page level
Overall, 30% of memory could be freed up at a loss of 0.2% on average
![Page 22: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/22.jpg)
Backup Slides
![Page 23: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/23.jpg)
Fragmentation - Results
![Page 24: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/24.jpg)
% misses to zero-valued blocks
bzi gap gcc gzi mcf par per two vor vpr am art equ mes sap jbb
512k 3 27 25 33 0.1 6 19 0.2 12 6 0.6 0.1 0.1 45 13 2
4 MB 16 29 67 42 0.6 0.1 27 0.3 23 40 10 20 0.1 53 15 1
• For gap, gcc, gzip, mesa more than 20% of the misses request zero-valued blocks; for the rest the percentage is quite small
![Page 25: A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.](https://reader035.fdocuments.net/reader035/viewer/2022062719/56649ef05503460f94c00bd4/html5/thumbnails/25.jpg)
Frequent Pattern Compression (Alameldeen and Wood)
Prefix Pattern encoded Data size
00 Zero word 0
01 One byte sign-extended 8 bits
10 halfword sign-extended 16 bits
11 Uncompressed 32 bits
3 bits (for runs up to 8 ”0”)Zero run
Each 32-bit word is coded using a prefix plus data:
Prefix Pattern encoded Data size
000 Zero run 3 bits (for runs up to 8 ”0”)
001 4-bit sign-extended 4 bits
010 One byte sign-extended 8 bits
011 Half word sign-extended 16 bits
100 Half word padded with a zero halfword 16 bits
101 Two halfwords, each a byte sign-ext. 16 bits
110 Word consisting of repeated bytes 8 bits
111 Uncompressed 32 bits