Post on 27-Mar-2018
CRAMES: Compressed RAM for Embedded Systems
Robert Dick
http://robertdick.org/ or http://ziyang.eecs.northwestern.edu/∼dickrp/Department of Electrical Engineering and Computer Science
Northwestern University
Application
Swap
VirtualMemory
CRAMES
Application
PageDecomp
PageComp
Application
IntroductionCRAMES design
Experimental evaluation
Collaborators on project
NEC
Northwestern UniversityLei Yang
Lan S. BaiRobert P. Dick
NEC Labs AmericaDr. Haris Lekatsas
Dr. Srimat Chakradhar
2 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
MotivationPast work
Outline
1. IntroductionMotivationPast work
2. CRAMES designDesign overviewPattern-based partial match compression
3. Experimental evaluationExperimental setupCommercialization and future directions
3 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
MotivationPast work
Problem background
RAM quantity limits application functionality
RAM price dropping but usage growing faster
Secure Internet access, email, music, and games
How much RAM?
Functionality
Cost
Power consumption
Size
4 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
MotivationPast work
Ideal hardware–software design process
Ideal case
Hardware and software engineers collaborate on system-level designfrom start to finish
We teach the advantages of this in our classes
It doesn’t always happen
5 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
MotivationPast work
Real hardware–software design process
HW engineers SW engineers
6 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
MotivationPast work
Real hardware–software design process
HW engineers SW engineers
6 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
MotivationPast work
Options
Option 1: Add more memory
Implications: Hardware redesign, miss shipping target, get fired
Option 2: Rip out memory-hungry application features
Implications: Lose market to competitors, fail to recoup design andproduction costs, get fired
Option 3: Make it seem as if memory increased without changinghardware, without changing applications, and without performance orpower consumption penalties
Nobody knew how to do this
7 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
MotivationPast work
Goals
Allow application RAM requirements to overrun initial estimates evenafter hardware design
Reduce physical RAM, negligible performance and energy cost
Improve functionality or performance with same physical RAM
8 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
MotivationPast work
HW RAM compression
Tremaine 2001, Benini 2002, Moore 2003
Hardware (de)compression unit between cache and RAM
Hardware redesign and application-specific compression hardware
Past work claimed application-specific hardware essential to keeppower and performance overhead low
9 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
MotivationPast work
Code compression
Lekatsas 2000, Xu 2004
Store code compressed, decompress during execution
Compress off-line, decompress on-line
For RAM, less important than on-line data compression
10 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
MotivationPast work
Compression for swap performance
Compressed caching
Douglis 1993, Russinovich 1996, Wilson 1999, Kjelso 1999
Add compressed software cache to VM
Swap compression
RamDoubler, Cortez 2000, Roy 2001, Chihaia 2005
Compress swapped-out pages and store them in software cache
Both techniques
Target: general purpose system with disks
Goal: improve system performance
Interface to backing store (disk)
11 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Design overviewPattern-based partial match compression
Outline
1. IntroductionMotivationPast work
2. CRAMES designDesign overviewPattern-based partial match compression
3. Experimental evaluationExperimental setupCommercialization and future directions
12 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Design overviewPattern-based partial match compression
Design principles
Page selection
Scheduling compression and decompression
Organizing compressed and uncompressed regions
Dynamically adjust compressed region size
Compression scheme
High performance
Energy efficient
Good compression ratio
Low memory requirement
13 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Design overviewPattern-based partial match compression
CRAMES design
Kernel Swapping System
CRAMES Device Driver
BlockDecompressor
MemoryManager
BlockCompressor
Virtual Memory
RAM
MMU
Page Fault
8
7
3
45
1
2
6
Memory Low
8
7
3
4
5
6
2
1
Memory Low
8
7
3
4
5
6
2
1 8
7
3
4
5
6
2
1
14 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Design overviewPattern-based partial match compression
System configuration
Slightly slowermain memoryworking area
20 MB
Main memoryworking area
10 MB
User view
Slightly slowerRamdisk with
file system20 MB
Compressedswap device
10 MB
Main memoryworking area
10 MB
Compressed RAM device with file system
12 MB
System view
CRAMES
Main memoryworking area
20 MB
Ramdiskwith file system
12 MB
System/User view
Original
15 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Design overviewPattern-based partial match compression
Compression algorithm
Compression Ratio
0.14
0.19
0.24
0.29
0.34
0.39
512 1024 2048 4096 8192
Block Size (bytes)
Co
mp
ressio
n R
ati
o bzip2
zlib default
zilb level 1
zlib level 9
RLE
LZRW-1A
LZO
Compression Time
0
0.0005
0.001
0.0015
0.002
0.0025
512 1024 2048 4096 8192
Block Size (bytes)
Tim
e (
sec)
bzip2
zilb default
zlib level 1
zlib level 9
RLE
LZRW-1A
LZO
Decompression Time
0
0.0002
0.0004
0.0006
0.0008
0.001
0.0012
512 1024 2048 4096 8192
Block Size (bytes)
Tim
e (
sec)
bzip2
zlib default
zlib level 1
zlib level 9
RLE
LZRW-1A
LZO
Working Memory
1
10
100
1000
10000
bzip2 zlib default LZO LZRW-1A RLE
Compression Algorithms
log
2 (
byte
s)
Compression
Decompression
7600kB
3700kB
256KB
44KB64KB
0
16KB 16KB
0 0
Compression Ratio
0.14
0.19
0.24
0.29
0.34
0.39
512 1024 2048 4096 8192
Block Size (bytes)
Co
mp
ressio
n R
ati
o bzip2
zlib default
zilb level 1
zlib level 9
RLE
LZRW-1A
LZO
Compression Time
0
0.0005
0.001
0.0015
0.002
0.0025
512 1024 2048 4096 8192
Block Size (bytes)
Tim
e (
sec)
bzip2
zilb default
zlib level 1
zlib level 9
RLE
LZRW-1A
LZO
Decompression Time
0
0.0002
0.0004
0.0006
0.0008
0.001
0.0012
512 1024 2048 4096 8192
Block Size (bytes)
Tim
e (
sec)
bzip2
zlib default
zlib level 1
zlib level 9
RLE
LZRW-1A
LZO
Working Memory
1
10
100
1000
10000
bzip2 zlib default LZO LZRW-1A RLE
Compression Algorithms
log
2 (
byte
s)
Compression
Decompression
7600kB
3700kB
256KB
44KB64KB
0
16KB 16KB
0 0
Compression Ratio
0.14
0.19
0.24
0.29
0.34
0.39
512 1024 2048 4096 8192
Block Size (bytes)
Co
mp
ressio
n R
ati
o bzip2
zlib default
zilb level 1
zlib level 9
RLE
LZRW-1A
LZO
Compression Time
0
0.0005
0.001
0.0015
0.002
0.0025
512 1024 2048 4096 8192
Block Size (bytes)
Tim
e (
sec)
bzip2
zilb default
zlib level 1
zlib level 9
RLE
LZRW-1A
LZO
Decompression Time
0
0.0002
0.0004
0.0006
0.0008
0.001
0.0012
512 1024 2048 4096 8192
Block Size (bytes)
Tim
e (
sec)
bzip2
zlib default
zlib level 1
zlib level 9
RLE
LZRW-1A
LZO
Working Memory
1
10
100
1000
10000
bzip2 zlib default LZO LZRW-1A RLE
Compression Algorithms
log
2 (
byte
s)
Compression
Decompression
7600kB
3700kB
256KB
44KB64KB
0
16KB 16KB
0 0
Compression Ratio
0.14
0.19
0.24
0.29
0.34
0.39
512 1024 2048 4096 8192
Block Size (bytes)
Co
mp
ressio
n R
ati
o bzip2
zlib default
zilb level 1
zlib level 9
RLE
LZRW-1A
LZO
Compression Time
0
0.0005
0.001
0.0015
0.002
0.0025
512 1024 2048 4096 8192
Block Size (bytes)
Tim
e (
sec)
bzip2
zilb default
zlib level 1
zlib level 9
RLE
LZRW-1A
LZO
Decompression Time
0
0.0002
0.0004
0.0006
0.0008
0.001
0.0012
512 1024 2048 4096 8192
Block Size (bytes)
Tim
e (
sec)
bzip2
zlib default
zlib level 1
zlib level 9
RLE
LZRW-1A
LZO
Working Memory
1
10
100
1000
10000
bzip2 zlib default LZO LZRW-1A RLE
Compression Algorithms
log
2 (
byte
s)
Compression
Decompression
7600kB
3700kB
256KB
44KB64KB
0
16KB 16KB
0 0
LZObest existingcompressionalgorithm for
CRAMES
16 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Design overviewPattern-based partial match compression
Compression algorithm
Compression Ratio
0.14
0.19
0.24
0.29
0.34
0.39
512 1024 2048 4096 8192
Block Size (bytes)
Co
mp
ressio
n R
ati
o bzip2
zlib default
zilb level 1
zlib level 9
RLE
LZRW-1A
LZO
Compression Time
0
0.0005
0.001
0.0015
0.002
0.0025
512 1024 2048 4096 8192
Block Size (bytes)
Tim
e (
sec)
bzip2
zilb default
zlib level 1
zlib level 9
RLE
LZRW-1A
LZO
Decompression Time
0
0.0002
0.0004
0.0006
0.0008
0.001
0.0012
512 1024 2048 4096 8192
Block Size (bytes)
Tim
e (
sec)
bzip2
zlib default
zlib level 1
zlib level 9
RLE
LZRW-1A
LZO
Working Memory
1
10
100
1000
10000
bzip2 zlib default LZO LZRW-1A RLE
Compression Algorithms
log
2 (
byte
s)
Compression
Decompression
7600kB
3700kB
256KB
44KB64KB
0
16KB 16KB
0 0
Compression Ratio
0.14
0.19
0.24
0.29
0.34
0.39
512 1024 2048 4096 8192
Block Size (bytes)
Co
mp
ressio
n R
ati
o bzip2
zlib default
zilb level 1
zlib level 9
RLE
LZRW-1A
LZO
Compression Time
0
0.0005
0.001
0.0015
0.002
0.0025
512 1024 2048 4096 8192
Block Size (bytes)
Tim
e (
sec)
bzip2
zilb default
zlib level 1
zlib level 9
RLE
LZRW-1A
LZO
Decompression Time
0
0.0002
0.0004
0.0006
0.0008
0.001
0.0012
512 1024 2048 4096 8192
Block Size (bytes)
Tim
e (
sec)
bzip2
zlib default
zlib level 1
zlib level 9
RLE
LZRW-1A
LZO
Working Memory
1
10
100
1000
10000
bzip2 zlib default LZO LZRW-1A RLE
Compression Algorithms
log
2 (
byte
s)
Compression
Decompression
7600kB
3700kB
256KB
44KB64KB
0
16KB 16KB
0 0
Compression Ratio
0.14
0.19
0.24
0.29
0.34
0.39
512 1024 2048 4096 8192
Block Size (bytes)
Co
mp
ressio
n R
ati
o bzip2
zlib default
zilb level 1
zlib level 9
RLE
LZRW-1A
LZO
Compression Time
0
0.0005
0.001
0.0015
0.002
0.0025
512 1024 2048 4096 8192
Block Size (bytes)
Tim
e (
sec)
bzip2
zilb default
zlib level 1
zlib level 9
RLE
LZRW-1A
LZO
Decompression Time
0
0.0002
0.0004
0.0006
0.0008
0.001
0.0012
512 1024 2048 4096 8192
Block Size (bytes)
Tim
e (
sec)
bzip2
zlib default
zlib level 1
zlib level 9
RLE
LZRW-1A
LZO
Working Memory
1
10
100
1000
10000
bzip2 zlib default LZO LZRW-1A RLE
Compression Algorithms
log
2 (
byte
s)
Compression
Decompression
7600kB
3700kB
256KB
44KB64KB
0
16KB 16KB
0 0
Compression Ratio
0.14
0.19
0.24
0.29
0.34
0.39
512 1024 2048 4096 8192
Block Size (bytes)
Co
mp
ressio
n R
ati
o bzip2
zlib default
zilb level 1
zlib level 9
RLE
LZRW-1A
LZO
Compression Time
0
0.0005
0.001
0.0015
0.002
0.0025
512 1024 2048 4096 8192
Block Size (bytes)
Tim
e (
sec)
bzip2
zilb default
zlib level 1
zlib level 9
RLE
LZRW-1A
LZO
Decompression Time
0
0.0002
0.0004
0.0006
0.0008
0.001
0.0012
512 1024 2048 4096 8192
Block Size (bytes)
Tim
e (
sec)
bzip2
zlib default
zlib level 1
zlib level 9
RLE
LZRW-1A
LZO
Working Memory
1
10
100
1000
10000
bzip2 zlib default LZO LZRW-1A RLE
Compression Algorithms
log
2 (
byte
s)
Compression
Decompression
7600kB
3700kB
256KB
44KB64KB
0
16KB 16KB
0 0
LZObest existingcompressionalgorithm for
CRAMES
16 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Design overviewPattern-based partial match compression
Weak link: Compression algorithm
LZO average performance penalty 9.5% when RAM reduced to 40%
Developed simulation environment to permit profiling
Compression and decompression were taking most time
Needed a better compression algorithm
17 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Design overviewPattern-based partial match compression
Weak link: Compression algorithm
LZO average performance penalty 9.5% when RAM reduced to 40%
Developed simulation environment to permit profiling
Compression and decompression were taking most time
Needed a better compression algorithm
17 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Design overviewPattern-based partial match compression
Pattern-based partial dictionary match coding
Is something similar
in dictionary?
Input 32-bit word
Output dictionary
location and
difference
Update dictionary
entry
Move entry to
front of
dictionary
Eliminate LRU
dictionary entry
Place entry
in dictionary
Output value
YN
18 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Design overviewPattern-based partial match compression
Data regularity
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
(16,
1)(8
,2)
(4,4
)
(32,
1)
(16,
2)(8
,4)
(64,
1)
(32,
2)
(16,
4)
(128
,1)
(64,
2)
(32,
4)
(256
,1)
(128
,2)
(64,
4)
Dictionary layout (size, level)
Perc
en
tag
e %
xmxx
xzxz
zxxz
xzzx
zxzz
zzxz
xzzz
xxzz
xxzx
xzxx
xxxz
zxxx
zzxx
zxzx
mmmx
mmxx
mmmm
zzzx
xxxx
zzzz
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
(16,
1)(8
,2)
(4,4
)
(32,
1)
(16,
2)(8
,4)
(64,
1)
(32,
2)
(16,
4)
(128
,1)
(64,
2)
(32,
4)
(256
,1)
(128
,2)
(64,
4)
Dictionary layout (size, level)
Perc
en
tag
e %
xmxx
xzxz
zxxz
xzzx
zxzz
zzxz
xzzz
xxzz
xxzx
xzxx
xxxz
zxxx
zzxx
zxzx
mmmx
mmxx
mmmm
zzzx
xxxx
zzzz
Zeros: 38%
Small values: 10%
Nearby values similar(data, pointers): 27%
19 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Design overviewPattern-based partial match compression
Data regularity
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
(16,
1)(8
,2)
(4,4
)
(32,
1)
(16,
2)(8
,4)
(64,
1)
(32,
2)
(16,
4)
(128
,1)
(64,
2)
(32,
4)
(256
,1)
(128
,2)
(64,
4)
Dictionary layout (size, level)
Perc
en
tag
e %
xmxx
xzxz
zxxz
xzzx
zxzz
zzxz
xzzz
xxzz
xxzx
xzxx
xxxz
zxxx
zzxx
zxzx
mmmx
mmxx
mmmm
zzzx
xxxx
zzzz
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
(16,
1)(8
,2)
(4,4
)
(32,
1)
(16,
2)(8
,4)
(64,
1)
(32,
2)
(16,
4)
(128
,1)
(64,
2)
(32,
4)
(256
,1)
(128
,2)
(64,
4)
Dictionary layout (size, level)
Perc
en
tag
e %
xmxx
xzxz
zxxz
xzzx
zxzz
zzxz
xzzz
xxzz
xxzx
xzxx
xxxz
zxxx
zzxx
zxzx
mmmx
mmxx
mmmm
zzzx
xxxx
zzzz
Zeros: 38%
Small values: 10%
Nearby values similar(data, pointers): 27%
19 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Design overviewPattern-based partial match compression
Pattern-based partial match
Algorithm
Consider each 32-bit word as an input
Allow partial dictionary match
Use most frequent patterns based on statistical analysis
Optimizations
Two-way associative LRU 16-entry hash-mapped dictionary
Optimized coding scheme
Early termination for uncompressable data
Fine-grained operation parallelization
20 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Design overviewPattern-based partial match compression
PBPM evaluations
PBPM twice as fast as LZO
Compression ratio degrades by 10%
Improves performance and still doubles usable memory
21 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Design overviewPattern-based partial match compression
CRAMES and in-RAM filesystem compression
Compressed
RAM disk reserved
Kernel
Main
memory
7%
2.2 MB
55.6%
17.8 MB
37.5%
12 MB
22 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Experimental setupCommercialization and future directions
Outline
1. IntroductionMotivationPast work
2. CRAMES designDesign overviewPattern-based partial match compression
3. Experimental evaluationExperimental setupCommercialization and future directions
23 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Experimental setupCommercialization and future directions
Experimental setup
Sharp Zaurus SL-5600 PDA
Intel XScale PXA250
32 MB flash memory
32 MB RAM
Embedix (Linux 2.4.18 kernel)
Qt/Qtopia PDA edition
24 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Experimental setupCommercialization and future directions
CRAMES for compressed filesystems
CRAMES was used to create a compressed RAM device onZaurus SL-5600
EXT2 file system
LZO compression
Resource map memory allocation
Benchmarks: common file system operations
Increase capacity to 1.6×Execution time increases by 8.4%
Energy consumption increases by 5.2%
25 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Experimental setupCommercialization and future directions
Benchmarks
Impact of using CRAMES to reduce physical RAM
Results for
ADPCM: Speech compression application from MediaBench
JPEG: Image encoding application from MediaBench
MPEG2: Video CODEC application from MediaBench
Straight-forward matrix multiplication
Intentionally difficult for CRAMES
Also tested on
10 GUI applications that came with Qtopia
Next-generation cellphone prototype
26 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Experimental setupCommercialization and future directions
Results
Reduced RAM from 20 MB to 8MB
Base case: 20 MB RAM, no compression
Without CRAMES
All suffered significant performance penalties
Matrix multiplication cannot execute
With CRAMES
LZO average case 9% overhead, worst case 29%
PBPM average case 2.5% overhead, worst case 9%
Also works on arbitrary in-RAM filesystems
27 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Experimental setupCommercialization and future directions
Status
Patent pending: Northwestern University and NEC
Millions of cellphones using CRAMES started shipping in June2007
Technology won Computerwold Horizon Award in 2007
28 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Experimental setupCommercialization and future directions
What did this require?
Bunny suit army?
29 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Experimental setupCommercialization and future directions
What did this require?
Bunny suit army?No! One good Ph.D. student.
30 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Experimental setupCommercialization and future directions
Another directionMemory expansion for MMU-less embedded systems
user application
byte code
modified byte code
executable
LLVM compiler
MEMMU compiler
Calls to
MEMMU
library
LLVM compiler
Instruction
replacement,
compiler-time
optimizations
MEMMU
runtime
library
Observations and Results
Application: Sensor networks
Implemented in LLVM,tested on TelosB nodes
Increases usable memory by50%, unchanged applications
Little overhead aftercompiler optimizations
CASES’06
With Lan Bai and Lei Yang
31 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Experimental setupCommercialization and future directions
Acknowledgments
This work was supported by NEC Labs America and NSF awardCCR-0347941.
We would like to acknowledge Hui Ding of NorthwesternUniversity for his suggestions on efficiently implementing PBPMand his help testing CRAMES with GUI applications.
Professors Peter Dinda, Gokhan Memik, and Lawrence Henschenprovided suggestions and feedback.
32 Robert Dick CRAMES
IntroductionCRAMES design
Experimental evaluation
Experimental setupCommercialization and future directions
Thank you for attending!
For additional information about this work
CRAMES and MEMMU athttp://ziyang.eecs.northwestern.edu/∼dickrp/tools.html
For additional information about the team that created it
http://ziyang.eecs.northwestern.edu/∼dickrp/people/students.html
33 Robert Dick CRAMES