Compression Options in Hadoop - A Tale of Tradeoffs
-
Upload
hadoopsummit -
Category
Technology
-
view
18.738 -
download
2
description
Transcript of Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options In Hadoop – A Tale of TradeoffsGovind Kamat, Sumeet Singh
Hadoop Summit (San Jose), June 27, 2013
Introduction
2
Sumeet SinghDirector of Products, Hadoop
Cloud Engineering Group
701 First AvenueSunnyvale, CA 94089 USA
Govind KamatTechnical Yahoo!, Hadoop
Cloud Engineering Group
Member of Technical Staff in the Hadoop Services team at Yahoo!
Focuses on HBase and Hadoop performance
Worked with the Performance Engineering Group on improving the performance and scalability of several Yahoo! applications
Experience includes development of large-scale software systems, microprocessor architecture, instruction-set simulators, compiler technology and electronic design
701 First AvenueSunnyvale, CA 94089 USA
Leads Hadoop products team at Yahoo!
Responsible for Product Management, Customer Engagements, Evangelism, and Program Management
Prior to this role, led Strategy functions for the Cloud Platform Group at Yahoo!
Agenda
3
Data Compression in Hadoop1
Available Compression Options2
Understanding and Working with Compression Options3
Problems Faced at Yahoo! with Large Data Sets 4
Performance Evaluations, Native Bzip2, and IPP Libraries5
Wrap-up and Future Work6
Compression Needs and Tradeoffs in Hadoop
4
Storage Disk I/O Network bandwidth
CPU Time
Hadoop jobs are data-intensive, compressing data can speed up the I/O operations MapReduce jobs are almost always I/O bound
Compressed data can save storage space and speed up data transfers across the network
Capital allocation for hardware can go further
Reduced I/O and network load can bring significant performance improvements MapReduce jobs can finish faster overall
On the other hand, CPU utilization and processing time increases during compression and decompression
Understanding the tradeoffs is important for MapReduce pipeline’s overall performance
The Compression Tradeoff
Data Compression in Hadoop’s MR Pipeline
5
Input splits
Map
Source: Hadoop: The Definitive Guide, Tom White
Output
ReduceBuffer in memory
Partition and Sort
fetch
Merge on disk
Merge and sort
Other maps
Other reducers
I/P compressed
Mapper decompresses
Mapper O/P compressed
1
Map Reduce
Reduce I/P
Map O/P
Reducer I/P decompresses
Reducer O/P compressed
2 3
Sort & Shuffle
Compress Decompress
Compression Options in Hadoop (1/2)
6
Format Algorithm Strategy Emphasis Comments
zlibUses DEFLATE (LZ77 and Huffman coding)
Dictionary-based, API Compression ratio Default codec
gzip Wrapper around zlibDictionary-based, standard compression utility
Same as zlib, codec operates on and produces standard gzip files
For data interchange on and off Hadoop
bzip2Burrows-Wheeler transform
Transform-based, block-oriented
Higher compression ratios than zlib
Common for Pig
LZO Variant of LZ77Dictionary-based, block-oriented, API
High compression speeds
Common for intermediate compression, HBase tables
LZ4Simplified variant of LZ77
Fast scan, APIVery high compression speeds
Available in newer Hadoop distributions
Snappy LZ77 Block-oriented, APIVery high compression speeds
Came out of Google, previously known as Zippy
Compression Options in Hadoop (2/2)
7
Format Codec (Defined in io.compression.codecs) File Extn. SplittableJava/ Native
zlib/ DEFLATE (default)
org.apache.hadoop.io.compress.DefaultCodec .deflate N Y/ Y
gzip org.apache.hadoop.io.compress.GzipCodec .gz N Y/ Y
bzip2 org.apache.hadoop.io.compress.BZip2Codec .bz2 Y Y/ Y
LZO (download separately)
com.hadoop.compression.lzo.LzoCodec .lzo N N/ Y
LZ4 org.apache.hadoop.io.compress.Lz4Codec .lz4 N N/ Y
Snappy org.apache.hadoop.io.compress.SnappyCodec .snappy N N/ Y
NOTES:
Splittability – Bzip2 is “splittable”, can be decompressed in parallel by multiple MapReduce tasks. Other algorithms require all blocks together for decompression with a single MapReduce task.
LZO – Removed from Hadoop because the LZO libraries are licensed under the GNU GPL. LZO format is still supported and the codec can be downloaded separately and enabled manually.
Native bzip2 codec – added by Yahoo! as part of this work in Hadoop 0.23
Space-Time Tradeoff of Compression Options
8
40% 45% 50% 55% 60% 65% 70% 75%0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
Series1, 32.3
60.0
4.84.02.4
Space Savings
CP
U T
ime
in S
ec
. (
Co
mp
res
s +
De
co
mp
res
s) Bzip2
Zlib(Deflate, Gzip)
LZOSnappyLZ4
Note: A 265 MB corpus from Wikipedia was used for the performance comparisons. Space savings is defined as [1 – (Compressed/ Uncompressed)]
Codec Performance on the Wikipedia Text Corpus
High Compression Ratio
High Compression Speed
Using Data Compression in Hadoop
9
Phase in MR Pipeline
Config Values
Input data to Map
File extension recognized automatically for decompression
File extensions for supported formatsNote: For SequenceFile, headers have the information [compression (boolean), block compression (boolean), and compression codec]
One of the supported codecs one defined in io.compression.codecs
Intermediate (Map) Output
mapreduce.map.output.compress false (default), true
mapreduce.map.output.compress.codecone defined in io.compression.codecs
Final (Reduce) Output
mapreduce.output.fileoutputformat. compress
false (default), true
mapreduce.output.fileoutputformat. compress.codec one defined in io.compression.codecs
mapreduce.output.fileoutputformat. compress.type
Type of compression to use for SequenceFile outputs: NONE, RECORD (default), BLOCK
1
2
3
Compress the input data, if large
Always use compression, particularly if spillage or slow network transfers
Compress for storage/ archival, better write speeds, or between MR jobs
Use splittable algo such as bzip2, or use zlib with SequenceFile format
Use faster codecs such as LZO, LZ4, or Snappy
Use standard utility such as gzip or bzip2 for data interchange, and faster codecs for chained jobs
When to Use Compression and Which Codec
10
Map ReduceShuffle & Sort
Input data to Map Intermediate (Map) Output
I/P compressed
Mapper decompresses
Mapper O/P compressed
1
Reducer I/P decompresses
Reducer O/P compressed
2 3
Compress Decompress
Final Reduce Output
Compression in the Hadoop Ecosystem
11
Component When to Use What to Use
Pig Compressing data between MR
job
Typical in Pig scripts that include joins or other operators that expand your data size
Enable compression and select the codec:pig.tmpfilecompression = truepig.tmpfilecompression.codec = gzip, lzo
Hive
Intermediate files produced by Hive between multiple map-reduce jobs
Hive writes output to a table
Enable intermediate or output compression:hive.exec.compress.intermediate = truehive.exec.compress.output = true
HBase
Compress data at the CF level (support for LZO, gzip, Snappy, and LZ4)
List required JNI libraries: hbase.regionserver.codecs
Enabling compression:create ’table', { NAME => 'colfam', COMPRESSION => ’LZO' }alter ’table', { NAME => 'colfam', COMPRESSION => ’LZO' }
4.2M Jobs, Jun 10-16, 2013
Compression in Hadoop at Yahoo!
12
99.8%
0.2%
LZO 98.3%
gzip 1.1%
zlib / default 0.5%
bzip2 0.1%
Map ReduceShuffle & Sort
Input data to Map Intermediate (Map) Output
1 2 3
Final Reduce Output
39.0%
61.0%
LZO 55%
gzip 35%
bzip2 5%
zlib / default 5%
4.2M Jobs, Jun 10-16, 2013
98%
2%
zlib / default 73%
gzip 22%
bzip2 4%
LZO 1%
380M Files on Jun 16, 2013 (/data, /projects)
Includes intermediate Pig/ Hive compression
Pig Intermediate
Compressed
Compression for Data Storage Efficiency
DSE considerations at Yahoo!
RCFile instead of SequenceFile
Faster implementation of bzip2
Native-code bzip2 codec
HADOOP-84621, available in 0.23.7
Substituting the IPP library
13
1 Native-code bzip2 implementation done in collaboration with Jason Lowe, Hadoop Core PMC member
IPP Libraries
Integrated Performance Primitives from Intel
Algorithmic and architectural optimizations
Processor-specific variants of each function
Applications remain processor-neutral
Compression: LZ, RLE, BWT, LZO
High level formats include: zlib, gzip, bzip2 and LZO
14
Measuring Standalone Performance
Standard programs (gzip, bzip2) used
Driver program written for other cases
32-bit mode
Single-threaded
JVM load overhead discounted
Default compression level
Quad-core Xeon machine
15
Data Corpuses Used
Binary files
Generated text from randomtextwriter
Wikipedia corpus
Silesia corpus
16
Compression Ratio
uncomp zlib bzip2 LZO Snappy LZ40
50
100
150
200
250
300
exe rtext wiki silesia
Fil
e S
ize
(M
B)
17
Compression Performance
zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip20
10
20
30
40
50
60
70
80
90
2923
63
44
26
exe rtext wiki silesia
CP
U T
ime
(s
ec
)
18
Compression Performance (Fast Algorithms)
LZO Snappy LZ40
0.5
1
1.5
2
2.5
3
3.53.2
2.9
1.7
exe rtext wiki silesia
CP
U T
ime
(s
ec
)
19
Decompression Performance
zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip20
5
10
15
20
25
32
21
17
12
exe rtext wiki silesia
CP
U T
ime
(s
ec
)
20
Decompression Performance (Fast Algorithms)
LZO Snappy LZ40
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1.6
1.1
0.7
exe rtext wiki silesia
CP
U T
ime
(s
ec
)
21
Compression Performance within Hadoop
Daytona performance framework
GridMix v1
Loadgen and sort jobs
Input data compressed with zlib / bzip2
LZO used for intermediate compression
35 datanodes, dual-quad-core machines
22
Map Performance
Java-bzip2 bzip2 IPP-bzip2 zlib0
5
10
15
20
25
30
35
40
45
5047 46 46
33
Ma
p T
ime
(s
ec
)
23
Reduce Performance
Java-bzip2 bzip2 IPP-bzip2 zlib0
5
10
15
20
25
30
35
31
28
18
14
Re
du
ce
Tim
e (
min
)
24
Job Performance
Java-bzip2 bzip2 IPP-bzip2 zlib0
5
10
15
20
25
30
35
40 38
34
23
19
38
34
25
18
sort loadgen
Jo
b T
ime
(m
in)
25
Future Work
Splittability support for native-code bzip2 codec
Enhancing Pig to use common bzip2 codec
Optimizing the JNI interface and buffer copies
Varying the compression effort parameter
Performance evaluation for 64-bit mode
Updating the zlib codec to specify alternative libraries
Other codec combinations, such as zlib for transient data
Other compression algorithms
26
Considerations in Selecting Compression Type
Nature of the data set
Chained jobs
Data-storage efficiency requirements
Frequency of compression vs. decompression
Requirement for compatibility with a standard data format
Splittability requirements
Size of the intermediate and final data
Alternative implementations of compression libraries
27