Compression Options in Hadoop - A Tale of Tradeoffs

28
Compression Options In Hadoop – A Tale of Tradeoffs Govind Kamat, Sumeet Singh Hadoop Summit (San Jose), June 27, 2013

description

Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.

Transcript of Compression Options in Hadoop - A Tale of Tradeoffs

Page 1: Compression Options in Hadoop - A Tale of Tradeoffs

Compression Options In Hadoop – A Tale of TradeoffsGovind Kamat, Sumeet Singh

Hadoop Summit (San Jose), June 27, 2013

Page 2: Compression Options in Hadoop - A Tale of Tradeoffs

Introduction

2

Sumeet SinghDirector of Products, Hadoop

Cloud Engineering Group

701 First AvenueSunnyvale, CA 94089 USA

Govind KamatTechnical Yahoo!, Hadoop

Cloud Engineering Group

Member of Technical Staff in the Hadoop Services team at Yahoo!

Focuses on HBase and Hadoop performance

Worked with the Performance Engineering Group on improving the performance and scalability of several Yahoo! applications

Experience includes development of large-scale software systems, microprocessor architecture, instruction-set simulators, compiler technology and electronic design

701 First AvenueSunnyvale, CA 94089 USA

Leads Hadoop products team at Yahoo!

Responsible for Product Management, Customer Engagements, Evangelism, and Program Management

Prior to this role, led Strategy functions for the Cloud Platform Group at Yahoo!

Page 3: Compression Options in Hadoop - A Tale of Tradeoffs

Agenda

3

Data Compression in Hadoop1

Available Compression Options2

Understanding and Working with Compression Options3

Problems Faced at Yahoo! with Large Data Sets 4

Performance Evaluations, Native Bzip2, and IPP Libraries5

Wrap-up and Future Work6

Page 4: Compression Options in Hadoop - A Tale of Tradeoffs

Compression Needs and Tradeoffs in Hadoop

4

Storage Disk I/O Network bandwidth

CPU Time

Hadoop jobs are data-intensive, compressing data can speed up the I/O operations MapReduce jobs are almost always I/O bound

Compressed data can save storage space and speed up data transfers across the network

Capital allocation for hardware can go further

Reduced I/O and network load can bring significant performance improvements MapReduce jobs can finish faster overall

On the other hand, CPU utilization and processing time increases during compression and decompression

Understanding the tradeoffs is important for MapReduce pipeline’s overall performance

The Compression Tradeoff

Page 5: Compression Options in Hadoop - A Tale of Tradeoffs

Data Compression in Hadoop’s MR Pipeline

5

Input splits

Map

Source: Hadoop: The Definitive Guide, Tom White

Output

ReduceBuffer in memory

Partition and Sort

fetch

Merge on disk

Merge and sort

Other maps

Other reducers

I/P compressed

Mapper decompresses

Mapper O/P compressed

1

Map Reduce

Reduce I/P

Map O/P

Reducer I/P decompresses

Reducer O/P compressed

2 3

Sort & Shuffle

Compress Decompress

Page 6: Compression Options in Hadoop - A Tale of Tradeoffs

Compression Options in Hadoop (1/2)

6

Format Algorithm Strategy Emphasis Comments

zlibUses DEFLATE (LZ77 and Huffman coding)

Dictionary-based, API Compression ratio Default codec

gzip Wrapper around zlibDictionary-based, standard compression utility

Same as zlib, codec operates on and produces standard gzip files

For data interchange on and off Hadoop

bzip2Burrows-Wheeler transform

Transform-based, block-oriented

Higher compression ratios than zlib

Common for Pig

LZO Variant of LZ77Dictionary-based, block-oriented, API

High compression speeds

Common for intermediate compression, HBase tables

LZ4Simplified variant of LZ77

Fast scan, APIVery high compression speeds

Available in newer Hadoop distributions

Snappy LZ77 Block-oriented, APIVery high compression speeds

Came out of Google, previously known as Zippy

Page 7: Compression Options in Hadoop - A Tale of Tradeoffs

Compression Options in Hadoop (2/2)

7

Format Codec (Defined in io.compression.codecs) File Extn. SplittableJava/ Native

zlib/ DEFLATE (default)

org.apache.hadoop.io.compress.DefaultCodec .deflate N Y/ Y

gzip org.apache.hadoop.io.compress.GzipCodec .gz N Y/ Y

bzip2 org.apache.hadoop.io.compress.BZip2Codec .bz2 Y Y/ Y

LZO (download separately)

com.hadoop.compression.lzo.LzoCodec .lzo N N/ Y

LZ4 org.apache.hadoop.io.compress.Lz4Codec .lz4 N N/ Y

Snappy org.apache.hadoop.io.compress.SnappyCodec .snappy N N/ Y

NOTES:

Splittability – Bzip2 is “splittable”, can be decompressed in parallel by multiple MapReduce tasks. Other algorithms require all blocks together for decompression with a single MapReduce task.

LZO – Removed from Hadoop because the LZO libraries are licensed under the GNU GPL. LZO format is still supported and the codec can be downloaded separately and enabled manually.

Native bzip2 codec – added by Yahoo! as part of this work in Hadoop 0.23

Page 8: Compression Options in Hadoop - A Tale of Tradeoffs

Space-Time Tradeoff of Compression Options

8

40% 45% 50% 55% 60% 65% 70% 75%0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

Series1, 32.3

60.0

4.84.02.4

Space Savings

CP

U T

ime

in S

ec

. (

Co

mp

res

s +

De

co

mp

res

s) Bzip2

Zlib(Deflate, Gzip)

LZOSnappyLZ4

Note: A 265 MB corpus from Wikipedia was used for the performance comparisons. Space savings is defined as [1 – (Compressed/ Uncompressed)]

Codec Performance on the Wikipedia Text Corpus

High Compression Ratio

High Compression Speed

Page 9: Compression Options in Hadoop - A Tale of Tradeoffs

Using Data Compression in Hadoop

9

Phase in MR Pipeline

Config Values

Input data to Map

File extension recognized automatically for decompression

File extensions for supported formatsNote: For SequenceFile, headers have the information [compression (boolean), block compression (boolean), and compression codec]

One of the supported codecs one defined in io.compression.codecs

Intermediate (Map) Output

mapreduce.map.output.compress false (default), true

mapreduce.map.output.compress.codecone defined in io.compression.codecs

Final (Reduce) Output

mapreduce.output.fileoutputformat. compress

false (default), true

mapreduce.output.fileoutputformat. compress.codec one defined in io.compression.codecs

mapreduce.output.fileoutputformat. compress.type

Type of compression to use for SequenceFile outputs: NONE, RECORD (default), BLOCK

1

2

3

Page 10: Compression Options in Hadoop - A Tale of Tradeoffs

Compress the input data, if large

Always use compression, particularly if spillage or slow network transfers

Compress for storage/ archival, better write speeds, or between MR jobs

Use splittable algo such as bzip2, or use zlib with SequenceFile format

Use faster codecs such as LZO, LZ4, or Snappy

Use standard utility such as gzip or bzip2 for data interchange, and faster codecs for chained jobs

When to Use Compression and Which Codec

10

Map ReduceShuffle & Sort

Input data to Map Intermediate (Map) Output

I/P compressed

Mapper decompresses

Mapper O/P compressed

1

Reducer I/P decompresses

Reducer O/P compressed

2 3

Compress Decompress

Final Reduce Output

Page 11: Compression Options in Hadoop - A Tale of Tradeoffs

Compression in the Hadoop Ecosystem

11

Component When to Use What to Use

Pig Compressing data between MR

job

Typical in Pig scripts that include joins or other operators that expand your data size

Enable compression and select the codec:pig.tmpfilecompression = truepig.tmpfilecompression.codec = gzip, lzo

Hive

Intermediate files produced by Hive between multiple map-reduce jobs

Hive writes output to a table

Enable intermediate or output compression:hive.exec.compress.intermediate = truehive.exec.compress.output = true

HBase

Compress data at the CF level (support for LZO, gzip, Snappy, and LZ4)

List required JNI libraries: hbase.regionserver.codecs

Enabling compression:create ’table', { NAME => 'colfam', COMPRESSION => ’LZO' }alter ’table', { NAME => 'colfam', COMPRESSION => ’LZO' }

Page 12: Compression Options in Hadoop - A Tale of Tradeoffs

4.2M Jobs, Jun 10-16, 2013

Compression in Hadoop at Yahoo!

12

99.8%

0.2%

LZO 98.3%

gzip 1.1%

zlib / default 0.5%

bzip2 0.1%

Map ReduceShuffle & Sort

Input data to Map Intermediate (Map) Output

1 2 3

Final Reduce Output

39.0%

61.0%

LZO 55%

gzip 35%

bzip2 5%

zlib / default 5%

4.2M Jobs, Jun 10-16, 2013

98%

2%

zlib / default 73%

gzip 22%

bzip2 4%

LZO 1%

380M Files on Jun 16, 2013 (/data, /projects)

Includes intermediate Pig/ Hive compression

Pig Intermediate

Compressed

Page 13: Compression Options in Hadoop - A Tale of Tradeoffs

Compression for Data Storage Efficiency

DSE considerations at Yahoo!

RCFile instead of SequenceFile

Faster implementation of bzip2

Native-code bzip2 codec

HADOOP-84621, available in 0.23.7

Substituting the IPP library

13

1 Native-code bzip2 implementation done in collaboration with Jason Lowe, Hadoop Core PMC member

Page 14: Compression Options in Hadoop - A Tale of Tradeoffs

IPP Libraries

Integrated Performance Primitives from Intel

Algorithmic and architectural optimizations

Processor-specific variants of each function

Applications remain processor-neutral

Compression: LZ, RLE, BWT, LZO

High level formats include: zlib, gzip, bzip2 and LZO

14

Page 15: Compression Options in Hadoop - A Tale of Tradeoffs

Measuring Standalone Performance

Standard programs (gzip, bzip2) used

Driver program written for other cases

32-bit mode

Single-threaded

JVM load overhead discounted

Default compression level

Quad-core Xeon machine

15

Page 16: Compression Options in Hadoop - A Tale of Tradeoffs

Data Corpuses Used

Binary files

Generated text from randomtextwriter

Wikipedia corpus

Silesia corpus

16

Page 17: Compression Options in Hadoop - A Tale of Tradeoffs

Compression Ratio

uncomp zlib bzip2 LZO Snappy LZ40

50

100

150

200

250

300

exe rtext wiki silesia

Fil

e S

ize

(M

B)

17

Page 18: Compression Options in Hadoop - A Tale of Tradeoffs

Compression Performance

zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip20

10

20

30

40

50

60

70

80

90

2923

63

44

26

exe rtext wiki silesia

CP

U T

ime

(s

ec

)

18

Page 19: Compression Options in Hadoop - A Tale of Tradeoffs

Compression Performance (Fast Algorithms)

LZO Snappy LZ40

0.5

1

1.5

2

2.5

3

3.53.2

2.9

1.7

exe rtext wiki silesia

CP

U T

ime

(s

ec

)

19

Page 20: Compression Options in Hadoop - A Tale of Tradeoffs

Decompression Performance

zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip20

5

10

15

20

25

32

21

17

12

exe rtext wiki silesia

CP

U T

ime

(s

ec

)

20

Page 21: Compression Options in Hadoop - A Tale of Tradeoffs

Decompression Performance (Fast Algorithms)

LZO Snappy LZ40

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

1.6

1.1

0.7

exe rtext wiki silesia

CP

U T

ime

(s

ec

)

21

Page 22: Compression Options in Hadoop - A Tale of Tradeoffs

Compression Performance within Hadoop

Daytona performance framework

GridMix v1

Loadgen and sort jobs

Input data compressed with zlib / bzip2

LZO used for intermediate compression

35 datanodes, dual-quad-core machines

22

Page 23: Compression Options in Hadoop - A Tale of Tradeoffs

Map Performance

Java-bzip2 bzip2 IPP-bzip2 zlib0

5

10

15

20

25

30

35

40

45

5047 46 46

33

Ma

p T

ime

(s

ec

)

23

Page 24: Compression Options in Hadoop - A Tale of Tradeoffs

Reduce Performance

Java-bzip2 bzip2 IPP-bzip2 zlib0

5

10

15

20

25

30

35

31

28

18

14

Re

du

ce

Tim

e (

min

)

24

Page 25: Compression Options in Hadoop - A Tale of Tradeoffs

Job Performance

Java-bzip2 bzip2 IPP-bzip2 zlib0

5

10

15

20

25

30

35

40 38

34

23

19

38

34

25

18

sort loadgen

Jo

b T

ime

(m

in)

25

Page 26: Compression Options in Hadoop - A Tale of Tradeoffs

Future Work

Splittability support for native-code bzip2 codec

Enhancing Pig to use common bzip2 codec

Optimizing the JNI interface and buffer copies

Varying the compression effort parameter

Performance evaluation for 64-bit mode

Updating the zlib codec to specify alternative libraries

Other codec combinations, such as zlib for transient data

Other compression algorithms

26

Page 27: Compression Options in Hadoop - A Tale of Tradeoffs

Considerations in Selecting Compression Type

Nature of the data set

Chained jobs

Data-storage efficiency requirements

Frequency of compression vs. decompression

Requirement for compatibility with a standard data format

Splittability requirements

Size of the intermediate and final data

Alternative implementations of compression libraries

27

Page 28: Compression Options in Hadoop - A Tale of Tradeoffs