Compression Options in Hadoop - A Tale of Tradeoffs

Compression Options In Hadoop – A Tale of TradeoffsGovind Kamat, Sumeet Singh

Hadoop Summit (San Jose), June 27, 2013

Introduction

2

Sumeet SinghDirector of Products, Hadoop

Cloud Engineering Group

701 First AvenueSunnyvale, CA 94089 USA

Govind KamatTechnical Yahoo!, Hadoop

Cloud Engineering Group

Member of Technical Staff in the Hadoop Services team at Yahoo!

Focuses on HBase and Hadoop performance

Worked with the Performance Engineering Group on improving the performance and scalability of several Yahoo! applications

Experience includes development of large-scale software systems, microprocessor architecture, instruction-set simulators, compiler technology and electronic design

701 First AvenueSunnyvale, CA 94089 USA

Leads Hadoop products team at Yahoo!

Responsible for Product Management, Customer Engagements, Evangelism, and Program Management

Prior to this role, led Strategy functions for the Cloud Platform Group at Yahoo!

Agenda

3

Data Compression in Hadoop1

Available Compression Options2

Understanding and Working with Compression Options3

Problems Faced at Yahoo! with Large Data Sets 4

Performance Evaluations, Native Bzip2, and IPP Libraries5

Wrap-up and Future Work6

Compression Needs and Tradeoffs in Hadoop

4

Storage Disk I/O Network bandwidth

CPU Time

Hadoop jobs are data-intensive, compressing data can speed up the I/O operations MapReduce jobs are almost always I/O bound

Compressed data can save storage space and speed up data transfers across the network

Capital allocation for hardware can go further

Reduced I/O and network load can bring significant performance improvements MapReduce jobs can finish faster overall

On the other hand, CPU utilization and processing time increases during compression and decompression

Understanding the tradeoffs is important for MapReduce pipeline’s overall performance

The Compression Tradeoff

Data Compression in Hadoop’s MR Pipeline

5

Input splits

Map

Source: Hadoop: The Definitive Guide, Tom White

Output

ReduceBuffer in memory

Partition and Sort

fetch

Merge on disk

Merge and sort

Other maps

Other reducers

I/P compressed

Mapper decompresses

Mapper O/P compressed

1

Map Reduce

Reduce I/P

Map O/P

Reducer I/P decompresses

Reducer O/P compressed

2 3

Sort & Shuffle

Compress Decompress

Compression Options in Hadoop (1/2)

6

Format Algorithm Strategy Emphasis Comments

zlibUses DEFLATE (LZ77 and Huffman coding)

Dictionary-based, API Compression ratio Default codec

gzip Wrapper around zlibDictionary-based, standard compression utility

Same as zlib, codec operates on and produces standard gzip files

For data interchange on and off Hadoop

bzip2Burrows-Wheeler transform

Transform-based, block-oriented

Higher compression ratios than zlib

Common for Pig

LZO Variant of LZ77Dictionary-based, block-oriented, API

High compression speeds

Common for intermediate compression, HBase tables

LZ4Simplified variant of LZ77

Fast scan, APIVery high compression speeds

Available in newer Hadoop distributions

Snappy LZ77 Block-oriented, APIVery high compression speeds

Came out of Google, previously known as Zippy

Compression Options in Hadoop (2/2)

7

Format Codec (Defined in io.compression.codecs) File Extn. SplittableJava/ Native

zlib/ DEFLATE (default)

org.apache.hadoop.io.compress.DefaultCodec .deflate N Y/ Y

gzip org.apache.hadoop.io.compress.GzipCodec .gz N Y/ Y

bzip2 org.apache.hadoop.io.compress.BZip2Codec .bz2 Y Y/ Y

LZO (download separately)

com.hadoop.compression.lzo.LzoCodec .lzo N N/ Y

LZ4 org.apache.hadoop.io.compress.Lz4Codec .lz4 N N/ Y

Snappy org.apache.hadoop.io.compress.SnappyCodec .snappy N N/ Y

NOTES:

Splittability – Bzip2 is “splittable”, can be decompressed in parallel by multiple MapReduce tasks. Other algorithms require all blocks together for decompression with a single MapReduce task.

LZO – Removed from Hadoop because the LZO libraries are licensed under the GNU GPL. LZO format is still supported and the codec can be downloaded separately and enabled manually.

Native bzip2 codec – added by Yahoo! as part of this work in Hadoop 0.23

Space-Time Tradeoff of Compression Options

8

40% 45% 50% 55% 60% 65% 70% 75%0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

Series1, 32.3

60.0

4.84.02.4

Space Savings

CP

U T

ime

in S

ec

. (

Co

mp

res

s +

De

co

mp

res

s) Bzip2

Zlib(Deflate, Gzip)

LZOSnappyLZ4

Note: A 265 MB corpus from Wikipedia was used for the performance comparisons. Space savings is defined as [1 – (Compressed/ Uncompressed)]

Codec Performance on the Wikipedia Text Corpus

High Compression Ratio

High Compression Speed

Using Data Compression in Hadoop

9

Phase in MR Pipeline

Config Values

Input data to Map

File extension recognized automatically for decompression

File extensions for supported formatsNote: For SequenceFile, headers have the information [compression (boolean), block compression (boolean), and compression codec]

One of the supported codecs one defined in io.compression.codecs

Intermediate (Map) Output

mapreduce.map.output.compress false (default), true

mapreduce.map.output.compress.codecone defined in io.compression.codecs

Final (Reduce) Output

mapreduce.output.fileoutputformat. compress

false (default), true

mapreduce.output.fileoutputformat. compress.codec one defined in io.compression.codecs

mapreduce.output.fileoutputformat. compress.type

Type of compression to use for SequenceFile outputs: NONE, RECORD (default), BLOCK

1

2

3

Compress the input data, if large

Always use compression, particularly if spillage or slow network transfers

Compress for storage/ archival, better write speeds, or between MR jobs

Use splittable algo such as bzip2, or use zlib with SequenceFile format

Use faster codecs such as LZO, LZ4, or Snappy

Use standard utility such as gzip or bzip2 for data interchange, and faster codecs for chained jobs

When to Use Compression and Which Codec

10

Map ReduceShuffle & Sort

Input data to Map Intermediate (Map) Output

I/P compressed

Mapper decompresses

Mapper O/P compressed

1

Reducer I/P decompresses

Reducer O/P compressed

2 3

Compress Decompress

Final Reduce Output

Compression in the Hadoop Ecosystem

11

Component When to Use What to Use

Pig Compressing data between MR

job

Typical in Pig scripts that include joins or other operators that expand your data size

Enable compression and select the codec:pig.tmpfilecompression = truepig.tmpfilecompression.codec = gzip, lzo

Hive

Intermediate files produced by Hive between multiple map-reduce jobs

Hive writes output to a table

Enable intermediate or output compression:hive.exec.compress.intermediate = truehive.exec.compress.output = true

HBase

Compress data at the CF level (support for LZO, gzip, Snappy, and LZ4)

List required JNI libraries: hbase.regionserver.codecs

Enabling compression:create ’table', { NAME => 'colfam', COMPRESSION => ’LZO' }alter ’table', { NAME => 'colfam', COMPRESSION => ’LZO' }

4.2M Jobs, Jun 10-16, 2013

Compression in Hadoop at Yahoo!

12

99.8%

0.2%

LZO 98.3%

gzip 1.1%

zlib / default 0.5%

bzip2 0.1%

Map ReduceShuffle & Sort

Input data to Map Intermediate (Map) Output

1 2 3

Final Reduce Output

39.0%

61.0%

LZO 55%

gzip 35%

bzip2 5%

zlib / default 5%

4.2M Jobs, Jun 10-16, 2013

98%

2%

zlib / default 73%

gzip 22%

bzip2 4%

LZO 1%

380M Files on Jun 16, 2013 (/data, /projects)

Includes intermediate Pig/ Hive compression

Pig Intermediate

Compressed

Compression for Data Storage Efficiency

DSE considerations at Yahoo!

RCFile instead of SequenceFile

Faster implementation of bzip2

Native-code bzip2 codec

HADOOP-84621, available in 0.23.7

Substituting the IPP library

13

1 Native-code bzip2 implementation done in collaboration with Jason Lowe, Hadoop Core PMC member

IPP Libraries

Integrated Performance Primitives from Intel

Algorithmic and architectural optimizations

Processor-specific variants of each function

Applications remain processor-neutral

Compression: LZ, RLE, BWT, LZO

High level formats include: zlib, gzip, bzip2 and LZO

14

Measuring Standalone Performance

Standard programs (gzip, bzip2) used

Driver program written for other cases

32-bit mode

Single-threaded

JVM load overhead discounted

Default compression level

Quad-core Xeon machine

15

Data Corpuses Used

Binary files

Generated text from randomtextwriter

Wikipedia corpus

Silesia corpus

16

Compression Ratio

uncomp zlib bzip2 LZO Snappy LZ40

50

100

150

200

250

300

exe rtext wiki silesia

Fil

e S

ize

(M

B)

17

Compression Performance

zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip20

10

20

30

40

50

60

70

80

90

2923

63

44

26


CP

U T

ime

(s

ec

)

18

Compression Performance (Fast Algorithms)

LZO Snappy LZ40

0.5

1

1.5

2

2.5

3

3.53.2

2.9

1.7


CP

U T

ime

(s

ec

)

19

Decompression Performance

zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip20

5

10

15

20

25

32

21

17

12


CP

U T

ime

(s

ec

)

20

Decompression Performance (Fast Algorithms)

LZO Snappy LZ40

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

1.6

1.1

0.7


CP

U T

ime

(s

ec

)

21

Compression Performance within Hadoop

Daytona performance framework

GridMix v1

Loadgen and sort jobs

Input data compressed with zlib / bzip2

LZO used for intermediate compression

35 datanodes, dual-quad-core machines

22

Map Performance

Java-bzip2 bzip2 IPP-bzip2 zlib0

5

10

15

20

25

30

35

40

45

5047 46 46

33

Ma

p T

ime

(s

ec

)

23

Reduce Performance


5

10

15

20

25

30

35

31

28

18

14

Re

du

ce

Tim

e (

min

)

24

Job Performance


5

10

15

20

25

30

35

40 38

34

23

19

38

34

25

18

sort loadgen

Jo

b T

ime

(m

in)

25

Future Work

Splittability support for native-code bzip2 codec

Enhancing Pig to use common bzip2 codec

Optimizing the JNI interface and buffer copies

Varying the compression effort parameter

Performance evaluation for 64-bit mode

Updating the zlib codec to specify alternative libraries

Other codec combinations, such as zlib for transient data

Other compression algorithms

26

Considerations in Selecting Compression Type

Nature of the data set

Chained jobs

Data-storage efficiency requirements

Frequency of compression vs. decompression

Requirement for compatibility with a standard data format

Splittability requirements

Size of the intermediate and final data

Alternative implementations of compression libraries

27

Compression Options in Hadoop - A Tale of Tradeoffs

Technology

Transcript of Compression Options in Hadoop - A Tale of Tradeoffs