© Hortonworks Inc. 2011
Daniel Dai (@daijy)
Thejas Nair (@thejasn)
Page 1
Making Pig FlyOptimizing Data Processing on Hadoop
© Hortonworks Inc. 2011
What is Apache Pig?
Page 2Architecting the Future of Big Data
Pig Latin, a high level
data processing
language.
An engine that
executes Pig
Latin locally or on
a Hadoop cluster.
Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/
© Hortonworks Inc. 2011
Pig-latin example
Page 3Architecting the Future of Big Data
• Query : Get the list of web pages visited by users whose
age is between 20 and 29 years.
USERS = load ‘users’ as (uid, age);
USERS_20s = filter USERS by age >= 20 and age <= 29;
PVs = load ‘pages’ as (url, uid, timestamp);
PVs_u20s = join USERS_20s by uid, PVs by uid;
© Hortonworks Inc. 2011
Why pig ?
Page 4Architecting the Future of Big Data
•Faster development
– Fewer lines of code
– Don’t re-invent the wheel
•Flexible
– Metadata is optional
– Extensible
– Procedural programming
Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/
© Hortonworks Inc. 2011
Pig optimizations
Page 5Architecting the Future of Big Data
• Ideally user should not have to
bother
• Reality– Pig is still young and immature
– Pig does not have the whole picture
–Cluster configuration
–Data histogram
– Pig philosophy: Pig is docile
© Hortonworks Inc. 2011
Pig optimizations
Page 6Architecting the Future of Big Data
• What pig does for you– Do safe transformations of query to optimize
– Optimized operations (join, sort)
• What you do– Organize input in optimal way
– Optimize pig-latin query
– Tell pig what join/group algorithm to use
© Hortonworks Inc. 2011
Rule based optimizer
Page 7Architecting the Future of Big Data
• Column pruner
• Push up filter
• Push down flatten
• Push up limit
• Partition pruning
• Global optimizer
© Hortonworks Inc. 2011
Column Pruner
Page 8Architecting the Future of Big Data
• Pig will do column pruning automatically
• Cases Pig will not do column pruning
automatically
– No schema specified in load statement
A = load ‘input’ as (a0, a1, a2);
B = foreach A generate a0+a1;
C = order B by $0;
Store C into ‘output’;
Pig will prune
a2 automatically
A = load ‘input’;
B = order A by $0;
C = foreach B generate $0+$1;
Store C into ‘output’;
A = load ‘input’;
A1 = foreach A generate $0, $1;
B = order A1 by $0;
C = foreach B generate $0+$1;
Store C into ‘output’;
DIY
© Hortonworks Inc. 2011
Column Pruner
Page 9Architecting the Future of Big Data
• Another case Pig does not do column
pruning
– Pig does not keep track of unused column after
grouping
A = load ‘input’ as (a0, a1, a2);
B = group A by a0;
C = foreach B generate SUM(A.a1);
Store C into ‘output’;
DIY
A = load ‘input’ as (a0, a1, a2);
A1 = foreach A generate $0, $1;
B = group A1 by a0;
C = foreach B generate SUM(A.a1);
Store C into ‘output’;
© Hortonworks Inc. 2011
Push up filter
Page 10Architecting the Future of Big Data
• Pig split the filter condition before push
A
Join
a0>0 && b0>10
B
Filter
A
Join
a0>0
B
Filter b0>10
Original query Split filter condition
A
Join
a0>0
B
Filterb0>10
Push up filter
© Hortonworks Inc. 2011
Other push up/down
Page 11Architecting the Future of Big Data
• Push down flatten
• Push up limit
Load
Flatten
Order
Load
Flatten
Order
A = load ‘input’ as (a0:bag, a1);
B = foreach A generateflattten(a0), a1;
C = order B by a1;
Store C into ‘output’;
Load
Limit
Foreach
Load
Foreach
Limit
Load (limited)
Foreach
Load
Limit
Order
Load
Order (limited)
© Hortonworks Inc. 2011
Partition pruning
Page 12Architecting the Future of Big Data
• Prune unnecessary partitions entirely
– HCatLoader
2010
2011
2012
HCatLoaderFilter
(year>=2011)
2010
2011
2012
HCatLoader
(year>=2011)
© Hortonworks Inc. 2011
Intermediate file compression
Page 13Architecting the Future of Big Data
Pig Script
map 1
reduce 1
map 2
reduce 2
Pig temp file
map 3
reduce 3
Pig temp file
•Intermediate file
between map and
reduce– Snappy
•Temp file between
mapreduce jobs– No compression by
default
© Hortonworks Inc. 2011
Enable temp file compression
Page 14Architecting the Future of Big Data
•Pig temp file are not compressed by
default– Issues with snappy (HADOOP-7990)
– LZO: not Apache license
•Enable LZO compression–Install LZO for Hadoop
–In conf/pig.properties
–With lzo, up to > 90% disk saving and 4x query
speed up
pig.tmpfilecompression = true
pig.tmpfilecompression.codec = lzo
© Hortonworks Inc. 2011
Multiquery
Page 15Architecting the Future of Big Data
• Combine two or more map/reduce
job into one
– Happens automatically
– Cases we want to control multiquery: combine too
many
Load
Group by $0 Group by $1
Foreach Foreach
Store Store
Group by $2
Foreach
Store
© Hortonworks Inc. 2011
Control multiquery
Page 16Architecting the Future of Big Data
• Disable multiquery– Command line option: -M
• Using “exec” to mark the boundaryA = load ‘input’;
B0 = group A by $0;
C0 = foreach B0 generate group, COUNT(A);
Store C0 into ‘output0’;
B1 = group A by $1;
C1 = foreach B1 generate group, COUNT(A);
Store C1 into ‘output1’;
exec
B2 = group A by $2;
C2 = foreach B2 generate group, COUNT(A);
Store C2 into ‘output2’;
© Hortonworks Inc. 2011
Implement the right UDF
Page 17Architecting the Future of Big Data
• Algebraic UDF– Initial
– Intermediate
– Final
A = load ‘input’;
B0 = group A by $0;
C0 = foreach B0 generate group, SUM(A);
Store C0 into ‘output0’;
MapInitial
CombinerIntermediate
ReduceFinal
© Hortonworks Inc. 2011
Implement the right UDF
Page 18Architecting the Future of Big Data
• Accumulator UDF– Reduce side UDF
– Normally takes a bag
• Benefit– Big bag are passed in
batches
– Avoid using too much
memory
– Batch size
A = load ‘input’;
B0 = group A by $0;
C0 = foreach B0 generate group, my_accum(A);
Store C0 into ‘output0’;
my_accum extends Accumulator {
public void accumulate() {
// take a bag trunk
}
public void getValue() {
// called after all bag trunks are
processed
}
}pig.accumulative.batchsize=20000
© Hortonworks Inc. 2011
Memory optimization
Page 19Architecting the Future of Big Data
• Control bag size on reduce side
– If bag size exceed threshold, spill to disk
– Control the bag size to fit the bag in memory if
possible
reduce(Text key, Iterator<Writable>
values, ……)
Mapreduce:
Iterator
Bag of Input 1 Bag of Input 2 Bag of Input 3
pig.cachedbag.memusage=0.2
© Hortonworks Inc. 2011
Optimization starts before pig
Page 20Architecting the Future of Big Data
• Input format
• Serialization format
• Compression
© Hortonworks Inc. 2011
Input format -Test Query
Page 21Architecting the Future of Big Data
> searches = load ’aol_search_logs.txt'
using PigStorage() as (ID, Query, …);
> search_thejas = filter searches by Query
matches '.*thejas.*';
> dump search_thejas;
(1568578 , thejasminesupperclub, ….)
© Hortonworks Inc. 2011
Input formats
Page 22Architecting the Future of Big Data
0
20
40
60
80
100
120
140
RunTime (sec)
RunTime (sec)
© Hortonworks Inc. 2011
Columnar format
Page 23Architecting the Future of Big Data
•RCFile
•Columnar format for a group of rows
•More efficient if you query subset of
columns
© Hortonworks Inc. 2011
Tests with RCFile
Page 24Architecting the Future of Big Data
• Tests with load + project + filter out all
records.
• Using hcatalog, w compression,types
•Test 1
•Project 1 out of 5 columns
•Test 2
•Project all 5 columns
© Hortonworks Inc. 2011
RCFile test results
Page 25Architecting the Future of Big Data
0
20
40
60
80
100
120
140
Project 1 (sec) Project all (sec)
Plain Text
RCFile
© Hortonworks Inc. 2011
Cost based optimizations
Page 26Architecting the Future of Big Data
• Optimizations decisions based on
your query/data
• Often iterative process
Run
queryMeasure
Tune
© Hortonworks Inc. 2011
• Hash Based Agg
Use pig.exec.mapPartAgg=true to enable
Map task
Cost based optimization - Aggregation
Page 27Architecting the Future of Big Data
Map
(logic)M.
Output
HBAHBA
OutputReduce task
© Hortonworks Inc. 2011
Cost based optimization – Hash Agg.
Page 28Architecting the Future of Big Data
• Auto off feature
• switches off HBA if output reduction is
not good enough
• Configuring Hash Agg
• Configure auto off feature -pig.exec.mapPartAgg.minReduction
• Configure memory used -pig.cachedbag.memusage
© Hortonworks Inc. 2011
Cost based optimization - Join
Page 29Architecting the Future of Big Data
• Use appropriate join algorithm
•Skew on join key - Skew join
•Fits in memory – FR join
© Hortonworks Inc. 2011
Cost based optimization – MR tuning
Page 30Architecting the Future of Big Data
•Tune MR parameters to reduce IO
•Control spills using map sort params
•Reduce shuffle/sort-merge params
© Hortonworks Inc. 2011
Parallelism of reduce tasks
Page 31Architecting the Future of Big Data
0:14:24
0:17:17
0:20:10
0:23:02
0:25:55
4 6 8 24 48 256
Runtime
Runtime
• Number of reduce slots = 6
• Factors affecting runtime
• Cores simultaneously used/skew
• Cost of having additional reduce tasks
© Hortonworks Inc. 2011
Cost based optimization – keep data sorted
Page 32Architecting the Future of Big Data
•Frequent joins operations on same
keys
• Keep data sorted on keys
• Use merge join
• Optimized group on sorted keys
• Works with few load functions – needs
additional i/f implementation
© Hortonworks Inc. 2011
Optimizations for sorted data
Page 33Architecting the Future of Big Data
0
10
20
30
40
50
60
70
80
90
sort+sort+join+join join + join
Join 2
Join 1
Sort2
Sort1
© Hortonworks Inc. 2011
Future Directions
Page 34Architecting the Future of Big Data
• Optimize using stats
• Using historical stats w hcatalog
• Sampling
© Hortonworks Inc. 2011
Questions
Page 35Architecting the Future of Big Data
?
© Hortonworks Inc. 2011 Page 36
Top Related