MRShare: Sharing Across Multiple Queries in MapReduce

41
MRShare: Sharing Across Multiple Queries in MapReduce By Tomasz Nykiel (University of Toronto) Michalis Potamias (Boston University) Chaitanya Mishra (University of Toronto, currently Facebook) George Kollios (Boston University) Nick Koudas (University of Toronto) 1 Presented by Xiaolan Wang and Pengfei Tang

description

MRShare: Sharing Across Multiple Queries in MapReduce. Presented by Xiaolan Wang and Pengfei Tang. By Tomasz Nykiel (University of Toronto) Michalis Potamias (Boston University) Chaitanya Mishra (University of Toronto, currently Facebook ) George Kollios (Boston University) - PowerPoint PPT Presentation

Transcript of MRShare: Sharing Across Multiple Queries in MapReduce

Page 1: MRShare: Sharing Across Multiple Queries in  MapReduce

1

MRShare: Sharing Across Multiple Queries in MapReduce

By Tomasz Nykiel (University of Toronto)Michalis Potamias (Boston University)Chaitanya Mishra (University of Toronto, currently Facebook)George Kollios (Boston University)Nick Koudas (University of Toronto)

Presented by Xiaolan Wang and Pengfei Tang

Page 2: MRShare: Sharing Across Multiple Queries in  MapReduce

2

Motivation

• Reducing the execution time• Reducing energy consumption• Monetary savings

*http://aws.amazon.com/ec2/#pricing

Page 3: MRShare: Sharing Across Multiple Queries in  MapReduce

MRShare – a sharing framework for Map Reduce

• MRShare framework:– Inspired by sharing primitives from relational domain– Introduces a cost model for Map Reduce jobs– Searches for the optimal sharing strategies– Does not change the Map Reduce computational model

hsdhquweiquwijksajdajsdjhwhjadjhashdj

3

Page 4: MRShare: Sharing Across Multiple Queries in  MapReduce

4

Outline

• Introduction• Map Reduce recap.• MRShare – Sharing opportunities in Map-Reduce• Cost model for MapReduce• MRShare – Grouping algorithms• MRShare Implementation and Evaluation• Summary

Page 5: MRShare: Sharing Across Multiple Queries in  MapReduce

5

Outline

• Introduction• Map Reduce recap.• MRShare – Sharing opportunities in Map-Reduce• Cost model for MapReduce• MRShare – Grouping algorithms• MRShare Implementation and Evaluation• Summary

Page 6: MRShare: Sharing Across Multiple Queries in  MapReduce

network

Map Reduce recap.

I

I

I

I

Map Reduce

Output

Output

HDFSHDFS

6

Page 7: MRShare: Sharing Across Multiple Queries in  MapReduce

7

Outline

• Introduction• Map Reduce recap.• MRShare - Sharing opportunities in Map-Reduce– Sharing scans– Sharing intermediate data

• Cost model for MapReduce• MRShare – Grouping algorithms• MRShare Implementation and Evaluation• Summary

Page 8: MRShare: Sharing Across Multiple Queries in  MapReduce

Sharing opportunities– sharing scans

• SELECT COUNT(*) FROM user GROUP BY hometown

• SELECT AVG(age) FROM user GROUP BY hometown

Map

id1 studentToronto

Toronto 1 Map

id1 studentToronto

Toronto 17

Reduce

Toronto 1Toronto 1Toronto 1Ottawa 1Ottawa 1

Toronto 3

Ottawa 2

Reduce

Toronto 17Toronto 19

Montreal 20Ottawa 23Ottawa 25

Toronto 18Montreal 20Ottawa 24

8

User_id Hometown Occupation Age

SQL

MAP

RED

UCE

Page 9: MRShare: Sharing Across Multiple Queries in  MapReduce

9

Meta-map

MRShare – sharing scans (map).Input

Map 1 Map 2 Map 3 Map 4

Map output

Page 10: MRShare: Sharing Across Multiple Queries in  MapReduce

Meta-reduce

MRShare – sharing scans (reduce)

J1 J2 J3 J4 key value

Toronto 1

Toronto 1

Toronto 1

Toronto 17

Toronto 19

Toronto 2

Toronto 5

Reduce 1

Reduce 2

Reduce 3

Reduce 4

10

Page 11: MRShare: Sharing Across Multiple Queries in  MapReduce

11

Sharing Map OutputSELECT T.a, sum(T.b) SELECT T.a, avg(T.b)FROM T FROM TWHERE T.a>10 AND T.a<20 WHERE T.b>10 AND T.c<100GROUP BY T.a GROUP BY T.a

Page 12: MRShare: Sharing Across Multiple Queries in  MapReduce

12

Sharing MapSELECT T.c, sum(T.b) SELECT T.a, avg(T.b)FROM T FROM TWHERE T.c > 10 WHERE T.c > 10GROUP BY T.c GROUP BY T.a

Same reducing.

Page 13: MRShare: Sharing Across Multiple Queries in  MapReduce

13

Sharing Parts of MapSELECT T.a, sum(T.b) SELECT T.a, avg(T.b)FROM T FROM TWHERE T.c>10 AND T.a<20 WHERE T.c>10 AND T.c<100GROUP BY T.a GROUP BY T.a

Page 14: MRShare: Sharing Across Multiple Queries in  MapReduce

Outline• Introduction• Map Reduce recap.• MRShare – Sharing opportunities in Map-Reduce• Cost model for MapReduce• MRShare – Grouping algorithm• MRShare Implementation and Evaluation• Summary

14

Page 15: MRShare: Sharing Across Multiple Queries in  MapReduce

15

Cost model for Map Reduce (single job)

• Reading – f(input size)• Sorting – f(intermediate data size)• Transferring– f(intermediate data size)• Writing – f(output size)

Reading input Sorting int. data Transferring Writing output

T(J) = Tread(J) + Tsort(J) + Ttr(J)

Page 16: MRShare: Sharing Across Multiple Queries in  MapReduce

16

Cost of executing a group of jobsRead Sort Transfer Write

Read Sort Transfer Write

Read Sort Transfer Write

J1

J2

J3

Read Sort Transfer Write

Potential costs

SavingsPotential savings

J1+J2+J3

Page 17: MRShare: Sharing Across Multiple Queries in  MapReduce

17

Cost without grouping

n – n jobs;m – m maps;r – r reduces; |Mi| - the average output size of a map task;|Ri| - the average input size of a reduce task;|Di| - the size of the intermediate data of job Ji.

|Di| = |Mi| · m = |Ri| · r

n MapReduce jobs, J = {J1, . . . , Jn}, read from the same input file F.

Page 18: MRShare: Sharing Across Multiple Queries in  MapReduce

19

Cost with grouping

m – m maps;r – r reduces; |Xm| - the average size of the combined output of map tasks;|Xr| - the average size of the combined input of reduce tasks; |XG| - the size of the intermediate data.

| XG | = | Xm | · m = | Xr | · r

Single group G contains all n jobs and execute it as a single job JG.

Page 19: MRShare: Sharing Across Multiple Queries in  MapReduce

20

Beneficial conditions

n <= B

Page 20: MRShare: Sharing Across Multiple Queries in  MapReduce

21

Finding the optimal sharing strategy

• An optimization problem

J1

J2

J5

J3

J4

J1

J2

J5

J3

J4

J1

J2

J5

J3

J4

“NoShare”

“GreedyShare”

Page 21: MRShare: Sharing Across Multiple Queries in  MapReduce

22

Sharing scans - cost based optimization

• Savings come from reduced number of scans• The sorting cost might change• The costs of copying and writing the output do not change

Read Sort

Read Sort

Read Sort

J1

J2

J3

Read Sort

Potential costsSavings

J1+J2+J3

Page 22: MRShare: Sharing Across Multiple Queries in  MapReduce

Outline• Introduction• Map Reduce recap.• MRShare – Sharing opportunities in Map-Reduce• Cost model for MapReduce• MRShare – Grouping algorithms– SplitJobs – cost based algorithm for sharing scans– MultiSplitJobs – an improvement of SplitJobs

• MRShare Evaluation• Summary

23

Page 23: MRShare: Sharing Across Multiple Queries in  MapReduce

SplitJobs – a DP solution for sharing scans.

• We reduce the problem of grouping to the problem of splitting a sorted list of jobs – by approximating the cost of sorting.

24

• Using our cost model and the approximation, we employ a DP algorithm to find the optimal split points.

J1 J2 J3 J4 J5 J6

J1 J2 J3 J4 J5 J6

G1 G2 G3

SplitJobs

Page 24: MRShare: Sharing Across Multiple Queries in  MapReduce

25

SplitJobs (cont.)

GS(i, l) = GAIN(i, l) − f

c(l) is the savings of the optimal grouping of jobs J1,…Jl.

Page 25: MRShare: Sharing Across Multiple Queries in  MapReduce

26

MultiSplitJobs – an improvement of SplitJobs

J1 J2 J7 J8

G1 G2

G3

J6J3 J4 J5

SplitJobs

SplitJobs

G4SplitJobs

MultiSplitJobs

Page 26: MRShare: Sharing Across Multiple Queries in  MapReduce

27

MultiSplitJobs (cont.)

Page 27: MRShare: Sharing Across Multiple Queries in  MapReduce

Outline

• Introduction• Map Reduce recap.• MRShare – Sharing primitives in Map-Reduce• MRShare – Cost based approach to sharing • MRShare Implementation and Evaluation• Summary

28

Page 28: MRShare: Sharing Across Multiple Queries in  MapReduce

29

Implementing MRShare• MRShare implement on Hadoop• First, acquire a batch of jobs from queries in a short time T• Second, MultiSplit Jobs is called to compute the optimal

grouping of the jobs• Third, the groups are rewritten, using a meta-map and a

meta-reduce function. These are MRShare specific container and their functionality relies on tagging.

• Finally, new jobs are submitted for execution

Page 29: MRShare: Sharing Across Multiple Queries in  MapReduce

30

Tagging for Sharing Only Scans

Page 30: MRShare: Sharing Across Multiple Queries in  MapReduce

31

Tagging for Sharing Map Output

Page 31: MRShare: Sharing Across Multiple Queries in  MapReduce

32

Tagging for Sharing Map Output

Page 32: MRShare: Sharing Across Multiple Queries in  MapReduce

33

Tagging for Sharing Map Output

Page 33: MRShare: Sharing Across Multiple Queries in  MapReduce

Evaluation setup

• 40 EC2 small instance virtual machines• Modified Hadoop engine• 30 GB text dataset consisting of blogs• Multiple grep-wordcount queries– Counts words matching a regular expression– Allows for variable intermediate data sizes– Generic aggregation Map Reduce job

34

Page 34: MRShare: Sharing Across Multiple Queries in  MapReduce

35

Validation of the Cost Model

Page 35: MRShare: Sharing Across Multiple Queries in  MapReduce

Evaluation goals

• Sharing is not always beneficial.– ‘GreedyShare’ policy

• How much can we save on sharing scans?– MRShare - MultiSplitJobs evaluation

• How much can we save on sharing intermediate data? – MRShare - γ-MultiSplitJobs evaluation

36

Page 36: MRShare: Sharing Across Multiple Queries in  MapReduce

Is sharing always beneficial?- ‘GreedyShare’ policy

Group of jobs

Group size

d=|intermediate data| / |input data|

H1 16 0.3 < d <0.7H2 16 0.7 < dH3 16 0.9 < d

37

Page 37: MRShare: Sharing Across Multiple Queries in  MapReduce

How much we save on sharing scans – MRShare MultiSplitJobs

Group of jobs

Group size

d=|intermediate data| / |input data|

G1 16 0.7 < d

G2 16 0.2 < d < 0.7

G3 16 0.0 < d < 0.2

G4 16 0.0 < d < max

G5 64 0.0 < d < max

38

Page 38: MRShare: Sharing Across Multiple Queries in  MapReduce

39

How much we save on sharing Map-output – MRShare MultiSplitJobs

Page 39: MRShare: Sharing Across Multiple Queries in  MapReduce

How much we save on sharing intermediate data - MRShare - γ-MultiSplitJobs

40

Group of jobs

Group size

d=|intermediate data| / |input data|

G1 16 0.7 < d

G2 16 0.2 < d < 0.7

G3 16 0.0 < d < 0.2

Page 40: MRShare: Sharing Across Multiple Queries in  MapReduce

Summary

• Introduction on MRShare – a framework for automatic work sharing in Map Reduce.

• We identified sharing primitives and demonstrated the implementation thereof in a Map-Reduce engine.

• We established a cost model and solved several work sharing optimization problems.

• We demonstrated vast savings when using MRShare.

41

Page 41: MRShare: Sharing Across Multiple Queries in  MapReduce

Thank you!!!

Questions?

42