Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt...

24
Bohr: Similarity Aware Geo-distributed Data Analytics Hangyu Li, Hong Xu, Sarana Nutanong City University of Hong Kong 1

Transcript of Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt...

Page 1: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Bohr: Similarity Aware Geo-distributed Data

AnalyticsHangyu Li, Hong Xu, Sarana Nutanong

City University of Hong Kong

1

Page 2: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Big Data Analytics

2

GenerateAnalysis

Page 3: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Data are geo-distributed

3

US VirginiaUS California

US Oregon

San PauloSydney

Tokyo

Frankfurt

Singapore

Page 4: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Centralizing data is infeasible

4

• Moving data through wide area networks(WANS) can be extremely slow.

• Data sovereignty laws.

US VirginiaUS California

US Oregon

San Paulo

Sydney

Tokyo

Frankfurt

Singapore1. Vulimiri et al., NSDI’152. Pu et al., SIGCOMM’153. Viswanathan et al., OSDI’164. Hsieh et al., NSDI’17

Page 5: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Processing queries in-place

5

Bottleneck site exists.

US VirginiaUS California

US Oregon

San PauloSydney

Tokyo

Frankfurt

Singapore

Page 6: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Processing queries in-place

6

System Workload Task / Data Placement

Optimize aspect

JetStream streaming no WAN bandwidth usage

Geode batch data WAN bandwidth usage

Iridium batch both query response time

Page 7: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Processing queries in-place

7

Iridium (Pu et al. SIGCOMM’15)

• Recurring queries, abundant computation resource.

• Transfer data out from the bottleneck site.

• Re-schedule reducer tasks.

Page 8: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Iridium can be improved

8

• similarity oblivious

• data reduction ratio is constantTokyo

SeoulOregon

Map1(with

combiner)

Reduce1

Reduce2

Reduce3

Input:UrlA,1UrlA,1UrlA,1

Inter-media:UrlA,3

Input:UrlB,1UrlB,1UrlA,1UrlC,1

Input:UrlC,1UrlC,1UrlB,1

Inter-media:UrlA,1UrlB,2UrlC,1

Inter-media:UrlB,1UrlC,2

Map3(with

combiner)

Map2(with

combiner) SiteIntermed

-iate record

Tokyo 3

Oregon 1

Seoul 2Total 6

Page 9: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Iridium can be improved

9

• similarity oblivious

• data reduction ratio is constantTokyo

SeoulOregon

Map1(with

combiner)

Reduce1

Reduce2

Reduce3

Input:UrlA,1UrlA,1UrlA,1UrlB,1

Inter-media:UrlA,3UrlB,1

Input:UrlB,1UrlC,1

Input:UrlC,1UrlC,1UrlB,1UrlA,1

Inter-media:UrlB,1UrlC,1

Inter-media:UrlB,1UrlC,2UrlA,1

Map3(with

combiner)

Map2(with

combiner)

(UrlB,1) Oregon (UrlA,1) Seoul

(UrlA,1) Tokyo to Seoul

(UrlB,1) Tokyo to OregonSite

Intermed-iate

recordTokyo 2

Oregon 2

Seoul 3Total 7

Page 10: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Iridium can be improved

10

• similarity oblivious

• data reduction ratio is constantTokyo

SeoulOregon

Map1(with

combiner)

Reduce1

Reduce2

Reduce3

Input:UrlA,1UrlA,1UrlA,1UrlA,1

Inter-media:UrlA,4

Input:UrlB,1UrlC,1

Input:UrlC,1UrlC,1UrlB,1UrlC,1

Inter-media:UrlB,1UrlC,1

Inter-media:UrlB,1UrlC,3

Map3(with

combiner)

Map2(with

combiner)

(UrlA,1) Oregon (UrlC,1) Seoul

(UrlC,1) Tokyo to Seoul

(UrlA,1) Tokyo to OregonSite

Intermed-iate

recordTokyo 2

Oregon 1

Seoul 2Total 5

Page 11: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

11

Bohr DesignKey ideas:

• Recurring queries, abundant computation resource.

• Similarity aware data transfer.

• Measuring and predicting the data reduction ratio.

Page 12: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Bohr Design

12

Site1 Site2

Global Manager

OLAP Cube Generation

Map1Combine enabled

Reduce1 Reduce2

Job Queue

OLAP Cube Generation

OLAPCube1

OLAPCube2

Map2Combine enabled

probe

similardataset

query tasks

finalresult

Page 13: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Why using OLAP Cube?

13

• Sorting dataset according to the attribute.

• Using record level similarity score.

Page 14: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Why using OLAP Cube?

14

Advantages:

• Format the unformatted raw data.

• Multiple dimension cube for one dataset.

Page 15: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Similarity Search & Check

15

Search:

• OLAP instruction(eg. dice, slice, roll up) to retrieve the needed attributes.

Check:

• Cross site check: simple probes.

• The probe components.

site 1 site 2

OLAPCube1

OLAPCube2Probe

value(1)value(2)

…value(k)

Page 16: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Reduce tasks scheduling

16

t : data transfer time. I : input data size. R: data reduction ratio. U: uplink bandwidth.

D: downlink bandwidth. r : reduce tasks percentage.

Goal : Minimize the data transfer time

Time to download data from other sites.

Time to upload data from site i.

Page 17: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

System Implementation

17

Based on Spark v2.1.0 and Iridium.

Utilize Apache Kylin OLAP data cube to deal with all the OLAP operation.

Use Gurobi Optimizer for optimize the LP function.

Page 18: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Evaluation Methodology

18

• Workloads:

• TPC-DS.

• Amplab big data benchmark.

• Hardware platform:

A real EC2 deployment with 10 sites(Singapore,Tokyo…).

Page 19: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Evaluation Methodology

19

• Baseline:

Iridium(Pu et al. Sigcomm’ 15).

• Performance metrics:

• Query completion time.

• Data reduction ratio.

Page 20: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Data Reduction Ratio

20

Singapo

reTok

yo

Oregon

Virginia

Ohio

FrankFurt

Seoul

Sydney

Londo

n

Irelan

d�10

0

10

20

30

40

50

60

Dat

are

duct

ion

com

pare

dto

in-p

lace

(%)

IridiumBohr

Bohr can increase the data reductio ratio in all sites.

Page 21: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Query Completion Time

21

Big data (scan)

Big data (UDF)

Big data (aggr)

TPC-DS0

2

4

6

8

10

12

14

16

Que

ryco

mpl

etio

nti

me

(sec

s)

IridiumBohr

Bohr saves 30% QCT compare to Iridium.

Page 22: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Conclusion

22

We propose Bohr:

• Similarity aware data transfer.

• Predicting and measuring data reduction ratio for each site.

Bohr is 30% faster than Iridium.

Page 23: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Thank you!

23

Page 24: Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt Singapore. Processing queries in-place 6 System Workload Task / Data Placement Optimize aspect

Future work

24

1. Generalize the basic idea to more workloads.

2. Dealing with the overhead bring by operate OLAP cubes.

3. Break recurring batch workload condition

4. Multiple types of queries accessing one dataset.