Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt...

Bohr: Similarity Aware Geo-distributed Data

AnalyticsHangyu Li, Hong Xu, Sarana Nutanong

City University of Hong Kong

1

Big Data Analytics

2

GenerateAnalysis

Data are geo-distributed

3

US VirginiaUS California

US Oregon

San PauloSydney

Tokyo

Frankfurt

Singapore

Centralizing data is infeasible

4

• Moving data through wide area networks(WANS) can be extremely slow.

• Data sovereignty laws.


US Oregon

San Paulo

Sydney

Tokyo

Frankfurt

Singapore1. Vulimiri et al., NSDI’152. Pu et al., SIGCOMM’153. Viswanathan et al., OSDI’164. Hsieh et al., NSDI’17

Processing queries in-place

5

Bottleneck site exists.


US Oregon

San PauloSydney

Tokyo

Frankfurt

Singapore


6

System Workload Task / Data Placement

Optimize aspect

JetStream streaming no WAN bandwidth usage

Geode batch data WAN bandwidth usage

Iridium batch both query response time


7

Iridium (Pu et al. SIGCOMM’15)

• Recurring queries, abundant computation resource.

• Transfer data out from the bottleneck site.

• Re-schedule reducer tasks.

Iridium can be improved

8

• similarity oblivious

• data reduction ratio is constantTokyo

SeoulOregon

Map1(with

combiner)

Reduce1

Reduce2

Reduce3

Input:UrlA,1UrlA,1UrlA,1

Inter-media:UrlA,3

Input:UrlB,1UrlB,1UrlA,1UrlC,1

Input:UrlC,1UrlC,1UrlB,1

Inter-media:UrlA,1UrlB,2UrlC,1

Inter-media:UrlB,1UrlC,2

Map3(with

combiner)

Map2(with

combiner) SiteIntermed

-iate record

Tokyo 3

Oregon 1

Seoul 2Total 6


9



SeoulOregon

Map1(with

combiner)

Reduce1

Reduce2

Reduce3

Input:UrlA,1UrlA,1UrlA,1UrlB,1

Inter-media:UrlA,3UrlB,1

Input:UrlB,1UrlC,1

Input:UrlC,1UrlC,1UrlB,1UrlA,1


Inter-media:UrlB,1UrlC,2UrlA,1

Map3(with

combiner)

Map2(with

combiner)

(UrlB,1) Oregon (UrlA,1) Seoul

(UrlA,1) Tokyo to Seoul

(UrlB,1) Tokyo to OregonSite

Intermed-iate

recordTokyo 2

Oregon 2

Seoul 3Total 7


10



SeoulOregon

Map1(with

combiner)

Reduce1

Reduce2

Reduce3

Input:UrlA,1UrlA,1UrlA,1UrlA,1

Inter-media:UrlA,4

Input:UrlB,1UrlC,1

Input:UrlC,1UrlC,1UrlB,1UrlC,1



Map3(with

combiner)

Map2(with

combiner)

(UrlA,1) Oregon (UrlC,1) Seoul

(UrlC,1) Tokyo to Seoul

(UrlA,1) Tokyo to OregonSite

Intermed-iate

recordTokyo 2

Oregon 1

Seoul 2Total 5

11

Bohr DesignKey ideas:

• Recurring queries, abundant computation resource.

• Similarity aware data transfer.

• Measuring and predicting the data reduction ratio.

Bohr Design

12

Site1 Site2

Global Manager

OLAP Cube Generation

Map1Combine enabled

Reduce1 Reduce2

Job Queue

OLAP Cube Generation

OLAPCube1

OLAPCube2

Map2Combine enabled

probe

similardataset

query tasks

finalresult

Why using OLAP Cube?

13

• Sorting dataset according to the attribute.

• Using record level similarity score.

Why using OLAP Cube?

14

Advantages:

• Format the unformatted raw data.

• Multiple dimension cube for one dataset.

Similarity Search & Check

15

Search:

• OLAP instruction(eg. dice, slice, roll up) to retrieve the needed attributes.

Check:

• Cross site check: simple probes.

• The probe components.

site 1 site 2

OLAPCube1

OLAPCube2Probe

value(1)value(2)

…value(k)

Reduce tasks scheduling

16

t : data transfer time. I : input data size. R: data reduction ratio. U: uplink bandwidth.

D: downlink bandwidth. r : reduce tasks percentage.

Goal : Minimize the data transfer time

Time to download data from other sites.

Time to upload data from site i.

System Implementation

17

Based on Spark v2.1.0 and Iridium.

Utilize Apache Kylin OLAP data cube to deal with all the OLAP operation.

Use Gurobi Optimizer for optimize the LP function.

Evaluation Methodology

18

• Workloads:

• TPC-DS.

• Amplab big data benchmark.

• Hardware platform:

A real EC2 deployment with 10 sites(Singapore,Tokyo…).

Evaluation Methodology

19

• Baseline:

Iridium(Pu et al. Sigcomm’ 15).

• Performance metrics:

• Query completion time.

• Data reduction ratio.

Data Reduction Ratio

20

Singapo

reTok

yo

Oregon

Virginia

Ohio

FrankFurt

Seoul

Sydney

Londo

n

Irelan

d�10

0

10

20

30

40

50

60

Dat

are

duct

ion

com

pare

dto

in-p

lace

(%)

IridiumBohr

Bohr can increase the data reductio ratio in all sites.

Query Completion Time

21

Big data (scan)

Big data (UDF)

Big data (aggr)

TPC-DS0

2

4

6

8

10

12

14

16

Que

ryco

mpl

etio

nti

me

(sec

s)

IridiumBohr

Bohr saves 30% QCT compare to Iridium.

Conclusion

22

We propose Bohr:

• Similarity aware data transfer.

• Predicting and measuring data reduction ratio for each site.

Bohr is 30% faster than Iridium.

Thank you!

23

Future work

24

1. Generalize the basic idea to more workloads.

2. Dealing with the overhead bring by operate OLAP cubes.

3. Break recurring batch workload condition

4. Multiple types of queries accessing one dataset.

Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt...

Documents

Transcript of Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt...