Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt...
Transcript of Bohr: Similarity Aware Geo-distributed Data Analytics · San Paulo Sydney Tokyo Frankfurt...
Bohr: Similarity Aware Geo-distributed Data
AnalyticsHangyu Li, Hong Xu, Sarana Nutanong
City University of Hong Kong
1
Big Data Analytics
2
GenerateAnalysis
Data are geo-distributed
3
US VirginiaUS California
US Oregon
San PauloSydney
Tokyo
Frankfurt
Singapore
Centralizing data is infeasible
4
• Moving data through wide area networks(WANS) can be extremely slow.
• Data sovereignty laws.
US VirginiaUS California
US Oregon
San Paulo
Sydney
Tokyo
Frankfurt
Singapore1. Vulimiri et al., NSDI’152. Pu et al., SIGCOMM’153. Viswanathan et al., OSDI’164. Hsieh et al., NSDI’17
Processing queries in-place
5
Bottleneck site exists.
US VirginiaUS California
US Oregon
San PauloSydney
Tokyo
Frankfurt
Singapore
Processing queries in-place
6
System Workload Task / Data Placement
Optimize aspect
JetStream streaming no WAN bandwidth usage
Geode batch data WAN bandwidth usage
Iridium batch both query response time
Processing queries in-place
7
Iridium (Pu et al. SIGCOMM’15)
• Recurring queries, abundant computation resource.
• Transfer data out from the bottleneck site.
• Re-schedule reducer tasks.
Iridium can be improved
8
• similarity oblivious
• data reduction ratio is constantTokyo
SeoulOregon
Map1(with
combiner)
Reduce1
Reduce2
Reduce3
Input:UrlA,1UrlA,1UrlA,1
Inter-media:UrlA,3
Input:UrlB,1UrlB,1UrlA,1UrlC,1
Input:UrlC,1UrlC,1UrlB,1
Inter-media:UrlA,1UrlB,2UrlC,1
Inter-media:UrlB,1UrlC,2
Map3(with
combiner)
Map2(with
combiner) SiteIntermed
-iate record
Tokyo 3
Oregon 1
Seoul 2Total 6
Iridium can be improved
9
• similarity oblivious
• data reduction ratio is constantTokyo
SeoulOregon
Map1(with
combiner)
Reduce1
Reduce2
Reduce3
Input:UrlA,1UrlA,1UrlA,1UrlB,1
Inter-media:UrlA,3UrlB,1
Input:UrlB,1UrlC,1
Input:UrlC,1UrlC,1UrlB,1UrlA,1
Inter-media:UrlB,1UrlC,1
Inter-media:UrlB,1UrlC,2UrlA,1
Map3(with
combiner)
Map2(with
combiner)
(UrlB,1) Oregon (UrlA,1) Seoul
(UrlA,1) Tokyo to Seoul
(UrlB,1) Tokyo to OregonSite
Intermed-iate
recordTokyo 2
Oregon 2
Seoul 3Total 7
Iridium can be improved
10
• similarity oblivious
• data reduction ratio is constantTokyo
SeoulOregon
Map1(with
combiner)
Reduce1
Reduce2
Reduce3
Input:UrlA,1UrlA,1UrlA,1UrlA,1
Inter-media:UrlA,4
Input:UrlB,1UrlC,1
Input:UrlC,1UrlC,1UrlB,1UrlC,1
Inter-media:UrlB,1UrlC,1
Inter-media:UrlB,1UrlC,3
Map3(with
combiner)
Map2(with
combiner)
(UrlA,1) Oregon (UrlC,1) Seoul
(UrlC,1) Tokyo to Seoul
(UrlA,1) Tokyo to OregonSite
Intermed-iate
recordTokyo 2
Oregon 1
Seoul 2Total 5
11
Bohr DesignKey ideas:
• Recurring queries, abundant computation resource.
• Similarity aware data transfer.
• Measuring and predicting the data reduction ratio.
Bohr Design
12
Site1 Site2
Global Manager
OLAP Cube Generation
Map1Combine enabled
Reduce1 Reduce2
Job Queue
OLAP Cube Generation
OLAPCube1
OLAPCube2
Map2Combine enabled
probe
similardataset
query tasks
finalresult
Why using OLAP Cube?
13
• Sorting dataset according to the attribute.
• Using record level similarity score.
Why using OLAP Cube?
14
Advantages:
• Format the unformatted raw data.
• Multiple dimension cube for one dataset.
Similarity Search & Check
15
Search:
• OLAP instruction(eg. dice, slice, roll up) to retrieve the needed attributes.
Check:
• Cross site check: simple probes.
• The probe components.
site 1 site 2
OLAPCube1
OLAPCube2Probe
value(1)value(2)
…value(k)
Reduce tasks scheduling
16
t : data transfer time. I : input data size. R: data reduction ratio. U: uplink bandwidth.
D: downlink bandwidth. r : reduce tasks percentage.
Goal : Minimize the data transfer time
Time to download data from other sites.
Time to upload data from site i.
System Implementation
17
Based on Spark v2.1.0 and Iridium.
Utilize Apache Kylin OLAP data cube to deal with all the OLAP operation.
Use Gurobi Optimizer for optimize the LP function.
Evaluation Methodology
18
• Workloads:
• TPC-DS.
• Amplab big data benchmark.
• Hardware platform:
A real EC2 deployment with 10 sites(Singapore,Tokyo…).
Evaluation Methodology
19
• Baseline:
Iridium(Pu et al. Sigcomm’ 15).
• Performance metrics:
• Query completion time.
• Data reduction ratio.
Data Reduction Ratio
20
Singapo
reTok
yo
Oregon
Virginia
Ohio
FrankFurt
Seoul
Sydney
Londo
n
Irelan
d�10
0
10
20
30
40
50
60
Dat
are
duct
ion
com
pare
dto
in-p
lace
(%)
IridiumBohr
Bohr can increase the data reductio ratio in all sites.
Query Completion Time
21
Big data (scan)
Big data (UDF)
Big data (aggr)
TPC-DS0
2
4
6
8
10
12
14
16
Que
ryco
mpl
etio
nti
me
(sec
s)
IridiumBohr
Bohr saves 30% QCT compare to Iridium.
Conclusion
22
We propose Bohr:
• Similarity aware data transfer.
• Predicting and measuring data reduction ratio for each site.
Bohr is 30% faster than Iridium.
Thank you!
23
Future work
24
1. Generalize the basic idea to more workloads.
2. Dealing with the overhead bring by operate OLAP cubes.
3. Break recurring batch workload condition
4. Multiple types of queries accessing one dataset.