Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems...
Transcript of Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems...
![Page 1: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/1.jpg)
Part 1: Systems Infrastructure forBig-Data Analytics
Ananth Grama
Center for Science of InformationDept. of Computer Science, Purdue University
With help from Giorgos Kollias, Naresh Rapolu, Karthik Kambatla, and Adnan Hassan
Thanks to NSF, DoE, the Center for Science of Information, Microsoft, and Intel.
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 1 / 1
![Page 2: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/2.jpg)
Outline
Motivating applications – PageRank, Graph Alignment
Case study: single mat-vec in MapReduce
Asynchrony and speculation: Optimizations across iterations
Asynchronous algorithms through Relaxed SynchronizationSpeculative parallelism through TransMR (transactionalMapReduce)Locking techniques for efficient distribution transactions
Future Work
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 2 / 1
![Page 3: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/3.jpg)
Motivating Example: Functional PageRank (PR)
Computing PageRank (PR)
PageRank as a random surfer process: Start surfing from a randomnode and keep following links with probability µ restarting withprobability 1− µ; the node for restarting will be selected based on apersonalization vector v . The ranking value xi of a node i is theprobability of visiting this node during surfing.
PR can also be cast in power series representation asx = (1− µ)
∑kj=0 µ
jS jv ; S encodes column-stochastic adjacencies.
Functional rankings
A general method to assign ranking values to graph nodes asx =
∑kj=0 ζjS
jv . PR is a functional ranking, ζj = (1− µ)µj .
Terms attenuated by outdegrees in S and damping coefficients ζj .
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 3 / 1
![Page 4: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/4.jpg)
Functional Rankings Through Multidamping
[Kollias, Gallopoulos, AG, TKDE’13]
1-µ1
1-µ2
1-µκ
µ1
µ2
µκComputing µj in multidamping
Simulate a functional ranking by randomsurfers following emanating links withprobability µj at step j given by :µj = 1− 1
1+ρk−j+11−µj−1
, j = 1, ..., k ,
where µ0 = 0 and ρk−j+1 =ζk−j+1
ζk−j
Examples
LinearRank (LR) xLR =∑k
j=02(k+1−j)
(k+1)(k+2)Sjv : µj = j
j+2 , j = 1, ..., k.
TotalRank (TR) xTR =∑∞
j=01
(j+1)(j+2)Sjv : µj = k−j+1
k−j+2 , j = 1, ..., k.
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 4 / 1
![Page 5: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/5.jpg)
Multidamping and Computational Cost
Advantages of multidamping
Reduced computational cost in approximating functionalrankings using the Monte Carlo approach. A random surferterminates with probability 1− µj at step j .
Inherently parallel and synchronization free computation.
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 5 / 1
![Page 6: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/6.jpg)
Multidamping Performance
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1 2 3 4 5 6 7 8 9 10
Ken
dall
tau
step
TotalRank: Kendall tau vs step for TopK=1000 nodes (uk-2005)
iterationssurfers
0
5
10
15
20
25
30
0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06
# sh
ared
nod
es (
max
=30
)
microstep
Personalized LinearRank: Number of shared nodes (max=30) vs microstep (in-2004). For the seed node 20% of the nodes has better ranking in the Non-Personalized run.
iterationssurfers
Approximate ranking: Run n surfers to completion for graph size n.How well does the computed ranking capture the “reference” ordering fortop-k nodes, compared to standard iterations of equivalentcomputational cost/number of operations? [Left]Approximate personalized ranking: Run < n surfers to completion(each called a microstep, x-axis), but only from a selected node(personalized). How well can we capture the “reference” top-k nodes,i.e., how many of them are shared (y-axis), compared to the iterativeapproach of equivalent computational cost? [Right]
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 6 / 1
![Page 7: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/7.jpg)
Motivating Example: Graph Matching
Node similarity: Two nodesare similar if they are linked byother similar node pairs. Bypairing similar nodes, the twographs become aligned.
Let A and B be the normalized adjacency matrices of the graphs(normalized by columns), Hij be the independently known similarityscores (preferences matrix) of nodes i ∈ VB and j ∈ VA, and µ bethe fractional contribution of topological similarity.
To compute X , IsoRank iterates:
X ← µBX AT + (1− µ)H
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 7 / 1
![Page 8: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/8.jpg)
Network Similarity Decomposition (NSD) [Kollias,
Mohammadi, AG, TKDE’12]
Network Similarity Decomposition (NSD)
In n steps of we reachX (n) = (1− µ)
∑n−1k=0 µ
k BkH(AT )k + µnBnH(AT )n
Assume that H = uvT (1 component). Two phases for X :
1 u(k) = Bku and v (k) = Akv (preprocess/compute iterates)
2 X (n) = (1− µ)∑n−1
k=0 µku(k)v (k)T + µnu(n)v (n)T (construct X )
This idea extends to s components, H ∼∑s
i=1 wizTi .
NSD computes matrix-vector iterates and builds X as a sum of outerproducts; these are much cheaper than triple matrix products.
We can then apply Primal-Dual or Greedy Matching (1/2 approximation)
to extract the actual node pairs.Ananth Grama (Purdue University) Big Data Analytics Infrastructure 8 / 1
![Page 9: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/9.jpg)
NSD: Performance [Kollias, Madan, Mohammadi,
AG, BMC RN’12]
Species Nodes Edgesceleg (worm) 2805 4572dmela (fly) 7518 25830ecoli (bacterium) 1821 6849hpylo (bacterium) 706 1414hsapi (human) 9633 36386mmusc (mouse) 290 254scere (yeast) 5499 31898
Species pair NSD(secs)
PDM(secs)
GM(secs)
IsoRank(secs)
celeg-dmela 3.15 152.12 7.29 783.48celeg-hsapi 3.28 163.05 9.54 1209.28celeg-scere 1.97 127.70 4.16 949.58dmela-ecoli 1.86 86.80 4.78 807.93dmela-hsapi 8.61 590.16 28.10 7840.00dmela-scere 4.79 182.91 12.97 4905.00ecoli-hsapi 2.41 79.23 4.76 2029.56ecoli-scere 1.49 69.88 2.60 1264.24hsapi-scere 6.09 181.17 15.56 6714.00
We compute similarity matrices X for various pairs of species usingProtein-Protein Interaction (PPI) networks. µ = 0.80, uniform initialconditions (outer product of suitably normalized 1’s for each pair), 20iterations, one component.
We then extract node matches using PDM and GM.
Three orders of magnitude speedup from NSD-based approaches comparedto IsoRank.
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 9 / 1
![Page 10: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/10.jpg)
NSD: Parallelization [KKG JPDC’13, Submitted,
KMSAG ParCo’13 Submitted]
Parallelization: NSD has been ported to parallel and distributedplatforms.
We have aligned up to million-node graph instances using over3K cores.
We process graph pairs of over a billion nodes and twenty billionedges each (!), on MapReduce-based distributed platforms.
More on this in the rest of the talk.
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 10 / 1
![Page 11: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/11.jpg)
MapReduce: Basics
Execute in parallel user-defined functions on individual data-itemsdistributed across machines.
Simple programming model — map and reduce functions
Scalable, distributed execution of these functions on massiveamounts of data on commodity hardware
0DS��
Ɣ ,QSXW���.����9��!Ɣ 2XWSXW��.C���9C��!����
������������.C�Q�9C�Q!
0DS��
Ɣ ,QSXW���.����9��!Ɣ 2XWSXW��.C���9C��!����
������������.C�Q�9C�Q!
0DS��
Ɣ ,QSXW���.����9��!Ɣ 2XWSXW��.C���9C��!����
������������.C�Q�9C�Q!
,17(50(',$7(��NH\��YDOXH�OLVW!��OLVW����
5HGXFH��
Ɣ ,QSXW���.C����9C���OLVW!Ɣ 2XWSXW��.CC���9CC��!�����
�.CC�Q�9CC�Q!
5HGXFH��
Ɣ ,QSXW���.C����9C���OLVW!Ɣ 2XWSXW��.CC���9CC��!�����
�.CC�Q�9CC�Q!
,1387��NH\��YDOXH!��OLVW����
287387�YDOXH��OLVW����
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 11 / 1
![Page 12: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/12.jpg)
MapReduce: DataFlow
VSOLW�� PDS
VSOLW�� PDS
VSOLW�� PDS
VRUW
VRUW
VRUW
UHGXFH SDUW��
UHGXFH SDUW��
PHUJH
PHUJH
LQSXW
RXWSXW
FRS\
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 12 / 1
![Page 13: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/13.jpg)
Example: Shotest Path (mat-vec) in MapReduce
map(node , adjList) {
for each arc in adjList {
output(arc.dst , node.distance + arc.weight)
}
}
reduce(node , newDistList) {
node.distance = min(newDistList)
}
while(not converged) {
runMapReduceJob(map , reduce , Ax)
}
We call this the Naive mat-vec — map takes an adjacency listas input
Pegasus (CMU) implementation reads one edge per map
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 13 / 1
![Page 14: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/14.jpg)
Naive mat-vec : Resource Utilizaton
0
20
40
60
80
100
0 200 400 600 800 1000 1200
% C
PU
Time
CPU Usage
userwaiting
0
5000
10000
15000
20000
25000
30000
35000
0 200 400 600 800 1000 1200
KB
/s
Time
Network I/O
In
Out
0
5000
10000
15000
20000
25000
30000
35000
0 200 400 600 800 1000 1200
KB
s
Time
Disk I/O
Block In
Block Out
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 14 / 1
![Page 15: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/15.jpg)
Optimizing the mat-vec further
0
20
40
60
80
100
map spill merge shuffle reduce
Pe
rce
nta
ge
of
tota
l tim
e
Stages
Stages overlap; hence, sum of percentages > 100
I/O time > Computation time
For performance:
Batch read dataEach map processes more data (as much as can fit in memory)
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 15 / 1
![Page 16: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/16.jpg)
Partitioned mat-vec
Each map operates on a graph partition — a set of adjacencylists
Partition size is constrained by the heap size
On our setup, maximum partition size was 219 nodes
215
220
225
230
235
240
245
250
255
260
265
15.5 16 16.5 17 17.5 18 18.5
Tim
e (
seconds)
Sub-graph size (powers of 2)
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 16 / 1
![Page 17: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/17.jpg)
Performance: Pegasus vs Naive vs Partitioned
16 Amazon EC2 nodes
10
100
1000
10000
23 24 25 26 27
Tim
e (
seconds)
Graph size (powers of 2)
naivepartitioned
pegasus
32 Amazon EC2 nodes
10
100
1000
23 24 25 26 27
Tim
e (
seconds)
Graph size (powers of 2)
naivepartitioned
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 17 / 1
![Page 18: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/18.jpg)
Optimizations Across Iterations
Algorithmic optimizations:
Asynchronous algorithms: Algorithms that allow asynchrony— ordering of updates doesn’t affect the correctness of thealgorithme.g., PageRank, Alignments
Speculative Parallelism: Algorithms where concurrentcomputations can have potential conflicts, the conflicts are rareand can only be detected at runtimee.g., Boruvka’s MST, Single Source Shortest Path
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 18 / 1
![Page 19: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/19.jpg)
Asynchronous Iterations
Improve performance in parallel environments.
Infrequent synchronization reduces communicationExamples
Graph algorithms, Numerical methods, ML kernels, etc.
More pronounced gains in distributed environments
Higher communication and data-movement costsRead once, process multiple times allows more computation forthe same I/O cost
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 19 / 1
![Page 20: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/20.jpg)
Relaxed Synchronization
Synchronize once every few iterations
Approach: Two levels of MapReduceGlobal MapReduce: The regular MapReduce
Requires global synchronization
Local MapReduce: MapReduce within a global map
Each global map runs a few iterations of local MapReducePartial synchronization of data of a single global map task
Input data partitioning for fewer dependencies across partitions
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 20 / 1
![Page 21: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/21.jpg)
PageRanks Using Traditional MapReduce
General PageRank
IEEE CLUSTER, Heraklion, Greece23rd September 2010 10
Synchronization Barrier
MapsReduces
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 21 / 1
![Page 22: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/22.jpg)
PageRank with Relaxed Synchronization
Relaxed PageRank
IEEE CLUSTER, Heraklion, Greece23rd September 2010 12
Synchronization Barrier
MapsReduces
BarrierBarrierBarrier
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 22 / 1
![Page 23: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/23.jpg)
Realizing Relaxed Synchronization Semantics
Code gmap, greduce, lmap, lreduce
lmap, lreduce use EmitLocalIntermediate() and Emit Local()Synchronized hashtables for local storage
gmap(xs : X list) {
while(no-local -convergence -intimated) {
for each element x in xs {
lmap(x); // emits lkey , lval
}
lreduce (); // operates on the output of lmap functions
}
for each value in lreduce -output{
EmitIntermediate(key , value);
}
}
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 23 / 1
![Page 24: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/24.jpg)
Evaluation — Relaxed Synchronization
Sample applications
Single Source Shortest Path (MST, transitive closure, etc.)PageRank (mat-vec: eigen value and linear system solvers)
Experimental Testbed8 Amazon EC2 Large Instances
64-bit compute units with 15 GB RAM, 4x 420 GB storageHadoop 0.20.1; 4 GB heap space per slave
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 24 / 1
![Page 25: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/25.jpg)
Single Source Shortest Path
Single Source Shortest Path
0
50
100
150
100 200 400 800 1600 3200 6400
# Ite
ratio
ns
General SSSP Relaxed SSSP
IEEE CLUSTER, Heraklion, Greece23rd September 2010 19
02000400060008000
100 200 400 800 1600 3200 6400Tim
e (s
econ
ds)
# Partitions
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 25 / 1
![Page 26: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/26.jpg)
PageRank
Input: Partitioned using METIS ( less than 10 seconds)
Damping factor = 0.85
GraphA GraphBNodes 280,000 100,000Edges 3 million 3 million
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 26 / 1
![Page 27: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/27.jpg)
PageRank Performance: GraphA
Performance: Graph A
17
010203040
100 200 400 800 1600 3200 6400
# Ite
ratio
ns
General PageRankRelaxed PageRank
0
1000
2000
3000
4000
100 200 400 800 1600 3200 6400
Tim
e (s
econ
ds)
# Partitions
General PageRankRelaxed PageRank
IEEE CLUSTER, Heraklion, Greece 23rd September 2010
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 27 / 1
![Page 28: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/28.jpg)
PageRank Performance: GraphB
Performance: Graph B
0
10
20
30
40
100 200 400 800 1600 3200 6400
# Ite
ratio
ns
General PageRank
Relaxed PageRank
18
0
1000
2000
3000
4000
100 200 400 800 1600 3200 6400
Tim
e (s
econ
ds)
#Partitions
General PageRank
Relaxed PageRank
IEEE CLUSTER, Heraklion, Greece 23rd September 2010
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 28 / 1
![Page 29: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/29.jpg)
Beyond Data-Parallelism: Speculation
Speculative parallelism
Most of the data can be operated on in parallel.
Some executions conflict. These can only be detected atruntime. [Pingali et.al., PLDI’11]
Online algorithms/ Pipelined workflows
MapReduce Online [Condie’10] is an approach needing havycheckpointing.
Software Transactional Memory (STM)
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 29 / 1
![Page 30: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/30.jpg)
TransMR: Transactional MapReduce
Goal: Exploit speculative data-parallelism
Support data-sharing across concurrent computations to detectand resolve conflicts at runtime
Solution:
Use distributed key-value stores as shared address space acrosscomputationsAddress inconsistencies arising due to the disparatefault-tolerance mechanismsTransactional execution of map and reduce functions
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 30 / 1
![Page 31: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/31.jpg)
TransMR: System Architecture
Distributed key-value store provides a shared-memoryabstraction to the distributed execution-layer.
…
N1 N2 Nn
Distributed
Execution Layer
Distributed
Key-Value Store
…
GS
CU
LS
CU
LS …
GS
CU
LS
CU
LS …
GS
CU
LS
CU
LS
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 31 / 1
![Page 32: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/32.jpg)
Semantics of the API
Data-centric function scope – Map/Reduce/Merge etc, – termedas a Computation Unit (CU), is executed as a transaction.
Optimistic reads and write-buffering. Local Store (LS) forms thewrite-buffer of a CU.
Put (K, V): Write to LS, which is later atomically committed toGS.Get (K, V): Return from LS, if already present; otherwise, fetchfrom GS and store in LS.Other Op: Any thread local operation.
The output of a CU is always committed to the GS before beingvisible to other CUs of the same or different type.
Eliminates the costly shuffle phase of MapReduce.
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 32 / 1
![Page 33: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/33.jpg)
Design Principles
Optimistic concurrency control over pessimistic locking
Locks are acquired at the end of the transaction. Write-bufferand read-set is validated against those of concurrent Trxassuring serializability.Client is potentially executing on the slowest node in thesystem; in this case, pessimistic locking hinders paralleltransaction execution.
Consistency (C) and Tolerance to Network Partitions (P) overAvailability (A) in CAP Theorem for Distributed transactions.
Application correctness mandates strict consistency ofexecution. Relaxed consistency models are application-specificoptimizations.Intermittent non-availability is not too costly forbatch-processing applications, where client is fault-prone initself.
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 33 / 1
![Page 34: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/34.jpg)
Evaluation
We show performance gains on two applications, which arehitherto implemented sequentially without transactional support;both exhibit Optimistic data-parallelism.
Boruka’s MST
Each iteration is coded as a Map function with input as a node.Reduce is an identity function. Conflicting maps are serializedwhile others are executed in parallel.After n iterations of coalescing, we get the MST of an n nodegraph.A graph of 100 thousand nodes, with average degree of 50,generated based on the forest-fire model.
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 34 / 1
![Page 35: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/35.jpg)
Boruvka’s MST
Speedup of 3.73 on 16 nodes, with less than 0.5 % re-executionsdue to aborts.
10
15
20
25
30
35
40
45
50
1 2 4 8 16 0
50
100
150
200
250
Tim
e (
min
s)
# A
bort
s
# Computing Nodes
Execution TimeNumber of Aborts
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 35 / 1
![Page 36: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/36.jpg)
Maximum Flow Using Push-Relabel Algorithm
Each Map function executes a Push or a Relabel operation onthe input node, depending on the constraints on its neighbors.
Push operation increases the flow to a neighboring node andchanges their “Excess”.
Relabel operation increases the height of the input node if it isthe lowest among its neighbors.
Conflicting Maps – operating on neighboring nodes – getserialized due to their transactional nature.
Only sequential implementation possible without support forruntime conflict detection.
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 36 / 1
![Page 37: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/37.jpg)
Maximum flow using Push-Relabel algorithm
Speedup of 4.5 is observed on 16 nodes with 4% re-executionson a windown of 40 iterations.
15
20
25
30
35
40
45
50
55
1 2 4 8 16 0
1000
2000
3000
4000
5000
6000
Tim
e (
min
s)
# A
bort
s
# Computing Nodes
Execution TimeNumber of Aborts
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 37 / 1
![Page 38: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/38.jpg)
TransMR: Intermediate Lessons
TransMR programming model enables data-sharing indata-centric programming models for enhanced applicability.
Similar to other data-centric programming models, theprogrammer only specifies operation on the individualdata-element without concerning about its interaction with otheroperations.
Prototype implementation shows that many importantapplications can be expressed in this model while extractingsignificant performance gains through increased parallelism.
BUT: What about the locking operations!
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 38 / 1
![Page 39: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/39.jpg)
Distributed Transactions on Key-Value Stores
Transactions are costly in a large scale distributed settings
two-phase locking (concurrency control)two-phase commit (atomicity)
Careful examination of the protocols and optimizations crucial toperformance of TransMR-like systems
These optimizations also useful for general purpose transactionson databases using key-value store as the underlying storage
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 39 / 1
![Page 40: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/40.jpg)
Lock Management in Distributed Transactions
Lock management the major bottleneck affecting the latency ofdistributed transactions.
Consider Strong Strict two phase locking (SS2PL) – waitingcase: The lock-acquiring stage is the only sequential stage. Theother stages can be parallelized to finish in a single round-trip.
Holds true even in optimistic-concurrency techniques where thelocks are acquired at the end.
LockRead /
ExecuteUpdate
Unlock /
Commit / Abort
Locking StageStart
l1 l2 ln
Continue Trx ExecutionIf all locks
acquired
Busy waitBusy wait Busy wait
Move to next if lock acquired
Sequential locking on ordered locks
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 40 / 1
![Page 41: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/41.jpg)
Workload Aware Lock Management
Contention based Lock-ordering: Order the locks so as todecrease the total amount of waiting time.
For the waiting case, the lock with the least contention shouldbe acquired first. This increases pipelining while decreasinglock-holding times.
Contention order is a runtime characteristic, and is updatedconsistently. All clients should adhere to the same order to avoiddeadlocks.
l1 l2 l3 l4 l5 l6
Locks ordered based on increasing
or decreasing contention-order
Group 1 Group 2 Group 3
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 41 / 1
![Page 42: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/42.jpg)
Constrained k-way Graph Partitioning
Graph partitioning algorithm to split the locking into knon-overlapping partitions, minimizing the sum of weights oncut-edges, while approximately balancing the total weight (sumof node-weights) of individual partitions.
The result of the partitioning algorithm is theload-balanced-partitioning of locks among k storage nodes.
k1
k4
k2
k6
k5
k3
8
2
4
6 8
9
7
Partition a Partition b
15
10
1010
12
15
Co-occurence countOccurrence count
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 42 / 1
![Page 43: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/43.jpg)
Evaluation
A cluster of 20 machines was used for all evaluations. Eachmachine had a Quad-core xeon processor with 8 GB of RAM.HBase is the underlying key-value store.
The YCSB benchmark was extended with the atomic multi-putoperation. A client transaction involves an atomicRead-Modify-Write operation on a set of keys.
The keys for the atomic operation are generated using a Zipfiangenerator with variable zipfian parameter. Each transactionupdates 15 keys out of 50K keys.
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 43 / 1
![Page 44: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/44.jpg)
Two Phase Locking: Waiting Version
2 4 6 8 10 12 14 16 18 20Number of clients
0
100
200
300
400
500
600
Thro
ughp
ut
z=0.7, order=incz=0.7, order=decz=0.8, order=incz=0.8, order=decz=0.9, order=incz=0.9, order=dec
Figure : Performance of Lock Partitioning
Ordering of keys in increasing-order of their contentionsignificantly better than the decreasing order.The increasing-order reduced lock holding time for highestcontended locks reducing waiting time for other transactions.
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 44 / 1
![Page 45: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/45.jpg)
Two Phase Locking: Waiting Version
2 4 6 8 10 12 14 16 18 20Number of clients
0
100
200
300
400
500
600
Thro
ughp
ut
z=0.7z=0.8z=0.9
Figure : Performance of Lock Partitioning
Partitioning is done using Metis and partitions are placed atseparate nodes. Lock-partitioning improves the throughput byreducing the number of network-roundtrips needed for sequentiallocking by the client.
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 45 / 1
![Page 46: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/46.jpg)
Optimistic Concurrency Control – No-waiting
Version
2 4 6 8 10 12 14 16 18 20Number of clients
0
50
100
150
200
250
300
350
400
450
Thro
ughp
ut
z=0.7, order=incz=0.7, order=decz=0.8, order=incz=0.8, order=decz=0.9, order=incz=0.9, order=dec
Figure : Performance of Lock Ordering
Smaller improvement for OCC mainly due to the shorterduration of locking.At similar contention levels, the throughput of optimisticconcurrency control is significantly lower.
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 46 / 1
![Page 47: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/47.jpg)
Optimistic Concurrency Control – No-waiting
Version
2 4 6 8 10 12 14 16 18 20Number of clients
0
50
100
150
200
250
300
350
400
450
Avg
Lock
s Ac
quire
d Pe
r Trx
z=0.7, order=incz=0.7, order=decz=0.8, order=incz=0.8, order=decz=0.9, order=incz=0.9, order=dec
Figure : Lock wastage due to restarts
Optimistic techniques not suitable at high contention levels asthe time spent in reading and local updating gets wasted due toconflicts during commit.
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 47 / 1
![Page 48: Part 1: Systems Infrastructure for Big-Data Analytics · 2013-09-25 · Part 1: Systems Infrastructure for Big-Data Analytics Ananth Grama Center for Science of Information Dept.](https://reader034.fdocuments.net/reader034/viewer/2022042110/5e8a99a0de2e2d2e3b08c263/html5/thumbnails/48.jpg)
Lock Optimization: Conclusions
The waiting version of SS2PL with increasing-contention-orderand partitioning outperforms the other protocols significantly.
Restarts due to conflicts constitute a major overhead indistributed transactions. Reducing restarts by busy-waiting forlocks is an important step towards increasing performance.
Understanding the workload - even simple statistics oncontention - is enough to achieve significant gains (up to 200%).
Lock-partitioning through graph clustering and partitioningtechniques can be performed dynamically to achieveperformance gains.
Ananth Grama (Purdue University) Big Data Analytics Infrastructure 48 / 1