The Quest for Scalable Support The Quest for Scalable Support of Data Intensive Applications of Data Intensive Applications
in Distributed Systemsin Distributed SystemsIoan Raicu
Distributed Systems LaboratoryComputer Science Department
University of Chicago
In Collaboration with: Ian Foster, University of Chicago and Argonne National LaboratoryAlex Szalay, The Johns Hopkins UniversityYong Zhao, Microsoft CorporationPhilip Little, Christopher Moretti, Amitabh Chaudhary, Douglas Thain, University of Notre Dame
IEEE/ACM Supercomputing 2008Argonne National Laboratory Booth
November 20th, 2008
Many-Core Growth Rates
• Increasing attention toparallel chips– Many plans for cores
with “In-Order” execution
90 nm
65 nm
256 Cores
90 nm
65 nm
256 Cores
Slide 2
On-chip shared memoryFar faster to access on-chip memory than DRAM
Interesting challenges in synchronization (e.g. locking)
Inexpensive Low-Power Parallel ChipsAmazing amounts of computing very cheap
Slower (or same) sequential speed!2004 2006 2008 2010 2012 2014 2016 2018
45 nm 32
nm22 nm
16 nm 11
nm
8 nm2Cores
4Cores
8Cores
16Cores 32
Cores
64 Cores
128 Cores
2004 2006 2008 2010 2012 2014 2016 2018
45 nm 32
nm22 nm
16 nm 11
nm
8 nm2Cores
4Cores
8Cores
16Cores 32
Cores
64 Cores
128 Cores
Pat Helland, Microsoft, The Irresistible Forces Meet the Movable Objects, November 9th, 2007
What will we do with 1+ Exaflopsand 1M+ cores?
Storage Resource Scalability
1000
10000
100000
1000000
hrou
ghpu
t (M
b/s)
GPFS RLOCAL RGPFS R+WLOCAL R+W
• GPFS vs. LOCAL– Read Throughput
• 1 node: 0.48Gb/s vs. 1.03Gb/s 2.15x• 160 nodes: 3.4Gb/s vs. 165Gb/s 48x• 11Mb/s per CPU vs. 515Mb/s per CPU
– Read+Write Throughput:
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 411/13/2008 4
100
1000
1 10 100 1000Number of Nodes
Th
• IBM BlueGene/P– 160K CPU cores– GPFS 8GB/s I/O rates (16 servers)– Experiments on 160K CPU BG/P achieved 0.3Mb/s per CPU core– Experiments on 5.7K CPU SiCortex achieved 0.06Mb/s per CPU core
Read Write Throughput:• 1 node: 0.2Gb/s vs. 0.39Gb/s 1.95x• 160 nodes: 1.1Gb/s vs. 62Gb/s 55x
– Metadata (mkdir / rm -rf)• 1 node: 151/sec vs. 199/sec 1.3x• 160 nodes: 21/sec vs. 31840/sec 1516x
Programming Model Issues
• Multicore processors• Massive task parallelism• Massive data parallelism• Integrating black box applications
Comple task dependencies (task graphs)
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 5
• Complex task dependencies (task graphs)• Failure, and other execution management issues• Dynamic task graphs• Documenting provenance of data products • Data management: input, intermediate, output• Dynamic data access involving large amounts of
data
Programming Model Issues
• Multicore processors• Massive task parallelism• Massive data parallelism• Integrating black box applications
Comple task dependencies (task graphs)
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 6
• Complex task dependencies (task graphs)• Failure, and other execution management issues• Dynamic task graphs• Documenting provenance of data products • Data management: input, intermediate, output• Dynamic data access involving large amounts of
data
Problem Types
Input Data Size
Hi
Data Analysis, Mining
Big Data and Many Tasks
Input Data Size
Hi
Data Analysis, Mining
Big Data and Many Tasks
7
Number of Tasks
Med
Low1 1K 1M
Heroic MPI
Tasks Many Loosely Coupled Apps
Number of Tasks
Med
Low1 1K 1M
Heroic MPI
Tasks Many Loosely Coupled Apps
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems
An Incomplete and Simplistic View of Programming Models and Tools
8The Quest for Scalable Support of Data Intensive Applications in Distributed Systems
MTC: Many Task Computing
• Loosely coupled applications– High-performance computations comprising of multiple
distinct activities, coupled via file system operations or message passingE h i i h t ti i d– Emphasis on using many resources over short time periods
– Tasks can be:• small or large, independent and dependent, uniprocessor or
multiprocessor, compute-intensive or data-intensive, static or dynamic, homogeneous or heterogeneous, loosely or tightly coupled, large number of tasks, large quantity of computing, and large volumes of data…
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 9
Motivating Example:AstroPortal Stacking Service
• Purpose– On-demand “stacks” of
random locations within ~10TB dataset
Ch ll
+
+++
+
+
+
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems
10
• Challenge– Processing Costs:
• O(100ms) per object
– Data Intensive: • 40MB:1sec
– Rapid access to 10-10K “random” files
– Time-varying load
AP SloanData
+
=
Hypothesis
“Significant performance improvements can be obtained in the analysis of large dataset by leveraging information about data analysis workloads rather than
individual data analysis tasks ”
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 11
• Important concepts related to the hypothesis– Workload: a complex query (or set of queries) decomposable into
simpler tasks to answer broader analysis questions – Data locality is crucial to the efficient use of large scale distributed
systems for scientific and data-intensive applications– Allocate computational and caching storage resources, co-scheduled to
optimize workload performance
individual data analysis tasks.
Abstract Model
• AMDASK: An Abstract Model for DAta-centric taSK farms– Task Farm: A common parallel pattern that drives independent
computational tasks• Models the efficiency of data analysis workloads for the MTC
class of applicationsC f ff
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 12
• Captures the following data diffusion properties– Resources are acquired in response to demand– Data and applications diffuse from archival storage to new resources – Resource “caching” allows faster responses to subsequent requests – Resources are released when demand drops – Considers both data and computations to optimize performance
AMDASK:Base Definitions
• Data Stores: Persistent & Transient– Store capacity, load, ideal bandwidth, available
bandwidth• Data Objects:
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 13
• Data Objects:– Data object size, data object’s storage location(s),
copy time• Transient resources: compute speed,
resource state• Task: application, input/output data
AMDASK:Execution Model Concepts
• Dispatch Policy– next-available, first-available, max-compute-util, max-cache-hit
• Caching Policy– random, FIFO, LRU, LFU
• Replay policy
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 14
Replay policy• Data Fetch Policy
– Just-in-Time, Spatial Locality• Resource Acquisition Policy
– one-at-a-time, additive, exponential, all-at-once, optimal• Resource Release Policy
– distributed, centralized
AMDASK:Performance Efficiency Model
• B: Average Task Execution Time: – K: Stream of tasks– µ(k): Task k execution time ∑
Κ∈Κ=Β
k)(
||1 κµ
• Y: Average Task Execution Time with Overheads: – ο(k): Dispatch overhead ⎧ Ω∑ δφδ )()]()([1
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 15
• V: Workload Execution Time: – A: Arrival rate of tasks – T: Transient Resources
• W: Workload Execution Time with Overheads
||*1,||
max Κ⎟⎟⎠
⎞⎜⎜⎝
⎛ΑΤ
=BV
||*1,||
max Κ⎟⎟⎠
⎞⎜⎜⎝
⎛ΑΤ
Υ=W
ο(k): Dispatch overhead– ς(δ,τ): Time to get data
⎪⎪⎩
⎪⎪⎨
⎧
Ω∈∉++Κ
Ω∈∈+Κ=∑
∑
Κ∈
Κ∈
δτφδτδζκκµ
δτφδκκµ
κ
κ
),(,)],()()([||
1
),()],()([||
1
o
oY
AMDASK:Performance Efficiency Model
• Efficiency
• Speedup
WV
=Ε⎪⎪⎩
⎪⎪⎨
⎧
>⎟⎠⎞
⎜⎝⎛
ΑΤ
≤=
ATY
YYB
ATY
E 1||
,*
||,max
1||
,1
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 16
• Speedup
• Optimizing Efficiency– Easy to maximize either efficiency or speedup independently– Harder to maximize both at the same time
• Find the smallest number of transient resources |T| while maximizing speedup*efficiency
||* TES =
Model Validation
• Stacking service (large scale astronomy application)• 92 experiments• 558K files
– Compressed: 2MB each 1.1TBUn compressed 6MB each 3 3TB
17
– Un-compressed: 6MB each 3.3TB
0%
10%
20%
30%
2 4 8 16 32 64 128Number of CPUs
Mod
el E
rror
GPFS (GZ) GPFS (FIT)Data Diffusion (FIT) - Locality 1 Data Diffusion (GZ) - Locality 1Data Diffusion (FIT) - Locality 1.38 Data Diffusion (GZ) - Locality 1.38Data Diffusion (FIT) - Locality 30 Data Diffusion (GZ) - Locality 30
0%
10%
20%
30%
2 4 8 16 32 64 128Number of CPUs
Mod
el E
rror
GPFS (GZ) GPFS (FIT)Data Diffusion (FIT) - Locality 1 Data Diffusion (GZ) - Locality 1Data Diffusion (FIT) - Locality 1.38 Data Diffusion (GZ) - Locality 1.38Data Diffusion (FIT) - Locality 30 Data Diffusion (GZ) - Locality 30
0%
10%
20%
30%
1 1.38 2 3 4 5 10 20 30Data Locality
Mod
el E
rror
GPFS (GZ)GPFS (FIT)Data Diffusion (FIT)Data Diffusion (GZ)
0%
10%
20%
30%
1 1.38 2 3 4 5 10 20 30Data Locality
Mod
el E
rror
GPFS (GZ)GPFS (FIT)Data Diffusion (FIT)Data Diffusion (GZ)
Falkon: a Fast and Light-weight tasK executiON framework
• Goal: enable the rapid and efficient execution of many independent jobs on large compute clusters
• Combines three components:– a streamlined task dispatcher
i i i th h lti l l h d li
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 18
– resource provisioning through multi-level scheduling techniques
– data diffusion and data-aware scheduling to leverage the co-located computational and storage resources
• Integration into Swift to leverage many applications– Applications cover many domains: astronomy, astro-physics,
medicine, chemistry, economics, climate modeling, etc
Falkon: a Fast and Light-weight tasK executiON framework
• Goal: enable the rapid and efficient execution of many independent jobs on large compute clusters
• Combines three components:– a streamlined task dispatcher
i i i th h lti l l h d li
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 19
– resource provisioning through multi-level scheduling techniques
– data diffusion and data-aware scheduling to leverage the co-located computational and storage resources
• Integration into Swift to leverage many applications– Applications cover many domains: astronomy, astro-physics,
medicine, chemistry, economics, climate modeling, etc
Falkon Overview
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 20
Data Diffusion
• Resource acquired in response to demand
• Data and applications diffuse from archival storage to newly acquired resources
text
Task DispatcherData-Aware Scheduler Persistent Storage
Shared File System
Idle Resources
text
Task DispatcherData-Aware Scheduler Persistent Storage
Shared File System
Idle Resources
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 21
• Resource “caching” allows faster responses to subsequent requests – Cache Eviction Strategies:
RANDOM, FIFO, LRU, LFU• Resources are released
when demand drops
Shared File System
Provisioned Resources
Shared File System
Provisioned Resources
Data Diffusion
• Considers both data and computations to optimize performance– Supports data-aware scheduling– Can optimize compute utilization, cache hit performance, or
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 22
a mixture of the two• Decrease dependency of a shared file system
– Theoretical linear scalability with compute resources– Significantly increases meta-data creation and/or
modification performance• Central for “data-centric task farm” realization
Scheduling Policies
• first-available: – simple load balancing
• max-cache-hit– maximize cache hits
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 23
maximize cache hits• max-compute-util
– maximize processor utilization• good-cache-compute
– maximize both cache hit and processor utilization at the same time
Data-Aware Scheduler Profiling
3
4
5
Task
(ms)
3000
4000
5000
task
s/se
c)
Task SubmitNotification for Task AvailabilityTask Dispatch (data-aware scheduler)Task Results (data-aware scheduler)Notification for Task ResultsWS CommunicationThroughput (tasks/sec)
3
4
5
Task
(ms)
3000
4000
5000
task
s/se
c)
Task SubmitNotification for Task AvailabilityTask Dispatch (data-aware scheduler)Task Results (data-aware scheduler)Notification for Task ResultsWS CommunicationThroughput (tasks/sec)
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 24
0
1
2
first-available
without I/O
first-availablewith I/O
max-compute-util
max-cache-hit
good-cache-
compute
CPU
Tim
e pe
r
0
1000
2000
Thro
ughp
ut (t
0
1
2
first-available
without I/O
first-availablewith I/O
max-compute-util
max-cache-hit
good-cache-
compute
CPU
Tim
e pe
r
0
1000
2000
Thro
ughp
ut (t
AstroPortal Stacking Service
• Purpose– On-demand “stacks” of random
locations within ~10TB dataset
• Challenge– Rapid access to 10-10K “random” files
+
+++
+
+
=
+
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 27
ap d access to 0 0 a do es– Time-varying load
• Sample WorkloadsS4 Sloan
Data
=
Web page or Web Service
Locality Number of Objects Number of Files1 111700 111700
1.38 154345 1116992 97999 490003 88857 296204 76575 191455 60590 1212010 46480 465020 40460 202530 23695 790
AstroPortal Stacking Service
• Purpose– On-demand “stacks” of random
locations within ~10TB dataset
• Challenge– Rapid access to 10-10K “random” files
+
+++
+
+
=
+
300
350
400
450
)
openradec2xyreadHDU+getTile+curl+convertArraycalibration+interpolation+doStackingwriteStacking
300
350
400
450
)
openradec2xyreadHDU+getTile+curl+convertArraycalibration+interpolation+doStackingwriteStacking
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 28
ap d access to 0 0 a do es– Time-varying load
• Sample WorkloadsS4 Sloan
Data
=
Web page or Web Service
Locality Number of Objects Number of Files1 111700 111700
1.38 154345 1116992 97999 490003 88857 296204 76575 191455 60590 1212010 46480 465020 40460 202530 23695 790
0
50
100
150
200
250
GPFS GZ LOCAL GZ GPFS FIT LOCAL FITFilesystem and Image Format
Tim
e (m
s)
0
50
100
150
200
250
GPFS GZ LOCAL GZ GPFS FIT LOCAL FITFilesystem and Image Format
Tim
e (m
s)
AstroPortal Stacking Servicewith Data Diffusion
Low data locality – Similar (but better)
performance to GPFS800
1000
1200
1400
1600
1800
2000
s) p
er s
tack
per
CPU
Data Diffusion (GZ)Data Diffusion (FIT)GPFS (GZ)GPFS (FIT)
800
1000
1200
1400
1600
1800
2000
s) p
er s
tack
per
CPU
Data Diffusion (GZ)Data Diffusion (FIT)GPFS (GZ)GPFS (FIT)
2000 Data Diffusion (GZ)D t Diff i (FIT)2000 Data Diffusion (GZ)D t Diff i (FIT)
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 29
0
200
400
600
2 4 8 16 32 64 128
Number of CPUsTi
me
(ms
0
200
400
600
2 4 8 16 32 64 128
Number of CPUsTi
me
(ms
High data locality– Near perfect scalability0
200
400
600
800
1000
1200
1400
1600
1800
2000
2 4 8 16 32 64 128
Number of CPUs
Tim
e (m
s) p
er s
tack
per
CPU
Data Diffusion (FIT)GPFS (GZ)GPFS (FIT)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2 4 8 16 32 64 128
Number of CPUs
Tim
e (m
s) p
er s
tack
per
CPU
Data Diffusion (FIT)GPFS (GZ)GPFS (FIT)
AstroPortal Stacking Servicewith Data Diffusion
• Aggregate throughput:– 39Gb/s– 10X higher than GPFS
• Reduced load on GPFS0 49Gb/s 20
25
30
35
40
45
50
Thro
ughp
ut (G
b/s)
Data Diffusion Throughput LocalData Diffusion Throughput Cache-to-CacheData Diffusion Throughput GPFSGPFS Throughput (FIT)GPFS Throughput (GZ)
20
25
30
35
40
45
50
Thro
ughp
ut (G
b/s)
Data Diffusion Throughput LocalData Diffusion Throughput Cache-to-CacheData Diffusion Throughput GPFSGPFS Throughput (FIT)GPFS Throughput (GZ)
30
– 0.49Gb/s– 1/10 of the original load
0
5
10
15
20
1 1.38 2 3 4 5 10 20 30Locality
Agg
rega
te
0
5
10
15
20
1 1.38 2 3 4 5 10 20 30Locality
Agg
rega
te
• Big performance gains as locality increases
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 1.38 2 3 4 5 10 20 30 IdealLocality
Tim
e (m
s) p
er s
tack
per
CPU
Data Diffusion (GZ)Data Diffusion (FIT)GPFS (GZ)GPFS (FIT)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 1.38 2 3 4 5 10 20 30 IdealLocality
Tim
e (m
s) p
er s
tack
per
CPU
Data Diffusion (GZ)Data Diffusion (FIT)GPFS (GZ)GPFS (FIT)
Monotonically Increasing Workload
• 250K tasks – 10MB reads– 10ms compute
• Vary arrival rate:– Min: 1 task/sec
600
700
800
900
1000
r sec
ond
150000
200000
250000
plet
edut
(Mb/
s)
Arrival Rate per secTasks completedIdeal Throughput Mb/s
600
700
800
900
1000
r sec
ond
150000
200000
250000
plet
edut
(Mb/
s)
Arrival Rate per secTasks completedIdeal Throughput Mb/s
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 31
– Increment function: CEILING(*1.3)
– Max: 1000 tasks/sec• 128 processors• Ideal case:
– 1415 sec– 80Gb/s peak
throughput
0
100
200
300
400
500
600
012
024
036
048
060
072
084
096
010
8012
0013
2014
40
Time (sec)
Arr
ival
Rat
e pe
r
0
50000
100000
150000
Task
s C
omp
Idea
l Thr
ough
pu
0
100
200
300
400
500
600
012
024
036
048
060
072
084
096
010
8012
0013
2014
40
Time (sec)
Arr
ival
Rat
e pe
r
0
50000
100000
150000
Task
s C
omp
Idea
l Thr
ough
pu
Data Diffusion: First-available (GPFS)
• GPFS vs. ideal: 5011 sec vs. 1415 sec
10
100
1000
odes
Gb/
s)(x
1K)
10
100
1000
odes
Gb/
s)(x
1K)
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 32
0.001
0.01
0.1
1
030
060
090
012
0015
0018
0021
0024
0027
0030
0033
0036
0039
0042
0045
0048
0051
00
Time (sec)
Num
ber o
f No
Thro
ughp
ut (G
Que
ue L
engt
h (
Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue Length Number of Nodes
0.001
0.01
0.1
1
030
060
090
012
0015
0018
0021
0024
0027
0030
0033
0036
0039
0042
0045
0048
0051
00
Time (sec)
Num
ber o
f No
Thro
ughp
ut (G
Que
ue L
engt
h (
Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue Length Number of Nodes
Data Diffusion:Max-compute-util & max-cache-hit
Max-compute-util100
1000
Gb/
s)B
/s) 0.8
0.9
1
100
1000
Gb/
s)B
/s) 0.8
0.9
1
100
1000
)
0.8
0.9
1
100
1000
)
0.8
0.9
1
Max-cache-hit
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 33
0.001
0.01
0.1
1
10
018
036
054
072
090
010
8012
6014
4016
2018
0019
8021
6023
4025
2027
0028
80
Time (sec)
Num
ber o
f Nod
esA
ggre
gate
Thr
ough
put (
GTh
roug
hput
per
Nod
e (M
Que
ue L
engt
h (x
1K)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Cac
he H
it/M
iss
%C
PU U
tiliz
atio
n %
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes CPU Utilization
0.001
0.01
0.1
1
10
018
036
054
072
090
010
8012
6014
4016
2018
0019
8021
6023
4025
2027
0028
80
Time (sec)
Num
ber o
f Nod
esA
ggre
gate
Thr
ough
put (
GTh
roug
hput
per
Nod
e (M
Que
ue L
engt
h (x
1K)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Cac
he H
it/M
iss
%C
PU U
tiliz
atio
n %
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes CPU Utilization
0.001
0.01
0.1
1
10
012
024
036
048
060
072
084
096
010
8012
0013
2014
4015
6016
8018
0019
2020
40
Time (sec)
Num
ber o
f Nod
esTh
roug
hput
(Gb/
s)Q
ueue
Len
gth
(x1K
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Cac
he H
it/M
iss
%
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0.001
0.01
0.1
1
10
012
024
036
048
060
072
084
096
010
8012
0013
2014
4015
6016
8018
0019
2020
40
Time (sec)
Num
ber o
f Nod
esTh
roug
hput
(Gb/
s)Q
ueue
Len
gth
(x1K
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Cac
he H
it/M
iss
%
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
Data Diffusion:Good-cache-compute
1GB
1.5GB 0.01
0.1
1
10
100
1000
Num
ber o
f Nod
esA
ggre
gate
Thr
ough
put (
Gb/
s)Th
roug
hput
per
Nod
e (M
B/s
)Q
ueue
Len
gth
(x1K
)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cac
he H
it/M
iss
%
0.01
0.1
1
10
100
1000
Num
ber o
f Nod
esA
ggre
gate
Thr
ough
put (
Gb/
s)Th
roug
hput
per
Nod
e (M
B/s
)Q
ueue
Len
gth
(x1K
)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cac
he H
it/M
iss
%
0 001
0.01
0.1
1
10
100
1000
Num
ber o
f Nod
esA
ggre
gate
Thr
ough
put (
Gb/
s)Th
roug
hput
per
Nod
e (M
B/s
)Q
ueue
Len
gth
(x1K
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cac
he H
it/M
iss
%
0 001
0.01
0.1
1
10
100
1000
Num
ber o
f Nod
esA
ggre
gate
Thr
ough
put (
Gb/
s)Th
roug
hput
per
Nod
e (M
B/s
)Q
ueue
Len
gth
(x1K
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cac
he H
it/M
iss
%
34
2GB
4GB
0.001
0
300
600
900
1200
1500
1800
2100
2400
2700
3000
3300
3600
Time (sec)
0
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0.001
0
300
600
900
1200
1500
1800
2100
2400
2700
3000
3300
3600
Time (sec)
0
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0.001
012
024
036
048
060
072
084
096
010
8012
0013
2014
4015
60
Time (sec)
0
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0.001
012
024
036
048
060
072
084
096
010
8012
0013
2014
4015
60
Time (sec)
0
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0.001
0.01
0.1
1
10
100
1000
0
120
240
360
480
600
720
840
960
1080
1200
1320
1440
Time (sec)
Num
ber o
f Nod
esA
ggre
gate
Thr
ough
put (
Gb/
s)Th
roug
hput
per
Nod
e (M
B/s
)Q
ueue
Len
gth
(x1K
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cac
he H
it/M
iss
%
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0.001
0.01
0.1
1
10
100
1000
0
120
240
360
480
600
720
840
960
1080
1200
1320
1440
Time (sec)
Num
ber o
f Nod
esA
ggre
gate
Thr
ough
put (
Gb/
s)Th
roug
hput
per
Nod
e (M
B/s
)Q
ueue
Len
gth
(x1K
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cac
he H
it/M
iss
%
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0.001
0.01
0.1
1
10
100
1000
0
120
240
360
480
600
720
840
960
1080
1200
1320
1440
Time (sec)
Num
ber o
f Nod
esA
ggre
gate
Thr
ough
put (
Gb/
s)Th
roug
hput
per
Nod
e (M
B/s
)Q
ueue
Len
gth
(x1K
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cac
he H
it/M
iss
%
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0.001
0.01
0.1
1
10
100
1000
0
120
240
360
480
600
720
840
960
1080
1200
1320
1440
Time (sec)
Num
ber o
f Nod
esA
ggre
gate
Thr
ough
put (
Gb/
s)Th
roug
hput
per
Nod
e (M
B/s
)Q
ueue
Len
gth
(x1K
)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cac
he H
it/M
iss
%
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
Data Diffusion:Good-cache-compute
10
100
1000
odes
hput
(Gb/
s)od
e (M
B/s
)(x
1K)
0 6
0.7
0.8
0.9
1
ss %10
100
1000
odes
hput
(Gb/
s)od
e (M
B/s
)(x
1K)
0 6
0.7
0.8
0.9
1
ss %
• Data Diffusion vs. ideal: 1436 sec vs 1415 sec
35
0.001
0.01
0.1
1
0
120
240
360
480
600
720
840
960
1080
1200
1320
1440
Time (sec)
Num
ber o
f No
Agg
rega
te T
hrou
ghTh
roug
hput
per
No
Que
ue L
engt
h
0
0.1
0.2
0.3
0.4
0.5
0.6
Cac
he H
it/M
is
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0.001
0.01
0.1
1
0
120
240
360
480
600
720
840
960
1080
1200
1320
1440
Time (sec)
Num
ber o
f No
Agg
rega
te T
hrou
ghTh
roug
hput
per
No
Que
ue L
engt
h
0
0.1
0.2
0.3
0.4
0.5
0.6
Cac
he H
it/M
is
Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
Data Diffusion:Throughput and Response Time
Throughput:– Average: 14Gb/s vs 4Gb/s– Peak: 100Gb/s vs. 6Gb/s
1
10
100
Thro
ughp
ut (G
b/s)
Local Worker Caches (Gb/s)Remote Worker Caches (Gb/s)GPFS Throughput (Gb/s)
1
10
100
Thro
ughp
ut (G
b/s)
Local Worker Caches (Gb/s)Remote Worker Caches (Gb/s)GPFS Throughput (Gb/s)
36
0.1Ideal first-
availablegood-cache-
compute,1GB
good-cache-
compute,1.5GB
good-cache-
compute,2GB
good-cache-
compute,4GB
max-cache-hit,
4GB
max-compute-util, 4GB
T
0.1Ideal first-
availablegood-cache-
compute,1GB
good-cache-
compute,1.5GB
good-cache-
compute,2GB
good-cache-
compute,4GB
max-cache-hit,
4GB
max-compute-util, 4GB
T
1084
230 287
3.1
114
1569
3.4
1
10
100
1000
10000
first-available
good-cache-
compute,1GB
good-cache-
compute,1.5GB
good-cache-
compute,2GB
good-cache-
compute,4GB
max-cache-hit, 4GB
max-compute-util, 4GB
Ave
rage
Res
pons
e Ti
me
(sec
) 1084
230 287
3.1
114
1569
3.4
1
10
100
1000
10000
first-available
good-cache-
compute,1GB
good-cache-
compute,1.5GB
good-cache-
compute,2GB
good-cache-
compute,4GB
max-cache-hit, 4GB
max-compute-util, 4GB
Ave
rage
Res
pons
e Ti
me
(sec
)
Response Time – 3 sec vs 1569 sec 506X
Data Diffusion: Performance Index, Slowdown, and Speedup
• Performance Index:– 34X higher
• Speedup– 3.5X faster than GPFS
0 3
0.4
0.5
0.6
0.7
0.8
0.9
1
Perf
orm
ance
Inde
x
2
2.5
3
3.5
up (c
ompa
red
to L
AN
GPF
S)
Performance IndexSpeedup (compared to first-available)
0 3
0.4
0.5
0.6
0.7
0.8
0.9
1
Perf
orm
ance
Inde
x
2
2.5
3
3.5
up (c
ompa
red
to L
AN
GPF
S)
Performance IndexSpeedup (compared to first-available)
37
0
0.1
0.2
0.3
first-available
good-cache-
compute,1GB
good-cache-
compute,1.5GB
good-cache-
compute,2GB
good-cache-
compute,4GB
good-cache-
compute,4GB, SRP
max-cache-hit,
4GB
max-compute-util, 4GB
1
1.5
Spee
du
0
0.1
0.2
0.3
first-available
good-cache-
compute,1GB
good-cache-
compute,1.5GB
good-cache-
compute,2GB
good-cache-
compute,4GB
good-cache-
compute,4GB, SRP
max-cache-hit,
4GB
max-compute-util, 4GB
1
1.5
Spee
du
123456789
101112131415161718
1 2 3 4 6 8 11 15 20 26 34 45 59 77 101
132
172
224
292
380
494
643
836
1000
Arrival Rate per Second
Slow
dow
n
first-availablegood-cache-compute, 1GBgood-cache-compute, 1.5GBgood-cache-compute, 2GBgood-cache-compute, 4GBmax-cache-hit, 4GBmax-compute-util, 4GB
123456789
101112131415161718
1 2 3 4 6 8 11 15 20 26 34 45 59 77 101
132
172
224
292
380
494
643
836
1000
Arrival Rate per Second
Slow
dow
n
first-availablegood-cache-compute, 1GBgood-cache-compute, 1.5GBgood-cache-compute, 2GBgood-cache-compute, 4GBmax-cache-hit, 4GBmax-compute-util, 4GB
• Slowdown:– 18X slowdown for
GPFS– Near ideal 1X
slowdown for large enough caches
Sin-Wave Workload
900
1000
1800000
2000000
ed
Arrival RateNumber of Tasks
900
1000
1800000
2000000
ed
Arrival RateNumber of Tasks
• 2M tasks – 10MB reads– 10ms compute
• Vary arrival rate:– Min: 1 task/sec
⎣ ⎦705.5*)11.0(*)1)859678.2*)11.0((sin( +++= timetimesqrtA
38
0
100
200
300
400
500
600
700
800
060
012
0018
0024
0030
0036
0042
0048
0054
0060
0066
00
Time (sec)
Arr
ival
Rat
e (p
er s
ec)
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
Num
ber o
f Tas
ks C
ompl
ete
0
100
200
300
400
500
600
700
800
060
012
0018
0024
0030
0036
0042
0048
0054
0060
0066
00
Time (sec)
Arr
ival
Rat
e (p
er s
ec)
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
Num
ber o
f Tas
ks C
ompl
ete
– Arrival rate function:– Max: 1000 tasks/sec
• 200 processors• Ideal case:
– 6505 sec– 80Gb/s peak
throughput
Sin-Wave Workload
• GPFS 5.7 hrs, ~8Gb/s, 1138 CPU hrs• DF+SRP 1.8 hrs, ~25Gb/s, 361 CPU hrs• DF+DRP 1.86 hrs, ~24Gb/s, 253 CPU hrs
39
0%10%20%30%40%50%60%70%80%90%100%
0
20
40
60
80
100
120
Cac
he H
it/M
iss
Num
ber o
f Nod
esTh
roug
hput
(Gb/
s)Q
ueue
Len
gth
(x1K
)
Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0%10%20%30%40%50%60%70%80%90%100%
0
20
40
60
80
100
120
Cac
he H
it/M
iss
Num
ber o
f Nod
esTh
roug
hput
(Gb/
s)Q
ueue
Len
gth
(x1K
)
Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0%10%20%30%40%50%60%70%80%90%100%
0
20
40
60
80
100
120
Cac
he H
it/M
iss
%
Num
ber o
f Nod
esTh
roug
hput
(Gb/
s)Q
ueue
Len
gth
(x1K
)
Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0%10%20%30%40%50%60%70%80%90%100%
0
20
40
60
80
100
120
Cac
he H
it/M
iss
%
Num
ber o
f Nod
esTh
roug
hput
(Gb/
s)Q
ueue
Len
gth
(x1K
)
Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
Sin-Wave Workload
50%60%70%80%90%100%
60
80
100
120
Hit/M
iss
%
r of N
odes
hput
(Gb/
s)en
gth
(x1K
)
50%60%70%80%90%100%
60
80
100
120
Hit/M
iss
%
r of N
odes
hput
(Gb/
s)en
gth
(x1K
)
40
0%10%20%30%40%50%
0
20
40
60
Cach
e H
Num
ber
Thro
ugh
Que
ue L
e
Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
0%10%20%30%40%50%
0
20
40
60
Cach
e H
Num
ber
Thro
ugh
Que
ue L
e
Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes
All-Pairs Workload
• All-Pairs( set A, set B, function F ) returns matrix M:
• Compare all elements of set A to all elements of set B via function F,
• 500x500 – 250K tasks– 24MB reads– 100ms compute– 200 CPUs
yielding matrix M, such that M[i,j] = F(A[i],B[j])
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 41
1 foreach $i in A2 foreach $j in B3 submit_job F $i $j4 end5 end
• 1000x1000 • 1M tasks• 24MB reads• 4sec compute• 4096 CPUs
• Ideal case:– 6505 sec– 80Gb/s peak
throughput
All-Pairs Workload500x500 on 200 CPUs
60%70%80%90%100%
4856647280
Mis
s
(Gb/
s)
60%70%80%90%100%
4856647280
Mis
s
(Gb/
s)
Efficiency: 75%
42
0%10%20%30%40%50%
08
16243240
Cach
e Hi
t/
Thro
ughp
ut (
Time (sec)
Cache Miss %Cache Hit Global %Cache Hit Local %Throughput (Data Diffusion)Maximum Throughput (GPFS)Maximum Throughput (Local Disk)
0%10%20%30%40%50%
08
16243240
Cach
e Hi
t/
Thro
ughp
ut (
Time (sec)
Cache Miss %Cache Hit Global %Cache Hit Local %Throughput (Data Diffusion)Maximum Throughput (GPFS)Maximum Throughput (Local Disk)
All-Pairs Workload1000x1000 on 4K emulated CPUs
120140160180200
t (G
b/s)
60%70%80%90%100%
/Mis
s
Cache Miss %120140160180200
t (G
b/s)
60%70%80%90%100%
/Mis
s
Cache Miss %
Efficiency: 86%
43
020406080
100
012
024
036
048
060
072
084
096
010
80
Time (sec)
Thro
ughp
ut
0%10%20%30%40%50%
Cac
he H
it/Cache Miss %Cache Hit Global %Cache Hit Local %Throughput (Data Diffusion)Maximum Throughput (GPFS)Maximum Throughput (Local Memory)
020406080
100
012
024
036
048
060
072
084
096
010
80
Time (sec)
Thro
ughp
ut
0%10%20%30%40%50%
Cac
he H
it/Cache Miss %Cache Hit Global %Cache Hit Local %Throughput (Data Diffusion)Maximum Throughput (GPFS)Maximum Throughput (Local Memory)
All-Pairs WorkloadData Diffusion vs. Active Storage• Push vs. Pull
– Active Storage:• Pushes workload
working set to all nodes 20%30%40%50%60%70%80%90%
100%
Effic
ienc
y
Best Case (active storage)Falkon (data diffusion)Falkon (GPFS)
20%30%40%50%60%70%80%90%
100%
Effic
ienc
y
Best Case (active storage)Falkon (data diffusion)Falkon (GPFS)
g• Static spanning tree
– Data Diffusion• Pulls task working set• Incremental spanning
forest
0%10%
500x500200 CPUs
1 sec
500x500200 CPUs
0.1 sec
1000x10004096 CPUs
4 sec
0%10%
500x500200 CPUs
1 sec
500x500200 CPUs
0.1 sec
1000x10004096 CPUs
4 sec
Experiment ApproachLocal
Disk/Memory (GB)
Network (node-to-node)
(GB)
Shared File
System (GB)
Best Case (active storage) 6000 1536 12
Falkon(data diffusion) 6000 1698 34
Best Case (active storage) 6000 1536 12
Falkon(data diffusion) 6000 1528 62
Best Case (active storage) 24000 12288 24
Falkon(data diffusion) 24000 4676 384
500x500200 CPUs
1 sec
500x500200 CPUs
0.1 sec
1000x10004096 CPUs
4 sec
Experiment ApproachLocal
Disk/Memory (GB)
Network (node-to-node)
(GB)
Shared File
System (GB)
Best Case (active storage) 6000 1536 12
Falkon(data diffusion) 6000 1698 34
Best Case (active storage) 6000 1536 12
Falkon(data diffusion) 6000 1528 62
Best Case (active storage) 24000 12288 24
Falkon(data diffusion) 24000 4676 384
500x500200 CPUs
1 sec
500x500200 CPUs
0.1 sec
1000x10004096 CPUs
4 sec
All-Pairs WorkloadData Diffusion vs. Active Storage• Best to use active storage if
– Slow data source– Workload working set fits on local node storage– Good aggregate network bandwidth
• Best to use data diffusion if– Medium to fast data source– Task working set << workload working set– Task working set fits on local node storage– Good aggregate network bandwidth
• If task working set does not fit on local node storage– Use parallel file system (i.e. GPFS, Lustre, PVFS, etc)
Limitations of Data Diffusion
• Needs Java 1.4+• Needs IP connectivity between hosts• Needs local storage (disk, memory, etc)• Per task workings set must fit in local storage• Task definition must include input/output files
metadata• Data access patterns: write once, read many
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 46
Related Work:Data Management
• [Beynon01]: DataCutter• [Ranganathan03]: Simulations• [Ghemawat03,Dean04,Chang06]: BigTable, GFS, MapReduce• [Liu04]: GridDB• [Chervenak04,Chervenak06]: RLS (Replica Location Service),
DRS (D t R li ti S i )
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 47
DRS (Data Replication Service)• [Tatebe04,Xiaohui05]: GFarm• [Branco04,Adams06]: DIAL/ATLAS• [Kosar06]: Stork• [Thain08]: Chirp/Parrot
Conclusion: None focused on the co-location of storage and generic black box computations with data-aware scheduling while operating in a dynamic environment
Scaling from 1K to 100K CPUswithout Data Diffusion
• At 1K CPUs:– 1 Server to manage all 1K CPUs– Use shared file system extensively
• Invoke application from shared file system• Read/write data from/to shared file system
• At 100K CPUs:• At 100K CPUs:– N Servers to manage 100K CPUs (1:256 ratio)– Don’t trust the application I/O access patterns to behave optimally
• Copy applications and input data to RAM• Read input data from RAM, compute, and write results to RAM• Archive all results in a single file in RAM• Copy 1 result file from RAM back to GPFS
• Great potential for improvements– Could leverage the Torus network for high aggregate bandwidth– Collective I/O (CIO) Primitives– Roadblocks: machine global IP connectivity, Java support, and time
48
Mythbusting
• Embarrassingly Happily parallel apps are trivial to run– Logistical problems can be tremendous
• Loosely coupled apps do not require “supercomputers”– Total computational requirements can be enormous– Individual tasks may be tightly coupled
W kl d f tl i l l t f I/O
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 49
– Workloads frequently involve large amounts of I/O– Make use of idle resources from “supercomputers” via backfilling – Costs to run “supercomputers” per FLOP is among the best
• BG/P: 0.35 gigaflops/watt (higher is better)• SiCortex: 0.32 gigaflops/watt• BG/L: 0.23 gigaflops/watt• x86-based HPC systems: an order of magnitude lower
• Loosely coupled apps do not require specialized system software• Shared file systems are good for all applications
– They don’t scale proportionally with the compute resources– Data intensive applications don’t perform and scale well
Conclusions & Contributions
• Defined an abstract model for performance efficiency of data analysis workloads using data-centric task farms
• Provide a reference implementation (Falkon)– Use a streamlined dispatcher to increase task throughput by several
orders of magnitude over traditional LRMsU lti l l h d li t d i d it ti f t k
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 50
– Use multi-level scheduling to reduce perceived wait queue time for tasks to execute on remote resources
– Address data diffusion through co-scheduling of storage and computational resources to improve performance and scalability
– Provide the benefits of dedicated hardware without the associated high cost
– Show effectiveness of data diffusion:• real large-scale astronomy application and a variety of synthetic workloads
More Information
• More information: http://people.cs.uchicago.edu/~iraicu/• Related Projects:
– Falkon: • http://dev.globus.org/wiki/Incubator/Falkon
– AstroPortal:• http://people.cs.uchicago.edu/~iraicu/projects/Falkon/astro_portal.htm
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 51
– Swift: • http://www.ci.uchicago.edu/swift/index.php
• Funding:– NASA: Ames Research Center, Graduate Student Research Program
(GSRP)– DOE: Mathematical, Information, and Computational Sciences Division
subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Dept. of Energy
– NSF: TeraGrid
• In conjunction with IEEE/ACM SuperComputing 2008• Location: Austin Texas• Date: November 17th, 2008
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 52
• In conjunction with IEEE/ACM SuperComputing 2008• Location: Austin Texas• Date: November 18th, 2008
The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 53
Top Related