Download - The Quest for Scalable Support of Data Intensive Applications in …people.cs.uchicago.edu/~iraicu/presentations/2008_SC08... · 2008-11-13 · The Quest for Scalable Support of Data

The Quest for Scalable Support The Quest for Scalable Support of Data Intensive Applications of Data Intensive Applications

in Distributed Systemsin Distributed SystemsIoan Raicu

Distributed Systems LaboratoryComputer Science Department

University of Chicago

In Collaboration with: Ian Foster, University of Chicago and Argonne National LaboratoryAlex Szalay, The Johns Hopkins UniversityYong Zhao, Microsoft CorporationPhilip Little, Christopher Moretti, Amitabh Chaudhary, Douglas Thain, University of Notre Dame

IEEE/ACM Supercomputing 2008Argonne National Laboratory Booth

November 20th, 2008

Many-Core Growth Rates

• Increasing attention toparallel chips– Many plans for cores

with “In-Order” execution

90 nm

65 nm

256 Cores

90 nm

65 nm

256 Cores

Slide 2

On-chip shared memoryFar faster to access on-chip memory than DRAM

Interesting challenges in synchronization (e.g. locking)

Inexpensive Low-Power Parallel ChipsAmazing amounts of computing very cheap

Slower (or same) sequential speed!2004 2006 2008 2010 2012 2014 2016 2018

45 nm 32

nm22 nm

16 nm 11

nm

8 nm2Cores

4Cores

8Cores

16Cores 32

Cores

64 Cores

128 Cores

2004 2006 2008 2010 2012 2014 2016 2018

45 nm 32

nm22 nm

16 nm 11

nm

8 nm2Cores

4Cores

8Cores

16Cores 32

Cores

64 Cores

128 Cores

Pat Helland, Microsoft, The Irresistible Forces Meet the Movable Objects, November 9th, 2007

What will we do with 1+ Exaflopsand 1M+ cores?

Storage Resource Scalability

1000

10000

100000

1000000

hrou

ghpu

t (M

b/s)

GPFS RLOCAL RGPFS R+WLOCAL R+W

• GPFS vs. LOCAL– Read Throughput

• 1 node: 0.48Gb/s vs. 1.03Gb/s 2.15x• 160 nodes: 3.4Gb/s vs. 165Gb/s 48x• 11Mb/s per CPU vs. 515Mb/s per CPU

– Read+Write Throughput:

The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 411/13/2008 4

100

1000

1 10 100 1000Number of Nodes

Th

• IBM BlueGene/P– 160K CPU cores– GPFS 8GB/s I/O rates (16 servers)– Experiments on 160K CPU BG/P achieved 0.3Mb/s per CPU core– Experiments on 5.7K CPU SiCortex achieved 0.06Mb/s per CPU core

Read Write Throughput:• 1 node: 0.2Gb/s vs. 0.39Gb/s 1.95x• 160 nodes: 1.1Gb/s vs. 62Gb/s 55x

– Metadata (mkdir / rm -rf)• 1 node: 151/sec vs. 199/sec 1.3x• 160 nodes: 21/sec vs. 31840/sec 1516x

Programming Model Issues

• Multicore processors• Massive task parallelism• Massive data parallelism• Integrating black box applications

Comple task dependencies (task graphs)

The Quest for Scalable Support of Data Intensive Applications in Distributed Systems 5

• Complex task dependencies (task graphs)• Failure, and other execution management issues• Dynamic task graphs• Documenting provenance of data products • Data management: input, intermediate, output• Dynamic data access involving large amounts of

data

Programming Model Issues

• Multicore processors• Massive task parallelism• Massive data parallelism• Integrating black box applications

Comple task dependencies (task graphs)


• Complex task dependencies (task graphs)• Failure, and other execution management issues• Dynamic task graphs• Documenting provenance of data products • Data management: input, intermediate, output• Dynamic data access involving large amounts of

data

Problem Types

Input Data Size

Hi

Data Analysis, Mining

Big Data and Many Tasks

Input Data Size

Hi

Data Analysis, Mining

Big Data and Many Tasks

7

Number of Tasks

Med

Low1 1K 1M

Heroic MPI

Tasks Many Loosely Coupled Apps

Number of Tasks

Med

Low1 1K 1M

Heroic MPI

Tasks Many Loosely Coupled Apps

The Quest for Scalable Support of Data Intensive Applications in Distributed Systems

An Incomplete and Simplistic View of Programming Models and Tools

8The Quest for Scalable Support of Data Intensive Applications in Distributed Systems

MTC: Many Task Computing

• Loosely coupled applications– High-performance computations comprising of multiple

distinct activities, coupled via file system operations or message passingE h i i h t ti i d– Emphasis on using many resources over short time periods

– Tasks can be:• small or large, independent and dependent, uniprocessor or

multiprocessor, compute-intensive or data-intensive, static or dynamic, homogeneous or heterogeneous, loosely or tightly coupled, large number of tasks, large quantity of computing, and large volumes of data…


Motivating Example:AstroPortal Stacking Service

• Purpose– On-demand “stacks” of

random locations within ~10TB dataset

Ch ll

+

+++

+

+

+

The Quest for Scalable Support of Data Intensive Applications in Distributed Systems

10

• Challenge– Processing Costs:

• O(100ms) per object

– Data Intensive: • 40MB:1sec

– Rapid access to 10-10K “random” files

– Time-varying load

AP SloanData

+

=

Hypothesis

“Significant performance improvements can be obtained in the analysis of large dataset by leveraging information about data analysis workloads rather than

individual data analysis tasks ”


• Important concepts related to the hypothesis– Workload: a complex query (or set of queries) decomposable into

simpler tasks to answer broader analysis questions – Data locality is crucial to the efficient use of large scale distributed

systems for scientific and data-intensive applications– Allocate computational and caching storage resources, co-scheduled to

optimize workload performance

individual data analysis tasks.

Abstract Model

• AMDASK: An Abstract Model for DAta-centric taSK farms– Task Farm: A common parallel pattern that drives independent

computational tasks• Models the efficiency of data analysis workloads for the MTC

class of applicationsC f ff


• Captures the following data diffusion properties– Resources are acquired in response to demand– Data and applications diffuse from archival storage to new resources – Resource “caching” allows faster responses to subsequent requests – Resources are released when demand drops – Considers both data and computations to optimize performance

AMDASK:Base Definitions

• Data Stores: Persistent & Transient– Store capacity, load, ideal bandwidth, available

bandwidth• Data Objects:


• Data Objects:– Data object size, data object’s storage location(s),

copy time• Transient resources: compute speed,

resource state• Task: application, input/output data

AMDASK:Execution Model Concepts

• Dispatch Policy– next-available, first-available, max-compute-util, max-cache-hit

• Caching Policy– random, FIFO, LRU, LFU

• Replay policy


Replay policy• Data Fetch Policy

– Just-in-Time, Spatial Locality• Resource Acquisition Policy

– one-at-a-time, additive, exponential, all-at-once, optimal• Resource Release Policy

– distributed, centralized

AMDASK:Performance Efficiency Model

• B: Average Task Execution Time: – K: Stream of tasks– µ(k): Task k execution time ∑

Κ∈Κ=Β

k)(

||1 κµ

• Y: Average Task Execution Time with Overheads: – ο(k): Dispatch overhead ⎧ Ω∑ δφδ )()]()([1


• V: Workload Execution Time: – A: Arrival rate of tasks – T: Transient Resources

• W: Workload Execution Time with Overheads

||*1,||

max Κ⎟⎟⎠

⎞⎜⎜⎝

⎛ΑΤ

=BV

||*1,||

max Κ⎟⎟⎠

⎞⎜⎜⎝

⎛ΑΤ

Υ=W

ο(k): Dispatch overhead– ς(δ,τ): Time to get data

⎪⎪⎩

⎪⎪⎨

⎧

Ω∈∉++Κ

Ω∈∈+Κ=∑

∑

Κ∈

Κ∈

δτφδτδζκκµ

δτφδκκµ

κ

κ

),(,)],()()([||

1

),()],()([||

1

o

oY

AMDASK:Performance Efficiency Model

• Efficiency

• Speedup

WV

=Ε⎪⎪⎩

⎪⎪⎨

⎧

>⎟⎠⎞

⎜⎝⎛

ΑΤ

≤=

ATY

YYB

ATY

E 1||

,*

||,max

1||

,1


• Speedup

• Optimizing Efficiency– Easy to maximize either efficiency or speedup independently– Harder to maximize both at the same time

• Find the smallest number of transient resources |T| while maximizing speedup*efficiency

||* TES =

Model Validation

• Stacking service (large scale astronomy application)• 92 experiments• 558K files

– Compressed: 2MB each 1.1TBUn compressed 6MB each 3 3TB

17

– Un-compressed: 6MB each 3.3TB

0%

10%

20%

30%

2 4 8 16 32 64 128Number of CPUs

Mod

el E

rror

GPFS (GZ) GPFS (FIT)Data Diffusion (FIT) - Locality 1 Data Diffusion (GZ) - Locality 1Data Diffusion (FIT) - Locality 1.38 Data Diffusion (GZ) - Locality 1.38Data Diffusion (FIT) - Locality 30 Data Diffusion (GZ) - Locality 30

0%

10%

20%

30%

2 4 8 16 32 64 128Number of CPUs

Mod

el E

rror

GPFS (GZ) GPFS (FIT)Data Diffusion (FIT) - Locality 1 Data Diffusion (GZ) - Locality 1Data Diffusion (FIT) - Locality 1.38 Data Diffusion (GZ) - Locality 1.38Data Diffusion (FIT) - Locality 30 Data Diffusion (GZ) - Locality 30

0%

10%

20%

30%

1 1.38 2 3 4 5 10 20 30Data Locality

Mod

el E

rror

GPFS (GZ)GPFS (FIT)Data Diffusion (FIT)Data Diffusion (GZ)

0%

10%

20%

30%

1 1.38 2 3 4 5 10 20 30Data Locality

Mod

el E

rror

GPFS (GZ)GPFS (FIT)Data Diffusion (FIT)Data Diffusion (GZ)

Falkon: a Fast and Light-weight tasK executiON framework

• Goal: enable the rapid and efficient execution of many independent jobs on large compute clusters

• Combines three components:– a streamlined task dispatcher

i i i th h lti l l h d li


– resource provisioning through multi-level scheduling techniques

– data diffusion and data-aware scheduling to leverage the co-located computational and storage resources

• Integration into Swift to leverage many applications– Applications cover many domains: astronomy, astro-physics,

medicine, chemistry, economics, climate modeling, etc

Falkon Overview


Data Diffusion

• Resource acquired in response to demand

• Data and applications diffuse from archival storage to newly acquired resources

text

Task DispatcherData-Aware Scheduler Persistent Storage

Shared File System

Idle Resources

text

Task DispatcherData-Aware Scheduler Persistent Storage

Shared File System

Idle Resources


• Resource “caching” allows faster responses to subsequent requests – Cache Eviction Strategies:

RANDOM, FIFO, LRU, LFU• Resources are released

when demand drops

Shared File System

Provisioned Resources

Shared File System

Provisioned Resources

Data Diffusion

• Considers both data and computations to optimize performance– Supports data-aware scheduling– Can optimize compute utilization, cache hit performance, or


a mixture of the two• Decrease dependency of a shared file system

– Theoretical linear scalability with compute resources– Significantly increases meta-data creation and/or

modification performance• Central for “data-centric task farm” realization

Scheduling Policies

• first-available: – simple load balancing

• max-cache-hit– maximize cache hits


maximize cache hits• max-compute-util

– maximize processor utilization• good-cache-compute

– maximize both cache hit and processor utilization at the same time

Data-Aware Scheduler Profiling

3

4

5

Task

(ms)

3000

4000

5000

task

s/se

c)

Task SubmitNotification for Task AvailabilityTask Dispatch (data-aware scheduler)Task Results (data-aware scheduler)Notification for Task ResultsWS CommunicationThroughput (tasks/sec)

3

4

5

Task

(ms)

3000

4000

5000

task

s/se

c)

Task SubmitNotification for Task AvailabilityTask Dispatch (data-aware scheduler)Task Results (data-aware scheduler)Notification for Task ResultsWS CommunicationThroughput (tasks/sec)


0

1

2

first-available

without I/O

first-availablewith I/O

max-compute-util

max-cache-hit

good-cache-

compute

CPU

Tim

e pe

r

0

1000

2000

Thro

ughp

ut (t

0

1

2

first-available

without I/O

first-availablewith I/O

max-compute-util

max-cache-hit

good-cache-

compute

CPU

Tim

e pe

r

0

1000

2000

Thro

ughp

ut (t

AstroPortal Stacking Service

• Purpose– On-demand “stacks” of random

locations within ~10TB dataset

• Challenge– Rapid access to 10-10K “random” files

+

+++

+

+

=

+


ap d access to 0 0 a do es– Time-varying load

• Sample WorkloadsS4 Sloan

Data

=

Web page or Web Service

Locality Number of Objects Number of Files1 111700 111700

1.38 154345 1116992 97999 490003 88857 296204 76575 191455 60590 1212010 46480 465020 40460 202530 23695 790

AstroPortal Stacking Service

• Purpose– On-demand “stacks” of random

locations within ~10TB dataset

• Challenge– Rapid access to 10-10K “random” files

+

+++

+

+

=

+

300

350

400

450

)

openradec2xyreadHDU+getTile+curl+convertArraycalibration+interpolation+doStackingwriteStacking

300

350

400

450

)

openradec2xyreadHDU+getTile+curl+convertArraycalibration+interpolation+doStackingwriteStacking


ap d access to 0 0 a do es– Time-varying load

• Sample WorkloadsS4 Sloan

Data

=

Web page or Web Service

Locality Number of Objects Number of Files1 111700 111700

1.38 154345 1116992 97999 490003 88857 296204 76575 191455 60590 1212010 46480 465020 40460 202530 23695 790

0

50

100

150

200

250

GPFS GZ LOCAL GZ GPFS FIT LOCAL FITFilesystem and Image Format

Tim

e (m

s)

0

50

100

150

200

250

GPFS GZ LOCAL GZ GPFS FIT LOCAL FITFilesystem and Image Format

Tim

e (m

s)

AstroPortal Stacking Servicewith Data Diffusion

Low data locality – Similar (but better)

performance to GPFS800

1000

1200

1400

1600

1800

2000

s) p

er s

tack

per

CPU

Data Diffusion (GZ)Data Diffusion (FIT)GPFS (GZ)GPFS (FIT)

800

1000

1200

1400

1600

1800

2000

s) p

er s

tack

per

CPU


2000 Data Diffusion (GZ)D t Diff i (FIT)2000 Data Diffusion (GZ)D t Diff i (FIT)


0

200

400

600

2 4 8 16 32 64 128

Number of CPUsTi

me

(ms

0

200

400

600

2 4 8 16 32 64 128

Number of CPUsTi

me

(ms

High data locality– Near perfect scalability0

200

400

600

800

1000

1200

1400

1600

1800

2000

2 4 8 16 32 64 128

Number of CPUs

Tim

e (m

s) p

er s

tack

per

CPU

Data Diffusion (FIT)GPFS (GZ)GPFS (FIT)

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2 4 8 16 32 64 128

Number of CPUs

Tim

e (m

s) p

er s

tack

per

CPU

Data Diffusion (FIT)GPFS (GZ)GPFS (FIT)

AstroPortal Stacking Servicewith Data Diffusion

• Aggregate throughput:– 39Gb/s– 10X higher than GPFS

• Reduced load on GPFS0 49Gb/s 20

25

30

35

40

45

50

Thro

ughp

ut (G

b/s)

Data Diffusion Throughput LocalData Diffusion Throughput Cache-to-CacheData Diffusion Throughput GPFSGPFS Throughput (FIT)GPFS Throughput (GZ)

20

25

30

35

40

45

50

Thro

ughp

ut (G

b/s)

Data Diffusion Throughput LocalData Diffusion Throughput Cache-to-CacheData Diffusion Throughput GPFSGPFS Throughput (FIT)GPFS Throughput (GZ)

30

– 0.49Gb/s– 1/10 of the original load

0

5

10

15

20

1 1.38 2 3 4 5 10 20 30Locality

Agg

rega

te

0

5

10

15

20

1 1.38 2 3 4 5 10 20 30Locality

Agg

rega

te

• Big performance gains as locality increases

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 1.38 2 3 4 5 10 20 30 IdealLocality

Tim

e (m

s) p

er s

tack

per

CPU


0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 1.38 2 3 4 5 10 20 30 IdealLocality

Tim

e (m

s) p

er s

tack

per

CPU


Monotonically Increasing Workload

• 250K tasks – 10MB reads– 10ms compute

• Vary arrival rate:– Min: 1 task/sec

600

700

800

900

1000

r sec

ond

150000

200000

250000

plet

edut

(Mb/

s)

Arrival Rate per secTasks completedIdeal Throughput Mb/s

600

700

800

900

1000

r sec

ond

150000

200000

250000

plet

edut

(Mb/

s)

Arrival Rate per secTasks completedIdeal Throughput Mb/s


– Increment function: CEILING(*1.3)

– Max: 1000 tasks/sec• 128 processors• Ideal case:

– 1415 sec– 80Gb/s peak

throughput

0

100

200

300

400

500

600

012

024

036

048

060

072

084

096

010

8012

0013

2014

40

Time (sec)

Arr

ival

Rat

e pe

r

0

50000

100000

150000

Task

s C

omp

Idea

l Thr

ough

pu

0

100

200

300

400

500

600

012

024

036

048

060

072

084

096

010

8012

0013

2014

40

Time (sec)

Arr

ival

Rat

e pe

r

0

50000

100000

150000

Task

s C

omp

Idea

l Thr

ough

pu

Data Diffusion: First-available (GPFS)

• GPFS vs. ideal: 5011 sec vs. 1415 sec

10

100

1000

odes

Gb/

s)(x

1K)

10

100

1000

odes

Gb/

s)(x

1K)


0.001

0.01

0.1

1

030

060

090

012

0015

0018

0021

0024

0027

0030

0033

0036

0039

0042

0045

0048

0051

00

Time (sec)

Num

ber o

f No

Thro

ughp

ut (G

Que

ue L

engt

h (

Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue Length Number of Nodes

0.001

0.01

0.1

1

030

060

090

012

0015

0018

0021

0024

0027

0030

0033

0036

0039

0042

0045

0048

0051

00

Time (sec)

Num

ber o

f No

Thro

ughp

ut (G

Que

ue L

engt

h (

Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue Length Number of Nodes

Data Diffusion:Max-compute-util & max-cache-hit

Max-compute-util100

1000

Gb/

s)B

/s) 0.8

0.9

1

100

1000

Gb/

s)B

/s) 0.8

0.9

1

100

1000

)

0.8

0.9

1

100

1000

)

0.8

0.9

1

Max-cache-hit


0.001

0.01

0.1

1

10

018

036

054

072

090

010

8012

6014

4016

2018

0019

8021

6023

4025

2027

0028

80

Time (sec)

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

GTh

roug

hput

per

Nod

e (M

Que

ue L

engt

h (x

1K)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Cac

he H

it/M

iss

%C

PU U

tiliz

atio

n %

Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes CPU Utilization

0.001

0.01

0.1

1

10

018

036

054

072

090

010

8012

6014

4016

2018

0019

8021

6023

4025

2027

0028

80

Time (sec)

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

GTh

roug

hput

per

Nod

e (M

Que

ue L

engt

h (x

1K)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Cac

he H

it/M

iss

%C

PU U

tiliz

atio

n %

Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes CPU Utilization

0.001

0.01

0.1

1

10

012

024

036

048

060

072

084

096

010

8012

0013

2014

4015

6016

8018

0019

2020

40

Time (sec)

Num

ber o

f Nod

esTh

roug

hput

(Gb/

s)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Cac

he H

it/M

iss

%

Cache Hit Local % Cache Hit Global % Cache Miss %Ideal Throughput (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes

0.001

0.01

0.1

1

10

012

024

036

048

060

072

084

096

010

8012

0013

2014

4015

6016

8018

0019

2020

40

Time (sec)

Num

ber o

f Nod

esTh

roug

hput

(Gb/

s)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Cac

he H

it/M

iss

%


Data Diffusion:Good-cache-compute

1GB

1.5GB 0.01

0.1

1

10

100

1000

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%

0.01

0.1

1

10

100

1000

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%

0 001

0.01

0.1

1

10

100

1000

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%

0 001

0.01

0.1

1

10

100

1000

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%

34

2GB

4GB

0.001

0

300

600

900

1200

1500

1800

2100

2400

2700

3000

3300

3600

Time (sec)

0


0.001

0

300

600

900

1200

1500

1800

2100

2400

2700

3000

3300

3600

Time (sec)

0


0.001

012

024

036

048

060

072

084

096

010

8012

0013

2014

4015

60

Time (sec)

0


0.001

012

024

036

048

060

072

084

096

010

8012

0013

2014

4015

60

Time (sec)

0


0.001

0.01

0.1

1

10

100

1000

0

120

240

360

480

600

720

840

960

1080

1200

1320

1440

Time (sec)

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%


0.001

0.01

0.1

1

10

100

1000

0

120

240

360

480

600

720

840

960

1080

1200

1320

1440

Time (sec)

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%


0.001

0.01

0.1

1

10

100

1000

0

120

240

360

480

600

720

840

960

1080

1200

1320

1440

Time (sec)

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%


0.001

0.01

0.1

1

10

100

1000

0

120

240

360

480

600

720

840

960

1080

1200

1320

1440

Time (sec)

Num

ber o

f Nod

esA

ggre

gate

Thr

ough

put (

Gb/

s)Th

roug

hput

per

Nod

e (M

B/s

)Q

ueue

Len

gth

(x1K

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cac

he H

it/M

iss

%


Data Diffusion:Good-cache-compute

10

100

1000

odes

hput

(Gb/

s)od

e (M

B/s

)(x

1K)

0 6

0.7

0.8

0.9

1

ss %10

100

1000

odes

hput

(Gb/

s)od

e (M

B/s

)(x

1K)

0 6

0.7

0.8

0.9

1

ss %

• Data Diffusion vs. ideal: 1436 sec vs 1415 sec

35

0.001

0.01

0.1

1

0

120

240

360

480

600

720

840

960

1080

1200

1320

1440

Time (sec)

Num

ber o

f No

Agg

rega

te T

hrou

ghTh

roug

hput

per

No

Que

ue L

engt

h

0

0.1

0.2

0.3

0.4

0.5

0.6

Cac

he H

it/M

is


0.001

0.01

0.1

1

0

120

240

360

480

600

720

840

960

1080

1200

1320

1440

Time (sec)

Num

ber o

f No

Agg

rega

te T

hrou

ghTh

roug

hput

per

No

Que

ue L

engt

h

0

0.1

0.2

0.3

0.4

0.5

0.6

Cac

he H

it/M

is


Data Diffusion:Throughput and Response Time

Throughput:– Average: 14Gb/s vs 4Gb/s– Peak: 100Gb/s vs. 6Gb/s

1

10

100

Thro

ughp

ut (G

b/s)

Local Worker Caches (Gb/s)Remote Worker Caches (Gb/s)GPFS Throughput (Gb/s)

1

10

100

Thro

ughp

ut (G

b/s)

Local Worker Caches (Gb/s)Remote Worker Caches (Gb/s)GPFS Throughput (Gb/s)

36

0.1Ideal first-

availablegood-cache-

compute,1GB

good-cache-

compute,1.5GB

good-cache-

compute,2GB

good-cache-

compute,4GB

max-cache-hit,

4GB

max-compute-util, 4GB

T

0.1Ideal first-

availablegood-cache-

compute,1GB

good-cache-

compute,1.5GB

good-cache-

compute,2GB

good-cache-

compute,4GB

max-cache-hit,

4GB


T

1084

230 287

3.1

114

1569

3.4

1

10

100

1000

10000

first-available

good-cache-

compute,1GB

good-cache-

compute,1.5GB

good-cache-

compute,2GB

good-cache-

compute,4GB

max-cache-hit, 4GB


Ave

rage

Res

pons

e Ti

me

(sec

) 1084

230 287

3.1

114

1569

3.4

1

10

100

1000

10000

first-available

good-cache-

compute,1GB

good-cache-

compute,1.5GB

good-cache-

compute,2GB

good-cache-

compute,4GB

max-cache-hit, 4GB


Ave

rage

Res

pons

e Ti

me

(sec

)

Response Time – 3 sec vs 1569 sec 506X

Data Diffusion: Performance Index, Slowdown, and Speedup

• Performance Index:– 34X higher

• Speedup– 3.5X faster than GPFS

0 3

0.4

0.5

0.6

0.7

0.8

0.9

1

Perf

orm

ance

Inde

x

2

2.5

3

3.5

up (c

ompa

red

to L

AN

GPF

S)

Performance IndexSpeedup (compared to first-available)

0 3

0.4

0.5

0.6

0.7

0.8

0.9

1

Perf

orm

ance

Inde

x

2

2.5

3

3.5

up (c

ompa

red

to L

AN

GPF

S)

Performance IndexSpeedup (compared to first-available)

37

0

0.1

0.2

0.3

first-available

good-cache-

compute,1GB

good-cache-

compute,1.5GB

good-cache-

compute,2GB

good-cache-

compute,4GB

good-cache-

compute,4GB, SRP

max-cache-hit,

4GB


1

1.5

Spee

du

0

0.1

0.2

0.3

first-available

good-cache-

compute,1GB

good-cache-

compute,1.5GB

good-cache-

compute,2GB

good-cache-

compute,4GB

good-cache-

compute,4GB, SRP

max-cache-hit,

4GB


1

1.5

Spee

du

123456789

101112131415161718

1 2 3 4 6 8 11 15 20 26 34 45 59 77 101

132

172

224

292

380

494

643

836

1000

Arrival Rate per Second

Slow

dow

n

first-availablegood-cache-compute, 1GBgood-cache-compute, 1.5GBgood-cache-compute, 2GBgood-cache-compute, 4GBmax-cache-hit, 4GBmax-compute-util, 4GB

123456789

101112131415161718

1 2 3 4 6 8 11 15 20 26 34 45 59 77 101

132

172

224

292

380

494

643

836

1000

Arrival Rate per Second

Slow

dow

n

first-availablegood-cache-compute, 1GBgood-cache-compute, 1.5GBgood-cache-compute, 2GBgood-cache-compute, 4GBmax-cache-hit, 4GBmax-compute-util, 4GB

• Slowdown:– 18X slowdown for

GPFS– Near ideal 1X

slowdown for large enough caches

Sin-Wave Workload

900

1000

1800000

2000000

ed

Arrival RateNumber of Tasks

900

1000

1800000

2000000

ed

Arrival RateNumber of Tasks

• 2M tasks – 10MB reads– 10ms compute

• Vary arrival rate:– Min: 1 task/sec

⎣ ⎦705.5*)11.0(*)1)859678.2*)11.0((sin( +++= timetimesqrtA

38

0

100

200

300

400

500

600

700

800

060

012

0018

0024

0030

0036

0042

0048

0054

0060

0066

00

Time (sec)

Arr

ival

Rat

e (p

er s

ec)

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

Num

ber o

f Tas

ks C

ompl

ete

0

100

200

300

400

500

600

700

800

060

012

0018

0024

0030

0036

0042

0048

0054

0060

0066

00

Time (sec)

Arr

ival

Rat

e (p

er s

ec)

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

Num

ber o

f Tas

ks C

ompl

ete

– Arrival rate function:– Max: 1000 tasks/sec

• 200 processors• Ideal case:

– 6505 sec– 80Gb/s peak

throughput

Sin-Wave Workload

• GPFS 5.7 hrs, ~8Gb/s, 1138 CPU hrs• DF+SRP 1.8 hrs, ~25Gb/s, 361 CPU hrs• DF+DRP 1.86 hrs, ~24Gb/s, 253 CPU hrs

39

0%10%20%30%40%50%60%70%80%90%100%

0

20

40

60

80

100

120

Cac

he H

it/M

iss

Num

ber o

f Nod

esTh

roug

hput

(Gb/

s)Q

ueue

Len

gth

(x1K

)

Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes

0%10%20%30%40%50%60%70%80%90%100%

0

20

40

60

80

100

120

Cac

he H

it/M

iss

Num

ber o

f Nod

esTh

roug

hput

(Gb/

s)Q

ueue

Len

gth

(x1K

)


0%10%20%30%40%50%60%70%80%90%100%

0

20

40

60

80

100

120

Cac

he H

it/M

iss

%

Num

ber o

f Nod

esTh

roug

hput

(Gb/

s)Q

ueue

Len

gth

(x1K

)


0%10%20%30%40%50%60%70%80%90%100%

0

20

40

60

80

100

120

Cac

he H

it/M

iss

%

Num

ber o

f Nod

esTh

roug

hput

(Gb/

s)Q

ueue

Len

gth

(x1K

)


Sin-Wave Workload

50%60%70%80%90%100%

60

80

100

120

Hit/M

iss

%

r of N

odes

hput

(Gb/

s)en

gth

(x1K

)

50%60%70%80%90%100%

60

80

100

120

Hit/M

iss

%

r of N

odes

hput

(Gb/

s)en

gth

(x1K

)

40

0%10%20%30%40%50%

0

20

40

60

Cach

e H

Num

ber

Thro

ugh

Que

ue L

e


0%10%20%30%40%50%

0

20

40

60

Cach

e H

Num

ber

Thro

ugh

Que

ue L

e


All-Pairs Workload

• All-Pairs( set A, set B, function F ) returns matrix M:

• Compare all elements of set A to all elements of set B via function F,

• 500x500 – 250K tasks– 24MB reads– 100ms compute– 200 CPUs

yielding matrix M, such that M[i,j] = F(A[i],B[j])


1 foreach $i in A2 foreach $j in B3 submit_job F $i $j4 end5 end

• 1000x1000 • 1M tasks• 24MB reads• 4sec compute• 4096 CPUs

• Ideal case:– 6505 sec– 80Gb/s peak

throughput

All-Pairs Workload500x500 on 200 CPUs

60%70%80%90%100%

4856647280

Mis

s

(Gb/

s)

60%70%80%90%100%

4856647280

Mis

s

(Gb/

s)

Efficiency: 75%

42

0%10%20%30%40%50%

08

16243240

Cach

e Hi

t/

Thro

ughp

ut (

Time (sec)

Cache Miss %Cache Hit Global %Cache Hit Local %Throughput (Data Diffusion)Maximum Throughput (GPFS)Maximum Throughput (Local Disk)

0%10%20%30%40%50%

08

16243240

Cach

e Hi

t/

Thro

ughp

ut (

Time (sec)

Cache Miss %Cache Hit Global %Cache Hit Local %Throughput (Data Diffusion)Maximum Throughput (GPFS)Maximum Throughput (Local Disk)

All-Pairs Workload1000x1000 on 4K emulated CPUs

120140160180200

t (G

b/s)

60%70%80%90%100%

/Mis

s

Cache Miss %120140160180200

t (G

b/s)

60%70%80%90%100%

/Mis

s

Cache Miss %

Efficiency: 86%

43

020406080

100

012

024

036

048

060

072

084

096

010

80

Time (sec)

Thro

ughp

ut

0%10%20%30%40%50%

Cac

he H

it/Cache Miss %Cache Hit Global %Cache Hit Local %Throughput (Data Diffusion)Maximum Throughput (GPFS)Maximum Throughput (Local Memory)

020406080

100

012

024

036

048

060

072

084

096

010

80

Time (sec)

Thro

ughp

ut

0%10%20%30%40%50%

Cac

he H

it/Cache Miss %Cache Hit Global %Cache Hit Local %Throughput (Data Diffusion)Maximum Throughput (GPFS)Maximum Throughput (Local Memory)

All-Pairs WorkloadData Diffusion vs. Active Storage• Push vs. Pull

– Active Storage:• Pushes workload

working set to all nodes 20%30%40%50%60%70%80%90%

100%

Effic

ienc

y

Best Case (active storage)Falkon (data diffusion)Falkon (GPFS)

20%30%40%50%60%70%80%90%

100%

Effic

ienc

y

Best Case (active storage)Falkon (data diffusion)Falkon (GPFS)

g• Static spanning tree

– Data Diffusion• Pulls task working set• Incremental spanning

forest

0%10%

500x500200 CPUs

1 sec

500x500200 CPUs

0.1 sec

1000x10004096 CPUs

4 sec

0%10%

500x500200 CPUs

1 sec

500x500200 CPUs

0.1 sec

1000x10004096 CPUs

4 sec

Experiment ApproachLocal

Disk/Memory (GB)

Network (node-to-node)

(GB)

Shared File

System (GB)

Best Case (active storage) 6000 1536 12

Falkon(data diffusion) 6000 1698 34





500x500200 CPUs

1 sec

500x500200 CPUs

0.1 sec

1000x10004096 CPUs

4 sec

Experiment ApproachLocal

Disk/Memory (GB)

Network (node-to-node)

(GB)

Shared File

System (GB)







500x500200 CPUs

1 sec

500x500200 CPUs

0.1 sec

1000x10004096 CPUs

4 sec

All-Pairs WorkloadData Diffusion vs. Active Storage• Best to use active storage if

– Slow data source– Workload working set fits on local node storage– Good aggregate network bandwidth

• Best to use data diffusion if– Medium to fast data source– Task working set << workload working set– Task working set fits on local node storage– Good aggregate network bandwidth

• If task working set does not fit on local node storage– Use parallel file system (i.e. GPFS, Lustre, PVFS, etc)

Limitations of Data Diffusion

• Needs Java 1.4+• Needs IP connectivity between hosts• Needs local storage (disk, memory, etc)• Per task workings set must fit in local storage• Task definition must include input/output files

metadata• Data access patterns: write once, read many


Related Work:Data Management

• [Beynon01]: DataCutter• [Ranganathan03]: Simulations• [Ghemawat03,Dean04,Chang06]: BigTable, GFS, MapReduce• [Liu04]: GridDB• [Chervenak04,Chervenak06]: RLS (Replica Location Service),

DRS (D t R li ti S i )


DRS (Data Replication Service)• [Tatebe04,Xiaohui05]: GFarm• [Branco04,Adams06]: DIAL/ATLAS• [Kosar06]: Stork• [Thain08]: Chirp/Parrot

Conclusion: None focused on the co-location of storage and generic black box computations with data-aware scheduling while operating in a dynamic environment

Scaling from 1K to 100K CPUswithout Data Diffusion

• At 1K CPUs:– 1 Server to manage all 1K CPUs– Use shared file system extensively

• Invoke application from shared file system• Read/write data from/to shared file system

• At 100K CPUs:• At 100K CPUs:– N Servers to manage 100K CPUs (1:256 ratio)– Don’t trust the application I/O access patterns to behave optimally

• Copy applications and input data to RAM• Read input data from RAM, compute, and write results to RAM• Archive all results in a single file in RAM• Copy 1 result file from RAM back to GPFS

• Great potential for improvements– Could leverage the Torus network for high aggregate bandwidth– Collective I/O (CIO) Primitives– Roadblocks: machine global IP connectivity, Java support, and time

48

Mythbusting

• Embarrassingly Happily parallel apps are trivial to run– Logistical problems can be tremendous

• Loosely coupled apps do not require “supercomputers”– Total computational requirements can be enormous– Individual tasks may be tightly coupled

W kl d f tl i l l t f I/O


– Workloads frequently involve large amounts of I/O– Make use of idle resources from “supercomputers” via backfilling – Costs to run “supercomputers” per FLOP is among the best

• BG/P: 0.35 gigaflops/watt (higher is better)• SiCortex: 0.32 gigaflops/watt• BG/L: 0.23 gigaflops/watt• x86-based HPC systems: an order of magnitude lower

• Loosely coupled apps do not require specialized system software• Shared file systems are good for all applications

– They don’t scale proportionally with the compute resources– Data intensive applications don’t perform and scale well

Conclusions & Contributions

• Defined an abstract model for performance efficiency of data analysis workloads using data-centric task farms

• Provide a reference implementation (Falkon)– Use a streamlined dispatcher to increase task throughput by several

orders of magnitude over traditional LRMsU lti l l h d li t d i d it ti f t k


– Use multi-level scheduling to reduce perceived wait queue time for tasks to execute on remote resources

– Address data diffusion through co-scheduling of storage and computational resources to improve performance and scalability

– Provide the benefits of dedicated hardware without the associated high cost

– Show effectiveness of data diffusion:• real large-scale astronomy application and a variety of synthetic workloads

More Information

• More information: http://people.cs.uchicago.edu/~iraicu/• Related Projects:

– Falkon: • http://dev.globus.org/wiki/Incubator/Falkon

– AstroPortal:• http://people.cs.uchicago.edu/~iraicu/projects/Falkon/astro_portal.htm


– Swift: • http://www.ci.uchicago.edu/swift/index.php

• Funding:– NASA: Ames Research Center, Graduate Student Research Program

(GSRP)– DOE: Mathematical, Information, and Computational Sciences Division

subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Dept. of Energy

– NSF: TeraGrid

• In conjunction with IEEE/ACM SuperComputing 2008• Location: Austin Texas• Date: November 17th, 2008


• In conjunction with IEEE/ACM SuperComputing 2008• Location: Austin Texas• Date: November 18th, 2008