John Bent Computer Sciences Department University of Wisconsin-Madison [email protected] ...

John BentComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

Explicit Control in a Batch-aware Distributed

File System

www.cs.wisc.edu/condor

Focus of work

› Harnessing, managing remote storage

› Batch-pipelined I/O intensive workloads

› Scientific workloads

› Wide-area grid computing


Batch-pipelined workloads

› General properties Large number of processes Process and data dependencies I/O intensive

› Different types of I/O Endpoint Batch Pipeline


Batch-pipelined workloads

Endpoint

Endpoint

EndpointBatch

dataset

Batch dataset

Pipeline

Pip

elin

e

Endpoint Endpoint

EndpointEndpointEndpointEndpoint

Pipeline Pipeline

Pipeline Pipeline Pipeline

PipelinePipeline


Wide-area grid computing

Home storage

Internet


Cluster-to-cluster (c2c)› Not quite p2p

More organized Less hostile More homogeneity Correlated failures

› Each cluster is autonomous Run and managed by different entities

› An obvious bottleneck is wide-area

InternetHomestore

How to manage flow of data into, within and out of these clusters?


Current approaches› Remote I/O

Condor standard universe Very easy Consistency through serialization

› Prestaging Condor vanilla universe Manually intensive Good performance through knowledge

› Distributed file systems (AFS, NFS) Easy to use, uniform name space Impractical in this environment


Pros and cons

PracticalEasy

to use

Leverages workload

info

Remote I/O √ √ X

Pre-staging √ X √ Trad. DFS X √ X


BAD-FS› Solution: Batch-Aware Distributed File System› Leverages workload info with storage control

Detail information about workload is known Storage layer allows external control External scheduler makes informed storage decisions

› Combining information and control results in Improved performance More robust failure handling Simplified implementation

PracticalEasy

to use

Leverages workload

info

BAD-FS √ √ √


› User-level; requires no privilege › Packaged as a modified Condor system

› A Condor system which includes BAD-FS› General; glide-in works everywhere

Practical and deployable

Internet

SGE SGE

SGE SGE SGE

SGE SGE

SGEBAD-

FSBAD-

FSBAD-

FSBAD-

FSBAD-

FSBAD-

FSBAD-

FSBAD-

FS

Homestore


BAD-FS == Condor ++

CondorDAGMan

Compute node

Condorstartd

Compute node

Condorstartd

Compute node

CondorStartd

Compute node

Condorstartd

Job queue

1 2

3 4Home storage

Job queue

3) Expanded Condor submit language

CondorDAGMan

++

4) BAD-FS scheduler

1) NeST storage management

2) Batch-Aware Distributed File System

NeSTNeSTNeSTNeST BAD-FS BAD-FS BAD-FS


BAD-FS knowledge

› Remote cluster knowledge Storage availability Failure rates

› Workload knowledge Data type (batch, pipeline, or endpoint) Data quantity Job dependencies


Control through lots› Abstraction that allows external storage control› Guaranteed storage allocations

Containers for job I/O e.g. “I need 2 GB of space for at least 24 hours”

› Scheduler Creates lots to cache input data

• Subsequent jobs can reuse this data Creates lots to buffer output data

• Destroys pipeline, copies endpoint Configures workload to access lots


Knowledge plus control

› Enhanced performance I/O scoping Capacity-aware scheduling

› Improved failure handling Cost-benefit replication

› Simplified implementation No cache consistency protocol


I/O scoping› Technique to minimize wide-area traffic

› Allocate lots to cache batch data

› Allocate lots for pipeline and endpoint

› Extract endpoint

› CleanupAMANDA:200 MB pipeline500 MB batch 5 MB endpoint

BAD-FSScheduler

Compute node Compute node

InternetSteady-state:Only 5 of 705 MB traverse wide-area.


Capacity-aware scheduling

› Technique to avoid over-allocations

› Scheduler has knowledge of Storage availability Storage usage within the workload

› Scheduler runs as many jobs as fit

› Avoids wasted utilizations

› Improves job throughput


Improved failure handling› Scheduler understands data semantics

Data is not just a collection of bytes Losing data is not catastrophic

• Output can be regenerated by rerunning jobs

› Cost-benefit replication Replicates only data whose replication cost

is cheaper than cost to rerun the job

› Can improve throughput in lossy environment


Simplified implementation

› Data dependencies known

› Scheduler ensures proper ordering

› Build a distributed file system With cooperative caching But without a cache consistency

protocol


Real workloads› AMANDA

Astrophysics study of cosmic events such as gamma-ray bursts

› BLAST Biology search for proteins within a genome

› CMS Physics simulation of large particle colliders

› HF Chemistry study of non-relativistic interactions between atomic

nuclei and electrons

› IBIS Ecology global-scale simulation of earth’s climate used to

study effects of human activity (e.g. global warming)


Real workload experience› Setup

16 jobs 16 compute nodes Emulated wide-area

› Configuration Remote I/O AFS-like with /tmp BAD-FS

› Result is order of magnitude improvement


BAD Conclusions› Schedulers can obtain workload

knowledge› Schedulers need storage control

Caching Consistency Replication

› Combining this control with knowledge Enhanced performance Improved failure handling Simplified implementation


For more information

› http://www.cs.wisc.edu/condor/publications.html

› Questions?

““Pipeline and Batch Sharing in Grid Workloads,” Pipeline and Batch Sharing in Grid Workloads,” Douglas Thain, John Bent, Andrea Arpaci-Dusseau, Douglas Thain, John Bent, Andrea Arpaci-Dusseau, Remzi Arpaci-Dussea, Miron Livny. HPDC 12, 2003.Remzi Arpaci-Dussea, Miron Livny. HPDC 12, 2003.

““Explicit Control in a Batch-Aware Distributed File System,”Explicit Control in a Batch-Aware Distributed File System,” John John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dussea, Miron Livny. NSDI ‘04, 2004.Remzi Arpaci-Dussea, Miron Livny. NSDI ‘04, 2004.

John Bent Computer Sciences Department University of Wisconsin-Madison [email protected] ...

Documents

Transcript of John Bent Computer Sciences Department University of Wisconsin-Madison [email protected] ...