Using Application Structure to Handle Failures and Improve Performance in a Migratory File Service...
-
date post
20-Dec-2015 -
Category
Documents
-
view
222 -
download
0
Transcript of Using Application Structure to Handle Failures and Improve Performance in a Migratory File Service...
Using Application Structureto Handle Failures
and Improve Performancein a Migratory File Service
John Bent, Douglas Thain, Andrea Arpaci-Dusseau,
Remzi Arpaci-Dusseau, and Miron Livny
WiND and Condor Project
14 April 2003
Outline
• Data Intensive Applications– Batch and Pipeline Sharing– Example: AMANDA
• Hawk: A Migratory File Service– Application Structure– System Architecture– Interactions
• Evaluation– Performance– Failure
• Philosophizing
CPU Bound
• SETI@Home, Folding@Home, etc...– Excellent application of dist comp.– KB of data, days of CPU time.– Efficient to do tiny I/O on demand.
• Supporting Systems:– Condor– BOINC– Google Toolbar– Custom software.
I/O Bound
• D-Zero data analysis:– Excellent app for cluster computing.– GB of data, seconds of CPU time.– Efficient to compute whenever data is
ready.
• Supporting Systems:– Fermi SAM– High-throughput document scanning– Custom software.
Batch Pipelined Applications
c1
data
b1
a1
x y z
c2
data
b2
a2
x y z
c3
data
b3
a3
x y z
data
PipelineSharedData
Batch Width
BatchSharedData
Pip
elin
e
Example: AMANDA
corsika
corama
mmc
amasim
NUCNUCCSGLAUBTAR
EGSDATA3.3QGSDATA4
(1 MB)
DAT(23 MB)
corama.out(26 MB)
mmc_input.txt
mmc_output.dat(126 MB)
amasim_input.dat
ice tables(3 files, 3MB)
amasim_output.txt(5MB)
expt geometry(100s files, 500 MB)
corsika_input.txt(4 KB)
Computing Evironment
• Clusters dominate:– Similar configurations.– Fast interconnects.– Single administrative domain.– Underutilized commodity storage.– En masse, quite unreliable.
• Users wish to harness multiple clusters, but have jobs that are both I/O and CPU intensive.
Ugly Solutions
• “FTP-Net”– User finds remote clusters.– Manually stages data in.– Submits jobs, deals with failures.– Pulls data out.– Lather, rinse, repeat.
• “Remote I/O”– Submit jobs to a remote batch system.– Let all I/O come back to the archive.– Return in several decades.
What We Really Need
• Access resources outside my domain.– Assemble your own army.
• Automatic integration of CPU and I/O access.– Forget optimal: save administration costs.– Replacing remote with local always wins.
• Robustness to failures.– Can’t hire babysitters for New Year’s Eve.
Hawk: A Migratory File Service
• Automatically deploys a “task force” acorss an existing distributed system.
• Manages applications from a high level, using knowledge of process interactions.
• Provides dependable performance through peer-to-peer techniques.
• Understands and reacts to failures using knowledge of the system and workloads.
Philsophy of Hawk
“In allocating resources, strive to avoid disaster, rather than attempt to obtain an optimum.” - Butler Lampson
Why not AFS+Make?
• Quick answer:– Distributed filesystems provide an
unnecessarily strong abstraction that is unacceptably expensive to provide in the wide area.
• Better answer after we explain what Hawk is and how it works.
Outline
• Data Intensive Applications– Batch and Pipeline Sharing– Example: AMANDA
• Hawk: A Migratory File Service– Application Structure– System Architecture– Interactions
• Evaluation– Performance– Failure
• Philosophizing
Workflow Language 1
job a a.sub
job b b.sub
job c c.sub
job d d.sub
parent a child c
parent b child d
a b
c d
v1
Home Storage
mydata
v2 v3
Workflow Language 2
volume v1 ftp://home/mydata
mount v1 a /datamount v1 b /data
volume v2 scratchmount v2 a /tmpmount v2 c /tmp
volume v3 scratchmount v3 b /tmpmount v3 d /tmp
a b
c d
v1
Home Storage
mydata
v2 v3
Workflow Language 3
extract v2 x ftp://home/out.1
extract v3 x ftp://home/out.2
a b
c dx
out.1 out.2
x
Mapping Logical to Physical
• Abstract Jobs– Physical jobs in a batch system– May run more than once!
• Logical “scratch” volumes– Temporary containers on a scratch disk.– May be created, replicated, and destroyed.
• Logical “read” volumes– Striped across cooperative proxy caches.– May be created, cached, and evicted.
Node
Starting System
MatchMaker
BatchQueueArchive
Node Node
NodeNodeNode
PBS Head Node
Node Node
NodeNode
Condor Pool
WorkflowManager
Node
Gliding In
MatchMaker
BatchQueueArchive
StartDProxy
Master
Node Node
NodeNodeNodeStartDProxy
Master
StartDProxy
Master
PBS Head Node
Node Node
NodeNode
Condor Pool
StartDProxy
Master
StartDProxy
Master
StartDProxy
Master Glide-InJob
Hawk ArchitectureStartD
Proxy
MatchMaker
BatchQueueArchive
WorkflowManager
StartD
Proxy
StartD
Proxy
Wide Area Caching
CoopCache
CoopCache
SystemModel
AppFlow
Job
Agent
Job
Agent
Job
Agent
I/O InteractionsStartD
Job
Agent
Proxy
POSIX Library Interface
Local Area Network
/tmp container://host5/120/data cache://host5/archive/data
MatchMaker
BatchQueueArchive
WorkflowManager
CooperativeBlockCache
OtherProxies
Cont. 119 Cont. 120
foo
outfile
tmpfile
bar baz
creat(“/tmp/outfile”);open(“/data/d15”);
Cooperative ProxiesStartD
ProxyA
MatchMaker
BatchQueueArchive
WorkflowManager
StartD
ProxyB
StartD
ProxyC
Job
Agent
Job
Agent
Job
Agent
DiscoverDiscoverDiscoverC
C
C
Hash MapPaths -> Proxies
Ct1:
BCt2:
C B At3:
C Bt4:
Summary
• Archive – Sources input data, chooses coordinator.
• Glide-In– Deploy a “task force” of components.
• Cooperative Proxies– Provide dependable batch read-only data.
• Data Containers– Fault-isolated pipeline data.
• Workflow Manager– Directs the operation.
Outline
• Data Intensive Applications– Batch and Pipeline Sharing– Example: AMANDA
• Hawk: A Migratory File Service– Application Structure– System Architecture– Interactions
• Evaluation– Performance– Failure
• Philosophizing
Performance Testbed
• Controlled testbed:– 32 550 MHZ dual-cpu cluster machines, 1
GB, SCSI disks, 100Mb/s ethernet.– Simulated WAN: restrict archive storage
across router to 800 KB/s.
• Also some preliminary tests on uncontrolled systems:– MFS over PBS cluster at Los Alamos– MFS over Condor system at INFN Italy.
Synthetic Apps
a
b
10 MBpipe
a
b
5 MBbatch
5 MBpipe
a
b
10 MBbatch
Pipe Intensive Mixed Batch Intensive
Local
Co-
Locate Data
Don’t
Co-
Locate
Remote
System Configurations
Real Applications
• BLAST– Search tool for proteins and nucleotides in
genomic databases.
• CMS– Simulation of a high energy physics expt to begin
operation at CERN in 2006.
• H-F– Simulation of the non relativistic interactions
between nuclei and electrons
• AMANDA– Simulation of a neutrino detector buried in the ice
of the South Pole.
Application Throughput
Name Stages Remote Hawk
BLAST 1 4.67 747.40
CMS 2 33.78 1273.96
HF 3 40.96 3187.22
AMANDA 4
Outline
• Data Intensive Applications– Batch and Pipeline Sharing– Example: AMANDA
• Hawk: A Migratory File Service– Application Structure– System Architecture– Interactions
• Evaluation– Performance– Failure
• Philosophizing
Related Work
• Workflow management
• Dependency managers: TREC, make
• Private namespaces: UFO, db views
• Cooperative caching: no writes.
• P2P systems: wrong semantics.
• Filesystems: overly strong
Why Not AFS+Make?
• Namespaces– Constructed per-process at submit-time
• Consistency– Enforced at the workflow level
• Selective Commit– Everything tossed unless explicitly saved.
• Fault Awareness– CPUs and data can be lost at any point.
• Practicality– No special permission required.
Conclusions
• Traditional systems build from the bottom up: this disk must have five nines, or we’re in big trouble!
• MFS builds from the top down: application semantics drive system structure.
• By posing the right problem, we solve the traditional hard problems of file systems.