Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison...
-
date post
20-Dec-2015 -
Category
Documents
-
view
224 -
download
5
Transcript of Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison...
Douglas Thain(Miron Livny)
Computer Sciences DepartmentUniversity of Wisconsin-Madison
[email protected]://www.cs.wisc.edu/~miron
High-Throughput Computing on
Commodity Systems.
www.cs.wisc.edu/condor
The Good News:
Raw computing power is everywhere - on desk-tops, shelves, racks, and in your pockets. It is: Cheap Plentiful Mass-Produced
www.cs.wisc.edu/condor
The Bad News:
GFLOPS per year=/=
GFLOPS per second *30,000,000 seconds/year
www.cs.wisc.edu/condor
A variation on a chestnut:
What is a benchmark?
www.cs.wisc.edu/condor
Answer:
The throughput which your system is
guaranteednever to exceed!
www.cs.wisc.edu/condor
Why?› A community of commodity computers
can be difficult to manage: Dynamic : State and availability change
over time Evolving : New hardware and software is
continuously acquired and installed Heterogeneous : Hardware and software Distributed ownership : Each machine has a
different owner with different requirements and preferences.
www.cs.wisc.edu/condor
Why?
› Even traditionally “static” systems (such as professionally managed clusters) suffer the same problems when viewed at a yearly scale: Power failures Hardware failures Software upgrades Load imbalance Network imbalance
www.cs.wisc.edu/condor
How do we measure computer performance?
› High-Performance Computing: Achieve max GFLOP per second under
ideal circumstances.
› High-Throughput Computing Achieve max GFLOP per months or
years in whatever conditions prevail.
www.cs.wisc.edu/condor
High-Throughput Computing
› Focuses on maximizing… simulations run before the paper deadline… crystal lattices per week… reconstructions per week… video frames rendered per year…
› …without “babysitting” from the user.
› Cannot depend on “ideal” circumstances.
www.cs.wisc.edu/condor
High-Throughput Computing
› Is achieved by: Expanding the CPUs available. Silently adapting to inevitable changes. Robust software
› Is only marginally affected by: MB, MHz, MIPS, FLOPS… Robust hardware
www.cs.wisc.edu/condor
Solution: Condor
› Condor is software for creating a high-throughput computing environment on a community of workstations, ranging from commodity PCs to supercomputers.
www.cs.wisc.edu/condor
Who are we?
www.cs.wisc.edu/condor
The Condor Project (Established ‘85)
Distributed systems CS research performed by a team that faces
software engineering challenges in a UNIX/Linux/NT environment,
active interaction with users and collaborators, daily maintenance and support challenges of a
distributed production environment, and educating and training students.
Funding - NSF, NASA,DoE, DoD, IBM, INTEL, Microsoft and the UW Graduate School
.
www.cs.wisc.edu/condor
Users and collaborators
› Scientists - Biochemistry, high energy physics, computer sciences, genetics, …
› Engineers - Hardware design, software building and testing, animation, ...
› Educators - Hardware design tools, distributed systems, networking, ...
www.cs.wisc.edu/condor
National Grid Efforts
› National Technology Grid - NCSA Alliance (NSF-PACI)
› Information Power Grid - IPG (NASA)
› Particle Physics Data Grid - PPDG (DoE)
› Grid Physics Network GriPhyN (NSF-ITR)
www.cs.wisc.edu/condor
0
200
400
600
800
1000
1200
1400
'88 '94 '99 '00
Other
CS
Condor CPUs on the UW Campus
www.cs.wisc.edu/condor
Some Numbers:UW-CS Pool
6/98-6/00 4,000,000 hours ~450 years“Real” Users 1,700,000 hours ~260 years
CS-Optimization 610,000 hoursCS-Architecture 350,000 hoursPhysics 245,000 hoursStatistics 80,000 hoursEngine Research Center 38,000 hoursMath 90,000 hoursCivil Engineering 27,000 hoursBusiness 970 hours
“External” Users 165,000 hours ~19 yearsMIT76,000 hoursCornell 38,000 hoursUCSD 38,000 hoursCalTech 18,000 hours
www.cs.wisc.edu/condor
Start slow,but think
BIG
www.cs.wisc.edu/condor
Start slow, but think big!
One PersonalCondor
Condor Pool
Condor-G
1 machine on your desktop
100 machines in your department
1000 machines in the GRID.
www.cs.wisc.edu/condor
Start slow, but think big!› Personal Condor:
Manage just your machine with Condor. Fault tolerance, policy control, logging. Sleep soundly at night.
› Condor Pool: Take advantage of your friends and colleagues:
share cycles, gain ~ 100x throughput.
› Condor-G: Jobs from your pool migrate to other
computational facilities around the world. Gain 1000x throughput. (Record-breaking results!)
www.cs.wisc.edu/condor
Key Condor User Services› Local control - jobs are stored and managed
locally by a personal scheduler.› Priority scheduling - execution order
controlled by priority ranking assigned by user.
› Job preemption - re-linked jobs can be checkpointed, suspended, hold and resumed.
› Local executing environment preserved - re-linked jobs can have their I/O re-directed to submission site.
www.cs.wisc.edu/condor
More Condor User Services
› Powerful and flexible means for selecting execution site (requirements and preferences)
› Logging of job activities.
› Management of large (10K) numbers of jobs per user.
› Support for jobs with dependencies - DAGMan (Directed Acyclic Graph Manager)
› Support for dynamic MW (PVM and File) applications
www.cs.wisc.edu/condor
How does it work?
www.cs.wisc.edu/condor
Basic HTC Mechanisms› Matchmaking - enables requests for services
and offers to provide services find each other (ClassAds).
› Fault tolerance - Checkpointing enables preemptive resume scheduling (go ahead and use it as long as it is available!).
› Remote execution – enables transparent access to resources from any machine in the world.
› Asynchronicity - enables management of dynamic (opportunistic) resources.
www.cs.wisc.edu/condor
Every Communityneeds a
Matchmaker!
www.cs.wisc.edu/condor
Why? Because ...
.. someone has to bring together community members who have requests for goods and services with members who offer them. Both sides are looking for each other Both sides have constraints Both sides have preferences
www.cs.wisc.edu/condor
ClassAd - Properties
Type = “Machine”;Activity = “Idle”;KbdIdle = ‘00:22:31’;Disk = 2.1G; //2.1 GigsMemory = 64M; // 6.4 MegsState = “Unclaimed”;LoadAverage = 0.042969Arch = “INTEL”;OpSys = “SOLARIS251”;
www.cs.wisc.edu/condor
ClassAd - PolicyRsrchGrp = { “raman”, “miron”, “solomon” };Friends = { “dilbert”, “wally” };Untrusted = { “rival”, riffraff”, TPHB” };
Tier = member(RsrchGroup, other.Owner) ? 2 :( member(Friends, other.Owner) ? 1 : 0 )
Requirements = !member(Untrusted, other.Owener)&& (Tier == 2 ? True
: Tier == 1 ? LoadAvg < 0.3 &&KbdIdle > ‘00:15’ )
: DayTime() <‘08:00’ || DayTime()>’18:00’ )
www.cs.wisc.edu/condor
Advantages of Matchmaking
Hybrid (Centralized+Distributed) resource allocation algorithm
End-to-end verificationBilateral specializationWeak consistency requirementsAuthentication Fault tolerance Incremental system evolution
www.cs.wisc.edu/condor
Fault-Tolerance
› Condor can checkpoint a program by writing its image to disk.
› If a machine should fail, the program may resume from the last checkpoint.
› Ifa job must vacate a machine, it may resume from where it left off.
www.cs.wisc.edu/condor
Remote Execution
› Condor might run your jobs on machines spread around the world – not all of them will have your files.
› Condor provides an adapter – a library – which converts your job’s I/O operations into remote I/O back to your home machine.
› No matter where your job runs, it sees the same environment.
www.cs.wisc.edu/condor
Asynchronicity
› A fact of life in a system of 1000s of machines. Power on/off Lunch breaks Jobs start and finish
› Condor never depends on a fixed configuration – work with what is available.
www.cs.wisc.edu/condor
Does it work?
www.cs.wisc.edu/condor
An example - NUG28 We are pleased to announce the exact solution of the nug28 quadratic assignment problem (QAP). This problem was derived from the well known nug30 problem using the distance matrix from a 4 by 7 grid, and the flow matrix from nug30 with the last 2 facilities deleted. This is to our knowledge the largest instance from the nugxx series ever provably solved to optimality.
The problem was solved using the branch-and-bound algorithm described in the paper "Solving quadratic assignment problems using convex quadratic programming relaxations," N.W. Brixius and K.M. Anstreicher. The computation was performed on a pool of workstations using the Condor high-throughput computing
system in a total wall time of approximately 4 days, 8 hours. During this time the number of active worker machines averaged
approximately 200. Machines from UW, UNM and (INFN) all participated in the computation.
www.cs.wisc.edu/condor
NUG30 Personal Condor …
For the run we will be flocking to
-- the main Condor pool at Wisconsin (600 processors)
-- the Condor pool at Georgia Tech (190 Linux boxes)
-- the Condor pool at UNM (40 processors)
-- the Condor pool at Columbia (16 processors)
-- the Condor pool at Northwestern (12 processors)
-- the Condor pool at NCSA (65 processors)
-- the Condor pool at INFN (200 processors)
We will be using glide_in to access the Origin 2000 (through LSF ) at NCSA.
We will use "hobble_in" to access the Chiba City Linux cluster and Origin
2000 here at Argonne.
www.cs.wisc.edu/condor
It works!!!Date: Thu, 8 Jun 2000 22:41:00 -0500 (CDT) From: Jeff Linderoth <[email protected]> To: Miron Livny <[email protected]> Subject: Re: Priority
This has been a great day for metacomputing! Everything is going wonderfully. We've had over 900 machines (currently around 890), and all the pieces are working great…
Date: Fri, 9 Jun 2000 11:41:11 -0500 (CDT) From: Jeff Linderoth <[email protected]>
Still rolling along. Over three billion nodes in about 1 day!
www.cs.wisc.edu/condor
Up to a Point …
Date: Fri, 9 Jun 2000 14:35:11 -0500 (CDT) From: Jeff Linderoth <[email protected]> Hi Gang,
The glory days of metacomputing are over. Our job just crashed. I watched it happen right before my very eyes. It was what I was afraid of -- they just shut down denali, and losing all of those machines at once caused other connections to time out -- and the snowball effect had bad repercussions for the Schedd.
www.cs.wisc.edu/condor
Back in Business
Date: Fri, 9 Jun 2000 18:55:59 -0500 (CDT) From: Jeff Linderoth <[email protected]>
Hi Gang,
We are back up and running. And, yes, it took me all afternoon to get it going again. There was a (brand new) bug in the QAP "read checkpoint" information that was making the master coredump. (Only with optimization level -O4). I was nearly reduced to tears, but with some supportive words from Jean-Pierre, I made it through.
www.cs.wisc.edu/condor
The First 600K seconds …
www.cs.wisc.edu/condor
We made it!!!
Sender: [email protected] Subject: Re: Let the festivities begin.
Hi dear Condor Team,
you all have been amazing. NUG30 required 10.9 years of
Condor Time. In just seven days !
More stats tomorrow !!! We are off celebrating !
condor rules !
cheers,
JP.
www.cs.wisc.edu/condor
Do not bepicky, be
agile!!!