Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison...

41
Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/~miron High-Throughput Computing on Commodity Systems.
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    224
  • download

    5

Transcript of Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison...

Page 1: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

Douglas Thain(Miron Livny)

Computer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/~miron

High-Throughput Computing on

Commodity Systems.

Page 2: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

The Good News:

Raw computing power is everywhere - on desk-tops, shelves, racks, and in your pockets. It is: Cheap Plentiful Mass-Produced

Page 3: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

The Bad News:

GFLOPS per year=/=

GFLOPS per second *30,000,000 seconds/year

Page 4: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

A variation on a chestnut:

What is a benchmark?

Page 5: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Answer:

The throughput which your system is

guaranteednever to exceed!

Page 6: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Why?› A community of commodity computers

can be difficult to manage: Dynamic : State and availability change

over time Evolving : New hardware and software is

continuously acquired and installed Heterogeneous : Hardware and software Distributed ownership : Each machine has a

different owner with different requirements and preferences.

Page 7: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Why?

› Even traditionally “static” systems (such as professionally managed clusters) suffer the same problems when viewed at a yearly scale: Power failures Hardware failures Software upgrades Load imbalance Network imbalance

Page 8: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

How do we measure computer performance?

› High-Performance Computing: Achieve max GFLOP per second under

ideal circumstances.

› High-Throughput Computing Achieve max GFLOP per months or

years in whatever conditions prevail.

Page 9: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

High-Throughput Computing

› Focuses on maximizing… simulations run before the paper deadline… crystal lattices per week… reconstructions per week… video frames rendered per year…

› …without “babysitting” from the user.

› Cannot depend on “ideal” circumstances.

Page 10: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

High-Throughput Computing

› Is achieved by: Expanding the CPUs available. Silently adapting to inevitable changes. Robust software

› Is only marginally affected by: MB, MHz, MIPS, FLOPS… Robust hardware

Page 11: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Solution: Condor

› Condor is software for creating a high-throughput computing environment on a community of workstations, ranging from commodity PCs to supercomputers.

Page 12: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Who are we?

Page 13: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

The Condor Project (Established ‘85)

Distributed systems CS research performed by a team that faces

software engineering challenges in a UNIX/Linux/NT environment,

active interaction with users and collaborators, daily maintenance and support challenges of a

distributed production environment, and educating and training students.

Funding - NSF, NASA,DoE, DoD, IBM, INTEL, Microsoft and the UW Graduate School

.

Page 14: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Users and collaborators

› Scientists - Biochemistry, high energy physics, computer sciences, genetics, …

› Engineers - Hardware design, software building and testing, animation, ...

› Educators - Hardware design tools, distributed systems, networking, ...

Page 15: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

National Grid Efforts

› National Technology Grid - NCSA Alliance (NSF-PACI)

› Information Power Grid - IPG (NASA)

› Particle Physics Data Grid - PPDG (DoE)

› Grid Physics Network GriPhyN (NSF-ITR)

Page 16: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

0

200

400

600

800

1000

1200

1400

'88 '94 '99 '00

Other

CS

Condor CPUs on the UW Campus

Page 17: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Some Numbers:UW-CS Pool

6/98-6/00 4,000,000 hours ~450 years“Real” Users 1,700,000 hours ~260 years

CS-Optimization 610,000 hoursCS-Architecture 350,000 hoursPhysics 245,000 hoursStatistics 80,000 hoursEngine Research Center 38,000 hoursMath 90,000 hoursCivil Engineering 27,000 hoursBusiness 970 hours

“External” Users 165,000 hours ~19 yearsMIT76,000 hoursCornell 38,000 hoursUCSD 38,000 hoursCalTech 18,000 hours

Page 18: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Start slow,but think

BIG

Page 19: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Start slow, but think big!

One PersonalCondor

Condor Pool

Condor-G

1 machine on your desktop

100 machines in your department

1000 machines in the GRID.

Page 20: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Start slow, but think big!› Personal Condor:

Manage just your machine with Condor. Fault tolerance, policy control, logging. Sleep soundly at night.

› Condor Pool: Take advantage of your friends and colleagues:

share cycles, gain ~ 100x throughput.

› Condor-G: Jobs from your pool migrate to other

computational facilities around the world. Gain 1000x throughput. (Record-breaking results!)

Page 21: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Key Condor User Services› Local control - jobs are stored and managed

locally by a personal scheduler.› Priority scheduling - execution order

controlled by priority ranking assigned by user.

› Job preemption - re-linked jobs can be checkpointed, suspended, hold and resumed.

› Local executing environment preserved - re-linked jobs can have their I/O re-directed to submission site.

Page 22: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

More Condor User Services

› Powerful and flexible means for selecting execution site (requirements and preferences)

› Logging of job activities.

› Management of large (10K) numbers of jobs per user.

› Support for jobs with dependencies - DAGMan (Directed Acyclic Graph Manager)

› Support for dynamic MW (PVM and File) applications

Page 23: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

How does it work?

Page 24: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Basic HTC Mechanisms› Matchmaking - enables requests for services

and offers to provide services find each other (ClassAds).

› Fault tolerance - Checkpointing enables preemptive resume scheduling (go ahead and use it as long as it is available!).

› Remote execution – enables transparent access to resources from any machine in the world.

› Asynchronicity - enables management of dynamic (opportunistic) resources.

Page 25: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Every Communityneeds a

Matchmaker!

Page 26: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Why? Because ...

.. someone has to bring together community members who have requests for goods and services with members who offer them. Both sides are looking for each other Both sides have constraints Both sides have preferences

Page 27: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

ClassAd - Properties

Type = “Machine”;Activity = “Idle”;KbdIdle = ‘00:22:31’;Disk = 2.1G; //2.1 GigsMemory = 64M; // 6.4 MegsState = “Unclaimed”;LoadAverage = 0.042969Arch = “INTEL”;OpSys = “SOLARIS251”;

Page 28: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

ClassAd - PolicyRsrchGrp = { “raman”, “miron”, “solomon” };Friends = { “dilbert”, “wally” };Untrusted = { “rival”, riffraff”, TPHB” };

Tier = member(RsrchGroup, other.Owner) ? 2 :( member(Friends, other.Owner) ? 1 : 0 )

Requirements = !member(Untrusted, other.Owener)&& (Tier == 2 ? True

: Tier == 1 ? LoadAvg < 0.3 &&KbdIdle > ‘00:15’ )

: DayTime() <‘08:00’ || DayTime()>’18:00’ )

Page 29: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Advantages of Matchmaking

Hybrid (Centralized+Distributed) resource allocation algorithm

End-to-end verificationBilateral specializationWeak consistency requirementsAuthentication Fault tolerance Incremental system evolution

Page 30: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Fault-Tolerance

› Condor can checkpoint a program by writing its image to disk.

› If a machine should fail, the program may resume from the last checkpoint.

› Ifa job must vacate a machine, it may resume from where it left off.

Page 31: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Remote Execution

› Condor might run your jobs on machines spread around the world – not all of them will have your files.

› Condor provides an adapter – a library – which converts your job’s I/O operations into remote I/O back to your home machine.

› No matter where your job runs, it sees the same environment.

Page 32: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Asynchronicity

› A fact of life in a system of 1000s of machines. Power on/off Lunch breaks Jobs start and finish

› Condor never depends on a fixed configuration – work with what is available.

Page 33: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Does it work?

Page 34: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

An example - NUG28 We are pleased to announce the exact solution of the nug28 quadratic assignment problem (QAP). This problem was derived from the well known nug30 problem using the distance matrix from a 4 by 7 grid, and the flow matrix from nug30 with the last 2 facilities deleted. This is to our knowledge the largest instance from the nugxx series ever provably solved to optimality.

The problem was solved using the branch-and-bound algorithm described in the paper "Solving quadratic assignment problems using convex quadratic programming relaxations," N.W. Brixius and K.M. Anstreicher. The computation was performed on a pool of workstations using the Condor high-throughput computing

system in a total wall time of approximately 4 days, 8 hours. During this time the number of active worker machines averaged

approximately 200. Machines from UW, UNM and (INFN) all participated in the computation.

Page 35: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

NUG30 Personal Condor …

For the run we will be flocking to

-- the main Condor pool at Wisconsin (600 processors)

-- the Condor pool at Georgia Tech (190 Linux boxes)

-- the Condor pool at UNM (40 processors)

-- the Condor pool at Columbia (16 processors)

-- the Condor pool at Northwestern (12 processors)

-- the Condor pool at NCSA (65 processors)

-- the Condor pool at INFN (200 processors)

We will be using glide_in to access the Origin 2000 (through LSF ) at NCSA.

We will use "hobble_in" to access the Chiba City Linux cluster and Origin

2000 here at Argonne.

Page 36: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

It works!!!Date: Thu, 8 Jun 2000 22:41:00 -0500 (CDT) From: Jeff Linderoth <[email protected]> To: Miron Livny <[email protected]> Subject: Re: Priority

This has been a great day for metacomputing! Everything is going wonderfully. We've had over 900 machines (currently around 890), and all the pieces are working great…

Date: Fri, 9 Jun 2000 11:41:11 -0500 (CDT) From: Jeff Linderoth <[email protected]>

Still rolling along. Over three billion nodes in about 1 day!

Page 37: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Up to a Point …

Date: Fri, 9 Jun 2000 14:35:11 -0500 (CDT) From: Jeff Linderoth <[email protected]> Hi Gang,

The glory days of metacomputing are over. Our job just crashed. I watched it happen right before my very eyes. It was what I was afraid of -- they just shut down denali, and losing all of those machines at once caused other connections to time out -- and the snowball effect had bad repercussions for the Schedd.

Page 38: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Back in Business

Date: Fri, 9 Jun 2000 18:55:59 -0500 (CDT) From: Jeff Linderoth <[email protected]>

Hi Gang,

We are back up and running. And, yes, it took me all afternoon to get it going again. There was a (brand new) bug in the QAP "read checkpoint" information that was making the master coredump. (Only with optimization level -O4). I was nearly reduced to tears, but with some supportive words from Jean-Pierre, I made it through.

Page 39: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

The First 600K seconds …

Page 40: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

We made it!!!

Sender: [email protected] Subject: Re: Let the festivities begin.

Hi dear Condor Team,

you all have been amazing. NUG30 required 10.9 years of

Condor Time. In just seven days !

More stats tomorrow !!! We are off celebrating !

condor rules !

cheers,

JP.

Page 41: Douglas Thain (Miron Livny) Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu miron High-Throughput.

www.cs.wisc.edu/condor

Do not bepicky, be

agile!!!