Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu ...

17
Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu http://www.cs.wisc.edu/condor Interactive MPI on Demand

Transcript of Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu ...

Page 1: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu  Interactive MPI on Demand.

Greg ThainComputer Sciences DepartmentUniversity of Wisconsin-Madison

Gthain @ cs.wisc.eduhttp://www.cs.wisc.edu/condor

Interactive MPI on Demand

Page 2: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu  Interactive MPI on Demand.

www.cs.wisc.edu/condor

Unix Tool Philosophy

› 1) Individual tools do one thing well

› 2) Communicate via ascii streams

› 3) Are composable

Page 3: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu  Interactive MPI on Demand.

www.cs.wisc.edu/condor

The Paradox

› Universal assent that it’s good

› No one uses it (Except for shell one-liners)

• grep ^abc| sort | uniq –c | sort –n

Page 4: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu  Interactive MPI on Demand.

www.cs.wisc.edu/condor

More than just shell scripts

Division in Unix processes provides:

RestartabiltyBetter security

Scalable across multi-core

Page 5: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu  Interactive MPI on Demand.

www.cs.wisc.edu/condor

For example…

› Qmail: Secure, stable Implemented across ~dozen

processes

Page 6: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu  Interactive MPI on Demand.

www.cs.wisc.edu/condor

Getting back to Condor…

› Condor uses this in some places x-Gahp’s condor_master Replaceable shadow/starter pairs Multi_shadow vs. many shadow

› But not everywhere schedd

Page 7: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu  Interactive MPI on Demand.

www.cs.wisc.edu/condor

Condor Daemons as Components

› Very Successful strategy: Glide-in Personal-condor “Hoffman” and schedd’s as jobs Condor-c

Page 8: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu  Interactive MPI on Demand.

www.cs.wisc.edu/condor

Case Study: MPI on Demand

› The problem: Have a pool with lots of machines Very-long running (weeks) vanilla jobs Need to run big, but short MPI Can’t reboot startds

› Need Dedicated scheduler Requires dedicated machines

Page 9: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu  Interactive MPI on Demand.

www.cs.wisc.edu/condor

Possible Solutions

› Add “suspension slot” Requires Reboot

› Submit MPI job normally Preempts vanilla job

Page 10: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu  Interactive MPI on Demand.

www.cs.wisc.edu/condor

COD refresher

› COD: Computing On Demand No Scheduling No File Transfer When COD runs, vanilla job suspends

• “Checkpoint to swap” Needs security on to work Explicitly allowed

Page 11: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu  Interactive MPI on Demand.

www.cs.wisc.edu/condor

Startd as COD job

› Overview:› Launch personal condor› Run startds as COD jobs on base pool

Report to personal Condor Base jobs suspend

› Submit parallel job to personal Condor› Remove COD startds

Page 12: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu  Interactive MPI on Demand.

www.cs.wisc.edu/condor

Startd under COD: Details

› Two condor_config files: careful!

› COD provides no file transfer Can re-use existing startd binary Need to pre-stage or NFS config_file

› Don’t lose claimid!

Page 13: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu  Interactive MPI on Demand.

www.cs.wisc.edu/condor

Example code

› HOSTS=“a b c”

› For h in hosts do; Condor_cod request –name $h >

claimid.$h

› For n in claimid.* do; Condor_cod activate –id `cat $n` -jobad

ja

Page 14: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu  Interactive MPI on Demand.

www.cs.wisc.edu/condor

Cod JOB_AD

› CMD = “/nfs/path/run-startd.sh”

› IWD = “/tmp”

› Out = “startd.out”

› Err = “startd.err”

› Universe = 5

Page 15: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu  Interactive MPI on Demand.

www.cs.wisc.edu/condor

Run-startd.sh

› Mkdir –p p-condor/{spool,log,execute)

› CONDOR_CONFIG=/nfs/new_config

› Exec /usr/sbin/condor_master –f -t

Page 16: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu  Interactive MPI on Demand.

www.cs.wisc.edu/condor

Summary

› Use condor daemons as components

› Mix-and-match as needed

Page 17: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu  Interactive MPI on Demand.

www.cs.wisc.edu/condor

Questions?

› Thank You!