Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu ...
-
Upload
avis-wells -
Category
Documents
-
view
214 -
download
0
Transcript of Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu ...
![Page 1: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu Interactive MPI on Demand.](https://reader035.fdocuments.net/reader035/viewer/2022071718/56649ec75503460f94bd2d0e/html5/thumbnails/1.jpg)
Greg ThainComputer Sciences DepartmentUniversity of Wisconsin-Madison
Gthain @ cs.wisc.eduhttp://www.cs.wisc.edu/condor
Interactive MPI on Demand
![Page 2: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu Interactive MPI on Demand.](https://reader035.fdocuments.net/reader035/viewer/2022071718/56649ec75503460f94bd2d0e/html5/thumbnails/2.jpg)
www.cs.wisc.edu/condor
Unix Tool Philosophy
› 1) Individual tools do one thing well
› 2) Communicate via ascii streams
› 3) Are composable
![Page 3: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu Interactive MPI on Demand.](https://reader035.fdocuments.net/reader035/viewer/2022071718/56649ec75503460f94bd2d0e/html5/thumbnails/3.jpg)
www.cs.wisc.edu/condor
The Paradox
› Universal assent that it’s good
› No one uses it (Except for shell one-liners)
• grep ^abc| sort | uniq –c | sort –n
![Page 4: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu Interactive MPI on Demand.](https://reader035.fdocuments.net/reader035/viewer/2022071718/56649ec75503460f94bd2d0e/html5/thumbnails/4.jpg)
www.cs.wisc.edu/condor
More than just shell scripts
Division in Unix processes provides:
RestartabiltyBetter security
Scalable across multi-core
![Page 5: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu Interactive MPI on Demand.](https://reader035.fdocuments.net/reader035/viewer/2022071718/56649ec75503460f94bd2d0e/html5/thumbnails/5.jpg)
www.cs.wisc.edu/condor
For example…
› Qmail: Secure, stable Implemented across ~dozen
processes
![Page 6: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu Interactive MPI on Demand.](https://reader035.fdocuments.net/reader035/viewer/2022071718/56649ec75503460f94bd2d0e/html5/thumbnails/6.jpg)
www.cs.wisc.edu/condor
Getting back to Condor…
› Condor uses this in some places x-Gahp’s condor_master Replaceable shadow/starter pairs Multi_shadow vs. many shadow
› But not everywhere schedd
![Page 7: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu Interactive MPI on Demand.](https://reader035.fdocuments.net/reader035/viewer/2022071718/56649ec75503460f94bd2d0e/html5/thumbnails/7.jpg)
www.cs.wisc.edu/condor
Condor Daemons as Components
› Very Successful strategy: Glide-in Personal-condor “Hoffman” and schedd’s as jobs Condor-c
![Page 8: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu Interactive MPI on Demand.](https://reader035.fdocuments.net/reader035/viewer/2022071718/56649ec75503460f94bd2d0e/html5/thumbnails/8.jpg)
www.cs.wisc.edu/condor
Case Study: MPI on Demand
› The problem: Have a pool with lots of machines Very-long running (weeks) vanilla jobs Need to run big, but short MPI Can’t reboot startds
› Need Dedicated scheduler Requires dedicated machines
![Page 9: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu Interactive MPI on Demand.](https://reader035.fdocuments.net/reader035/viewer/2022071718/56649ec75503460f94bd2d0e/html5/thumbnails/9.jpg)
www.cs.wisc.edu/condor
Possible Solutions
› Add “suspension slot” Requires Reboot
› Submit MPI job normally Preempts vanilla job
![Page 10: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu Interactive MPI on Demand.](https://reader035.fdocuments.net/reader035/viewer/2022071718/56649ec75503460f94bd2d0e/html5/thumbnails/10.jpg)
www.cs.wisc.edu/condor
COD refresher
› COD: Computing On Demand No Scheduling No File Transfer When COD runs, vanilla job suspends
• “Checkpoint to swap” Needs security on to work Explicitly allowed
![Page 11: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu Interactive MPI on Demand.](https://reader035.fdocuments.net/reader035/viewer/2022071718/56649ec75503460f94bd2d0e/html5/thumbnails/11.jpg)
www.cs.wisc.edu/condor
Startd as COD job
› Overview:› Launch personal condor› Run startds as COD jobs on base pool
Report to personal Condor Base jobs suspend
› Submit parallel job to personal Condor› Remove COD startds
![Page 12: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu Interactive MPI on Demand.](https://reader035.fdocuments.net/reader035/viewer/2022071718/56649ec75503460f94bd2d0e/html5/thumbnails/12.jpg)
www.cs.wisc.edu/condor
Startd under COD: Details
› Two condor_config files: careful!
› COD provides no file transfer Can re-use existing startd binary Need to pre-stage or NFS config_file
› Don’t lose claimid!
![Page 13: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu Interactive MPI on Demand.](https://reader035.fdocuments.net/reader035/viewer/2022071718/56649ec75503460f94bd2d0e/html5/thumbnails/13.jpg)
www.cs.wisc.edu/condor
Example code
› HOSTS=“a b c”
› For h in hosts do; Condor_cod request –name $h >
claimid.$h
› For n in claimid.* do; Condor_cod activate –id `cat $n` -jobad
ja
![Page 14: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu Interactive MPI on Demand.](https://reader035.fdocuments.net/reader035/viewer/2022071718/56649ec75503460f94bd2d0e/html5/thumbnails/14.jpg)
www.cs.wisc.edu/condor
Cod JOB_AD
› CMD = “/nfs/path/run-startd.sh”
› IWD = “/tmp”
› Out = “startd.out”
› Err = “startd.err”
› Universe = 5
![Page 15: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu Interactive MPI on Demand.](https://reader035.fdocuments.net/reader035/viewer/2022071718/56649ec75503460f94bd2d0e/html5/thumbnails/15.jpg)
www.cs.wisc.edu/condor
Run-startd.sh
› Mkdir –p p-condor/{spool,log,execute)
› CONDOR_CONFIG=/nfs/new_config
› Exec /usr/sbin/condor_master –f -t
![Page 16: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu Interactive MPI on Demand.](https://reader035.fdocuments.net/reader035/viewer/2022071718/56649ec75503460f94bd2d0e/html5/thumbnails/16.jpg)
www.cs.wisc.edu/condor
Summary
› Use condor daemons as components
› Mix-and-match as needed
![Page 17: Greg Thain Computer Sciences Department University of Wisconsin-Madison Gthain @ cs.wisc.edu Interactive MPI on Demand.](https://reader035.fdocuments.net/reader035/viewer/2022071718/56649ec75503460f94bd2d0e/html5/thumbnails/17.jpg)
www.cs.wisc.edu/condor
Questions?
› Thank You!