An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded...
Transcript of An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded...
![Page 1: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/1.jpg)
An introduction to
checkpointingfor scientifc applications
[email protected]/CISM
November 2016CISM/CÉCI training session
![Page 2: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/2.jpg)
What ischeckpointing ?
![Page 3: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/3.jpg)
$ ./count
![Page 4: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/4.jpg)
$ ./count1
![Page 5: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/5.jpg)
$ ./count12
![Page 6: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/6.jpg)
$ ./count123
![Page 7: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/7.jpg)
$ ./count123^C$
![Page 8: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/8.jpg)
$ ./count123^C$ ./count
![Page 9: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/9.jpg)
$ ./count123^C$ ./count1
![Page 10: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/10.jpg)
$ ./count123^C$ ./count1
Without checkpointing:
![Page 11: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/11.jpg)
$ ./count123^C$ ./count1
$ ./count123^C$ ./count4
Without checkpointing: With checkpointing:
![Page 12: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/12.jpg)
$ ./count123^C$ ./count12
$ ./count123^C$ ./count45
Without checkpointing: With checkpointing:
![Page 13: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/13.jpg)
$ ./count123^C$ ./count12 3
$ ./count123^C$ ./count45 6
Without checkpointing: With checkpointing:
![Page 14: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/14.jpg)
$ ./count123^C$ ./count12 3
$ ./count123^C$ ./count45 6
Without checkpointing: With checkpointing:
Checkpointing:
'saving' a computation so that it can be resumed later
(rather than started again)
![Page 15: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/15.jpg)
Today's agenda:
1. General concepts and scientifc soft.
2. Working with Signals
3. Slurm recipes
4. DMTCP
![Page 16: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/16.jpg)
Why do we needcheckpointing ?
![Page 17: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/17.jpg)
Imagine a text editor without 'checkpointing' ...
![Page 18: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/18.jpg)
The idea:
Save the program state
every time a checkpoint is encountered
and restart from there upon (un)planned stop
rather than bootstrap again from scratch
Values in variablesOpen fles...
Position in the codeSignal or event...
starting loops at iteration 0creating tmp fles...
![Page 19: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/19.jpg)
1. Fit in time constraints
2. Debugging, monitoring
3. Cope with hardware failures
4. Job preemption
Goals of checkpointing in HPC:
![Page 20: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/20.jpg)
1. Fit in time constraints
Goals of checkpointing in HPC:
All clusters limit maximum 'wall' time of jobs
to allow high job turnover
to ensure fair time sharing of the cluster
-------------(and reduce waiting times...)
![Page 21: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/21.jpg)
2. Debugging, monitoring
Goals of checkpointing in HPC:
Checkpointing means saving the state on disk
-> You can view the state while the job is running
-> You can restart at the checkpoint before a bug occurred
![Page 22: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/22.jpg)
Goals of checkpointing in HPC:
3. Cope with hardware failures
![Page 23: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/23.jpg)
4. Job preemption
Goals of checkpointing in HPC:
Not used at CÉCI, preemption is the ability for
a high-priority job to re-queue a low-priority job
![Page 24: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/24.jpg)
1 Many scientifc software save stateafter N iteration.
![Page 25: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/25.jpg)
Working with checkpoint-restart-able software
Many scientifc software have built-in checkpointing capabilities
(although it might not be called that way)
Check the documentation
Evaluate the options : tradeoff between
I/O overhead
portabilityease of use
![Page 26: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/26.jpg)
Working with checkpoint-restart-able software
http://www.gaussian.com/g_blog/faq2.htm
![Page 27: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/27.jpg)
Working with checkpoint-restart-able software
http://cfd.direct/openfoam/user-guide/controlDict/
![Page 28: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/28.jpg)
Working with checkpoint-restart-able software
https://www.cp2k.org/restarting
![Page 29: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/29.jpg)
Working with checkpoint-restart-able software
https://www.molpro.net/info/2015.1/doc/quickstart/node65.html
![Page 30: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/30.jpg)
Working with checkpoint-restart-able software
http://www.cfs.dl.ac.uk/docs/html/part3/node6.html
![Page 31: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/31.jpg)
Working with checkpoint-restart-able software
http://cms.mpi.univie.ac.at/vasp/vasp/ISTART_tag.html
![Page 32: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/32.jpg)
Working with checkpoint-restart-able software
http://lammps.sandia.gov/doc/restart.html
![Page 33: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/33.jpg)
Working with checkpoint-restart-able software
http://www.gromacs.org/Documentation/How-tos/Doing_Restarts
![Page 34: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/34.jpg)
Working with checkpoint-restart-able software
http://www.abinit.org/doc/helpfiles/for-v7.10/input_variables/varrlx.html#restartxf
![Page 35: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/35.jpg)
So you can play
On lemaitre2: ~dfr/checkpoint.tgz
![Page 36: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/36.jpg)
2 Using UNIX signals to reduceoverhead : do not save the state ateach iteration -- wait for the signal.
![Page 37: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/37.jpg)
UNIX processes can receive 'signals' from the user, the OS, or another process
![Page 38: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/38.jpg)
UNIX processes can receive 'signals' from the user, the OS, or another process
^C
^Z
^D
fg, bg
kill -9
kill
![Page 39: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/39.jpg)
UNIX processes can receive 'signals' from the user, the OS, or another process
e.g.
![Page 40: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/40.jpg)
UNIX processes can receive 'signals' from the user, the OS, or another process
e.g.
![Page 41: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/41.jpg)
UNIX processes can receive 'signals' with an associated default action
![Page 42: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/42.jpg)
![Page 43: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/43.jpg)
3 Use Slurm signaling abilities tomanage checkpoint-able software inSlurm scripts on the clusters.
![Page 44: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/44.jpg)
scancel is used to send signals to jobs
![Page 45: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/45.jpg)
--signal to have Slurm send signals automaticallybefore the end of the allocation
![Page 46: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/46.jpg)
Example: send SIGINT 60 seconds before job is killed (so, here, after 2 minutes)
![Page 47: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/47.jpg)
If your program exits with a non-zero exit code incase of interruption, you can have your job re-
queued automatically
![Page 48: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/48.jpg)
Note the --open-mode=append
![Page 49: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/49.jpg)
Or chain the jobs...
![Page 50: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/50.jpg)
Using a signal-based watchdogto re-queue the job just before it is killed
![Page 51: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/51.jpg)
4 Making non restartable softwarerestartable with DMTCP
![Page 52: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/52.jpg)
![Page 53: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/53.jpg)
● Distributed Multi-Threaded CheckPointing● Works with Linux Kernel 2.6.9 and later● Supports sequential and multi-threaded computations across
single/multiple hosts● Entirely in user space (no kernel modules or root privilege)● Transparent (no recompiling, no re-linking)● Written at Northeastern U. and MIT and under active development for 5+
years● LGPL'd and freely available● No remote I/O● Supports threads, mutexes/semaphoes, forks, shared memory, exec, and
many more
Advertised Features
What types of programs can DMTCP checkpoint?It checkpoints most binary programs on most Linux distributions. Some examples on which users haveverified that DMTCP works are: Matlab, R, Java, Python, Perl, Ruby, PHP, Ocaml, GCL (GNU CommonLisp), emacs, vi/cscope, Open MPI, MPICH-2, OpenMP, and Cilk. See Supported Applications for furtherdetails. Our goal is to support DMTCP for all vanilla programs. If DMTCP does not work correctly onyour program, then this is a bug in DMTCP. We would be appreciative if you can then file a bug report with DMTCP.
From their FAQ:
“
”
![Page 54: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/54.jpg)
Imagine a non-checkpointable program
![Page 55: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/55.jpg)
Run with dmtcp_launch (runs monitoring daemon if necessary)
![Page 56: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/56.jpg)
Restart with dmtcp_restart_script.sh
![Page 57: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/57.jpg)
:q
Launch the coordinator and the program withautomatic checkpointing every 30 seconds
![Page 58: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/58.jpg)
Launch coordinator and restart program
![Page 59: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/59.jpg)
https://github.com/dmtcp/dmtcp/blob/master/plugin/batch-queue/job_examples/slurm_launch.jobhttps://github.com/dmtcp/dmtcp/blob/master/plugin/batch-queue/job_examples/slurm_rstr.job
![Page 61: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/61.jpg)
Never click 'Discard' again...
![Page 62: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded](https://reader035.fdocuments.net/reader035/viewer/2022062602/5f01e1b27e708231d4017e89/html5/thumbnails/62.jpg)
The submission script(s)
● Either one big one or two small ones● Checkpoint periodically or --signal● Requeue automatically● Open-mode=append