|
|
Linux » Books » Developer »
Message Passing Toolkit (MPT) User's Guide
(document number: 007-3773-019 / published: 2011-11-15)
table of contents | additional info | download find in page
Chapter 7. Checkpoint/Restart
MPT
2.02 (or later) supports application checkpoint/restart by using the Berkeley
Lab Checkpoint/Restart (BLCR) implementation. This allows applications
to periodically save a copy of their state. They can then later resume
from that point in time if the application crashes or the job is aborted
to free up resources for higher priority jobs.
There are some important limitations to keep in mind, as follows: BLCR does not checkpoint the state of any data files that
the application may be using.
Certain MPI features including spawning and one-sided
MPI are also not supported when using CPR.
InfiniBand XRC queue pairs are not supported.
Checkpoint files are often very large and require significant
disk bandwidth to create in a timely manner.
For more information on BLCR, see https://ftg.lbl.gov/projects/CheckpointRestart
.
To use checkpoint/restart
with MPT, BLCR must first be installed. This requires installing the
blcr-, blcr-libs-, and blcr-kmp-
RPMs. BLCR must then be enabled by root, as follows:
BLCR uses a kernel module which must be built against the specific
kernel that the operating system is running. In the case that the kernel
module fails to load, it must be rebuilt and installed. Install the
blcr- SRPM. In the blcr.spec file, set the
kernel variable to the name of the current kernel, then rebuild and install
the new set of RPMs.
To enable checkpoint/restart
within MPT, mpirun or mpiexec_mpt
must be passed the -cpr option, for example: % mpirun -cpr hostA, hostB -np 8 ./a.out |
To checkpoint a job, use the mpt_checkpoint command
on the same host where mpirun is running.
mpt_checkpoint needs to be passed the PID of mpirun
and a name with which you want to prefix all the checkpoint
files. For example: % mpt_checkpoint -p 12345 -f my_checkpoint |
This will create a my_checkpoint.cps meta-data
file and a number of my_checkpoint.*.cpd files.
To restart the job, pass the name of the .cps
file to mpirun, for example: % mpirun -restart my_checkpoint.cps hostC, hostD -np 8 ./a.out |
The job may be restarted on a different set of hosts but there must
be the same number of hosts and each host must have the same number of
ranks as the corresponding host in the original run of the job.
Message Passing Toolkit (MPT) User's Guide
(document number: 007-3773-019 / published: 2011-11-15)
table of contents | additional info | download
Front Matter
New Features in This Manual
About This Manual
Chapter 1. Introduction
Chapter 2. Administrating MPT
Chapter 3. Getting Started
Chapter 4. Programming with SGI MPI
Chapter 5. Debugging MPI Applications
Chapter 6. PerfBoost
Chapter 7. Checkpoint/Restart
Chapter 8. Run-time Tuning
Chapter 9. MPI Performance Profiling
Chapter 10. Troubleshooting and Frequently Asked Questions
Index
home/search |
what's new |
help
|
|
|