Launch a run¶
scripts/TLCC/scr_run.in
¶
Prepares a resource allocation for SCR, launches a run, re-launches on failures, and scavenges and rebuilds files for most recent checkpoint if needed. Updates SCR index file in prefix directory to account for last checkpoint.
- Interprets
$SCR_ENABLE
, callssrun
and bails if set to 0. - Interprets
$SCR_DEBUG
, enables verbosity if set \(>\) 0. - Invokes
scr_test_runtime
to check that runtime dependencies are available. - Invokes “
scr_env –jobid
” to get jobid of current job. - Interprets
$SCR_NODELIST
to determine set of nodes job is using, sets and exports$SCR_NODELIST
to value returned by “scr_env –nodes
” if not set. - Invokes
$scr_prefix
to get prefix directory on parallel file system. - Interprets
$SCR_WATCHDOG
. - Invokes
scr_glob_hosts
to check that this command is running on a node in the nodeset, bails with error if not. - Invokes
scr_list_dir
to get control directory. - Issues a NOP
srun
command on all nodes to force each node to run SLURM prologue to delete old files from cache. - Invokes
scr_prerun
to prepare nodes for SCR run. - If
$SCR_FLUSH_ASYNC == 1
, invokesscr_glob_hosts
to get count of number of nodes. and invokessrun
to launch anscr_transfer
process on each node.
ENTER LOOP
- Invokes
scr_list_down_nodes
to determine list of bad nodes. If any node has been previously marked down, force it to continue to be marked down. We do this to avoid re-running on “bad” nodes, the logic being that if a node was identified as being bad in this resource allocation once already, there is a good chance that it is still bad (even if it currently seems to be healthy), so avoid it. - Invokes
scr_list_down_nodes
to print reason for down nodes, if any. - Count the number of nodes that the application needs. Invokes
scr_glob_hosts
to count number of nodes in$SCR_NODELIST
, which lists all nodes in allocation. Interprets$SCR_MIN_NODES
to use that value of set, otherwise invokesscr_env –runnodes
to get number of nodes used in last run. - Invokes
scr_glob_host
to count number of nodes left in the allocation. - If number of nodes left is smaller than number needed, break loop.
- Invokes
scr_glob_host
to ensure node runningscr_srun
script is not listed as a down node, if it is, break loop. - Build list of nodes to be excluded from run.
- Optionally log start of run.
- Invokes
srun
including node where thescr_srun
command is running and excluding down nodes. - If watchdog is enabled, record pid of srun, invokes
sleep 10
so job shows up in squeue, invokesscr_get_jobstep_id
to get SLURM jobstep id from pid, invokesscr_watchdog
and records pid of watchdog process. - Invokes
scr_list_down_nodes
to get list of down nodes. - Optionally log end of run (and down nodes and reason those nodes are down).
- If number of attempted runs is \(>=\) than number of allowed retries, break loop.
- Invokes
scr_retries_halt
and breaks loop if halt condition is detected. - Invokes “
sleep 60
” to give nodes in allocation a chance to cleanup. - Invokes
scr_retries_halt
and breaks loop if halt condition is detected. We do this a second time in case a command to halt came in while we were sleeping. - Loop back.
EXIT LOOP
- If
$SCR_FLUSH_ASYNC == 1
, invokes “scr_halt –immediate
” to killscr_transfer
processes on each node. - Invokes
scr_postrun
to scavenge most recent checkpoint. - Invokes
kill
to kill watchdog process if it is running.
scripts/common/scr_test_runtime.in
¶
Checks that various runtime dependencies are available.
- Checks for
pdsh
command, - Checks for
dshbak
command, - Checks for
Date::Manip
perl module.
scripts/common/scr_prerun.in
¶
Executes commands to prepare an allocation for SCR.
- Interprets
$SCR_ENABLE
, callssrun
and bails if set to 0. - Interprets
$SCR_DEBUG
, enables verbosity if set \(>\) 0. - Invokes
scr_test_runtime
to check for necessary run time dependencies. - Invokes
mkdir
to create.scr
subdirectory in prefix directory. - Invokes
rm -f
to remove flush and nodes files from prefix directory. - Returns 0 if allocation is ready, 1 otherwise.
src/scr_retries_halt.c
¶
Reads halt file and returns exit code depending on whether the run should be halted or not.