Launch a run¶
scripts/TLCC/scr_run.in¶
Prepares a resource allocation for SCR, launches a run, re-launches on failures, and scavenges and rebuilds files for most recent checkpoint if needed. Updates SCR index file in prefix directory to account for last checkpoint.
- Interprets
$SCR_ENABLE, callssrunand bails if set to 0. - Interprets
$SCR_DEBUG, enables verbosity if set \(>\) 0. - Invokes
scr_test_runtimeto check that runtime dependencies are available. - Invokes “
scr_env –jobid” to get jobid of current job. - Interprets
$SCR_NODELISTto determine set of nodes job is using, sets and exports$SCR_NODELISTto value returned by “scr_env –nodes” if not set. - Invokes
$scr_prefixto get prefix directory on parallel file system. - Interprets
$SCR_WATCHDOG. - Invokes
scr_glob_hoststo check that this command is running on a node in the nodeset, bails with error if not. - Invokes
scr_list_dirto get control directory. - Issues a NOP
sruncommand on all nodes to force each node to run SLURM prologue to delete old files from cache. - Invokes
scr_prerunto prepare nodes for SCR run. - If
$SCR_FLUSH_ASYNC == 1, invokesscr_glob_hoststo get count of number of nodes. and invokessrunto launch anscr_transferprocess on each node.
ENTER LOOP
- Invokes
scr_list_down_nodesto determine list of bad nodes. If any node has been previously marked down, force it to continue to be marked down. We do this to avoid re-running on “bad” nodes, the logic being that if a node was identified as being bad in this resource allocation once already, there is a good chance that it is still bad (even if it currently seems to be healthy), so avoid it. - Invokes
scr_list_down_nodesto print reason for down nodes, if any. - Count the number of nodes that the application needs. Invokes
scr_glob_hoststo count number of nodes in$SCR_NODELIST, which lists all nodes in allocation. Interprets$SCR_MIN_NODESto use that value of set, otherwise invokesscr_env –runnodesto get number of nodes used in last run. - Invokes
scr_glob_hostto count number of nodes left in the allocation. - If number of nodes left is smaller than number needed, break loop.
- Invokes
scr_glob_hostto ensure node runningscr_srunscript is not listed as a down node, if it is, break loop. - Build list of nodes to be excluded from run.
- Optionally log start of run.
- Invokes
srunincluding node where thescr_sruncommand is running and excluding down nodes. - If watchdog is enabled, record pid of srun, invokes
sleep 10so job shows up in squeue, invokesscr_get_jobstep_idto get SLURM jobstep id from pid, invokesscr_watchdogand records pid of watchdog process. - Invokes
scr_list_down_nodesto get list of down nodes. - Optionally log end of run (and down nodes and reason those nodes are down).
- If number of attempted runs is \(>=\) than number of allowed retries, break loop.
- Invokes
scr_retries_haltand breaks loop if halt condition is detected. - Invokes “
sleep 60” to give nodes in allocation a chance to cleanup. - Invokes
scr_retries_haltand breaks loop if halt condition is detected. We do this a second time in case a command to halt came in while we were sleeping. - Loop back.
EXIT LOOP
- If
$SCR_FLUSH_ASYNC == 1, invokes “scr_halt –immediate” to killscr_transferprocesses on each node. - Invokes
scr_postrunto scavenge most recent checkpoint. - Invokes
killto kill watchdog process if it is running.
scripts/common/scr_test_runtime.in¶
Checks that various runtime dependencies are available.
- Checks for
pdshcommand, - Checks for
dshbakcommand, - Checks for
Date::Manipperl module.
scripts/common/scr_prerun.in¶
Executes commands to prepare an allocation for SCR.
- Interprets
$SCR_ENABLE, callssrunand bails if set to 0. - Interprets
$SCR_DEBUG, enables verbosity if set \(>\) 0. - Invokes
scr_test_runtimeto check for necessary run time dependencies. - Invokes
mkdirto create.scrsubdirectory in prefix directory. - Invokes
rm -fto remove flush and nodes files from prefix directory. - Returns 0 if allocation is ready, 1 otherwise.
src/scr_retries_halt.c¶
Reads halt file and returns exit code depending on whether the run should be halted or not.