Scavenge

SCR commands should be executed after the final run of the application in a resource allocation to check that the most recent checkpoint is successfully copied to the parallel file system before exiting the allocation. This logic is encapsulated in the scr_postrun command.

scripts/common/scr_postrun.in

Checks whether there is a dataset in cache that must be copied to the parallel file system. If so, scavenge this dataset, rebuild any missing files if possible, and finally update SCR index file in prefix directory.

  1. Interprets $SCR_ENABLE, bails with success if set to 0.

  2. Interprets $SCR_DEBUG, enables verbosity if set \(>\) 0.

  3. Invokes scr_prefix to determine prefix directory on parallel file system (but this value is overridden via “-p” option when called from scr_srun).

  4. Interprets $SCR_NODELIST to determine set of nodes job is using, invokes scr_env –nodes if not set.

  5. Invokes scr_list_down_nodes to determine which nodes are down.

  6. Invokes scr_glob_hosts to subtract down nodes from node list to determine which nodes are still up, bails with error if there are no up nodes left.

  7. Invokes scr_list_dir control to get the control directory.

  8. Invokes “scr_flush_file –dir $pardir –latest” providing prefix directory to determine id of most recent dataset.

  9. If this command fails, there is no dataset to scavenge, so bail out with error.

  10. Invokes “scr_inspect –up $UPNODES –from $cntldir” to get list of datasets in cache.

  11. Invokes “scr_flush_file –dir $pardir –needflush $id” providing prefix directory and dataset id to determine whether this dataset needs to be copied.

  12. If this command fails, the dataset has already been flushed, so bail out with success.

  13. Invokes “scr_flush_file –dir $pardir –subdir $id” to get name for dataset directory.

  14. Creates dataset directory on parallel file system, and creates hidden .scr directory.

  15. Invokes scr_scavenge providing control directory, dataset id to be copied, dataset directory, and set of known down nodes, which copies dataset files from cache to the PFS.

  16. Invokes scr_index providing dataset directory, which checks whether all files are accounted for, attempts to rebuild missing files if it can, and records the new directory and status in the SCR index file.

  17. If dataset was copied and indexed successfully, marks the dataset as current in the index file.

scripts/TLCC/scr_scavenge.in

Executes within job batch script to manage scavenge of files from cache to parallel file system. Uses scr_hostlist.pm and scr_param.pm.

  1. Uses scr_param.pm to read SCR_FILE_BUF_SIZE (sets size of buffer when writing to file system).

  2. Uses scr_param.pm to read SCR_CRC_ON_FLUSH (flag indicating whether to compute CRC32 on file during scavenge).

  3. Uses scr_param.pm to read SCR_USE_CONTAINERS (flag indicating whether to combine application files into container files).

  4. Invokes “scr_env –jobid” to get jobid.

  5. Invokes “scr_env –nodes” to get the current nodeset, can override with “–jobset” on command line.

  6. Logs start of scavenge operation, if logging is enabled.

START ROUND 1

  1. Invokes pdsh of scr_copy providing control directory, dataset id, dataset directory, buffer size, CRC32 flag, partner flag, container flag, and list of known down nodes.

  2. Directs stdout to one file, directs stderr to another.

  3. Scan stdout file to build list of partner nodes and list of nodes where copy command failed.

  4. Scan stderr file for a few well-known error strings indicating pdsh failed.

  5. Build list of all failed nodes and list of nodes that were partners to those failed nodes, if any.

  6. If there were any nodes that failed in ROUND 1, enter ROUND 2.

END ROUND 1, START ROUND 2

  1. Build list of updated failed nodes, includes nodes known to be failed before ROUND 1, plus any nodes detected as failed in ROUND 1.

  2. Invokes pdsh of scr_copy on partner nodes of failed nodes (if we found a partner for each failed node) or on all non-failed nodes otherwise, provided the updated list of failed nodes.

END ROUND 2

  1. Logs end of scavenge, if logging is enabled.

src/scr_copy.c

Serial process that runs on a compute node and copies files for specified dataset to parallel file system.

  1. Read control directory, dataset id, destination dataset directory, etc from command line.

  2. Read master filemap and each filemap it lists.

  3. If specified dataset id does not exist, we can’t copy it so bail out with error.

  4. Loop over each rank we have for this dataset.

RANK LOOP

  1. Get flush descriptor from filemap. Record partner if set and whether we should preserve user directories or use containers.

  2. If partner flag is set, print node name of partner and loop.

  3. Otherwise, check whether we have all files for this rank, and if not loop to next rank.

  4. Otherwise, we’ll actually start to copy files.

  5. Allocate a rank filemap object and set expected number of files.

  6. Copy dataset hash into rank filemap.

  7. Record whether we’re preserving user directories or using containers in the rank filemap.

  8. Loop over each file for this rank in this dataset.

FILE LOOP

  1. Get file name.

  2. Check that we can read the file, if not record an error state.

  3. Get file meta data from filemap.

  4. Check whether file is application or SCR file.

  5. If user file, set destination directory. If preserving directories, get user-specified directory from meta data and call mkdir.

  6. Create destination file name.

  7. Copy file from cache to destination, optionally compute CRC32 during copy.

  8. Compute relative path to destination file from dataset directory.

  9. Add relative path to file to rank filemap.

  10. If CRC32 was enabled and also was set on original file, check its value, or if it was not already set, set it in file meta data.

  11. Record file meta data in rank filemap.

  12. Free temporaries.

END FILE LOOP

  1. Write rank filemap to dataset directory.

  2. Delete rank filemap object.

END RANK LOOP

  1. Free path to dataset directory and hidden .scr directory.

  2. Print and exit with code indicating success or error.

src/scr_index.c

Given a dataset directory as command line argument, checks whether dataset is indexed and adds to index if not. Attempts to rebuild missing files if needed.

  1. If “–add” option is specified, call index_add_dir (Section 0.1.5) to add directory to index file.

  2. If “–remove” option is specified, call index_remove_dir to delete dataset directory from index file. Does not delete associated files, only the reference to the directory from the index file.

  3. If “–current” option is specified, call index_current_dir to mark specified dataset directory as current. When a dataset is marked as current, SCR attempts to restart the job from that dataset and works backwards if it fails.

  4. If “–list” option is specified, call index_list to list contents of index file.

index_add_dir

Adds specified dataset directory to index file, if it doesn’t already exist. Rebuilds files if possible, and writes summary file if needed.

  1. Read index file.

  2. Lookup dataset directory in index file, if it’s already indexed, bail out with success.

  3. Otherwise, concatenate dataset subdirectory name with prefix directory to get full path to the dataset directory.

  4. Attempt to read summary file from dataset directory. Call scr_summary_build (Section 0.1.6) if it does not exist.

  5. Read dataset id from summary file, if this fails exit with error.

  6. Read completeness flag from summary file.

  7. Write entry to index hash for this dataset, including directory name, dataset id, complete flag, and flush timestamp.

  8. Write hash out as new index file.

scr_summary_build

Scans all files in dataset directory, attempts to rebuild files, and writes summary file.

  1. If we can read the summary file, bail out with success.

  2. Call scr_scan_files (Section 0.1.7) to read meta data for all files in directory. This records all data in a scan hash.

  3. Call scr_inspect_scan (Section 0.1.9) to examine whether all files in scan hash are complete, and record results in scan hash.

  4. If files are missing, call scr_rebuild_scan (Section 0.1.10) to attempt to rebuild files. After the rebuild, we delete the scan hash, rescan, and re-inspect to produce an updated scan hash.

  5. Delete extraneous entries from scan hash to form our summary file hash (Section Summary file).

  6. Write out summary file.

scr_scan_files

Reads all filemap and meta data files in directory to build a hash listing all files in dataset directory.

  1. Build string to hidden .scr subdirectory in dataset directory.

  2. Build regular expression to identify XOR files.

  3. Open hidden directory.

BEGIN LOOP

  1. Call readdir to get next directory item.

  2. Get item name.

  3. If item does not end with “.scrfilemap”, loop.

  4. Otherwise, create full path to file name.

  5. Call scr_scan_file to read file into scan hash.

  6. Free full path and loop to next item.

END LOOP

scr_scan_file

  1. Create new rank filemap object.

  2. Read filemap.

  3. For each dataset id in filemap…

  4. Get dataset id.

  5. Get scan hash for this dataset.

  6. Lookup rank2file map in scan hash, or create one if it doesn’t exist.

  7. For each rank in this dataset…

  8. Get rank id.

  9. Read dataset hash from filemap and record in scan hash.

  10. Get rank hash from rank2file hash for the current rank.

  11. Set number of expected files.

  12. For each file for this rank and dataset…

  13. Get file name.

  14. Build full path to file.

  15. Get meta data for file from rank filemap.

  16. Read number of ranks, file name, file size, and complete flag for file.

  17. Check that file exists.

  18. Check that file size matches.

  19. Check that number of ranks we expect matches number from meta data, use this value to set the expected number of ranks if it’s not already set.

  20. If any check fails, skip to next file.

  21. Otherwise, add entry for this file in scan hash.

  22. If meta data is for an XOR file, add an XOR entry in scan hash.

scr_inspect_scan

Checks that each rank has an entry in the scan hash, and checks that each rank has an entry for each of its expected number of files.

  1. For each dataset in scan hash, get dataset id and pointer to its hash entries.

  2. Lookup rank2file hash under RANK2FILE key.

  3. Lookup hash for RANKS key, and check that we have exactly one entry.

  4. Read number of ranks for this dataset.

  5. Sort entries for ranks in scan hash by rank id.

  6. Set expected rank to 0, and iterate over each rank in loop.

BEGIN LOOP

  1. Get rank id and hash entries for current rank.

  2. If rank id is invalid or out of order compared to expected rank, throw an error and mark dataset as invalid.

  3. While current rank id is higher than expected rank id, mark expected rank id as missing and increment expected rank id.

  4. Get FILES hash for this rank, and check that we have exactly one entry.

  5. Read number of expected files for this rank.

  6. Get hash of file names for this rank recorded in scr_scan_files.

  7. For each file, if it is marked as incomplete, mark rank as missing.

  8. If number of file entries for this rank is less than expected number of files, mark rank as missing.

  9. If number of file entries for this rank is more than expected number of files, mark dataset as invalid.

  10. Increment expected rank id.

END LOOP

  1. While expected rank id is less than the number of ranks for this dataset, mark expected rank id as missing and increment expected rank id.

  2. If expected rank id is more than the number of ranks for this dataset, mark dataset as invalid.

  3. Return SCR_SUCCESS if and only if we have all files for each dataset.

scr_rebuild_scan

Identifies whether any files are missing and forks and execs processes to rebuild missing files if possible.

  1. Iterate over each dataset id recorded in scan hash.

  2. Get dataset id and its hash entries.

  3. Look for flag indicating that dataset is invalid. We assume the dataset is bad beyond repair if we find such a flag.

  4. Check whether there are any ranks listed as missing files for this dataset, if not, go to next dataset.

  5. Otherwise, iterate over entries for each XOR set.

BEGIN LOOP

  1. Get XOR set id and number of members for this set.

  2. Iterate over entries for each member in the set. If we are missing an entry for the member, or if we have its entry but its associated rank is listed as one of the missing ranks, mark this member as missing.

  3. If we are missing files for more than one member of the set, mark the dataset as being unrecoverable. In this case, we won’t attempt to rebuild any files.

  4. Otherwise, if we are missing any files for the set, build the string that we’ll use later to fork and exec a process to rebuild the missing files.

END LOOP

  1. If dataset is recoverable, call scr_fork_rebuilds to fork and exec processes to rebuild missing files. This forks a process for each missing file where each invokes scr_rebuild_xor utility, implemented in scr_rebuild_xor.c. If any of these rebuild processes fail, then consider the rebuild as failed.

  2. Return SCR_SUCCESS if and only if, for each dataset id in the scan hash, the dataset is not explicitly marked as bad, and all files already existed or we were able to rebuild all missing files.