SCR_Init

During SCR_Init(), the library allocates and initializes data structures. It inspects the cache, and it distributes and rebuilds files for cached datasets. Otherwise, it attempts to fetch the most recent checkpoint from the parallel file system. This function is implemented in scr.c.

Interprets $SCR_ENABLE, if not enabled, bail out with error.
Calls DTCMP_Init if datatype compare library is available.
Create scr_comm_world by duplicating MPI_COMM_WORLD and set scr_my_rank_world and scr_ranks_world.
Call scr_env_nodename to get hostname, store in scr_my_hostname.
Call getpagesize to get memory page size, store in scr_page_size.
Initialize parameters – rank 0 reads any config files and broadcasts info to other ranks.
Check whether we are still enabled (a config file may disable us), and bail out with error if not.
Check that scr_username and scr_jobid are defined, which are used for logging purposes.
Call scr_groupdescs_create to create group descriptors (Section 0.1.1).
Call scr_storedescs_create to create store descriptors (Section 0.1.2).
Call scr_reddescs_create to create redundancy descriptors (Section 0.1.3).
Check that we have a valid redundancy descriptor that we can use for each checkpoint.
Create scr_comm_node and scr_comm_node_across.
Log the start of this run, if logging is enabled.
Verify that scr_prefix is defined.
Define scr_prefix_scr to be .scr directory within prefix directory.
Create .scr directory in prefix directory.
Define control directory and save in scr_cntl_prefix.
Create the control directory.
Create each of the cache directories.

BARRIER

Define file names for halt, flush, nodes, transfer, and filemap files.
Delete any existing transfer file.
Create nodes file, write total number of nodes in the job (max of size of scr_comm_level).
Allocate a hash to hold halt status, and initialize halt seconds if needed.

BARRIER

Stop any ongoing asynchronous flush.
Check whether we need to halt and exit this run.
Enable scr_flush_on_restart and disable scr_fetch if scr_global_restart is set.
Create empty hash for filemap.
Master process for each store reads and distributes info from filemaps (Section 0.1.4).
Call scr_cache_rebuild to rebuild datasets in cache (Section 0.1.5).
If rebuild was successful, call scr_flush_file_rebuild to update flush file.
If rebuild successful, check whether dataset should be flushed.
If we don’t have a checkpoint (rebuild failed), call scr_cache_purge to clear cache (delete all files).
If fetch is enabled, call scr_fetch_sync to read checkpoint from parallel file system (Section 0.1.12).
If the fetch failed, clear the cache again.

BARRIER

Log end of initialization.
Start timer to record length of compute phase and log start of compute phase.

scr_groupdescs_create

The group descriptors are kept in the scr_groupdescs array (Section Group descriptors]). This function is implemented in scr_groupdesc.c.

Read GROUPS key from scr_groupdesc_hash which is set while processing parameters in SCR_Init.
Count number of groups and add two.
Allocate space for scr_groupdesc array.
Initialize each group descriptor.
Create descriptor for all tasks on the same node called NODE.
Create descriptor for all tasks in job called WORLD.
Create each group descriptor specified in scr_groupdesc_hash by calling scr_groupdesc_create_by_str. Rank 0 broadcasts the group name to determine the order.

The scr_groupdesc_create_by_str function creates groups of processes by splitting scr_comm_world into subcommunicators containing all procs that specify the same string. The real work is delegated to scr_rank_str, which is implemented in scr_split.c. It executes a bitonic sort on string names, and it returns the number of distinct groups across all procs and the group id to which the calling process belongs. This id is then used in a call to MPI_Comm_split.

scr_storedescs_create

The store descriptors are kept in the scr_storedescs array (Section Store descriptors). This function is implemented in scr_storedesc.c.

Read STORE key from scr_storedesc_hash which is set while processing parameters in SCR_Init.
Count number of store descriptors.
Allocate space for scr_storedescs array.
Sort store descriptors to ensure they are in the same order on all procs.
Create each store descriptor specified in scr_storedesc_hash by calling scr_storedesc_create_from_hash.
Create store descriptor for control directory and save it in scr_storedesc_cntl.

The scr_storedesc_create function sets all fields in the descriptor using default values or values defined in the hash. A key field is a communicator consisting of the group of processes that share the associated storage device. This communicator is used to coordinate processes when accessing the device. It is created by duplicating a communicator from a group descriptor.

scr_reddescs_create

The redundancy descriptors are kept in the scr_reddescs array (Section Redundancy descriptors). This function is implemented in scr_reddesc.c.

Read CKPT key from scr_reddesc_hash which is set while processing parameters in SCR_Init.
Count number of redundancy descriptors.
Allocate space for scr_reddescs array.
Sort redundancy descriptors to ensure they are in the same order on all procs.
Create each redundancy descriptor specified in scr_reddesc_hash by calling scr_reddesc_create_from_hash.

The scr_reddesc_create_from_hash function sets all fields in the descriptor from default values or values defined in the hash. Two key fields consist of an index to the store descriptor providing details on the class of storage to use and a communicator on which to compute redundancy data. To build the communicator, a new communicator is created by splitting scr_comm_world into subcommunicators consisting of processes from different failure groups.

scr_scatter_filemaps

During a restart, the master process for each control directory reads in all filemap data and distributes this data to the other processes sharing the control directory, if any. After this distribution phase, a process is responsible for each file it has filemap data for, and each file in cache is the responsibility of some process. We use this approach to handle cases where the number of tasks accessing the control directory in the current run is different from the number of tasks in the prior run. This function is implemented in scr_cache_rebuild.c.

Master reads master filemap file.
Master creates empty filemap and reads each filemap file listed in the master filemap. Deletes each filemap file as it’s read.
Gather list of global rank ids sharing the store to master process.
If the filemap has data for a rank, master prepares hash to send corresponding data to that rank.
Master evenly distributes the remainder of the filemap data to all processes.
Distribute filemap data via scr_hash_exchange().
Master writes new master filemap file.
Each process writes new filemap file.

scr_cache_rebuild

This section describes the logic to distribute and rebuild files in cache. SCR attempts to rebuild all cached datasets. This functionality is implemented in scr_cache_rebuild.c.

Start timer.
Delete any files from cache known to be incomplete.
Get list of dataset ids currently in cache.

LOOP

Identify dataset with lowest id across all procs yet to be rebuilt.
If there is no dataset id specified on any process, break loop.
Otherwise, log which dataset we are attempting to rebuild.
Distribute hash for this dataset and store in map object (Section 0.1.6).
If we fail to distribute the hash to all processes, delete this dataset from cache and loop.
Distribute redundancy descriptors for this dataset and store in temporary redundancy descriptor object (Section 0.1.7). This informs each process about the cache device and the redundancy scheme to use for this dataset.
If we fail to distribute the redundancy descriptors to all processes, delete this dataset from cache and loop.
Create dataset directory in cache according to redundancy descriptor.
Distribute files to the ranks that wrote them (Section 0.1.8). The owner ranks may now be on different nodes.
Rebuild any missing files for this dataset using redundancy scheme specified in redundancy descriptor (Section 0.1.9).
If the rebuild fails, delete this dataset from cache and loop.
Otherwise, the rebuild succeeded. Update scr_dataset_id and scr_checkpoint_id if the id for the current dataset is higher, so that we continue counting up from this number when assigning ids to later datasets.
Unset FLUSHING flag in flush file.
Free the temporary redundancy descriptor.

EXIT LOOP

Stop timer and log whether we were able to rebuild any dataset from cache.

scr_distribute_datasets

Given a filemap and dataset id, distribute dataset hash and store in filemap.

Create empty send hash for transferring dataset hashes.
Get list of ranks that we have files for as part of the specified dataset.
For each rank, lookup dataset hash from filemap and add to send hash.
Delete list of ranks.
Check that no rank identified an invalid rank. If the restarted run uses a smaller number of processes than the previous run, we may (but are not guaranteed to) discover this condition here.
Identify smallest rank that has a copy of the dataset hash.
Return with failure if no such rank exists.
Otherwise, broadcast hash from this rank.
Store dataset hash in filemap and write filemap to disk.
Delete send hash.

scr_distribute_reddescs

Given a filemap and dataset id, distribute redundancy descriptor that was applied to the dataset and store in filemap. This creates the same group and redundancy scheme that was applied to the dataset, even if the user may have configured new schemes for the current run.

Create empty send hash for transferring redundancy descriptor hashes.
Get list of ranks that we have files for as part of the specified dataset.
For each rank, lookup redundancy descriptor hash from filemap and add to send hash.
Delete list of ranks.
Check that no rank identified an invalid rank. If the restarted run uses a smaller number of processes than the previous run, we may (but are not guaranteed to) discover this condition here.
Execute sparse data exchange with scr_hash_exchange.
Check that each rank received its descriptor, return with failure if not.
Store redundancy descriptor hash in filemap and write filemap to disk.
Create redundancy descriptor by calling scr_reddesc_create_from_filemap.
Delete send and receive hashes from exchange.

scr_distribute_files

This section describes the algorithm used to distribute files for a specified dataset. SCR transfers files from their current location to the storage device accessible from the node where the owner rank is now running. The algorithm operates over a number of rounds. In each round, a process may send files to at most one other process. A process may only send files if it has all of the files written by the owner process. The caller specifies a filemap, a redundancy descriptor, and a dataset id as input. This implementation is in scr_cache_rebuild.c.

Delete all bad (incomplete or inaccessible) files from the filemap.
Get list of ranks that we have files for as part of the specified dataset.
From this list, set a start index to the position corresponding to the first rank that is equal to or greater than our own rank (looping back to rank 0 if we pass the last rank). We stagger the start index across processes in this way to help distribute load later.
Check that no rank identified an invalid rank while scanning for its start index. If the restarted run uses a smaller number of processes than the previous run, we may (but are not guaranteed to) discover this condition here.
Allocate arrays to record which rank we can send files to in each round.
Check that we have all files for each rank, and record the round in which we can send them. The round we pick here is affected by the start index computed earlier.
Issue sparse global exchange via scr_hash_exchange to inform each process in which round we can send it its files, and receive similar messages from other processes.
Search for minimum round in which we can retrieve our own files, and remember corresponding round and source rank. If we can fetch files from our self, we’ll always select this option as it will be the minimum round.
Free the list of ranks we have files for.
Determine whether all processes can obtain their files, and bail with error if not.
Determine the maximum round any process needs to get its files.
Identify which rank we’ll get our files from and issue sparse global exchange to distribute this info.
Determine which ranks want to receive files from us, if any, and record the round they want to receive their files in.
Get the directory name for this dataset.
Loop through the maximum number of rounds and exchange files.

LOOP ROUNDS

Check whether we can send files to a rank in this round, and if so, record destination and number of files.
Check whether we need to receive our files in this round, and if so, record source rank.
If we need to send files to our self, just move (rename) each file, update the filemap, and loop to the next round.
Otherwise, if we have files for this round but the the owner rank does not need them, delete them.
If we do not need to send or receive any files this round, loop to next round.
Otherwise, exchange number of files we’ll be sending and/or receiving, and record expected number that we’ll receive in our filemap.
If we’re sending files, get a list of files for the destination.
Enter exchange loop.

LOOP EXCHANGE

Get next file name from our list of files to send, if any remaining.
Swap file names with partners.
If we’ll receive a file in this iteration, add the file name to the filemap and write out our filemap.
Transfer file via scr_swap_files(). This call overwrites the outgoing file (if any) with the incoming file (if any), so there’s no need to delete the outgoing file. If there is no incoming file, it deletes the outgoing file (if any). We use this approach to conserve storage space, since we assume the cache is small. We also transfer file metadata with this function.
If we sent a file, remove that file from our filemap and write out the filemap.
Decrement the number of files we have to send / receive by one. When both counts hit zero, break exchange loop.
Write updated filemap to disk.

EXIT LOOP EXCHANGE

Free list of files that we sent in this round.

EXIT LOOP ROUNDS

If we have more ranks than there were rounds, delete files for all remaining ranks.
Write out filemap file.
Delete bad files (incomplete or inaccessible) from the filemap.

scr_reddesc_recover

This function attempts to rebuild any missing files for a dataset. It returns SCR_SUCCESS on all processes if successful; it returns !SCR_SUCCESS on all processes otherwise. The caller specifies a filemap, a redundancy descriptor, and a dataset id as input. This function is implemented in in scr_reddesc_recover.c.

Attempt to rebuild files according to the redundancy scheme specified in the redundancy descriptor. Currently, only XOR can actually rebuild files (Section 0.1.10).
If the rebuild failed, return with an error.
Otherwise, check that all processes have all of their files for the dataset.
If not, return with an error.
If so, reapply the redundancy scheme, if needed. No need to do this with XOR, since it does this step as part of the rebuild.

scr_reddesc_recover_xor_attempt

Before we attempt to rebuild files using the XOR redundancy scheme, we first check whether it is possible. If we detect that two or more processes from the same XOR set are missing files, we cannot recover all files and there is no point to rebuild any of them. We execute this check in scr_reddesc_recover.c. The caller specifies a filemap, a redundancy descriptor, and a dataset id as input.

Check whether we have our dataset files, and check whether we have our XOR file. If we’re missing any of these files, assume that we’re missing them all.
Count the number of processes in our XOR set that need their files. We can recover all files from a set so long as no more than a single member needs its files.
Check whether we can recover files for all sets, if not bail with an error.
If the current process is in a set which needs to be rebuilt, identify which rank needs its files and call scr_reddesc_recover_xor() to rebuild files (Section 0.1.11).
Check that the rebuild succeeded on all tasks, return error if not, otherwise return success.

scr_reddesc_recover_xor

We invoke this routine within each XOR set that is missing files. The caller specifies a filemap, a redundancy descriptor, and a dataset id as input, as well as, the rank of the process in the XOR set that is missing its files. We refer to the process that needs to rebuild its files as the root. This function is implemented in scr_reddesc_recover.c

ALL

Get pointer to XOR state structure from copy_state field of redundancy descriptor.
Allocate empty hash to hold the header of our XOR file.

NON-ROOT

Get name of our XOR file.
Open XOR file for reading.
Read header from file.
From header, get hash of files we wrote.
From this file hash, get the number of files we wrote.
Allocate arrays to hold file descriptor, file name, and file size for each of our files.
Get path of dataset directory from XOR file name.
Open each of our files for reading and store file descriptor, file name, and file size of each file in our arrays.
If the failed rank is to our left, send it our header. Our header stores a copy of the file hash for the rank to our left under the PARTNER key.
If the failed rank is to our right, send it our file hash. When the failed rank rebuilds its XOR file, it needs to record our file hash in its header under the PARTNER key.

ROOT

Receive XOR header from rank to our right.
Rename PARTNER key in this header to CURRENT. The rank to our right stored a copy of our file hash under PARTNER.
Receive file hash from rank to our left, and store it under PARTNER in our header.
Get our file hash from CURRENT key in the header.
From our file hash, get the number of files we wrote during the dataset.
Allocate arrays to hold file descriptor, file name, and file size for each of our files.
Build the file name for our XOR file, and add XOR file to the filemap.
For each of our files, get meta data from file hash, then get file name and file size from meta data. Add file name to filemap, and record file name and file size in arrays.
Record the number of files we expect to have in the filemap, including the XOR file.
Write out filemap.
Open XOR file for writing.
Open each of our dataset files for writing, and record file descriptors in our file descriptor array.
Write out XOR header to XOR file.

ALL

Read XOR chunk size from header.
Allocate buffers to send and receive data during reduction.
Execute pipelined XOR reduction to root to reconstruct missing data as illustrated in Figure Pipelined XOR reduction to root. For a full description of the redundancy scheme, see Section XOR algorithm.
Close our XOR file.
Close each of our dataset files.

ROOT

For each of our dataset files and our XOR file, update filemap.
Write filemap to disk.
Also compute and record CRC32 checksum for each file if SCR_CRC_ON_COPY is set.

ALL

Free data buffers.
Free arrays for file descriptors, file names, and file sizes.
Free XOR header hash.

scr_fetch_sync

This section describes the loop used to fetch a checkpoint from the parallel file system. SCR starts with the most recent checkpoint on the parallel file system as specified in the index file. If SCR fails to fetch this checkpoint, it then works backwards and attempts to fetch the next most recent checkpoint until it either succeeds or runs out of checkpoints. It acquires the list of available checkpoints from the index file. This functionality is implemented within scr_fetch.c.

Start timer.
Rank 0 reads index file from prefix directory, bail if failed to read file.

LOOP

Rank 0 selects a target directory name. Start with directory marked as current if set, and otherwise use most recent checkpoint specified in index file. For successive iterations, attempt the checkpoint that is the next most recent.
Rank 0 records fetch attempt in index file.
Rank 0 builds full path to dataset.
Broadcast dataset path from rank 0.
Attempt to fetch checkpoint from selected directory.
If fetch fails, rank 0 deletes “current” designation from dataset and marks dataset as “failed” in index file.
If fetch succeeds, rank 0 updates “current” designation to point to this dataset in index file, break loop.

EXIT LOOP

Delete index hash.
Stop timer and print statistics.

SCR_Need_checkpoint

Determines whether a checkpoint should be taken. This function is implemented in scr.c.

If not enabled, bail out with error.
If not initialized, bail out with error.
Increment the scr_need_checkpoint_id counter. We use this counter so the user can specify that the application should checkpoint after every so many calls to SCR_Need_checkpoint.
Check whether we need to halt. If so, then set need checkpoint flag to true.
Rank 0 checks various properties to make a decision: user has called SCR_Need_checkpoint an appropriate number of times, or the max time between consecutive checkpoints has expired, or the ratio of the total checkpoint time to the total compute time is below a threshold.
Rank 0 broadcasts the decision to all other tasks.

SCR_Start_checkpoint

Prepares the cache for a new checkpoint. This function is implemented in scr.c.

If not enabled, bail out with error.
If not initialized, bail out with error.
If this is being called from within a Start/Complete pair, bail out with error.
Issue a barrier here so that processes don’t delete checkpoint files from the cache before we’re sure that all processes will actually make it this far.

BARRIER

Stop timer of compute phase, and log this compute section.
Increment scr_dataset_id and scr_checkpoint_id.
Get redundancy descriptor for this checkpoint id.
Start timer for checkpoint phase, and log start of checkpoint.
Get a list of all datasets in cache.
Get store descriptor associated with redundancy descriptor.
Determine how many checkpoints are currently in the cache directory specified by the store descriptor.
Delete oldest datasets from this directory until we have sufficient room for this new checkpoint. When selecting checkpoints to delete, skip checkpoints that are being flushed. If the only option is a checkpoint that is being flushed, wait for it to complete then delete it.
Free the list of checkpoints.
Rank 0 fills in the dataset descriptor hash and broadcasts it.
Store dataset hash in filemap.
Add flush descriptor entries to filemap for this dataset.
Store redundancy descriptor in filemap.
Write filemap to disk.
Create dataset directory in cache.

SCR_Route_file

Given a name of a file, return the string the caller should use to access this file. This function is implemented in scr.c.

If not enabled, bail out with error.
If not initialized, bail out with error.
Lookup redundancy descriptor for current checkpoint id.
Direct path to dataset directory in cache according to redundancy descriptor.
If called from within a Start/Complete pair, add file name to filemap. Record original file name as specified by caller, the absolute path to the file and the number of ranks in the job in the filemap. Update filemap on disk.
Otherwise, we assume we are in a restart, so check whether we can read the file, and return error if not. The goal in this case is to provide a mechanism for a process to determine whether it can read its checkpoint file from cache during a restart.
Return success.

SCR_Complete_checkpoint

Applies redundancy scheme to checkpoint files, may flush checkpoint to parallel file system, and may exit run if the run should be halted. This function is implemented in scr.c.

If not enabled, bail out with error.
If not initialized, bail out with error.
If not called from within Start/Complete pair, bail out with error.
Record file size and valid flag for each file written during checkpoint.
Write out meta data for each file registered in filemap for this dataset id.
Compute total data size across all procs and determine whether all procs specified a valid write.
Update filemap and write to disk.
Verify that flush is valid by checking that all files belong to same subdirectory and compute container offsets if used.
Apply redundancy scheme specified in redundancy descriptor (Section 0.5.1 or Section 0.5.2).
Stop timer measuring length of checkpoint, and log cost of checkpoint.
If checkpoint was successful, update our flush file, check whether we need to halt, and check whether we need to flush.
If checkpoint was not successful, delete it from cache.
Check whether any ongoing asynchronous flush has completed.

BARRIER

Start timer for start of compute phase, and log start of compute phase.

scr_reddesc_apply_partner

Algorithm to compute PARTNER redundancy scheme. Caller provides a filemap, a redundancy descriptor, and a dataset id. This function is implemented in scr_reddesc_apply.c.

Get pointer to partner state structure from copy_state field in redundancy descriptor.
Read list of files for this rank for the specified checkpoint.
Inform our right-hand partner how many files we’ll send.
Record number of files we expect to receive from our left-hand partner in our filemap.
Remember the node name where our left-hand partner is running (used during scavenge).
Record the redundancy descriptor hash for our left-hand partner. Each process needs to be able to recover its own redundancy descriptor hash after a failure, so we make a copy in our partner’s filemap.
Write filemap to disk.
Get checkpoint directory we’ll copy partner’s files to.
While we have a file to send or receive, loop.

LOOP

If we have a file to send, get the file name.
Exchange file names with left-hand and right-hand partners.
If our left-hand partner will be sending us a file, add the file name to our filemap, and write out our filemap.
Exchange files by calling scr_swap_files(), and update filemap with meta data for file.

EXIT LOOP

Write filemap to disk.
Free the list of file names.

scr_reddesc_apply_xor

Algorithm to compute XOR redundancy scheme. Caller provides a filemap, a redundancy descriptor, and a dataset id. The XOR set is the group of processes defined by the communicator specified in the redundancy descriptor. This function is implemented in scr_reddesc_apply.c.

Get pointer to XOR state structure from copy_state field in redundancy descriptor.
Allocate a buffers to send and receive data.
Count the number of files this process wrote during the specified dataset id. Allocate space to record a file descriptor, the file name, and the size of each file.
Record the redundancy descriptor hash for our left-hand partner in our filemap. Each process needs to be able to recover its own redundancy descriptor hash after a failure, so each process sends a copy to his right-hand partner.
Allocate a hash to hold the header of our XOR redundancy file.
Record the global ranks of the MPI tasks in our XOR set.
Record the dataset id in our header.
Open each of our files, get the size of each file, and read the meta data for each file.
Create a hash and record our rank, the number of files we have, and the meta data for each file.
Send this hash to our right-hand partner, and receive equivalent hash from left-hand partner.
Record our hash along with the hash from our left-hand partner in our XOR header hash. This way, the meta data for each file is recorded in the headers of two different XOR files.
Determine chunk size for RAID algorithm (Section XOR algorithm) and record this size in the XOR header.
Determine full path name for XOR file.
Record XOR file name in our filemap and update the filemap on disk.
Open the XOR file for writing.
Write header to file and delete header hash.
Execute RAID algorithm and write data to XOR file (Section XOR algorithm).
Close and fsync our XOR file and close each of our dataset files.
Free off scratch space memory and MPI buffers.
Write out meta data file for XOR file.
If SCR_CRC_ON_COPY is specified, compute CRC32 checksum of XOR file.

SCR_Finalize

Shuts down the SCR library, flushes most recent checkpoint to the parallel file system, and frees data structures. This function is implemented in scr.c.

If not enabled, bail out with error.
If not initialized, bail out with error.
Stop timer measuring length of compute phase.
Add reason for exiting to halt file. We assume the user really wants to stop once the application calls SCR_Finalize. We add a reason to the halt file so we know not to start another run after we exit from this one.
Complete or stop any ongoing asynchronous flush.
Flush most recent checkpoint if we still need to.
Disconnect logging functions.
Free internal data structures.
Call DTCMP_Finalize if used.