SCR_Init -------- During ``SCR_Init()``, the library allocates and initializes data structures. It inspects the cache, and it distributes and rebuilds files for cached datasets. Otherwise, it attempts to fetch the most recent checkpoint from the parallel file system. This function is implemented in ``scr.c``. #. Interprets ``$SCR_ENABLE``, if not enabled, bail out with error. #. Calls ``DTCMP_Init`` if datatype compare library is available. #. Create ``scr_comm_world`` by duplicating ``MPI_COMM_WORLD`` and set ``scr_my_rank_world`` and ``scr_ranks_world``. #. Call ``scr_env_nodename`` to get hostname, store in ``scr_my_hostname``. #. Call ``getpagesize`` to get memory page size, store in ``scr_page_size``. #. Initialize parameters – rank 0 reads any config files and broadcasts info to other ranks. #. Check whether we are still enabled (a config file may disable us), and bail out with error if not. #. Check that ``scr_username`` and ``scr_jobid`` are defined, which are used for logging purposes. #. Call ``scr_groupdescs_create`` to create group descriptors (Section :ref:`0.1.1 `). #. Call ``scr_storedescs_create`` to create store descriptors (Section :ref:`0.1.2 `). #. Call ``scr_reddescs_create`` to create redundancy descriptors (Section :ref:`0.1.3 `). #. Check that we have a valid redundancy descriptor that we can use for each checkpoint. #. Create ``scr_comm_node`` and ``scr_comm_node_across``. #. Log the start of this run, if logging is enabled. #. Verify that ``scr_prefix`` is defined. #. Define ``scr_prefix_scr`` to be ``.scr`` directory within prefix directory. #. Create ``.scr`` directory in prefix directory. #. Define control directory and save in ``scr_cntl_prefix``. #. Create the control directory. #. Create each of the cache directories. BARRIER #. Define file names for halt, flush, nodes, transfer, and filemap files. #. Delete any existing transfer file. #. Create nodes file, write total number of nodes in the job (max of size of ``scr_comm_level``). #. Allocate a hash to hold halt status, and initialize halt seconds if needed. BARRIER #. Stop any ongoing asynchronous flush. #. Check whether we need to halt and exit this run. #. Enable ``scr_flush_on_restart`` and disable ``scr_fetch`` if ``scr_global_restart`` is set. #. Create empty hash for filemap. #. Master process for each store reads and distributes info from filemaps (Section :ref:`0.1.4 `). #. Call ``scr_cache_rebuild`` to rebuild datasets in cache (Section :ref:`0.1.5 `). #. If rebuild was successful, call ``scr_flush_file_rebuild`` to update flush file. #. If rebuild successful, check whether dataset should be flushed. #. If we don’t have a checkpoint (rebuild failed), call ``scr_cache_purge`` to clear cache (delete all files). #. If fetch is enabled, call ``scr_fetch_sync`` to read checkpoint from parallel file system (Section :ref:`0.1.12 `). #. If the fetch failed, clear the cache again. BARRIER #. Log end of initialization. #. Start timer to record length of compute phase and log start of compute phase. .. _groupdesc_create: scr_groupdescs_create ~~~~~~~~~~~~~~~~~~~~~ The group descriptors are kept in the ``scr_groupdescs`` array (Section :ref:`Group descriptors] <:group_descriptors>`). This function is implemented in ``scr_groupdesc.c``. #. Read ``GROUPS`` key from ``scr_groupdesc_hash`` which is set while processing parameters in ``SCR_Init``. #. Count number of groups and add two. #. Allocate space for ``scr_groupdesc`` array. #. Initialize each group descriptor. #. Create descriptor for all tasks on the same node called ``NODE``. #. Create descriptor for all tasks in job called ``WORLD``. #. Create each group descriptor specified in ``scr_groupdesc_hash`` by calling ``scr_groupdesc_create_by_str``. Rank 0 broadcasts the group name to determine the order. The ``scr_groupdesc_create_by_str`` function creates groups of processes by splitting ``scr_comm_world`` into subcommunicators containing all procs that specify the same string. The real work is delegated to ``scr_rank_str``, which is implemented in ``scr_split.c``. It executes a bitonic sort on string names, and it returns the number of distinct groups across all procs and the group id to which the calling process belongs. This id is then used in a call to ``MPI_Comm_split``. .. _storedesc_create: scr_storedescs_create ~~~~~~~~~~~~~~~~~~~~~ The store descriptors are kept in the ``scr_storedescs`` array (Section :ref:`Store descriptors `). This function is implemented in ``scr_storedesc.c``. #. Read ``STORE`` key from ``scr_storedesc_hash`` which is set while processing parameters in ``SCR_Init``. #. Count number of store descriptors. #. Allocate space for ``scr_storedescs`` array. #. Sort store descriptors to ensure they are in the same order on all procs. #. Create each store descriptor specified in ``scr_storedesc_hash`` by calling ``scr_storedesc_create_from_hash``. #. Create store descriptor for control directory and save it in ``scr_storedesc_cntl``. The ``scr_storedesc_create`` function sets all fields in the descriptor using default values or values defined in the hash. A key field is a communicator consisting of the group of processes that share the associated storage device. This communicator is used to coordinate processes when accessing the device. It is created by duplicating a communicator from a group descriptor. .. _reddesc_create: scr_reddescs_create ~~~~~~~~~~~~~~~~~~~ The redundancy descriptors are kept in the ``scr_reddescs`` array (Section :ref:`Redundancy descriptors `). This function is implemented in ``scr_reddesc.c``. #. Read ``CKPT`` key from ``scr_reddesc_hash`` which is set while processing parameters in ``SCR_Init``. #. Count number of redundancy descriptors. #. Allocate space for ``scr_reddescs`` array. #. Sort redundancy descriptors to ensure they are in the same order on all procs. #. Create each redundancy descriptor specified in ``scr_reddesc_hash`` by calling ``scr_reddesc_create_from_hash``. The ``scr_reddesc_create_from_hash`` function sets all fields in the descriptor from default values or values defined in the hash. Two key fields consist of an index to the store descriptor providing details on the class of storage to use and a communicator on which to compute redundancy data. To build the communicator, a new communicator is created by splitting ``scr_comm_world`` into subcommunicators consisting of processes from different failure groups. .. _distribute_filemap: scr_scatter_filemaps ~~~~~~~~~~~~~~~~~~~~ During a restart, the master process for each control directory reads in all filemap data and distributes this data to the other processes sharing the control directory, if any. After this distribution phase, a process is responsible for each file it has filemap data for, and each file in cache is the responsibility of some process. We use this approach to handle cases where the number of tasks accessing the control directory in the current run is different from the number of tasks in the prior run. This function is implemented in ``scr_cache_rebuild.c``. #. Master reads master filemap file. #. Master creates empty filemap and reads each filemap file listed in the master filemap. Deletes each filemap file as it’s read. #. Gather list of global rank ids sharing the store to master process. #. If the filemap has data for a rank, master prepares hash to send corresponding data to that rank. #. Master evenly distributes the remainder of the filemap data to all processes. #. Distribute filemap data via ``scr_hash_exchange()``. #. Master writes new master filemap file. #. Each process writes new filemap file. .. _rebuild_loop: scr_cache_rebuild ~~~~~~~~~~~~~~~~~ This section describes the logic to distribute and rebuild files in cache. SCR attempts to rebuild all cached datasets. This functionality is implemented in ``scr_cache_rebuild.c``. #. Start timer. #. Delete any files from cache known to be incomplete. #. Get list of dataset ids currently in cache. LOOP #. Identify dataset with lowest id across all procs yet to be rebuilt. #. If there is no dataset id specified on any process, break loop. #. Otherwise, log which dataset we are attempting to rebuild. #. Distribute hash for this dataset and store in map object (Section :ref:`0.1.6 `). #. If we fail to distribute the hash to all processes, delete this dataset from cache and loop. #. Distribute redundancy descriptors for this dataset and store in temporary redundancy descriptor object (Section :ref:`0.1.7 `). This informs each process about the cache device and the redundancy scheme to use for this dataset. #. If we fail to distribute the redundancy descriptors to all processes, delete this dataset from cache and loop. #. Create dataset directory in cache according to redundancy descriptor. #. Distribute files to the ranks that wrote them (Section :ref:`0.1.8 `). The owner ranks may now be on different nodes. #. Rebuild any missing files for this dataset using redundancy scheme specified in redundancy descriptor (Section :ref:`0.1.9 `). #. If the rebuild fails, delete this dataset from cache and loop. #. Otherwise, the rebuild succeeded. Update ``scr_dataset_id`` and ``scr_checkpoint_id`` if the id for the current dataset is higher, so that we continue counting up from this number when assigning ids to later datasets. #. Unset FLUSHING flag in flush file. #. Free the temporary redundancy descriptor. EXIT LOOP #. Stop timer and log whether we were able to rebuild any dataset from cache. .. _distribute_dset_hash: scr_distribute_datasets ~~~~~~~~~~~~~~~~~~~~~~~ Given a filemap and dataset id, distribute dataset hash and store in filemap. #. Create empty send hash for transferring dataset hashes. #. Get list of ranks that we have files for as part of the specified dataset. #. For each rank, lookup dataset hash from filemap and add to send hash. #. Delete list of ranks. #. Check that no rank identified an invalid rank. If the restarted run uses a smaller number of processes than the previous run, we may (but are not guaranteed to) discover this condition here. #. Identify smallest rank that has a copy of the dataset hash. #. Return with failure if no such rank exists. #. Otherwise, broadcast hash from this rank. #. Store dataset hash in filemap and write filemap to disk. #. Delete send hash. .. _distribute_reddesc: scr_distribute_reddescs ~~~~~~~~~~~~~~~~~~~~~~~ Given a filemap and dataset id, distribute redundancy descriptor that was applied to the dataset and store in filemap. This creates the same group and redundancy scheme that was applied to the dataset, even if the user may have configured new schemes for the current run. #. Create empty send hash for transferring redundancy descriptor hashes. #. Get list of ranks that we have files for as part of the specified dataset. #. For each rank, lookup redundancy descriptor hash from filemap and add to send hash. #. Delete list of ranks. #. Check that no rank identified an invalid rank. If the restarted run uses a smaller number of processes than the previous run, we may (but are not guaranteed to) discover this condition here. #. Execute sparse data exchange with ``scr_hash_exchange``. #. Check that each rank received its descriptor, return with failure if not. #. Store redundancy descriptor hash in filemap and write filemap to disk. #. Create redundancy descriptor by calling ``scr_reddesc_create_from_filemap``. #. Delete send and receive hashes from exchange. .. _distribute_files: scr_distribute_files ~~~~~~~~~~~~~~~~~~~~ This section describes the algorithm used to distribute files for a specified dataset. SCR transfers files from their current location to the storage device accessible from the node where the owner rank is now running. The algorithm operates over a number of rounds. In each round, a process may send files to at most one other process. A process may only send files if it has all of the files written by the owner process. The caller specifies a filemap, a redundancy descriptor, and a dataset id as input. This implementation is in ``scr_cache_rebuild.c``. #. Delete all bad (incomplete or inaccessible) files from the filemap. #. Get list of ranks that we have files for as part of the specified dataset. #. From this list, set a start index to the position corresponding to the first rank that is equal to or greater than our own rank (looping back to rank 0 if we pass the last rank). We stagger the start index across processes in this way to help distribute load later. #. Check that no rank identified an invalid rank while scanning for its start index. If the restarted run uses a smaller number of processes than the previous run, we may (but are not guaranteed to) discover this condition here. #. Allocate arrays to record which rank we can send files to in each round. #. Check that we have all files for each rank, and record the round in which we can send them. The round we pick here is affected by the start index computed earlier. #. Issue sparse global exchange via ``scr_hash_exchange`` to inform each process in which round we can send it its files, and receive similar messages from other processes. #. Search for minimum round in which we can retrieve our own files, and remember corresponding round and source rank. If we can fetch files from our self, we’ll always select this option as it will be the minimum round. #. Free the list of ranks we have files for. #. Determine whether all processes can obtain their files, and bail with error if not. #. Determine the maximum round any process needs to get its files. #. Identify which rank we’ll get our files from and issue sparse global exchange to distribute this info. #. Determine which ranks want to receive files from us, if any, and record the round they want to receive their files in. #. Get the directory name for this dataset. #. Loop through the maximum number of rounds and exchange files. LOOP ROUNDS #. Check whether we can send files to a rank in this round, and if so, record destination and number of files. #. Check whether we need to receive our files in this round, and if so, record source rank. #. If we need to send files to our self, just move (rename) each file, update the filemap, and loop to the next round. #. Otherwise, if we have files for this round but the the owner rank does not need them, delete them. #. If we do not need to send or receive any files this round, loop to next round. #. Otherwise, exchange number of files we’ll be sending and/or receiving, and record expected number that we’ll receive in our filemap. #. If we’re sending files, get a list of files for the destination. #. Enter exchange loop. LOOP EXCHANGE #. Get next file name from our list of files to send, if any remaining. #. Swap file names with partners. #. If we’ll receive a file in this iteration, add the file name to the filemap and write out our filemap. #. Transfer file via ``scr_swap_files()``. This call overwrites the outgoing file (if any) with the incoming file (if any), so there’s no need to delete the outgoing file. If there is no incoming file, it deletes the outgoing file (if any). We use this approach to conserve storage space, since we assume the cache is small. We also transfer file metadata with this function. #. If we sent a file, remove that file from our filemap and write out the filemap. #. Decrement the number of files we have to send / receive by one. When both counts hit zero, break exchange loop. #. Write updated filemap to disk. EXIT LOOP EXCHANGE #. Free list of files that we sent in this round. EXIT LOOP ROUNDS #. If we have more ranks than there were rounds, delete files for all remaining ranks. #. Write out filemap file. #. Delete bad files (incomplete or inaccessible) from the filemap. .. _rebuild_files: scr_reddesc_recover ~~~~~~~~~~~~~~~~~~~ This function attempts to rebuild any missing files for a dataset. It returns ``SCR_SUCCESS`` on all processes if successful; it returns ``!SCR_SUCCESS`` on all processes otherwise. The caller specifies a filemap, a redundancy descriptor, and a dataset id as input. This function is implemented in in ``scr_reddesc_recover.c``. #. Attempt to rebuild files according to the redundancy scheme specified in the redundancy descriptor. Currently, only ``XOR`` can actually rebuild files (Section :ref:`0.1.10 `). #. If the rebuild failed, return with an error. #. Otherwise, check that all processes have all of their files for the dataset. #. If not, return with an error. #. If so, reapply the redundancy scheme, if needed. No need to do this with ``XOR``, since it does this step as part of the rebuild. .. _attempt_rebuild_files_xor: scr_reddesc_recover_xor_attempt ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before we attempt to rebuild files using the ``XOR`` redundancy scheme, we first check whether it is possible. If we detect that two or more processes from the same ``XOR`` set are missing files, we cannot recover all files and there is no point to rebuild any of them. We execute this check in ``scr_reddesc_recover.c``. The caller specifies a filemap, a redundancy descriptor, and a dataset id as input. #. Check whether we have our dataset files, and check whether we have our ``XOR`` file. If we’re missing any of these files, assume that we’re missing them all. #. Count the number of processes in our ``XOR`` set that need their files. We can recover all files from a set so long as no more than a single member needs its files. #. Check whether we can recover files for all sets, if not bail with an error. #. If the current process is in a set which needs to be rebuilt, identify which rank needs its files and call ``scr_reddesc_recover_xor()`` to rebuild files (Section :ref:`0.1.11 `). #. Check that the rebuild succeeded on all tasks, return error if not, otherwise return success. .. _rebuild_files_xor: scr_reddesc_recover_xor ~~~~~~~~~~~~~~~~~~~~~~~ We invoke this routine within each ``XOR`` set that is missing files. The caller specifies a filemap, a redundancy descriptor, and a dataset id as input, as well as, the rank of the process in the ``XOR`` set that is missing its files. We refer to the process that needs to rebuild its files as the *root*. This function is implemented in ``scr_reddesc_recover.c`` ALL #. Get pointer to ``XOR`` state structure from ``copy_state`` field of redundancy descriptor. #. Allocate empty hash to hold the header of our ``XOR`` file. NON-ROOT #. Get name of our ``XOR`` file. #. Open ``XOR`` file for reading. #. Read header from file. #. From header, get hash of files we wrote. #. From this file hash, get the number of files we wrote. #. Allocate arrays to hold file descriptor, file name, and file size for each of our files. #. Get path of dataset directory from ``XOR`` file name. #. Open each of our files for reading and store file descriptor, file name, and file size of each file in our arrays. #. If the failed rank is to our left, send it our header. Our header stores a copy of the file hash for the rank to our left under the ``PARTNER`` key. #. If the failed rank is to our right, send it our file hash. When the failed rank rebuilds its ``XOR`` file, it needs to record our file hash in its header under the ``PARTNER`` key. ROOT #. Receive ``XOR`` header from rank to our right. #. Rename ``PARTNER`` key in this header to ``CURRENT``. The rank to our right stored a copy of our file hash under ``PARTNER``. #. Receive file hash from rank to our left, and store it under ``PARTNER`` in our header. #. Get our file hash from ``CURRENT`` key in the header. #. From our file hash, get the number of files we wrote during the dataset. #. Allocate arrays to hold file descriptor, file name, and file size for each of our files. #. Build the file name for our ``XOR`` file, and add ``XOR`` file to the filemap. #. For each of our files, get meta data from file hash, then get file name and file size from meta data. Add file name to filemap, and record file name and file size in arrays. #. Record the number of files we expect to have in the filemap, including the ``XOR`` file. #. Write out filemap. #. Open ``XOR`` file for writing. #. Open each of our dataset files for writing, and record file descriptors in our file descriptor array. #. Write out ``XOR`` header to ``XOR`` file. ALL #. Read ``XOR`` chunk size from header. #. Allocate buffers to send and receive data during reduction. #. Execute pipelined ``XOR`` reduction to root to reconstruct missing data as illustrated in Figure :ref:`xor_reduce`. For a full description of the redundancy scheme, see Section :ref:`XOR algorithm `. #. Close our ``XOR`` file. #. Close each of our dataset files. ROOT #. For each of our dataset files and our ``XOR`` file, update filemap. #. Write filemap to disk. #. Also compute and record CRC32 checksum for each file if ``SCR_CRC_ON_COPY`` is set. ALL #. Free data buffers. #. Free arrays for file descriptors, file names, and file sizes. #. Free ``XOR`` header hash. .. _fetch_loop: scr_fetch_sync ~~~~~~~~~~~~~~ This section describes the loop used to fetch a checkpoint from the parallel file system. SCR starts with the most recent checkpoint on the parallel file system as specified in the index file. If SCR fails to fetch this checkpoint, it then works backwards and attempts to fetch the next most recent checkpoint until it either succeeds or runs out of checkpoints. It acquires the list of available checkpoints from the index file. This functionality is implemented within ``scr_fetch.c``. #. Start timer. #. Rank 0 reads index file from prefix directory, bail if failed to read file. LOOP #. Rank 0 selects a target directory name. Start with directory marked as current if set, and otherwise use most recent checkpoint specified in index file. For successive iterations, attempt the checkpoint that is the next most recent. #. Rank 0 records fetch attempt in index file. #. Rank 0 builds full path to dataset. #. Broadcast dataset path from rank 0. #. Attempt to fetch checkpoint from selected directory. #. If fetch fails, rank 0 deletes “current” designation from dataset and marks dataset as “failed” in index file. #. If fetch succeeds, rank 0 updates “current” designation to point to this dataset in index file, break loop. EXIT LOOP #. Delete index hash. #. Stop timer and print statistics. SCR_Need_checkpoint ------------------- Determines whether a checkpoint should be taken. This function is implemented in ``scr.c``. #. If not enabled, bail out with error. #. If not initialized, bail out with error. #. Increment the ``scr_need_checkpoint_id`` counter. We use this counter so the user can specify that the application should checkpoint after every so many calls to ``SCR_Need_checkpoint``. #. Check whether we need to halt. If so, then set need checkpoint flag to true. #. Rank 0 checks various properties to make a decision: user has called ``SCR_Need_checkpoint`` an appropriate number of times, or the max time between consecutive checkpoints has expired, or the ratio of the total checkpoint time to the total compute time is below a threshold. #. Rank 0 broadcasts the decision to all other tasks. SCR_Start_checkpoint -------------------- Prepares the cache for a new checkpoint. This function is implemented in ``scr.c``. #. If not enabled, bail out with error. #. If not initialized, bail out with error. #. If this is being called from within a Start/Complete pair, bail out with error. #. Issue a barrier here so that processes don’t delete checkpoint files from the cache before we’re sure that all processes will actually make it this far. BARRIER #. Stop timer of compute phase, and log this compute section. #. Increment ``scr_dataset_id`` and ``scr_checkpoint_id``. #. Get redundancy descriptor for this checkpoint id. #. Start timer for checkpoint phase, and log start of checkpoint. #. Get a list of all datasets in cache. #. Get store descriptor associated with redundancy descriptor. #. Determine how many checkpoints are currently in the cache directory specified by the store descriptor. #. Delete oldest datasets from this directory until we have sufficient room for this new checkpoint. When selecting checkpoints to delete, skip checkpoints that are being flushed. If the only option is a checkpoint that is being flushed, wait for it to complete then delete it. #. Free the list of checkpoints. #. Rank 0 fills in the dataset descriptor hash and broadcasts it. #. Store dataset hash in filemap. #. Add flush descriptor entries to filemap for this dataset. #. Store redundancy descriptor in filemap. #. Write filemap to disk. #. Create dataset directory in cache. SCR_Route_file -------------- Given a name of a file, return the string the caller should use to access this file. This function is implemented in ``scr.c``. #. If not enabled, bail out with error. #. If not initialized, bail out with error. #. Lookup redundancy descriptor for current checkpoint id. #. Direct path to dataset directory in cache according to redundancy descriptor. #. If called from within a Start/Complete pair, add file name to filemap. Record original file name as specified by caller, the absolute path to the file and the number of ranks in the job in the filemap. Update filemap on disk. #. Otherwise, we assume we are in a restart, so check whether we can read the file, and return error if not. The goal in this case is to provide a mechanism for a process to determine whether it can read its checkpoint file from cache during a restart. #. Return success. SCR_Complete_checkpoint ----------------------- Applies redundancy scheme to checkpoint files, may flush checkpoint to parallel file system, and may exit run if the run should be halted. This function is implemented in ``scr.c``. #. If not enabled, bail out with error. #. If not initialized, bail out with error. #. If not called from within Start/Complete pair, bail out with error. #. Record file size and valid flag for each file written during checkpoint. #. Write out meta data for each file registered in filemap for this dataset id. #. Compute total data size across all procs and determine whether all procs specified a valid write. #. Update filemap and write to disk. #. Verify that flush is valid by checking that all files belong to same subdirectory and compute container offsets if used. #. Apply redundancy scheme specified in redundancy descriptor (Section :ref:`0.5.1 ` or Section :ref:`0.5.2 `). #. Stop timer measuring length of checkpoint, and log cost of checkpoint. #. If checkpoint was successful, update our flush file, check whether we need to halt, and check whether we need to flush. #. If checkpoint was not successful, delete it from cache. #. Check whether any ongoing asynchronous flush has completed. BARRIER #. Start timer for start of compute phase, and log start of compute phase. .. _copy_partner: scr_reddesc_apply_partner ~~~~~~~~~~~~~~~~~~~~~~~~~ Algorithm to compute ``PARTNER`` redundancy scheme. Caller provides a filemap, a redundancy descriptor, and a dataset id. This function is implemented in ``scr_reddesc_apply.c``. #. Get pointer to partner state structure from ``copy_state`` field in redundancy descriptor. #. Read list of files for this rank for the specified checkpoint. #. Inform our right-hand partner how many files we’ll send. #. Record number of files we expect to receive from our left-hand partner in our filemap. #. Remember the node name where our left-hand partner is running (used during scavenge). #. Record the redundancy descriptor hash for our left-hand partner. Each process needs to be able to recover its own redundancy descriptor hash after a failure, so we make a copy in our partner’s filemap. #. Write filemap to disk. #. Get checkpoint directory we’ll copy partner’s files to. #. While we have a file to send or receive, loop. LOOP #. If we have a file to send, get the file name. #. Exchange file names with left-hand and right-hand partners. #. If our left-hand partner will be sending us a file, add the file name to our filemap, and write out our filemap. #. Exchange files by calling ``scr_swap_files()``, and update filemap with meta data for file. EXIT LOOP #. Write filemap to disk. #. Free the list of file names. .. _copy_xor: scr_reddesc_apply_xor ~~~~~~~~~~~~~~~~~~~~~ Algorithm to compute ``XOR`` redundancy scheme. Caller provides a filemap, a redundancy descriptor, and a dataset id. The ``XOR`` set is the group of processes defined by the communicator specified in the redundancy descriptor. This function is implemented in ``scr_reddesc_apply.c``. #. Get pointer to ``XOR`` state structure from ``copy_state`` field in redundancy descriptor. #. Allocate a buffers to send and receive data. #. Count the number of files this process wrote during the specified dataset id. Allocate space to record a file descriptor, the file name, and the size of each file. #. Record the redundancy descriptor hash for our left-hand partner in our filemap. Each process needs to be able to recover its own redundancy descriptor hash after a failure, so each process sends a copy to his right-hand partner. #. Allocate a hash to hold the header of our ``XOR`` redundancy file. #. Record the global ranks of the MPI tasks in our ``XOR`` set. #. Record the dataset id in our header. #. Open each of our files, get the size of each file, and read the meta data for each file. #. Create a hash and record our rank, the number of files we have, and the meta data for each file. #. Send this hash to our right-hand partner, and receive equivalent hash from left-hand partner. #. Record our hash along with the hash from our left-hand partner in our ``XOR`` header hash. This way, the meta data for each file is recorded in the headers of two different ``XOR`` files. #. Determine chunk size for RAID algorithm (Section :ref:`XOR algorithm `) and record this size in the ``XOR`` header. #. Determine full path name for ``XOR`` file. #. Record ``XOR`` file name in our filemap and update the filemap on disk. #. Open the ``XOR`` file for writing. #. Write header to file and delete header hash. #. Execute RAID algorithm and write data to ``XOR`` file (Section :ref:`XOR algorithm `). #. Close and fsync our ``XOR`` file and close each of our dataset files. #. Free off scratch space memory and MPI buffers. #. Write out meta data file for ``XOR`` file. #. If ``SCR_CRC_ON_COPY`` is specified, compute CRC32 checksum of ``XOR`` file. SCR_Finalize ------------ Shuts down the SCR library, flushes most recent checkpoint to the parallel file system, and frees data structures. This function is implemented in ``scr.c``. #. If not enabled, bail out with error. #. If not initialized, bail out with error. #. Stop timer measuring length of compute phase. #. Add reason for exiting to halt file. We assume the user really wants to stop once the application calls ``SCR_Finalize``. We add a reason to the halt file so we know not to start another run after we exit from this one. #. Complete or stop any ongoing asynchronous flush. #. Flush most recent checkpoint if we still need to. #. Disconnect logging functions. #. Free internal data structures. #. Call ``DTCMP_Finalize`` if used.