.. _xor: XOR --- The ``XOR`` redundancy scheme divides the set of processes in the run into subsets, called *XOR sets*. For each dataset, each process in a set computes and stores redundancy data in an *XOR file*. This file is stored in the dataset subdirectory within the cache directory along side any files that the application process writes. The ``XOR`` redundancy scheme is designed such that the dataset files for any single member of a set can be reconstructed using the dataset files and ``XOR`` files from all remaining members. Thus, all dataset files can be recovered, even if the files for one process from each set are lost. On the other hand, if any set loses files for two or more processes, the ``XOR`` redundancy scheme cannot recover all files. The processes within each set are ordered, and each process has a *rank* in the set, counting up from 0. The process whose rank in the set is one less than the rank of the current process is called the *left neighbor*, and the process whose rank is one more is the *right neighbor*. The last rank wraps back to the first to form a ring. At run time, the library caches the ``XOR`` set in the MPI communicator associated with a redundancy descriptor. Each process also caches information abouts its left and right neighbor processes in the redundancy descriptor. SCR ensures that it does not select two processes from the same failure group to be in the same ``XOR`` set. The ``SCR_SET_SIZE`` parameter determines the minimum number of processes to include in a set. The selection algorithm is implemented in ``scr_reddesc_create()`` and ``scr_reddesc_create_xor()`` in ``scr_reddesc.c``, as well as ``scr_set_partners()`` in ``scr_util_mpi.c``. .. _raid: XOR algorithm ~~~~~~~~~~~~~ The ``XOR`` redundancy scheme applies the algorithm described in [Ross]_ (which is based on [RAID5]_). Assuming that each process writes one file and that the files on all processes are the same size, this algorithm is illustrated in Figure :ref:`1 `. Given :math:`N` processes in the set, each file is logically partitioned into :math:`N-1` chunks, and an empty, zero-padded chunk is logically inserted into the file at alternating positions depending on the rank of the process. Then a reduce-scatter is computed across the set of logical files. The resulting chunk from this reduce-scatter is the data that the process stores in its ``XOR`` file. .. _xor_fig: .. figure:: fig/xor.png :alt: XOR reduce-scatter :name: fig:xor :width: 12cm XOR reduce-scatter In general, different processes may write different numbers of files, and file sizes may be arbitrary. In Figure :ref:`xor_general`, we illustrate how to extend the algorithm for the general case. First, we logically concatenate all of the files a process writes into a single file. We then compute the minimum chunk size such that :math:`N-1` chunks are equal to or larger than the largest logical file. Finally, we pad the end of each logical file with zeros, such that each logical file extends to the number of bytes contained in :math:`N-1` chunks. This extension is efficient when all processes write about the same amount of data. .. _xor_general: .. figure:: fig/xor_general.png :alt: Extension to multiple files :width: 12cm Extension to multiple files In practice, to read from this logical file, we first open each physical file, and then we call ``scr_read_pad_n()``. As input, this function takes an array of file names, an array of file sizes, and an array of opened file descriptors, along with an integer defining how many elements are in each array, as well as, an offset and the number of bytes to read. It returns data as if the set of files were concatenated as a single file in the order specified by the arrays. This read also pads the end of the concatenated file with zeros if the read extends past the amount of real data. There is a corresponding ``scr_write_pad_n()`` function to issue writes to this logical file. These functions are implemented in ``scr_io.c``. This way, we can operate as though each process has exactly one file, where each file has the same length and is evenly divisible by :math:`N-1`. For an efficient reduce-scatter implementation, we use an algorithm that achieves the following goals: #. Evenly distributes the work among all processes in the set. #. Structures communication so that a process always receives data from its left neighbor and sends data to its right neighbor. This is useful to eliminate network contention. #. Only reads data from each checkpoint file once, and only writes data to the ``XOR`` file once. This minimizes file accesses, which may be slow. #. Operates on the data in small pieces, so that the working set fits within the processor’s cache. To accomplish this, we divide each chunk into a series of smaller pieces, and we operate on each piece in phases. In the first phase, we compute the reduce-scatter result for the first piece of all chunks. Then, in the second phase, we compute the reduce-scatter result for the second piece of all chunks, and so on. In each phase, the reduce-scatter computation is pipelined among the processes. The first phase of this reduce-scatter algorithm is illustrated in Figure :ref:`reduce_scatter`. This algorithm is implemented in ``scr_reddesc_apply_xor()`` in ``scr_reddesc_apply.c``. .. _reduce_scatter: .. figure:: fig/reduce_scatter.png :alt: XOR reduce-scatter implementation :width: 12cm XOR reduce-scatter implementation XOR file ~~~~~~~~ The ``XOR`` file contains a header, which is stored as a hash, followed by the ``XOR`` chunk data, which is stored as binary data. The header provides information on the process that wrote the file, meta data for the process’s files, and the group of processes that belong to its ``XOR`` set. SCR also makes a copy of the meta data for a process’s files in the header of the ``XOR`` file written by the process’s right neighbor. This way, SCR can recover all meta data even if one ``XOR`` file is lost. An example header is shown below: :: DSET COMPLETE 1 SIZE 2097182 FILES 4 ID 6 NAME scr.dataset.6 CREATED 1312850690668536 USER user1 JOBNAME simulation123 JOBID 112573 CKPT 6 RANKS 4 GROUP RANKS 4 RANK 0 0 1 1 2 2 3 3 CHUNK 174766 CURRENT RANK 3 FILES 1 FILE 0 FILE rank_3.ckpt TYPE FULL RANKS 4 ORIG rank_3.ckpt PATH /p/lscratchb/user1/simulation123 NAME rank_3.ckpt SIZE 524297 COMPLETE 1 PARTNER RANK 2 FILES 1 FILE 0 FILE rank_2.ckpt TYPE FULL RANKS 4 ORIG rank_2.ckpt PATH /p/lscratchb/user1/simulation123 NAME rank_2.ckpt SIZE 524296 COMPLETE 1 The topmost ``DSET`` field records the dataset descriptor the ``XOR`` file belongs to, and the topmost ``RANKS`` field records the number of ranks in the run (i.e., the size of ``scr_comm_world``). The ``GROUP`` hash records the set of processes in the ``XOR`` set. The number of processes in the set is listed under the ``RANKS`` field, and a mapping of a process’s rank in the group to its rank in ``scr_comm_world`` is stored under the ``RANK`` hash. The size of the ``XOR`` chunk in number of bytes is specified in the ``CHUNK`` field. Then, the meta data for the checkpoint files written by the process are recorded under the ``CURRENT`` hash, and a copy of the meta data for the checkpoint files written by the left neighbor are recorded under the ``PARTNER`` hash. Each hash records the rank of the process (in ``scr_comm_world``) under ``RANK``, the number of checkpoint files the process wrote under ``FILES``, and a ordered list of meta data for each file under the ``FILE`` hash. Each checkpoint file is assigned an integer index, counting up from 0, which specifies the order in which the files were logically concatenated to compute the ``XOR`` chunk. The meta data for each file is then recorded under its index. At times, ``XOR`` files from different processes reside in the same directory, so SCR specifies a unique name for the ``XOR`` file on each process. Furthermore, SCR encodes certain information in the file name to simplify the task of grouping files belonging to the same set. SCR assigns a unique integer id to each ``XOR`` set. To select this id, SCR computes the minimum rank in ``scr_comm_world`` of all processes in the set and uses that rank as the set id. SCR then incorporates a process’s rank within its set, the size of its set, and its set id into its file name, such that the ``XOR`` file name is of the form: ``_of__in_.xor``. XOR rebuild ~~~~~~~~~~~ SCR provides two different methods to rebuild files using the ``XOR`` scheme. If a run is restarted and a dataset is stored in cache, then SCR rebuilds files during ``SCR_Init()``. On the other hand, at the end of an allocation, SCR can rebuild files after scavenging a dataset from cache. This section discusses the method used in ``SCR_Init()``. For discussion on rebuilding during a scavenge, see Sections :ref:`Scavenge ` and :ref:`Program Flow>Scavenge `. During ``SCR_Init()`` in a restarted run, SCR uses MPI to rebuild files in parallel. The processes in each set check whether they need to and whether they can rebuild any missing files. If so, the processes identify which rank in the set needs its files rebuilt. This rank is then set as the root of a reduction over the data in the remaining application files and ``XOR`` files to reconstruct the missing data. SCR implements a reduction algorithm that achieves the same goals as the reduce-scatter described in Section :ref:`0.1.1 `. Namely, the implementation attempts to distribute work evenly among all processes, minimize network contention, and minimize file accesses. This algorithm is implemented in ``scr_reddesc_recover_xor()`` in ``scr_reddesc_recover.c``. An example is illustrated in Figure :ref:`xor_reduce`. .. _xor_reduce: .. figure:: fig/xor_reduce.png :alt: Pipelined XOR reduction to root :width: 12cm Pipelined XOR reduction to root .. [Ross] W. Gropp, R. Ross, and N. Miller, “Providing Efficient I/O Redundancy in MPI Environments,” in Lecture Notes in Computer Science, 3241:7786, September 2004. 11th European PVM/MPI Users Group Meeting, 2004. .. [RAID5] D. Patterson, G. Gibson, and R. Katz, “A Case for Redundant Arrays of Inexpensive Disks (RAID),” in Proc. of 1988 ACM SIGMOD Conf. on Management of Data, 1988.