XOR
The XOR
redundancy scheme divides the set of processes in the run
into subsets, called XOR sets. For each dataset, each process in a set
computes and stores redundancy data in an XOR file. This file is
stored in the dataset subdirectory within the cache directory along side
any files that the application process writes.
The XOR
redundancy scheme is designed such that the dataset files
for any single member of a set can be reconstructed using the dataset
files and XOR
files from all remaining members. Thus, all dataset
files can be recovered, even if the files for one process from each set
are lost. On the other hand, if any set loses files for two or more
processes, the XOR
redundancy scheme cannot recover all files.
The processes within each set are ordered, and each process has a rank
in the set, counting up from 0. The process whose rank in the set is one
less than the rank of the current process is called the left neighbor,
and the process whose rank is one more is the right neighbor. The last
rank wraps back to the first to form a ring. At run time, the library
caches the XOR
set in the MPI communicator associated with a
redundancy descriptor. Each process also caches information abouts its
left and right neighbor processes in the redundancy descriptor.
SCR ensures that it does not select two processes from the same failure
group to be in the same XOR
set. The SCR_SET_SIZE
parameter
determines the minimum number of processes to include in a set. The
selection algorithm is implemented in scr_reddesc_create()
and
scr_reddesc_create_xor()
in scr_reddesc.c
, as well as
scr_set_partners()
in scr_util_mpi.c
.
XOR algorithm
The XOR
redundancy scheme applies the algorithm described
in [Ross] (which is based on [RAID5]). Assuming that each process
writes one file and that the files on all processes are the same size,
this algorithm is illustrated in Figure 1. Given
\(N\) processes in the set, each file is logically partitioned into
\(N-1\) chunks, and an empty, zero-padded chunk is logically
inserted into the file at alternating positions depending on the rank of
the process. Then a reduce-scatter is computed across the set of logical
files. The resulting chunk from this reduce-scatter is the data that the
process stores in its XOR
file.
In general, different processes may write different numbers of files, and file sizes may be arbitrary. In Figure Extension to multiple files, we illustrate how to extend the algorithm for the general case. First, we logically concatenate all of the files a process writes into a single file. We then compute the minimum chunk size such that \(N-1\) chunks are equal to or larger than the largest logical file. Finally, we pad the end of each logical file with zeros, such that each logical file extends to the number of bytes contained in \(N-1\) chunks. This extension is efficient when all processes write about the same amount of data.
In practice, to read from this logical file, we first open each physical
file, and then we call scr_read_pad_n()
. As input, this function
takes an array of file names, an array of file sizes, and an array of
opened file descriptors, along with an integer defining how many
elements are in each array, as well as, an offset and the number of
bytes to read. It returns data as if the set of files were concatenated
as a single file in the order specified by the arrays. This read also
pads the end of the concatenated file with zeros if the read extends
past the amount of real data. There is a corresponding
scr_write_pad_n()
function to issue writes to this logical file.
These functions are implemented in scr_io.c
.
This way, we can operate as though each process has exactly one file, where each file has the same length and is evenly divisible by \(N-1\). For an efficient reduce-scatter implementation, we use an algorithm that achieves the following goals:
Evenly distributes the work among all processes in the set.
Structures communication so that a process always receives data from its left neighbor and sends data to its right neighbor. This is useful to eliminate network contention.
Only reads data from each checkpoint file once, and only writes data to the
XOR
file once. This minimizes file accesses, which may be slow.Operates on the data in small pieces, so that the working set fits within the processor’s cache.
To accomplish this, we divide each chunk into a series of smaller
pieces, and we operate on each piece in phases. In the first phase, we
compute the reduce-scatter result for the first piece of all chunks.
Then, in the second phase, we compute the reduce-scatter result for the
second piece of all chunks, and so on. In each phase, the reduce-scatter
computation is pipelined among the processes. The first phase of this
reduce-scatter algorithm is illustrated in
Figure XOR reduce-scatter implementation. This algorithm
is implemented in scr_reddesc_apply_xor()
in
scr_reddesc_apply.c
.
XOR file
The XOR
file contains a header, which is stored as a hash, followed
by the XOR
chunk data, which is stored as binary data. The header
provides information on the process that wrote the file, meta data for
the process’s files, and the group of processes that belong to its
XOR
set. SCR also makes a copy of the meta data for a process’s
files in the header of the XOR
file written by the process’s right
neighbor. This way, SCR can recover all meta data even if one XOR
file is lost. An example header is shown below:
DSET
COMPLETE
1
SIZE
2097182
FILES
4
ID
6
NAME
scr.dataset.6
CREATED
1312850690668536
USER
user1
JOBNAME
simulation123
JOBID
112573
CKPT
6
RANKS
4
GROUP
RANKS
4
RANK
0
0
1
1
2
2
3
3
CHUNK
174766
CURRENT
RANK
3
FILES
1
FILE
0
FILE
rank_3.ckpt
TYPE
FULL
RANKS
4
ORIG
rank_3.ckpt
PATH
/p/lscratchb/user1/simulation123
NAME
rank_3.ckpt
SIZE
524297
COMPLETE
1
PARTNER
RANK
2
FILES
1
FILE
0
FILE
rank_2.ckpt
TYPE
FULL
RANKS
4
ORIG
rank_2.ckpt
PATH
/p/lscratchb/user1/simulation123
NAME
rank_2.ckpt
SIZE
524296
COMPLETE
1
The topmost DSET
field records the dataset descriptor the XOR
file belongs to, and the topmost RANKS
field records the number of
ranks in the run (i.e., the size of scr_comm_world
). The GROUP
hash records the set of processes in the XOR
set. The number of
processes in the set is listed under the RANKS
field, and a mapping
of a process’s rank in the group to its rank in scr_comm_world
is
stored under the RANK
hash. The size of the XOR
chunk in number
of bytes is specified in the CHUNK
field.
Then, the meta data for the checkpoint files written by the process are
recorded under the CURRENT
hash, and a copy of the meta data for the
checkpoint files written by the left neighbor are recorded under the
PARTNER
hash. Each hash records the rank of the process (in
scr_comm_world
) under RANK
, the number of checkpoint files the
process wrote under FILES
, and a ordered list of meta data for each
file under the FILE
hash. Each checkpoint file is assigned an
integer index, counting up from 0, which specifies the order in which
the files were logically concatenated to compute the XOR
chunk. The
meta data for each file is then recorded under its index.
At times, XOR
files from different processes reside in the same
directory, so SCR specifies a unique name for the XOR
file on each
process. Furthermore, SCR encodes certain information in the file name
to simplify the task of grouping files belonging to the same set. SCR
assigns a unique integer id to each XOR
set. To select this id, SCR
computes the minimum rank in scr_comm_world
of all processes in the
set and uses that rank as the set id. SCR then incorporates a process’s
rank within its set, the size of its set, and its set id into its file
name, such that the XOR
file name is of the form:
<grouprank+1>_of_<groupsize>_in_<groupid>.xor
.
XOR rebuild
SCR provides two different methods to rebuild files using the XOR
scheme. If a run is restarted and a dataset is stored in cache, then SCR
rebuilds files during SCR_Init()
. On the other hand, at the end of
an allocation, SCR can rebuild files after scavenging a dataset from
cache. This section discusses the method used in SCR_Init()
. For
discussion on rebuilding during a scavenge, see
Sections Scavenge and Program Flow>Scavenge.
During SCR_Init()
in a restarted run, SCR uses MPI to rebuild files
in parallel. The processes in each set check whether they need to and
whether they can rebuild any missing files. If so, the processes
identify which rank in the set needs its files rebuilt. This rank is
then set as the root of a reduction over the data in the remaining
application files and XOR
files to reconstruct the missing data. SCR
implements a reduction algorithm that achieves the same goals as the
reduce-scatter described in Section 0.1.1. Namely, the
implementation attempts to distribute work evenly among all processes,
minimize network contention, and minimize file accesses. This algorithm
is implemented in scr_reddesc_recover_xor()
in
scr_reddesc_recover.c
. An example is illustrated in
Figure Pipelined XOR reduction to root.
Gropp, R. Ross, and N. Miller, “Providing Efficient I/O Redundancy in MPI Environments,” in Lecture Notes in Computer Science, 3241:7786, September 2004. 11th European PVM/MPI Users Group Meeting, 2004.
Patterson, G. Gibson, and R. Katz, “A Case for Redundant Arrays of Inexpensive Disks (RAID),” in Proc. of 1988 ACM SIGMOD Conf. on Management of Data, 1988.