Redundancy schemesο
SINGLEο
With the SINGLE
redundancy scheme, SCR keeps a single copy of each
dataset file. It tracks meta data on application files, but no
redundancy data is stored. The communicator in the redundancy descriptor
is a duplicate of MPI_COMM_SELF
. During a restart, files are
distributed according to the current process mapping, but if a failure
leads to a loss of files, affected datasets are simply deleted from the
cache.
Although the lack of redundancy prevents SINGLE
from being useful in
cases where access to storage is lost, in practice many failures are due
to application-level software which do not impact storage accessibility.
This scheme is also useful when writing to highly reliable storage.
PARTNERο
TODO: This scheme assumes node-local storage.
With PARTNER
a full copy of every dataset file is made. The partner
process is always selected from a different failure group, so that itβs
unlikely for a process and its partner to fail at the same time.
When creating the redundancy descriptor, SCR splits scr_comm_world
into subcommunicators each of which contains at most one process from
each failure group. Within this communicator, each process picks the
process whose rank is one more than its own (right-hand side) to send
its copies, and it stores copies of the files from the process whose
rank is one less (left-hand side). Processes at the end of the
communicator wrap around to find partners. The hostname, the rank within
the redundancy communicator, and the global rank of the left and right
neighbors are stored in the copy_state
field of the redundancy
descriptor. This is all implemented in scr_reddesc_create()
and
scr_reddesc_create_partner()
in scr_reddesc.c
.
When applying the redundancy scheme, each process sends its files to its right neighbor. The meta data for each file is transferred and stored, as well as, the redundancy descriptor hash for the process. Each process writes the copies to the same dataset directory in which it wrote its own original files. Note that if a process and its partner share access to the same storage, then this scheme should not be used.
Each process also records the name of the node for which it is serving as the partner. This information is used during a scavenge in order to target the partner node for a copy if the source node has failed. This is a useful optimization when the cache is in node-local storage.
During the distribute phase of a restart, a process may obtain its files
from either the original copy or the partner copy. If neither is
available, the distribute fails and the dataset is deleted from cache.
If the distribute phase succeeds, the PARTNER
scheme is immediately
applied again to restore the redundancy.
During the first round of a scavenge, only original files are copied from cache to the parallel file system. If the scavenge fails to copy data from some nodes, the second round attempts to target just the relevant partner nodes which it identified from the partner key recorded in the filemap. This optimization avoids unnecessarily copying every files twice, the original plus its copy.