Redundancy descriptors
Overview
A redundancy descriptor is a data structure that describes how a dataset is cached. It tracks information such as the cache directory that is used, the redundancy scheme that is applied, and the frequency with which this combination should be used. The data structure also records information on the group of processes that make up a redundancy set, such as the number of processes in the set, as well as, a unique integer that identifies the set, called the group id.
There is both a C struct and an equivalent specialized hash for storing redundancy descriptors. The hash is primarily used to persist group information across runs, such that the same process group can be reconstructed in a later run (even if the user changes configuration parameters between runs). These hashes are stored in filemap files. The C struct is used within the library to cache additional runtime information such as an MPI communicator for each group and the location of certain MPI ranks.
During the run, the SCR library maintains an array of redundancy
descriptor structures in a global variable named scr_reddescs
. It
records the number of descriptors in this list in a variable named
scr_nreddescs
. It builds this list during SCR_Init()
using a
series of redundancy descriptor hashes defined in a third variable named
scr_reddesc_hash
. The hashes in this variable are constructed while
processing SCR parameters.
Redundancy descriptor struct
Here is the definition for the C struct.
typedef struct {
int enabled; /* flag indicating whether this descriptor is active */
int index; /* each descriptor is indexed starting from 0 */
int interval; /* how often to apply this descriptor, pick largest such
* that interval evenly divides checkpoint id */
int store_index; /* index into scr_storedesc for storage descriptor */
int group_index; /* index into scr_groupdesc for failure group */
char* base; /* base cache directory to use */
char* directory; /* full directory base/dataset.id */
int copy_type; /* redundancy scheme to apply */
void* copy_state; /* pointer to extra state depending on copy type */
MPI_Comm comm; /* communicator holding procs for this scheme */
int groups; /* number of redundancy sets */
int group_id; /* unique id assigned to this redundancy set */
int ranks; /* number of ranks in this set */
int my_rank; /* caller's rank within its set */
} scr_reddesc;
The enabled
field is set to 0 (false) or 1 (true) to indicate
whether this particular redundancy descriptor may be used. Even though a
redundancy descriptor may be defined, it may be disabled. The index
field records the index number of this redundancy descriptor. This
corresponds to the redundancy descriptor’s index in the scr_reddescs
array. The interval
field describes how often this redundancy
descriptor should be selected for different checkpoints. To choose a
redundancy descriptor to apply to a given checkpoint, SCR picks the
descriptor that has the largest interval value which evenly divides the
checkpoint id.
The store_index
field tracks the index of the store descriptor
within the scr_storedescs
array that describes the storage used with
this redundancy descriptor. The group_index
field tracks the index
of the group descriptor within the scr_groupdescs
array that
describes the group of processes likely to fail at the same time. The
redundancy scheme will protect against failures for this group using the
specified storage device.
The base
field is a character array that records the cache base
directory that is used. The directory
field is a character array
that records the directory in which the dataset subdirectory is created.
This path consists of the cache directory followed by the redundancy
descriptor index directory, such that one must only append the dataset
id to compute the full path of the corresponding dataset directory.
The copy_type
field specifies the type of redundancy scheme that is
applied. It may be set to one of: SCR_COPY_NULL
,
SCR_COPY_SINGLE
, SCR_COPY_PARTNER
, or SCR_COPY_XOR
. The
copy_state
field is a void*
that points to any extra state that
is needed depending on the redundancy scheme.
The remaining fields describe the group of processes that make up the
redundancy set for a particular process. For a given redundancy
descriptor, the entire set of processes in the run is divided into
distinct groups, and each of these groups is assigned a unique integer
id called the group id. The set of group ids may not be contiguous. Each
process knows the total number of groups, which is recorded in the
groups
field, as well as, the id of the group the process is a
member of, which is recorded in the group_id
field.
Since the processes within a group communicate frequently, SCR creates a
communicator for each group. The comm
field is a handle to the MPI
communicator that defines the group the process is a member of. The
my_rank
and ranks
fields cache the rank of the process in this
communicator and the number of processes in this communicator,
respectively.
If the redundancy scheme requires additional information to be kept in
the redundancy descriptor, it allocates additional memory and records a
pointer to it via the copy_state
pointer.
Extra state for PARTNER
The SCR_COPY_PARTNER
scheme allocates the following structure:
typedef struct {
int lhs_rank; /* rank which is one less (with wrap to highest) within set */
int lhs_rank_world; /* rank of lhs process in comm world */
char* lhs_hostname; /* hostname of lhs process */
int rhs_rank; /* rank which is one more (with wrap to lowest) within set */
int rhs_rank_world; /* rank of rhs process in comm world */
char* rhs_hostname; /* hostname of rhs process */
} scr_reddesc_partner;
For SCR_COPY_PARTNER
, the processes within a group form a logical
ring, ordered by their rank in the group. Each process has a left and
right neighbor in this ring. The left neighbor is the process whose rank
is one less than the current process, and the right neighbor is the
process whose rank is one more. The last process in the group wraps back
around to the first. SCR caches information about the ranks to the left
and right of a process. The lhs_rank
, lhs_rank_world
, and
lhs_hostname
fields describe the rank to the left of the process,
and the rhs_rank
, rhs_rank_world
, and rhs_hostname
fields
describe the rank to the right. The lhs_rank
and rhs_rank
fields
record the ranks of the neighbor processes in comm
. The
lhs_rank_world
and rhs_rank_world
fields record the ranks of the
neighbor processes in scr_comm_world
. Finally, the lhs_hostname
and rhs_hostname
fields record the hostnames where those processes
are running.
Extra state for XOR
The SCR_COPY_XOR
scheme allocates the following structure:
typedef struct {
scr_hash* group_map; /* hash that maps group rank to world rank */
int lhs_rank; /* rank which is one less (with wrap to highest) within set */
int lhs_rank_world; /* rank of lhs process in comm world */
char* lhs_hostname; /* hostname of lhs process */
int rhs_rank; /* rank which is one more (with wrap to lowest) within set */
int rhs_rank_world; /* rank of rhs process in comm world */
char* rhs_hostname; /* hostname of rhs process */
} scr_reddesc_xor;
The fields here are similar to the fields of SCR_COPY_PARTNER
with
the exception of an additional group_map
field, which records a hash
that maps a group rank to its rank in MPI_COMM_WORLD
.
Example redundancy descriptor hash
Each redundancy descriptor can be stored in a hash. Here is an example redundancy descriptor hash.
ENABLED
1
INDEX
0
INTERVAL
1
BASE
/tmp
DIRECTORY
/tmp/user1/scr.1145655/index.0
TYPE
XOR
HOP_DISTANCE
1
SET_SIZE
8
GROUPS
1
GROUP_ID
0
GROUP_SIZE
4
GROUP_RANK
0
Most field names in the hash match field names in the C struct, and the
meanings are the same. The one exception is GROUP_RANK
, which
corresponds to my_rank
in the struct. Note that not all fields from
the C struct are recorded in the hash. At runtime, it’s possible to
reconstruct data for the missing struct fields using data from the hash.
In particular, one may recreate the group communicator by calling
MPI_Comm_split()
on scr_comm_world
specifying the GROUP_ID
value as the color and specifying the GROUP_RANK
value as the key.
After recreating the group communicator, one may easily find info for
the left and right neighbors.
Example redundancy descriptor configuration file entries
SCR must be configured with redundancy schemes. By default, SCR protects
against single compute node failures using XOR
, and it caches one
checkpoint in /tmp
. To specify something different, edit a
configuration file to include checkpoint descriptors. Checkpoint
descriptors look like the following.
# instruct SCR to use the CKPT descriptors from the config file
SCR_COPY_TYPE=FILE
# the following instructs SCR to run with three checkpoint configurations:
# - save every 8th checkpoint to /ssd using the PARTNER scheme
# - save every 4th checkpoint (not divisible by 8) to /ssd using XOR with
# a set size of 8
# - save all other checkpoints (not divisible by 4 or 8) to /tmp using XOR with
# a set size of 16
CKPT=0 INTERVAL=1 GROUP=NODE STORE=/tmp TYPE=XOR SET_SIZE=16
CKPT=1 INTERVAL=4 GROUP=NODE STORE=/ssd TYPE=XOR SET_SIZE=8
CKPT=2 INTERVAL=8 GROUP=SWITCH STORE=/ssd TYPE=PARTNER
First, one must set the SCR_COPY_TYPE
parameter to “FILE
”.
Otherwise, an implied checkpoint descriptor is constructed using various
SCR parameters including SCR_GROUP
, SCR_CACHE_BASE
,
SCR_COPY_TYPE
, and SCR_SET_SIZE
.
Checkpoint descriptor entries are identified by a leading CKPT
key.
The values of the CKPT
keys must be numbered sequentially starting
from 0. The INTERVAL
key specifies how often a checkpoint is to be
applied. For each checkpoint, SCR selects the descriptor having the
largest interval value that evenly divides the internal SCR checkpoint
iteration number. It is necessary that one descriptor has an interval of
1. This key is optional, and it defaults to 1 if not specified. The
GROUP
key lists the failure group, i.e., the name of the group of
processes likely to fail. This key is optional, and it defaults to the
value of the SCR_GROUP
parameter if not specified. The STORE
key
specifies the directory in which to cache the checkpoint. This key is
optional, and it defaults to the value of the SCR_CACHE_BASE
parameter if not specified. The TYPE
key identifies the redundancy
scheme to be applied. This key is optional, and it defaults to the value
of the SCR_COPY_TYPE
parameter if not specified.
Other keys may exist depending on the selected redundancy scheme. For
XOR
schemes, the SET_SIZE
key specifies the minimum number of
processes to include in each XOR
set.
Common functions
This section describes some of the most common redundancy descriptor
functions. The implementation can be found in scr.c
.
Initializing and freeing redundancy descriptors
Initialize a redundancy descriptor structure (clear its fields).
struct scr_reddesc desc;
scr_reddesc_init(&desc)
Free memory associated with a redundancy descriptor.
scr_reddesc_free(&desc)
Redundancy descriptor array
Allocate and fill in scr_reddescs
array using redundancy descriptor
hashes provided in scr_reddesc_hash
.
scr_reddescs_create()
Free the list of redundancy descriptors.
scr_reddescs_free()
Select a redundancy descriptor for a specified checkpoint id from among
the ndescs
descriptors in the array of descriptor structs pointed to
by descs
.
struct scr_reddesc* desc = scr_reddesc_for_checkpoint(ckpt, ndescs, descs)
Converting between structs and hashes
Convert a redundancy descriptor struct to its equivalent hash.
scr_reddesc_store_to_hash(desc, hash)
This function clears any entries in the specified hash before setting fields according to the struct.
Given a redundancy descriptor hash, build and fill in the fields for its equivalent redundancy descriptor struct.
scr_reddesc_create_from_hash(desc, index, hash)
This function creates a communicator for the redundancy group and fills
in neighbor information relative to the calling process. Note that this
call is collective over scr_comm_world
, because it creates a
communicator. The index value specified in the call is overridden if an
index field is set in the hash.
Interacting with filemaps
Redundancy descriptor hashes are cached in filemaps. There are functions to set, get, and unset a redundancy descriptor hash in a filemap for a given dataset id and rank id (Section Filemap redundancy descriptors). There are additional functions to extract info from a redundancy descriptor hash that is stored in a filemap.
For a given dataset id and rank id, return the base directory associated with the redundancy descriptor stored in the filemap.
char* basedir = scr_reddesc_base_from_filemap(map, dset, rank)
For a given dataset id and rank id, return the path associated with the redundancy descriptor stored in the filemap in which dataset directories are to be created.
char* dir = scr_reddesc_dir_from_filemap(map, dset, rank)
For a given dataset id and rank id, fill in the specified redundancy descriptor struct using the redundancy descriptor stored in the filemap.
scr_reddesc_create_from_filemap(map, dset, rank, desc)
Note that this call is collective over scr_comm_world
, because it
creates a communicator.