Containers

NOTE: This feature is experimental and not yet complete, so it it not documented in the user guide.

SCR requires checkpoint data to be stored primarily as a file per process. However, writing a large number of files is inefficient or difficult to manage on some file systems. To alleviate this problem, SCR provides an abstraction called “containers”. When writing data to or reading data from the prefix directory, SCR combines multiple application files into a container. Containers are disabled by default. To enable them, set the SCR_USE_CONTAINERS parameter to 1.

During a flush, SCR identifies the containers and the offsets within those containers where each file should be stored. SCR records the file-to-container mapping in the rank2file map, which it later references to extract files during the fetch operation.

A container has a maximum size, which is determined by the SCR_CONTAINER_SIZE parameter. This parameter defaults to 100GB. Application file data is packed sequentially within a container until the container is full, and then the remaining data spills over to the next container. The total number of containers required depends on the total number of bytes in the dataset and the container size. A container file name is of the form ctr.<id>.scr, where <id> is the container id which counts up from 0. All containers are written to the dataset directory within the prefix directory.

SCR combines files in an order such that all files on the same node are grouped sequentially. This limits the number of files that each compute node must access. For this purpose, SCR creates two global communicators during SCR_Init. Both are defined in scr_globals.c. The scr_comm_node communicator consists of all processes on the same compute node. The scr_comm_node_across communicator consists of all processes having the same rank within scr_comm_node. Note that some process has rank 0 in scr_comm_node for each node in the run. This process is called the “node leader”.

To get the offset where each process should write its data, SCR first sums up the sizes of all files on the node via a reduce on scr_comm_node. The node leaders then execute a scan across nodes using the scr_comm_node_across communicator to get a node offset. A final scan within scr_comm_node produces the offset at which each process should write its data.

TODO: discuss setting in flush descriptor stored in filemap under dataset id and rank

TODO: discuss containers during a scavenge

TODO: should we copy redundancy data to containers as well?

Within a rank2file map file, the file-to-container map adds entries under the SEG key for each file. An example entry looks like the following:

rank_2.ckpt
  SEG
    0
      FILE
        .scr/ctr.1.scr
      OFFSET
        224295
      LENGTH
        75705
    1
      FILE
        .scr/ctr.2.scr
      OFFSET
        0
      LENGTH
        300000
    2
      FILE
        .scr/ctr.3.scr
      OFFSET
        0
      LENGTH
        148591

The SEG key specifies file data as a list of numbered segments starting from 0. Each segment specifies the length of file data, and the name and offset at which it can be found within a container file. Reading all segments in order produces the full sequence of bytes that make up the file. The name of the container file is given as a relative path from the dataset directory.

In the above example, the container size is set to 300000. This size is smaller than normal to illustrate the various fields. The data for the rank_2.ckpt file is split among three segments. The first segment of 75705 bytes is in the container file named .scr/ctr.1.scr starting at offset 224295. The next segment is 300000 bytes and is in .scr/ctr.2.scr starting at offset 0. The final segment of 148591 bytes are in .scr/ctr.3.scr starting at offset 0.