Containers ========== NOTE: This feature is experimental and not yet complete, so it it not documented in the user guide. SCR requires checkpoint data to be stored primarily as a file per process. However, writing a large number of files is inefficient or difficult to manage on some file systems. To alleviate this problem, SCR provides an abstraction called “containers”. When writing data to or reading data from the prefix directory, SCR combines multiple application files into a container. Containers are disabled by default. To enable them, set the ``SCR_USE_CONTAINERS`` parameter to 1. During a flush, SCR identifies the containers and the offsets within those containers where each file should be stored. SCR records the file-to-container mapping in the rank2file map, which it later references to extract files during the fetch operation. A container has a maximum size, which is determined by the ``SCR_CONTAINER_SIZE`` parameter. This parameter defaults to 100GB. Application file data is packed sequentially within a container until the container is full, and then the remaining data spills over to the next container. The total number of containers required depends on the total number of bytes in the dataset and the container size. A container file name is of the form ``ctr..scr``, where ```` is the container id which counts up from 0. All containers are written to the dataset directory within the prefix directory. SCR combines files in an order such that all files on the same node are grouped sequentially. This limits the number of files that each compute node must access. For this purpose, SCR creates two global communicators during ``SCR_Init``. Both are defined in ``scr_globals.c``. The ``scr_comm_node`` communicator consists of all processes on the same compute node. The ``scr_comm_node_across`` communicator consists of all processes having the same rank within ``scr_comm_node``. Note that some process has rank 0 in ``scr_comm_node`` for each node in the run. This process is called the “node leader”. To get the offset where each process should write its data, SCR first sums up the sizes of all files on the node via a reduce on ``scr_comm_node``. The node leaders then execute a scan across nodes using the ``scr_comm_node_across`` communicator to get a node offset. A final scan within ``scr_comm_node`` produces the offset at which each process should write its data. TODO: discuss setting in flush descriptor stored in filemap under dataset id and rank TODO: discuss containers during a scavenge TODO: should we copy redundancy data to containers as well? Within a rank2file map file, the file-to-container map adds entries under the ``SEG`` key for each file. An example entry looks like the following: :: rank_2.ckpt SEG 0 FILE .scr/ctr.1.scr OFFSET 224295 LENGTH 75705 1 FILE .scr/ctr.2.scr OFFSET 0 LENGTH 300000 2 FILE .scr/ctr.3.scr OFFSET 0 LENGTH 148591 The ``SEG`` key specifies file data as a list of numbered segments starting from 0. Each segment specifies the length of file data, and the name and offset at which it can be found within a container file. Reading all segments in order produces the full sequence of bytes that make up the file. The name of the container file is given as a relative path from the dataset directory. In the above example, the container size is set to 300000. This size is smaller than normal to illustrate the various fields. The data for the ``rank_2.ckpt`` file is split among three segments. The first segment of 75705 bytes is in the container file named ``.scr/ctr.1.scr`` starting at offset 224295. The next segment is 300000 bytes and is in ``.scr/ctr.2.scr`` starting at offset 0. The final segment of 148591 bytes are in ``.scr/ctr.3.scr`` starting at offset 0.