Containers
==========

NOTE: This feature is experimental and not yet complete, so it it not
documented in the user guide.

SCR requires checkpoint data to be stored primarily as a file per
process. However, writing a large number of files is inefficient or
difficult to manage on some file systems. To alleviate this problem, SCR
provides an abstraction called “containers”. When writing data to or
reading data from the prefix directory, SCR combines multiple
application files into a container. Containers are disabled by default.
To enable them, set the ``SCR_USE_CONTAINERS`` parameter to 1.

During a flush, SCR identifies the containers and the offsets within
those containers where each file should be stored. SCR records the
file-to-container mapping in the rank2file map, which it later
references to extract files during the fetch operation.

A container has a maximum size, which is determined by the
``SCR_CONTAINER_SIZE`` parameter. This parameter defaults to 100GB.
Application file data is packed sequentially within a container until
the container is full, and then the remaining data spills over to the
next container. The total number of containers required depends on the
total number of bytes in the dataset and the container size. A container
file name is of the form ``ctr.<id>.scr``, where ``<id>`` is the
container id which counts up from 0. All containers are written to the
dataset directory within the prefix directory.

SCR combines files in an order such that all files on the same node are
grouped sequentially. This limits the number of files that each compute
node must access. For this purpose, SCR creates two global communicators
during ``SCR_Init``. Both are defined in ``scr_globals.c``. The
``scr_comm_node`` communicator consists of all processes on the same
compute node. The ``scr_comm_node_across`` communicator consists of all
processes having the same rank within ``scr_comm_node``. Note that some
process has rank 0 in ``scr_comm_node`` for each node in the run. This
process is called the “node leader”.

To get the offset where each process should write its data, SCR first
sums up the sizes of all files on the node via a reduce on
``scr_comm_node``. The node leaders then execute a scan across nodes
using the ``scr_comm_node_across`` communicator to get a node offset. A
final scan within ``scr_comm_node`` produces the offset at which each
process should write its data.

TODO: discuss setting in flush descriptor stored in filemap under
dataset id and rank

TODO: discuss containers during a scavenge

TODO: should we copy redundancy data to containers as well?

Within a rank2file map file, the file-to-container map adds entries
under the ``SEG`` key for each file. An example entry looks like the
following:

::

     rank_2.ckpt
       SEG
         0
           FILE
             .scr/ctr.1.scr
           OFFSET
             224295
           LENGTH
             75705
         1
           FILE
             .scr/ctr.2.scr
           OFFSET
             0
           LENGTH
             300000
         2
           FILE
             .scr/ctr.3.scr
           OFFSET
             0
           LENGTH
             148591

The ``SEG`` key specifies file data as a list of numbered segments
starting from 0. Each segment specifies the length of file data, and the
name and offset at which it can be found within a container file.
Reading all segments in order produces the full sequence of bytes that
make up the file. The name of the container file is given as a relative
path from the dataset directory.

In the above example, the container size is set to 300000. This size is
smaller than normal to illustrate the various fields. The data for the
``rank_2.ckpt`` file is split among three segments. The first segment of
75705 bytes is in the container file named ``.scr/ctr.1.scr`` starting
at offset 224295. The next segment is 300000 bytes and is in
``.scr/ctr.2.scr`` starting at offset 0. The final segment of 148591
bytes are in ``.scr/ctr.3.scr`` starting at offset 0.