Rank2file map

The rank2file map tracks which files were written by which ranks during a particular dataset. This map contains information for every rank and file. For large jobs, it may consist of more bytes than can be loaded into any single MPI process. This information is scattered among multiple files that are organized as a tree. These files are stored in the dataset directory on the parallel file system. Internally, the data of the rank2file map is organized as a hash.

There is always a root file named rank2file.scr. Here are the contents of an example root rank2file map.

LEVEL
  1
RANKS
  4
RANK
  0
    OFFSET
      0
    FILE
      .scr/rank2file.0.0.scr

Note that there is no VERSION field. The version is implied from the summary file for the dataset. The LEVEL field lists the level at which the current rank2file map is located in the tree. The leaves of the tree are at level 0. The RANKS field specifies the number of ranks the current file (and its associated subtree) contains information for.

For levels that are above level 0, the RANK hash contains information about other rank2file map files to be read. Each entry in this hash is identified by a rank id, and then for each rank, a FILE and OFFSET are given. The rank id specifies which rank is responsible for reading content at the next level. The FILE field specifies the file name that is to be read, and the OFFSET field gives the starting byte offset within that file.

A process reading a file at the current level scatters the hash info to the designated “reader” ranks, and those processes read data for the next level. In this way, the task of reading the rank2file map is distributed among multiple processes in the job. The SCR library ensures that the maximum amount of data any process reads in any step is limited (currently 1MB).

File names at levels lower than the root have names of the form rank2file.<level>.<rank>.scr, where level is the level number within the tree and rank is the rank of the process that wrote the file.

Finally, level 0 contains the data that maps a rank to a list of files names. Here are the contents of an example rank2file map file at level 0.

RANK2FILE
  LEVEL
    0
  RANKS
    4
  RANK
    0
      FILE
        rank_0.ckpt
          SIZE
            524294
          CRC
            0x6697d4ef
    1
      FILE
        rank_1.ckpt
          SIZE
            524295
          CRC
            0x28eeb9e
    2
      FILE
        rank_2.ckpt
          SIZE
            524296
          CRC
            0xb6a62246
    3
      FILE
        rank_3.ckpt
          SIZE
            524297
          CRC
            0x213c897a

Again, the number of ranks that this file contains information for is recorded under the RANKS field.

There are entries for specific ranks under the RANK hash, which is indexed by rank id within scr_comm_world. For a given rank, each file that rank wrote as part of the dataset is indexed by file name under the FILE hash. The file name specifies the relative path to the file starting from the dataset directory. For each file, SCR records the size of the file in bytes under SIZE, and SCR may also record the CRC32 checksum value over the contents of the file under the CRC field.

On restart, the reader rank that reads this hash scatters the information to the owner rank, so that by the end of processing the tree, all processes know which files to read.