Scavenge

At the end of an allocation, certain SCR commands inspect the cache to verify that the most recent checkpoint has been copied to the parallel file system. If not, these commands execute other SCR commands to scavenge this checkpoint before the allocation ends. In this section, we detail key concepts referenced as part of the scavenge operations. Detailed program flow for these operations is provided in Section Program Flow>Scavenge.

Rank filemap file

The scr_copy command is a serial program (non-MPI) that executes on a compute node and copies all files belonging to a specified dataset id from the cache to a specified dataset directory on the parallel file system. It is implemented in scr_copy.c whose program flow is described in Section <scr_copy>. The scr_copy command copies all application files and SCR redundancy data files. In addition, it writes a special filemap file for each rank to the dataset directory. The name of this filemap file is of the format: <rank>.scrfilemap. An example hash for such a filemap file is shown below:

DSET
  6
    RANK
      2
RANK
  2
    DSET
      6
        DSETDESC
          COMPLETE
            1
          SIZE
            2097182
          FILES
            4
          ID
            6
          NAME
            scr.dataset.6
          CREATED
            1312850690668536
          USER
            user1
          JOBNAME
            simulation123
          JOBID
            112573
          CKPT
            6
        FILES
          2
        FILE
          3_of_4_in_0.xor
            META
              RANKS
                4
              COMPLETE
                1
              SIZE
                175693
              TYPE
                XOR
              FILE
                3_of_4_in_0.xor
              CRC
                0x2ef519a1
          rank_2.ckpt
            META
              COMPLETE
                1
              SIZE
                524296
              NAME
                rank_2.ckpt
              PATH
                /p/lscratchb/user1/simulation123
              ORIG
                rank_2.ckpt
              RANKS
                4
              TYPE
                FULL
              FILE
                rank_2.ckpt
              CRC
                0x738bb68f

It lists the files owned by a rank for a particular dataset. In this case, it shows that rank 2 wrote two files (FILES=2) as part of dataset id 6. Those files are named rank_2.ckpt and 3_of_4_in_0.xor.

This format is similar to the filemap hash format described in Section Filemap. The main differences are that files are listed using relative paths instead of absolute paths and there are no redundancy descriptors. The paths are relative so that the dataset directory on the parallel file system may be moved or renamed. Redundancy descriptors are cache-specific, so these entries are excluded.

Scanning files

After scr_copy copies files from the cache on each compute node to the parallel file system, the scr_index command runs to check whether all files were recovered, rebuild missing files if possible, and add an entry for the dataset to the SCR index file (Section Index_file). When invoking the scr_index command, the full path to the prefix directory and the name of the dataset directory are specified on the command line. The scr_index command is implemented in scr_index.c, and its program flow is described in Section <scr_index>.

The scr_index command first acquires a listing of all items contained in the dataset directory by calling scr_read_dir, which is implemented in scr_index.c. This function uses POSIX calls to list all files and subdirectories contained in the dataset directory. The hash returned by this function distinguishes directories from files using the following format.

DIR
  <dir1>
  <dir2>
  ...
FILE
  <file1>
  <file2>
  ...

The scr_index command then iterates over the list of file names and reads each file that ends with the “.scrfilemap” extension. These files are the filemap files written by scr_copy as described above. The scr_index command records the number of expected files for each rank into a single hash called the scan hash.

For each file listed in the rank filemap file, the scr_index command verifies the meta data from the rank filemap map against the original file (excluding CRC32 checks). If the file passes these checks, the command adds a corresponding entry for the file to the scan hash. This entry is formatted such that it can be used as an entry in the summary file hash (Section Summary file). If the file is an XOR file, it sets a NOFETCH flag under the FILE key, which instructs the SCR library to exclude this file during a fetch operation.

Furthermore, for each XOR file, the scr_index command extracts info about the XOR set from the file name and adds an entry under an XOR key in the scan hash. It records the XOR set id (under XOR), the number of members in the set (under MEMBERS), and the group rank of the current file in this set (under MEMBER), as well as, the global rank id (under RANK) and the name of the XOR file (under FILE). After this all of this, the scan hash might look like the following example:

DLIST
  <dataset_id>
    DSET
      COMPLETE
        1
      SIZE
        2097182
      FILES
        4
      ID
        6
      NAME
        scr.dataset.6
      CREATED
        1312850690668536
      USER
        user1
      JOBNAME
        simulation123
      JOBID
        112573
      CKPT
        6
    RANK2FILE
      RANKS
        <num_ranks>
      RANK
        <rank1>
          FILES
            <num_expected_files_for_rank1>
          FILE
             <filename>
               SIZE
                 <filesize>
               CRC
                 <crc>
             <xor_filename>
               NOFETCH
               SIZE
                 <filesize>
               CRC
                 <crc>
             ...
        <rank2>
          FILES
            <num_expected_files_for_rank2>
          FILE
             <filename>
               SIZE
                 <filesize>
               CRC
                 <crc>
             <xor_filename>
               NOFETCH
               SIZE
                 <filesize>
               CRC
                 <crc>
             ...
        ...
    XOR
      <set1>
        MEMBERS
          <num_members_in_set1>
        MEMBER
          <member1>
            FILE
              <xor_filename_of_member1_in_set1>
            RANK
              <rank_id_of_member1_in_set1>
          <member2>
            FILE
              <xor_filename_of_member2_in_set1>
            RANK
              <rank_id_of_member2_in_set1>
          ...
      <set2>
        MEMBERS
          <num_members_in_set2>
        MEMBER
          <member1>
            FILE
              <xor_filename_of_member1_in_set2>
            RANK
              <rank_id_of_member1_in_set2>
          <member2>
            FILE
              <xor_filename_of_member2_in_set2>
            RANK
              <rank_id_of_member2_in_set2>
          ...
      ...

Inspecting files

After merging data from all filemap files in the dataset directory, the scr_index command inspects the scan hash to identify any missing files. For each dataset, it determines the number of ranks associated with the dataset, and it checks that it has an entry in the scan hash for each rank. It then checks whether each rank has as an entry for each of its expected number of files. If any file is determined to be missing, the command adds an INVALID flag to the scan hash, and it lists all ranks that are missing files under the MISSING key. This operation may thus add entries like the following to the scan hash.

DLIST
  <dataset_id>
    INVALID
    MISSING
      <rank1>
      <rank2>
      ...

Rebuilding files

If any ranks are missing files, then the scr_index command attempts to rebuild files. Currently, only the XOR redundancy scheme can be used to rebuild files. The command iterates over each of the XOR sets listed in the scan hash, and it checks that each set has an entry for each of its members. If it finds an XOR set that is missing a member, or if it finds that a set contains a rank which is known to be missing files, the command constructs a string that can be used to fork and exec a process to rebuild the files for that process. It records these strings under the BUILD key in the scan hash. If it finds that one or more files cannot be recovered, it sets an UNRECOVERABLE flag in the scan hash. If the scr_index command determines that it is possible to rebuild all missing files, it forks and execs a process for each string listed under the BUILD hash. Thus this operation may add entries like the following to the scan hash.

DLIST
  <dataset_id>
    UNRECOVERABLE
    BUILD
      <cmd_to_rebuild_files_for_set1>
      <cmd_to_rebuild_files_for_set2>
      ...

Scan hash

After all of these steps, the scan hash is of the following form:

DLIST
  <dataset_id>
    UNRECOVERABLE
    BUILD
      <cmd_to_rebuild_files_for_set1>
      <cmd_to_rebuild_files_for_set2>
      ...
    INVALID
    MISSING
      <rank1>
      <rank2>
      ...
    RANKS
      <num_ranks>
    RANK
      <rank>
        FILES
          <num_files_to_expect>
        FILE
          <file_name>
            SIZE
              <size_in_bytes>
            CRC
              <crc32_string_in_0x_form>
          <xor_file_name>
            NOFETCH
            SIZE
              <size_in_bytes>
            CRC
              <crc32_string_in_0x_form>
          ...
      ...
    XOR
      <xor_setid>
        MEMBERS
          <num_members_in_set>
        MEMBER
          <member_id>
            FILE
              <xor_filename>
            RANK
              <rank>
          ...
      ...

After the rebuild attempt, the scr_index command writes a summary file in the dataset directory. To produce the hash for the summary file, the command deletes extraneous entries from the scan hash (UNRECOVERABLE, BUILD, INVALID, MISSING, XOR) and adds the summary file format version number.