Flush

This section describes the process of a synchronous flush.

scr_flush_sync

This is implemented in scr_flush_sync.c.

  1. Return with failure if flush is disabled.

  2. Return with success if specified dataset id has already been flushed.

  3. Barrier to ensure all procs are ready to start.

  4. Start timer to record flush duration.

  5. If async flush is in progress, wait for it to stop. Then check that our dataset still needs to be flushed.

  6. Log start of flush.

  7. Add FLUSHING marker for dataset in flush file to denote flush started.

  8. Get list of files to flush, identify containers, create directories, and create containers (Section 0.1.2). Store list in new hash.

  9. Flush data to files or containers (Section 0.1.7).

  10. Write summary file (Section 0.1.9).

  11. Get total bytes from dataset hash in filemap.

  12. Delete hashes of data and list of files.

  13. Removing FLUSHING marker from flush file.

  14. Stop timer, compute bandwidth, log end.

scr_flush_prepare

Given a filemap and dataset id, prepare and return a list of files to be flushed, also create corresponding directories and container files. This is implemented in scr_flush.c.

  1. Build hash of files, directories, and containers for flush (Section 0.1.3).

  2. Create directory tree for dataset (Section 0.1.6).

  3. Create container files in scr_flush_create_containers.

    1. Loop over each file in file list hash. If the process writes to offset 0, have it open, create, truncate, and close the container file.

scr_flush_identify

Creates a hash of files to flush. This is implemented in scr_flush.c.

  1. Check that all procs have all of their files for this dataset.

  2. Add files to file list hash, including meta data in scr_flush_identify_files.

    1. Read dataset hash from filemap, add to file list hash.

    2. Loop over each file for dataset, if file is not XOR add it and its meta data to file list hash.

  3. Add directories to file list hash (Section 0.1.4).

  4. Add containers to file list hash in (Section 0.1.5).

scr_flush_identify_dirs

Specifies directories which must be created as part of flush, and identifies processes responsible for creating them. This is implemented in scr_flush.c.

  1. Extract dataset hash from file list hash.

  2. If we’re preserving user directories:

    1. Allocate arrays to call DTCMP to rank directory names of all files.

    2. For each file, check that its user-specified path is under the prefix directory, insert dataset directory name in subdir array, and insert full parent directory in dir array.

    3. Check that all files from all procs are within a directory under the prefix directory.

    4. Call DTCMP with subdir array to rank user directory names for all files, and check that one common dataset directory contains all files.

    5. Broadcast dataset directory name to all procs.

    6. Record dataset directory in file list hash.

    7. Call DTCMP with the dir array to rank all directories across all procs.

    8. For each unique directory, we pick one process to later create that directory. This process records the directory name in the file list hash.

    9. Free arrays.

  3. Otherwise (if we’re not preserving user-defined directories):

    1. Get name of dataset from dataset hash.

    2. Append dataset name to prefix directory to define dataset directory.

    3. Record dataset directory in file list hash.

    4. Record dataset directory as destination path for each file in file list hash.

scr_flush_identify_containers

For each file to be flushed in file list hash, identify segments, containers, offsets, and length.s This is implemented in scr_flush.c.

  1. Get our rank within the scr_comm_node communicator.

  2. Get the container size.

  3. Extract dataset hash from file list hash.

  4. Define path within dataset directory to all container files.

  5. Loop over each file in file list hash and compute total byte count.

  6. Compute total bytes across all processes in run with allreduce on scr_comm_world.

  7. Compute total bytes per node by reducing to node leader in scr_comm_node.

  8. Compute offset for each node with a scan across node leaders in scr_comm_node_across.

  9. Compute offset of processes within each node with scan within scr_comm_node.

  10. Loop over each file and compute offset of each file.

  11. Given the container size, the offset and length of each file, compute container file name, length, and offset for each segment and store details within file list hash.

  12. Check that all procs identified their containers.

scr_flush_create_dirs

Given a file list hash, create dataset directory and any subdirectories to hold dataset. This is implemented in scr_flush.c.

  1. Get file mode for creating directories.

  2. Rank 0 creates the dataset directory:

    1. Read path from file list hash.

    2. Get subdirectory name of dataset within prefix directory.

    3. Extract dataset hash from file list hash, and get dataset id.

    4. Add dataset directory and id to index file, write index file to disk.

    5. Create dataset directory and its hidden .scr subdirectory.

  3. Barrier across all procs.

  4. Have each leader create its directory as designated in Section 0.1.4.

  5. Ensure that all directories were created.

scr_flush_data

This is implemented in scr_flush_sync.c. To flow control the number of processes writing, rank 0 writes its data first and then serves as a gate keeper. All processes wait until they receive a go ahead message from rank 0 before starting, and each sends a message back to rank 0 when finished. Rank 0 maintains a sliding window of active writers. Each process includes a flag indicating whether it failed or succeeded to copy its files. If rank 0 detects that a process fails, the go ahead message it sends to other writers indicates this failure, in which that writer immediate sends back a message without copying its files. This way time is not wasted by later writers if an earlier writer has already failed.

RANK 0

  1. Flush files in list, writing data to containers if used (Section 0.1.8).

  2. Allocate arrays to manage a window of active writers.

  3. Send “go ahead” message to first W writers.

  4. Waitany for any writer to send completion notification, record flag indicating whether that writer was successful, and send “go ahead” message to next writer.

  5. Loop until all writers have completed.

  6. Execute allreduce to inform all procs whether flush was successful.

NON-RANK 0

  1. Wait for go ahead message.

  2. Flush files in list, writing data to containers if used (Section 0.1.8).

  3. Send completion message to rank 0 indicating whether copy succeeded.

  4. Execute allreduce to inform all procs whether flush was successful.

scr_flush_files_list

Given a list of files, this function copies data file-by-file, and then it updates the hash that forms the rank2file map. It is implemented in scr_flush_sync.c.

  1. Get path to summary file from file list hash.

  2. Loop over each file in file list.

LOOP

  1. Get file name.

  2. Get basename of file (throw away directory portion).

  3. Get hash for this file.

  4. Get file meta data from hash.

  5. Check for container segments (TODO: what if a process has no files?).

CONTAINERS

  1. Add basename to rank2file map.

  2. Flush file to its containers.

  3. If successful, record file size, CRC32 if computed, and segment info in rank2file map.

  4. Otherwise, record 0 for COMPLETE flag in rank2file map.

  5. Delete file name and loop.

NON-CONTAINERS

  1. Get directory to write file from PATH key in file hash.

  2. Append basename to directory to get full path.

  3. Compute relative path to file starting from dataset directory.

  4. Add relative path to rank2file map.

  5. Copy file data to destination.

  6. If successful, copy file size and CRC32 if computed in rank2file map.

  7. Otherwise, record 0 for COMPLETE flag in rank2file map.

  8. Delete relative and full path names and loop.

END LOOP

scr_flush_complete

Writes out summary and rank2file map files. This is implemented in scr_flush.c.

  1. Extract dataset hash from file list hash.

  2. Get dataset directory path.

  3. Write summary file (Section 0.1.10).

  4. Update index file to mark dataset as “current”.

  5. Broadcast signal from rank 0 to indicate whether flush succeeded.

  6. Update flush file that dataset is now on parallel file system.

scr_flush_summary

Produces summary and rank2file map files in dataset directory on parallel file system. Data for the rank2file maps are gathered and written via a data-dependent tree, such that no process has to write more than 1MB to each file. This is implemented in scr_flush.c.

  1. Get path to dataset directory and hidden .scr directory.

  2. Given data to write to rank2file map file, pick a writer process so that each writer gets at most 1MB.

  3. Call scr_hash_exchange_direction to fold data up tree.

  4. Rank 0 creates summary file and writes dataset hash.

  5. Define name of rank2file map files.

  6. Funnel rank2file data up tree in recursive manner (Section 0.1.11).

  7. If process is a writer, write rank2file map data to file.

  8. Free temporaries.

  9. Check that all procs wrote all files successfully.

scr_flush_summary_map

Produces summary and rank2file map files in dataset directory on parallel file system. This is implemented in scr_flush.c.

  1. Get path to dataset directory and hidden .scr directory.

  2. If we received rank2file map in the previous step, create hash to specify its file name to include at next level in tree.

  3. Given this hash, pick a writer process so that each writer gets at most 1MB.

  4. Call scr_hash_exchange_direction to fold data up tree.

  5. Define name of rank2file map files.

  6. Funnel rank2file data up tree by calling scr_flush_summary_map recursively..

  7. If process is a writer, write rank2file map data to file.

  8. Free temporaries.