Flush¶
This section describes the process of a synchronous flush.
scr_flush_sync¶
This is implemented in scr_flush_sync.c
.
- Return with failure if flush is disabled.
- Return with success if specified dataset id has already been flushed.
- Barrier to ensure all procs are ready to start.
- Start timer to record flush duration.
- If async flush is in progress, wait for it to stop. Then check that our dataset still needs to be flushed.
- Log start of flush.
- Add FLUSHING marker for dataset in flush file to denote flush started.
- Get list of files to flush, identify containers, create directories, and create containers (Section 0.1.2). Store list in new hash.
- Flush data to files or containers (Section 0.1.7).
- Write summary file (Section 0.1.9).
- Get total bytes from dataset hash in filemap.
- Delete hashes of data and list of files.
- Removing FLUSHING marker from flush file.
- Stop timer, compute bandwidth, log end.
scr_flush_prepare¶
Given a filemap and dataset id, prepare and return a list of files to be
flushed, also create corresponding directories and container files. This
is implemented in scr_flush.c
.
- Build hash of files, directories, and containers for flush (Section 0.1.3).
- Create directory tree for dataset (Section 0.1.6).
- Create container files in
scr_flush_create_containers
.- Loop over each file in file list hash. If the process writes to offset 0, have it open, create, truncate, and close the container file.
scr_flush_identify¶
Creates a hash of files to flush. This is implemented in
scr_flush.c
.
- Check that all procs have all of their files for this dataset.
- Add files to file list hash, including meta data in
scr_flush_identify_files
.- Read dataset hash from filemap, add to file list hash.
- Loop over each file for dataset, if file is not
XOR
add it and its meta data to file list hash.
- Add directories to file list hash (Section 0.1.4).
- Add containers to file list hash in (Section 0.1.5).
scr_flush_identify_dirs¶
Specifies directories which must be created as part of flush, and
identifies processes responsible for creating them. This is implemented
in scr_flush.c
.
- Extract dataset hash from file list hash.
- If we’re preserving user directories:
- Allocate arrays to call DTCMP to rank directory names of all files.
- For each file, check that its user-specified path is under the prefix directory, insert dataset directory name in subdir array, and insert full parent directory in dir array.
- Check that all files from all procs are within a directory under the prefix directory.
- Call DTCMP with subdir array to rank user directory names for all files, and check that one common dataset directory contains all files.
- Broadcast dataset directory name to all procs.
- Record dataset directory in file list hash.
- Call DTCMP with the dir array to rank all directories across all procs.
- For each unique directory, we pick one process to later create that directory. This process records the directory name in the file list hash.
- Free arrays.
- Otherwise (if we’re not preserving user-defined directories):
- Get name of dataset from dataset hash.
- Append dataset name to prefix directory to define dataset directory.
- Record dataset directory in file list hash.
- Record dataset directory as destination path for each file in file list hash.
scr_flush_identify_containers¶
For each file to be flushed in file list hash, identify segments,
containers, offsets, and length.s This is implemented in
scr_flush.c
.
- Get our rank within the
scr_comm_node
communicator. - Get the container size.
- Extract dataset hash from file list hash.
- Define path within dataset directory to all container files.
- Loop over each file in file list hash and compute total byte count.
- Compute total bytes across all processes in run with allreduce on
scr_comm_world
. - Compute total bytes per node by reducing to node leader in
scr_comm_node
. - Compute offset for each node with a scan across node leaders in
scr_comm_node_across
. - Compute offset of processes within each node with scan within
scr_comm_node
. - Loop over each file and compute offset of each file.
- Given the container size, the offset and length of each file, compute container file name, length, and offset for each segment and store details within file list hash.
- Check that all procs identified their containers.
scr_flush_create_dirs¶
Given a file list hash, create dataset directory and any subdirectories
to hold dataset. This is implemented in scr_flush.c
.
- Get file mode for creating directories.
- Rank 0 creates the dataset directory:
- Read path from file list hash.
- Get subdirectory name of dataset within prefix directory.
- Extract dataset hash from file list hash, and get dataset id.
- Add dataset directory and id to index file, write index file to disk.
- Create dataset directory and its hidden
.scr
subdirectory.
- Barrier across all procs.
- Have each leader create its directory as designated in Section 0.1.4.
- Ensure that all directories were created.
scr_flush_data¶
This is implemented in scr_flush_sync.c
. To flow control the number
of processes writing, rank 0 writes its data first and then serves as a
gate keeper. All processes wait until they receive a go ahead message
from rank 0 before starting, and each sends a message back to rank 0
when finished. Rank 0 maintains a sliding window of active writers. Each
process includes a flag indicating whether it failed or succeeded to
copy its files. If rank 0 detects that a process fails, the go ahead
message it sends to other writers indicates this failure, in which that
writer immediate sends back a message without copying its files. This
way time is not wasted by later writers if an earlier writer has already
failed.
RANK 0
- Flush files in list, writing data to containers if used (Section 0.1.8).
- Allocate arrays to manage a window of active writers.
- Send “go ahead” message to first W writers.
- Waitany for any writer to send completion notification, record flag indicating whether that writer was successful, and send “go ahead” message to next writer.
- Loop until all writers have completed.
- Execute allreduce to inform all procs whether flush was successful.
NON-RANK 0
- Wait for go ahead message.
- Flush files in list, writing data to containers if used (Section 0.1.8).
- Send completion message to rank 0 indicating whether copy succeeded.
- Execute allreduce to inform all procs whether flush was successful.
scr_flush_files_list¶
Given a list of files, this function copies data file-by-file, and then
it updates the hash that forms the rank2file map. It is implemented in
scr_flush_sync.c
.
- Get path to summary file from file list hash.
- Loop over each file in file list.
LOOP
- Get file name.
- Get basename of file (throw away directory portion).
- Get hash for this file.
- Get file meta data from hash.
- Check for container segments (TODO: what if a process has no files?).
CONTAINERS
- Add basename to rank2file map.
- Flush file to its containers.
- If successful, record file size, CRC32 if computed, and segment info in rank2file map.
- Otherwise, record 0 for COMPLETE flag in rank2file map.
- Delete file name and loop.
NON-CONTAINERS
- Get directory to write file from PATH key in file hash.
- Append basename to directory to get full path.
- Compute relative path to file starting from dataset directory.
- Add relative path to rank2file map.
- Copy file data to destination.
- If successful, copy file size and CRC32 if computed in rank2file map.
- Otherwise, record 0 for COMPLETE flag in rank2file map.
- Delete relative and full path names and loop.
END LOOP
scr_flush_complete¶
Writes out summary and rank2file map files. This is implemented in
scr_flush.c
.
- Extract dataset hash from file list hash.
- Get dataset directory path.
- Write summary file (Section 0.1.10).
- Update index file to mark dataset as “current”.
- Broadcast signal from rank 0 to indicate whether flush succeeded.
- Update flush file that dataset is now on parallel file system.
scr_flush_summary¶
Produces summary and rank2file map files in dataset directory on
parallel file system. Data for the rank2file maps are gathered and
written via a data-dependent tree, such that no process has to write
more than 1MB to each file. This is implemented in scr_flush.c
.
- Get path to dataset directory and hidden
.scr
directory. - Given data to write to rank2file map file, pick a writer process so that each writer gets at most 1MB.
- Call
scr_hash_exchange_direction
to fold data up tree. - Rank 0 creates summary file and writes dataset hash.
- Define name of rank2file map files.
- Funnel rank2file data up tree in recursive manner (Section 0.1.11).
- If process is a writer, write rank2file map data to file.
- Free temporaries.
- Check that all procs wrote all files successfully.
scr_flush_summary_map¶
Produces summary and rank2file map files in dataset directory on
parallel file system. This is implemented in scr_flush.c
.
- Get path to dataset directory and hidden
.scr
directory. - If we received rank2file map in the previous step, create hash to specify its file name to include at next level in tree.
- Given this hash, pick a writer process so that each writer gets at most 1MB.
- Call
scr_hash_exchange_direction
to fold data up tree. - Define name of rank2file map files.
- Funnel rank2file data up tree by calling
scr_flush_summary_map
recursively.. - If process is a writer, write rank2file map data to file.
- Free temporaries.