.. _flow_flush: Flush ----- This section describes the process of a synchronous flush. scr_flush_sync ~~~~~~~~~~~~~~ This is implemented in ``scr_flush_sync.c``. #. Return with failure if flush is disabled. #. Return with success if specified dataset id has already been flushed. #. Barrier to ensure all procs are ready to start. #. Start timer to record flush duration. #. If async flush is in progress, wait for it to stop. Then check that our dataset still needs to be flushed. #. Log start of flush. #. Add FLUSHING marker for dataset in flush file to denote flush started. #. Get list of files to flush, identify containers, create directories, and create containers (Section :ref:`0.1.2 `). Store list in new hash. #. Flush data to files or containers (Section :ref:`0.1.7 `). #. Write summary file (Section :ref:`0.1.9 `). #. Get total bytes from dataset hash in filemap. #. Delete hashes of data and list of files. #. Removing FLUSHING marker from flush file. #. Stop timer, compute bandwidth, log end. .. _flow_flush_prepare: scr_flush_prepare ~~~~~~~~~~~~~~~~~ Given a filemap and dataset id, prepare and return a list of files to be flushed, also create corresponding directories and container files. This is implemented in ``scr_flush.c``. #. Build hash of files, directories, and containers for flush (Section :ref:`0.1.3 `). #. Create directory tree for dataset (Section :ref:`0.1.6 `). #. Create container files in ``scr_flush_create_containers``. #. Loop over each file in file list hash. If the process writes to offset 0, have it open, create, truncate, and close the container file. .. _flow_flush_identify: scr_flush_identify ~~~~~~~~~~~~~~~~~~ Creates a hash of files to flush. This is implemented in ``scr_flush.c``. #. Check that all procs have all of their files for this dataset. #. Add files to file list hash, including meta data in ``scr_flush_identify_files``. #. Read dataset hash from filemap, add to file list hash. #. Loop over each file for dataset, if file is not ``XOR`` add it and its meta data to file list hash. #. Add directories to file list hash (Section :ref:`0.1.4 `). #. Add containers to file list hash in (Section :ref:`0.1.5 `). .. _flow_flush_identify_dirs: scr_flush_identify_dirs ~~~~~~~~~~~~~~~~~~~~~~~ Specifies directories which must be created as part of flush, and identifies processes responsible for creating them. This is implemented in ``scr_flush.c``. #. Extract dataset hash from file list hash. #. If we’re preserving user directories: #. Allocate arrays to call DTCMP to rank directory names of all files. #. For each file, check that its user-specified path is under the prefix directory, insert dataset directory name in subdir array, and insert full parent directory in dir array. #. Check that all files from all procs are within a directory under the prefix directory. #. Call DTCMP with subdir array to rank user directory names for all files, and check that one common dataset directory contains all files. #. Broadcast dataset directory name to all procs. #. Record dataset directory in file list hash. #. Call DTCMP with the dir array to rank all directories across all procs. #. For each unique directory, we pick one process to later create that directory. This process records the directory name in the file list hash. #. Free arrays. #. Otherwise (if we’re not preserving user-defined directories): #. Get name of dataset from dataset hash. #. Append dataset name to prefix directory to define dataset directory. #. Record dataset directory in file list hash. #. Record dataset directory as destination path for each file in file list hash. .. _flow_flush_identify_containers: scr_flush_identify_containers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For each file to be flushed in file list hash, identify segments, containers, offsets, and length.s This is implemented in ``scr_flush.c``. #. Get our rank within the ``scr_comm_node`` communicator. #. Get the container size. #. Extract dataset hash from file list hash. #. Define path within dataset directory to all container files. #. Loop over each file in file list hash and compute total byte count. #. Compute total bytes across all processes in run with allreduce on ``scr_comm_world``. #. Compute total bytes per node by reducing to node leader in ``scr_comm_node``. #. Compute offset for each node with a scan across node leaders in ``scr_comm_node_across``. #. Compute offset of processes within each node with scan within ``scr_comm_node``. #. Loop over each file and compute offset of each file. #. Given the container size, the offset and length of each file, compute container file name, length, and offset for each segment and store details within file list hash. #. Check that all procs identified their containers. .. _flow_flush_create_dirs: scr_flush_create_dirs ~~~~~~~~~~~~~~~~~~~~~ Given a file list hash, create dataset directory and any subdirectories to hold dataset. This is implemented in ``scr_flush.c``. #. Get file mode for creating directories. #. Rank 0 creates the dataset directory: #. Read path from file list hash. #. Get subdirectory name of dataset within prefix directory. #. Extract dataset hash from file list hash, and get dataset id. #. Add dataset directory and id to index file, write index file to disk. #. Create dataset directory and its hidden ``.scr`` subdirectory. #. Barrier across all procs. #. Have each leader create its directory as designated in Section :ref:`0.1.4 `. #. Ensure that all directories were created. .. _flow_flush_data: scr_flush_data ~~~~~~~~~~~~~~ This is implemented in ``scr_flush_sync.c``. To flow control the number of processes writing, rank 0 writes its data first and then serves as a gate keeper. All processes wait until they receive a go ahead message from rank 0 before starting, and each sends a message back to rank 0 when finished. Rank 0 maintains a sliding window of active writers. Each process includes a flag indicating whether it failed or succeeded to copy its files. If rank 0 detects that a process fails, the go ahead message it sends to other writers indicates this failure, in which that writer immediate sends back a message without copying its files. This way time is not wasted by later writers if an earlier writer has already failed. RANK 0 #. Flush files in list, writing data to containers if used (Section :ref:`0.1.8 `). #. Allocate arrays to manage a window of active writers. #. Send “go ahead” message to first W writers. #. Waitany for any writer to send completion notification, record flag indicating whether that writer was successful, and send “go ahead” message to next writer. #. Loop until all writers have completed. #. Execute allreduce to inform all procs whether flush was successful. NON-RANK 0 #. Wait for go ahead message. #. Flush files in list, writing data to containers if used (Section :ref:`0.1.8 `). #. Send completion message to rank 0 indicating whether copy succeeded. #. Execute allreduce to inform all procs whether flush was successful. .. _flow_flush_list: scr_flush_files_list ~~~~~~~~~~~~~~~~~~~~ Given a list of files, this function copies data file-by-file, and then it updates the hash that forms the rank2file map. It is implemented in ``scr_flush_sync.c``. #. Get path to summary file from file list hash. #. Loop over each file in file list. LOOP #. Get file name. #. Get basename of file (throw away directory portion). #. Get hash for this file. #. Get file meta data from hash. #. Check for container segments (TODO: what if a process has no files?). CONTAINERS #. Add basename to rank2file map. #. Flush file to its containers. #. If successful, record file size, CRC32 if computed, and segment info in rank2file map. #. Otherwise, record 0 for COMPLETE flag in rank2file map. #. Delete file name and loop. NON-CONTAINERS #. Get directory to write file from PATH key in file hash. #. Append basename to directory to get full path. #. Compute relative path to file starting from dataset directory. #. Add relative path to rank2file map. #. Copy file data to destination. #. If successful, copy file size and CRC32 if computed in rank2file map. #. Otherwise, record 0 for COMPLETE flag in rank2file map. #. Delete relative and full path names and loop. END LOOP .. _flow_flush_complete: scr_flush_complete ~~~~~~~~~~~~~~~~~~ Writes out summary and rank2file map files. This is implemented in ``scr_flush.c``. #. Extract dataset hash from file list hash. #. Get dataset directory path. #. Write summary file (Section :ref:`0.1.10 `). #. Update index file to mark dataset as “current”. #. Broadcast signal from rank 0 to indicate whether flush succeeded. #. Update flush file that dataset is now on parallel file system. .. _flow_flush_summary: scr_flush_summary ~~~~~~~~~~~~~~~~~ Produces summary and rank2file map files in dataset directory on parallel file system. Data for the rank2file maps are gathered and written via a data-dependent tree, such that no process has to write more than 1MB to each file. This is implemented in ``scr_flush.c``. #. Get path to dataset directory and hidden ``.scr`` directory. #. Given data to write to rank2file map file, pick a writer process so that each writer gets at most 1MB. #. Call ``scr_hash_exchange_direction`` to fold data up tree. #. Rank 0 creates summary file and writes dataset hash. #. Define name of rank2file map files. #. Funnel rank2file data up tree in recursive manner (Section :ref:`0.1.11 `). #. If process is a writer, write rank2file map data to file. #. Free temporaries. #. Check that all procs wrote all files successfully. .. _flow_flush_summary_map: scr_flush_summary_map ~~~~~~~~~~~~~~~~~~~~~ Produces summary and rank2file map files in dataset directory on parallel file system. This is implemented in ``scr_flush.c``. #. Get path to dataset directory and hidden ``.scr`` directory. #. If we received rank2file map in the previous step, create hash to specify its file name to include at next level in tree. #. Given this hash, pick a writer process so that each writer gets at most 1MB. #. Call ``scr_hash_exchange_direction`` to fold data up tree. #. Define name of rank2file map files. #. Funnel rank2file data up tree by calling ``scr_flush_summary_map`` recursively.. #. If process is a writer, write rank2file map data to file. #. Free temporaries.