How to Manage Large Data
Introduction
The size of data files being used on the cluster has increased significantly over the past years. Working with files with sizes in the order of Terabytes requires careful consideration of I/O and data management. That means before you start a large series of production runs you should ask yourself a few questions about the needed data files. And if necessary, you should make a few benchmarks to better understand your application.
For I/O you first need to understand what kind of I/O is the result of your HPC application. How many data files are read or written and what is the typical size of a data file? How does the size of a data file change when parameters of the applications are changed? What is the I/O pattern of your application, i.e. how often and how much data is read or written when? Is the I/O sequentially or random access? And, in cases of parallel applications, is the I/O done in serial or in parallel?
For data management, you have to consider how to organize your data on the HPC system to minimize the need for storage space and data transfers between different filesystems. What is the total amount of data generated or needed for a project? Which data files are needed for processing, i.e. must be accessible from the compute nodes? Which data files can be deleted after processing? Which data files need to be kept for a longer period of time? Is it possible to pack or compress the data files to reduce the size?
The Hardware and Filesystems for Storage
The HPC cluster has different filesystems available which should be chosen based on the consideration about I/O and data management above. The underlying hardware and the connection to the cluster determine the possibilities of the filesystems. Namely, these are the bandwidth for I/O and the capacity for storing data. The following table summarizes the limitations for each filesystem:
Hardware | Filesystem | Network | Total Capacity | Capacity per User | Total Bandwidth | Bandwidth per Process |
---|---|---|---|---|---|---|
ESS | $HOME | 10G/1G | > 2PB | 1 TB | 2.5 GB/s | 0.125 GB/s |
$DATA | 20 TB | |||||
$OFFSITE | 12.5 TB | |||||
$GROUP | - | |||||
GPFS | $WORK | FDR IB | 900 TB | 25 TB | 12.5 GB/s | >1 GB/s |
local HDD or NVMe | $TMPDIR | - | 290 TB | up to 1.7 TB | >1 GB/s | >1 GB/s |
Explanations: The different filesystems are provided by two major hardware resources, the central storage system of the University (ESS) and the parallel filesystem for the cluster (GPFS). In addition, the compute nodes in the CARL cluster are equipped with a local HDD or NVMe card. It is recommended to use the different filesystems via the environment variables, e.g. in your job scripts. Except for the local HDDs or NVMes data transfer uses either the ethernet network (ESS) or the FDR Infiniband (GPFS). The cluster has two 10 Gbit/s connections to the campus network which means a total bandwidth of 2.5 GB/s could be achieved when writing to the ESS filesystems. However, since the compute nodes only have 1G ethernet uplinks, the maximum bandwidth single job or task running there can only achieve 0.125 GB/s. And if multiple jobs on a single node have to share the bandwidth from the uplink, the transfer rate becomes as low as 0.005 GB/s (in fact, this is also true for all jobs running on the cluster and sharing the 10G uplink to the campus/ESS). On the other hand, the data transfer to the GPFS over the Infiniband is five times faster in total and in practice 10x faster for a single job or task that is running on a compute node. In addition, the limiting factor here is not the network bandwidth, which about 6 GB/s, so that multiple jobs on the same node could write to $WORK up to a 100x faster than to e.g. $HOME.
Best Practice Job I/O
As mentioned above you first need to understand what I/O your application is doing and how it might affect performance. Here are some things to consider:
- Preferably use $WORK and/or $TMPDIR for I/O during job runtime. The easiest to achieve this is to prepare the job in $WORK and use the sbatch from there. Alternatively, you can modify your job script to run in a directory under $WORK (that could be created from the job script).
- Estimate the fraction of the total job runtime used for I/O and choose your filesystem accordingly. For example, if you job takes 2h per run but it spends 30 minutes writing to $HOME, you could maybe run the same job in a little over 1:30h when writing to $WORK.
- Avoid writing many small files and consider unformatted I/O.
- If your application does very random I/O use $TMPDIR. Also, try to increase the memory limit for the job as the OS will use the available RAM for I/O caching.
Best Practice Data Management
After your jobs are completed you will need to decide how to proceed with the data (ideally you already planned ahead for that at the beginning of the project). Since the capacity of the filesystems is limited (both from the hardware but also by quotas per user) you will also regularly clean up your data on the HPC cluster.
- Only keep those files on $WORK that you still need to process or work with on the cluster. Delete such data files that you no longer need, for example, the raw output files from computations that could be redone if needed. Note, that there is no backup for $WORK and also no snapshots, so a file removed cannot be recovered.
- Move files you need to keep from $WORK to $DATA or $OFFSITE, see e.g. this guide.
- Files that are only kept under the rules of good scientific practice but are no longer actively needed should be kept in the form of compressed tar-files.
- Document your files with README files or in a similar way to make sure you will still know what is in the data files in two years.