How to Manage Large Data

From HPC users
Jump to navigationJump to search

Introduction

The size of data files being used on the cluster has increased significantly over the past years. Working with files with sizes in the order of Terabytes requires careful consideration of I/O and data management. That means before you start a large series of production runs you should ask yourself a few questions about the needed data files. And if necessary, you should make a few benchmarks to better understand your application.

For I/O you first need to understand what kind of I/O is the result of your HPC application. How many data files are read or written and what is the typical size of a data file? How does the size of a data file change when parameters of the applications are changed? What is the I/O pattern of your application, i.e. how often and how much data is read or written when? Is the I/O sequentially or random access? And, in cases of parallel applications, is the I/O done in serial or in parallel?

For data management, you have to consider how to organize your data on the HPC system to minimize the need for storage space and data transfers between different filesystems. What is the total amount of data generated or needed for a project? Which data files are needed for processing, i.e. must be accessible from the compute nodes? Which data files can be deleted after processing? Which data files need to be kept for a longer period of time? Is it possible to pack or compress the data files to reduce the size?

The Hardware and Filesystems for Storage

The HPC cluster has different filesystems available which should be chosen based on the consideration about I/O and data management above. The underlying hardware and the connection to the cluster determine the possibilities of the filesystems. Namely, these are the bandwidth for I/O and the capacity for storing data. The following table summarizes the limitations for each filesystem:

Hardware Filesystem Network Total Capacity Capacity per User Total Bandwidth Bandwidth per Process
ESS $HOME 10G/1G > 2PB 1 TB 2.5 GB/s 0.125 GB/s