Skip to content

Staging Data for Computation

Summary

Savio was designed for reading/writing files (i.e., I/O) very quickly using the scratch filesystem. Using scratch has benefits for your jobs and reduces the I/O load on the filesystem supporting home and group directories, thereby helping other users get their work done. Therefore, in many cases you should copy ('stage') your data to scratch before running your job(s) and write your job output files to scratch.

When to stage data to scratch

Copying a file to scratch is equivalent in terms of filesystem usage to reading the file once in a job. Situations under which one should use scratch include:

  • files that are used repeatedly within a job or across multiple jobs,
  • files you wish to read/write in parallel,
  • large input/output files, and
  • large numbers of files.

As a rule of thumb, if you are working with more than 100 MB data within your job, you should consider staging it via scratch.

Using data in condo storage

For condo storage set up in summer 2021 or since then, the condo storage is on the same filesystem as scratch, so data stored in the condo storage does not need (and should not) be staged to scratch before use. For condo storage set up before summer 2021, please consider staging your data to scratch based on the guidance above.

Staging data to /tmp

As an alternative to using your scratch directory for I/O for a job, you are welcome to use the /tmp directory on the node(s) your Slurm job is using if you wish (for example based on the specifics of the software you are using). While in general we suggest using scratch, one situation where it could be advantageous to use /tmp is on savio3_htc, which has fast NVMe solid state drives that, for jobs doing non-parallel I/O, may achieve several-fold faster I/O than using scratch.

Note that you will need to copy the data within every job to or from /tmp, as /tmp is local to a given machine and not directly accessible from the login nodes.

Python packages (and Conda environments) include many (thousands or more) small files. Python packages installed via pip are by default stored in ~/.local in your home directory and Conda environments and their constituent packages are by default stored in ~/.conda.

Using these packages or environments (particularly if running Python in parallel) can lead to a heavy burden on the filesystem supporting users' home directories and group directories. In contrast, Python packages and Conda environments installed on scratch can be accessed quickly and without burden on the filesystem.