Skip to content

Frequently Asked Questions

Account issues

Q. How can I access the Faculty Computing Allowance application (Requirements Survey) and Additional User Request forms? I'm seeing a "You need permission" error.

A. You need to authenticate via CalNet to access the online forms for applying for a Faculty Computing Allowance, and for requesting additional user accounts on the Savio cluster.

When accessing either form, you may encounter the error message, "You need permission. This form can only be viewed by users in the owner's organization", under either of these circumstances:

1. If you haven't already successfully logged in via CalNet. (If you don't have a CalNet ID, please work with a UCB faculty member or other researcher who can access the form on your behalf.)
2. If you've logged in via CalNet, but you're also simultaneously connected, in your browser, to a non-UCB Google account; for instance, to access a personal Gmail account. (If so, the easiest way to access the online forms might be to use a private/incognito window in your primary browser, or else use a second browser on your computer, one in which you aren't already logged into a Google account. As an alternative, you can first log out of all of your Google accounts in your primary browser, before attempting to access these forms.)

Q: How can I add new users to my FCA or Condo account?

A: You can add a new or existing Savio user to your FCA or Condo account by filling out and submitting the Additional User Account Request/Modification Form, which you can find here. New user accounts are usually set up within a few days to up to a week or so. The PI of the FCA or Condo account can fill out and submit the form on behalf of the new or existing user, or the new or existing user can fill out the form themselves if they have a CalNet ID (i.e., UC Berkeley email address).

Please make sure that the new user also fills out the BRC User Access Agreement Form as well (after the Additional User Account Request/Modification Form has been filled out and submitted). An existing Savio user does not need to fill this form out, as they already filled out and submitted the form when their Savio account was initially created.

Q: How long does it take until my account is created?

A: After the BRC User Access Agreement Form has been filled out and submitted, it should take between a few days to up to a week or so in some circumstances to send out the confirmation email that points you to the page with instructions to link one of your personal email accounts with your BRC HPC Cluster account, installing and setting up Google Authenticator on a mobile device, setting up your PIN / token, and logging into the BRC Cluster, and thereby access your account and start working on the BRC Cluster for the first time.

Q: I haven’t heard back about my account request. How do I know if it is ready and is there anything else I need to do in order to get access to my new account?

A: After the BRC User Access Agreement Form has been submitted, it should take between a few days to up to a week or so in some cases for your new user account to be set up on the Savio cluster and ready to use. Please be patient during this time and note that there are no additional steps that you need to take during this time. We encourage and recommend that you take this time to familiarize yourself with our extensive Savio documentation as you wait for your account to be processed and keep in mind that as you begin working on the Savio system, a lot of the questions that may come up are answered in the documentation. When the new user account is ready, you will receive a confirmation email pointing you to this page with instructions to link one of your personal email accounts with your BRC HPC Cluster account, installing and setting up Google Authenticator on a mobile device, setting up your PIN / token, and logging into the BRC Cluster. Please do not attempt to follow any of these instructions before you have received the confirmation email. Please make sure to check your email spam folder for this confirmation email if you do not receive it in your regular email inbox within a week or so after you’ve filled out and submitted the BRC User Access Agreement Form.

Q: Do I need PI permission before requesting an account? Do I need to fill out the Additional User Account Request/Modification Form myself or can the PI fill it on my behalf?

A: Yes, you need to check with your PI before requesting a new Savio account to be created under or to have access to the PI’s existing FCA or Condo account. Once you have checked with the PI, then if you have a CalNet ID (i.e., berkeley.edu email address) you can fill out and submit the Additional User Account Request/Modification Form yourself. If you do not have a CalNet ID then the PI can fill out and submit the form on your behalf.

Q. What happens to my Savio account when I graduate or otherwise leave UC Berkeley?

A: Your Savio account and your access to Savio should remain active for some period of time after you have left UC Berkeley, as long as your CalNet ID remains active and you can continue to log into the cluster.

If your CalNet ID has expired this could have an impact in some cases on your access to and use of Savio. Note, however, that having a CalNet ID is not required for Savio authentication once users have a Savio account, and there isn't any requirement that Savio users have a CalNet ID in keeping with our pledge to support collaborators from other institutions. Also, if you do have a CalNet ID and it is revoked (due to your departure from the University when you graduate or if your employment with the University ends, for example), you still might be able to continue using the Google Authenticator app on your smartphone or tablet to continue to login to and access Savio. Issues with CalNet will need to be addressed to the IT Campus Shared Services (ITCSS) team. Their help desk email is: itcsshelp@berkeley.edu. For example, you can work with your PI to request a Sponsored Guest type account that would enable you to continue to have access to some UC berkeley services, including Savio.

Also, as noted in our documentation, please recall that the PI for the project(s) is ultimately responsible for notifying Berkeley Research Computing when user accounts should be closed, and to specify the disposition of a user's software, files, and data.

If you are no longer able to log into Savio but still need access to data on Savio, you can arrange with your research colleagues who have active Savio accounts and access to Savio and/or the PI(s) of the projects you're involved with to transfer the needed data to bDrive (Google Drive), Box, or an external server or personal computer following the instructions, guidelines, and examples in our documentation.

Q. How much am I charged for computational time on Savio?

A: For those with Faculty Computing Allowance accounts, usage of computational time on Savio is tracked (in effect, "charged" for, although no monetary costs are incurred) via abstract measurement units called "Service Units". When all of the Service Units provided under an Allowance have been exhausted, no more jobs can be run under that account. When you have exhausted your entire Faculty Computing Allowance, there are a number of options open to you for getting more computing time for your project. Note that usage tracking does not impact Condo users, who have no Service Unit-based limits on the use of their associated compute pools.

You can view how many Service Units have been used to date under a Faculty Computing Allowance, or by a particular user account on Savio, via the check_usage.sh script.

Q: How and when can I renew my Faculty Computing Allowance (FCA)?

A: Each year, beginning in May, you can submit an application form to renew your Faculty Computing Allowance. Links to this renewal application form are typically emailed to Allowance recipients (and to other designated "main contacts" on such accounts) by or before mid-May each year. (There are often some at least modest differences in the renewal application process from year to year, so there is no permanent online location for this form.)

Renewal applications submitted during May will be processed beginning June 1st. Those submitted and approved later in the year, after the May/June period (i.e. after June 30th), will receive pro-rated allowances of computing time, based on month of application during the allowance year. Note that new allowances that are set up in June or old ones that are renewed in June are the only ones that get the full 300,000 SUs allocation.

Please note that there are some additional options open to you as well for getting more computing time for your project when your allowance is used up.

Please also recall that if a researcher is already the PI of an existing FCA project, they can not request the creation of a new FCA account. So, if a researcher has exhausted the allocated service units on their FCA, they should not request the creation of a new FCA, but, rather, they should renew their already existing FCA or purchase additional computing time on Savio.

Q: How can I get access to CGRL resources such as the Rosalind or Vector condos?

A: Email cgrl@berkeley.edu. See the CGRL website for more information.

Q: How can I work with sensitive (P2/P3) data on Savio?

A: Please see our documentation on working with sensitive data on Savio. To start the process of creating a P2/P3 project, see Accounts for Sensitive Data.

Login and connection issues

Q: I can’t log into Savio for various different possible reasons. For example, when I try to log into Savio, I get the following error message: 'username@hpc.brc.berkeley.edu: Permission denied (gssapi-keyex,gssapi-with-mic,keyboard-interactive)'

A: The first thing to do is to make sure you’ve tried all of the steps in the Savio login troubleshooting documentation. In particular, we recommend resetting your OTP on the LBL token management page (troubleshooting step #5).

If the troubleshooting steps don't resolve the issue, then it could be that your external public IPv4 address has been blocked for too many failed authentication attempts or for some other reason(s). (Note that on Savio, there is a policy of automatically blocking IP addresses that have too many failed login attempts.) To check if this is the case, try to change your external public IP (IPv4) address (maybe try logging in from a different IP address/network or different wifi or phone hotspot). For example, if you’re able to log into Savio on your phone hotspot but not on the campus network, it's possible that a campus IP address is being blocked. Other factors to consider and questions to ask yourself include:

  • Which BRC endpoint are you trying to connect to (such as dtn.brc.berkeley.edu or hpc.brc.berkeley.edu)?
  • Do you have proper network connectivity to other SSH hosts?
  • Are you trying to connect through a public wifi? Maybe try using a VPN or switch to another wifi?
  • What error message do you get when you try to ssh from your computer?
  • What computer/OS version/ssh version are you using and what is your exact ssh command?
  • Does your router/network have a firewall against Savio domain names / IP addresses?

If it looks like your IP address has been blocked, let us know your external (public) IPv4 address (which you can find by visiting here (see first line) or here and we can go ahead and unblock it.

Q: When logging into the Non-LBL Token Management web page to set up my token/PIN so I can log into Savio for the first time, I receive the following error message: 'Login Error: There was an error logging you in. The account that you logged in with has not been mapped to a system account. If you believe that you may have mapped a different account, choose it below and try again.'

A: This may be an indication that you need to link your email account with your BRC HPC Cluster account. To do this, please follow the steps listed in our documentation here.

Please also make sure you are logging into the LBL Token Management page with the same account that was linked (i.e., the same information that you used when filling out the account linking form). For example, if you received the linking invitation email via your UC Berkeley (CalNet) email address, make sure you log into the LBL Token Management page with your CalNet ID.

Also, if you need to reset your PIN, you can go to the token management page. Select the identity you linked with your cluster account when you first set it up (for UC Berkeley people, this is usually your UC Berkeley account -- your CalNet ID), and log in. On the token management page, click the “Update PIN” link for the login token you want to change, and you can change your PIN.

If you're using a Google Chrome browser to do this and find that you are still unsuccessful and getting the same error message when you try to log into the BRC Cluster, please follow the above instructions using the a different browser instead, if at all possible, and see if that helps.

Q: I’m in the process of setting up my OTP (for logging into Savio for the first time), but I haven’t received an account linking email. What should I do / how should I resolve this?

A: Make sure that you fill out and submit the account linking email form following the instructions in our documentation. It should send you an email within 10 to 30 minutes that you can use to continue with the set-up process described in our documentation. Sometimes users put in another variant of their name than what we have on file. If the name doesn’t match exactly, the form won’t activate. In some cases, if and as needed, a consultant can fill the form out and submit it on the user’s behalf, and/or send the user the correct information they’ll need to use to fill out the form. If the user fills out the form with the correct data, we ask them to get in touch with us if they don’t get an account linking invitation email within a half hour or so (usually it should be under 10 minutes). If a user or consultant fills out the form with known-good information, and the user still doesn't get an email within a half hour, the consultant may contact the Savio system administration team for further assistance.

Q: I forgot my PIN/Password. What do I need to do in order to be able to login to Savio again?

A: If you haven't used the system in a while, you might forget what the PIN is associated with you login token. To reset/update your PIN, you can go to the Non-LBL Token Management web page, select the identity you linked with your cluster account when you first set it up (for UC Berkeley people, this is usually your UC Berkeley account), and log in. On the token management page, click the “Update PIN” or “Reset PIN” link for the login token you want to change, and you can change your PIN. If you have any trouble with it, please let us know.

Q: I need to remove/transfer old data from my account or the account of a collaborator, but I can’t login anymore or I can’t access their account. What do I do?

A: Your Savio account and your access to Savio should remain active for some period of time after you have left UC Berkeley, as long as your CalNet ID remains active and you can continue to log into the cluster.

You won't be able to login to Savio without an active CalNet account. So, if your CalNet ID has expired this will have an impact on your access to and use of Savio. Issues with CalNet will need to be addressed to the IT Campus Shared Services (ITCSS) team. Their help desk email is: itcsshelp@berkeley.edu. For example, you can work with your PI to request a Sponsored Guest type account that would enable you to continue to have access to some UC berkeley services, including Savio.

Also, as noted in our documentation, please recall that the PI for the project(s) is ultimately responsible for notifying Berkeley Research Computing when user accounts should be closed, and to specify the disposition of a user's software, files, and data.

If you are no longer able to log into Savio but still need access to data on Savio, you can arrange with your research colleagues who have active Savio accounts and access to Savio and/or the PI(s) of the projects you're involved with to transfer the needed data to bDrive (Google Drive), Box, or an external computer/server following the instructions, guidelines, and examples in our documentation. We can also assist you with this if and as needed.

Q: My terminal / ssh connection to Savio keeps timing out. Is there a way to stay logged on?

A: Inactive connections are terminated after five minutes. It’s possible to set up your SSH configuration so that your session will not time out.

If your laptop/desktop is a Mac or Linux machine, you can add this to your ~/.ssh/config file:

Host *
ServerAliveInterval 300
ServerAliveCountMax 2

If you're using Putty from Windows, there is information online.

Another option is to use the screen or tmux programs for your interactive session. If you are disconnected (or choose to logout from Savio), you can reconnect to your running screen or tmux session when you log back in.

Finally, in many cases you may be best off setting up your computation to run as a background job using sbatch rather than running interactively using srun.

Command line, storage, and software installation issues

Q: I can’t run standard shell commands, I'm unable to list available software modules, or I can’t display various directories to navigate to.

A: If you find, after logging onto the cluster, that you can’t list available software modules (when using the module avail command) and/or you can’t get a listing of directories to navigate to, or if your shell prompt looks like, e.g, -bash-4.2$ instead of, say, [myashar@ln001 ~]$, this may indicate that there is something wrong with your shell environment, which in turn may point to a problem with your ~/.bashrc and/or ~/.bash_profile file. If this is the case, then you can try to fix these files manually, or you can replace/switch them with the system default ones using the following commands:

/usr/bin/mv ~/.bashrc ~/.bashrc.orig
/usr/bin/cp /etc/skel/.bashrc ~/.bashrc
/usr/bin/mv ~/.bash_profile ~/.bash_profile.orig
/usr/bin/cp /etc/skel/.bash_profile ~/.bash_profile

Then, log out and log back in again, and check whether this has resolved the issue.

Q: I’m transferring data to or from Savio and the transfer is going very slowly.

A: Slow transfers can occur for a variety of reasons. These include:

  • Heavy usage of scratch by Savio users.
  • Bandwidth limits somewhere between Savio and the location you are transferring to/from. (E.g., transferring data to/from a computer at your home is limited by the bandwidth of your home internet connection.
  • If you are transferring a large number of (possibly small) files, there is a cost to simply opening that many files. Transferring a smaller number of larger files may be more successful (though this may require you to use tar or zip to aggregate the smaller files, which will also take time).

In general we recommend the use of Globus to transfer files when the location you are transferring to/from can be set up as a Globus endpoint. (Note that we are acquiring a Globus license that will potentially allow transfers to Box or bDrive via Globus, but that is not yet possible.) As a benchmark, one might achieve something like 75 MB/s when transferring between Savio and servers elsewhere on the Berkeley campus.

Q: I am getting a 'Permission denied' error (e.g., when trying to install or use software).

A: Permission denied errors happen when your account does not have permission to perform an attempted read or write on a file or directory.

  • If you are trying to read or write to a file or directory owned by someone else (not the root user), you may need to ask the owner to set permissions so you can do so. See our documentation about making files accessible to other users for more details about how file permissions work and how to set permissions to grant access to other users.
  • Savio users are not able to modify the root filesystem (paths writable only by the root user, usually beginning with /etc, /usr, or /bin). If you need to change these paths for your software to work, you may use a Singularity container.
  • If the permission denied error occurs during an attempted software installation, often the software can be installed at a custom prefix (such as in your home/group directory instead of in /bin or /usr/bin), or otherwise you can install it inside a Singularity container. See our documentation on installing your own software on Savio for more details on how to install software at different locations.
Q: I’ve received a 'disk quota exceeded' error when trying to install software in, move data to, or write to files in my Savio home directory. What does this mean and how can I resolve this?

A: By default, each user on Savio is entitled to a 10 GB home directory which receives regular backups. If you exceed this 10 GB disk usage quota limit you will receive “disk quota exceeded” error message if you try to to install software in, move or add data to, or write to files in your Savio home directory, and you won’t be able to proceed further. Simply removing/deleting files and directories in your home directory, and/or moving files to your Savio scratch or scratch2 directory or your shared group directory (if you have one), for example, until you are below the 10 GB limit should resolve this issue, but be cautious not to remove hidden files beginning with ".", as some of them set useful environment parameters and would need to be regenerated. Note that you can use ls -al to show your hidden files and directories in your home directory (where the ‘-a’ flag shows hidden files and directories).

Another option to free up space in your home directory is to delete files inside the ~/.local directory (as well as the ~/.conda/pkgs directory in your home directory, which is where pip (and Conda) installs Python packages (for example) by default. It's also possible to install Python packages into someplace other than ~/.local, such as scratch. If there are Python packages in ~/.local that are taking up a lot of space, it's cleaner to remove them with pip rather than just deleting files there. Otherwise, you might have issues with Python thinking a package is installed but it has actually been deleted. You can use pip uninstall $PACKAGE_NAME for this. Another typical example occurs when you run a job and have a software package write output to your home directory (or perhaps a group directory) and you go over your 10 GB quota limit (or the 30 GB quota limit in your shared research group directory). This could also happen if the software package writes to a configuration directory that it sets up in your home (or group) directory, e.g., a program might create a directory ~/.name_of_program and put files in there. In that case, you can configure things to have any output go to scratch, for example, if such output files are the problem. Similarly, you could configure the software package to use scratch (instead of your home directory) for its configuration directory.

Note that you can monitor your home directory disk usage by using commands such as quota -s, quota -u $USER, and/or du -sk /global/home/users/$USER (where $USER is your Savio username). Similarly, you can use the command du -h -d 1 ~ to see how much space is used by each top-level subdirectory in your home directory (which is what the ~ indicates).

Please take care not to exceed the 10 GB quota limit by monitoring it on a regular basis and moving files to scratch or scratch2, or to bDrive, Box, or an external system or server, for example, if and as needed, to stay below the 10 GB limit.

Q: I’ve exceed the 12 TB disk usage soft quota in my Savio scratch directory and I need to remove some files from there to get below the 12 TB limit. What are my best options for where to transfer those files and how / what tools are best to use to transfer the data?

A: You can check your current disk usage on the Savio scratch file system (/global/scratch) with the following command:

grep <username> /global/scratch/scratch-usage.txt 

where <username> is your Savio username. This command gives you your current Savio scratch disk usage in KB. To convert to TB divide by 1x10^9.

We are currently requesting that users who exceed the 12 TB quota limit in their scratch directories to take immediate actions to clean up this space by removing or deleting as many files as necessary to move below the 12 TB limit. Enforced purge procedures including disabling job submissions might take place otherwise. We especially need to do this if and when we approach 100% disk storage usage on the Savio scratch file system, as this can then interfere with the ability of other Savio users to run jobs on the system. To help alleviate this situation, we need Savio scratch users who are above the 12 TB quota limit to remove or transfer files from their scratch directories as soon as possible.

If this restricts your research productivity, please get in touch with us ASAP. If you need advice on where to move your files, or if you have questions or special needs for additional storage, please also get in touch us.

For example, if you have access to /global/scratch2/ for your lab, you're welcome (and encouraged) to move data there.

In general we recommend the use of Globus to transfer files when the location you are transferring to/from can be set up as a Globus endpoint. (Note that we are acquiring a Globus license that will potentially allow transfers to Box or bDrive via Globus, but that is not yet possible.) As a benchmark, one might achieve something like 75 MB/s when transferring between Savio and servers elsewhere on the Berkeley campus.

For other data transfer tools (such as rclone) that you can use to transfer data between Savio and Box, bDrive, your lab server, personal computer, or other external system, please see our documentation.

As far as long-term storage, one option would be our condo storage program.

As far as other options, we'd be happy to discuss further with you, either by email or directly, potentially during our office hours.

If you are no longer affiliated with UC Berkeley and/or no longer have and active Savio account and no longer need the data in your scratch directory, please let us know and we can delete the data on your behalf (after receiving final confirmation from the PI(s) of the FCA and/or Condo account(s) you had access to).

Q: How can I resolve a 'No Space Left on Device' error related to /tmp (and not to scratch or my home directory filling up)?

A: You may receive a "No Space Left on Device" error when you run a job on Savio that turns out to have nothing to do with global scratch or your home directory filling up, but rather is related to your job using the /tmp directory on a compute node (or, in some cases, on one of the login nodes) instead of scratch or your home directory, and /tmp is getting filled up. If your job is writing to a temporary file to /tmp, it's possible that you’ve ran out of space in the /tmp folder, since it is only around 3.7GB in size for most compute nodes, and 7.8 GB for savio3 nodes (for example).

Depending on the program that you are running, you may be able to control where it writes temporary files by setting the TMPDIR environment variable in your job script. Two main options are to use your scratch space or (only if using compute nodes) use /local (which is local scratch space on each node).

For example, you can use a directory in your scratch space by setting the TMPDIR environment variable so that the executable uses that scratch directory rather than /tmp. Most programs respect the TMPDIR environment variable, so you can make a temp directory in your personal scratch directory as follows (e.g., you can add this to your SLURM batch script):

mkdir /global/scratch/$USER/tmp
export TMPDIR=/global/scratch/$USER/tmp # set TMPDIR

where $USER is your Savio username. If your program respects the TMPDIR environment variable then this should make it use scratch instead of /tmp.

Q: How do I get sudo/root access (for example, to install software)?

A: Unfortunately, Savio does not allow users to have sudo/root access.

If you have requirements for using paths/directories owned by root, you can use a Singularity container which allows you to change root directories from the perspective of the program.

Often software can be installed without root access at a custom prefix (for example, instead of installing the binary in /bin/ or /usr/bin you can put it in a directory you created). Please see our documentation on installing software, including using environment modules to simplify software installed at custom prefix locations.

If none of these solutions fit your needs, you may contact us, and we can advise on how to proceed on Savio without root access.

Q: How do I access files on my personal computer from Savio? Do you allow users to NFS mount their own storage onto the compute nodes?

A: There are several options for transferring files to and from Savio. While you cannot mount your own storage on Savio nodes, if you want to synchronize files in near real-time between your own computer and Savio, then there are tools such as Cyberduck (for Mac/Windows) or sshfs (for Linux) which allow you to synchronize a directory on your computer with a directory on Savio. On Savio, you should use the dtn.brc.berkeley.edu address when connecting using one of these tools.

We do not allow users to NFS mount their own storage. We NFS mount Savio's storage across all compute nodes so that data is available independent of which compute nodes are used; however, medium to large clusters can place a very high load on NFS storage servers and many, including Linux-based NFS servers, cannot handle this load and will lock up. A non-responding NFS mount can hang the entire cluster, so we can't risk allowing outside mounts.

Q: I deleted my files accidentally! What can I do?

A: Note: While the home directory filesystem (/global/home) is backed up so it is possible to restore old versions of files, no such backups are taken in scratch (/global/scratch). Important data that needs to be backed up should not be stored in scratch.

The home directory filesystem (files within /global/home) are backed up hourly. Backup snapshots can be accessed from the hidden directory .snapshot within each directory. For example, to access the snapshots of my home directory:

[nicolaschan@ln000 ~]$ cd .snapshots
[nicolaschan@ln000 .snapshots]$ ls
daily_2020_11_08__16_00   hourly_2020_11_12__20_00  hourly_2020_11_13__04_00  hourly_2020_11_13__12_00
daily_2020_11_09__16_00   hourly_2020_11_12__21_00  hourly_2020_11_13__05_00  hourly_2020_11_13__13_00
daily_2020_11_10__16_00   hourly_2020_11_12__22_00  hourly_2020_11_13__06_00  hourly_2020_11_13__14_00
daily_2020_11_11__16_00   hourly_2020_11_12__23_00  hourly_2020_11_13__07_00  hourly_2020_11_13__15_00
daily_2020_11_12__16_00   hourly_2020_11_13__00_00  hourly_2020_11_13__08_00  hourly_2020_11_13__16_00
daily_2020_11_13__16_00   hourly_2020_11_13__01_00  hourly_2020_11_13__09_00  weekly_2020_11_04__16_00
hourly_2020_11_12__18_00  hourly_2020_11_13__02_00  hourly_2020_11_13__10_00  weekly_2020_11_11__16_00
hourly_2020_11_12__19_00  hourly_2020_11_13__03_00  hourly_2020_11_13__11_00

You can then copy the file you want to restore from the snapshot (using the normal cp command).

Slurm / job issues

Q: How do I know which partition, account, and QoS I should use when submitting my job?

A:SLURM provides this command you can run to check on the partitions, accounts and Quality of Service (QoS) options that you're permitted to use:

sacctmgr -p show associations user=$USER

You can also add the "-p" option to this command to get parsable output.

Q: Why hasn’t my job started? When will my job start?

A: Please see our suggestions. In particular you may wish to try our sq tool to diagnose problems.

Q. How can I check the usage on my Faculty Computing Allowance (FCA)?

A Savio provides a "check_usage.sh" command line tool you can use to check cluster usage by user or account.

Running "check_usage.sh -E" will report total usage by the current user, as well as a breakdown of their usage within each of their related project accounts, since the most recent reset/introduction date (normally June 1st of each year). To check usage for another user on the system, add a "-u sampleusername" option (substituting an actual user name for 'sampleusername' in this example).

You can check usage for a project's account, rather than for an individual user's account, with the '-a sampleprojectname' option to this command (substituting an actual account name for 'sampleprojectname' in this example).

Also, when checking usage for either users or accounts, you can display usage during a specified time period by adding start date (-s) and/or end date (-e) options, as in "-s YYYY-MM-DD" and "-e YYYY-MM-DD" (substituting actual Year-Month-Day values for 'YYYY-MM-DD' in these examples). Run "check_usage.sh -h" for more information and additional options.

When checking usage for accounts that have overall usage limits (such as Faculty Computing Allowances), the value of the Service Units (SUs) field is color-coded to help you see at a glance how much computational time is still available: green means your project has used less than 50% of its available SUs; yellow means your project has used more than 50% but less than 100% of its available SUs; and red means your project has used 100% or more of its available SUs (and has likely been disabled). Note that if you specify the starttime and/or endtime with "-s" and/or "-e" option(s) you will not get the color coded output.

Here are a couple of output samples from running this command line tool with user and project options, respectively, along with some tips on interpreting that output:

check_usage.sh -E -u sampleusername
Usage for USER sampleusername [2016-06-01T00:00:00, 2016-08-17T18:18:37]: 38 jobs, 1311.40 CPUHrs, 1208.16 SUs used
Usage for USER sampleusername in ACCOUNT co_samplecondoname [2016-06-01T00:00:00 2016-08-17T18:18:37]: 23 jobs, 857.72 CPUHrs, 827.59 SUs
Usage for USER sampleusername in ACCOUNT fc_sampleprojectname [2016-06-01T00:00:00 2016-08-17T18:18:37]: 15 jobs, 453.68 CPUHrs, 380.57 SUs

Total usage from June 1, 2016 through the early evening of August 17, 2016 by the 'sampleusername' cluster user consists of 38 jobs run, using approximately 1,311 CPU hours, and resulting in usage of approximately 1208 Service Units. (The total number of Service Units is less than the total number of CPU hours in this example, because some jobs were run on older or otherwise less expensive hardware pools (partitions) which cost less than one Service Unit per CPU hour.)

Of that total usage, 23 jobs were run under the Condo project account 'co_samplecondoname', using approximately 858 CPU hours and 828 Service Units, and 15 jobs were run under the Faculty Computing Allowance project account 'fc_sampleprojectname', using approximately 454 CPU hours and 381 Service Units.

check_usage.sh -a fc_sampleprojectname
Usage for ACCOUNT fc_sampleprojectname [2016-06-01T00:00:00, 2016-08-17T18:19:15]: 156 jobs, 85263.80 CPUHrs, 92852.12 SUs used from an allocation of 300000 SUs.</code></p>

Usage from June 1, 2016 through the early evening of August 17, 2016 by all cluster users of the Faculty Computing Allowance account 'fc_sampleprojectname'  consists of 156 jobs run, using a total of approximately 85,263 CPU hours, and resulting in usage of approximately 92,852 Service Units. (The total number of Service Units is greater than the total number of CPU hours in this example, because some jobs were run on hardware pools (partitions) which cost more than one Service Unit per CPU hour.) The total Faculty Computing Allowance allocation for this project's account is 300,000 Service Units, so there are approximately 207,148 Service Units still available for running jobs during the remainder of the current Allowance year (June 1 to May 31): 300,000 total Service Units granted, less 92,852 used to date. The total of 92,852 Service Units used to date is colored green, because this project's account has used less than 50% of its total Service Units available.

To also view individual usages by each cluster user of the Faculty Computing Allowance project account 'fc_sampleprojectname', you can add a '-E' option to the above command; e.g., check_usage.sh -E -a fc_sampleprojectname

Finally, if your Faculty Computing Allowance has become completely exhausted, the output from running the "check_usage.sh" command line tool will by default show only information for the period of time after your job scheduler account was disabled; for example:

Usage for ACCOUNT fc_sampleprojectname [2017-04-05T11:00:00, 2017-04-24T17:19:12]: 3 jobs, 0.00 CPUHrs, 0.00 SUs from an allocation of 0 SUs.
ACCOUNT fc_sampleprojectname has exceeded its allowance. Allocation has been set to 0 SUs.
Usage for USER sampleusername in ACCOUNT fc_sampleprojectname [2017-04-05T11:00:00, 2017-04-24T17:19:12]: 0 jobs, 0.00 CPUHrs, 0.00 (0%) SUs

To display the - more meaningful - information about the earlier usage that resulted in the Faculty Computing Allowance becoming exhausted, use the start date (-s) option and specify the most recently-passed June 1st - the first day of the current Allowance year - as that start date. E.g., to view usage for an Allowance that became exhausted anytime during the 2018-19 Allowance year, use a start date of June 1, 2018:

check_usage.sh -E -s 2018-06-01 -a fc_sampleprojectname

Q: How can I check my job's running status? How do I monitor the performance and resource use of my job?

A: To monitor the status of running batch jobs, please see our documentation on the use of the squeue and wwall tools. For information on the different options available with the use of wwall, enter wwall --help at the Linux command prompt on Savio.

Alternatively, you can login to the node your job is running on as follows:

srun --jobid=$your_job_id --pty /bin/bash
This runs a shell in the context of your existing job. Once on the node, you can run top, htop, ps, or other tools.

You can also see a "top"-like summary for all nodes by running wwtop from a login node. You can use the page up and down keys to scroll through the nodes to find the node(s) your job is using. All CPU percentages are relative to the total number of cores on the node, so 100% usage would mean that all of the cores are being fully used.

Q: How do I submit jobs to a condo?

A: If you are running your job using a condo account (those beginning with "co_" ), make sure to specify the condo account name when submitting your SLURM batch job script, e.g.,

#SBATCH --account=co_projectname

where ‘projectname’is the name of the condo account.

A maximum time limit for the job is required under all conditions. When running your job under a QoS that does not have a time limit (such as jobs submitted by the users of most of the cluster's Condos under their priority access QoS), you can specify a sufficiently long time limit value, but this parameter should not be omitted. Jobs submitted without providing a time limit will be rejected by the scheduler.

As a condo contributor, you are entitled to use the extra resource that is available on the Savio cluster (across all partitions). For more information and details on this, please see our documentation on low-priority jobs.

Note that by default any jobs run in a condo account will use the default QoS (generally savio_normal) if not specified.

For the different QoS configurations for Savio condos, including the corresponding QoS limits, see here. To specify the QoS configuration for your Savio condo, you would include a line in your SLURM batch job script such as the following example:

#SBATCH --qos=lsdi_knl2_normal

Recall that to check which QoS you are allowed to use, simply run sacctmgr -p show associations user=$USER, where $USER is your Savio username.

Q: How can I run multiple jobs on a single node to take account of all the compute cores on a node? How can I run many jobs at once?

A: There are various approaches you can take if you have many jobs you need to run and want to take full advantage of the cores on each Savio node, given that all except the savio2_htc and the various GPU partitions are allocated as entire nodes, so your FCA is charged for use of all the cores on a node.

Q. How do I "burst" onto more nodes?

A. There are two ways to do this. First, Condo users can access more nodes via Savio's preemptable, low-priority quality of service option. Second, faculty can obtain a Faculty Computing Allowance, and their users can then submit jobs to the General queues to run on the compute nodes provided by the institution. (Use of these nodes is subject to the current job queue policies for general institutional access.)

Q: Why is my job not using the GPU nodes / How can I get access to the GPU nodes?

A: Please this example of a SLURM sbatch GPU job script.

To help the job scheduler effectively manage the use of GPUs, your job submission script must request two CPUs for each GPU you will use. Jobs submitted that do not request a minimum of two CPUs for every GPU will be rejected by the scheduler. You can request the correct number of GPUs by setting --cpus-per-task equal to the total number of CPUs (assuming you use GPUs on one node only) or by setting --ntasks to the number of GPUs requested and --cpus-per-task=2.

Note that the --gres=gpu:[1-4] specification must be between 1 and 4. This is because the feature is associated with a node, and the nodes each have 4 GPUs. If you wish to use more than 4 GPUs, your --gres=gpu:[1-4] specification should include how many GPUs to use per node requested. For example, if you wish to use eight GPUs, your job script should include options to the effect of "--gres=gpu:4", "--nodes=2", "--ntasks=8", and "--cpus-per-task=2"

The available Savio GPU partitions can be found here and include savio2_gpu, savio2_1080ti, savio3_gpu, and savio3_2080ti. You can use the command sacctmgr -p show associations user=$USER (where $USER is your Savio username) to check which of these GPU partitions your account has access to (which you can use when submitting your job on Savio).

To obtain information on the available GPUs, their status, and their utilization on a particular node of a GPU partition, access a GPU node interactively using the srun command, as in the following example:

srun -pty -A fc_projectname --partition=savio2_1080ti --nodes=1 --gres=gpu:1 \
        --ntasks=1 --cpus-per-task=2 -t 48:00:00 bash -i
where ‘projectname’ is the name of your FCA (i.e., fc_projectname) or Condo account (i.e., co_projectname).

Then, once you have logged into the particular GPU node, enter the following at the command prompt: nvidia-smi

You can also use srun to get an interactive shell on the GPU node your job is running on already, and then use nvidia-smi as follows:

srun -j $your_job_id 
nvidia-smi
where you substitute the job ID for "$your_job_id"

Q: I can’t connect to other nodes within a job, possibly with the errors 'Host key verification failed' or 'ORTE was unable to reliably start one or more daemons. This usually is caused by:...'. How can I resolve this?

A: This message is usually associated with problems communicating between nodes on jobs that use more than one node. This message is often caused by your ~/.ssh directly being misconfigured or corrupted in some way that prevents ssh from connecting to the other nodes in your job’s allocation. It can also be caused by problems on one or more nodes that your job is running on. To check your ~/.ssh directory, run

ls -al ~/.ssh

The contents should look something like this:

[paciorek@ln002 etc]$ ls -al ~/.ssh
total 116
drwx------ 1 paciorek ucb   348 Oct  7  2019 .
drwxr-xr-x 1 paciorek ucb 65536 Nov 24 15:26 ..
-rw------- 1 paciorek ucb   610 Aug  2  2019 authorized_keys
-rw------- 1 paciorek ucb   672 Aug  2  2019 cluster
-rw-r--r-- 1 paciorek ucb   610 Aug  2  2019 cluster.pub
-rw------- 1 paciorek ucb    98 Aug  2  2019 config
-rw-r--r-- 1 paciorek ucb 28652 Nov 24 15:40 known_hosts

If files are missing or empty, you can try to recreate your.ssh directory as follows:

cp -pr ~/.ssh ~/.ssh-save   #make a backup
/usr/bin/cluster-env        #recreate your .ssh folder

If necessary you can restore your original .ssh from the .ssh-save directory. Or you can delete .ssh-save if you no longer need it.

If your .ssh directory seems fine (or the connection problem persists after running cluster-env), please submit a ticket.

Also as a troubleshooting step, you could try to start a multiple (e.g., two) node job using srun and then try to ssh to the other node(s) from the first node. You can determine the nodes allocated to the job using the following command (once you are on the first node):

echo $SLURM_NODELIST

If you suspect a particular node as being the problem, you can request that a node be used for your job using the ‘-w’ flag, e.g., -w n0204.savio2, but of course the node of interest may be in use by another job.

Q: My job fails with a message about being out of memory. How can I fix this and how can I know how much memory my job used?

A: Jobs may use a lot of memory simply because the code requires a lot of memory or because the code was not written to use memory efficiently, or because you are using the code in an inefficient way. You may want to look through the code to better understand what steps are using a lot of memory. BRC consultants can also help diagnose what is going on. It’s also possible that the code you are using has a memory leak (i.e., a bug in the code that causes increasing memory use).

Finally, parallelized code can use a lot of memory if multiple copies of input data are made, often one per worker process. In this latter case, one option is to reduce the number of worker processes your job uses by modifying the SLURM submission flags or changing how the code is configured. Another option is to spread your computation across multiple nodes so as to take advantage of the memory of more than one node; this is only possible with code that is set up specifically to parallelize across multiple nodes and won’t generally “just work”.

If you can’t reduce the memory usage, one option is to submit your job to one of our high memory partitions. If you need even more memory, BRC consultants can help you access resources outside Savio (e.g., on NSF’s XSEDE network or commercial cloud providers such as Google Cloud or AWS) that host machines with even more memory available.

To see how much memory your job used (or is currently using), please use these approaches for monitoring jobs

Q: What do I do when the job runtime exceeds max queue time, or wall clock time specified in the SLURM batch script?

A: There are a variety of possibilities that may help with this:

  • If you have access to a condo, you could run in the condo with a more generous time limit, as most condos don’t have a time limit.
  • For jobs not needing many cores, you can run under our long queue on the savio2_htc partition under an FCA.
  • You may be able to break up your job into separate steps that you can run as individual (shorter) jobs.
  • You may be able to parallelize your code so that it runs faster.
  • You may be able to set up your code to use checkpointing. The general idea is to save the state of your code regularly and then when the job stops at the time limit, submit a new job that starts up at the last checkpoint. Existing software may already have functionality to produce restart files if requested.
  • BRC consultants can help you access resources outside Savio (e.g., on NSF’s XSEDE network or commercial cloud providers such as Google Cloud or AWS) where there are more generous (or no) time limits. Please contact us at brc@berkeley.edu.
Q: I have a job that will take longer than the 3-day time limit for FCAs. Can I run jobs for longer?

A: Yes, you can submit to the long queue, which will run your job on the savio2_htc partition, up to a limit of 10 days. Please see these details on the long queue. Note that time limits on Condo accounts are determined by the Condo PI(s) and in many cases are quite long or unlimited (if no time limit is indicated, that means there is no limit).

Q: Is there any way to resume jobs that will run for longer than the time available to a job?

A: When the job reaches and exceeds its allowed wall clock limit, the job is preempted, or when software/hardware faults occur, it is useful to be able to checkpoint/restart the job.

Checkpointing/restarting is the technique where the applications’ state is stored in the filesystem, allowing the user to restart computation from this saved state in order to minimize loss of computation time. By checkpointing or saving intermediate results frequently, you won't lose as much work if your running job exceeds the time limit you specified, your job is preempted, or your job is otherwise terminated prematurely for some reason.

If you wish to do checkpointing, your first step should always be to check if your application already has such capabilities built-in, as that is the most stable and safe way of doing it. Applications that are known to have some sort of native checkpointing include: Abaqus, Amber, Gaussian, GROMACS, LAMMPS, NAMD, NWChem, Quantum Espresso, STAR-CCM+, and VASP.

In case your program does not natively support checkpointing, it may also be worthwhile to consider utilizing generic checkpoint/restart solutions if and as needed that should work application-agnostic. One example of such a solution is DMTCP: Distributed MultiThreaded CheckPointing. DMTCP is a checkpoint-restart solution that works outside the flow of user applications, enabling their state to be saved without application alterations. You can find its reference quick-start documentation here.

Software issues

Q: I want to use software with a graphical interface, such as MATLAB or RStudio, or I want to run a Jupyter notebook. Is this possible on Savio?

A: Some software with a graphical interface, including RStudio and MATLAB, is available through Savio’s Open OnDemand service at ood.brc.berkeley.edu. All your files are accessible in your session. Jupyter notebooks can be run through the Open OnDemand service as well.

Q: How do I know what software Savio has?

A: Savio has software installed using environment modules.

Savio administrators and consultants install modules into /global/software/sl-7.x86_64/ and /global/home/groups/consultsw/sl-7.x86_64/. Groups and users may also create their own module trees. Modules in /global/software/sl-7.x86_64/ are maintained by the system administrators and should be preferred over modules in /global/home/groups/consultsw/sl-7.x86_64/. Modules in /global/home/groups/consultsw/sl-7.x86_64/ are installed by consultants, usually at the request of users and include software and versions that are not in the sysadmin-maintained software tree.

Sometimes software is not visible in module avail until its dependency has been loaded. For example, openmpi cannot be loaded until after gcc or intel has already been loaded. To see an extensive list of all module files, including those only visible after loading other prerequisite modules, enter the following command:

find /global/software/sl-7.x86_64/modfiles -type d -exec ls -d {} \;
Q: How can I request software be installed on Savio?

Please see this section of our documentation to request that software be installed system-wide.

Q: How do I make my software run in parallel or run on more than one node?

A: First, note that in general, software will not run across multiple nodes unless it is set up to do so, and you invoke the software correctly. In some, but certainly not all cases, software that is set up to run in parallel on one node may do so without you having to do anything. Even if it does, it may not use all the cores on a node.

If you are using software developed by someone else, parallelization depends on how the developer designed the software, and ease of use usually depends on how well documented the software is.

There are three main ways in which parallelization parameters are set:

  • command line arguments set when running the software
  • compilation options that produce separate executables when installing the software
  • configuration files that can be modified in advance of running the software

Our training on parallelization on Savio has a section with detailed information, including a number of examples based on popular bioinformatics software.

If you are writing your own code, you have various options:

  • If using C or C++ you can write code that uses openMP (for parallelization across threads on one node) or MPI (for parallelization across processes on one or more nodes).
  • If using Python, R, or MATLAB, there are various ways to parallelize independent computations. You can also interact use MPI from within Python and R.
  • Python’s Dask package and Spark are both tools for working with large datasets in parallel.
Q: How can I run a Jupyter notebook on a GPU?

A: Please note that we've stopped supporting use of the compute nodes (including GPU nodes) through Jupyterhub, as we've migrated the job submission options for Jupyter notebooks over to the Savio Open OnDemand service as of December 2020. You can login with your Savio username and PIN+OTP and select the Jupyter Server under Interactive Apps. From there, you can choose a GPU partition and specify the number of GPUs you want (e.g., gpu:1 for 1 GPU).

Another option available to you is to start a Jupyter notebook on the GPU node is to access it via our visualization node following these instructions.

Q: How can I use Conda and/or Pip to install Packages in a directory other than my home directory (e.g., shared group directory or scratch directory)?

A: To use Conda to install a package (for example, biopython) in a subdirectory in your personal global scratch directory (/global/scratch/username/) called 'your_directory', with environment name 'test', you can do this (for example):

module load python/3.6
conda create -p /global/scratch/$USER/your_directory/test --python=3.7
source activate /global/scratch/$USER/your_directory/test
conda install -p /global/scratch/$USER/your_directory/ biopython
source deactivate /global/scratch/$USER/your_directory/test

There are a couple of different ways to use pip to install python packages in a different/selected directory from your home directory. For example, you can use pip install --prefix=/path/to/directory package_name (where ‘package_name’ is the name of the python package you are installing) and then modify PYTHONPATH to include /path/to/directory. Also, see this Stack Overflow thread for additional options. For example, you can try (where $USER is your Savio username):

pip install --install-option="--prefix=/global/scratch/$USER/your_directory" biopython

or

pip install --prefix=/global/scratch/$USER/your_directory biopython

You might also want to use the "--ignore-installed" option to force all dependencies to be reinstalled using this new prefix. Recall that pip installs packages into the ~/.local directory (in your home directory) when you use the command pip install --user <package_name>. So, another approach you can take is to copy the contents of ~/.local to your scratch space and then replace it with a symbolic link. The commands to do this are, for example, as follows (where $USER is your Savio username):

cp -r ~/.local /global/scratch/$USER/.local
rm -rf ~/.local
ln -s /global/scratch/$USER/.local ~/.local

Then, when you use pip to install packages in your home directory, they will be installed in your scratch directory instead.

Q. When trying to install Python packages with Conda in my home directory, I receive an error message that includes "Disk quota exceeded" and can’t proceed. How can I resolve this?

A. When a user is trying to install additional Python packages in their home directory with Conda and/or set up a Conda environment, they may sometimes receive an error message that includes, e.g.,“[Errno 122] Disk quota exceeded” when they've exceeded their home directory 10 GB quota limit. This can happen because Conda installs packages inside the ~/.conda directory (in the user’s home directory) but the user has run out of available storage space there. To work around this, you can move the ~/.conda directory to your scratch directory and then create a symbolic link to it.

There are a couple ways to do this. If you want to move your existing Conda environment (note that the cp invocation might take a long time):

mv ~/.conda ~/.conda.orig
cp -r ~/.conda.orig /global/scratch/$USER/.conda   # $USER is your Savio username
ln -s /global/scratch/$USER/.conda ~/.conda

Once everything is done and working, you can delete your old Conda environment from your home directory to free up space if there are no environments you care about keeping.

rm -rf ~/.conda.orig

You can also replace the "cp -r" line with a mkdir, and then start fresh. This means they'll lose any existing environments, but won't have to wait for the lengthy copy to finish:

mkdir -p /global/scratch/$USER/conda # create a new directory for ~/.conda in scratch

Again, you should keep in mind that the above process will remove any existing conda environments you have, so you might consider exporting these to an environment.yml file if you need to recreate them.

Also, please keep in mind that you can follow the instructions below to remove redundant/unused Conda environments as needed:

conda info --envs (this lists all your environments)
conda remove --name myenv all (where myenv is the name of the environment you want to remove)

Another option to free up space in your home directory is to delete files inside the ~/.local directory (in your home directory), which is where pip installs python packages (for example) by default. It's also possible to install into someplace other than .local, such as scratch. If there are python packages in ~/.local that are taking up a lot of space, it's cleaner to remove them with pip rather than just deleting files there. Otherwise, you might have issues with Python thinking a package is installed but it has actually been deleted. You can use pip uninstall <PACKAGE_NAME> for this.

You can also check if there are files in ~/.conda/pkgs that are taking up a lot of space. If you run

du -h -d 1 ~

you'll see how much space is used by each top-level subdirectory in your home directory (which is what the ~ indicates).

Miscellaneous issues

Q. How can I acknowledge the Savio cluster in my presentations or publications?

A. You can use the following sentence in order to acknowledge computational and storage services associated with the Savio cluster:

"This research used the Savio computational cluster resource provided by the Berkeley Research Computing program at the University of California, Berkeley (supported by the UC Berkeley Chancellor, Vice Chancellor for Research, and Chief Information Officer)."

Acknowledgements of this type are an important factor in helping to justify ongoing funding for, and expansion of, the cluster. As well, we encourage you to tell us how BRC impacts your research (Google Form), at any time!