Frequently Asked Questions¶

Account issues¶

Q: How can I add new users to my FCA or Condo account?¶

A: You can add a new or existing Savio user to your FCA or Condo account by visiting the MyBRC User Portal in your web browser. After following the on-screen instructions in the portal and registering and/or logging in, you must first review and sign the cluster User Access Agreement Form on the Home ("Welcome") page (if you haven't already done so from within the portal previously) by clicking on the "Review" button, and then clicking on the "Join" button to request to join an existing project. Please note that we have now replaced the Google forms that were previously used for BRC account and project requests with the MyBRC User Portal. If you don’t already have a BRC cluster account, one will be created for you when you submit a request to join a project for the first time and when a PI adds you to a project within the MyBRC User Portal. PIs can also use the MyBRC User Portal to submit account requests for non-UCB users associated with their project who have MyBRC portal accounts. Users wishing to get access to an additional project (FCA or Condo) in addition to their current project(s) can also do so via the MyBRC User Portal.

Q: How long does it take until my account is created?¶

A: After you submit a request to join a project or a PI requests to add you to a project within the MyBRC User Portal, it should take between a few days to up to a week or so in some circumstances to send out the confirmation email that points you to the page with instructions to link one of your personal email accounts with your BRC HPC Cluster account, installing and setting up Google Authenticator on a mobile device, setting up your PIN / token, and logging into the BRC Cluster, and thereby access your account and start working on the BRC Cluster for the first time.

Q: I haven’t heard back about my account request. How do I know if it is ready and is there anything else I need to do in order to get access to my new account?¶

A: After a request has been submitted to join a project within the MyBRC User Portal, it should take between a few days to up to a week or so in some cases for your new user account to be set up on the Savio cluster and ready to use. Please be patient during this time and note that there are no additional steps that you need to take during this time. You can also determine the status of your requests to join projects (FCAs, Condo projects, Vector projects, etc.) and your project access on the cluster via the MyBRC User Portal as well. We encourage and recommend that you take this time to familiarize yourself with our extensive Savio documentation as you wait for your account to be processed and keep in mind that as you begin working on the Savio system, a lot of the questions that may come up are answered in the documentation. When the new user account is ready, you will receive a confirmation email pointing you to this page with instructions to link one of your personal email accounts with your BRC HPC Cluster account, installing and setting up Google Authenticator on a mobile device, setting up your PIN / token, and logging into the BRC Cluster. Please do not attempt to follow any of these instructions before you have received the confirmation email. Please make sure to check your email spam folder for this confirmation email if you do not receive it in your regular email inbox within a week or so after you’ve submitted a request to join a project via the MyBRC User Portal.

Q: Do I need PI permission before requesting an account? Do I need to join a project via the MyBRC User Portal myself or can the PI add me to their project on my behalf?¶

A: Yes, you need to check with your PI before requesting a new Savio account to be created under or to have access to the PI’s existing FCA or Condo account. Once you have checked with the PI, then if you have a MyBRC User Portal account and you’ve reviewed and submitted the User Access Agreement form from within the portal, you can join an existing project within the portal yourself. A PI can also add you to a project if you have a MyBRC User Portal account and if you’ve reviewed and submitted the BRC User Access Agreement from within the portal.

Q. What happens to my Savio account when I graduate or otherwise leave UC Berkeley or lose my CalNet access?¶

A: Your Savio account and your access to Savio will remain active, and you can continue to login to Savio after you have left UC Berkeley. An active CalNet ID is not needed. However, losing your CalNet access will generally prevent you from logging in to our Open OnDemand (OOD) service, accessing the myBRC portal, or changing your PIN for your one-time password (OTP). For details on how to fix this, see this related FAQ.

Also, as noted in our documentation, please recall that the PI for the project(s) is ultimately responsible for notifying Berkeley Research Computing when user accounts should be closed, and to specify the disposition of a user's software, files, and data.

If you are no longer able to log into Savio but still need access to data on Savio, you can arrange with your research colleagues who have active Savio accounts and access to Savio and/or the PI(s) of the projects you're involved with to provide access to Savio files via Globus or transfer the needed data to you.

Q. Do I need a CalNet ID to get or use a Savio account?¶

A: You do not need a CalNet ID to get a Savio account. (In particular, users from other institutions who are working with a Berkeley PI can get an account with the permission of the PI.) Furthermore if you leave UC Berkeley, your Savio account will remain active even after you no longer have a CalNet ID.

There are three issues that would generally occur when your CalNet ID disappears. First you will probably not be able to logon to our Open OnDemand (OOD) service (which uses CILogon and determines your Savio username based on the email address used to logon). Second, your access to the myBRC portal is generally via your CalNet credentials, so you won't be able to access the myBRC portal. Finally, your access to the Token Management site is generally via your CalNet credentials, so you won't be able to reset the PIN for your one-time password (OTP). To fix all of this, please contact us (ideally before you lose your CalNet access) with your new email address so we can associate that with your myBRC account.

Q. How much am I charged for computational time on Savio?¶

A: For those with Faculty Computing Allowance accounts, usage of computational time on Savio is tracked (in effect, "charged" for, although no monetary costs are incurred) via abstract measurement units called "Service Units". When all of the Service Units provided under an Allowance have been exhausted, no more jobs can be run under that account. When you have exhausted your entire Faculty Computing Allowance, there are a number of options open to you for getting more computing time for your project. Note that usage tracking does not impact Condo users, who have no Service Unit-based limits on the use of their associated compute pools.

You can view how many Service Units have been used to date under a Faculty Computing Allowance as well as an MOU allocation, or by a particular user account on Savio.

Q: How and when can I renew my Faculty Computing Allowance (FCA)?¶

A: Each year, beginning in May, you can request to renew your Faculty Computing Allowance via the MyBRC User Portal by following the instructions in our documentation here. Also, links to this renewal application form are typically emailed to Allowance recipients (i.e., PIs and project managers) by or before mid-May each year.

Renewal applications submitted during May will be processed beginning June 1st. Those submitted and approved later in the year, after the May/June period (i.e. after June 30th), will receive pro-rated allowances of computing time, based on month of application during the allowance year. Note that new allowances that are set up in June or old ones that are renewed in June are the only ones that get the full 300,000 SUs allocation.

Please note that there are some additional options open to you as well for getting more computing time for your project when your allowance is used up.

Please also recall that if a researcher is already the PI of an existing FCA project, they can not request the creation of a new FCA account. So, if a researcher has exhausted the allocated service units on their FCA, they should not request the creation of a new FCA, but, rather, they should renew their already existing FCA or purchase additional computing time on Savio.

Q: How can I get access to CGRL resources such as the Rosalind or Vector condos?¶

A: You can request new accounts through the CGRL by visiting the MyBRC User Portal in your web browser. After following the on-screen instructions in the portal and registering and/or logging in, they should first review and sign the cluster User Access Agreement form on the Home ("Welcome") page (if they haven't already done so from within the portal previously) by clicking on the "Review" button, and then clicking on the "Join" button to request to join an existing Vector or Condo project. Please note that we have now replaced the Google forms that were previously used for BRC account and project requests with the MyBRC User Portal. If a researcher does not already have a BRC cluster account, one will be created for them when they submit a request to join a Vector and/or Condo project for the first time and when a PI adds them to a Vector and/or Condo project within the MyBRC User Portal. PIs can also use the MyBRC User Portal to submit account requests for non-UCB users associated with their project who have MyBRC portal accounts. .

Q: How can I work with sensitive (P2/P3) data on Savio?¶

A: Please see our documentation on working with sensitive data on Savio. To start the process of creating a P2/P3 project, see Accounts for Sensitive Data.

Q: How can I be removed from the Savio user email announcement list so as to stop receiving such email announcements?¶

A: A Savio user is added to the Savio user email announcement list when they obtain their Savio cluster user account and join a (FCA, Condo, Vector, etc.) project. A user will be removed from the email list (and so stop receiving such emails) when they are removed from their last Savio project. So, a user can stop receiving such emails by removing themselves from all Savio projects via the MyBRC User Portal or having the PI(s) or manager(s) of the Savio project(s) that the user is a part of remove the user from the project(s) via the MyBRC User Portal.

Q: I can’t log into Savio for various different possible reasons. For example, when I try to log into Savio, I get the following error message: 'username@hpc.brc.berkeley.edu: Permission denied (gssapi-keyex,gssapi-with-mic,keyboard-interactive)'¶

A: The first thing to do is to make sure you’ve tried all of the steps in the Savio login troubleshooting documentation. In particular, we recommend resetting your OTP on the LBL token management page (troubleshooting step #5).

If the troubleshooting steps don't resolve the issue, then it could be that your external public IPv4 address has been blocked for too many failed authentication attempts or for some other reason(s). (Note that on Savio, there is a policy of automatically blocking IP addresses that have too many failed login attempts.) To check if this is the case, try to change your external public IP (IPv4) address (maybe try logging in from a different IP address/network or different wifi or phone hotspot). For example, if you’re able to log into Savio on your phone hotspot but not on the campus network, it's possible that a campus IP address is being blocked. Other factors to consider and questions to ask yourself include:

Which BRC endpoint are you trying to connect to (such as dtn.brc.berkeley.edu or hpc.brc.berkeley.edu)?
Do you have proper network connectivity to other SSH hosts?
Are you trying to connect through a public wifi? Campus wifi? Eduroam, Calvisitor? Ethernet? Maybe try using a VPN and/or switch to another wifi?
What error message do you get when you try to ssh from your computer?
What computer/OS version/ssh version are you using and what is your exact ssh command?
Does your router/network have a firewall against Savio domain names / IP addresses?

If it looks like your IP address has been blocked, let us know your external (public) IPv4 address (which you can find by visiting here (see first line) or here and we can go ahead and unblock it.

Finally, please keep in mind that when logging in to savio, you will be sent to any one of four random login nodes, ln000, ln001, ln002, ln003. If it turns out that one or more of these login nodes are down for maintenance or otherwise not working due to technical issues, you may find that you can not log in with 'ssh username@hpc.brc.berkeley.edu' (where 'username' is your Savio cluster user name). This could be an indication that one (or more) of those login nodes could be down while the other three are working, and you just happened to log into the one that was down at that particular time. In that case, you can work around this by trying to log in directly to each of the login nodes as follows:

# from you laptop computer or other local system
ssh username@ln000.brc.berkeley.edu
ssh username@ln001.brc.berkeley.edu
ssh username@ln002.brc.berkeley.edu
ssh username@ln003.brc.berkeley.edu

Or, if you find that the login node that you are currently logged into is very slow or sluggish, you can also log in from one login node to another as follows:

# from another login node
ssh ln000
ssh ln001
ssh ln002
ssh ln003

We prefer that users are randomly assigned to a login node by hpc.brc.berkeley.edu so that the load is evenly balanced, but under some circumstances (e.g., one or more of the login nodes is down due to maintenance) it is permissible for users to log into a specific login node.

Q: When logging into the Non-LBL Token Management web page to set up my token/PIN so I can log into Savio for the first time, I receive the following error message: 'Login Error: There was an error logging you in. The account that you logged in with has not been mapped to a system account. If you believe that you may have mapped a different account, choose it below and try again.'¶

A: This may be an indication that you need to link your email account with your BRC HPC Cluster account. To do this, please follow the steps listed in our documentation here.

Please also make sure you are logging into the LBL Token Management page with the same account that was linked (i.e., the same information that you used when receiving and using the linking invitation email from BRC support during the account creation process). For example, if you received the linking invitation email via your UC Berkeley (CalNet) email address, make sure you log into the LBL Token Management page with your CalNet ID.

Also, if you need to reset your PIN, you can go to the token management page. Select the identity you linked with your cluster account when you first set it up (for UC Berkeley people, this is usually your UC Berkeley account -- your CalNet ID), and log in. On the token management page, click the “Update PIN” link for the login token you want to change, and you can change your PIN.

If you're using a Google Chrome browser to do this and find that you are still unsuccessful and getting the same error message when you try to log into the BRC Cluster, please follow the above instructions using the a different browser instead, if at all possible, and see if that helps.

Q: I’m in the process of setting up my OTP (for logging into Savio for the first time), but I haven’t received an account linking email, or the link I did receive in an earlier email has now expired. What should I do / how should I resolve this?¶

A: New users will automatically be sent a linking invitation email from BRC support during the account creation process with the subject line "Invitation to link your personal account with BRC HPC Cluster account". New Savio users who use an @lbl.gov email will not receive a linking invitation email. Follow the instructions in that email to complete the linking process. If you cannot locate the email, please check your Spam or Junk folder. Or, if the link expires or you still can't locate the email, then you can request that a new linking invitation email be sent to you by navigating to your user profile on the MyBRC portal (https://mybrc.brc.berkeley.edu/user/user-profile/) and clicking on “Request Linking Email” at the bottom of the page. Once requested, you will receive an email containing instructions on how to link your accounts within an hour. The status of the request will also be available on the profile page. If you have not received the email within an hour or so (including in your Spam or Junk folder), please contact BRC support).

Q: I forgot my PIN/Password. What do I need to do in order to be able to login to Savio again?¶

A: If you haven't used the system in a while, you might forget what the PIN is associated with you login token. To reset/update your PIN, you can go to the Non-LBL Token Management web page, select the identity you linked with your cluster account when you first set it up (for UC Berkeley people, this is usually your UC Berkeley account), and log in. On the token management page, click the “Update PIN” or “Reset PIN” link for the login token you want to change, and you can change your PIN. If you have any trouble with it, please let us know.

Q: I need to remove/transfer old data from my account or the account of a collaborator, but I can’t login anymore or I can’t access their account. What do I do?¶

A: First note that your Savio account and your access to Savio will remain active, and you can continue to login to Savio after you have left UC Berkeley.

If you are no longer able to log into Savio but still need access to data on Savio, you can arrange with your research colleagues who have active Savio accounts and access to Savio and/or the PI(s) of the projects you're involved with to transfer the needed data to bDrive (Google Drive), Box, or an external computer/server following the instructions, guidelines, and examples in our documentation. We can also assist you with this if needed.

Also, as noted in our documentation, please recall that the PI for the project(s) is ultimately responsible for notifying Berkeley Research Computing when user accounts should be closed, and to specify the disposition of a user's software, files, and data.

Q: My terminal / ssh connection to Savio keeps timing out. Is there a way to stay logged on?¶

A: Inactive connections are terminated after five minutes. It’s possible to set up your SSH configuration so that your session will not time out.

If your laptop/desktop is a Mac or Linux machine, you can add this to your ~/.ssh/config file:

Host *
ServerAliveInterval 300
ServerAliveCountMax 2

If you're using Putty from Windows, there is information online.

Another option is to use the screen or tmux programs for your interactive session. If you are disconnected (or choose to logout from Savio), you can reconnect to your running screen or tmux session when you log back in.

Finally, in many cases you may be best off setting up your computation to run as a background job using sbatch rather than running interactively using srun.

Command line, storage, and software installation issues¶

Q: How do I use the UNIX command line interface (the interface one is interacting with when one logs into a login node or the DTN)?¶

A: For those who are new to the command line (aka, the 'terminal' or the 'shell') or would like a refresher or to expand your knowledge, there are a variety of tutorials online. We recommend the following:

Software Carpentry's UNIX shell workshop for novices with an accompanying video
The Berkeley Statistical Computing Facility's introductory tutorial about the UNIX command line
The Berkeley Statistical Computing Facility's tutorial about the bash shell

Q: I can’t run standard shell commands, I'm unable to list available software modules, or I can’t display various directories to navigate to.¶

A: If you find, after logging onto the cluster, that you can’t list available software modules (when using the module avail command) and/or you can’t get a listing of directories to navigate to, or if your shell prompt looks like, e.g, -bash-4.2$ instead of, say, [myashar@ln001 ~]$, this may indicate that there is something wrong with your shell environment, which in turn may point to a problem with your ~/.bashrc and/or ~/.bash_profile file. If this is the case, then you can try to fix these files manually, or you can replace/switch them with the system default ones using the following commands:

/usr/bin/mv ~/.bashrc ~/.bashrc.orig
/usr/bin/cp /etc/skel/.bashrc ~/.bashrc
/usr/bin/mv ~/.bash_profile ~/.bash_profile.orig
/usr/bin/cp /etc/skel/.bash_profile ~/.bash_profile

Then, log out and log back in again, and check whether this has resolved the issue.

Q: I’m transferring data to or from Savio and the transfer is going very slowly.¶

A: Slow transfers can occur for a variety of reasons. These include:

Heavy usage of scratch by Savio users.
Bandwidth limits somewhere between Savio and the location you are transferring to/from. (E.g., transferring data to/from a computer at your home is limited by the bandwidth of your home internet connection.
If you are transferring a large number of (possibly small) files, there is a cost to simply opening that many files. Transferring a smaller number of larger files may be more successful (though this may require you to use tar or zip to aggregate the smaller files, which will also take time).

In general we recommend the use of Globus to transfer files when the location you are transferring to/from can be set up as a Globus endpoint. This includes personal computers, bDrive/Google Drive, Wasabi, AWS, and other locations. At the moment, some administrative challenges prevent connecting to Box using Globus. As a benchmark, one might achieve something like 75 MB/s when transferring between Savio and servers elsewhere on the Berkeley campus.

Q: I am getting a 'Permission denied' error (e.g., when trying to install or use software).¶

A: Permission denied errors happen when your account does not have permission to perform an attempted read or write on a file or directory.

If you are trying to read or write to a file or directory owned by someone else (not the root user), you may need to ask the owner to set permissions so you can do so. See our documentation about making files accessible to other users for more details about how file permissions work and how to set permissions to grant access to other users.
Savio users are not able to modify the root filesystem (paths writable only by the root user, usually beginning with /etc, /usr, or /bin). If you need to change these paths for your software to work, you may use a Singularity container.
If the permission denied error occurs during an attempted software installation, often the software can be installed at a custom prefix (such as in your home/group directory instead of in /bin or /usr/bin), or otherwise you can install it inside a Singularity container. See our documentation on installing your own software on Savio for more details on how to install software at different locations.

Q: I am getting a 'Permission denied' error when running an executable on the DTN.¶

A: For security reasons, it's not possible to run executables located on scratch while on the DTN. Please place the executable you need for downloading data in your home or group directory.

Q: I’ve received a 'disk quota exceeded' error when trying to install software in, move data to, or write to files in my Savio home directory. What does this mean and how can I resolve this?¶

A: By default, each user on Savio is entitled to a 30 GB home directory which receives regular backups. If you exceed this 30 GB disk usage quota limit you will receive “disk quota exceeded” error message if you try to to install software in, move or add data to, or write to files in your Savio home directory, and you won’t be able to proceed further. Simply removing/deleting files and directories in your home directory, and/or moving files to your Savio scratch directory or your shared group directory (if you have one), for example, until you are below the 30 GB limit should resolve this issue, but be cautious not to remove hidden files beginning with ".", as some of them set useful environment parameters and would need to be regenerated. For example, files in the .cache directory in your home directory may also contribute to your home directory usage. Those files can usually be safely deleted. Note that you can use ls -al to show your hidden files and directories in your home directory (where the ‘-a’ flag shows hidden files and directories).

Another option to free up space in your home directory is to delete files inside the ~/.local directory (as well as the ~/.conda/pkgs directory in your home directory, which is where pip (and Conda) installs Python packages (for example) by default. It's also possible to install Python packages into someplace other than ~/.local, such as scratch. If there are Python packages in ~/.local that are taking up a lot of space, it's cleaner to remove them with pip rather than just deleting files there. Otherwise, you might have issues with Python thinking a package is installed but it has actually been deleted. You can use pip uninstall $PACKAGE_NAME for this. Another typical example occurs when you run a job and have a software package write output to your home directory (or perhaps a group directory) and you go over your 30 GB quota limit (or the 30 GB quota limit in your shared research group directory). This could also happen if the software package writes to a configuration directory that it sets up in your home (or group) directory, e.g., a program might create a directory ~/.name_of_program and put files in there. In that case, you can configure things to have any output go to scratch, for example, if such output files are the problem. Similarly, you could configure the software package to use scratch (instead of your home directory) for its configuration directory.

An additional option for freeing up space in your home directory is to delete files in the subdirectories within the ~/ondemand/data/sys/dashboard/batch_connect/sys/ directory. This is where Open OnDemand stores temporary files for interactive apps (for example, completed job output data generated when launching and running Jupyter notebooks). It is recommended that you clean up the contents of these subdirectories from time to time.

You can use commands such as quota and du to check your usage and determine what files/directories are taking up a lot of space.

Please take care not to exceed the 30 GB quota limit by monitoring it on a regular basis and moving files to scratch, or to bDrive, Box, another cloud storage service, or an external system or server, for example, if and as needed, to stay below the 30 GB limit.

Q: I need to remove files in my Savio scratch directory. What are my best options for where to transfer those files and how / what tools are best to use to transfer the data?¶

A: In general we recommend the use of Globus to transfer files when the location you are transferring to/from can be set up as a Globus endpoint. This includes personal computers, bDrive/Google Drive, Wasabi, AWS, and other locations. At the moment, some administrative challenges prevent connecting to Box using Globus. As a benchmark, one might achieve something like 75 MB/s when transferring between Savio and servers elsewhere on the Berkeley campus.

For other data transfer tools (such as rclone) that you can use to transfer data between Savio and Box, bDrive, your lab server, personal computer, or other external system, please see our documentation.

As far as long-term storage, one option would be our condo storage program.

As far as other options, we'd be happy to discuss further with you, either by email or directly, potentially during our office hours.

If you are no longer affiliated with UC Berkeley and/or no longer have and active Savio account and no longer need the data in your scratch directory, please let us know and we can delete the data on your behalf (after receiving final confirmation from the PI(s) of the FCA and/or Condo account(s) you had access to).

A: You may receive a "No Space Left on Device" error when you run a job on Savio that turns out to have nothing to do with global scratch or your home directory filling up, but rather is related to your job using the /tmp directory on a compute node (or, in some cases, on one of the login nodes) instead of scratch or your home directory, and /tmp is getting filled up. If your job is writing to a temporary file to /tmp, it's possible that you’ve ran out of space in the /tmp folder, since it is only around 3.7GB in size for most compute nodes, and 7.8 GB for savio3 nodes (for example).

Depending on the program that you are running, you may be able to control where it writes temporary files by setting the TMPDIR environment variable in your job script. Two main options are to use your scratch space or (only if using compute nodes) use /local (which is local scratch space on each node).

For example, you can use a directory in your scratch space by setting the TMPDIR environment variable so that the executable uses that scratch directory rather than /tmp. Most programs respect the TMPDIR environment variable, so you can make a temp directory in your personal scratch directory as follows (e.g., you can add this to your SLURM batch script):

mkdir /global/scratch/users/$USER/tmp
export TMPDIR=/global/scratch/users/$USER/tmp # set TMPDIR

where $USER is your Savio username. If your program respects the TMPDIR environment variable then this should make it use scratch instead of /tmp.

Q: How do I get sudo/root access (for example, to install software)?¶

A: Unfortunately, Savio does not allow users to have sudo/root access.

If you have requirements for using paths/directories owned by root, you can use a Singularity container which allows you to change root directories from the perspective of the program.

Often software can be installed without root access at a custom prefix (for example, instead of installing the binary in /bin/ or /usr/bin you can put it in a directory you created). Please see our documentation on installing software, including using environment modules to simplify software installed at custom prefix locations.

If none of these solutions fit your needs, you may contact us, and we can advise on how to proceed on Savio without root access.

Q: How do I access files on my personal computer from Savio? Do you allow users to NFS mount their own storage onto the compute nodes?¶

A: There are several options for transferring files to and from Savio. While you cannot mount your own storage on Savio nodes, if you want to synchronize files in near real-time between your own computer and Savio, then there are tools such as Cyberduck (for Mac/Windows) or sshfs (for Linux) which allow you to synchronize a directory on your computer with a directory on Savio. On Savio, you should use the dtn.brc.berkeley.edu address when connecting using one of these tools.

We do not allow users to NFS mount their own storage. We NFS mount Savio's storage across all compute nodes so that data is available independent of which compute nodes are used; however, medium to large clusters can place a very high load on NFS storage servers and many, including Linux-based NFS servers, cannot handle this load and will lock up. A non-responding NFS mount can hang the entire cluster, so we can't risk allowing outside mounts.

Q: I deleted my files accidentally! What can I do?¶

A: Note: While the home directory filesystem (/global/home) is backed up so it is possible to restore old versions of files, no such backups are taken in scratch (/global/scratch/users). Important data that needs to be backed up should not be stored in scratch.

The home directory filesystem (files within /global/home) are backed up hourly. Backup snapshots can be accessed from the hidden directory .snapshot within each directory. For example, to access the snapshots of my home directory:

[nicolaschan@ln000 ~]$ cd .snapshots
[nicolaschan@ln000 .snapshots]$ ls
daily_2020_11_08__16_00   hourly_2020_11_12__20_00  hourly_2020_11_13__04_00  hourly_2020_11_13__12_00
daily_2020_11_09__16_00   hourly_2020_11_12__21_00  hourly_2020_11_13__05_00  hourly_2020_11_13__13_00
daily_2020_11_10__16_00   hourly_2020_11_12__22_00  hourly_2020_11_13__06_00  hourly_2020_11_13__14_00
daily_2020_11_11__16_00   hourly_2020_11_12__23_00  hourly_2020_11_13__07_00  hourly_2020_11_13__15_00
daily_2020_11_12__16_00   hourly_2020_11_13__00_00  hourly_2020_11_13__08_00  hourly_2020_11_13__16_00
daily_2020_11_13__16_00   hourly_2020_11_13__01_00  hourly_2020_11_13__09_00  weekly_2020_11_04__16_00
hourly_2020_11_12__18_00  hourly_2020_11_13__02_00  hourly_2020_11_13__10_00  weekly_2020_11_11__16_00
hourly_2020_11_12__19_00  hourly_2020_11_13__03_00  hourly_2020_11_13__11_00

You can then copy the file you want to restore from the snapshot (using the normal cp command).

Q: I can't install an R package within RStudio in Open OnDemand. (E.g., I get the error message: '/bin/ld: cannot find -lmkl_gf_lp64'. What can I do?¶

A: This is a known issue with environment variables not being passed into RStudio in Open OnDemand. Please see our OOD documentation for how to address this.

Q: What is the cause of this error during Python package installation: 'json.decoder.JSONDecodeError: Unterminated string starting at'?¶

A: This error is generally seen when you have exceed the 30 GB limit in your home directory. After freeing up some space, you may also need to run conda clean --all to clean up any corrupted materials in ~/.conda/pkgs.

Slurm / job issues¶

Q: How do I know which partition, account, and QoS I should use when submitting my job?¶

A: SLURM provides this command you can run to check on the partitions, accounts and Quality of Service (QoS) options that you're permitted to use:

sacctmgr -p show associations user=$USER

You can also add the "-p" option to this command to get parsable output.

Q: Why hasn’t my job started? When will my job start?¶

A: Please see our suggestions. In particular you may wish to try our sq tool to diagnose problems.

Q. How can I check the usage on my Faculty Computing Allowance (FCA)?¶

You can view how many Service Units have been used to date under a Faculty Computing Allowance as well as an MOU allocation, or by a particular user account on Savio.

Q: How can I check my job's running status? How do I monitor the performance and resource use of my job?¶

A: To monitor the status of running batch jobs, please see our documentation on the use of the squeue and wwall tools. For information on the different options available with the use of wwall, enter wwall --help at the Linux command prompt on Savio.

Alternatively, you can login to the node your job is running on as follows:

srun --jobid=$your_job_id --pty /bin/bash

This runs a shell in the context of your existing job. Once on the node, you can run top, htop, ps, or other tools.

You can also see a "top"-like summary for all nodes by running wwtop from a login node. You can use the page up and down keys to scroll through the nodes to find the node(s) your job is using. All CPU percentages are relative to the total number of cores on the node, so 100% usage would mean that all of the cores are being fully used.

Q: How do I submit jobs to a condo?¶

A: If you are running your job using a condo account (those beginning with "co_" ), make sure to specify the condo account name when submitting your SLURM batch job script, e.g.,

#SBATCH --account=co_projectname

where ‘projectname’is the name of the condo account.

A maximum time limit for the job is required under all conditions. When running your job under a QoS that does not have a time limit (such as jobs submitted by the users of most of the cluster's Condos under their priority access QoS), you can specify a sufficiently long time limit value, but this parameter should not be omitted. Jobs submitted without providing a time limit will be rejected by the scheduler.

As a condo contributor, you are entitled to use the extra resource that is available on the Savio cluster (across all partitions). For more information and details on this, please see our documentation on low-priority jobs.

Note that by default any jobs run in a condo account will use the default QoS (generally savio_normal) if not specified.

For the different QoS configurations for Savio condos, including the corresponding QoS limits, see here. To specify the QoS configuration for your Savio condo, you would include a line in your SLURM batch job script such as the following example:

#SBATCH --qos=lsdi_knl2_normal

Recall that to check which QoS you are allowed to use, simply run sacctmgr -p show associations user=$USER, where $USER is your Savio username.

Q: How can I run multiple jobs on a single node to take account of all the compute cores on a node? How can I run many jobs at once?¶

A: There are various approaches you can take if you have many jobs you need to run and want to take full advantage of the cores on each Savio node, given that all except the savio2_htc and the various GPU partitions are allocated as entire nodes, so your FCA is charged for use of all the cores on a node.

There are various ways to parallelize independent computations within software such as Python, R, and MATLAB.
One can use GNU parallel to aggregate individual jobs into a single job to run on one or more nodes.
If necessary, you can submit jobs to the savio2_htc partition, which is allocated per core, so that you are not charged for cores that your job can’t use.

Q. How do I "burst" onto more nodes?¶

A. There are two ways to do this. First, Condo users can access more nodes via Savio's preemptable, low-priority quality of service option. Second, faculty can obtain a Faculty Computing Allowance, and their users can then submit jobs to the General queues to run on the compute nodes provided by the institution. (Use of these nodes is subject to the current job queue policies for general institutional access.)

Q: Why is my job not using the GPU nodes / How can I get access to the GPU nodes?¶

A: Please this example of a SLURM sbatch GPU job script.

To help the job scheduler effectively manage the use of GPUs, your job submission script must request two CPUs for each GPU you will use. Jobs submitted that do not request a minimum of two CPUs for every GPU will be rejected by the scheduler. You can request the correct number of GPUs by setting --cpus-per-task equal to the total number of CPUs (assuming you use GPUs on one node only) or by setting --ntasks to the number of GPUs requested and --cpus-per-task=2.

Note that the --gres=gpu:[1-4] specification must be between 1 and 4. This is because the feature is associated with a node, and the nodes each have 4 GPUs. If you wish to use more than 4 GPUs, your --gres=gpu:[1-4] specification should include how many GPUs to use per node requested. For example, if you wish to use eight GPUs, your job script should include options to the effect of "--gres=gpu:4", "--nodes=2", "--ntasks=8", and "--cpus-per-task=2"

The available Savio GPU partitions can be found here and include savio2_1080ti, savio3_gpu, and savio3_2080ti. You can use the command sacctmgr -p show associations user=$USER (where $USER is your Savio username) to check which of these GPU partitions your account has access to (which you can use when submitting your job on Savio).

To obtain information on the available GPUs, their status, and their utilization on a particular node of a GPU partition, access a GPU node interactively using the srun command, as in the following example:

srun -pty -A fc_projectname --partition=savio2_1080ti --nodes=1 --gres=gpu:1 \
        --ntasks=1 --cpus-per-task=2 -t 48:00:00 bash -i

where ‘projectname’ is the name of your FCA (i.e., fc_projectname) or Condo account (i.e., co_projectname).

Then, once you have logged into the particular GPU node, enter the following at the command prompt: nvidia-smi

You can also use srun to get an interactive shell on the GPU node your job is running on already, and then use nvidia-smi as follows:

srun -j $your_job_id 
nvidia-smi

where you substitute the job ID for "$your_job_id"

Q: I can’t connect to other nodes within a job, possibly with the errors 'Host key verification failed' or 'ORTE was unable to reliably start one or more daemons. This usually is caused by:...'. How can I resolve this?¶

A: This message is usually associated with problems communicating between nodes on jobs that use more than one node. This message is often caused by your ~/.ssh directly being misconfigured or corrupted in some way that prevents ssh from connecting to the other nodes in your job’s allocation. It can also be caused by problems on one or more nodes that your job is running on. To check your ~/.ssh directory, run

ls -al ~/.ssh

The contents should look something like this:

[paciorek@ln002 etc]$ ls -al ~/.ssh
total 116
drwx------ 1 paciorek ucb   348 Oct  7  2019 .
drwxr-xr-x 1 paciorek ucb 65536 Nov 24 15:26 ..
-rw------- 1 paciorek ucb   610 Aug  2  2019 authorized_keys
-rw------- 1 paciorek ucb   672 Aug  2  2019 cluster
-rw-r--r-- 1 paciorek ucb   610 Aug  2  2019 cluster.pub
-rw------- 1 paciorek ucb    98 Aug  2  2019 config
-rw-r--r-- 1 paciorek ucb 28652 Nov 24 15:40 known_hosts

If files are missing or empty, you can try to recreate your.ssh directory as follows:

cp -pr ~/.ssh ~/.ssh-save   #make a backup
/usr/bin/cluster-env        #recreate your .ssh folder

If necessary you can restore your original .ssh from the .ssh-save directory. Or you can delete .ssh-save if you no longer need it.

If your .ssh directory seems fine (or the connection problem persists after running cluster-env), please submit a ticket.

Also as a troubleshooting step, you could try to start a multiple (e.g., two) node job using srun and then try to ssh to the other node(s) from the first node. You can determine the nodes allocated to the job using the following command (once you are on the first node):

echo $SLURM_NODELIST

If you suspect a particular node as being the problem, you can request that a node be used for your job using the ‘-w’ flag, e.g., -w n0204.savio2, but of course the node of interest may be in use by another job.

Q: My job fails with a message about being out of memory. How can I fix this and how can I know how much memory my job used?¶

A: Jobs may use a lot of memory simply because the code requires a lot of memory or because the code was not written to use memory efficiently, or because you are using the code in an inefficient way. You may want to look through the code to better understand what steps are using a lot of memory. BRC consultants can also help diagnose what is going on. It’s also possible that the code you are using has a memory leak (i.e., a bug in the code that causes increasing memory use).

Finally, parallelized code can use a lot of memory if multiple copies of input data are made, often one per worker process. In this latter case, one option is to reduce the number of worker processes your job uses by modifying the SLURM submission flags or changing how the code is configured. Another option is to spread your computation across multiple nodes so as to take advantage of the memory of more than one node; this is only possible with code that is set up specifically to parallelize across multiple nodes and won’t generally “just work”.

If you can’t reduce the memory usage, one option is to submit your job to one of our high memory partitions. For example, the savio3 nodes have 96 GB memory per node. If you receive "out of memory" errors when using these nodes, you may need to use a partition with nodes with more memory or run fewer tasks on each node so that each task has access to more memory. If you need even more memory, BRC consultants can help you access resources outside Savio (e.g., on NSF’s ACCESS network or commercial cloud providers such as Google Cloud or AWS) that host machines with even more memory available.

To see how much memory your job used (or is currently using), please use these approaches for monitoring jobs

Q: What do I do when the job runtime exceeds max queue time, or wall clock time specified in the SLURM batch script?¶

A: There are a variety of possibilities that may help with this:

If you have access to a condo, you could run in the condo with a more generous time limit, as most condos don’t have a time limit.
For jobs not needing many cores, you can run under our long queue on the savio2_htc partition under an FCA.
You may be able to break up your job into separate steps that you can run as individual (shorter) jobs.
You may be able to parallelize your code so that it runs faster.
You may be able to set up your code to use checkpointing. The general idea is to save the state of your code regularly and then when the job stops at the time limit, submit a new job that starts up at the last checkpoint. Existing software may already have functionality to produce restart files if requested.
BRC consultants can help you access resources outside Savio (e.g., on NSF’s ACCESS network or commercial cloud providers such as Google Cloud or AWS) where there are more generous (or no) time limits. Please contact us at brc@berkeley.edu.

Q: I have a job that will take longer than the 3-day time limit for FCAs. Can I run jobs for longer?¶

A: Yes, you can submit to the long queue, which will run your job on the savio2_htc partition, up to a limit of 10 days. Please see these details on the long queue. Note that time limits on Condo accounts are determined by the Condo PI(s) and in many cases are quite long or unlimited (if no time limit is indicated, that means there is no limit).

Q: Is there any way to resume jobs that will run for longer than the time available to a job?¶

A: When the job reaches and exceeds its allowed wall clock limit, the job is preempted, or when software/hardware faults occur, it is useful to be able to checkpoint/restart the job.

Checkpointing/restarting is the technique where the applications’ state is stored in the filesystem, allowing the user to restart computation from this saved state in order to minimize loss of computation time. By checkpointing or saving intermediate results frequently, you won't lose as much work if your running job exceeds the time limit you specified, your job is preempted, or your job is otherwise terminated prematurely for some reason.

If you wish to do checkpointing, your first step should always be to check if your application already has such capabilities built-in, as that is the most stable and safe way of doing it. Applications that are known to have some sort of native checkpointing include: Abaqus, Amber, Gaussian, GROMACS, LAMMPS, NAMD, NWChem, Quantum Espresso, STAR-CCM+, and VASP.

In case your program does not natively support checkpointing, it may also be worthwhile to consider utilizing generic checkpoint/restart solutions if and as needed that should work application-agnostic. One example of such a solution is DMTCP: Distributed MultiThreaded CheckPointing. DMTCP is a checkpoint-restart solution that works outside the flow of user applications, enabling their state to be saved without application alterations. You can find its reference quick-start documentation here.

Q: My job is not starting with the reason 'QOSGrpCpuLimit'. What does that mean?¶

A: If you see QOSGrpCpuLimit as the REASON your job is pending when calling squeue, that means that the particular QoS you are using is full. For condo jobs, this happens if all the resources of the condo are in use by other jobs submitted by condo members. For FCA jobs (this would only occur for savio3_gpu), this means that, for the type of GPU requested, all the GPUs of that type available for FCA use are already in use. Note that this can occur even when there are available nodes in the partition. One option in such cases is to submit to the low-priority queue.

Q: My (MPI-based) job fails with the message 'ORTE has lost communication with a remote daemon'. What could be wrong?¶

A: Unfortunately, this error message can occur in a lot of situations. Some potential causes include a networking issue, the job running out of memory on a node, and a disk error (including writing to scratch). We have also seen the error occur when users were using software compiled for an older operating system (e.g., for SL7).

Q:My multi-GPU job hangs (or runs very slowly). What could be wrong?¶

A: For GPUs for which data transfer directly between GPUs (called peer-to-peer or P2P communication) is possible (on Savio, this includes the A5000 and A40 GPUs), we've seen cases in which jobs using multiple GPUs on the same node hang or run very slowly because of problems transferring data directly between the GPUs without going through the CPU (the host). We haven't determined the cause of the problem, but one potential work-around is to disable P2P communication (forcing communication to go through the host) by running export NCCL_P2P_DISABLE=1 in the shell (e.g., at the start of your Slurm job script) before running your code. Please let us know if you encounter this issue.

Software issues¶

Q: I want to use software with a graphical interface, such as MATLAB or RStudio, or I want to run a Jupyter notebook. Is this possible on Savio?¶

A: Some software with a graphical interface, including RStudio and MATLAB, is available through Savio’s Open OnDemand service at ood.brc.berkeley.edu. All your files are accessible in your session. Jupyter notebooks can be run through the Open OnDemand service as well.

Q: How do I know what software Savio has?¶

A: Savio has software installed using environment modules.

Savio administrators and consultants install modules into /global/software/sl-7.x86_64/ and /global/home/groups/consultsw/sl-7.x86_64/. Groups and users may also create their own module trees. Modules in /global/software/sl-7.x86_64/ are maintained by the system administrators and should be preferred over modules in /global/home/groups/consultsw/sl-7.x86_64/. Modules in /global/home/groups/consultsw/sl-7.x86_64/ are installed by consultants, usually at the request of users and include software and versions that are not in the sysadmin-maintained software tree.

Sometimes software is not visible in module avail until its dependency has been loaded. For example, openmpi cannot be loaded until after gcc or intel has already been loaded. To see an extensive list of all module files, including those only visible after loading other prerequisite modules, enter the following command:

find /global/software/sl-7.x86_64/modfiles -type d -exec ls -d {} \;

Q: How can I request software be installed on Savio?¶

Please see this section of our documentation to request that software be installed system-wide.

Q: How do I make my software run in parallel or run on more than one node?¶

A: First, note that in general, software will not run across multiple nodes unless it is set up to do so, and you invoke the software correctly. In some, but certainly not all cases, software that is set up to run in parallel on one node may do so without you having to do anything. Even if it does, it may not use all the cores on a node.

If you are using software developed by someone else, parallelization depends on how the developer designed the software, and ease of use usually depends on how well documented the software is.

There are three main ways in which parallelization parameters are set:

command line arguments set when running the software
compilation options that produce separate executables when installing the software
configuration files that can be modified in advance of running the software

Our training on parallelization on Savio has a section with detailed information, including a number of examples based on popular bioinformatics software.

If you are writing your own code, you have various options:

If using C or C++ you can write code that uses openMP (for parallelization across threads on one node) or MPI (for parallelization across processes on one or more nodes).
If using Python, R, or MATLAB, there are various ways to parallelize independent computations. You can also interact use MPI from within Python and R.
Python’s Dask package and Spark are both tools for working with large datasets in parallel.

Q: How can I run a Jupyter notebook on a GPU?¶

A: You can do this through the Jupyter app of our Open OnDemand service. Login with your Savio username and PIN+OTP and select the Jupyter Server under Interactive Apps. From there, you can choose a GPU partition and specify the number of GPUs you want (e.g., gpu:1 for 1 GPU).

Q: How can I use Conda and/or Pip to install Packages in a directory other than my home directory (e.g., shared group directory or scratch directory)?¶

A: To use Conda to install a package (for example, biopython) in a subdirectory in your personal global scratch directory (/global/scratch/users/username/) called 'your_directory', with environment name 'test', you can do this (for example):

module load python/3.6
conda create -p /global/scratch/users/$USER/your_directory/test --python=3.7
source activate /global/scratch/users/$USER/your_directory/test
conda install biopython
source deactivate

There are a couple of different ways to use pip to install python packages in a different/selected directory from your home directory. For example, you can use pip install --prefix=/path/to/directory package_name (where ‘package_name’ is the name of the python package you are installing) and then modify PYTHONPATH to include /path/to/directory. Also, see this Stack Overflow thread for additional options. For example, you can try (where $USER is your Savio username):

pip install --install-option="--prefix=/global/scratch/users/$USER/your_directory" biopython

or

pip install --prefix=/global/scratch/users/$USER/your_directory biopython

You might also want to use the "--ignore-installed" option to force all dependencies to be reinstalled using this new prefix. Recall that pip installs packages into the ~/.local directory (in your home directory) when you use the command pip install --user <package_name>. So, another approach you can take is to copy the contents of ~/.local to your scratch space and then replace it with a symbolic link. The commands to do this are, for example, as follows (where $USER is your Savio username):

cp -r ~/.local /global/scratch/users/$USER/.local
rm -rf ~/.local
ln -s /global/scratch/users/$USER/.local ~/.local

Then, when you use pip to install packages in your home directory, they will be installed in your scratch directory instead.

Q. When trying to install Python packages with Conda in my home directory, I receive an error message that includes "Disk quota exceeded" and can’t proceed. How can I resolve this?¶

A: When a user is trying to install additional Python packages in their home directory with Conda and/or set up a Conda environment, they may sometimes receive an error message that includes, e.g.,“[Errno 122] Disk quota exceeded” when they've exceeded their home directory 30 GB quota limit. This can happen because Conda installs packages inside the ~/.conda directory (in the user’s home directory) but the user has run out of available storage space there. To work around this, you can move the ~/.conda directory to your scratch directory and then create a symbolic link to it.

There are a couple ways to do this. If you want to move your existing Conda environment (note that the cp invocation might take a long time):

mv ~/.conda ~/.conda.orig
cp -r ~/.conda.orig /global/scratch/users/$USER/.conda   # $USER is your Savio username
ln -s /global/scratch/users/$USER/.conda ~/.conda

Once everything is done and working, you can delete your old Conda environment from your home directory to free up space if there are no environments you care about keeping.

rm -rf ~/.conda.orig

You can also replace the cp -r line with a mkdir, and then start fresh. This means they'll lose any existing environments, but won't have to wait for the lengthy copy to finish:

mkdir -p /global/scratch/$USER/.conda # create a new directory for ~/.conda in scratch

Again, you should keep in mind that the above process will remove any existing conda environments you have, so you might consider exporting these to an environment.yml file if you need to recreate them.

Also, please keep in mind that you can follow the instructions below to remove redundant/unused Conda environments as needed:

conda info --envs (this lists all your environments)
conda remove --name myenv all (where myenv is the name of the environment you want to remove)

Another option to free up space in your home directory is to delete files inside the ~/.local directory (in your home directory), which is where pip installs python packages (for example) by default. It's also possible to install into someplace other than .local, such as scratch. If there are python packages in ~/.local that are taking up a lot of space, it's cleaner to remove them with pip rather than just deleting files there. Otherwise, you might have issues with Python thinking a package is installed but it has actually been deleted. You can use pip uninstall <PACKAGE_NAME> for this. You can also check if there are files in ~/.conda/pkgs that are taking up a lot of space. If you run

du -h -d 1 ~

you'll see how much space is used by each top-level subdirectory in your home directory (which is what the ~ indicates).

Q: When trying to install Python packages with Conda, it takes a long time or fails with an error about dependencies. What can I do?¶

A: Mamba is a drop-in replacement for conda] that is usually faster and more robust in resolving dependencies. Here are instructions for how to use it on Savio.

Q: I'm having a problem with Open OnDemand (OOD). How can I troubleshoot it?¶

A: Please see our list of common OOD problems.

Q: How can I use Visual Studio Code (VS Code) on Savio?¶

A: For security reasons, users can not utilize VS Code's remote SSH extention to access and utilize VS Code on Savio. Instead, Savio users should access VS Code via the Open OnDemand VS Code service following the instructions here.

Miscellaneous issues¶

Q. How can I acknowledge the Savio cluster in my presentations or publications?¶

A. You can use the following sentence in order to acknowledge computational and storage services associated with the Savio cluster:

"This research used the Savio computational cluster resource provided by the Berkeley Research Computing program at the University of California, Berkeley (supported by the UC Berkeley Chancellor, Vice Chancellor for Research, and Chief Information Officer)."

Acknowledgements of this type are an important factor in helping to justify ongoing funding for, and expansion of, the cluster. As well, we encourage you to tell us how BRC impacts your research (Google Form), at any time!