Skip to content

Frequently Asked Questions

General FAQs

Q. How can I access the Faculty Computing Allowance application (Requirements Survey) and Additional User Request forms? I'm seeing a "You need permission" error.

A. You need to authenticate via CalNet to access the online forms for applying for a Faculty Computing Allowance, and for requesting additional user accounts on the Savio cluster.

When accessing either form, you may encounter the error message, "You need permission. This form can only be viewed by users in the owner's organization", under either of these circumstances:

1. If you haven't already successfully logged in via CalNet. (If you don't have a CalNet ID, please work with a UCB faculty member or other researcher who can access the form on your behalf.)
2. If you've logged in via CalNet, but you're also simultaneously connected, in your browser, to a non-UCB Google account; for instance, to access a personal Gmail account. (If so, the easiest way to access the online forms might be to use a private/incognito window in your primary browser, or else use a second browser on your computer, one in which you aren't already logged into a Google account. As an alternative, you can first log out of all of your Google accounts in your primary browser, before attempting to access these forms.)

Q. How do I know which partition, account, and QoS I should use when submitting my job?

A. SLURM provides a command you can run to check on the partitions, accounts and Quality of Service (QoS) options that you're permitted to use. Please run the "sacctmgr show associations user=$USER" command to find this information for your job submission. You can also add the "-p" option to this command to get a parsable output, i.e., "sacctmgr -p show associations user=$USER".

Q. How can I check on my Faculty Computing Allowance (FCA) usage?

A. Savio provides a "" command line tool you can use to check cluster usage by user or account.

Running " -E" will report total usage by the current user, as well as a breakdown of their usage within each of their related project accounts, since the most recent reset/introduction date (normally June 1st of each year). To check usage for another user on the system, add a "-u sampleusername" option (substituting an actual user name for 'sampleusername' in this example).

You can check usage for a project's account, rather than for an individual user's account, with the '-a sampleprojectname' option to this command (substituting an actual account name for 'sampleprojectname' in this example).

Also, when checking usage for either users or accounts, you can display usage during a specified time period by adding start date (-s) and/or end date (-e) options, as in "-s YYYY-MM-DD" and "-e YYYY-MM-DD" (substituting actual Year-Month-Day values for 'YYYY-MM-DD' in these examples). Run " -h" for more information and additional options.

When checking usage for accounts that have overall usage limits (such as Faculty Computing Allowances), the value of the Service Units (SUs) field is color-coded to help you see at a glance how much computational time is still available: green means your project has used less than 50% of its available SUs; yellow means your project has used more than 50% but less than 100% of its available SUs; and red means your project has used 100% or more of its available SUs (and has likely been disabled). Note that if you specify the starttime and/or endtime with "-s" and/or "-e" option(s) you will not get the color coded output.

A couple of output samples from running this command line tool with user and project options, respectively, along with some tips on interpreting that output: -E -u sampleusername Usage for USER sampleusername [2016-06-01T00:00:00, 2016-08-17T18:18:37]: 38 jobs,
1311.40 CPUHrs, 1208.16 SUs used
Usage for USER sampleusername in ACCOUNT co_samplecondoname [2016-06-01T00:00:00,
 2016-08-17T18:18:37]: 23 jobs, 857.72 CPUHrs, 827.59 SUs
Usage for USER sampleusername in ACCOUNT fc_sampleprojectname [2016-06-01T00:00:00,
2016-08-17T18:18:37]: 15 jobs, 453.68 CPUHrs, 380.57 SUs

Total usage from June 1, 2016 through the early evening of August 17, 2016 by the 'sampleusername' cluster user consists of 38 jobs run, using approximately 1,311 CPU hours, and resulting in usage of approximately 1208 Service Units. (The total number of Service Units is less than the total number of CPU hours in this example, because some jobs were run on older or otherwise less expensive hardware pools (partitions) which cost less than one Service Unit per CPU hour.)

Of that total usage, 23 jobs were run under the Condo project account 'co_samplecondoname', using approximately 858 CPU hours and 828 Service Units, and 15 jobs were run under the Faculty Computing Allowance project account 'fc_sampleprojectname', using approximately 454 CPU hours and 381 Service Units. -a fc_sampleprojectname
Usage for ACCOUNT fc_sampleprojectname [2016-06-01T00:00:00, 2016-08-17T18:19:15]: 156
jobs, 85263.80 CPUHrs, 92852.12 SUs used from an allocation of 300000 SUs.

Usage from June 1, 2016 through the early evening of August 17, 2016 by all cluster users of the Faculty Computing Allowance account 'fc_sampleprojectname'  consists of 156 jobs run, using a total of approximately 85,263 CPU hours, and resulting in usage of approximately 92,852 Service Units. (The total number of Service Units is greater than the total number of CPU hours in this example, because some jobs were run on hardware pools (partitions) which cost more than one Service Unit per CPU hour.) The total Faculty Computing Allowance allocation for this project's account is 300,000 Service Units, so there are approximately 207,148 Service Units still available for running jobs during the remainder of the current Allowance year (June 1 to May 31): 300,000 total Service Units granted, less 92,852 used to date. The total of 92,852 Service Units used to date is colored green, because this project's account has used less than 50% of its total Service Units available.

To also view individual usages by each cluster user of the Faculty Computing Allowance project account 'fc_sampleprojectname', you can add a '-E' option to the above command; e.g., -E -a fc_sampleprojectname

Finally, if your Faculty Computing Allowance has become completely exhausted, the output from running the "" command line tool will by default show only information for the period of time after your job scheduler account was disabled; for example:

Usage for ACCOUNT fc_sampleprojectname [2017-04-05T11:00:00, 2017-04-24T17:19:12]: 3 jobs, 0.00 CPUHrs, 0.00 SUs from an allocation of 0 SUs.
ACCOUNT fc_sampleprojectname has exceeded its allowance. Allocation has been set to 0 SUs.
Usage for USER sampleusername in ACCOUNT fc_sampleprojectname [2017-04-05T11:00:00, 2017-04-24T17:19:12]: 0 jobs, 0.00 CPUHrs, 0.00 (0%) SUs

To display the - more meaningful - information about the earlier usage that resulted in the Faculty Computing Allowance becoming exhausted, use the start date (-s) option and specify the most recently-passed June 1st - the first day of the current Allowance year - as that start date. E.g., to view usage for an Allowance that became exhausted anytime during the 2018-19 Allowance year, use a start date of June 1, 2018: -E -s 2018-06-01 -a fc_sampleprojectname

Q. How can I use $SLURM_JOBID in my output and/or error log filenames?

A. SLURM uses a different way to manage SLURM-specific environment variables, which is in turn different than PBS or other job schedulers. Before you use a SLURM environment variable, please check its scope of availability by entering "man sbatch" or "man srun".

To use the JOBID as part of the output and/or error file name, it takes a filename pattern instead of any of the environment variables. Please "man sbatch" for details. As a quick reference, the proper syntax is "--output=%j.out".

Q. When is my job going to start to run?

A. In most cases you should be able to get an estimate of when your job is going to start by running the command "squeue --start -j $your_job_id" (substituting your actual job ID for '$your_job_id' in this example). However under the following two circumstances you may get a "N/A" as the reported START_TIME.

  • You did not specify the run time limit for the job with "--time" option. This will block the job from being back filled.
  • The job that's currently running and blocking your job from starting didn't specify the run time limit with "--time" option.

Thus, as the best practice to improve the scheduler efficiency and to obtain a more accurate estimate of the start time, it is highly recommended to always use "--time" to specify a run time limit for your jobs. It is also worth noting that the start time is only an estimate based on the current jobs queued in the scheduler. If there are new jobs submitted later with higher priorities this estimate will be updated spontaneously as well. So this estimate should only be used as a reference instead of a guaranteed start time.

Q. Why is my job not starting to run?

A. Lots of reasons could cause the job to not start at your expected time:

  • The first action that you should take to troubleshoot it is to get an estimate of the start time with "squeue --start -j $your_job_id" (substituting your actual job ID for '$your_job_id' in this example). If you are satisfied with the estimated start time you can stop here.
  • If you would like to troubleshoot further you can then run "sinfo -p $partition" (substituting the actual name of the partition on which you're trying to run your job, such as 'savio2' or 'savio2_gpu', for '$partition' in this example) to see if the resources you are requesting are currently allocated to other jobs or not. As a generality, "idle" nodes are free and available to run new jobs, while nodes in most other statuses are currently unavailable. If you used node features in your job submission file, make sure you only check the resources that match the node feature that you requested. (All the node features are documented on your cluster's webpage.) If the partition on which you want to run your job currently is heavily impacted (has few idle nodes), and you are not satisfied with your job's estimated start time (see above), you might consider running it on another, less-impacted partition to which you also have access, if that partition's features are compatible with your job.
  • If you see enough resources available but your job is still not showing a reasonable start time, please run "sprio" to see if there are jobs with higher priorities that are currently blocking your job. (For instance, "sprio -o "%Y %i %u" | sort -n -r" will show pending jobs across all of Savio's partitions, ordered from highest to lowest priority, together with their job IDs and usernames.)
  • If there are no higher priority jobs blocking the resources that you requested, you can check whether there may be any reservation on the resources that you requested with "scontrol show reservations". Reservations are used, for instance, to defer the start of jobs whose requested wall clock times might overlap with scheduled maintenance periods on the cluster. In those instances, if your job can be run within a shorter time period, you can adjust its wall clock time to avoid such overlap.
  • For Faculty Computing Allowance users, you may also want to check whether your account's allowance has been completely used up, via "". (See the Tip on using this command, above.)

If, after going through all of these steps, you are still puzzled by why your job is not starting, please feel free to contact us.

Q. How can I check my job's running status?

A. If you suspect your job is not running properly, or you simply want to understand how much memory or how much CPU the job is actually using on the compute nodes, RIT provides a script "wwall" to check that. "wwall -j $your_job_id" provides a snapshot of node status that the job is running on. "wwall -j $your_job_id -t" provides a text-based user interface (TUI) to monitor the node status when the job progresses. To exit the TUI, enter "q" to quit out of the interface and be returned to the command line.

Q. How can I run High-Throughput Computing (HTC) type of jobs? (How can I run multiple jobs on the same node?)

A. If you have a set of common tasks that you would like to perform on the cluster, and these tasks share the characteristics of short duration and a decent number of them, they fall into the category of High-Throughput Computing (HTC). Typical applications such as parameter/configuration scanning, divide and conquer approach can all be categorized like this. Resolving an HTC problem isn't easy on a traditional HPC cluster with time and resource limits. However, within the room that one can maneuver, there are still some options available.

The usage instructions for the "GNU parallel" shell tool can be found at this page: GNU parallel

As well, the Savio cluster also offers High Throughput Computing nodes, which may be suitable for some of these types of HTC tasks.

Q. My job keeps failing with the error "Host key verification failed". How can I resolve this?

A. The message "Host key verification failed" often means there might be something wrong with your ~/.ssh directory and the files in it. For example, the ~/.ssh directory may not include the 'known_hosts' file. Please try these steps to regenerate your ~/.ssh folder:

$ mv ~/.ssh ~/.ssh.orig # move your .ssh folder to a backup

$ cluster-env

Your ~/.ssh directory should then include, for example, files such as the following:

$ ls -l ~/.ssh
total 16
-rw------- 1 myashar ucb 3326 Aug 16 2018 id_rsa
-rw-r--r-- 1 myashar ucb 747 Aug 16 2018
-rw------- 1 myashar ucb 3986 Apr 8 00:03 known_hosts

Q. When trying to install Python packages with Conda in my home directory, I receive an error message that includes "Disk quota exceeded" and can’t proceed. How can I resolve this?

A. When a user is trying to install additional Python packages in their home directory with Conda and/or set up a Conda environment, they may sometimes receive an error message that includes, e.g.,“[Errno 122] Disk quota exceeded” when they've exceeded their home directory 10 GB quota limit. This can happen because Conda installs packages inside the ~/.conda directory (in the user’s home directory) but the user has run out of available storage space there. To work around this, you can move the ~/.conda directory to your scratch directory and then create a symbolic link to it.

There are a couple ways to do this. If you want to move you existing Conda environment (the cp might take a long time):

mv ~/.conda ~/.conda.orig

cp -r ~/.conda.orig /global/scratch/$USER/.conda # $USER is your Savio username

ln -s /global/scratch/$USER/.conda ~/.conda

Once everything is done and working, you can delete your old Conda environment from your home directory to free up space if there are no environments you care about keeping.

rm -rf ~/.conda.orig

You can also replace the "cp -r" line with a mkdir, and then start fresh. This means they'll lose any existing environments, but won't have to wait for the lengthy copy to finish:

mkdir -p /global/scratch/$USER/conda # create a new directory for ~/.conda in scratch

Again, you should keep in mind that the above process will remove any existing conda environments you have, so you might consider exporting these to an environment.yml file if you need to recreate them.

Also, please keep in mind that you can follow the instructions below to remove redundant/unused Conda environments as needed:

1) conda info --envs (this lists all your environments)

2) conda remove --name myenv all (where myenv is the name of the environment you want to remove)

Another option to free up space in your home directory is to delete files inside the ~/.local directory (in you home directory), which is where pip installs python packages (for example) by default. It's also possible to install into someplace other than .local, such as scratch. If there are python packages in ~/.local that are taking up a lot of space, it's cleaner to remove them with pip rather than just deleting files there. Otherwise, you might have issues with Python thinking a package is installed but it has actually been deleted. You can use "pip uninstall $PACKAGE_NAME" for this.

You can also check if there are files in ~/.conda/pkgs that are taking up a lot of space. If you run

du -h -d 1 ~

you'll see how much space is used by each top-level subdirectory in your home directory (which is what the ~ indicates).

Q. I’m unable to list available software modules and/or I can’t display various directories to navigate to. How can I resolve this?

A. If you find, after logging onto the cluster, that you can’t list available software modules (when using the “module avail” command) and/or you can’t get a listing of directories to navigate to, or if your linux shell prompt looks like, e.g, “-bash-4.2$” instead of, say, “[myashar@ln001 ~]$”, this may indicate that there is something wrong with your shell environment, which in turn may point to a problem with your ~/.bashrc and/or ~/.bash_profile file. If this is the case, then you can try to fix these files manually, or you can replace/switch them with the system default ones using the following commands:

mv ~/.bashrc ~/.bashrc.orig

cp /etc/skel/.bashrc ~/.bashrc

mv ~/.bash_profile ~/.bash_profile.orig

cp /etc/skel/.bash_profile ~/.bash_profile

Then, log out and log back in again, and check whether this has resolved the issue.

Q. What does the #error "This Intel <math.h> is for use with only the Intel compilers!" mean?

A. You likely are seeing this error because you have an Intel compiler module loaded in your environment, but you are trying to build your application with a GCC compiler. Please unload any Intel compiler module(s) from your current environment and rebuild with GCC. (See Accessing Software for instructions on unloading modules.)

Q. My code uses C++11 features. Do any compilers on the cluster support that?

A. C++11 (formerly known as C++0x) features have been partially supported by Intel's C++ compilers, beginning with version 11.x, and are fully supported in the 2015.x series. For more details please refer to Intel's C++11 features support page. Note: to support the full set of C++11 features, GCC 4.8 and above is also needed. Please follow this guidance when compiling your C++ code with C++11 features on the cluster:

  • Start by loading the environment module for the default version of the Intel compilers via module load intel. Compile the code with "icpc -std=c++11 some_file" (replacing "some_file" with the actual name of your C++11 source code file).
  • If the command above finishes successfully you can stop here. Otherwise please check Intel's C++11 features support page to learn whether the C++11 features your code uses are supported by the default version of the Intel compilers. If not, please switch to the cluster's environment module that provides a higher version of the Intel compilers. To do so, enter "module switch intel intel/xxxx.yy.zz" (replacing "xxxx.yy.zz" with that higher version number; enter "module avail" to find that number, if needed).
  • If your code uses the C++11 Standard Template Library (STL), you’ll also need to load the GCC/4.8.5 software module as a driver; its header files provide support for the C++11 STL. To do so, enter "module load gcc/4.8.5" before compiling your code.

Q. How can I see all of the available modules?

A. To see an extensive list of all module files, including those only visible after loading other prerequisite modules, enter the following command:

find /global/software/sl-7.x86_64/modfiles -type d -exec ls -d {} \;

Q. Can I get root access to my compute nodes?

A. Unfortunately, that is not possible. All the compute nodes download the same operating system image from the master node and load the image into RAM disk, so changes to the operating system on the compute node would not be persistent. If you believe that you may need root access for software installations, or any other purpose related to your research workflow, please contact us and we'll be glad to explore various alternative approaches with you.

Q. Do you allow users to NFS mount their own storage onto the compute nodes?

A. No. We NFS mount storage across all compute nodes so that data is available independent of which compute nodes are used; however, medium to large clusters can place a very high load on NFS storage servers and many, including Linux-based NFS servers, cannot handle this load and will lock up. A non-responding NFS mount can hang the entire cluster, so we can't risk allowing outside mounts.

Q. How much am I charged for computational time on Savio?

A. For those with Faculty Computing Allowance accounts, usage of computational time on Savio is tracked (in effect, "charged" for, although no costs are incurred) via abstract measurement units called "Service Units." (Please see Service Units on Savio for a description of how this usage is calculated.) When all of the Service Units provided under an Allowance have been exhausted, no more jobs can be run under that account. Usage tracking does not impact Condo users, who have no Service Unit-based limits on the use of their associated compute pools.

Q. How can I acknowledge the Savio Cluster in my presentations or publications?

A. You can use the following sentence in order to acknowledge computational and storage services associated with the Savio Cluster:

"This research used the Savio computational cluster resource provided by the Berkeley Research Computing program at the University of California, Berkeley (supported by the UC Berkeley Chancellor, Vice Chancellor for Research, and Chief Information Officer)."

Acknowledgements of this type are an important factor in helping to justify ongoing funding for, and expansion of, the cluster. As well, we encourage you to tell us how BRC impacts your research (Google Form), at any time!

Condo Cluster Computing Program FAQs

Q. What are the benefits of participating in the Condo cluster program?

A. A major incentive for researchers to participate is that they only have to purchase their compute nodes, and support of the compute nodes is provided for free in exchange for their unused compute cycles. In addition to receiving professional systems administration support, researchers will be able to leverage the use of the HPC infrastructure (firewalled subnet, login nodes, commercial compiler, parallel filesystem, etc.) when they use their compute nodes. This infrastructure is provided for free and saves researchers from having to purchase and create any of these components on their own.

Q. What are the support costs for participating in the Condo program?

A. The monthly cluster support, colocation and network fees are waived for researchers who buy into the Condo. Essentially, the institution waives those costs in exchange for excess compute cycles. Each user of the system receives a 10 GB storage allocation, which includes backups. Condo groups are also eligible to receive additional group storage of 200 GB. In addition, use of the large, shared parallel scratch filesystem is provided at no cost. Condo users needing more storage for persistent data can purchase additional allocations at current rates. In addition, users needing very large amounts of persistent storage can also take advantage of the Condo Storage Service.

Q. How do I purchase compute nodes for the Condo program?

A. Prospective condo owners are invited to contact us. Our team will work with you to understand your application and to determine if the Condo cluster would be a suitable platform. We will provide an estimate of the costs of the compute nodes and associated InfiniBand network equipment and then work with your Procurement buyer to specify the correct items to order. Participants are expected to contribute the compute nodes and InfiniBand cable.

Q. How do I get access to my nodes?

A. We will set up a floating reservation equivalent to the number of nodes that you contribute to the Condo to provide priority access to you and your users. You can determine the run time limits for your reservation. If you are not using your reservation, then other users will be allowed to run jobs on unused nodes. If you submit a job to run when all nodes are busy, your job will be given priority over all other waiting jobs to run, but your job will have to wait until nodes become free in order to run. We do not do pre-emptive scheduling where running jobs are killed in order to give immediate access to priority jobs.

Q. I need dedicated or immediate access to my nodes. Can you accommodate that?

A. The basic premise of Condo participation is to facilitate the sharing of unused resources. Dedicating or reserving compute resources works counter to sharing, so this is not possible in the Condo model. As an alternative, PIs can purchase nodes and set them up as a Private Pool in the Condo environment, which will allow a researcher to tailor the access and job queues to meet their specific needs. Private Pool compute nodes will share the HPC infrastructure along with the Condo cluster; however, researchers will have to cover the support costs for BRC staff to manage their compute nodes. Rates for Private Pool compute nodes will be determined at a later date.

Q. How do I "burst" onto more nodes?

A. There are two ways to do this. First, Condo users can access more nodes via Savio's preemptable, low-priority quality of service option. Second, faculty can obtain a Faculty Computing Allowance, and their users can then submit jobs to the General queues to run on the compute nodes provided by the institution. (Use of these nodes is subject to the current job queue policies for general institutional access.)