Skip to content

Troubleshooting

Troubleshooting Open OnDemand

Common problems

Problem: when I login to OOD, I immediately get the error: "can't find user for YOUR-USERNAME. Run 'nginx_stage --help' to see a full list of available command line options"

This error occurs if your account has not been correctly set up to use OOD. Please contact us.

Problem: my OOD apps on shared nodes (VS Code, Jupyter shared node app, RStudio shared node app) never start

In some situations when using apps that run outside of a Slurm job (VS Code and Jupyter Server sessions on the shared Jupyter node) the session starts to be created, but the user is never provided with the "Connect" button on the "Interactive Sessions" page and the session terminates in a minute or two. We've seen two causes for this.

One possibility is that this can occur because you have the base Conda environment initialized automatically whenever you login. This will generally be the case if you have run conda init and thereby modified your .bashrc file as discussed here. If so, you will generally see "(base)" appear at the beginning of your shell prompt, e.g. (base) [your_username@ln002 ~]$. The solution is to tell Conda not to enter the base environment when you login to Savio. Once in your shell, simply run

conda config --set auto_activate_base False

A second possibility is that you have installed a version of some Python package in your ~/.local directory (via pip install --user) that is masking the system version of the same package and interfering with starting your session. Please try moving aside your ~/.local directory (e.g., mv .local .local.save in a terminal) and trying again to see if that allows your session to start.

A third possibility is that you have something configured in your .bashrc file that is preventing the OOD session from starting. Try moving your .bashrc aside and using a plain version:

cp -rf ~/.bashrc ~/.bashrc.save

Then either remove items from your .bashrc or create a very simple .bashrc that looks like this:

# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
  . /etc/bashrc
fi

Problem: my OOD apps on shared nodes (VS Code, Jupyter shared node app, RStudio shared node app) report "permission denied (gssapi-keyex,gssapi-with-mic,keyboard-interactive)"

This can occur if permissions are incorrectly set on your home directory or your .ssh directory, preventing OOD from using SSH to connect to the shared node. Make sure that your home directory is not world-writeable. E.g., in the following example, the last "w" indicates that the user's home directory is world-writeable.

[bugs_bunny@ln002 users]$ ls -ld bugs_bunny
drwxrwxrwx 1 bugs_bunny ucb 1048576 Mar 29 16:06 bugs_bunny
[bugs_bunny@ln002 users]$ chmod o-w ~bugs_bunny
drwxrwxr-x 1 bugs_bunny ucb 1048576 Mar 29 16:06 bugs_bunny

You might also need to modify the permissions on your .ssh directory to look like this:

[bugs_bunny@ln002 ~]$ ls -ld .ssh
drwx------ 1 bugs_bunny ucb 348 Jan  6 15:58 .ssh
[bugs_bunny@ln002 ~]$ ls -l .ssh 
total 104
-rw-r--r-- 1 bugs_bunny ucb  1013 Jan  6 15:58 authorized_keys
-rw------- 1 bugs_bunny ucb   672 Aug  2  2019 cluster
-rw-r--r-- 1 bugs_bunny ucb   610 Aug  2  2019 cluster.pub
-rw------- 1 bugs_bunny ucb    98 Jan  6 15:58 config
-rw-r--r-- 1 bugs_bunny ucb 79896 Mar 16 13:22 known_hosts

Problem: my OOD apps report an SSH or connection error

If OOD reports an error related to SSH or a connection problem, the issue may be that your account is not properly configured to be able to connect to the Savio compute nodes from the Savio login nodes. To troubleshoot:

  • First, check that you can login to Savio by using SSH to connect to a login node.
  • Second, check that you can ssh between nodes on Savio. For example, try to ssh to the DTN or another login node from a terminal session on one of the Savio login nodes:
    ssh dtn
    ssh ln001  # This assumes you are not already on ln001.
    

If any of these tests do not work, please contact us.

Problem: my OOD Jupyter kernel or my RStudio's R session keeps dying

There can be various reasons a Jupyter kernel in an OOD Jupyter session or an RStudio R session may repeatedly die.

  • Various users have reported problems when using Jupyter notebooks with the Safari web browser. (If you try to open a Terminal under the OOD Clusters tab, you will probably seen a "websocket connection" error message.) Try using Chrome or another browser and see if the problem persists.
  • If your code uses more memory than available to your session, the Jupyter kernel or R session can die without telling you why. In particular this is likely to occur when using a Jupyter Server or RStudio Server that uses the shared Jupyter/OOD node (i.e., outside of a Slurm job) because these sessions are limited to 8 GB of memory. Try running your Jupyter notebook or RStudio in a Slurm-based session and see if the problem persists.
Problem: Slurm-based OOD sessions never start

In some situations when starting a session that is Slurm-based (e.g., Jupyter server sessions that use the Savio partitions, RStudio sessions, or MATLAB sessions), the session starts to be created, but the user is never provided with the "Connect" button on the "Interactive Sessions" page and the session terminates in a minute or two. This can occur because of problems with your Slurm configuration or with your account's access to Savio software modules.

  • First, check that you can run regular Savio jobs (outside of OOD) using either srun or sbatch.
  • Second, check that you can load Savio software modules from within a terminal on a Savio login node. For example check that you can run module load python without error and that your MODULEPATH environment variable looks like this:
    echo $MODULEPATH
    ## /global/software/rocky-8.x86_64/modfiles/langs:/global/software/rocky-8.x86_64/modfiles/tools:/global/software/rocky-8.x86_64/modfiles/compilers:/global/software/rocky-8.x86_64/modfiles/apps:/global/software/site/modfiles:/global/home/groups/consultsw/rocky-8.x86_64/modfiles
    module load python
    module list
    ##  1) python/3.11.6-gcc-11.4.0
    

If you have problems in either case, please contact us.

Another possibility is that you have installed a version of some Python package (likely one that is Jupyter-related) in your ~/.local directory (via pip install --user) that is masking the system version of the same package and interfering with starting your session. Please try moving aside your ~/.local directory (e.g., mv .local .local.save in a terminal) and trying again to see if that allows your session to start.

Problem: OOD sessions on the shared node never start and report being in a "bad state"

This may be caused by instability of the shared node. Please try the following step to delete your session and start over:

  1. Delete the job using the Delete button to the right.
  2. If that doesn't work, log out and back into the OOD web portal.
  3. If that doesn't work, delete the folder under ~/ondemand/data/sys/dashboard/batch_connect/db corresponding to the app having the “bad state” problem.
  4. If none of those steps work, please contact us.
Problem: OOD apps fail to start with an error message about rsync/file IO/disk quota

If you see this message when trying to start an OOD app,

rsync: close failed on "/global/home/users/smith/ondemand/data/sys/dashboard/batch_connect/sys/..." Disk quota exceeded (122)
rsync error: error in file IO (code 11) at ...

it indicates that you've exceeded your disk quota in your home directory. Please see this FAQ for information on how to reduce your usage.

Problem: Accessing ood.brc.berkeley.edu gives a 'Bad Request' message

If you see this message when accessing ood.brc.berkeley.edu,

Bad Request
Your browser sent a request that this server could not understand.
Size of a request header field exceeds server limit.

try clearing the cookies in your browser. As work-arounds you might also try the browser icognito mode or use a different browser.

Problem: OOD reports a "Proxy Error"

We've seen this error occur in multiple contexts (sometimes as a "502 Proxy Error").

  • If you experience this error when you are trying to login/authenticate to OOD (i.e., entering your CalNet credentials via CILogon), it may have to do with stale processes associated with your previous OOD sessions that are still running on the server that runs the OOD service. You can try to remove such processes as follows. Close any browser tabs that are accessing https://ood.brc.berkeley.edu. Then in a plain SSH session, login to Savio. Then login to the server running the OOD service by invoking ssh ood. Then use top or ps to find the process IDs of any nginx, Passenger, or ruby processes running under your username. Kill the processes (kill <pid>) and then try to login at https://ood.brc.berkeley.edu again. If that doesn't help or you're not comfortable trying those steps, please contact us.
  • If the proxy error occurs when trying to start, access, or use an OOD app, it may be caused by problems with the scratch filesystem. If we have emailed users about problems on scratch or if there is a message about scratchproblems on the Research IT/Savio status page, please try later when scratch is functioning properly again. If not, please let us know of the problem.
Problem (Deprecated): the OOD login pop-up box keeps reappearing

Note: this problem should no longer occur with changes made to the OOD login system during the switch to the Rocky 8 operating system in July 2024. If you see this problem, please let us know.

If you have trouble logging into OOD (in particular if the login pop-up box keeps reappearing after you enter your username and password), you may need to make sure you have completely exited out of other OOD sessions. This could include closing browser tab(s)/window(s), clearing your browser cache and clearing relevant cookies. You might also try running OOD in an incognito window (or if using Google Chrome, in a new user profile) or in a different browser (such as Google Chrome, Safari, or Firefox). For instructions on clearing your browser cache and cookies for the different browsers, see the links below:

General information for troubleshooting

Logs and scripts for each interactive session with Open OnDemand are stored in:

~/ondemand/data/sys/dashboard/batch_connect/sys

There are directories for each interactive app type within this directory. For example, to see the scripts and logs for an RStudio session, you might look at the files under:

~/ondemand/data/sys/dashboard/batch_connect/sys/brc_rstudio-compute/output/b5733507-a750-4bb9-8d4b-710618ce0de1

where b5733507-a750-4bb9-8d4b-710618ce0de1 corresponds to a specific session of an OOD app (the RStudio app in this case).

In particular, the main place to look for errors is output.log.

The BRC Open OnDemand interactive apps configuration is on GitHub. Additional information about Open OnDemand configuration is available on the Open OnDemand documentation.