Using Python on Savio¶

In addition to the extensive information provided here, you can also refer to the workshop on Python and Jupyter training that we presented in November 2021.

Loading Python and accessing Python packages¶

To access Python, you need to load in one of the modules that provides Python. There are a few options:

python/3.11.6-gcc-11.4.0 provides Python 3.11.6 with a variety of useful Python packages available.
This is suitable for work in Python when you don't need to use Conda environments or packages. However note that there is a limited set of Python packages (including numpy, scipy, pandas, jax, matplotlib) installed with this module.
anaconda3/2024.02-1-11.4 provides Python 3.11.7 with a wide variety of useful Python packages available. -This is what you want if you wish to use Conda environments and packages.
ml/pytorch/2.3.1-py3.11.7 provides PyTorch.
ml/tensorflow/2.15.0-py3.10.0 provides Tensorflow.
intelpython proides the Intel distribution of Python 3.9, including the mpi4py package linked to the Intel MPI library.

In addition some older versions are available. Run module avail to see them.

Warning

We recommend specifying the Python version explicitly for reproduciblity.

module load python/3.11.6-gcc-11.4.0

Warning

Python 2 has reached end of life status; we strongly recommend you transition to Python 3 if you are still using Python 2. However if you really need Python 2, you can create a Conda environment with Python 2 as the executable:

module load anaconda3
conda create -n py2 python=2.7

Once you have loaded the module of interest, you can determine what Python packages are provided by the system by entering either of the following:

pip list
conda list  # Does not work with the `python` modules, only `anaconda3` and PyTorch/Tensorflow modules.

When writing a job script file for use with sbatch, we recommend that you load the Python module (and any other needed modules) just after the comments that contain the Slurm flags, e.g.,

#SBATCH --partition=savio2
#SBATCH --ntasks=2
#SBATCH --time=00:00:30

module load python/3.11.6-gcc-11.4.0

Installing additional Python packages¶

Using pip¶

You can also install additional Python packages, such as those available via the Python Package Index (PyPI), that are not already provided on the system. You’ll need to install them into your home directory, your scratch directory, or your group directory (if any), rather than into system directories.

After loading Python, check whether some or all of the packages you need might already be available using conda list.

If not, you have two main options for installing packages.

First, you can use pip with the --user option to ensure they are installed in your home directory (by default in ~/.local):

pip install --user your_package_name

Conflicting modules and non-isolated environments when using pip

pip install --user will install packages into ~/.local/lib/python3.11 (or 3.10, 3.9, etc.). This means that you could end up using a package installed for one of the modules providing Python when using a different module if both modules provide the same version of Python (e.g., Python 3.10). Setting up a Conda environment is often a good strategy instead.

Using Conda¶

You can also use Conda to install packages, but to do so you need to install packages within a Conda environment. Here's how you can create an environment and install packages in it, in this case choosing to have the environment use Python 3.12 and to install numpy.

To use Conda, you'll need the anaconda3 module loaded.

module load anaconda3

To create a new environment:

conda create --name=your_env_name python=3.12 numpy

To access an existing environment:

source activate your_env_name
conda install scipy # install another package into the environment

Using the libmamba solver for more robust and faster package installations

It's fairly common that using conda to install a package takes a long time or fails with a report of incompatible package versions. We recommend using the libmamba "solver" instead, as it is generally faster and better at resolving dependencies. To use the solver, you can configure Conda as follows:

module load anaconda3
conda config --set solver libmamba

That will add the line solver: libmamba to your .condarc configuration file such that libmamba is always used.

Using the conda-forge channel to find packages

Conda packages are available from various "channels", namely locations or repositories of packages online. The conda-forge channel is a popular channel with a wide variety of up-to-date, community-supported packages. We recommend that you use conda-forge in general. To do so, you can configure Conda as follows:

module load anaconda3
conda config --add channels conda-forge
conda config --set channel_priority strict

That will add lines to to your .condarc configuration file such that conda-forge is used when possible.

Use caution with conda init

We recommend caution if activating Conda environments using conda activate despite the fact that running source activate will give a deprecation warning. conda activate prompts the user to run conda init. conda init results in changes to your shell configuration (by modifying your .bashrc) that can affect the behavior of your shell more generally (as does mamba init).

If you do run conda init, thereby modifying your .bashrc), we strongly recommend preventing Conda from activating the base environment when a new shell starts (e.g., when logging in) as this activation can slow down the login process and experience difficulties when the filesystem is responding slowly (e.g., potentially preventing you from logging in at all). To ensure this does not happen, after you run conda init, run the following:

conda config --set auto_activate_base False

Activate your environment within your job code

We recommend that when running code within an interactive (srun) session or an sbatch job that you activate your environment after starting the job (i.e., after getting connected via srun or within the commands in your sbatch job script). When an environment is activated, various shell environment variables are set, so activating within the context of the job will ensure those variables are set properly.

To leave an environment:

source deactivate

Using pip within a Conda environment is tricky

You can use pip within a Conda environment, but this can cause issues because conda doesn't consider pip-installed packages when installing additional packages, so proceed with caution as follows:

Use pip only after conda
If Conda changes are needed after using pip, create a new environment
Don't use --user when calling pip install
Always run pip with --upgrade-strategy only-if-needed (the default)

Using Conda channels¶

Channels are locations where packages are stored. By default you'll get packages from the default Anaconda channel.

In some cases a package my only be available from a specific channel. For example to install the diphase package from the bioconda channel, you'd do:

conda install -c bioconda diphase

You can also set the channels you want used whenever you use Conda (and the prioritization of them) by configuring Conda (or directly modifying your ~/.condarc file).

The conda-forge channel is a very popular community channel that provides a large variety of up-to-date package versions. We can make conda-forge the default channel (and enforce that it will always be used when it provides a given package) like this:

conda config --add channels conda-forge
conda config --set channel_priority strict

Isolating your Conda environment¶

There are a few steps that can help to isolate your Conda environment from the system Python and packages installed on the system or installed by you outside the environment.

Specify the Python version specifically when creating the environment (e.g., python=3.12 above). This ensures that the Python executable will be part of the environment rather than using a Python executable from the system.
Don't use --user when calling pip install inside your environment. If you use --user the package will be installed in ~/.local rather than in the environment.
Note that if you have installed packages via pip install --user previously outside the environment (i.e., into ~/.local), the Conda environment will try to use those packages. You may want to avoid having packages installed in ~/.local.

Reproducibility¶

You can create configuration files that store the state of your installed packages or environment and can be subsequently used to recreate that state.

## When using pip
pip freeze > requirements.txt
## When using conda
conda env export > environment.yml

Installing packages outside your home directory¶

You may want to install Python packages or Conda environments outside your home directory, particularly if you bumping up against the quota in your home directory.

Warning

Also note that Conda caches package source files and files shared across environments in ~/.conda/pkgs for use if you later install the same package again. (Note that the ~/.conda/envs directory is the default location where conda environments that you have created are stored.) This can contribute to running out of space in your home directory. In addition to the strategy of putting your .conda directory in scratch (discussed below), you may also want to remove uneeded files from the pkgs directory by running conda clean --all.

Pip¶

For packages installed via pip, which are installed by default in ~/.local (e.g., in ~/.local/lib/python3.11/site-packages when using Python 3.11), one strategy is to locate your .local directory in your scratch directly and symlink to it from your home directory:

cp -pr ~/.local /global/scratch/users/$USER/.local
rm -rf ~/.local
ln -s /global/scratch/users/$USER/.local ~/.local

Alternatively you can install packages individually in some other directory:

pip install --prefix=/path/to/directory package_name

and then modify PYTHONPATH to include /path/to/directory.

Conda¶

You can also use the symlink approach discussed above with ~/.conda, so that Conda environments are stored in scratch. Or you could do this just for ~/.conda/pkgs to have the directory containing cached package source files not take up space in your home directory.

And you can create Conda environments that are stored in some other directory:

SOME_DIR=/global/scratch/users/$USER/your_directory
module load anaconda3
conda create -p ${SOME_DIR}/test 
source activate ${SOME_DIR}/test
conda install biopython
source deactivate

Running Python interactively¶

Using `srun` to run Python on the command line¶

To use Python interactively on Savio's compute nodes, you can use srun to submit an interactive job.

Once you're working on a compute node, you can then load the Python module and start Python.

We recommend use of IPython for interactive work with Python. To start IPython for command line use, simply enter ipython.

IPython provides tab completion, easy access to help information, useful aliases, and lots more. Type ? at the IPython command line for more details.

Using Open OnDemand to run Python in a Jupyter notebook¶

We now provide access to Jupyter notebooks via Open OnDemand. Notebooks can be run on compute nodes in any of the Savio partitions or on a standalone server for testing and exploration purposes.

Running Python code on a GPU¶

To run Python jobs on one or more GPUs, you'll need to request access to the GPU(s) by including the --gres=gpu:x flag to sbatch or srun, where x is the number of GPUs you need, following our example GPU job script.

Perhaps the most common use of Python with GPUs is via machine learning and other software that moves computation onto the GPU but hides the details of the GPU use from the user. Packages such as TensorFlow and PyTorch can operate in this way.

PyTorch¶

PyTorch is usually easy to install with Pip or Conda/Mamba, but it is available system-wide via the ml/pytorch modules.

To check if you are able to run PyTorch calculations on the GPU, run this code:

# Check GPU availability:
import torch
torch.cuda.is_available()

# Manually place operations on GPU:
print(torch.rand(2,3).cuda())

Tensorflow¶

We have Tensorflow installed system-wide via the ml/tensorflow modules.

Tensorflow can be tricky to install such that it will seamlessly use the GPU -- if you would like help installing Tensorflow yourself (e.g., in a Conda environment) for use on the GPU, please contact us and we can help you with the steps needed.

To check if you are able to run Tensorflow calculations on the GPU, here is some basic testing code for Tensorflow:

# Check GPU availability:
import tensorflow as tf
tf.config.list_physical_devices('GPU')

# Find out which device your operations and tensors are assigned to:
tf.debugging.set_log_device_placement(True)
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

# Manually place operations on a GPU:
with tf.device('/GPU:0'):
    a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
    b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
    c = tf.matmul(a, b)

Monitoring GPU usage¶

To check on the current usage (and hence availability) of each of the GPUs on your GPU node, you can run nvidia-smi from the Linux shell within an interactive session on that GPU node. Near the end of that command's output, the "Processes: GPU Memory" table will list the GPUs currently in use, if any. For example, in a scenario where GPUs 0 and 1 are in use on your GPU node, you'll see something like the following. (By implication from the output below, GPUs 2 and 3 are currently not in use, and thus fully available, on this node.)

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|===================================================================================|
| 0 32699 C python 2462MiB |
| 1 32710 C python 2108MiB |
====================================================================================|

To determine which line reflects your usage, you'd need to determine which of the process IDs (here 32699 and 32710) is the process associated with your job, e.g., using top or ps.

Parallel processing in Python on Savio¶

While there are many, many packages for parallelization in Python, we’ll focus on a few widely-used ones: ipyparallel, Dask, and Ray. For more complicated kinds of parallelism, you may want to use MPI within Python via the mpi4py package, discussed below.

Threaded linear algebra¶

The numpy and scipy packages included with Savio's Python installations are linked to the OpenBLAS library when using the python module and the MKL library when using the anaconda3 module. These provides fast, threaded versions of core linear algebra functions. You don't have to do anything -- any linear algebra that you run (either directly or that gets called by code that you run) will run in parallel on multiple threads, limited only by the number of cores available to your Slurm job on the node you are running on.

You can verify the linear algebra library and number of threads being used like this:

import numpy
import threadpoolctl
print(threadpoolctl.threadpool_info())

However, there are some circumstances under which you might want to set the number of threads yourself. These include:

If you are running a hybrid parallelization in which individual threaded linear algebra operations are being run in parallel (e.g., using linear algebra nested within some other parallelized work, such as with ipyparallel as seen in the next section). Here you probably want to set the number of threads equal to the number of cores available divided by the number of parallel workers running.
Some (often smaller) problems may not run faster with more and more threads (or could possibly even run more slowly).

You can control the number of threads by setting the OMP_NUM_THREADS (for OpenBLAS) or MKL_NUM_THREADS (for MKL) enviroment variables, either in the shell before starting Python or within numpy via os.environ (e.g., os.environ['MKL_NUM_THREADS']='4') before importing numpy.

Single-node threading only

Threading only works on individual nodes. Any linear algebra will only run in parallel on the cores available on one node.

OpenBLAS versus MKL

If performance of your linear algebra is critical, you might want to compare performance with OpenBLAS (using the python module) to MLK (using the anaconda3 module), as some performance comparisons suggest MKL may be faster in some situations.

Using `ipyparallel`¶

The ipyparallel package allows you to parallelize independent computations across multiple cores on one or more nodes. To use it on multiple nodes, you'll need to start up the worker processes yourself, as discussed below.

To use Python in parallel, we need to request resources on one or more nodes, start up ipyparallel worker processes, and then run our calculations in Python.

Single node parallelization¶

As of version 7 of ipyparallel, one can start the worker processes from within Python. We recommend this approach, so we'll describe it first, covering other approaches in later sections.

Here is an example job script.

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Account:
#SBATCH --account=account_name
#
# Partition:
#SBATCH --partition=partition_name
#
# Request one node:
#SBATCH --nodes=1
#
# Wall clock limit:
#SBATCH --time=00:30:00
#
## Command(s) to run (example):
module load python/3.11.6-gcc-11.4.0
ipython job.py > job.pyout

Then in Python code in the job.py file we'll start up our worker processes and do our computations, as illustrated next.

First, get an object 'handle' (called c here) to the cluster of workers.

import os
import ipyparallel as ipp
ipp.__version__   # check that at least version 7
mycluster = ipp.Cluster(n = int(os.getenv('SLURM_CPUS_ON_NODE')))
c = mycluster.start_and_connect_sync()

c.ids

Here's a simple example of using a direct view (dispatching computational tasks in a simple way to the workers) to run a function on all the workers.

dview = c[:]
# Cause execution on main process to wait while tasks sent to workers finish
dview.block = True   
dview.apply(lambda : "Hello, World")

Tip

Alternatively (if not being run in an Open OnDemand-based Jupyter notebook), you can use $SLURM_NTASKS or SLURM_CPUS_PER_TASK as appropriate based on your job's Slurm flags. You can also use some other number of workers, as desired, though one would generally not want to use more workers than cores available on the node.

An example parallel computation¶

Let's set up an example calculation. Suppose we want to do leave-one-out cross-validation of a random forest statistical model. We define a function that fits to all but one observation and predicts for that observation.

def looFit(index, Ylocal, Xlocal):
    rf = rfr(n_estimators=100)
    fitted = rf.fit(np.delete(Xlocal, index, axis = 0), np.delete(Ylocal, index))
    pred = rf.predict(np.array([Xlocal[index, :]]))
    return(pred[0])

Now use a direct view to load packages on the workers and broadcast data structures to all the workers using push.

dview.execute('from sklearn.ensemble import RandomForestRegressor as rfr')
dview.execute('import numpy as np')
# assume predictors are in 2-d array X and outcomes in 1-d array Y
# here we generate random arrays for X and Y
# we need to broadcast those data objects to the workers
import numpy as np
X = np.random.random((200,5))
Y = np.random.random(200)
mydict = dict(X = X, Y = Y, looFit = looFit)
dview.push(mydict)

Next we'll use a load-balanced view (sequentially dispatching computational tasks as earlier computational tasks finish) to run the code across all the observations in parallel using the map operation.

# Need a wrapper function because map() only operates on one argument
def wrapper(i):
    return(looFit(i, Y, X))

n = len(Y)
import time
time.time()
# Run a parallel map, executing the wrapper function on indices 0,...,n-1
lview = c.load_balanced_view()
# Cause execution on main process to wait while tasks sent to workers finish
lview.block = True 
pred = lview.map(wrapper, range(n))   # Run calculation in parallel
time.time()
print(pred[0:10])

Single node parallelization - starting workers outside Python¶

One can also start up the workers outside of Python and then connect to the workers from within Python. Here's an example job script that starts the Python worker processes.

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Account:
#SBATCH --account=account_name
#
# Partition:
#SBATCH --partition=partition_name
#
# Request one node:
#SBATCH --nodes=1
#
# Wall clock limit:
#SBATCH --time=00:30:00
#
## Command(s) to run (example):
module load python/3.11.6-gcc-11.4.0
ipcluster start -n $SLURM_CPUS_ON_NODE &    # Start worker processes
ipython job.py > job.pyout
ipcluster stop

Then within Python, we need to get an object 'handle' to the cluster of workers, making sure to pause while the workers start.

from ipyparallel import Client
c = Client()
c.wait_for_engines(n = int(os.getenv('SLURM_CPUS_ON_NODE')))
c.ids

Then proceed using the c object as discussed in the previous section.

Multi-node parallelization¶

To run ipyparallel workers across multiple nodes, you need to modify the #SBATCH options in your submission script and the commands used to start up the worker processes at the end of that script, but your Python code can stay the same.

Here’s an example submission script, with the syntax for starting the workers:

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Account:
#SBATCH --account=account_name
#
# Partition:
#SBATCH --partition=partition_name
#
# Total number of tasks
#SBATCH --ntasks=48
#
# Wall clock limit:
#SBATCH --time=00:30:00
#
## Command(s) to run (example):
module load python/3.11.6-gcc-11.4.0
# Start the controller process:
ipcontroller --ip='*' &
sleep 50                   # Wait for controller to start
srun ipengine &            # Start as many worker processes as we have Slurm tasks
ipython job.py > job.pyout
ipcluster stop

Your Python code in job.py can follow the same approach as discussed in the previous section, though you'll need to adjust the value of the wait_for_engines argument to match the number of workers started.

Parallelization in a Jupyter notebook¶

You can use ipyparallel functionality within a Jupyter notebook.

Using Dask¶

The Dask package provides a variety of tools for managing parallel computations. In addition to providing tools that allow you to parallelize independent computations as discussed above using ipyparallel, Dask also allows you to run computations across datasets in parallel using distributed data objects.

Tip

Dask is available in the anaconda3/2024.02-1-11.4 module.

Some of the key ideas/features are that users can:

separate what to parallelize from how and where the parallelization is actually carried out,
run the same code on different computational resources (without touching the actual code that does the computation), and
use Dask's distributed data structures that can be treated as a single data structure when running operations on them (like Spark).

First you'll need to request one or more nodes via Slurm using sbatch or srun.

Then, to run Dask on multiple cores on a single node, you can use any of the 'threads', 'processes', or 'distributed' schedulers, simply setting the number of workers (four in the examples below). Generally you'd want to use only as many workers as you have cores available.

import dask

# Threads scheduler
dask.config.set(scheduler='threads', num_workers = 4)

# Processes scheduler
import dask.multiprocessing
dask.config.set(scheduler='processes', num_workers = 4)  

# Distributed scheduler 
# Fine to use this on a single node and it provides some nice functionality
# If you experience issues with worker memory then try the processes scheduler
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers = 4)
c = Client(cluster)

To run Dask across cores on multiple nodes, you can use dask-scheduler and dask-worker to start up the necessary Python processes (an alternative is to use dask-ssh. Here's an example that uses the Slurm srun command with dask-worker to make the connections to the nodes in the Slurm allocation:

export SCHED=$(hostname):8786
# Start scheduler process
dask-scheduler &
sleep 50    # might need even more time to ensure scheduler starts up fully
# Start worker processes
srun dask-worker tcp://${SCHED} --local-directory /tmp &
sleep 100   # might need even more time to make sure workers start up fully

The srun command here (which is of course being run within another sbatch or srun invocation that you ran to get access to the compute nodes) will run dask-worker on as many workers as the number of Slurm 'tasks', across the allocated nodes.

The use of --local-directory is because we've seen problems when the local directory for Dask is a Savio user home directory.

Then in Python, connect to the cluster via the scheduler.

import os, time, dask
from dask.distributed import Client
c = Client(address = os.getenv('SCHED'))

Using Ray¶

Ray provides a variety of tools for managing parallel computations. At its simplest, it allows you to parallelize independent computations across multiple cores on one or more machines. One nice feature relative to Dask is that Ray allows one to share data across all worker processes on a node, without multiple copies of the data, using the object store. At its more complicated, Ray provides tools to build distributed (across multiple cores or nodes) applications where different processes interact with each other (using the notion of 'actors').

Tip

You'll need to install the ray package yourself. It's not provided with any of the modules that provide Python.

First you'll need to request one or more nodes via Slurm using sbatch or srun.

Then, to run Ray on multiple cores on a single node, you can initialize Ray from within Python.

import ray
ray.init()
## alternatively, to specify a specific number of cores:
ray.init(num_cpus = 4)

To run Ray across cores on multiple nodes, see these Ray instructions for Slurm.

Using MPI with Python¶

You can use the mpi4py package to provide MPI functionality within your Python code. This allows individual Python processes to communicate amongst each other to carry out a computation. We won’t go into the details of writing mpi4py code, but will simply show how to set things up so an mpi4py-based job will run.

Here’s an example job script, illustrating how to load the needed modules and start the Python job:

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Account:
#SBATCH --account=account_name
#
# Partition:
#SBATCH --partition=partition_name
#
# Total number of tasks
#SBATCH --ntasks=48
#
# Wall clock limit:
#SBATCH --time=00:30:00
#
## Command(s) to run (example):
module load gcc/11.4.0 openmpi python/3.11.6-gcc-11.4.0 
mpirun python mpiCode.py > mpiCode.pyout

Here’s a brief ‘hello, world’ example of using mpi4py. Note that this particular example requires that you load the numpy module in your job script, as shown above. Also note that this example does not demonstrate the key motivation for using mpi4py, which is to enable more complicated inter-process communication:

from mpi4py import MPI
import numpy as np
comm = MPI.COMM_WORLD

# simple print out Rank & Size
id = comm.Get_rank()
print("Of ", comm.Get_size() , " workers, I am number " , id, ".")

# function to be run many times
def f(id, n):
    np.random.seed(id)
    return np.mean(np.random.normal(0, 1, n))

n = 1000000
# execute on each worker process
result = f(id, n)

# gather results to single ‘main’ process and print output
output = comm.gather(result, root = 0)
if id == 0:
    print(output)

Using Python on Savio¶

Loading Python and accessing Python packages¶

Installing additional Python packages¶

Using pip¶

Using Conda¶

Using Conda channels¶

Isolating your Conda environment¶

Reproducibility¶

Installing packages outside your home directory¶

Pip¶

Conda¶

Running Python interactively¶

Using srun to run Python on the command line¶

Using Open OnDemand to run Python in a Jupyter notebook¶

Running Python code on a GPU¶

PyTorch¶

Tensorflow¶

Monitoring GPU usage¶

Parallel processing in Python on Savio¶

Threaded linear algebra¶

Using ipyparallel¶

Single node parallelization¶

An example parallel computation¶

Single node parallelization - starting workers outside Python¶

Multi-node parallelization¶

Parallelization in a Jupyter notebook¶

Using Dask¶

Using Ray¶

Using MPI with Python¶

Using `srun` to run Python on the command line¶

Using `ipyparallel`¶