Running MATLAB jobs across multiple nodes
The MATLAB Distributed Computing Server (DCS) toolbox allows one to run MATLAB jobs across multiple nodes. There are a few things you need to do to get it set up, but once you do, the parallel code you use in MATLAB will be the same as discussed in the previous section.
One important limitation of this functionality is that the campus license only provides Savio with 32 MATLAB DCS licenses. This means that across all Savio users, only 32 MATLAB workers can be operating at once. Even if you are using all the licenses, 32 workers is not much more than the number of cores on a node on most Savio partitions. However, note that it is possible to use multiple threads per worker, so one can potentially use more than 32 cores across multiple nodes for your MATLAB DCS-using job. For additional notes and slides from a workshop on MATLAB DCS held in August 2018, please see this Github repo.
One-time setup¶
In order to configure MATLAB to operate across multiple nodes you need to start MATLAB (it’s simplest to just start MATLAB on a login node for this) and run the command
configCluster
You only need to do that once and the settings will be saved for future times when you login to Savio and use MATLAB.
License usage¶
Please be sure to use the -L flag when submitting your job (see details below) and that the number of licenses requested via the -L flag equals the number of MATLAB workers you will start, which will generally be the number of SLURM tasks or the number of SLURM tasks minus 1. If you don’t do this, your job will check out licenses without SLURM knowing about it, and any other user who tries to use this functionality may find their code hangs without explanation when they try to checkout licenses that are not actually available.
Because of the license limitation, if other users are using licenses, you may not be able to run your job until enough licenses become free. In general, your SLURM submission should queue until enough licenses are available, but limitations in the license management process (or other users failing to use the -L flag) may result occasionally in MATLAB itself hanging when your MATLAB code attempts to check out licenses. So please keep an eye on your log files when using this functionality to avoid a long-running job that fails to do any computation.
To see how many licenses are in use at any time, you can run the following on either a login or compute node:
scontrol show licenses
Batch use (i.e., sbatch)¶
Here are the pieces you need in your sbatch script:
- Request the number of licenses equal to the number of MATLAB workers you plan to use (up to the limit of 32 imposed by our license) in your SLURM job script (in the example here we request 28 licenses):
#SBATCH --licenses=mdcs:28
- Make sure to specify -n or --ntasks (because the maxPoolSize function mentioned below queries SLURM_NTASKS):
SBATCH --ntasks=28
- Include this line in your script before starting MATLAB:
export MDCE_OVERRIDE_EXTERNAL_HOSTNAME=$(/bin/hostname -f)
Interactive use (i.e., srun)¶
When invoking srun, you’ll need to do the following so that your interaction session can properly use MATLAB DCS.
- Your srun command should request the number of licenses equal to the number of MATLAB workers you plan to use (up to the limit of 32 imposed by our license) and should also specify the number of tasks (which will generally be equal to the number of workers or one more than the number of workers (see next section). For example for 28 MATLAB workers:
srun -A fc_smith -p savio2 -L mdcs:28 -n 28 -t 30:00 --pty bash
- Run the following once you have a shell on the compute node
export MDCE_OVERRIDE_EXTERNAL_HOSTNAME=$(/bin/hostname -f)
One can also run the command setLocalHostName from within MATLAB instead of Step 2 above.
MATLAB code for using DCS - basic parpool example¶
Here’s how you set up your parallel pool of workers in your MATLAB code so that you can use workers across multiple nodes.
c = parcluster(‘savio’);
mps = maxPoolSize(); % this will equal the number of SLURM tasks
% MATLAB recommends one fewer worker than the number of cores available, but your job also may work with a single core being shared by the master process and a worker
p = c.parpool(mps-1);
parfor ….
…
p.delete;
In the above code we have one of the cores for the master process and the remainder for the workers, as recommended by MATLAB. So actually in the sbatch/srun example code from the previous sections, if --ntasks were 28, then we’d only need to ask for 27 licenses. However, your job may run fine with a single core shared by the master process and a worker, in which case you would have c.parpool(mps) and would then ask for 28 licenses.
Other parallel MATLAB code will be different, but in all cases you need to make sure that you specify the ‘savio’ parallel profile and that you only start as many workers as the number of licenses you requested.
MATLAB code for using DCS - parpool with multiple threads per worker¶
Each MATLAB worker can make use of multiple threads (i.e., cores). To enable this, make sure to set -c (--cpus-per-task) to a number greater than 1 in your SLURM submission. Then your MATLAB code would look like the following:
c = parcluster(‘savio’);
c.NumThreads = threadCount(); % this will be set to the same as the value of SLURM_CPUS_PER_TASK
mps = maxPoolSize(); % this will equal the number of SLURM tasks
% MATLAB recommends one fewer worker than the number of cores available, but your job may work with a single core being shared by the master process and a worker; here we don’t subtract one because it would result in setting aside multiple cores for the master, which seems excessive
p = c.parpool(mps);
parfor ….
…
p.delete;