mlp clusterAuthor: Hao Tang (hao.tang@ed.ac.uk) Date: 21 Oct, 2025
mlp
Try the following command and log into mlp.
ssh <UUN>@mlp -J <UUN>@student.ssh.inf.ed.ac.uk
The -J argument is known as a jump. It is necessary when you are
not in the School of Informatics network.
Below is what I saw when I logged into mlp.
[haotang@vaio ~]$ ssh htang2@mlp -J htang2@staff.ssh.inf.ed.ac.uk (htang2@staff.ssh.inf.ed.ac.uk) Password: (htang2@mlp) Password: Welcome to Ubuntu 24.04.3 LTS (GNU/Linux 6.8.0-57-generic x86_64) This is a cluster head node please do not run compute intensive processes here this node is intended to provide an interface to the cluster only Please nice any long running processes If you are running interactive jobs via srun please do do so for longer than 24 hours. Specifically do not leave them to idle as they may be killed, other people may need to use the GPU you are blocking ------------------------------------------------------------------------------- Looking for tips? https://computing.help.inf.ed.ac.uk/cluster-tips Jan 2025: Note that some GRES have changed to swap dashes to underscores. ************************************************************************** We have scheduled the final set of nodes to upgrade for week commencing 17 August and we will be switching over the head nodes to new hardware running noble starting with uhtred (currnetly aliased as mlp.inf.ed.ac.uk as of 9am Monday. **********************************************updated: 2025-05-15 10:00 ** Last login: Thu Oct 9 20:24:49 2025 from porthos.inf.ed.ac.uk [htang2@hastings ~]$
The first thing to note is the hostname hastings,
the name of the head node.
The head node is responsible for taking commands and submit
them to one of the compute nodes.
Once we are on the head node, we can run the following command to see the list of machines and partitions.
sinfo
Below is the output of sinfo.
We can see that a partition is a set of nodes,
and a node is a physical machine.
[htang2@hastings ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST SRF-Teaching up 2:00:00 1 idle saxa SRF-Reserach up 10-00:00:0 7 mix herman,scotia[01-03,05-06,08] SRF-Reserach up 10-00:00:0 1 alloc scotia07 SRF-Reserach up 10-00:00:0 1 idle scotia04 Teach-Interactive up 2:00:00 1 down* landonia01 Teach-Interactive up 2:00:00 1 idle landonia03 Teach-Standard* up 3-08:00:00 2 down* landonia[09,19] Teach-Standard* up 3-08:00:00 7 idle landonia[02,04-06,08,21,25] Teach-Short drain 4:00:00 0 n/a Teach-LongJobs up 3-08:00:00 1 down* landonia22 Teach-LongJobs up 3-08:00:00 1 drain landonia23 Teach-LongJobs up 3-08:00:00 3 idle landonia[21,24],saxa General-Usage up 3-08:00:00 1 down* meme General-Usage up 3-08:00:00 1 down letha06 PGR-Standard up 5-00:00:00 1 inval damnii10 PGR-Standard up 5-00:00:00 1 down* damnii08 PGR-Standard up 5-00:00:00 13 mix crannog[02-07],herman,scotia[01-03,05-06,08] PGR-Standard up 5-00:00:00 2 alloc crannog01,scotia07 PGR-Standard up 5-00:00:00 11 idle damnii[01-07,09,11-12],scotia04 ILCC-Standard up 10-00:00:0 1 mix duflo ILCC-Standard up 10-00:00:0 5 idle barre,greider,levi,mcclintock,nuesslein ILCC-CDT up 10-00:00:0 1 mix strickland ILCC-CDT up 10-00:00:0 1 idle arnold
The two most important partitions are Teach-Interactive and
Teach-Standard, and I will say more about them in the
rest of the tutorial.
Let's run our first job.
srun -p Teach-Interactive --pty bash
This job is literally the shell bash itself.
Once you run it, it is as if you ssh-ed into a compute
node.
Below is the output I got from running the command.
[htang2@hastings ~]$ srun -p Teach-Interactive --pty bash [htang2@landonia03 ~]$
In this case, I'm on landonia03 and I can make use
of the compute resources on it.
Remember to run the following command when you are done.
exit
Below is what I got after running exit. You should be back on
the head node (in this case hastings).
[htang2@landonia03 ~]$ exit exit [htang2@hastings ~]$
Before we move on, it is worth talking about how the files are stored on the cluster. A job cannot really do much if we don't let the job on the compute node access the files it needs.
When we are on the head node, the home directory (i.e.,
/home/<UUN>) is on NFS.
The nice thing about NFS is that
all compute nodes can access files on the NFS,
including your home directory.
The downside of NFS is that it is terribly slow.
Fortunately, every compute node also has a scratch space, located
at /disk/scratch.
The access to the scratch space is a lot faster than the home directory,
because it is simply a local disk.
/disk/scratch before the compute
happens, and move the result back to the home directory after the compute
is done.
Moving data to and from NFS does not scale linearly in the number of files. It is much more efficient to move a big file than to move many small files.
/disk/scratch.
Note that the NFS is not backed up. Make sure you back up the code and the results whenever possible.
/afs/inf.ed.ac.uk/user/<prefix>/<UUN>
is still accessble from mlp.
It is backed up daily, so put your precious data over there.
I assume your main use would be to run and to train neural networks with PyTorch.
For anything you do with GPUs, you will need to have access to the CUDA library.
A simple way to check whether CUDA is properly installed is
to run nvcc (the NVIDIA compiler).
nvcc --version
Unfortunately, you will see the following error message.
[htang2@hastings ~]$ nvcc --version Sorry, 'nvcc' was not found. Perhaps you meant: - nvlc
One solution is to use conda and install your own CUDA, but we would end up with hundreds of copies of the same CUDA library on the cluster. I have built a simple toolchain to make everybody's life easier.
. /home/htang2/toolchain-20251006/toolchain.rc
Now if you run nvcc --version, you will see the following.
[htang2@hastings ~]$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2025 NVIDIA Corporation Built on Wed_Jan_15_19:20:09_PST_2025 Cuda compilation tools, release 12.8, V12.8.61 Build cuda_12.8.r12.8/compiler.35404655_0
To install PyTorch, we first create a virtual environment and activate it.
python -m venv venv . venv/bin/activate
The virtual environment of course needs to be on the NFS so that the compute nodes can use it. You will also get to experience how slow the NFS is when you create this virtual environment.
PyTorch can be installed with the following command.
Don't forget to install numpy.
In case you wonder, these two packages are specifically
compiled for the cluster, and you might find them run
faster than the default one you get with regular pip install.
pip install /home/htang2/toolchain-20251006/whl/numpy-2.2.3-cp312-cp312-linux_x86_64.whl pip install /home/htang2/toolchain-20251006/whl/torch-2.8.0a0+gitunknown-cp312-cp312-linux_x86_64.whl
After doing all the above, you will get CUDA version 12.8 and PyTorch 2.8.0.
We are now ready to run a job that makes use of a GPU. Let's run the following interactive job.
srun -p Teach-Interactive --gres=gpu:1 --pty bash
Note the additional argument --gres=gpu:1.
It means we need will need one GPU for this job.
To check whether you have a GPU to work with, run the following command to see what GPU you have.
nvidia-smi
In my case, I have a GeForce GTX TITAN X with around 12 GB of memory.
[htang2@landonia03 ~]$ nvidia-smi
Sun Oct 12 23:42:26 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.158.01 Driver Version: 570.158.01 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX TITAN X Off | 00000000:03:00.0 Off | N/A |
| 22% 32C P8 16W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Don't forget to exit after the interactive job.
sbatch unless when debugging,
and only run srun on the partition Teach-Interactive.
Let's actually write a python script that runs PyTorch.
Open your editor and save the following code to a file test.py.
#!/usr/bin/env python3
#
# This is test.py
#
import torch
print(torch.cuda.is_available())
a = torch.rand(3)
b = torch.rand(3)
a = a.to('cuda')
b = b.to('cuda')
print(a + b)
To run this script, we need to activate the virtual environment first.
Save the following bash script to a file run.sh.
#!/bin/bash # # This is run.sh # . /home/htang2/toolchain-20251006/toolchain.rc . venv/bin/activate python3 test.py
Instead of running an interactive job with srun, we are going to run
a regular job with sbatch.
sbatch -p Teach-Standard --gres=gpu:1 ./run.sh
All the arguments should be familiar to you by now.
The only difference is that we switch to the partition Teach-Standard.
Below is the output of the command.
[htang2@hastings ~]$ sbatch -p Teach-Standard --gres=gpu:1 ./run.sh Submitted batch job 2093068
The job is now run and is given a job id (in my case, 2093068).
You will also notice that there is a new file slurm-<job_id>.out
in your home directory.
This is the file where the standard output goes.
Once the job is done, I can look at the output by doing
the following.
cat slurm-2093068.out
Below is the output I got.
[htang2@hastings ~]$ cat slurm-2093068.out True tensor([1.1060, 1.1293, 0.9643], device='cuda:0')
Based on devicd='cuda:0', the tensor is indeed allocated
on the GPU memory.
Once you are comfortable running jobs, you will need to manage them.
You can use squeue to see a list of jobs you are currently running.
squeue -u <UUN>
If you find that some jobs are not useful anymore, you can kill them
using scancel.
scancel <job_id>