A tutorial on using the `mlp` cluster

Author: Hao Tang (hao.tang@ed.ac.uk)
Date: 21 Oct, 2025

Task 1: Logging into `mlp`

Try the following command and log into mlp.

ssh <UUN>@mlp -J <UUN>@student.ssh.inf.ed.ac.uk

The -J argument is known as a jump. It is necessary when you are not in the School of Informatics network.

Below is what I saw when I logged into mlp.

[haotang@vaio ~]$ ssh htang2@mlp -J htang2@staff.ssh.inf.ed.ac.uk
(htang2@staff.ssh.inf.ed.ac.uk) Password: 
(htang2@mlp) Password: 
Welcome to Ubuntu 24.04.3 LTS (GNU/Linux 6.8.0-57-generic x86_64)

This is a cluster head node please do not run compute intensive processes here
this node is intended to provide an interface to the cluster only
Please nice any long running processes
If you are running interactive jobs via srun please do do so for longer than
24 hours. Specifically do not leave them to idle as they may be killed,
other people may need to use the GPU you are blocking
-------------------------------------------------------------------------------
Looking for tips?  https://computing.help.inf.ed.ac.uk/cluster-tips
Jan 2025: Note that some GRES have changed to swap dashes to underscores.

**************************************************************************
We have scheduled the final set of nodes to upgrade for week commencing
17 August and we will be switching over the head nodes to new hardware 
running noble starting with uhtred (currnetly aliased as mlp.inf.ed.ac.uk
as of 9am Monday.
**********************************************updated: 2025-05-15 10:00 **

Last login: Thu Oct  9 20:24:49 2025 from porthos.inf.ed.ac.uk
[htang2@hastings ~]$

The first thing to note is the hostname hastings, the name of the head node. The head node is responsible for taking commands and submit them to one of the compute nodes.

Warning: Do you run anything on the head node. The head node is where we submit jobs, and it needs to serve a lot of people. Instead, submit jobs so that the compute happens on the compute nodes.

Task 2: Listing all the machines and partitions

Once we are on the head node, we can run the following command to see the list of machines and partitions.

sinfo

Below is the output of sinfo. We can see that a partition is a set of nodes, and a node is a physical machine.

[htang2@hastings ~]$ sinfo
PARTITION         AVAIL  TIMELIMIT  NODES  STATE NODELIST
SRF-Teaching         up    2:00:00      1   idle saxa
SRF-Reserach         up 10-00:00:0      7    mix herman,scotia[01-03,05-06,08]
SRF-Reserach         up 10-00:00:0      1  alloc scotia07
SRF-Reserach         up 10-00:00:0      1   idle scotia04
Teach-Interactive    up    2:00:00      1  down* landonia01
Teach-Interactive    up    2:00:00      1   idle landonia03
Teach-Standard*      up 3-08:00:00      2  down* landonia[09,19]
Teach-Standard*      up 3-08:00:00      7   idle landonia[02,04-06,08,21,25]
Teach-Short       drain    4:00:00      0    n/a 
Teach-LongJobs       up 3-08:00:00      1  down* landonia22
Teach-LongJobs       up 3-08:00:00      1  drain landonia23
Teach-LongJobs       up 3-08:00:00      3   idle landonia[21,24],saxa
General-Usage        up 3-08:00:00      1  down* meme
General-Usage        up 3-08:00:00      1   down letha06
PGR-Standard         up 5-00:00:00      1  inval damnii10
PGR-Standard         up 5-00:00:00      1  down* damnii08
PGR-Standard         up 5-00:00:00     13    mix crannog[02-07],herman,scotia[01-03,05-06,08]
PGR-Standard         up 5-00:00:00      2  alloc crannog01,scotia07
PGR-Standard         up 5-00:00:00     11   idle damnii[01-07,09,11-12],scotia04
ILCC-Standard        up 10-00:00:0      1    mix duflo
ILCC-Standard        up 10-00:00:0      5   idle barre,greider,levi,mcclintock,nuesslein
ILCC-CDT             up 10-00:00:0      1    mix strickland
ILCC-CDT             up 10-00:00:0      1   idle arnold

The two most important partitions are Teach-Interactive and Teach-Standard, and I will say more about them in the rest of the tutorial.

Task 3: Running an interactive job

Let's run our first job.

srun -p Teach-Interactive --pty bash

This job is literally the shell bash itself. Once you run it, it is as if you ssh-ed into a compute node. Below is the output I got from running the command.

[htang2@hastings ~]$ srun -p Teach-Interactive --pty bash
[htang2@landonia03 ~]$

In this case, I'm on landonia03 and I can make use of the compute resources on it.

Pro tip: Interactive jobs are super useful for debugging when you encounter errors.

Remember to run the following command when you are done.

exit

Below is what I got after running exit. You should be back on the head node (in this case hastings).

[htang2@landonia03 ~]$ exit
exit
[htang2@hastings ~]$

File systems

Before we move on, it is worth talking about how the files are stored on the cluster. A job cannot really do much if we don't let the job on the compute node access the files it needs.

When we are on the head node, the home directory (i.e., /home/<UUN>) is on NFS. The nice thing about NFS is that all compute nodes can access files on the NFS, including your home directory. The downside of NFS is that it is terribly slow.

Fortunately, every compute node also has a scratch space, located at /disk/scratch. The access to the scratch space is a lot faster than the home directory, because it is simply a local disk.

Pro tip: To get the best of both worlds, move the data that the job needs to /disk/scratch before the compute happens, and move the result back to the home directory after the compute is done.

Moving data to and from NFS does not scale linearly in the number of files. It is much more efficient to move a big file than to move many small files.

Pro tip: It is much faster to zip your dataset ahead of time before moving it to /disk/scratch.

Note that the NFS is not backed up. Make sure you back up the code and the results whenever possible.

Pro tip: Your AFS directory /afs/inf.ed.ac.uk/user/<prefix>/<UUN> is still accessble from mlp. It is backed up daily, so put your precious data over there.

Task 4: Preparing the virtual environment

I assume your main use would be to run and to train neural networks with PyTorch. For anything you do with GPUs, you will need to have access to the CUDA library. A simple way to check whether CUDA is properly installed is to run nvcc (the NVIDIA compiler).

nvcc --version

Unfortunately, you will see the following error message.

[htang2@hastings ~]$ nvcc --version
Sorry, 'nvcc' was not found.  Perhaps you meant:
 - nvlc

One solution is to use conda and install your own CUDA, but we would end up with hundreds of copies of the same CUDA library on the cluster. I have built a simple toolchain to make everybody's life easier.

. /home/htang2/toolchain-20251006/toolchain.rc

Now if you run nvcc --version, you will see the following.

[htang2@hastings ~]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:20:09_PST_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0

Pro tip: Remember to activate the toolchain whenever you need to use a GPU.

To install PyTorch, we first create a virtual environment and activate it.

python -m venv venv
. venv/bin/activate

The virtual environment of course needs to be on the NFS so that the compute nodes can use it. You will also get to experience how slow the NFS is when you create this virtual environment.

PyTorch can be installed with the following command. Don't forget to install numpy. In case you wonder, these two packages are specifically compiled for the cluster, and you might find them run faster than the default one you get with regular pip install.

pip install /home/htang2/toolchain-20251006/whl/numpy-2.2.3-cp312-cp312-linux_x86_64.whl
pip install /home/htang2/toolchain-20251006/whl/torch-2.8.0a0+gitunknown-cp312-cp312-linux_x86_64.whl

After doing all the above, you will get CUDA version 12.8 and PyTorch 2.8.0.

Task 5: Running a job with a GPU

We are now ready to run a job that makes use of a GPU. Let's run the following interactive job.

srun -p Teach-Interactive --gres=gpu:1 --pty bash

Note the additional argument --gres=gpu:1. It means we need will need one GPU for this job.

Pro tip: Don't ask for many GPUs per job, as the job will be queued until there are enough GPUs. Instead, run many jobs, each of which uses one single GPU, and the jobs will be run as soon as there is a GPU available.

To check whether you have a GPU to work with, run the following command to see what GPU you have.

nvidia-smi

In my case, I have a GeForce GTX TITAN X with around 12 GB of memory.

[htang2@landonia03 ~]$ nvidia-smi
Sun Oct 12 23:42:26 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.158.01             Driver Version: 570.158.01     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX TITAN X     Off |   00000000:03:00.0 Off |                  N/A |
| 22%   32C    P8             16W /  250W |       0MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Don't forget to exit after the interactive job.

Warning: Because an interactive job typically has low GPU utilization, it is never a good idea to run an interactive job with a GPU other than debugging. Always run jobs with sbatch unless when debugging, and only run srun on the partition Teach-Interactive.

Task 6: Running a regular job

Let's actually write a python script that runs PyTorch. Open your editor and save the following code to a file test.py.

#!/usr/bin/env python3

#
# This is test.py
#

import torch

print(torch.cuda.is_available())

a = torch.rand(3)
b = torch.rand(3)

a = a.to('cuda')
b = b.to('cuda')

print(a + b)

To run this script, we need to activate the virtual environment first. Save the following bash script to a file run.sh.

#!/bin/bash

#
# This is run.sh
#

. /home/htang2/toolchain-20251006/toolchain.rc
. venv/bin/activate

python3 test.py

Pro tip: Remember to activate the toolchain whenever you need to use a GPU.

Instead of running an interactive job with srun, we are going to run a regular job with sbatch.

sbatch -p Teach-Standard --gres=gpu:1 ./run.sh

All the arguments should be familiar to you by now. The only difference is that we switch to the partition Teach-Standard. Below is the output of the command.

[htang2@hastings ~]$ sbatch -p Teach-Standard --gres=gpu:1 ./run.sh
Submitted batch job 2093068

The job is now run and is given a job id (in my case, 2093068). You will also notice that there is a new file slurm-<job_id>.out in your home directory. This is the file where the standard output goes. Once the job is done, I can look at the output by doing the following.

cat slurm-2093068.out

Below is the output I got.

[htang2@hastings ~]$ cat slurm-2093068.out
True
tensor([1.1060, 1.1293, 0.9643], device='cuda:0')

Based on devicd='cuda:0', the tensor is indeed allocated on the GPU memory.

Managing your jobs

Once you are comfortable running jobs, you will need to manage them. You can use squeue to see a list of jobs you are currently running.

squeue -u <UUN>

If you find that some jobs are not useful anymore, you can kill them using scancel.

scancel <job_id>

A tutorial on using the mlp cluster

Task 1: Logging into mlp