A tutorial on using the icf cluster

Author: Hao Tang (hao.tang@ed.ac.uk)
Date: 18 May, 2026

Task 1: Logging into icf

Try the following command and log into icf.

ssh <UUN>@icf -J <UUN>@student.ssh.inf.ed.ac.uk

The -J argument is known as a jump. It is necessary when you are not in the School of Informatics network.

Below is what I saw when I logged into icf.

[haotang@latitude2 ~]$ ssh htang2@icf -J htang2@staff.ssh.inf.ed.ac.uk
(htang2@staff.ssh.inf.ed.ac.uk) Password: 
(htang2@icf) Password: 
Welcome to Ubuntu 24.04.4 LTS (GNU/Linux 6.8.0-90-generic x86_64)

               ____   ___    _   _  ___ _____   ____  _   _ _   _ 
              |  _ \ / _ \  | \ | |/ _ \_   _| |  _ \| | | | \ | |
              | | | | | | | |  \| | | | || |   | |_) | | | |  \| |
              | |_| | |_| | | |\  | |_| || |   |  _ <| |_| | |\  |
              |____/ \___/  |_| \_|\___/ |_|   |_| \_\\___/|_| \_|
                                                                  
               __  __ _____ __  __  ___  ______   __   ___  ____  
              |  \/  | ____|  \/  |/ _ \|  _ \ \ / /  / _ \|  _ \ 
              | |\/| |  _| | |\/| | | | | |_) \ V /  | | | | |_) |
              | |  | | |___| |  | | |_| |  _ < | |   | |_| |  _ < 
              |_|  |_|_____|_|  |_|\___/|_| \_\|_|    \___/|_| \_\
                                                                  
                    ____ ___  __  __ ____  _   _ _____ _____ 
                   / ___/ _ \|  \/  |  _ \| | | |_   _| ____|
                  | |  | | | | |\/| | |_) | | | | | | |  _|  
                  | |__| |_| | |  | |  __/| |_| | | | | |___ 
                   \____\___/|_|  |_|_|    \___/  |_| |_____|
                                                             
               ___ _   _ _____ _____ _   _ ____ _____     _______ 
              |_ _| \ | |_   _| ____| \ | / ___|_ _\ \   / / ____|
               | ||  \| | | | |  _| |  \| \___ \| | \ \ / /|  _|  
               | || |\  | | | | |___| |\  |___) | |  \ V / | |___ 
              |___|_| \_| |_| |_____|_| \_|____/___|  \_/  |_____|
                                                                  
              ____  ____   ___   ____ _____ ____ ____  _____ ____  
             |  _ \|  _ \ / _ \ / ___| ____/ ___/ ___|| ____/ ___| 
             | |_) | |_) | | | | |   |  _| \___ \___ \|  _| \___ \ 
             |  __/|  _ <| |_| | |___| |___ ___) |__) | |___ ___) |
             |_|   |_| \_\\___/ \____|_____|____/____/|_____|____/ 
                                                                   

This is a shared node, do not run compute intensive processes.


This is a cluster head node please do not run compute intensive processes here
this node is intended to provide an interface to the cluster only
****************************************************************************
***Please nice any long running processes***
*****************************************************************************
If you are running interactive jobs via srun please do not do so for longer than
an hour. Specifically do not leave them to idle as they may be killed,
other people may need to use the GPU you are blocking
-------------------------------------------------------------------------------
The ICF status can be viewed online: https://icfwebview.inf.ed.ac.uk
-------------------------------------------------------------------------------
Looking for tips?  https://computing.help.inf.ed.ac.uk/cluster-tips
**************************************************************************

**********************************************updated: 2026-03-27 10:48 **

Last login: Mon May 18 12:00:33 2026 from porthos.inf.ed.ac.uk

The first thing to note is the hostname hastings, the name of the head node. The head node is responsible for taking commands and submit them to one of the compute nodes.

Warning: Do not run anything on the head node. The head node is where we submit jobs, and it needs to serve a lot of people. Instead, submit jobs so that the compute happens on the compute nodes.

Task 2: Listing all the machines and partitions

Once we are on the head node, we can run the following command to see the list of machines and partitions.

sinfo

Below is the output of sinfo. We can see that a partition is a set of nodes, and a node is a physical machine.

[htang2@hastings ~]$ sinfo
PARTITION     AVAIL  TIMELIMIT  NODES  STATE NODELIST
Interactive      up    4:00:00      2   idle landonia[01-02]
ICF-Free         up 2-00:00:00      1   mix- herman
ICF-Free         up 2-00:00:00     12    mix crannog[01-07],landonia[11,21],saxa,scotia[01-02]
ICF-Free         up 2-00:00:00     17   idle damnii[07-12],landonia[03,05,08,23,25],scotia[03-08]
Teaching*        up 2-00:00:00      4    mix crannog[01-02],landonia11,saxa
Teaching*        up 2-00:00:00     11   idle damnii[07-12],landonia[03,05,08,23,25]
ICF-Research     up 2-00:00:00      1   mix- herman
ICF-Research     up 2-00:00:00      2    mix scotia[01-02]
ICF-Research     up 2-00:00:00      6   idle scotia[03-08]
Open-Research    up 2-00:00:00      5    mix crannog[03-07]
ILCC-Standard    up 5-00:00:00      6   idle barre,duflo,greider,levi,mcclintock,nuesslein
ILCC-CDT         up 5-00:00:00      2   idle arnold,strickland

The two most important partitions are Interactive and Teaching, and I will say more about them in the rest of the tutorial.

Task 3: Running an interactive job

Let's run our first job.

srun -p Interactive --pty bash

This job is literally the shell bash itself. Once you run it, it is as if you ssh-ed into a compute node. Below is the output I got from running the command.

[htang2@hastings ~]$ srun -p Teach-Interactive --pty bash
[htang2@landonia01 ~]$ 

In this case, I'm on landonia01 and I can make use of the compute resources on it.

Pro tip: Interactive jobs are super useful for debugging when you encounter errors.

Remember to run the following command when you are done.

exit

Below is what I got after running exit. You should be back on the head node (in this case hastings).

[htang2@landonia01 ~]$ exit
exit
[htang2@hastings ~]$ 

File systems

Before we move on, it is worth talking about how the files are stored on the cluster. A job cannot really do much if we don't let the job on the compute node access the files it needs.

When we are on the head node, the home directory (i.e., /home/<UUN>) is on NFS. The nice thing about NFS is that all compute nodes can access files on the NFS, including your home directory. The downside of NFS is that it is terribly slow.

Fortunately, every compute node also has a scratch space, located at /disk/scratch. The access to the scratch space is a lot faster than the home directory, because it is simply a local disk.

Pro tip: To get the best of both worlds, move the data that the job needs to /disk/scratch before the compute happens, and move the result back to the home directory after the compute is done.

Moving data to and from NFS does not scale linearly in the number of files. It is much more efficient to move a big file than to move many small files.

Pro tip: It is much faster to zip your dataset ahead of time before moving it to /disk/scratch.

Note that the NFS is not backed up. Make sure you back up the code and the results whenever possible.

Pro tip: Your AFS directory /afs/inf.ed.ac.uk/user/<prefix>/<UUN> is still accessble from mlp. It is backed up daily, so put your precious data over there.

Task 4: Preparing the virtual environment

I assume your main use would be to run and to train neural networks with PyTorch. For anything you do with GPUs, you will need to have access to the CUDA library. A simple way to check whether CUDA is properly installed is to run nvcc (the NVIDIA compiler).

nvcc --version

Unfortunately, you will see the following error message.

[htang2@hastings ~]$ nvcc --version
Sorry, 'nvcc' was not found.  Perhaps you meant:
 - nvlc

One solution is to use conda and install your own CUDA, but we would end up with hundreds of copies of the same CUDA library on the cluster. I have built a simple toolchain to make everybody's life easier.

. /home/htang2/toolchain-20251006/toolchain.rc

Now if you run nvcc --version, you will see the following.

[htang2@hastings ~]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:20:09_PST_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0
Pro tip: Remember to activate the toolchain whenever you need to use a GPU.

To install PyTorch, we first create a virtual environment and activate it.

python -m venv venv
. venv/bin/activate

The virtual environment of course needs to be on the NFS so that the compute nodes can use it. You will also get to experience how slow the NFS is when you create this virtual environment.

PyTorch can be installed with the following command. Don't forget to install numpy. In case you wonder, these two packages are specifically compiled for the cluster, and you might find them run faster than the default one you get with regular pip install.

pip install /home/htang2/toolchain-20251006/whl/numpy-2.2.3-cp312-cp312-linux_x86_64.whl
pip install /home/htang2/toolchain-20251006/whl/torch-2.8.0a0+gitunknown-cp312-cp312-linux_x86_64.whl

After doing all the above, you will get CUDA version 12.8 and PyTorch 2.8.0.

Task 5: Running a job with a GPU

We are now ready to run a job that makes use of a GPU. Let's run the following interactive job.

srun -p Interactive --gres=gpu:1 --pty bash

Note the additional argument --gres=gpu:1. It means we need will need one GPU for this job.

Pro tip: Don't ask for many GPUs per job, as the job will be queued until there are enough GPUs. Instead, run many jobs, each of which uses one single GPU, and the jobs will be run as soon as there is a GPU available.

To check whether you have a GPU to work with, run the following command to see what GPU you have.

nvidia-smi

In my case, I have a GeForce GTX TITAN X with around 12 GB of memory.

[htang2@landonia01 ~]$ nvidia-smi
Mon May 18 12:13:32 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:04:00.0 Off |                  N/A |
| 29%   34C    P8             15W /  250W |       0MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Don't forget to exit after the interactive job.

Warning: Because an interactive job typically has low GPU utilization, it is never a good idea to run an interactive job with a GPU other than debugging. Always run jobs with sbatch unless when debugging, and only run srun on the partition Interactive.

Task 6: Running a regular job

Let's actually write a python script that runs PyTorch. Open your editor and save the following code to a file test.py.

#!/usr/bin/env python3

#
# This is test.py
#

import torch

print(torch.cuda.is_available())

a = torch.rand(3)
b = torch.rand(3)

a = a.to('cuda')
b = b.to('cuda')

print(a + b)

To run this script, we need to activate the virtual environment first. Save the following bash script to a file run.sh.

#!/bin/bash

#
# This is run.sh
#

. /home/htang2/toolchain-20251006/toolchain.rc
. venv/bin/activate

python3 test.py
Pro tip: Remember to activate the toolchain whenever you need to use a GPU.

Instead of running an interactive job with srun, we are going to run a regular job with sbatch.

sbatch -p Teaching --gres=gpu:1 ./run.sh

All the arguments should be familiar to you by now. The only difference is that we switch to the partition Teaching. Below is the output of the command.

[htang2@hastings ~]$ sbatch -p Teaching --gres=gpu:1 ./run.sh
Submitted batch job 2093068

The job is now run and is given a job id (in my case, 2093068). You will also notice that there is a new file slurm-<job_id>.out in your home directory. This is the file where the standard output goes. Once the job is done, I can look at the output by doing the following.

cat slurm-2093068.out

Below is the output I got.

[htang2@hastings ~]$ cat slurm-2093068.out
True
tensor([1.1060, 1.1293, 0.9643], device='cuda:0')

Based on devicd='cuda:0', the tensor is indeed allocated on the GPU memory.

Managing your jobs

Once you are comfortable running jobs, you will need to manage them. You can use squeue to see a list of jobs you are currently running.

squeue -u <UUN>

If you find that some jobs are not useful anymore, you can kill them using scancel.

scancel <job_id>