NVIDIA GPU sharing under Linux

At the time of writing, NVIDIA didn’t provide any tools for monitoring the usage of their boards. Although they were “looking into it”. This page describes the hackish solutions I provided to share boards in Toronto. When running jobs on CPUs it is most efficient to only run one job per CPU. On GPUs this rule seems more important, as system crashes result more often when multiple jobs are run at once(?).

Discretionary locking: gpu_lock

I have written a Python script gpu_lock that implements a discretionary locking scheme: if everyone uses it, only one user will use a given GPU ID at a time. If an ID lock is obtained in the default way, the lock is automatically freed when the parent process ends (even if it crashes) or on a system reboot. It is also possible to get a lock that will persist and that must be manually freed.

Toronto ML users: the command to alias is /u/murray/bin/gpu_lock.py. To use it as a Python module, symlink both /u/murray/bin/gpu_lock.py and /u/murray/bin/run_on_me_or_pid_quit into a directory in your PYTHONPATH.

Edinburgh ML users: the command to alias is /disk/scratch/imurray2/gpu_lock/gpu_lock.py. To use it as a Python module, symlink both /disk/scratch/imurray2/gpu_lock/gpu_lock.py and /disk/scratch/imurray2/gpu_lock/run_on_me_or_pid_quit into a directory in your PYTHONPATH.

Run the script with no arguments to see the locks currently assigned and for more information on using the script. It should be easy to use. In a shell script do:

ID=`gpu_lock.py --id`

Then tell your GPU-using program to use $ID. From within a Python program do:

import gpu_lock
id = gpu_lock.obtain_lock_id()

From within Matlab do:

id = obtain_gpu_lock_id();

Generically, in a programming language with a system command that runs a sub-shell do:

id = str2int(system("exec gpu_lock.py --id"));

More examples are given in the directory. For all of the above methods of locking an id, the lock will be freed when the calling program finishes. An ID of -1 is returned if none are free. Run gpu_lock.py with no arguments for help on how to get a persistent lock and freeing it manually. But this is not recommended.

The script doesn't really know anything about GPUs other than the number of boards available. It's just a way of assigning integers on a first-come first-served basis.

Portability: The number of boards is found in a linux-specific way (looking at the number of /dev/nvidia[0-9]* devices) as is process monitoring (peeking in /proc) for lock freeing. Some tweaking is probably required for OSX. The symlink-based locking mechanism should work on any POSIX system (Linux and OSX, not Windows).

Our first approach: gpu_usage

Note: this section describes a tool we no longer really use as it doesn't work properly, but it may be of some use.

It would be nice to see if a GPU ID is actually being used, rather than just locked. The first script I wrote was gpu_usage. It uses a setuid wrapper to /usr/bin/lsof to see which users were using /dev/nvidia[0-9]* devices as memory-mapped files. This used to work, but now (for reasons unknown, but presumably after something was upgraded) a user using just one of the devices for real computation appears to be using all of the devices. So the script currently over-reports usage.

In some ways gpu_lock is better anyway, despite not knowing about real usage. It ensures that no two users use the same ID (as long as everyone uses the locking system). With gpu_usage we had problems with jobs simultaneously requesting an ID.