Efficient Matlab and Octave

In March 2011 I was asked to provide a short tutorial on “writing efficient Matlab code”. Here are some notes from that tutorial, including some notes on good practice that don’t strictly relate to efficiency.

For more detail, see some of this recommended reading:

YAGTOM: Yet Another Guide TO Matlab, by Matthew Dunham and Kevin Murphy.
The Mathwork’s vectorization guide.
Marios Athineos’s collection of notes (some slightly dated).
Tom Minka’s notes.
Loren Shure’s checklist of best practices and the rest of her blog.
My other niche Matlab/Octave pages: tricks, oddities, R notes, code.

Opening Quiz

Here are some things to think about.

How would you sample, uniformly at random, K values from a list of N values without replacement?
You have a large NxD data array data, (with N=1e7 say). The first column consists of user IDs, integers between 1 and 1e5. A row, data(n,:), is only relevant if data(n,1) appears in interesting, a Kx1 list of interesting users (K=1e3 say). How would you construct subset, an array consisting of the interesting rows of data?
How would you quickly find the NxM square distances between two sets of D-dimensional vectors of size N and M stored in DxN and DxM arrays? How about if you measure distance with different lenghscales (Mahalanobis distance with a diagonal metric): d_n,m = Σ_d (x_d,n - x_d,m)² / l_d²
You are given a list of N positive integers class=[1,9,3,4,...]. Find a neat way to get a logical NxK array, with only one true element in each row, where row n should be true in column number class(n).
You are given X, an NxD matrix of D-dimensional locations, and Y, an Nx1 vector of values at those locations. Lots of the rows of X are identical. Return new arrays Xu, an MxD matrix of unique D-dimensional locations, and Yu, a Mx1 vector of the sums of the Y values associated with those locations.
What compiler options would you use to compile a numerical C or C++ routine, and how should you link it with a mex/oct file?

Meta-question: (When) should we care about questions like these?

My answers are below, but the questions will be more useful if you think about them for yourself first. Please contact me if you have interesting alternative answers or questions, or any other comments. Thank you to those who have helped improve this document already.

When (not) to use Matlab and Octave
Vectorization
Profiling and Testing
Function documentation and input checking
C (and other) extensions: mex and oct files
3rd-party GPU toolkits
Summary
Quiz Answers
Acknowledgements

When (not) to use Matlab and Octave

Data types and structures: In the beginning, MATLAB the “MATrix LABoratory” had one data-type: a matrix of doubles. Matlab software is still dominated by numerical computation with matrices. Personally I like that. I don't want to coerce data into the right sort of vector, or data type. I've got used to pushing blocks of numbers around and not having to worry about types. (Although I do document the size/orientation of each matrix.) In other languages I’ve found coercing my data into the right type cumbersome, and often I’ve had to use far more of someone’s framework than I’d like when using one of their functions.

Historically, complex data-structures haven’t been easy or efficient in Matlab. It is now possible to do a fair amount with structure arrays, cell arrays and classes. But these aren’t always particularly efficient. Profile your code! Sometimes it is necessary to resort to linking with other languages. If much of your code is heavy on complicated data-structures then Matlab/Octave may not be the way to go. (e.g., Matlab only recently got native hash tables, and Octave doesn’t ship with them!)

Free vs Proprietary: Octave is Free Software, whereas Matlab is closed source, cannot be modified and costs money. The relative political/ethical importance of this distinction I’ll leave up to you. But some practical pros and cons of Matlab/Octave directly relate to their commercial/free development models.

Dealing with licenses: the license manager that Matlab uses is annoying to set up. The toolboxes that Mathworks sell are expensive and so most institutions never have quite enough copies. Then you find your code bombs out because you used a trivial function, like range, which you could have implemented yourself in one or two lines if you had realized it was part of an expensive toolbox. This sort of thing punishes me for paying for software. If after writing your code you want to deploy jobs across a high-performance cluster, you may have to worry about licenses again.

Matlab and Octave are not just similar, some parts use the same underlying libraries. The performance of vectorized code is often similar, although Matlab uses MKL and Octave is often linked with another BLAS implementation, such as ATLAS. Overall I do run things in Matlab when available though, as it is much faster at interpreting loop-heavy code. There can be surprising differences. I have had lots of numerical trouble with Matlab’s quadrature routines, when Octave’s have worked fine, even though they use the same underlying library! Of course Octave, like any complex software system, has buggy routines too.

Installing binaries: Matlab comes pre-compiled and linked with a barrage of required libraries. In some ways this bundling makes Matlab Just Work. But because Matlab doesn't necessarily play nicely with the surrounding operating system, linking extensions can be a pain and getting updates doesn’t happen automatically. At the time of writing I was six Matlab releases behind the current version. And a recent experience with a collaborator was that I couldn’t even assume they had bsxfun. Why should they buy an upgrade just to run one bit of code from me as a one-off? So I had to waste time writing a compatibility stub. (My code would have worked in Octave, but that wasn’t installed and I didn’t have access to set it up.)

Backwards compatibility has a pretty good history in Matlab. I have used research code from more than a decade ago with minimal or no tweaking. I think commercial systems often feel a strong pressure to keep their big customers happy, even when it means keeping warts in the language. What are the chances that some Python+numpy+scipy code will work in >10 years time? Octave hasn’t always been so Matlab compatible. In shifting to Matlab-compatibility as a project goal, old Octave code has been broken. Code written now that targets the large subset of the language that is supported by Matlab and Octave is likely to be runnable (perhaps with minimal tweaking) for some time to come.

Plotting is one of the things that Octave continues to be least compatible with Matlab over. On the one hand, Octave plotting is now much more compatible with Matlab. On the other hand, few of my old Octave plotting scripts work any more. In many programming environments, anything to do with displaying graphics or GUIs seems to be harder to keep working or port. As always, separate out functionality: don’t make your core simulation depend on plotting routines. Write out results and plot them separately, or make plotting optional and swappable with callback functions. For some purposes, R produces much more professional looking graphics out of the box. Separating out your plotting routines will make it easier to redo plots in (say) R.

Vectorization

Writing an expensive computation as a cryptic line or two of dense Matlab can be a fun puzzle. Is it worth it? Sometimes. Matlab’s JIT means that vectorization is sometimes unnecessary or worse than more naive code. Always profile and test.

The references listed above are just a few of the available notes you can find on the web. Here I provide a checklist of things you may want to look into if you are unsure about them.

logical, integer and subscript indexing: know how they all work and how to use:

find, and when (not) to use it
:
sub2ind, ind2sub (and maybe my subv2ind, ind2subv)

Frequently used functions for vectorizing:

~~bsxfun (often used where we used to use repmat)~~
There's no need to use bsxfun now that Matlab supports broadcasting (and Octave has for even longer).
histc
unique (sometimes slower than it should be)
ones/true and zeros/false
reshape, vec=@(x)x(:);
cumsum
accumarray
diff (slightly neater than doing yourself with indexing)
Others? Often mentioned, but I haven’t used much: vectorize, filter, kron, meshgrid

If you’re a functional programming fan you may like to play around with cell arrays and the “map” functions cellfun, arrayfun, etc. Sometimes these allow neat, fast code. If you have arranged computations like this, it should be easier to parallelize them. Octave-forge contains a parallel version of cellfun: parcellfun.

You could consider using sparse arrays to vectorize some operations that may not naturally map onto dense linear algebra operations.

Pre-allocation. If you must use a for loop, try not to grow an array inside it:

X = zeros(1, N);
for nn = 1:N
    X(nn) = fn(nn);
end
% Do NOT do:
X = [];
for nn = 1:N
    X = [X fn(nn)];
end

Sometimes you have to guess how big the array will be. It is usually faster to make it too big, or double the length every time it fills up, and trim at the end. Matlab r2011a improved performance if you forget to pre-allocate. Before that, growing an array could easily be disastrously slow.

Profiling and Testing

Time individual snippets with:

tic; stuff(); wallclock_time_taken = toc
timeit(@() stuff(), 0) — more careful and less variable timings.

(For Octave or older Matlab, Mathworks used to distribute timeit on its fileexchange with a BSD license. I have mirrored the download here: timeit.zip. The second argument to timeit makes it work in older versions of Octave, but won't be necessary now.)

Which snippets should you bother to time?

Use Matlab’s built-in graphical profiler: profile viewer. It is awesome and happens to be the only part of the Matlab GUI (besides plot windows) that I use. Most computer time is usually spent in a small part of your code, and it can be surprising which lines of code dominate timings.
Don’t take it on trust that a 3rd-party “fast” replacement of a Matlab primitive actually is still faster in current Matlab. Benchmark it!

Write test-cases to check that versions of your code are really producing the same output as each other. Also check on small cases you can solve by hand.

I put tests in a sub-directory called testing. If I start Matlab there I can still see my project’s code because my startup.m includes:

% If in a testing directory, add the parent directory to the Matlab path.
% Make it easier to separate out tests into a subdirectory,
% then I'm less likely to delete the tests and more likely to run them:
if regexp(pwd,'/test(ing|s|)(|_)(|stuff|rig)$')
    try
        addgenpath('..'); % cleverer version from my Matlab toolbox
    catch
        addpath(genpath('..'));
    end
end

If “shipping” my code, I create a startup.m in the testing directory:

% run usual startup script
startups = which('startup', '-ALL');
if length(startups) > 1
    run(startups{2})
end

% Add the parent directory (the project I'm testing) to the Matlab path
oldpwd = pwd;
cd ..
try; addgenpath(pwd); catch; addpath(genpath(pwd)); end
cd(oldpwd);

clear startups oldpwd ans

Does your code work if some of your array dimensions happen to be one or zero? 1) Always specify the dim argument in functions like sum unless you know you will always be processing a vector, otherwise the summing direction might switch for some inputs. 2) Know that Matlab cannot explicitly have ND-arrays with trailing singleton dimensions, but that most functions allow you to access an arbitrary number of trailing singleton dimensions as if they were there:

>> A = zeros(4,3,2,1);
>> sz = size(A)
sz =
     4     3     2
>> sz(4) % bug, because last dimension has length 1 and didn't appear
??? Attempted to access sz(4); index out of bounds because numel(sz)=3.

>> size(A,4)
ans =
     1
>> size(A,5) % implicitly exists even though you didn't put it there
ans =
     1

3) Don’t use squeeze, which can squeeze dims you didn’t intend. Instead permute the singleton dimensions you want to squeeze to the end of your ND-Array, where they become invisible.

Matlab computations can produce NaN, Inf and complex numbers. Is your code correct in those cases? It can be hard to see where these esoteric numbers came from when they pop out as a final result. Typing dbstop naninf in Matlab causes a breakpoint whenever a NaN or Inf is produced. Putting log=@reallog;sqrt=@realsqrt; at the top of your functions prevents common sources of unintended complex numbers.

Function documentation and input checking

When optimizing a snippet of code, it’s often useful to separate it out into a meaningful function. Separated-out functions are more easily swapped with alternatives, tend to be more understandable, are more easily passed to timeit, and are easier to exercise separately in test cases.

A block of comments immediately beneath a function declaration is returned by the Matlab help command. The first line is returned by the lookfor command. This first line is conventionally the function name in capitals (yuck) followed by a short description. As a minimum I usually add an example of the function being used and a list of the sizes of all the input and output arrays. For a minimal example, see my unique_totals.m function. Creating these headers can be tedious, but I’ve found them worthwhile. For example is the input data matrix NxD, as standard for a design matrix in statistics, or DxN adjacent column vectors which keeps data cases together in memory? People do both. I have done both. I pipe my function header into a Python(!) program to help me make the documentation that keeps me sane: docmatlab, example.

The beginning of a function often has quite a bit of fairly boring code to check the number and sizes of arguments, fill in default values, pre-process arrays for the intended algorithm and so on. There is a tension between efficiency and robustness/clarity. For example I have an m-file DEFAULT.m, which allows me to do this:

function [xx, yy, zz] = blah(aa, bb, cc, dd)
DEFAULT('bb', zeros(size(aa)));
DEFAULT('dd', 0);
...


>> blah(aa, [], cc); % 2nd & 4th arguments initialized to default values

I find using DEFAULT clearer and shorter than manually putting in the logic to see if bb is defined and assigning a value if not. However, DEFAULT adds even more overhead than these checks would already have, which can cause a problem if the function is called frequently. I rely on profiling to tell me when this overhead is really an issue.

One common pattern is that a function is called many times, in a loop or recursively with similar arguments. After the first valid call, the arguments might be guaranteed to be valid and complete given what has gone before. In these cases I have an end-user facing function with extensive argument checking. This front-end function only checks the user-provided arguments once and then carefully drives an internal/private function many times. This internal function has minimal checking.

function [xx, yy, zz] = sprocket_solver(lots, of, args)
% Documentation
% Extensive arg checking and processing

for big_loop = 1:maybe_lots
    sprocket_helper(pre, processed, args);
end

function sprocket_helper(pre, processed, args)
% We now get on with some serious work with little arg-checking.
% sprocket_helper might call itself recursively, but sprocket_solver
% shouldn't because it has lots of arg-checking overhead.

Most of Matlab’s library routines do a lot of checking, and are quite general. If you are sure of your input, only need a simple case, and call the function often, then writing your own version can give a speedup. Although realize that you will be throwing away some of the care that has gone into writing seemingly trivial functions.

C (and other) extensions: mex and oct files

First, question whether you really want to write some of your code in another language. What are the chances that it and all its dependencies will compile and link cleanly with Matlab/Octave on a different system? How about 10 years from now?

Second, question whether you can actually write fast C code. If your algorithm involves large matrix operations, you may have to call optimized BLAS routines to do some of the work, or you risk being outperformed by the m-file that you are trying to replace.

There is fairly neat Java (and now C#) integration in Matlab. Using these features could help with Matlab’s limited data-structure support. However, I’m not much of a fan of Java and these language interfaces aren’t compatible with Octave, so personally I avoid them.

Personally, if I want to write low level code I tend to do it in ‘plain C’ (ANSI C, or C99). It’s much less hassle to link than C++, compiles much faster, and the compilers tend to give me more interpretable error messages.

Matlab extension basics

Matlab comes with a script called mex to compile and link C/C++/Fortran extensions. Matlab also exposes a command called mex within Matlab, which seems to do the same thing as calling the command via, system('mex'). I’ve had most success calling the command from within Matlab; linking correctly can be tricky, and you’re more likely to get the environment set up correct if compiling within Matlab.

You’ll need to look at the documentation and find examples to copy and paste. But here are some notes to orientate you. A mex C program needs to contain:

#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    /* main code */
}

Rather than a main() routine. As you would expect, plhs and prhs are arrays of pointers to mex-defined datatypes. There are a bunch of functions in the Matlab documentation beginning with mex, which are general routines for interfacing with Matlab. The routines to do with manipulating mxArrays all begin with mx.

There are some example programs in the <MATLAB>/extern/examples/mex directory. You can probably copy and paste one of those to form the basis of your first simple mex file. Here is an example program to perform an elementwise function to a general ND-Array:

#include "mex.h"

void my_element_wise_function(double* inputs, double* answer, int nn);

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    const mxArray *mxIn;
    double *outArray, *inArray;
    int nIn;

    if (nrhs != 1)
        mexErrMsgTxt("There must be exactly one input.");
    if (nlhs > 1) /* can be "zero", in which case Matlab allows one */
        mexErrMsgTxt("There must be at most one output.");

    /* get information about the input */
    mxIn = prhs[0];
    inArray = mxGetPr(mxIn);
    nIn = mxGetNumberOfElements(mxIn);

    /* create output the same size as the input */
    plhs[0] = mxCreateNumericArray(
            mxGetNumberOfDimensions(mxIn),
            mxGetDimensions(mxIn),
            mxDOUBLE_CLASS, mxREAL);
    outArray = mxGetPr(plhs[0]);

    my_element_wise_function(inArray, outArray, nIn);
}

Where my_element_wise_function could be defined in a separate C file. This separation lets me write and test the interesting code in a C standalone program without worrying about mex, before finally linking:

system('gcc -c my_func.c');
mex my_mex_file.c my_func.o

Given the above code, it might be tempting to try to modify inArray in place. There is no supported way of doing in-place operations in mex. If you write a mex file that does that anyway, here is what can happen:

>> A = 1
A =
     1
>> B = A % Matlab internally does copy-on-write, so just points to A
B =
     1
>> modify_in_place(A); A
A =
    2.7183
>> B % B still points to A's memory; Matlab didn't notice the change
B =
    2.7183

There are hacks to avoid accidentally modifying other variables, (do A(1)=A(1); before calling the mex file on A), but they aren’t guaranteed to stay working.

Octave extension basics

Octave has a mex compatibility layer. Use:

mkoctfile --mex

in place of mex. In Debian/Ubuntu you may need to install the octave-headers package to get mkoctfile. It doesn’t support everything, but for simple mex files it seems to work fine.

Octave’s mex files have more overhead time than in Matlab, partly because of compatibility conversions, but partly because Octave doesn’t trust you not to alter the inputs, so unnecessarily copies them before giving them to you. Some, but not all, of this overhead can be removed by writing a native oct extension. If you separate out the main work of your routine into a standalone .c or .cc file, writing an extra interface stub is little work. See tl_exp.cc in fast_exp, which is almost entirely boiler-plate, for an example.

The documentation for writing Octave extensions seems a little thin. However, the examples directory provides a good starting point. In Debian/Ubuntu the examples are currently installed in /usr/share/doc/octave3.2-headers/examples, or you could grab the source tarball from the Octave website.

Debugging C code

It’s all too easy with C (or C++) to mis-use memory and get random crashes (‘segmentation faults’, ‘bus errors’, etc.) some time after the mistake was actually made. These mistakes are now easy to find! Compile a C program with -g to include debug symbols and run it through valgrind:

valgrind ./my_program

It usually tells you the precise line on which you made your mistake. In fact valgrind works regardless of which compiled language you used to make my_program.

It is possible to run your whole Matlab session inside valgrind (see the man page about tracing children). But loading Matlab in valgrind is problematic because a) valgrind slows everything down dramatically, so loading Matlab and doing anything takes ages; b) valgrind throws up so many warnings to do with Matlab itself, finding your transgressions is tricky.

It is strongly recommended that you find all the mistakes that you can in your code before attempting to run it inside Matlab. Make your main functions take normal C arrays rather than Matlab data types, and make a small standalone program to exercise your code for testing. After perfecting your C routine, then write a thin Matlab interface in a separate .c file. The interface obtains pointers to the input and output data and calls your C routine, which itself doesn’t need to know anything about mex. This separation will make it easier for you to use your C code in other places. For example you could write a Python wrapper, or write a .cc file giving a native Octave interface (rather than forcing Octave to use its Matlab compatibility interface).

Another useful C debugging tool is mudflap (should come with gcc, but in Debian/Ubuntu I had to install a package called libmudflap0-X.Y-dev, where X.Y matches the current version of gcc):

gcc -g -fmudflap -lmudflap main.c -o my_program

The program runs much faster than under valgrind, but still reports memory mistakes, sometimes catching ones that valgrind misses, sometimes missing ones valgrind gets. Note: valgrind will not work on programs linked with mudflap; do a normal "production" build before testing with valgrind.

Update November 2015: mudflap has been deprecated in favour of "address sanitzer", and removed from recent versions of gcc. Instead compile with:

gcc -g -fsanitize=address -fno-omit-frame-pointer main.c -o my_program

You can replace "gcc" with "clang" if you prefer. Programs run faster than under valgrind, and stack level problems are found that valgrind misses. Again, maybe valgrind will find some mistakes that address sanitizer won’t (I don’t know yet). If trying valgrind, don’t also compile with address sanitizer!

C/C++ compilers and their options

How much do you care about numerical correctness, IEEE floating point conformance, etc.? Your priorities will guide some of your compiler options. For example: -mieee-with-inexact means you really want very strict IEEE conformance, even on hardware where this is slow (I don’t), whereas -ffast-math says you don’t care about a bunch of correctness concerns.

A gcc compiler option I’ve sometimes found really important on x86 or i686 architectures is -msse2. This flag allows the compiler to automatically vectorize some floating point operations, potentially giving a large speed-up. gcc doesn’t turn on the option by default because the required instructions are not implemented by all 32-bit processors. Although the flag is turned on by default for x86_64.

To avoid keeping track of processor flags, and which CPU I need for them to work, I tend to compile with -march=native, which means “use whatever the CPU in this machine can support”. Be careful if transferring binaries to other machines however.

I usually use the -O2 optimization flag. The -O3 flag sometimes makes things better, but can make things worse and has a reputation for being more buggy.

I often turn on all the warning and -pedantic flags I can without getting too irritated.

I’ve heard informally that the Intel compilers sometimes produce much faster code than gcc. It’s not very expensive for academic use, but academic use cannot use the free personal license (last time I checked). I value my time, and have never bothered sorting out a purchase and fighting with licenses.

Building mex/oct and linking

C++ is notoriously binary incompatible between versions. Matlab won’t have been compiled with the same compiler as installed on your system. If you compile C++ mex files on your system and try to use them they often won’t work. Currently I have Matlab aliased to start like this:

LD_PRELOAD='/lib/libgcc_s.so.1:/usr/lib/libstdc++.so.6.0.13:/lib/libz.so.1.2.3.3' matlab

This line causes Matlab to use my system's libraries and made some C++ mex file work. It could cause other problems, and might not fix things for you. There are various solutions on the Mathworks tech support forums and elsewhere on the web. Searching for the error message that you get will probably find something that works. I have much better luck with C mex files.

It’s still possible to break C mex files by linking them incorrectly though. Recently any time I tried to display a diagnostic with mexErrMsgTxt, my whole Matlab session bombed out with:

terminate called after throwing an instance of 'MathWorks::System::SimpleException'
zsh: abort      LD_PRELOAD= matlab -nodesktop -nosplash

The reason was that I had accidentally overridden mex’s carefully-set CFLAGS. The mex file could be compiled correctly from within Matlab (on Linux) in either of these two ways:

mex CFLAGS="\$CFLAGS -march=native -O2" mymexfile.c
% or:
system('mex CFLAGS=''$CFLAGS -march=native -O2'' mymexfile.c');

But not including the $CFLAGS, or failing to escape the $ in precisely the right way, caused the problem above. Note that mkoctfile uses CFLAGS set in the environment of your shell. The same escaping doesn’t work, but you can augment the CFLAGS it would normally use like this:

system('CFLAGS="-march=native $(mkoctfile -p CFLAGS)" mkoctfile --mex mymexfile.c')

Remember that for C++ you actually want to specify CXXFLAGS.

Octave usually doesn’t give nearly as many linking or compilation problems as Matlab. Octave is likely to have been built with the same compiler as installed on your system (at least if installed through a Linux distribution or self-compiled), so there are fewer sources of incompatibility.

OpenMP / multithreading C code

Multithreading is easy in a mex file: ‘just’ write multithreaded C code and then link it in. OpenMP allows you to parallelize some for loops by adding a single annotation and compiling with the right flags. See tl_omp_exp in fast_exp for an example.

As noted elsewhere on this page, multithreading isn’t necessarily a good idea for overall performance though. Also the Matlab API isn’t thread safe. Don’t try to call any mex or mx functions within threaded code. Same goes for the Octave API.

‘Embarassingly-parallel’ batch jobs

This probably isn’t a popular opinion with readers who’ve come to learn about parallel code, but: often I think simple single-threaded code is the way to go. (At least when using CPUs. And of course there are exceptions.)

When developing code I run simple, fast test cases so that I have some hope my code is correct. By the time I have serious computations to run, I probably have a grid of parameter settings, or a variety of control experiments to try out. Then I can run these experiments as independent parallel jobs. Matlab can now automatically use more than one CPU for a single job, but multithreading can be a bad idea. Running two jobs sequentially using two threads for each job is usually slower for me than running the jobs in parallel with one thread each.

Aside, detail on this claim: Super-linear speedups from multi-threading are possible due to cache effects. Also running two jobs in parallel might not work well if you need all your memory or shared cache for one job. But for my jobs, running parallel single-threaded jobs usually works best. Some algorithms (such as that used by eigs in Matlab) are memory bound and throwing more cores at them won’t help. If your code has some things that can’t be multi-threaded well, then some of your cores will spend some time sitting idle or pointlessly thrashing. Then, assuming you have more than one experiment to run, not doing multi-threading is faster over-all. I turn off multi-threading for production runs. Turning off multi-threading can also make profiling much less confusing.

In Matlab you can currently specify that you want single-threading with:

maxNumCompThreads(1);

However, this function has been deprecated in favour of starting Matlab with

matlab -singleCompThread

Although the version that I have installed doesn’t support this yet. In addition you may want to do setenv('OMP_NUM_THREADS','1') in your startup.m if anything you use is built with OpenMP support.

Running many separate jobs in parallel across the CPUs in a machine, or indeed across a cluster of machines, can be hard to keep track of. Having some good scripts to manage things for you is important. A good experiment setup can be interrupted and be restarted again (because you realize a mistake, or because the cluster goes down).

You can look at experiment_toolbox inside some code I’ve written. And on request I could give you the ugly bundle of shell-scripts I have used to farm things over machines. But neither are particularly sophisticated and I’m not going to support them.

Whatever batch system you do use, you’ll want to be able to run things from the command line. To run myscript.m in Matlab do:

matlab -nodisplay -r myscript -logfile myscript.log

Note myscript.m must end in quit or the Matlab session will stay open. Octave behaves like most Unix programs:

octave myscript.m > myscript.log

and doesn’t need you to finish myscript.m with quit, although doing so for Matlab compatibility does no harm.

3rd-party GPU toolkits

A disruptive change to computing has been the introduction of “General Purpose computations on Graphics Processing Units” (GPGPUs, or simply using GPUs). A modern desktop computer might be able to do a large computation 250x faster on its graphics card (GPU) than the central processing unit (CPU). That sort of speed-up is game-changing enough that it might be worth some implementation effort to achieve, and I start to reconsider my thoughts about multi-threading (above).

The theoretical speedups are hard to achieve. If you do something suboptimally, performance can disastrously crash back down to CPU levels. On the other hand, if you do standard vectorization tricks to reduce your bottlenecks to matrix operations, you might get a 30x speedup without having to do anything more sophisticated than get a library to do the linear algebra operations instead of Matlab’s intrinsics. That can change a month’s wait into working on your results the next day. (Ignoring the month you spent fighting with GPU software tools.)

In 2011 the options included Jacket, a commercial 3rd-party GPU library for Matlab, and GPUmat:, which was “freeware”, but is now open source abandonware. It was popular with some machine learning researchers, but is now obsolete.

The Mathworks Parallel Processing Toolbox was just introducing GPU support. I haven’t used it. As a separate product it’s often not available.

Back in 2011 I listed the Free Software offerings for Python for comparison. These included the simple (which can be a good thing) cudamat+gnumpy and the more sophisticated theano. Now there are a lot of machine learning toolkits that bring easy GPU support to Python.

Aside: most of the currently-available toolkits use NVIDIA’s CUDA language, rather than Open-CL which can run on other vendor’s cards.

In the future will GPU support be much more built into the main libraries? Will CPU vectorization automatically do much of what GPUs do now? Or is it important to learn GPU-specific technology? Whatever the answers to these questions, well-organized, functionally separated code, expressed where possible with standard vector and matrix operations will fare well.

Summary

Use the profiler.
Good code practice (separating functionality, good style, good test cases, version control, etc.) is more important than ever when optimizing code.
My view: for production runs and profiling purposes, turn off multi-threading and do as many jobs in parallel as you have CPUs. Use maxNumCompThreads(1); or matlab -singleCompThread, depending on Matlab version, and have a good batch experiment-running system. This advice won’t fit everyone.
If a hotspot is a loop, consult a vectorization guide or local expert for advice.
mex/oct files have their place, but can be a headache to maintain and debug. Develop the main C (or whatever) code separately, with native test cases, and make the Matlab/Octave glue as thin and separate as possible. Use good compiler options for your code, but be careful how you link the mex/oct file. Provide good instructions for others, and try to have an m-file for comparison and as a fallback.
GPUs are making a big difference for some people. But the tools are immature and in a state of flux. Can you afford to wait?

Quiz Answers

Have you thought about the quiz questions for yourself yet? If not, go back and have a think.

1. How would you sample, uniformly at random, K values from a list of N values without replacement?

The following costs O(N.log(N)):

idx = randperm(N);
idx = idx(1:K);
subset = values(idx);

This code is simple, and if it isn’t a bottleneck I’d stop there. However, finding the indices of the values, idx, should have a smaller big-O cost. One alternative algorithm could go something like this:

opts = 1:N; % This is still O(N) right here though
idx = zeros(K, 1); % remember to pre-allocate arrays when possible
for kk = 1:K
    last = (N-kk+1);
    ii = ceil(rand()*last);
    idx(kk) = opts(ii);
    opts(ii) = last;
end

This code isn’t straight-forward to vectorize. (Although I think it can be done!) But even in Octave, without a just-in-time compiler, this loop will be faster than the naive algorithm for K≪N. For K≪N I get much faster results again by sampling with replacement >K values until I get ≥K distinct values. Then I draw my K values without replacement from the smaller set using the naive algorithm.

idx = [];
while numel(idx) < K
    % For small K/N this loop will usually run once.
    % For larger K, the strategy should be adjusted!
    idx = unique(ceil(rand(2*K, 1)*N)); % O(K)
end
idx = idx(randperm(numel(idx))); % O(K.log(K))
idx = idx(1:K);

If building a responsible routine one would check K to avoid getting caught looping for a very long time. Perhaps replacing that magic value of 2 with something more principled. It would take a bit of effort to make a polished function that deals well with large ranges of K and N.

The statistics toolbox has a function that does exactly the required task:

subset = randsample(values, K);

Although using it could lead you into license manager hell. You could also do better yourself if you cared about this problem: the second two options I outlined above were faster for me when I tried N=1e6 and K=1e3.

2. You have a large NxD data array data, (with N=1e7 say). The first column consists of user IDs, integers between 1 and 1e5. A row, data(n,:), is only relevant if data(n,1) appears in interesting, a Kx1 list of interesting users (K=1e3 say). How would you construct subset, an array consisting of the interesting rows of data?

The main reason for including this exercise is that almost anything you might come up with involving for loops is likely to be disastrously slow.

This is perhaps the most obvious way involving indexing:

subset = data(ismember(data(:, 1), interesting), :);

This code takes about a second on my computer for the problem size specified. A second for a one-off pre-processing step is no problem. If we had a genuine need for speed there is more we can do. Given that we know that interesting consists of integers, we can create a lookup-table to test membership more quickly.

num_users = 1e5;
mask = false(num_users, 1);
mask(interesting) = true;
subset = data(mask(data(:,1)), :);

This solution is about 3x faster for the problem size given. The difference could be more extreme in other cases.

3. How would you quickly find the NxM square distances between two sets of D-dimensional vectors of size N and M stored in DxN and DxM arrays? How about if you measure distance with different lenghscales (Mahalanobis distance with a diagonal metric): d_n,m = Σ_d (x_d,n - x_d,m)² / l_d²

One answer is square_dist.m in my Matlab toolbox. For large arrays this m-file has been faster than mex files that ship with some Matlab toolboxes. The answer that uses the fewest flops depends on the dimensionality of the vectors, I usually assume D is big. One potential problem with my function is loss of numerical precision:

>> x1 = 1;
>> x2 = 1 + 1e-9;
>> (x1-x2)^2            
ans =
   1.0000e-18
>> x1^2 + x2^2 - 2*x1*x2
ans =
     0

I usually centre my data and decided that this wouldn't be an issue for me. If you grab a fast routine somewhere, will it match your assumptions? Be skeptical!

To find the Mahalanobis distance using Dx1 or 1xD lengthscales ell, simply scale the input data:

d2 = square_dist(x1 ./ ell(:), x2 ./ ell(:));

Or for older Matlab:

d2 = square_dist(bsxfun(@rdivide, x1, ell(:)), bsxfun(@rdivide, x2, ell(:)));

The (:) is a useful idiom so that you don't have to worry about which way around a vector is. Note that if N and M are large this O(D(N+M)) work is much more efficient than the O(DNM) work of scaling after taking the differences between vectors.

There is a general message: before scaling or shifting every element of a large matrix, consider whether the shift or scale can be performed before construction the matrix, or to the smaller results of applying it. For example, if you are going to repeatedly multiply a large array by a scalar (e.g., weight decay in neural network training) simply keep track of the overall scale of the matrix in a separate scalar variable, rather than doing O(NM) work every time.

4. You are given a list of N positive integers class=[1,9,3,4,...]. Find a neat way to get a logical NxK array, with only one true element in each row, where row n should be true in column number class(n).

One way is to ‘pre-compute’ all possible rows, and then do lookup for the answer for each row:

row_dict = logical(eye(max(class)));
encoding = row_dict(class, :);

If max(class) is big, the answer will be sparse, so this might be more appropriate:

encoding = sparse(1:numel(class), class, true);

The sparse call can be surrounded with full(...) to get a normal Matlab array, which is faster overall for large K. Finally here is yet another answer:

K = max(class(:));
N = numel(class);
encoding = false(N, K);
encoding(sub2ind([N, K], (1:N)', class(:))) = true;

The final answer uses indexing to specify the interesting elements, which is often useful. Maybe these indexes could be used directly later in the computation rather than producing this logical array?

5. You are given X, an NxD matrix of D-dimensional locations, and Y, an Nx1 vector of values at those locations. Lots of the rows of X are identical. Return new arrays Xu, an MxD matrix of unique D-dimensional locations, and Yu, a Mx1 vector of the sums of the Y values associated with those locations.

There’s an answer in my Matlab toolbox. That was fast enough for the real-time psychophysics experiment it was being used in, so I stopped. It could be made faster though because Matlab’s unique function could be faster. For large D one could get a speed-up by hashing the input locations in some way.

5. What compiler options would you use to compile a numerical C or C++ routine, and how should you link it with a mex/oct file?

Covered above.

Meta-question: (When) should we care about questions like these?

A common answer is that one should only worry about trying out clever vectorization tricks if you already have a correct version, and if profiling has said a speed-up is useful. My feeling is that knowing the common tricks, functions and indexing methods can help you write clearer Matlab/Octave code more quickly.

A possible answer is that the biggest speedups come from using GPUs and much of this document is a distraction. To some extent, I haven’t said much about GPUs because I’m out of date and haven’t got wide experience of using GPUs with Matlab to share. But having well-organized code, and algorithms expressed as cache-friendly standard numerical operations will always be useful, especially for porting code to a GPU.

Sometimes most of this document is a distraction! Although I generally stand by testing your code and making it clear, sometimes almost all of the above advice about making code faster is misplaced. There are several trivial “optimization strategies” that are often not given nearly enough consideration. These include: Use less data: do you actually need to be solving a problem on such a massive scale to get the answers you need...at least Right Now? Early stopping: if you are using a fancy optimizer it could be that most of its time is spent refining significant digits you don’t care about. Or worse, time could be spent overfitting. Maybe running your code for less time might give you the same or better results. Have you checked? Make your approach simpler: could something simpler work? I refer you to, without entirely agreeing with, a beautiful rant.

Acknowledgements

I received useful comments, which helped improve these notes, from Ali Eslami and Jonathan Pillow. In addition, the various references given above will have influenced me.