# Efficient Matlab and Octave

In March 2011 I was asked to provide a short tutorial on “writing efficient Matlab code”. Here are some notes from that tutorial, including some notes on good practice that don’t strictly relate to efficiency.

**For more detail, see some of this recommended reading:**

- YAGTOM: Yet Another Guide TO Matlab, by Matthew Dunham and Kevin Murphy.
- The Mathwork’s vectorization guide.
- Marios Athineos’s collection of notes (some slightly dated).
- Tom Minka’s notes.
- Loren Shure’s checklist of best practices and the rest of her blog.
- My other niche Matlab/Octave pages: tricks, oddities, R notes, code.

## Opening Quiz

Here are some things to think about.

- How would you sample, uniformly at random, K values from a list of N values
*without*replacement? - You have a large NxD data array
`data`, (with N=1e7 say). The first column consists of user IDs, integers between 1 and 1e5. A row,`data(n,:)`, is only relevant if`data(n,1)`appears in`interesting`, a Kx1 list of interesting users (K=1e3 say). How would you construct`subset`, an array consisting of the interesting rows of`data`? - How would you quickly find the NxM square distances between two sets of
D-dimensional vectors of size N and M stored in DxN and DxM arrays?
How about if you measure distance with different lenghscales (Mahalanobis
distance with a diagonal metric):
*d*= Σ_{n,m}_{d}(*x*)² /_{d,n}- x_{d,m}*l*²_{d} - You are given a list of N positive integers
`class=[1,9,3,4,...]`. Find a neat way to get a logical NxK array, with only one true element in each row, where row`n`should be true in column number`class(n)`. - You are given
`X`, an NxD matrix of D-dimensional locations, and`Y`, an Nx1 vector of values at those locations. Lots of the rows of X are identical. Return new arrays`Xu`, an MxD matrix of unique D-dimensional locations, and`Yu`, a Mx1 vector of the sums of the Y values associated with those locations. - What compiler options would you use to compile a numerical C or C++ routine,
and how should you link it with a
`mex`/`oct`file?

**Meta-question:** (When) should we care about questions like these?

My answers are below, but the questions will be more useful if you think about them for yourself first. Please contact me if you have interesting alternative answers or questions, or any other comments. Thank you to those who have helped improve this document already.

## Contents

- When (not) to use Matlab and Octave
- Vectorization
- Profiling and Testing
- Function documentation and input checking
- C (and other) extensions: mex and oct files
- 3rd-party GPU toolkits
**Summary**- Quiz Answers
- Acknowledgements

## When (not) to use Matlab and Octave

**Data types and structures:** In the beginning, MATLAB the
“MATrix LABoratory” had one data-type: a matrix of doubles. Matlab
software is still dominated by numerical computation with matrices. Personally I
like that. I don't want to coerce data into the right sort of vector, or data
type. I've got used to pushing blocks of numbers around and not having to
worry about types. (Although I do document the
size/orientation of each matrix.) In other languages I’ve found
coercing my data into the right type cumbersome, and often I’ve had to use
far more of someone’s framework than I’d like when using one of their
functions.

Historically, complex data-structures haven’t been easy or efficient in Matlab. It is now possible to do a fair amount with structure arrays, cell arrays and classes. But these aren’t always particularly efficient. Profile your code! Sometimes it is necessary to resort to linking with other languages. If much of your code is heavy on complicated data-structures then Matlab/Octave may not be the way to go. (e.g., Matlab only recently got native hash tables, and Octave doesn’t ship with them!)

**Free vs Proprietary:** Octave is Free Software,
whereas Matlab is closed source, cannot be modified and costs money. The
relative political/ethical importance of this distinction I’ll leave up to
you. But some practical pros and cons of Matlab/Octave directly relate to
their commercial/free development models.

Dealing with licenses: the license manager that Matlab uses is annoying to
set up. The toolboxes that Mathworks sell are expensive and so most institutions
never have quite enough copies. Then you find your code bombs out because you
used a trivial function, like `range`, which you could have implemented
yourself in one or two lines if you had realized it was part of an expensive
toolbox. This sort of thing punishes me for paying for software. If after
writing your code you want to deploy jobs across a high-performance cluster, you
may have to worry about licenses again.

Matlab and Octave are not just similar, some parts use the same underlying libraries. The performance of vectorized code is often similar, although Matlab uses MKL and Octave is often linked with another BLAS implementation, such as ATLAS. Overall I do run things in Matlab when available though, as it is much faster at interpreting loop-heavy code. There can be surprising differences. I have had lots of numerical trouble with Matlab’s quadrature routines, when Octave’s have worked fine, even though they use the same underlying library! Of course Octave, like any complex software system, has buggy routines too.

**Installing binaries:** Matlab comes pre-compiled and linked
with a barrage of required libraries. In some ways this bundling makes Matlab Just Work. But
because Matlab doesn't necessarily play nicely with the surrounding operating system, linking extensions can be a pain and getting updates
doesn’t happen automatically. At the time of writing I was six Matlab
releases behind the current version. And a recent experience with a collaborator
was that I
couldn’t even assume they had bsxfun. Why should they buy an upgrade
just to run one bit of code from me as a one-off? So I had to waste time writing
a compatibility
stub. (My code would have worked in Octave, but that wasn’t
installed and I didn’t have access to set it up.)

**Backwards compatibility** has a pretty good history in Matlab.
I have used research code from more than a decade ago with minimal or no
tweaking. I think commercial systems often feel a strong pressure to
keep their big customers happy, even when it means keeping warts in the
language. What are the chances that some Python+numpy+scipy code will work in
>10 years time? Octave hasn’t always been so Matlab compatible. In shifting
to Matlab-compatibility as a project goal, old Octave code has been broken. Code
written now that targets the large subset of the language that is supported by
Matlab and Octave is likely to be runnable (perhaps with minimal tweaking) for
some time to come.

**Plotting** is one of the things that Octave continues to be
least compatible with Matlab over. On the one hand, Octave plotting is now much
more compatible with Matlab. On the other hand, few of my old Octave plotting
scripts work any more. In many programming environments, anything to do with
displaying graphics or GUIs seems to be harder to keep working or port. As
always, separate out functionality: don’t make your core simulation depend
on plotting routines. Write out results and plot them separately, or make
plotting optional and swappable with callback functions.
For some purposes, R produces much more professional looking graphics out of the
box. Separating out your plotting routines will make it easier to redo plots in
(say) R.

## Vectorization

Writing an expensive computation as a cryptic line or two of dense Matlab can be
a fun puzzle. Is it worth it? Sometimes. Matlab’s JIT means that vectorization
is sometimes unnecessary or worse than more naive code.
**Always profile and test.**

The references listed above are just a few of the available notes you can find on the web. Here I provide a checklist of things you may want to look into if you are unsure about them.

**logical, integer and subscript indexing:** know how they all
work and how to use:

**Frequently used functions for vectorizing:**

`bsxfun`(often used where we used to use`repmat`)

There's no need to use`bsxfun`now that Matlab supports broadcasting (and Octave has for even longer).`histc``unique`(sometimes slower than it should be)`ones`/`true`and`zeros`/`false``reshape`,`vec=@(x)x(:);``cumsum``accumarray``diff`(slightly neater than doing yourself with indexing)- Others? Often mentioned, but I haven’t used much:
`vectorize`,`filter`,`kron`,`meshgrid`

If you’re a functional programming fan you may like to
play around with cell arrays and the “map” functions
`cellfun`, `arrayfun`, etc. Sometimes these allow neat, fast
code. If you have arranged computations like this, it should be easier to
parallelize them. Octave-forge contains a parallel version of `cellfun`: `parcellfun`.

You could consider using `sparse` arrays to vectorize some operations
that may not naturally map onto dense linear algebra operations.

**Pre-allocation**. If you must use a for loop, try not to grow an array inside it:

X = zeros(1, N); for nn = 1:N X(nn) = fn(nn); end % Do NOT do: X = []; for nn = 1:N X = [X fn(nn)]; end

Sometimes you have to guess how big the array will be. It is usually faster to make it too big, or double the length every time it fills up, and trim at the end. Matlab r2011a improved performance if you forget to pre-allocate. Before that, growing an array could easily be disastrously slow.

## Profiling and Testing

**Time individual snippets** with:

`tic; stuff(); wallclock_time_taken = toc``timeit(@() stuff(), 0)`— more careful and less variable timings.

(The second argument to `timeit`
makes it work in older versions of Octave.)
Which snippets should you bother to time?

**Use Matlab’s built-in graphical profiler:**`profile viewer`. It is awesome and happens to be the only part of the Matlab GUI (besides plot windows) that I use. Most computer time is usually spent in a small part of your code, and it can be surprising which lines of code dominate timings.- Don’t take it on trust that a 3rd-party “fast” replacement of a
Matlab primitive actually
*is*still faster in current Matlab. Benchmark it!

**Write test-cases** to check that versions of your code are
really producing the same output as each other. Also check on small cases you
can solve by hand.

I put tests in a sub-directory called testing. If I start Matlab there I can
still see my project’s code because my `startup.m` includes:

% If in a testing directory, add the parent directory to the Matlab path. % Make it easier to separate out tests into a subdirectory, % then I'm less likely to delete the tests and more likely to run them: if regexp(pwd,'/test(ing|s|)(|_)(|stuff|rig)$') try addgenpath('..'); % cleverer version from my Matlab toolbox catch addpath(genpath('..')); end end

If “shipping” my code, I create a startup.m in the testing directory:

% run usual startup script startups = which('startup', '-ALL'); if length(startups) > 1 run(startups{2}) end % Add the parent directory (the project I'm testing) to the Matlab path oldpwd = pwd; cd .. try; addgenpath(pwd); catch; addpath(genpath(pwd)); end cd(oldpwd); clear startups oldpwd ans

**Does your code work if some of your array dimensions happen to be one
or zero?** 1) Always specify the `dim` argument in
functions like `sum` unless you *know* you will always be
processing a vector, otherwise the summing direction might switch for some
inputs. 2) Know that Matlab cannot explicitly have ND-arrays with trailing
singleton dimensions, but that most functions allow you to access an arbitrary
number of trailing singleton dimensions as if they were there:

>> A = zeros(4,3,2,1); >> sz = size(A) sz = 4 3 2 >> sz(4) % bug, because last dimension has length 1 and didn't appear ??? Attempted to access sz(4); index out of bounds because numel(sz)=3. >> size(A,4) ans = 1 >> size(A,5) % implicitly exists even though you didn't put it there ans = 1

3) Don’t use `squeeze`, which can
squeeze dims you didn’t intend. Instead `permute` the singleton
dimensions you want to squeeze to the end of your ND-Array, where they become
invisible.

**Matlab computations can produce NaN, Inf and
complex numbers**. Is your code correct in those cases? It can be
hard to see where these esoteric numbers came from when they pop out as a final
result. Typing

`dbstop naninf`in Matlab causes a breakpoint whenever a

`NaN`or

`Inf`is produced. Putting

`log=@reallog;sqrt=@realsqrt;`at the top of your functions prevents common sources of unintended complex numbers.

## Function documentation and input checking

When optimizing a snippet of code, it’s often useful to separate it out
into a meaningful function. Separated-out functions are more easily swapped with
alternatives, tend to be more understandable, are more easily
passed to
`timeit`,
and are easier to exercise separately in
test cases.

A block of comments immediately beneath a function declaration is returned by
the Matlab `help` command. The first line is returned by the
`lookfor` command. This first line is conventionally the function name in capitals
(yuck) followed by a short description. As a minimum I usually add an example of
the function being used and a list of the sizes of all the input and output
arrays. For a minimal example, see my unique_totals.m
function. Creating these headers can be tedious, but I’ve found them
worthwhile. For example is the input data matrix NxD, as standard for a design matrix in
statistics, or DxN adjacent column vectors which keeps
data cases together in memory? People do both. I have done both. I pipe my function
header into a Python(!) program to help me make the documentation that keeps me
sane: docmatlab,
example.

The beginning of a function often has quite a bit of fairly boring code to check the number and sizes of arguments, fill in default values, pre-process arrays for the intended algorithm and so on. There is a tension between efficiency and robustness/clarity. For example I have an m-file DEFAULT.m, which allows me to do this:

function [xx, yy, zz] = blah(aa, bb, cc, dd) DEFAULT('bb', zeros(size(aa))); DEFAULT('dd', 0); ... >> blah(aa, [], cc); % 2nd & 4th arguments initialized to default values

I find using `DEFAULT` clearer and shorter than manually putting in
the logic to see if `bb` is defined and assigning a value if not.
However, `DEFAULT` adds even more overhead than these checks would
already have, which can cause a problem if the function is called frequently. I
rely on profiling to tell me when this overhead is really an issue.

One common pattern is that a function is called many times, in a loop or recursively with similar arguments. After the first valid call, the arguments might be guaranteed to be valid and complete given what has gone before. In these cases I have an end-user facing function with extensive argument checking. This front-end function only checks the user-provided arguments once and then carefully drives an internal/private function many times. This internal function has minimal checking.

function [xx, yy, zz] = sprocket_solver(lots, of, args) % Documentation % Extensive arg checking and processing for big_loop = 1:maybe_lots sprocket_helper(pre, processed, args); end function sprocket_helper(pre, processed, args) % We now get on with some serious work with little arg-checking. % sprocket_helper might call itself recursively, but sprocket_solver % shouldn't because it has lots of arg-checking overhead.

Most of Matlab’s library routines do a lot of checking, and are quite general. If you are sure of your input, only need a simple case, and call the function often, then writing your own version can give a speedup. Although realize that you will be throwing away some of the care that has gone into writing seemingly trivial functions.

## C (and other) extensions: mex and oct files

First, question whether you really want to write some of your code in another language. What are the chances that it and all its dependencies will compile and link cleanly with Matlab/Octave on a different system? How about 10 years from now?

Second, question whether you can actually write fast C code. If your algorithm involves large matrix operations, you may have to call optimized BLAS routines to do some of the work, or you risk being outperformed by the m-file that you are trying to replace.

There is fairly neat Java (and now C#) integration in Matlab. Using these features could help with Matlab’s limited data-structure support. However, I’m not much of a fan of Java and these language interfaces aren’t compatible with Octave, so personally I avoid them.

Personally, if I want to write low level code I tend to do it in ‘plain C’ (ANSI C, or C99). It’s much less hassle to link than C++, compiles much faster, and the compilers tend to give me more interpretable error messages.

### Matlab extension basics

Matlab comes with a script called `mex` to compile and link
C/C++/Fortran extensions. Matlab also exposes a command called `mex` within
Matlab, which seems to do the same thing as calling the command via,
`system('mex')`. I’ve had most success calling the command from
within Matlab; linking correctly can be tricky, and you’re more
likely to get the environment set up correct if compiling within Matlab.

You’ll need to look at the documentation and find examples to copy and paste. But here are some notes to orientate you. A mex C program needs to contain:

#include "mex.h" void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) { /* main code */ }

Rather than a `main()` routine. As you would expect, plhs and prhs are
arrays of pointers to mex-defined datatypes. There are a bunch of functions in
the Matlab documentation beginning with `mex`, which are general routines for
interfacing with Matlab. The routines to do with manipulating `mxArray`s
all begin with `mx`.

There are some example programs in the
`<MATLAB>/extern/examples/mex` directory. You can probably copy and
paste one of those to form the basis of your first simple mex file. Here is an example
program to perform an elementwise function to a general ND-Array:

#include "mex.h" void my_element_wise_function(double* inputs, double* answer, int nn); void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) { const mxArray *mxIn; double *outArray, *inArray; int nIn; if (nrhs != 1) mexErrMsgTxt("There must be exactly one input."); if (nlhs > 1) /* can be "zero", in which case Matlab allows one */ mexErrMsgTxt("There must be at most one output."); /* get information about the input */ mxIn = prhs[0]; inArray = mxGetPr(mxIn); nIn = mxGetNumberOfElements(mxIn); /* create output the same size as the input */ plhs[0] = mxCreateNumericArray( mxGetNumberOfDimensions(mxIn), mxGetDimensions(mxIn), mxDOUBLE_CLASS, mxREAL); outArray = mxGetPr(plhs[0]); my_element_wise_function(inArray, outArray, nIn); }

Where `my_element_wise_function` could be defined in a separate C
file. This separation lets me write and test the interesting code in a C standalone program without worrying
about mex, before finally linking:

system('gcc -c my_func.c'); mex my_mex_file.c my_func.o

Given the above code, it might be tempting to try to modify `inArray`
in place. There is no supported way of doing in-place operations in mex. If you write a mex file that
does that anyway, here is what can happen:

>> A = 1 A = 1 >> B = A % Matlab internally does copy-on-write, so just points to A B = 1 >> modify_in_place(A); A A = 2.7183 >> B % B still points to A's memory; Matlab didn't notice the change B = 2.7183

There are hacks to avoid accidentally modifying other variables, (do
`A(1)=A(1);` before calling the mex file on `A`), but they
aren’t guaranteed to stay working.

### Octave extension basics

Octave has a mex compatibility layer. Use:

mkoctfile --mex

in place of `mex`. In Debian/Ubuntu you may need to install the
`octave-headers` package to get `mkoctfile`. It doesn’t
support everything, but for simple mex files it seems to work fine.

Octave’s mex files have more overhead time than in Matlab, partly
because of compatibility conversions, but partly because Octave doesn’t
trust you not to alter the inputs, so unnecessarily copies them before giving
them to you. Some, but not all, of this overhead can be removed by writing a
native oct extension. If you separate out the main work of your routine into a
standalone .c or .cc file, writing an extra interface stub is little work.
See `tl_exp.cc` in fast_exp,
which is almost entirely boiler-plate, for an example.

The
documentation for writing Octave extensions seems a little thin. However,
the `examples` directory provides a good starting point. In Debian/Ubuntu the examples are
currently installed in `/usr/share/doc/octave3.2-headers/examples`, or
you could grab the source tarball from the Octave website.

### Debugging C code

It’s all too easy with C (or C++) to mis-use memory and get random
crashes (‘segmentation faults’, ‘bus errors’, etc.) some
time after the mistake was actually made. These mistakes are now easy to find!
Compile a C program with `-g` to include debug symbols and run it through
valgrind:

valgrind ./my_program

It usually tells you the precise line on which you made your mistake. In fact
valgrind works regardless of which compiled language you used to make
`my_program`.

It is possible to run your whole Matlab session inside valgrind (see the man
page about tracing children). But loading Matlab in valgrind is problematic
because a) valgrind slows everything down dramatically, so loading Matlab
and doing anything takes *ages*; b) valgrind throws up so many
warnings to do with Matlab itself, finding *your* transgressions is
tricky.

It is *strongly* recommended that you find all the mistakes that you
can in your code before attempting to run it inside Matlab. Make your main
functions take normal C arrays rather than Matlab data types, and make a small
standalone program to exercise your code for testing. After perfecting your C
routine, *then* write a thin Matlab interface in a separate .c file. The
interface obtains pointers to the input and output data and calls your C
routine, which itself doesn’t need to know anything about mex. This
separation will make it easier for you to use your C code in other places. For
example you could write a Python wrapper, or write a .cc file giving a native
Octave interface (rather than forcing Octave to use its Matlab compatibility
interface).

Another useful C debugging tool is mudflap (should come with gcc, but in
Debian/Ubuntu I had to install a package called `libmudflap0- X.Y-dev`,
where

`X.Y`matches the current version of gcc):

gcc -g -fmudflap -lmudflap main.c -o my_program

The program runs much faster than under valgrind, but still reports memory mistakes, sometimes catching ones that valgrind misses, sometimes missing ones valgrind gets. Note: valgrind will not work on programs linked with mudflap; do a normal "production" build before testing with valgrind.

Update November 2015: mudflap has been deprecated in favour of "address sanitzer", and removed from recent versions of gcc. Instead compile with:

gcc -g -fsanitize=address -fno-omit-frame-pointer main.c -o my_program

You can replace "gcc" with "clang" if you prefer. Programs run faster than under valgrind, and stack level problems are found that valgrind misses. Again, maybe valgrind will find some mistakes that address sanitizer won’t (I don’t know yet). If trying valgrind, don’t also compile with address sanitizer!

### C/C++ compilers and their options

How much do you care about numerical correctness, IEEE floating point
conformance, etc.? Your priorities will guide some of your compiler options. For
example: `-mieee-with-inexact` means you really want very strict IEEE
conformance, even on hardware where this is slow (I don’t), whereas
`-ffast-math` says you don’t care about a bunch of correctness
concerns.

A `gcc` compiler option I’ve sometimes found *really* important on x86 or
i686 architectures is `-msse2`. This flag allows the compiler to
automatically vectorize some floating point operations, potentially giving a large speed-up. `gcc`
doesn’t turn on the option by default because the
required instructions are not implemented by all 32-bit processors. Although the
flag is turned on by default for `x86_64`.

**To avoid keeping track of processor flags**, and which CPU I
need for them to work, I tend to compile with `-march=native`, which means
“use whatever the CPU in this machine can support”. Be careful if
transferring binaries to other machines however.

I usually use the `-O2` optimization flag. The `-O3` flag
sometimes makes things better, but can make things worse and has a reputation
for being more buggy.

I often turn on all the warning and `-pedantic` flags I can without
getting too irritated.

I’ve heard informally that the Intel
compilers sometimes produce much faster code than `gcc`.
It’s not very expensive for academic use, but academic use *cannot*
use the free personal license (last time I checked). I value my time, and have
never bothered sorting out a purchase and fighting
with licenses.

### Building mex/oct and linking

C++ is notoriously binary incompatible between versions. Matlab won’t
have been compiled with the same compiler as installed on your system. If you
compile C++ mex files on your system and try to use them they *often*
won’t work. Currently I have Matlab aliased to start like this:

LD_PRELOAD='/lib/libgcc_s.so.1:/usr/lib/libstdc++.so.6.0.13:/lib/libz.so.1.2.3.3' matlab

This line causes Matlab to use my system's libraries and made some C++ mex file work. It could cause other problems, and might not fix things for you. There are various solutions on the Mathworks tech support forums and elsewhere on the web. Searching for the error message that you get will probably find something that works. I have much better luck with C mex files.

It’s still possible to break C mex files by linking them incorrectly
though. Recently any time I tried to display a diagnostic with
`mexErrMsgTxt`, my whole Matlab session bombed out with:

terminate called after throwing an instance of 'MathWorks::System::SimpleException' zsh: abort LD_PRELOAD= matlab -nodesktop -nosplash

The reason was that I had accidentally overridden mex’s carefully-set CFLAGS. The mex file could be compiled correctly from within Matlab (on Linux) in either of these two ways:

mex CFLAGS="\$CFLAGS -march=native -O2" mymexfile.c % or: system('mex CFLAGS=''$CFLAGS -march=native -O2'' mymexfile.c');

But not including the `$CFLAGS`, or failing to escape the `$`
in precisely the right way, caused the problem above. Note that
`mkoctfile` uses `CFLAGS` set in the environment of your shell.
The same escaping doesn’t work, but you can augment the `CFLAGS` it
would normally use like this:

system('CFLAGS="-march=native $(mkoctfile -p CFLAGS)" mkoctfile --mex mymexfile.c')

Remember that for C++ you actually want to specify `CXXFLAGS`.

Octave usually doesn’t give nearly as many linking or compilation problems as Matlab. Octave is likely to have been built with the same compiler as installed on your system (at least if installed through a Linux distribution or self-compiled), so there are fewer sources of incompatibility.

### OpenMP / multithreading C code

Multithreading is easy in a mex file: ‘just’ write multithreaded
C code and then link it in. OpenMP allows you to parallelize some for loops by
adding a single annotation and compiling with the right flags. See
`tl_omp_exp` in fast_exp
for an example.

As noted elsewhere on this page, multithreading isn’t necessarily a good idea for overall performance though. Also the Matlab API isn’t thread safe. Don’t try to call any mex or mx functions within threaded code. Same goes for the Octave API.

## ‘Embarassingly-parallel’ batch jobs

This probably isn’t a popular opinion with readers who’ve come to
learn about parallel code, but: *often I think simple single-threaded code is
the way to go*. (At least when using CPUs. And of course there are exceptions.)

When developing code I run simple, fast test cases so that I have some hope my code is correct. By the time I have serious computations to run, I probably have a grid of parameter settings, or a variety of control experiments to try out. Then I can run these experiments as independent parallel jobs. Matlab can now automatically use more than one CPU for a single job, but multithreading can be a bad idea. Running two jobs sequentially using two threads for each job is usually slower for me than running the jobs in parallel with one thread each.

*Aside, detail on this claim:* Super-linear speedups from
multi-threading are possible due to cache effects. Also running two jobs in
parallel might not work well if you need all your memory or shared cache for one
job. But for my jobs, running parallel single-threaded jobs usually works best.
Some algorithms (such as that used by `eigs` in Matlab) are memory bound
and throwing more cores at them won’t help. If your code has some things
that can’t be multi-threaded well, then some of your cores will spend some
time sitting idle or pointlessly thrashing. Then, assuming you have more than
one experiment to run, *not* doing multi-threading is faster over-all. I
turn off multi-threading for production runs. Turning off multi-threading can
also make profiling much less confusing.

In Matlab you can currently specify that you want single-threading with:

maxNumCompThreads(1);

However, this function has been deprecated in favour of starting Matlab with

matlab -singleCompThread

Although the version that I have installed doesn’t support this yet. In
addition you may want to do `setenv('OMP_NUM_THREADS','1')` in your
`startup.m` if anything you use is built with OpenMP support.

Running many separate jobs in parallel across the CPUs in a machine, or
indeed across a cluster of machines, can be hard to keep track of. Having some
good scripts to manage things for you is important. **A good experiment
setup can be interrupted and be restarted again** (because you
realize a mistake, or because the cluster goes down).

You can look at `experiment_toolbox` inside
some
code I’ve written. And on request I could give you the ugly bundle
of shell-scripts I have used to farm things over machines. But neither are
particularly sophisticated and I’m not going to support them.

Whatever batch system you do use, you’ll want to be able to run things from the
command line. To run `myscript.m` in Matlab do:

matlab -nodisplay -r myscript -logfile myscript.log

Note `myscript.m` *must* end in `quit` or the Matlab
session will stay open. Octave behaves like most Unix programs:

octave myscript.m > myscript.log

and doesn’t need you to finish `myscript.m` with `quit`,
although doing so for Matlab compatibility does no harm.

## 3rd-party GPU toolkits

A disruptive change to computing has been the introduction of “General Purpose computations on Graphics Processing Units” (GPGPUs, or simply using GPUs). A modern desktop computer might be able to do a large computation 250x faster on its graphics card (GPU) than the central processing unit (CPU). That sort of speed-up is game-changing enough that it might be worth some implementation effort to achieve, and I start to reconsider my thoughts about multi-threading (above).

The theoretical speedups are hard to achieve. If you do something suboptimaly, performance can disastrously crash back down to CPU levels. On the other hand, if you do standard vectorization tricks to reduce your bottlenecks to matrix operations, you might get a 30x speedup without having to do anything more sophisticated than get a library to do the linear algebra operations instead of Matlab’s intrinsics. That can change a month’s wait into working on your results the next day. (Ignoring the month you spent fighting with GPU software tools.)

**Jacket:** commercial
3rd-party GPU support for Matlab.

**GPUmat:**
“freeware”, but *not* Free Software (a
real concern for long term availability and viability) 3rd party GPU support for
Matlab. Popular with some machine learning researchers. For example Graham
Taylor has put GPUmat to good
use.

**The Mathworks Parallel Processing Toolbox** is introducing GPU
support. I haven’t used it. As a separate product it’s often not
available.

It may be instructive to compare to the Free Software offerings for Python. These include the simple (which can be a good thing) cudamat+gnumpy and the more sophisticated theano.

Aside: most of the currently-available toolkits use NVIDIA’s CUDA language, rather than Open-CL which can run on other vendor’s cards.

In the future will GPU support be much more built into the main libraries? Will CPU vectorization automatically do much of what GPUs do now? Or is it important to learn GPU-specific technology? Whatever the answers to these questions, well-organized, functionally separated code, expressed where possible with standard vector and matrix operations will fare well.

## Summary

**Use the profiler**.- Good code practice (separating functionality, good style, good test cases, version control, etc.) is more important than ever when optimizing code.
- My view: for production runs and profiling purposes, turn off
multi-threading and do as many jobs in parallel as you have CPUs. Use
`maxNumCompThreads(1);`or`matlab -singleCompThread`, depending on Matlab version, and have a good batch experiment-running system. This advice won’t fit everyone. - If a hotspot is a loop, consult a vectorization guide or local expert for advice.
`mex`/`oct`files have their place, but can be a headache to maintain and debug. Develop the main C (or whatever) code separately, with native test cases, and make the Matlab/Octave glue as thin and separate as possible. Use good compiler options for your code, but be careful how you link the`mex`/`oct`file. Provide good instructions for others, and try to have an m-file for comparison and as a fallback.- GPUs are making a big difference for some people. But the tools are immature and in a state of flux. Can you afford to wait?

## Quiz Answers

*Have you thought about the quiz questions for yourself yet?
If not, go back and have a think.*

**1. How would you sample, uniformly at random, K values from a list of N values
without replacement?**

The following costs O(N.log(N)):

idx = randperm(N); idx = idx(1:K); subset = values(idx);

This code is simple, and if it isn’t a bottleneck I’d stop there.
However, finding the indices of the values, `idx`, should have a smaller
big-O cost. One alternative algorithm could go something like this:

opts = 1:N; % This is still O(N) right here though idx = zeros(K, 1); % remember to pre-allocate arrays when possible for kk = 1:K last = (N-kk+1); ii = ceil(rand()*last); idx(kk) = opts(ii); opts(ii) = last; end

This code isn’t straight-forward to vectorize. (Although I think it can be done!) But even in Octave, without a just-in-time compiler, this loop will be faster than the naive algorithm for K≪N. For K≪N I get much faster results again by sampling with replacement >K values until I get ≥K distinct values. Then I draw my K values without replacement from the smaller set using the naive algorithm.

idx = []; while numel(idx) < K % For small K/N this loop will usually run once. % For larger K, the strategy should be adjusted! idx = unique(ceil(rand(2*K, 1)*N)); % O(K) end idx = idx(randperm(numel(idx))); % O(K.log(K)) idx = idx(1:K);

If building a responsible routine one would check K to avoid getting caught
looping for a *very* long time. Perhaps replacing that magic value of 2
with something more principled. It would take a bit of effort to make a polished
function that deals well with large ranges of K and N.

The statistics toolbox has a function that does exactly the required task:

subset = randsample(values, K);

Although using it could lead you into license manager hell. You could also do better yourself if you cared about this problem: the second two options I outlined above were faster for me when I tried N=1e6 and K=1e3.

**2. You have a large NxD data array data, (with N=1e7 say). The first
column consists of user IDs, integers between 1 and 1e5. A row,
data(n,:), is only relevant if data(n,1) appears in
interesting, a Kx1 list of interesting users (K=1e3 say). How would
you construct subset, an array consisting of the interesting rows of
data?**

The main reason for including this exercise is that almost anything you might come up with involving for loops is likely to be disastrously slow.

This is perhaps the most obvious way involving indexing:

subset = data(ismember(data(:, 1), interesting), :);

This code takes about a second on my computer for the problem size specified.
A second for a one-off pre-processing step is no problem. If we had a genuine
need for speed there is more we can do. Given that we know that
`interesting` consists of integers, we can create a lookup-table to test
membership more quickly.

num_users = 1e5; mask = false(num_users, 1); mask(interesting) = true; subset = data(mask(data(:,1)), :);

This solution is about 3x faster for the problem size given. The difference could be more extreme in other cases.

**3. How would you quickly find the NxM square distances between two sets of
D-dimensional vectors of size N and M stored in DxN and DxM arrays?
How about if you measure distance with different lenghscales (Mahalanobis
distance with a diagonal metric):**
*d _{n,m}* = Σ

_{d}(

*x*)² /

_{d,n}- x_{d,m}*l*²

_{d}One answer is `square_dist.m`
in my
Matlab toolbox. For large arrays this m-file has been faster than mex files
that ship with some Matlab toolboxes. The answer that uses the fewest flops
depends on the dimensionality of the vectors, I usually assume D is big. One
potential problem with my function is loss of numerical precision:

>> x1 = 1; >> x2 = 1 + 1e-9; >> (x1-x2)^2 ans = 1.0000e-18 >> x1^2 + x2^2 - 2*x1*x2 ans = 0

I usually centre my data and decided that this wouldn't be an issue for me. If you grab a fast routine somewhere, will it match your assumptions? Be skeptical!

To find the Mahalanobis distance using Dx1 or 1xD lengthscales
`ell`, simply scale the input data:

d2 = square_dist(x1 ./ ell(:), x2 ./ ell(:));

Or for older Matlab:

d2 = square_dist(bsxfun(@rdivide, x1, ell(:)), bsxfun(@rdivide, x2, ell(:)));

The `(:)` is a useful idiom so that you don't have to worry about
which way around a vector is. Note that if N and M are large this O(D(N+M))
work is *much* more efficient than the O(DNM) work of scaling after
taking the differences between vectors.

There is a general message: before scaling or shifting every element of a large matrix, consider whether the shift or scale can be performed before construction the matrix, or to the smaller results of applying it. For example, if you are going to repeatedly multiply a large array by a scalar (e.g., weight decay in neural network training) simply keep track of the overall scale of the matrix in a separate scalar variable, rather than doing O(NM) work every time.

**4. You are given a list of N positive integers class=[1,9,3,4,...].
Find a neat way to get a logical NxK array, with only one true
element in each row, where row n should be true in column number
class(n).**

One way is to ‘pre-compute’ all possible rows, and then do lookup for the answer for each row:

row_dict = logical(eye(max(class))); encoding = row_dict(class, :);

If `max(class)` is big, the answer will be sparse, so this might be
more appropriate:

encoding = sparse(1:numel(class), class, true);

The `sparse` call can be surrounded with `full(...)` to get a
normal Matlab array, which is faster overall for large K. Finally here is yet another answer:

K = max(class(:)); N = numel(class); encoding = false(N, K); encoding(sub2ind([N, K], (1:N)', class(:))) = true;

The final answer uses indexing to specify the interesting elements, which is often useful. Maybe these indexes could be used directly later in the computation rather than producing this logical array?

**5. You are given X, an NxD matrix of D-dimensional locations, and
Y, an Nx1 vector of values at those locations. Lots of the rows of
X are identical. Return new arrays Xu, an MxD matrix of unique
D-dimensional locations, and Yu, a Mx1 vector of the sums of the Y
values associated with those locations.**

There’s an
answer in my Matlab toolbox.
That was fast enough for the real-time psychophysics experiment it was being
used in, so I stopped. It could be made faster though because Matlab’s
`unique` function could be faster. For large D one could get a
speed-up by hashing the input locations in some way.

**5. What compiler options would you use to compile a numerical C or C++ routine,
and how should you link it with a mex/oct
file?**

**Meta-question: (When) should we care about questions like these?**

A common answer is that one should only worry about trying out clever vectorization tricks if you already have a correct version, and if profiling has said a speed-up is useful. My feeling is that knowing the common tricks, functions and indexing methods can help you write clearer Matlab/Octave code more quickly.

A possible answer is that the biggest speedups come from using GPUs and
much of this document is a distraction. To *some* extent,
I haven’t said much about GPUs
because I’m out of date and haven’t got wide
experience of using GPUs with Matlab to share. But having well-organized
code, and algorithms expressed as cache-friendly standard numerical
operations will always be useful, especially for porting code to a GPU.

**Sometimes most of this document is a distraction!** Although I
generally stand by testing your code and making it clear, *sometimes*
almost all of the above advice about making code faster is misplaced. There
are several trivial “optimization strategies” that are often not
given nearly enough consideration. These include: **Use less
data:** do you actually need to be solving a problem on such a
massive scale to get the answers you need...at least Right Now?
**Early stopping:** if you are using a fancy optimizer it could
be that *most* of its time is spent refining significant digits you
don’t care about. Or worse, time could be spent overfitting. Maybe
running your code for less time might give you the same or better results.
Have you checked? **Make your approach simpler:** could
something simpler work? I refer you to, without *entirely*
agreeing with, a beautiful rant.

## Acknowledgements

I received useful comments, which helped improve these notes, from Ali Eslami and Jonathan Pillow. In addition, the various references given above will have influenced me.