MATLAB: Vectorisation is a double-edged sword
Update: 2nd July 2015 The code in github has moved on a little since this post was written so I changed the link in the text below to the exact commit that produced the results discussed here.
Imagine that you are a very new MATLAB programmer and you have to create an N x N matrix called A where A(i,j) = i+j
Your first attempt at a solution might look like this
N=2000 % Generate a N-by-N matrix where A(i,j) = i + j; for ii = 1:N for jj = 1:N A(ii,jj) = ii + jj; end end
On my current machine (Macbook Pro bought in early 2015), the above loop takes 2.03 seconds. You might think that this is a long time for something so simple and complain that MATLAB is slow. The person you complain to points out that you should preallocate your array before assigning to it.
N=2000 % Generate a N-by-N matrix where A(i,j) = i + j; A=zeros(N,N); for ii = 1:N for jj = 1:N A(ii,jj) = ii + jj; end end
This now takes 0.049 seconds on my machine – a speed up of over 41 times! MATLAB suddenly doesn’t seem so slow after all.
Word gets around about your problem, however, and seasoned MATLAB-ers see that nested loop, make a funny face twitch and start muttering ‘vectorise, vectorise, vectorise’. Emails soon pile in with vectorised solutions
% Method 1: MESHGRID. [X, Y] = meshgrid(1:N, 1:N); A = X + Y;
This takes 0.025 seconds on my machine — a healthy speed-up on the loop-with-preallocation solution. You have to understand the meshgrid command, however, in order to understand what’s going on here. It’s still clear (to me at least) what its doing but not as clear as the nice,obvious double loop. Call me old fashioned but I like loops…I understand them.
% Method 2: Matrix multiplication. A = (1:N).' * ones(1, N) + ones(N, 1) * (1:N);
This one is MUCH harder to read but you don’t worry about it too much because at 0.032 seconds it’s slower than meshgrid.
% Method 3: REPMAT. A = repmat(1:N, N, 1) + repmat((1:N).', 1, N);
This one appears to be interesting! At 0.009 seconds, it’s the fastest so far – by a healthy amount!
% Method 4: CUMSUM. A = cumsum(ones(N)) + cumsum(ones(N), 2);
Coming in at 0.052 seconds, this cumsum solution is slower than the preallocated loop.
% Method 5: BSXFUN. A = bsxfun(@plus, 1:N, (1:N).');
Ahhh, bsxfun or ‘The Widow-maker function’ as I sometimes refer to it. Responsible for some of the fastest and most unreadable vectorised MATLAB code I’ve ever written. In this case, it brings execution time down to 0.0045 seconds.
Whenever I see something that can be vectorised with a repmat, I try to figure out if I can rewrite it as a bsxfun. The result is usually horrible to look at so I tend to keep the original loop commented out above it as an explanation! This particular example isn’t too bad but bsxfun can quickly get hairy.
Conclusion
Loops in MATLAB aren’t anywhere near as bad as they used to be thanks to advances in JIT compilation but it can often pay, speed-wise, to switch to vectorisation. The price you often pay for this speed-up is that vectorised code can become very difficult to read.
If you’d like the code I ran to get the timings above, it’s on github (link refers to the exact commit used for this post) . Here’s the output from the run I referred to in this post.
Original loop time is 2.025441 Preallocate and loop is 0.048643 Meshgrid time is 0.025277 Matmul time is 0.032069 Repmat time is 0.009030 Cumsum time is 0.051966 bsxfun time is 0.004527
- MATLAB Version: 2015a
- Early 2015 Macbook Pro with 16Gb RAM
- CPU: 2.8Ghz quad core i7 Haswell
This post is based on a demonstration given by Mathwork’s Ken Deeley during a recent session at The University of Sheffield.
What about MATLAB generated C Code?
Great post!
This is very interesting and I will file this article and the tests in my repository for later use.
The output on my machine:
Original loop time is 3.510969
Preallocate and loop is 0.051073
Meshgrid time is 0.023769
Matmul time is 0.031171
Repmat time is 0.020366
Cumsum time is 0.051336
bsxfun time is 0.009737
– MATLAB Version: 2015a
– HP Notebook with 24 MB RAM
– CPU: 2.8GHz with eight cores, i7
– OS: Win7 Prof., 64-bit
Best,
Dominik
Interesting. I bash Matlab (and Euler Math Toolbox) sometimes because you have to revert to all sorts of tricks to get things to work fast. This is bad. You should not spend time on such tricks when the solution in a compiled language is obvious. It is like doing Math with Lisp. It can be done, but the price is high.
Now I see that some loops are well optimized by Matlab. The JIT compiler was not part of my Matlab version when I last tried it, however. MY recent experience was a Matlab code that took me 4 hours to fix and ran in 90 seconds, compared to the same code in Java developed in one hour and running in 5 seconds. That used dictionaries and file access etc. I am not sure if JIT would have helped.
Here are the results for EMT. The vectorization is much more simple due to the extended Matrix langauge. The C code is in fact slower. I am not sure why that happens. The loops in EMT are not compiled and thus very slow.
>n=2000;
>tic; (1:n)+(1:n)'; toc;
Used 0.021 seconds
>function tinyc mm (n) ...
$ARG_INT(n);
$RES_DOUBLE_MATRIX(A,n,n);
$for (int i=1; i<=n; i++)
$ for (int j=1; jtic; mm(n); toc
Used 0.067 seconds
0.067
>function mm1 (n) ...
$A=zeros(n,n);
$for i=1 to n;
$ for j=1 to n;
$ A[i,j]=i+j;
$ end;
$end;
$return A;
$endfunction
>tic; mm1(n); toc;
Used 7.225 seconds
Two things:
1) did you try inverting the order of the loop? From my knowledge of matlab and fortran it seems that you are accessing the array in a non conventional memory order.
2) For reference (not for competition), in C++, this (non parallel code) takes 0.0139 seconds in my Intel® Core™ i7-3612QM CPU @ 2.10GHz × 8, with all optimizations on:
boost::multi_array A(boost::extents[2000][2000]);
for(int i = 0; i != 2000; ++i){
for(int j = 0; j != 2000; ++j){
A[i][j] = i + j;
}
}
@alfC
1. No, we didn’t and you are correct – we were doing it in the wrong order. I guess we were too focused on vectorisation. I am officially embarrassed! Anyway…if you switch the loop order
It is indeed faster at 0.018092 seconds. Faster than all vectorisations except the repmat and bsxfun ones!
Thanks for pointing this out
Thanks for the C++ code. I may write a follow up piece that uses C/C++ via mex and include it .
The repository containing the example code has been updated.
https://github.com/mikecroucher/WalkingRandomly
Great post and comments, thanks! Out of curiosity, in which of the versions can we be reasonably confident that Matlab’s calling processor-optimized libraries (e.g. MKL)?
That’s a tricky one, David. Matrix multiplication will probably be using the MKL but it’s not the fastest solution shown here. I suspect that the creation of all the intermediates destroys the benefit of the MKL.
Hi, thanks for trying. The wrong loop order in C++ gives a timing of 0.0887seconds (4+ times slower).
BTW, the C++ code in my post is not complete because the formatting ate my template parameters. It should be boost::multi_array smallerthansymbol double, 2 greaterthansymbol.
Original loop time is 4.541423
Preallocate and loop is 0.047281
Switched loop order is 0.025103
Meshgrid time is 0.025185
Matmul time is 0.038637
Repmat time is 0.013518
Cumsum time is 0.031999
bsxfun time is 0.006002
All results are equal
R2015a, Arch Linux, Macbook Air 13″ 2013, Haswell i5 @ 1.3GHz, 8GB Ram
Here is what I would “instinctively” code – a half-way vectorisation, which is half-way readable. How does it compare?
A = zeros(N);
A(:,1) = [2:N+1]’;
for k = 2:N,
A(:,k)=A(:,k-1)+1;
end
Talking about hard-to-read solutions, here is a one-liner just for fun :)
A = hankel(2:N+1, N+1:2*N);
Internally HANKEL function uses BSXFUN as well.
Nice! Thanks for the Pull Request too.
I’ve not tried your version yet but merged it anyway and changed links in this post to refer to the commit that produced the times discussed here.
Hi Mike,
Just found this, a nice and somehow optimally nontrivial benchmark. Two more datapoints (both self-advertising but I hope useful and interesting).
1.
I love bsxfun, but hate its ugliness, so I’ve made a simple matlab class that creates “broadcastable” matrices. In use, it just delegates to bsxfun, but generates IMO more readable code, so hankel is:
A = broadcast(1:N) + (1:N)';
The code can be found at http://awful.codeplex.com in examples/hankel_test/hankel_test.m
2.
MEX should be easier to write. Here’s the source of my implementation, which on my box generates optimal assembly and runs at the same speed as the bsxfun implementation. Code in hankel_test as above.
#include
// Declare mlx_function (C++ version of mexFunction)
void mlx_function(mlx_inputs& in, mlx_outputs& out)
{
mlx_array A(mlx_size {1,1}, in[0]); // Get input 0
int size = int(A[0]);
mlx_make_array sum(size, size); // Make output array
// Perform the operation
for(mwSize j = 0; j < size; ++j)
for(mwSize i = 0; i < size; ++i)
sum(i,j) = double(i+j+2);
out[0] = sum; // Assign to output
}
I wonder,
Does the MATLAB R2015b release improve results?
Thank You.
A = (1:N) + (1:N).’;
is 3 times faster than bsxfun on my machine.