MATLAB: Vectorisation is a double-edged sword

May 22nd, 2015 | Categories: Making MATLAB faster, math software, matlab, programming | Tags:

Update: 2nd July 2015 The code in github has moved on a little since this post was written so I changed the link in the text below to the exact commit that produced the results discussed here.

Imagine that you are a very new MATLAB programmer and you have to create an N x N matrix called A where A(i,j) = i+j
Your first attempt at a solution might look like this

N=2000
% Generate a N-by-N matrix where A(i,j) = i + j;
for ii = 1:N
     for jj = 1:N
         A(ii,jj) = ii + jj;
     end
 end

On my current machine (Macbook Pro bought in early 2015), the above loop takes 2.03 seconds. You might think that this is a long time for something so simple and complain that MATLAB is slow. The person you complain to points out that you should preallocate your array before assigning to it.

N=2000
% Generate a N-by-N matrix where A(i,j) = i + j;
A=zeros(N,N);
for ii = 1:N
     for jj = 1:N
         A(ii,jj) = ii + jj;
     end
 end

This now takes 0.049 seconds on my machine – a speed up of over 41 times! MATLAB suddenly doesn’t seem so slow after all.

Word gets around about your problem, however, and seasoned MATLAB-ers see that nested loop, make a funny face twitch and start muttering ‘vectorise, vectorise, vectorise’. Emails soon pile in with vectorised solutions

% Method 1: MESHGRID.
[X, Y] = meshgrid(1:N, 1:N);
A = X + Y;

This takes 0.025 seconds on my machine — a healthy speed-up on the loop-with-preallocation solution. You have to understand the meshgrid command, however, in order to understand what’s going on here. It’s still clear (to me at least) what its doing but not as clear as the nice,obvious double loop. Call me old fashioned but I like loops…I understand them.

% Method 2: Matrix multiplication.
A = (1:N).' * ones(1, N) + ones(N, 1) * (1:N);

This one is MUCH harder to read but you don’t worry about it too much because at 0.032 seconds it’s slower than meshgrid.

% Method 3: REPMAT.
A = repmat(1:N, N, 1) + repmat((1:N).', 1, N);

This one appears to be interesting! At 0.009 seconds, it’s the fastest so far – by a healthy amount!

% Method 4: CUMSUM.
A = cumsum(ones(N)) + cumsum(ones(N), 2);

Coming in at 0.052 seconds, this cumsum solution is slower than the preallocated loop.

% Method 5: BSXFUN.
A = bsxfun(@plus, 1:N, (1:N).');

Ahhh, bsxfun or ‘The Widow-maker function’ as I sometimes refer to it. Responsible for some of the fastest and most unreadable vectorised MATLAB code I’ve ever written. In this case, it brings execution time down to 0.0045 seconds.

Whenever I see something that can be vectorised with a repmat, I try to figure out if I can rewrite it as a bsxfun. The result is usually horrible to look at so I tend to keep the original loop commented out above it as an explanation! This particular example isn’t too bad but bsxfun can quickly get hairy.

Conclusion

Loops in MATLAB aren’t anywhere near as bad as they used to be thanks to advances in JIT compilation but it can often pay, speed-wise, to switch to vectorisation. The price you often pay for this speed-up is that vectorised code can become very difficult to read.

If you’d like the code I ran to get the timings above, it’s on github (link refers to the exact commit used for this post) . Here’s the output from the run I referred to in this post.

Original loop time is 2.025441
Preallocate and loop is 0.048643
Meshgrid time is 0.025277
Matmul time is 0.032069
Repmat time is 0.009030
Cumsum time is 0.051966
bsxfun time is 0.004527
  • MATLAB Version: 2015a
  • Early 2015 Macbook Pro with 16Gb RAM
  • CPU: 2.8Ghz quad core i7 Haswell

This post is based on a demonstration given by Mathwork’s Ken Deeley during a recent session at The University of Sheffield.

  1. Royi
    May 22nd, 2015 at 12:01
    Reply | Quote | #1

    What about MATLAB generated C Code?

  2. Dominik
    May 22nd, 2015 at 12:33
    Reply | Quote | #2

    Great post!
    This is very interesting and I will file this article and the tests in my repository for later use.

    The output on my machine:
    Original loop time is 3.510969
    Preallocate and loop is 0.051073
    Meshgrid time is 0.023769
    Matmul time is 0.031171
    Repmat time is 0.020366
    Cumsum time is 0.051336
    bsxfun time is 0.009737

    – MATLAB Version: 2015a
    – HP Notebook with 24 MB RAM
    – CPU: 2.8GHz with eight cores, i7
    – OS: Win7 Prof., 64-bit

    Best,
    Dominik

  3. May 22nd, 2015 at 17:02
    Reply | Quote | #3

    Interesting. I bash Matlab (and Euler Math Toolbox) sometimes because you have to revert to all sorts of tricks to get things to work fast. This is bad. You should not spend time on such tricks when the solution in a compiled language is obvious. It is like doing Math with Lisp. It can be done, but the price is high.

    Now I see that some loops are well optimized by Matlab. The JIT compiler was not part of my Matlab version when I last tried it, however. MY recent experience was a Matlab code that took me 4 hours to fix and ran in 90 seconds, compared to the same code in Java developed in one hour and running in 5 seconds. That used dictionaries and file access etc. I am not sure if JIT would have helped.

    Here are the results for EMT. The vectorization is much more simple due to the extended Matrix langauge. The C code is in fact slower. I am not sure why that happens. The loops in EMT are not compiled and thus very slow.


    >n=2000;
    >tic; (1:n)+(1:n)'; toc;
    Used 0.021 seconds
    >function tinyc mm (n) ...
    $ARG_INT(n);
    $RES_DOUBLE_MATRIX(A,n,n);
    $for (int i=1; i<=n; i++)
    $ for (int j=1; jtic; mm(n); toc
    Used 0.067 seconds
    0.067
    >function mm1 (n) ...
    $A=zeros(n,n);
    $for i=1 to n;
    $ for j=1 to n;
    $ A[i,j]=i+j;
    $ end;
    $end;
    $return A;
    $endfunction
    >tic; mm1(n); toc;
    Used 7.225 seconds

  4. alfC
    May 22nd, 2015 at 21:17
    Reply | Quote | #4

    Two things:

    1) did you try inverting the order of the loop? From my knowledge of matlab and fortran it seems that you are accessing the array in a non conventional memory order.

    2) For reference (not for competition), in C++, this (non parallel code) takes 0.0139 seconds in my Intel® Core™ i7-3612QM CPU @ 2.10GHz × 8, with all optimizations on:

    boost::multi_array A(boost::extents[2000][2000]);
    for(int i = 0; i != 2000; ++i){
    for(int j = 0; j != 2000; ++j){
    A[i][j] = i + j;
    }
    }

  5. Mike Croucher
    May 25th, 2015 at 09:35
    Reply | Quote | #5

    @alfC

    1. No, we didn’t and you are correct – we were doing it in the wrong order. I guess we were too focused on vectorisation. I am officially embarrassed! Anyway…if you switch the loop order

    %preallocate
    tic
    % Generate a N-by-N matrix where A(i,j) = i + j;
    Aswitchloop=zeros(N,N);
    for jj = 1:N
         for ii = 1:N
             Aswitchloop(ii,jj) = ii + jj;
         end
    end
    switchloop = toc;
    

    It is indeed faster at 0.018092 seconds. Faster than all vectorisations except the repmat and bsxfun ones!
    Thanks for pointing this out

    Thanks for the C++ code. I may write a follow up piece that uses C/C++ via mex and include it .

  6. Mike Croucher
    May 25th, 2015 at 09:42
    Reply | Quote | #6

    The repository containing the example code has been updated.

    https://github.com/mikecroucher/WalkingRandomly

  7. May 25th, 2015 at 21:01
    Reply | Quote | #7

    Great post and comments, thanks! Out of curiosity, in which of the versions can we be reasonably confident that Matlab’s calling processor-optimized libraries (e.g. MKL)?

  8. Mike Croucher
    May 26th, 2015 at 12:58
    Reply | Quote | #8

    That’s a tricky one, David. Matrix multiplication will probably be using the MKL but it’s not the fastest solution shown here. I suspect that the creation of all the intermediates destroys the benefit of the MKL.

  9. alfC
    May 31st, 2015 at 10:02
    Reply | Quote | #9

    Hi, thanks for trying. The wrong loop order in C++ gives a timing of 0.0887seconds (4+ times slower).

    BTW, the C++ code in my post is not complete because the formatting ate my template parameters. It should be boost::multi_array smallerthansymbol double, 2 greaterthansymbol.

  10. markuman
    June 4th, 2015 at 19:02

    Original loop time is 4.541423
    Preallocate and loop is 0.047281
    Switched loop order is 0.025103
    Meshgrid time is 0.025185
    Matmul time is 0.038637
    Repmat time is 0.013518
    Cumsum time is 0.031999
    bsxfun time is 0.006002

    All results are equal

    R2015a, Arch Linux, Macbook Air 13″ 2013, Haswell i5 @ 1.3GHz, 8GB Ram

  11. William Heath
    June 20th, 2015 at 07:35

    Here is what I would “instinctively” code – a half-way vectorisation, which is half-way readable. How does it compare?

    A = zeros(N);
    A(:,1) = [2:N+1]’;
    for k = 2:N,
    A(:,k)=A(:,k-1)+1;
    end

  12. July 2nd, 2015 at 16:37

    Talking about hard-to-read solutions, here is a one-liner just for fun :)


    A = hankel(2:N+1, N+1:2*N);

    Internally HANKEL function uses BSXFUN as well.

  13. Mike Croucher
    July 2nd, 2015 at 18:41

    Nice! Thanks for the Pull Request too.

    I’ve not tried your version yet but merged it anyway and changed links in this post to refer to the commit that produced the times discussed here.

  14. November 25th, 2015 at 13:29

    Hi Mike,

    Just found this, a nice and somehow optimally nontrivial benchmark. Two more datapoints (both self-advertising but I hope useful and interesting).

    1.
    I love bsxfun, but hate its ugliness, so I’ve made a simple matlab class that creates “broadcastable” matrices. In use, it just delegates to bsxfun, but generates IMO more readable code, so hankel is:

    A = broadcast(1:N) + (1:N)';

    The code can be found at http://awful.codeplex.com in examples/hankel_test/hankel_test.m

    2.
    MEX should be easier to write. Here’s the source of my implementation, which on my box generates optimal assembly and runs at the same speed as the bsxfun implementation. Code in hankel_test as above.

    #include

    // Declare mlx_function (C++ version of mexFunction)
    void mlx_function(mlx_inputs& in, mlx_outputs& out)
    {
    mlx_array A(mlx_size {1,1}, in[0]); // Get input 0

    int size = int(A[0]);

    mlx_make_array sum(size, size); // Make output array

    // Perform the operation
    for(mwSize j = 0; j < size; ++j)
    for(mwSize i = 0; i < size; ++i)
    sum(i,j) = double(i+j+2);

    out[0] = sum; // Assign to output
    }

  15. Royi
    November 30th, 2015 at 20:52

    I wonder,
    Does the MATLAB R2015b release improve results?

    Thank You.

  16. Elad
    November 13th, 2019 at 05:40

    A = (1:N) + (1:N).’;
    is 3 times faster than bsxfun on my machine.