Walking Randomly

MATLAB mex functions using the NAG C Library in Windows

October 24th, 2011 | Categories: Making MATLAB faster, matlab, NAG Library, parallel programming, programming, tutorials | Tags:

The NAG C Library is one of the largest commercial collections of numerical software currently available and I often find it very useful when writing MATLAB mex files. “Why is that?” I hear you ask.

One of the main reasons for writing a mex file is to gain more speed over native MATLAB. However, one of the main problems with writing mex files is that you have to do it in a low level language such as Fortran or C and so you lose much of the ease of use of MATLAB.

In particular, you lose straightforward access to most of the massive collections of MATLAB routines that you take for granted. Technically speaking that’s a lie because you could use the mex function mexCallMATLAB to call a MATLAB routine from within your mex file but then you’ll be paying a time overhead every time you go in and out of the mex interface. Since you are going down the mex route in order to gain speed, this doesn’t seem like the best idea in the world. This is also the reason why you’d use the NAG C Library and not the NAG Toolbox for MATLAB when writing mex functions.

One way out that I use often is to take advantage of the NAG C library and it turns out that it is extremely easy to add the NAG C library to your mex projects on Windows. Let’s look at a trivial example. The following code, nag_normcdf.c, uses the NAG function nag_cumul_normal to produce a simplified version of MATLAB’s normcdf function (laziness is all that prevented me from implementing a full replacement).

/* A simplified version of normcdf that uses the NAG C library
 * Written to demonstrate how to compile MATLAB mex files that use the NAG C Library
 * Only returns a normcdf where mu=0 and sigma=1
 * October 2011 Michael Croucher
 * www.walkingrandomly.com
 */

#include <math.h>
#include "mex.h"
#include "nag.h"
#include "nags.h"

void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] )
{
    double *in,*out;
    int rows,cols,num_elements,i;

    if(nrhs>1)
    {
        mexErrMsgIdAndTxt("NAG:BadArgs","This simplified version of normcdf only takes 1 input argument");
    } 

    /*Get pointers to input matrix*/
    in = mxGetPr(prhs[0]);
    /*Get rows and columns of input matrix*/
    rows = mxGetM(prhs[0]);
    cols = mxGetN(prhs[0]);
    num_elements = rows*cols;

    /* Create output matrix */
    plhs[0] = mxCreateDoubleMatrix(rows, cols, mxREAL);
    /* Assign pointer to the output */
    out = mxGetPr(plhs[0]);

    for(i=0; i<num_elements; i++){
          out[i] = nag_cumul_normal(in[i]);
         }
}

To compile this in MATLAB, just use the following command

 mex nag_normcdf.c CLW6I09DA_nag.lib

If your system is set up the same as mine then the above should ‘just work’ (see System Information at the bottom of this post). The new function works just as you would expect it to

>> format long
>> format compact
>> nag_normcdf(1)
ans =
   0.841344746068543

Compare the result to normcdf from the statistics toolbox

>> normcdf(1)
ans =
   0.841344746068543

So far so good. I could stop the post here since all I really wanted to do was say ‘The NAG C library is useful for MATLAB mex functions and it’s a doddle to use – here’s a toy example and here’s the mex command to compile it’

However, out of curiosity, I looked to see if my toy version of normcdf was any faster than The Mathworks’ version. Let there be 10 million numbers:

>> x=rand(1,10000000);

Let’s time how long it takes MATLAB to take the normcdf of those numbers

>> tic;y=normcdf(x);toc
Elapsed time is 0.445883 seconds.
>> tic;y=normcdf(x);toc
Elapsed time is 0.405764 seconds.
>> tic;y=normcdf(x);toc
Elapsed time is 0.366708 seconds.
>> tic;y=normcdf(x);toc
Elapsed time is 0.409375 seconds.

Now let’s look at my toy-version that uses NAG.

>> tic;y=nag_normcdf(x);toc
Elapsed time is 0.544642 seconds.
>> tic;y=nag_normcdf(x);toc
Elapsed time is 0.556883 seconds.
>> tic;y=nag_normcdf(x);toc
Elapsed time is 0.553920 seconds.
>> tic;y=nag_normcdf(x);toc
Elapsed time is 0.540510 seconds.

So my version is slower! Never mind, I’ll just make my version parallel using OpenMP – Here is the code: nag_normcdf_openmp.c

/* A simplified version of normcdf that uses the NAG C library
 * Written to demonstrate how to compile MATLAB mex files that use the NAG C Library
 * Only returns a normcdf where mu=0 and sigma=1
 * October 2011 Michael Croucher
 * www.walkingrandomly.com
 */

#include <math.h>
#include "mex.h"
#include "nag.h"
#include "nags.h"
#include <omp.h>

void do_calculation(double in[],double out[],int num_elements)
{
    int i,tid;

#pragma omp parallel for shared(in,out,num_elements) private(i,tid)
    for(i=0; i<num_elements; i++){
          out[i] = nag_cumul_normal(in[i]);
         }
}

void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] )
{
    double *in,*out;
    int rows,cols,num_elements;

    if(nrhs>1)
    {
        mexErrMsgIdAndTxt("NAG_NORMCDF:BadArgs","This simplified version of normcdf only takes 1 input argument");
    } 

    /*Get pointers to input matrix*/
    in = mxGetPr(prhs[0]);
    /*Get rows and columns of input matrix*/
    rows = mxGetM(prhs[0]);
    cols = mxGetN(prhs[0]);
    num_elements = rows*cols;

    /* Create output matrix */
    plhs[0] = mxCreateDoubleMatrix(rows, cols, mxREAL);
    /* Assign pointer to the output */
    out = mxGetPr(plhs[0]);

    do_calculation(in,out,num_elements);
}

Compile that with

 mex  COMPFLAGS="$COMPFLAGS /openmp"  nag_normcdf_openmp.c CLW6I09DA_nag.lib

and on my quad-core machine I get the following timings

>> tic;y=nag_normcdf_openmp(x);toc
Elapsed time is 0.237925 seconds.
>> tic;y=nag_normcdf_openmp(x);toc
Elapsed time is 0.197531 seconds.
>> tic;y=nag_normcdf_openmp(x);toc
Elapsed time is 0.206511 seconds.
>> tic;y=nag_normcdf_openmp(x);toc
Elapsed time is 0.211416 seconds.

This is faster than MATLAB and so normal service is resumed :)

System Information

64bit Windows 7
MATLAB 2011b
NAG C Library Mark 9 – CLW6I09DAL
Visual Studio 2008
Intel Core i7-2630QM processor

2 comments

Probability of overlapping sub-sequences?

October 8th, 2011 | Categories: general math, probability, Problem of the week | Tags:

Consider a periodic sequence, seq_p, with period p

seq_p = a₁, a₂, …, a_p, a₁, a₂, …, a_p, a₁, a₂, …, a_p, …

Two sub-sequences, each of length N where 2< N< p, are produced by choosing two random start points in seq_p and using the next N-1 numbers. What is the probability that these two sub-sequences will overlap?

If you take M>2 such length N sub-sequences then what is the probability that any two will overlap?

For a more concrete example, assume that p = 2^19937 – 1 (yes, a very big number) and consider 10,000 sub-sequences each of length 10^9.

Disclaimer: I don’t have the solution to this and haven’t yet tried to find it

7 comments

A Month of Math Software – September 2011

October 4th, 2011 | Categories: math software, Month of Math Software | Tags:

Welcome to September’s A Month of Math Software. Last month’s can be found here and the archive is here. If you’ve got news for next month’s edition then contact me.

General mathematics and statistics

The commercial computer algebra package, MAGMA, has been upgraded to version 2.17-11. The change log is at magma.maths.usyd.edu.au/magma/releasenotes/2/17/11/
Version 2011b of MATLAB has been released. I’m all excited about the improved GPU functionality in the parallel computing toolbox along with the that fact that the MATLAB compiler now supports GPU-enabled code. This little combination means that we’ll be able to include a small amount of MATLAB/GPU support on Manchester University’s Condor pool. If GPU support isn’t your thing then maybe you’d like to hear about updates to the MATLAB comparison tool from Mike on the MATLAB Desktop.
Version 2.13.2 of the extremely popular free, open source statistics pacakge, R, has been released. This is intended to be the last of the 2.13 series and you can find out what’s new by going to https://stat.ethz.ch/pipermail/r-announce/2011/000543.html

Freely available linear algebra software

Jack Dongarra et al have updated their survey of freely available software for linear algebra.
Version 2.4.2 of PLASMA (Parallel Linear Algebra for Scalable Multi-core Architectures) has been released. Click here for a list of the new stuff.

Math software in the blogs

Sage: Creating a Viable Free Open Source Alternative to Magma, Maple, Mathematica, and MATLAB – a blog post by Sage’s creator, William Stein
Robert Talbert asks questions about the fundamental theorem of calculus while looking at Maple, Mathematica and Wolfram Alpha. On a related note I had fun with an integral with MATLAB and Mathematica.
“Neil Bickford calculated the first 458 million terms for the continued fraction of pi, breaking the previous record of 180 million. He used Mathematica to develop his code and verify his results—which he posted shortly after he turned 13.” – Check out the full story at The Wolfram Blog: From Pi to Puzzles
The R Graph Gallery has gone social. Read about the details at Revolutions – http://blog.revolutionanalytics.com/2011/09/the-r-graph-gallery-goes-social.html
Two Suns in the Sunset – NASA’s Kepler mission has discovered a planet orbiting two stars, something like Tatooine in Star Wars. Yeah, I know that this hasn’t got anything to do with math software but it is very cool!

8 comments

Integration fun with MATLAB and Mathematica

September 13th, 2011 | Categories: general math, just for fun, mathematica, matlab | Tags:

Consider this indefinite integral

$\light \large \int \frac{1}{\sqrt{x(2-x)}}$

Feed it to MATLAB’s symbolic toolbox:

int(1/sqrt(x*(2 - x)))

ans =
asin(x - 1)

Feed it to Mathematica 8.0.1:

Integrate[1/Sqrt[x (2 - x)], x] // InputForm

(2*Sqrt[-2 + x]*Sqrt[x]*Log[Sqrt[-2 + x] + Sqrt[x]])/Sqrt[-((-2 + x)*x)]

Let x=1.2 in both results:

MATLAB's answer evaluates to 0.2014
Mathematica's answer evaluates to -1.36944 + 0.693147 I

Discuss!

24 comments

MATLAB functions with built in parallel computing toolbox support

September 11th, 2011 | Categories: Making MATLAB faster, matlab, parallel programming, programming | Tags:

So, you’re the proud owner of a new license for MATLAB’s parallel computing toolbox (PCT) and you are wondering how to get some bang for your buck as quickly as possible. Sure, you are going to learn about constructs such as parfor and spmd but that takes time and effort. Wouldn’t it be nice if you could speed up some of your MATLAB code simply by saying ‘Turn parallelisation on’?

It turns out that The Mathworks have been adding support for their parallel computing toolbox all over the place and all you have to do is switch it on (Assuming that you actually have the parallel computing toolbox of course). For example say you had the following call to fmincon (part of the optimisation toolbox) in your code

[x, fval] = fmincon(@objfun, x0, [], [], [], [], [], [], @confun,opts)

To turn on parallelisation across 2 cores just do

matlabpool 2;
opts = optimset('fmincon');
opts = optimset('UseParallel','always');
[x, fval] = fmincon(@objfun, x0, [], [], [], [], [], [], @confun,opts);

That wasn’t so hard was it? The speedup (if any) completely depends upon your particular optimization problem.

Why isn’t parallelisation turned on by default?
The next question that might occur to you is ‘Why doesn’t The Mathworks just turn parallelisation on by default?’ After all, although the above modification is straightforward, it does require you to know that this particular function supports parallel execution via the PCT. If you didn’t think to check then your code would be doomed to serial execution forever.

The simple answer to this question is ‘Sometimes the parallel version is slower‘. Take this serial code for example.

objfun = @(x)exp(x(1))*(4*x(1)^2+2*x(2)^2+4*x(1)*x(2)+2*x(2)+1);
confun = @(x) deal( [1.5+x(1)*x(2)-x(1)-x(2); -x(1)*x(2)-10], [] );
tic;
[x, fval] = fmincon(objfun, x0, [], [], [], [], [], [], confun);
toc

On the machine I am currently sat at (quad core running MATLAB 2011a on Linux) this typically takes around 0.032 seconds to solve. With a problem that trivial my gut feeling is that we are not going to get much out of switching to parallel mode.

objfun = @(x)exp(x(1))*(4*x(1)^2+2*x(2)^2+4*x(1)*x(2)+2*x(2)+1);
confun = @(x) deal( [1.5+x(1)*x(2)-x(1)-x(2); -x(1)*x(2)-10],[] );

%only do this next line once.  It opens two MATLAB workers
matlabpool 2;

opts = optimset('fmincon');
opts = optimset('UseParallel','always');

tic;
[x, fval] = fmincon(objfun, x0, [], [], [], [], [], [], confun,opts);
toc

Sure enough, this increases execution time dramatically to an average of 0.23 seconds on my machine. There is always a computational overhead that needs paying when you go parallel and if your problem is too trivial then this overhead costs more than the calculation itself.

So which functions support the Parallel Computing Toolbox?

I wanted a web-page that listed all functions that gain benefit from the Parallel Computing Toolbox but couldn’t find one. I found some documentation on specific toolboxes such as Parallel Statistics but nothing that covered all of MATLAB in one place. Here is my attempt at producing such a document. Feel free to contact me if I have missed anything out.

This covers MATLAB 2011b and is almost certainly incomplete. I’ve only covered toolboxes that I have access to and so some are missing. Please contact me if you have any extra information.

Bioinformatics Toolbox

Global Optimisation

Various solvers use the PCT. See this part of the MATLAB documentation for details.

Image Processing

blockproc
Note that many Image Processing functions run in parallel even without the parallel computing toolbox. See my article Which MATLAB functions are Multicore Aware?

Optimisation Toolbox

Simulink

Running parallel simulations
You can increase the speed of diagram updates for models containing large model reference hierarchies by building referenced models that are configured in Accelerator mode in parallel whenever conditions allow. This is covered in the documentation.

Statistics Toolbox

Other articles about parallel computing in MATLAB from WalkingRandomly

Which MATLAB functions are multicore aware? There are a ton of functions in MATLAB that take advantage of parallel processors automatically. No Parallel Computing Toolbox necessary.
Parallel MATLAB with OpenMP mex files Want to parallelize your own functions without purchasing the PCT? Not afraid to get your hands dirty with C? Perhaps this option is for you.
MATLAB GPU/CUDA Experiences and tutorials on my laptop – A series of articles where I look a GPU computing with CUDA on MATLAB

7 comments

MATLAB version of Interactive Slinky Thing

September 6th, 2011 | Categories: general math, just for fun, matlab | Tags:

Matt Tearle has produced a MATLAB version of my Interactive Slinky Thing which, in turn, was originally inspired by a post by Sol over at Playing with Mathematica. Matt adatped the code from some earlier work he did and you can click on the image below to get it. Thanks Matt!

3 comments

A Month of Math Software – August 2011

September 2nd, 2011 | Categories: Month of Math Software | Tags:

Welcome to August’s ‘A Month of Math Software’ where I look at everything from blog posts about math libraries through to the latest releases. Click here for earlier articles in the series.

Mathematical software packages

Maxima, the venerable (it is based on a 1960s codebase called Macsyma) open source computer algebra package has been upgraded to version 5.25.1. See the change-log here. Download it from Sourceforge and check out a Maxima tutorial (how to plot direction fields for first order ODEs) right here at WalkingRandomly.
SAGE is arguably the best general purpose mathematics package in the open source world and it saw an update to version 4.7.1 this month. See what got changed at http://www.sagemath.org/mirror/src/changelogs/sage-4.7.1.txt. I played with an old version a little while ago (see here and here) and am still offering a bounty for anyone who can add a certain piece of functionality to this product.
Next up is a free (as in beer) package from Microsoft called Sho which saw an upgrade to version 2.0.5 this month. I’ll let Microsoft themselves tell you what it is:

“Sho is an interactive environment for data analysis and scientific computing that lets you seamlessly connect scripts (in IronPython) with compiled code (in .NET) to enable fast and flexible prototyping.”

It certainly looks very interesting with features such as direct integration with Azure (Microsoft’s cloud computing product), Optimization and loads more. Let me know what you think of it if you try it out.

IBM’s SPSS Statistics is now at version 20. Take a look at what’s new in this commercial package at http://www-01.ibm.com/software/analytics/spss/statistics/resources/whats-new.html

Mathematical Software Libraries

The AMD Core Math Library was upgraded to version 5.0 this month. I think the following is the complete change-log
- DGEMM and SGEMM have been tuned for AMD Family 15h processors. These take advantage of AVX and FMA-4 instructions to achieve high efficiency using either one or both threads of a compute unit.
- The Fortran code base for the library is compiled with AVX and FMA-4 flags to support the AMD Family 15h processors. This library will not run on processors that do not support AVX and FMA-4. The package includes legacy libraries with SSE/SSE2 instructions suitable for use on AMD Family 10h and AMD Family 0fh processors.
- New 2D and 3D real-to-complex FFT functions have been introduced. Included are samples demonstrating how to use the new functions.
- The L’Ecuyer, Whichmann-Hill, and Mersenne Twister random number generator have been updated to improve performance on all processor types.
- The vector math library dependency has been removed from the library, and libacml_mv has been removed from the build. These AMD math functions are available as a separate download from the AMD web page.
While on the subject of the ACML, check out this interview about the library with Chip Freitag, one of its developers, that was recorded earlier this month.
AMD have also released version 1.4 of the AMD Accelerated Parallel Processing Math Libraries (APPML). This is an OpenCL library aimed at GPUs. As far as I can tell, this is just a bug fix release and so there are no new routines available.
Jack Dongarra et al have moved their linear algebra library for heterogeneous/hybrid architectures, MAGMA, from release candidate 5 to a full version 1.0 release. Roughly speaking, you can think of this project as LAPACK for CUDA GPUs (although this scope will probably widen in the future). I believe that it is used in products such as MATLAB’s parallel computing toolbox and Accelereyes’ Jacket (Correction: I’ve since learned that Jacket uses CULA and not MAGMA) among others.

Odds and Ends

Intel have come up with a new hardware-based random number generator. Read about it at http://spectrum.ieee.org/computing/hardware/behind-intels-new-randomnumber-generator/ I wonder if hardware based generators will ever replace pseudo random number generators for simulation work?
Finally, William Hart released a new version of Coopr back in July but I didn’t learn about it in time to get it into July’s edition of Month of Math Software. So, I’m telling you about it now (more accurately, I’m quoting William). ‘Coopr is a collection of Python software packages that supports a diverse set of optimization capabilities for formulating and analyzing optimization models’

That’s it for this month. Thanks to everyone who contacted me with news items, this series would be a lot more difficult to compile without you. If you have found something interesting in the world of mathematical software then feel free to let me know and I’ll include it in a future edition.

11 comments

WalkingRandomly turns 4 today

August 20th, 2011 | Categories: walking randomly | Tags:

This little corner of the web that I like to call home has been online for 4 years as of today. I’d like to thank everyone who reads and comments on the stuff I put up here, I hope you find my wanderings interesting and/or useful.

I’d also like to thank all of the people and organisations I have collaborated with over the years while writing articles here. I’m not going to mention names out of fear of missing someone out but you know who you are. The only exception would be Matthew Haworth of Reason Digital who talked me into starting this up during his University of Manchester farewell party. He also provides me with web hosting (surviving three slashdottings so far), technical support and, on occasion, lunch! Without Matt, there would be no Walking Randomly.

Back on WRs first birthday I gave out some stats (143 RSS subscribers and 250 visits per day) and so I’ll so the same today (1645 RSS subscribers and 800 unique visits per day)

Finally, here is a list of some of my favourite articles from over the last 4 years. I hope you enjoy reading them as much as I enjoyed writing them

Secret Messages hidden inside equations – When an equation literally spoke to me.
The Valentine’s equations – Hearts and flowers, wrapped up in equations
Now thats what I CALL Retro Computing – Ever wondered how easy it would be to program a 60 year old computer? Now’s your chance
Simulating Harmonographs – Pretty pictures from pendulums
Proof I have a brain – When I blagged my way to a free brain scan
Should Fortran be taught to undergraduates? – My first Slashdotting. This post has been read by tens of thousands of people.
Wheels on Wheels on Wheels – More pretty patterns
The unreasonable ineffectiveness of factoring – We drill and kill the technique of factoring equations in schools. A pity that its useless then!
Polygonal numbers on quadratic number spirals – Yet more pretty patterns
Math on iPad #1 – The beginning of a series that is still ongoing
Quadraflakes, Pentaflakes, Hexaflakes and more – Playing with a particular kind of Fractal
Complex Power Towers – recreations in the complex plane

11 comments

MATLAB GPU / CUDA experiences on my laptop – Elementwise operations on the GPU #2

August 16th, 2011 | Categories: CUDA, GPU, Making MATLAB faster, matlab | Tags:

This is part 2 of an ongoing series of articles about MATLAB programming for GPUs using the Parallel Computing Toolbox. The introduction and index to the series is at https://www.walkingrandomly.com/?p=3730. All timings are performed on my laptop (hardware detailed at the end of this article) unless otherwise indicated.

Last time

In the previous article we saw that it is very easy to run MATLAB code on suitable GPUs using the parallel computing toolbox. We also saw that the GPU has its own area of memory that is completely separate from main memory. So, if you have a program that runs in main memory on the CPU and you want to accelerate part of it using the GPU then you’ll need to explicitly transfer data to the GPU and this takes time. If your GPU calculation is particularly simple then the transfer times can completely swamp your performance gains and you may as well not bother.

So, the moral of the story is either ‘Keep data transfers between GPU and host down to a minimum‘ or ‘If you are going to transfer a load of data to or from the GPU then make sure you’ve got a ton of work for the GPU to do.‘ Today, I’m going to consider the latter case.

A more complicated elementwise function – CPU version
Last time we looked at a very simple elementwise calculation– taking the sine of a large array of numbers. This time we are going to increase the complexity ever so slightly. Consider trig_series.m which is defined as follows

function out = myseries(t)
  out = 2*sin(t)-sin(2*t)+2/3*sin(3*t)-1/2*sin(4*t)+2/5*sin(5*t)-1/3*sin(6*t)+2/7*sin(7*t);
end

Let’s generate 75 million random numbers and see how long it takes to apply this function to all of them.

cpu_x = rand(1,75*1e6)*10*pi;
tic;myseries(cpu_x);toc

I did the above calculation 20 times using a for loop and took the median of the results to get 4.89 seconds¹. Note that this calculation is fully parallelised…all 4 cores of my i7 CPU were working on the problem simultaneously (Many built in MATLAB functions are parallelised these days by the way…No extra toolbox necessary).

GPU version 1 (Or ‘How not to do it’)

One way of performing this calculation on the GPU would be to proceed as follows

cpu_x =rand(1,75*1e6)*10*pi;
tic
%Transfer data to GPU
gpu_x = gpuArray(cpu_x);
%Do calculation using GPU
gpu_y = myseries(gpu_x);
%Transfer results back to main memory
cpu_y = gather(gpu_y)
toc

The mean time of the above calculation on my laptop’s GPU is 3.27 seconds, faster than the CPU version but not by much. If you don’t include data transfer times in the calculation then you end up with 2.74 seconds which isn’t even a factor of 2 speed-up compared to the CPU.

GPU version 2 (arrayfun to the rescue)

We can get better performance out of the GPU simply by changing the line that does the calculation from

gpu_y = myseries(gpu_x);

to a version that uses MATLAB’s arrayfun command.

gpu_y = arrayfun(@myseries,gpu_x);

So, the full version of the code is now

cpu_x =rand(1,75*1e6)*10*pi;
tic
%Transfer data to GPU
gpu_x = gpuArray(cpu_x);
%Do calculation using GPU via arrayfun
gpu_y = arrayfun(@myseries,gpu_x);
%Transfer results back to main memory
cpu_y = gather(gpu_y)
toc

This improves the timings quite a bit. If we don’t include data transfer times then the arrayfun version completes in 1.42 seconds down from 2.74 seconds for the original GPU code. Including data transfer, the arrayfun version complete in 1.94 seconds compared to for the 3.27 seconds for the original.

Using arrayfun for the GPU is definitely the way to go! Giving the GPU every disadvantage I can think of (double precision, including transfer times, comparing against multi-thread CPU code etc) we still get a speed-up of just over 2.5 times on my laptop’s hardware. Pretty useful for hardware that was designed for energy-efficient gaming!

Note: Something that I learned while writing this post is that the first call to arrayfun will be slower than all of the rest. This is because arrayfun compiles the MATLAB function you pass it down to PTX and this can take a while (seconds). Subsequent calls will be much faster since arrayfun will use the compiled results. The compiled PTX functions are not saved between MATLAB sessions.

arrayfun – Good for your memory too!
Using the arrayfun function is not only good for performance, it’s also good for memory management. Imagine if I had 100 million elements to operate on instead of only 75 million. On my 3Gb GPU, the following code fails:

cpu_x = rand(1,100*1e6)*10*pi;
gpu_x = gpuArray(cpu_x);
gpu_y = myseries(gpu_x);
??? Error using ==> GPUArray.mtimes at 27
Out of memory on device. You requested: 762.94Mb, device has 724.21Mb free.

Error in ==> myseries at 3
        out =
        2*sin(t)-sin(2*t)+2/3*sin(3*t)-1/2*sin(4*t)+2/5*sin(5*t)-1/3*sin(6*t)+2/7*sin(7*t);

If we use arrayfun, however, then we are in clover. The following executes without complaint.

cpu_x = rand(1,100*1e6)*10*pi;
gpu_x = gpuArray(cpu_x);
gpu_y = arrayfun(@myseries,gpu_x);

Some Graphs

Just like last time, I ran this calculation on a range of input arrays from 1 million to 100 million elements on both my laptop’s GT 555M GPU and a Tesla C2050 I have access to. Unfortunately, the C2050 is running MATLAB 2010b rather than 2011a so it’s not as fair a test as I’d like. I could only get up to 84 million elements on the Tesla before it exited due to memory issues. I’m not sure if this is down to the hardware itself or the fact that it was running an older version of MATLAB.

GPU trig series timings

Next, I looked at the actual speed-up shown by the GPUs compared to my laptop’s i7 CPU. Again, this includes data transfer times, is in full double precision and the CPU version was multi-threaded (No dodgy ‘GPUs are awesome’ techniques used here). No matter what array size is used I get almost a factor of 3 speed-up using the laptop GPU and more than a factor of 7 speed-up when using the Tesla. Not too shabby considering that the programming effort to achieve this speed-up was minimal.
GPU trig series speed-up

Conclusions

It is VERY easy to modify simple, element-wise functions to take advantage of the GPU in MATLAB using the Parallel Computing Toolbox.
arrayfun is the most efficient way of dealing with such functions.
My laptop’s GPU demonstrated almost a 3 times speed-up compared to its CPU.

Hardware / Software used for the majority of this article

Laptop model: Dell XPS L702X
CPU: Intel Core i7-2630QM @2Ghz software overclockable to 2.9Ghz. 4 physical cores but total 8 virtual cores due to Hyperthreading.
GPU: GeForce GT 555M with 144 CUDA Cores. Graphics clock: 590Mhz. Processor Clock:1180 Mhz. 3072 Mb DDR3 Memeory
RAM: 8 Gb
OS: Windows 7 Home Premium 64 bit. I’m not using Linux because of the lack of official support for Optimus.
MATLAB: 2011a with the parallel computing toolbox

Code and sample timings (Will grow over time)

You need myseries.m and trigseries_test.m and the Parallel Computing Toolbox. These are the times given by the following function call for various systems (transfer included and using arrayfun).

[cpu,~,~,~,gpu] = trigseries_test(10,50*1e6,'mean')

GPUs

Tesla C2050, Linux, 2010b – 0.4751 seconds
NVIDIA GT 555M – 144 CUDA Cores, 3Gb RAM, Windows 7, 2011a – 1.2986 seconds

CPUs

Intel Core i7-2630QM, Windows 7, 2011a (My laptop’s CPU) – 3.33 seconds
Intel Core 2 Quad Q9650 @ 3.00GHz, Linux, 2011a – 3.9452 seconds

Footnotes
1 – This is the method used in all subsequent timings and is similar to that used by the File Exchange function, timeit (timeit takes a median, I took a mean). If you prefer to use timeit then the function call would be timeit(@()myseries(cpu_x)). I stick to tic and toc in the article because it makes it clear exactly where timing starts and stops using syntax well known to most MATLABers.

Thanks to various people at The Mathworks for some useful discussions, advice and tutorials while creating this series of articles.

4 comments

What if the Moon were made of cheese….and you ate it?

August 15th, 2011 | Categories: just for fun | Tags:

This lunchtime I stumbled across the following tweet by GrrlScientist

“mmm, cheese! RT @LookMaNoFans How many Calories would a Moon made of cheese be, you ask? A made-of-manchego Moon = 285 septillion Calories.”

A septillion is 10^24 so that’s a lot of calories! Naturally, I wanted to check the facts so I asked Wolfram Alpha which, sadly, didn’t have the specifics on manchego cheese so I substituted monterey cheese to get a result of 8.5*10^25 calories. This disagrees with the tweet but it’s the figure I’m going to go with for now.

Since I am into running, I wondered how far I’d have to run to burn that many calories. Using Wolfram Alpha’s running calorie calculator I found out that at a pace of 8 minutes per mile I would need to run 0.675 miles to burn 85 calories (assuming average body weight*) so would need to run 6.75*10^24 miles (almost 115 billion light years) to burn off the calories contained in a moon made of monteray cheese. That’s larger than the diameter of the observable universe!

*Obviously, if you ate the moon then you would weigh slightly more than average so this is a gross over-simplification.

4 comments

« Newer Entries Older Entries »

TOP

Walking Randomly

MATLAB mex functions using the NAG C Library in Windows

Probability of overlapping sub-sequences?

A Month of Math Software – September 2011

Integration fun with MATLAB and Mathematica

MATLAB functions with built in parallel computing toolbox support

MATLAB version of Interactive Slinky Thing

A Month of Math Software – August 2011

WalkingRandomly turns 4 today

MATLAB GPU / CUDA experiences on my laptop – Elementwise operations on the GPU #2

What if the Moon were made of cheese….and you ate it?

Categories

Archives

Blogroll

Meta

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Categories

Archives

Blogroll

Meta