## Making Mathematica faster with Compile

June 18th, 2011

Over at Sol Lederman’s fantastic new blog, Playing with Mathematica, he shared some code that produced the following figure.

Here’s Sol’s code with an AbsoluteTiming command thrown in.

f[x_, y_] := Module[{},
If[
Sin[Min[x*Sin[y], y*Sin[x]]] >
Cos[Max[x*Cos[y],
y*Cos[x]]] + (((2 (x - y)^2 + (x + y - 6)^2)/40)^3)/
6400000 + (12 - x - y)/30, 1, 0]
]

AbsoluteTiming[
\[Delta] = 0.02;
range = 11;
xyPoints = Table[{x, y}, {y, 0, range, \[Delta]}, {x, 0, range, \[Delta]}];
image = Map[f @@ # &, xyPoints, {2}];
]
Rotate[ArrayPlot[image, ColorRules -> {0 -> White, 1 -> Black}], 135 Degree]

This took 8.02 seconds on the laptop I am currently working on (Windows 7 AMD Phenom II N620 Dual core at 2.8Ghz). Note that I am only measuring how long the calculation itself took and am ignoring the time taken to render the image and define the function.

Compiled functions make Mathematica code go faster

Mathematica has a Compile function which does exactly what you’d expect…it produces a compiled version of the function you give it (if it can!). Sol’s function gave it no problems at all.

f = Compile[{{x, _Real}, {y, _Real}}, If[
Sin[Min[x*Sin[y], y*Sin[x]]] >
Cos[Max[x*Cos[y],
y*Cos[x]]] + (((2 (x - y)^2 + (x + y - 6)^2)/40)^3)/
6400000 + (12 - x - y)/30, 1, 0]
];

AbsoluteTiming[
\[Delta] = 0.02;
range = 11;
xyPoints =
Table[{x, y}, {y, 0, range, \[Delta]}, {x, 0, range, \[Delta]}];
image = Map[f @@ # &, xyPoints, {2}];
]
Rotate[ArrayPlot[image, ColorRules -> {0 -> White, 1 -> Black}],
135 Degree]

This simple change takes computation time down from 8.02 seconds to 1.23 seconds which is a 6.5 times speed up for hardly any extra coding work. Not too shabby!

Switch to C code to get it even faster

I’m not done yet though! By default the Compile command produces code for the so-called Mathematica Virtual Machine but recent versions of Mathematica allow us to go even further.

Install Visual Studio Express 2010 (and the Windows 7.1 SDK if you are running 64bit Windows) and you can ask Mathematica to convert the function to low level C code, compile it and produce a function object linked to the resulting compiled code. Sounds complicated but is a snap to actually do. Just add

CompilationTarget -> "C"

to the Compile command.

f = Compile[{{x, _Real}, {y, _Real}},
If[Sin[Min[x*Sin[y], y*Sin[x]]] >
Cos[Max[x*Cos[y],
y*Cos[x]]] + (((2 (x - y)^2 + (x + y - 6)^2)/40)^3)/
6400000 + (12 - x - y)/30, 1, 0]
, CompilationTarget -> "C"
];

AbsoluteTiming[\[Delta] = 0.02;
range = 11;
xyPoints =
Table[{x, y}, {y, 0, range, \[Delta]}, {x, 0, range, \[Delta]}];
image = Map[f @@ # &, xyPoints, {2}];]
Rotate[ArrayPlot[image, ColorRules -> {0 -> White, 1 -> Black}],
135 Degree]

On my machine this takes calculation time down to 0.89 seconds which is 9 times faster than the original.

Making the compiled function listable

The current compiled function takes just one x,y pair and returns a result.

In[8]:= f[1, 2]

Out[8]= 1

It can’t directly accept a list of x values and a list of y values. For example for the two points (1,2) and (10,20) I’d like to be able to do f[{1, 10}, {2, 20}] and get the results {1,1}. However what I end up with is an error

f[{1, 10}, {2, 20}]

CompiledFunction::cfsa: Argument {1,10} at position 1 should be a machine-size real number. >>

To fix this I need to make my compiled function listable which is as easy as adding

RuntimeAttributes -> {Listable}

to the function definition.

f = Compile[{{x, _Real}, {y, _Real}},
If[Sin[Min[x*Sin[y], y*Sin[x]]] >
Cos[Max[x*Cos[y],
y*Cos[x]]] + (((2 (x - y)^2 + (x + y - 6)^2)/40)^3)/
6400000 + (12 - x - y)/30, 1, 0]
, CompilationTarget -> "C", RuntimeAttributes -> {Listable}
];

So now I can pass the entire array to this compiled function at once. No need for Map.

AbsoluteTiming[
\[Delta] = 0.02;
range = 11;
xpoints = Table[x, {x, 0, range, \[Delta]}, {y, 0, range, \[Delta]}];
ypoints = Table[y, {x, 0, range, \[Delta]}, {y, 0, range, \[Delta]}];
image = f[xpoints, ypoints];
]
Rotate[ArrayPlot[image, ColorRules -> {0 -> White, 1 -> Black}],
135 Degree]

On my machine this gets calculation time down to 0.28 seconds, a whopping 28.5 times faster than the original. Rendering time is becoming much more of an issue than calculation time in fact!

Parallel anyone?

Parallelization -> True

to the Compile command I can parallelise the code using threads. Since I have a dual core machine, this might be a good thing to do. Let’s take a look

f = Compile[{{x, _Real}, {y, _Real}},
If[
Sin[Min[x*Sin[y], y*Sin[x]]] >
Cos[Max[x*Cos[y],
y*Cos[x]]] + (((2 (x - y)^2 + (x + y - 6)^2)/40)^3)/
6400000 + (12 - x - y)/30, 1, 0]
, RuntimeAttributes -> {Listable}, CompilationTarget -> "C",
Parallelization -> True];

AbsoluteTiming[
\[Delta] = 0.02;
range = 11;
xpoints = Table[x, {x, 0, range, \[Delta]}, {y, 0, range, \[Delta]}];
ypoints = Table[y, {x, 0, range, \[Delta]}, {y, 0, range, \[Delta]}];
image = f[xpoints, ypoints];
]
Rotate[ArrayPlot[image, ColorRules -> {0 -> White, 1 -> Black}],
135 Degree]

The first time I ran this it was SLOWER than the non-threaded version coming in at 0.33 seconds. Subsequent runs varied and occasionally got as low as 0.244 seconds which is only a few hundredths of a second faster than the original.

If I make the problem bigger, however, by decreasing the size of Delta then we start to see the benefit of parallelisation.

AbsoluteTiming[
\[Delta] = 0.01;
range = 11;
xpoints = Table[x, {x, 0, range, \[Delta]}, {y, 0, range, \[Delta]}];
ypoints = Table[y, {x, 0, range, \[Delta]}, {y, 0, range, \[Delta]}];
image = f[xpoints, ypoints];
]
Rotate[ArrayPlot[image, ColorRules -> {0 -> White, 1 -> Black}],
135 Degree]

The above calculation (sans rendering) took 0.988 seconds using a parallelised version of f and 1.24 seconds using a serial version. Rendering took significantly longer! As a comparison lets put a Delta of 0.01 in the original code:

f[x_, y_] := Module[{},
If[
Sin[Min[x*Sin[y], y*Sin[x]]] >
Cos[Max[x*Cos[y],
y*Cos[x]]] + (((2 (x - y)^2 + (x + y - 6)^2)/40)^3)/
6400000 + (12 - x - y)/30, 1, 0]
]

AbsoluteTiming[
\[Delta] = 0.01;
range = 11;
xyPoints = Table[{x, y}, {y, 0, range, \[Delta]}, {x, 0, range, \[Delta]}];
image = Map[f @@ # &, xyPoints, {2}];
]
Rotate[ArrayPlot[image, ColorRules -> {0 -> White, 1 -> Black}], 135 Degree]

The calculation time (again, ignoring rendering time) took 32.56 seconds and so our C-compiled, parallel version is almost 33 times faster!

Summary

• The Compile function can make your code run significantly faster by compiling it for the Mathematica Virtual Machine (MVM).  Note that not every function is suitable for compilation.
• If you have a C-compiler installed on your machine then you can switch from the MVM to compiled C-code using a single option statement.  The resulting code is even faster
• Making your functions listable can increase performance.
• Parallelising your compiled function is easy and can lead to even more speed but only if your problem is of a suitable size.
• Sol Lederman has a very cool Mathematica blog – check it out!  The code that inspired this blog post originated there.

## Parallel MATLAB with openmp mex files

October 21st, 2009

Slowly but surely more and more MATLAB functions are becoming able to take advantage of multi-core processors.  For example, in MATLAB 2009b, functions such as sort, bsxfun, filter and erf (among others) gained the ability to spread the calculational load across several processor cores.  This is good news because if your code uses these functions, and if you have a multi-core processor, then you will get faster execution times without having to modify your program.   This kind of parallelism is called implicit parallelism because it doesn’t require any special commands in order to take advantage of multiple cores – MATLAB just does it automagically.  Faster code for free!

For many people though, this isn’t enough and they find themselves needing to use explicit parallelism where it becomes the programmer’s responsibility to tell MATLAB exactly how to spread the work between the multiple processing cores available.  The standard way of doing this in MATLAB is to use the Parallel Computing Toolbox but, unlike packages such as Mathematica and Maple, this functionality is going to cost you extra.

There is another way though…

I’ve recently been working with David Szotten, a postgraduate student at Manchester University, on writing parallel mex files for MATLAB using openmp and C.  Granted, it’s not as easy as using the Parallel Computing Toolbox but it does work and the results can be blisteringly fast since we are working in C.  It’s also not quite as difficult as we originally thought it might be.

There is one golden rule you need to observe:

Never, ever call any of the mex api functions inside the portion of your code that you are trying to parallelise using openmp.  Don’t try to interact with MATLAB at all during the parallel portion of your code.

The reason for the golden rule is because, at the time of writing at least (MATLAB 2009b), all mex api functions are not thread-safe and that includes the printf function since printf is defined to be mexPrintf in the mex.h header file.  Stick to the golden rule and you’ll be fine.  Move away from the golden rule,however, and you’ll find dragons.  Trust me on this!

Let’s start with an example and find the sum of foo(x) where foo is a moderately complicated function and x is a large number of values.  We have implemented such an example in the mex function mex_sum_openmp.

I assume you are running MATLAB 2009b on 32bit Ubuntu Linux version 9.04 – just like me.  If your setup differs from this then you may need to modify the following instructions accordingly.

• Before you start.  Close all currently running instances of MATLAB.
• Download and unzip the file  mex_sum_openmp.zip to give you mex_sum_openmp.c
• Open up a bash terminal and enter export OMP_NUM_THREADS=2 (replace 2 with however many cores your machine has)
• start MATLAB from this terminal by entering matlab.
• Inside MATLAB, navigate to the directory containing mex_sum_openmp.c
• Inside MATLAB, enter the following to compile the .c file
mex mex_sum_openmp.c CFLAGS="\$CFLAGS -fopenmp" LDFLAGS="\$LDFLAGS -fopenmp"
• Inside MATLAB, test the code by entering
x=rand(1,9999999); %create an array of random x values
mex_sum_openmp(x) %calculate the sum of foo(x(i))
• If all goes well, this calculation will be done in parallel and you will be rewarded with a single number.  Time the calculation as follows.
tic;mex_sum_openmp(x);toc
• My dual core laptop did this in about 0.6 seconds on average.  To see how much faster this is compared to single core mode just repeat all of the steps above but start off with export OMP_NUM_THREADS=1 (You won’t need to recompile the mex function).  On doing this, my laptop did it in 1.17 seconds and so the 2 core mode is almost exactly 2 times faster.

Let’s take a look at the code.

#include "mex.h"
#include <math.h>
#include <omp.h>

double spawn_threads(double x[],int cols)
{
double sum=0;
int count;
#pragma omp parallel for shared(x, cols) private(count) reduction(+: sum)
for(count = 0;count<cols;count++)
{
sum += sin(x[count]*x[count]*x[count])/cos(x[count]+1.0);;
}

return sum;
}

void mexFunction(int nlhs,mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
double sum;
/* Check for proper number of arguments. */
if(nrhs!=1) {
mexErrMsgTxt("One input required.");
} else if(nlhs>1) {
mexErrMsgTxt("Too many output arguments.");
}

/*Find number of rows and columns in input data*/
int rows = mxGetM(prhs[0]);
int cols = mxGetN(prhs[0]);

/*I'm only going to consider row matrices in this code so ensure that rows isn't more than 1*/
if(rows>1){
mexErrMsgTxt("Input data has too many Rows.  Just the one please");
}

/*Get pointer to input data*/
double* x = mxGetPr(prhs[0]);

/*Do the calculation*/

/*create space for output*/
plhs[0] = mxCreateDoubleMatrix(1,1, mxREAL);

/*Get pointer to output array*/
double* y = mxGetPr(plhs[0]);

/*Return sum to output array*/
y[0]=sum;

}

That’s pretty much it – the golden rule in action. I hope you find this example useful and thanks again to David Szotten who helped me figure all of this out.  Comments welcomed.

Other articles you may find useful