## Using Intel’s SPMD Compiler (ispc) with MATLAB on Linux

Modern CPUs are capable of parallel processing at multiple levels with the most obvious being the fact that a typical CPU contains multiple processor cores. My laptop, for example, contains a quad-core Intel Sandy Bridge i7 processor and so has 4 processor cores. You may be forgiven for thinking that, with 4 cores, my laptop can do up to 4 things simultaneously but life isn’t quite that simple.

The first complication is hyper-threading where each physical core appears to the operating system as two or more virtual cores. For example, the processor in my laptop is capable of using hyper-threading and so I have access to up to 8 virtual cores! I have heard stories where unscrupulous sales people have passed off a 4 core CPU with hyperthreading as being as good as an 8 core CPU…. after all, if you fire up the Windows Task Manager you can see 8 cores and so there you have it! However, this is very far from the truth since what you really have is 4 real cores with 4 brain damaged cousins. Sometimes the brain damaged cousins can do something useful but they are no substitute for physical cores. There is a great explanation of this technology at makeuseof.com.

The second complication is the fact that **each physical processor core** contains a SIMD (Single Instruction Multiple Data) lane of a certain width. SIMD lanes, aka SIMD units or vector units, can process several numbers simultaneously with a single instruction rather than only one a time. The 256-bit wide SIMD lanes on my laptop’s processor, for example, can operate on up to 8 single (or 4 double) precision numbers per instruction. Since each physical core has its own SIMD lane this means that a 4 core processor could theoretically operate on up to 32 single precision (or 16 double precision) numbers per clock cycle!

So, all we need now is a way of programming for these SIMD lanes!

Intel’s SPMD Program Compiler, ispc, is a free product that allows programmers to take direct advantage of the SIMD lanes in modern CPUS using a C-like syntax. The speed-ups compared to single-threaded code can be impressive with Intel reporting up to 32 times speed-up (on an i7 quad-core) for a **single precision** Black-Scholes option pricing routine for example.

**Using ispc on MATLAB
**

Since ispc routines are callable from C, it stands to reason that we’ll be able to call them from MATLAB using mex. To demonstrate this, I thought that I’d write a sqrt function that works faster than MATLAB’s built-in version. This is a tall order since the sqrt function is pretty fast and is already multi-threaded. Taking the square root of 200 million random numbers doesn’t take very long in MATLAB:

>> x=rand(1,200000000)*10; >> tic;y=sqrt(x);toc Elapsed time is 0.666847 seconds.

This might not be the most useful example in the world but I wanted to focus on how to get ispc to work from within MATLAB rather than worrying about the details of a more interesting example.

**Step 1 – A reference single-threaded mex file**

Before getting all fancy, let’s write a nice, straightforward single-threaded mex file in C and see how fast that goes.

#include <math.h> #include "mex.h" void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] ) { double *in,*out; int rows,cols,num_elements,i; /*Get pointers to input matrix*/ in = mxGetPr(prhs[0]); /*Get rows and columns of input matrix*/ rows = mxGetM(prhs[0]); cols = mxGetN(prhs[0]); num_elements = rows*cols; /* Create output matrix */ plhs[0] = mxCreateDoubleMatrix(rows, cols, mxREAL); /* Assign pointer to the output */ out = mxGetPr(plhs[0]); for(i=0; i<num_elements; i++) { out[i] = sqrt(in[i]); } }

Save the above to a text file called** sqrt_mex.c** and compile using the following command in MATLAB

mex sqrt_mex.c

Let’s check out its speed:

>> x=rand(1,200000000)*10; >> tic;y=sqrt_mex(x);toc Elapsed time is 1.993684 seconds.

Well, it works but it’s quite a but slower than the built-in MATLAB function so we still have some work to do.

**Step 2 – Using the SIMD lane on one core via ispc**

Using ispc is a two step process. First of all you need the .ispc program

export void ispc_sqrt(uniform double vin[], uniform double vout[], uniform int count) { foreach (index = 0 ... count) { vout[index] = sqrt(vin[index]); } }

Save this to a file called** ispc_sqrt.ispc** and compile it at the **Bash prompt** using

ispc -O2 ispc_sqrt.ispc -o ispc_sqrt.o -h ispc_sqrt.h --pic

This creates an object file, **ispc_sqrt.o**, and a header file, **ispc_sqrt.h**. Now create the mex file in MATLAB

#include <math.h> #include "mex.h" #include "ispc_sqrt.h" void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] ) { double *in,*out; int rows,cols,num_elements,i; /*Get pointers to input matrix*/ in = mxGetPr(prhs[0]); /*Get rows and columns of input matrix*/ rows = mxGetM(prhs[0]); cols = mxGetN(prhs[0]); num_elements = rows*cols; /* Create output matrix */ plhs[0] = mxCreateDoubleMatrix(rows, cols, mxREAL); /* Assign pointer to the output */ out = mxGetPr(plhs[0]); ispc::ispc_sqrt(in,out,num_elements); }

Call this **ispc_sqrt_mex.cpp ** and compile **in MATLAB** with the command

mex ispc_sqrt_mex.cpp ispc_sqrt.o

Let’s see how that does for speed:

>> tic;y=ispc_sqrt_mex(x);toc Elapsed time is 1.379214 seconds.

So, we’ve improved on the single-threaded mex file a bit (1.37 instead of 2 seconds) but it’s still not enough to beat the MATLAB built-in. To do that, we are going to have to use the SIMD lanes on all 4 cores simultaneously.

**Step 3 – A reference multi-threaded mex file using OpenMP**

Let’s step away from ispc for a while and see how we do with something we’ve seen before– a mex file using OpenMP (see here and here for previous articles on this topic).

#include <math.h> #include "mex.h" #include <omp.h> void do_calculation(double in[],double out[],int num_elements) { int i; #pragma omp parallel for shared(in,out,num_elements) for(i=0; i<num_elements; i++){ out[i] = sqrt(in[i]); } } void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] ) { double *in,*out; int rows,cols,num_elements,i; /*Get pointers to input matrix*/ in = mxGetPr(prhs[0]); /*Get rows and columns of input matrix*/ rows = mxGetM(prhs[0]); cols = mxGetN(prhs[0]); num_elements = rows*cols; /* Create output matrix */ plhs[0] = mxCreateDoubleMatrix(rows, cols, mxREAL); /* Assign pointer to the output */ out = mxGetPr(plhs[0]); do_calculation(in,out,num_elements); }

Save this to a text file called **openmp_sqrt_mex.c** and compile **in MATLAB** by doing

mex openmp_sqrt_mex.c CFLAGS="\$CFLAGS -fopenmp" LDFLAGS="\$LDFLAGS -fopenmp"

Let’s see how that does (OMP_NUM_THREADS has been set to 4):

>> tic;y=openmp_sqrt_mex(x);toc Elapsed time is 0.641203 seconds.

That’s very similar to the MATLAB built-in and I suspect that The Mathworks have implemented their sqrt function in a very similar manner. Theirs will have error checking, complex number handling and what-not but it probably comes down to a for-loop that’s been parallelized using Open-MP.

**Step 4 – Using the SIMD lanes on all cores via ispc**

To get a ispc program to run on all of my processors cores simultaneously, I need to break the calculation down into a series of tasks. The .ispc file is as follows

task void ispc_sqrt_block(uniform double vin[], uniform double vout[], uniform int block_size,uniform int num_elems){ uniform int index_start = taskIndex * block_size; uniform int index_end = min((taskIndex+1) * block_size, (unsigned int)num_elems); foreach (yi = index_start ... index_end) { vout[yi] = sqrt(vin[yi]); } } export void ispc_sqrt_task(uniform double vin[], uniform double vout[], uniform int block_size,uniform int num_elems,uniform int num_tasks) { launch[num_tasks] < ispc_sqrt_block(vin, vout, block_size, num_elems) >; }

Compile this by doing the following at the **Bash prompt**

ispc -O2 ispc_sqrt_task.ispc -o ispc_sqrt_task.o -h ispc_sqrt_task.h --pic

We’ll need to make use of a task scheduling system. The ispc documentation suggests that you could use the scheduler in Intel’s Threading Building Blocks or Microsoft’s Concurrency Runtime but a basic scheduler is provided with ispc in the form of tasksys.cpp (I’ve also included it in the .tar.gz file in the downloads section at the end of this post), We’ll need to compile this too so do the following **at the Bash prompt**

g++ tasksys.cpp -O3 -Wall -m64 -c -o tasksys.o -fPIC

Finally, we write the mex file

#include <math.h> #include "mex.h" #include "ispc_sqrt_task.h" void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] ) { double *in,*out; int rows,cols,i; unsigned int num_elements; unsigned int block_size; unsigned int num_tasks; /*Get pointers to input matrix*/ in = mxGetPr(prhs[0]); /*Get rows and columns of input matrix*/ rows = mxGetM(prhs[0]); cols = mxGetN(prhs[0]); num_elements = rows*cols; /* Create output matrix */ plhs[0] = mxCreateDoubleMatrix(rows, cols, mxREAL); /* Assign pointer to the output */ out = mxGetPr(plhs[0]); block_size = 1000000; num_tasks = num_elements/block_size; ispc::ispc_sqrt_task(in,out,block_size,num_elements,num_tasks); }

In the above, the input array is divided into tasks where each task takes care of 1 million elements. Our 200 million element test array will, therefore, be split into 200 tasks– many more than I have processor cores. I’ll let the task scheduler worry about how to schedule these tasks efficiently across the cores in my machine. Compile this **in MATLAB** by doing

mex ispc_sqrt_task_mex.cpp ispc_sqrt_task.o tasksys.o

Now for crunch time:

>> x=rand(1,200000000)*10; >> tic;ys=sqrt(x);toc %MATLAB's built-in Elapsed time is 0.670766 seconds. >> tic;y=ispc_sqrt_task_mex(x);toc %my version using ispc Elapsed time is 0.393870 seconds.

There we have it! A version of the sqrt function that works faster than MATLAB’s own by virtue of the fact that I am now making full use of the SIMD lanes in my laptop’s Sandy Bridge i7 processor thanks to ispc.

Although this example isn’t very useful as it stands, I hope that it shows that using the ispc compiler from within MATLAB isn’t as hard as you might think and is yet another tool in the arsenal of weaponry that can be used to make MATLAB faster.

**Final Timings, downloads and links
**

- Single threaded: 2.01 seconds
- Single threaded with ispc: 1.37 seconds
- MATLAB built-in: 0.67 seconds
- Multi-threaded with OpenMP (OMP_NUM_THREADS=4): 0.64 seconds
- Multi-threaded with OpenMP and hyper-threading (OMP_NUM_THREADS=8): 0.55 seconds
- Task-based multicore with ispc: 0.39 seconds

Finally, here’s some links and downloads

**System Specs**

- MATLAB 2011b running on 64 bit linux
- gcc 4.6.1
- ispc version 1.1.1
- Intel Core i7-2630QM with 8Gb RAM

Very interesting! Thank you.

I implemented step 2 but MATLAB crashed (even if I use the ispc_sqrt_mex function with just one number).

System Specs:

– MATLAB 2011a running on 32 bit Mac OS X 10.6.8

– gcc 4.2.1

– ispc version 1.1.1

– Intel Core 2 Duo 2.4 Ghz with 4Gb RAM

Bye,

michele.

Hi Michele

Thanks for trying it out, shame it doesn’t work for you. Do the examples that come with ispc work for you?

Cheers,

Mike

Very interesting!

But for a bit more complex number crunching functions the efficiency rapidly decreased. The ISPC is still not matured tools for real HPC problems.

Hi Michal

Do you have examples of the decreasing efficiency please? I’ll be trying out some more complex things myself in the near future.

Cheers,

Mike

To be fair, your last example that goes faster than the MATLAB built-in SQRT isn’t entirely comparable. You said it yourself:

“Theirs will have error checking, complex number handling and what-not but it probably comes down to a for-loop that’s been parallelized using Open-MP.”

You can see this if you make a one-character change to the example:

x=randn(1,200000000)*10; % RANDN instead of RAND

I’m not certain what your version will do on this vector (which contains negative numbers), but since you don’t do anything with the imaginary part of the output it won’t be the same as SQRT’s answer. At best it’ll throw some sort of error (or assert, perhaps?) and at worst it’ll return just the real part of the answer. [Silently returning the wrong answer is a Very Bad Thing.]

I understand that you’re using this as a simple example of how to write a MEX-file using this new tool, not as a complete replacement for SQRT. However, I believe your statement that it’s faster could be interpreted in the latter way.

Hi Steve

For negative numbers, all of the examples I’ve written return NaN which I think is OK but not as good as what MATLAB’s sqrt would do.

You are right, comparing this highly simplified version of sqrt directly to MATLAB’s is probably unfair. The real fair comparison is between the mex files themselves where ispc is clearly of benefit if you have the right hardware.

With that said, however, if you know in advance that you are only going to be dealing with positive numbers then you are in clover :)

I guess that MATLAB’s sqrt function is implemented as a mex file? If so then perhaps you could try applying the ispc compiler to your code and see how it pans out?

Cheers,

Mike

Nice article. I wonder if your parallel version would be faster with pthreads instead of OpenMP (but that would be a bit of work)? Also, it would be interesting (and easy) to test Matlab in singlethreaded mode. I think Matlab’s latest releases are making it harder or impossible to do this, but starting Matlab with the “-singleCompThread” might work.

Also, since Matlab uses ACML and Intel MKL for its LAPACK and BLAS, and since these libraries contain much more than BLAS (for example, Intel MKL contains a vectorized sqrt), Matlab may in fact be using AMD or Intel’s sqrt. In that case, if you are beating Matlab, it really means you’re beating AMD or Intel, so bravo! Although as Steve L mentions, the error handing may be non-negligible.