Parallel MATLAB with openmp mex files

October 21st, 2009 | Categories: Making Mathematica Faster, matlab, programming | Tags:

Slowly but surely more and more MATLAB functions are becoming able to take advantage of multi-core processors.  For example, in MATLAB 2009b, functions such as sort, bsxfun, filter and erf (among others) gained the ability to spread the calculational load across several processor cores.  This is good news because if your code uses these functions, and if you have a multi-core processor, then you will get faster execution times without having to modify your program.   This kind of parallelism is called implicit parallelism because it doesn’t require any special commands in order to take advantage of multiple cores – MATLAB just does it automagically.  Faster code for free!

For many people though, this isn’t enough and they find themselves needing to use explicit parallelism where it becomes the programmer’s responsibility to tell MATLAB exactly how to spread the work between the multiple processing cores available.  The standard way of doing this in MATLAB is to use the Parallel Computing Toolbox but, unlike packages such as Mathematica and Maple, this functionality is going to cost you extra.

There is another way though…

I’ve recently been working with David Szotten, a postgraduate student at Manchester University, on writing parallel mex files for MATLAB using openmp and C.  Granted, it’s not as easy as using the Parallel Computing Toolbox but it does work and the results can be blisteringly fast since we are working in C.  It’s also not quite as difficult as we originally thought it might be.

There is one golden rule you need to observe:

Never, ever call any of the mex api functions inside the portion of your code that you are trying to parallelise using openmp.  Don’t try to interact with MATLAB at all during the parallel portion of your code.

The reason for the golden rule is because, at the time of writing at least (MATLAB 2009b), all mex api functions are not thread-safe and that includes the printf function since printf is defined to be mexPrintf in the mex.h header file.  Stick to the golden rule and you’ll be fine.  Move away from the golden rule,however, and you’ll find dragons.  Trust me on this!

Let’s start with an example and find the sum of foo(x) where foo is a moderately complicated function and x is a large number of values.  We have implemented such an example in the mex function mex_sum_openmp.

I assume you are running MATLAB 2009b on 32bit Ubuntu Linux version 9.04 – just like me.  If your setup differs from this then you may need to modify the following instructions accordingly.

  • Before you start.  Close all currently running instances of MATLAB.
  • Download and unzip the file  mex_sum_openmp.zip to give you mex_sum_openmp.c
  • Open up a bash terminal and enter export OMP_NUM_THREADS=2 (replace 2 with however many cores your machine has)
  • start MATLAB from this terminal by entering matlab.
  • Inside MATLAB, navigate to the directory containing mex_sum_openmp.c
  • Inside MATLAB, enter the following to compile the .c file
    mex mex_sum_openmp.c CFLAGS="\$CFLAGS -fopenmp" LDFLAGS="\$LDFLAGS -fopenmp"
  • Inside MATLAB, test the code by entering
    x=rand(1,9999999); %create an array of random x values
    mex_sum_openmp(x) %calculate the sum of foo(x(i))
  • If all goes well, this calculation will be done in parallel and you will be rewarded with a single number.  Time the calculation as follows.
    tic;mex_sum_openmp(x);toc
  • My dual core laptop did this in about 0.6 seconds on average.  To see how much faster this is compared to single core mode just repeat all of the steps above but start off with export OMP_NUM_THREADS=1 (You won’t need to recompile the mex function).  On doing this, my laptop did it in 1.17 seconds and so the 2 core mode is almost exactly 2 times faster.

Let’s take a look at the code.

#include "mex.h"
#include <math.h>
#include <omp.h>

double spawn_threads(double x[],int cols)
{
	double sum=0;
	int count;
        #pragma omp parallel for shared(x, cols) private(count) reduction(+: sum)
	for(count = 0;count<cols;count++)
	{
	     sum += sin(x[count]*x[count]*x[count])/cos(x[count]+1.0);;
	}

	return sum;
}

void mexFunction(int nlhs,mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
	double sum;
	 /* Check for proper number of arguments. */
 	if(nrhs!=1) {
    	mexErrMsgTxt("One input required.");
  	} else if(nlhs>1) {
    	mexErrMsgTxt("Too many output arguments.");
  	}

	/*Find number of rows and columns in input data*/
	int rows = mxGetM(prhs[0]);
	int cols = mxGetN(prhs[0]);		

	/*I'm only going to consider row matrices in this code so ensure that rows isn't more than 1*/
	if(rows>1){
	mexErrMsgTxt("Input data has too many Rows.  Just the one please");
	}

	/*Get pointer to input data*/
	double* x = mxGetPr(prhs[0]);

	/*Do the calculation*/
	sum = spawn_threads(x,cols);

	/*create space for output*/
 	 plhs[0] = mxCreateDoubleMatrix(1,1, mxREAL);

	/*Get pointer to output array*/
	double* y = mxGetPr(plhs[0]);

	/*Return sum to output array*/
	y[0]=sum;

}

That’s pretty much it – the golden rule in action. I hope you find this example useful and thanks again to David Szotten who helped me figure all of this out.  Comments welcomed.

Other articles you may find useful

 

  1. kyle
    April 9th, 2010 at 15:19
    Reply | Quote | #1

    just curious. Can Matlab 2008b do this job or not?

  2. April 10th, 2010 at 08:47
    Reply | Quote | #2

    Hi Kyle

    I can’t see why not although I haven’t tried it.

    Cheers,
    Mike

  3. Anika
    May 11th, 2010 at 10:29
    Reply | Quote | #3

    Hello.
    I’m working on my thesis and try using OpenMP to parallelize a few things.
    I am using MATLAB 7.1 and Visual Studio Pro 2008. With this two things I’ve done some small c-code and compile it successfully.
    Now I’m trying something different and it not work.
    Do you know if you can call in a parallel for loop MATLAB functions?

    int i;
    mxArray *input[3], *out[1];
    #pragma omp parallel for
    for(i = 1; i <= 6; i++){
    mexCallMATLAB(1,out,3,input,"CalcSomething");
    }

    About Help or ideas I'd be very grateful.
    Greetings
    Anika

  4. May 17th, 2010 at 05:35
    Reply | Quote | #4

    Hi Anika

    I can see WHY you want to do this since it will effectively be a replacement for MATLAB’s parfor function (for which you have to pay extra) but it is just not possible at the moment.

    I’m afraid that what you are trying to do will simply not work since you are calling a mex function inside your parallelised loop. As I said in the article, this does not work since mex functions are not thread-safe.

    So, what can you do in your situation? Well, at the moment you have a couple of options if you want to take the direct parallelisation route.

    *Use the free parallel toolbox from MIT, pMATLAB. http://www.ll.mit.edu/mission/isr/pmatlab/pmatlab.html
    *Pay for MATLAB’s Parallel Computing Toolbox. http://www.mathworks.com/products/parallel-computing/

    Remember,however, that there is always more than one way to crack a nut and, depending on your application, there might be some other things you can use in order to get your results more quickly.

    For example, if you were at my university then I might suggest that you consider high throughput computing (HTC) on our experimental condor pool or replacing some key MATLAB functions with routines from the NAG Toolbox for MATLAB among other things.

    The very first thing I’d suggest, however, would be to run your code through the MATLAB profiler and see if you can speed things up with good, old fashioned optimisation.

    Cheers,
    Mike

  5. August 7th, 2010 at 00:29
    Reply | Quote | #5

    Thanks for this great article. Also thanks for the link to the MIT toolbox. What are the advantages of using this functionality vs the MIT toolbox. When would you pick one over the other?

  6. August 7th, 2010 at 00:42
    Reply | Quote | #6

    Also do I need to install any openmpi from the ubuntu repositories?

  7. Vladko
    August 15th, 2010 at 23:45
    Reply | Quote | #7

    Thanks for the insight. I ran the code on the cluster of my university. Just opening Matlab and running the program gave me no problems and I got what I expected. However, when I used nohup and OMP_NUM_THREADS was bigger than 1, everything was again executed correctly (on more than one threads) but I got the message Segmentation Fault in the end. Do you know why nohup caused that issue and what are the consequences? Any response will be greatly appreciated.

  8. Alberto
    January 13th, 2011 at 15:55
    Reply | Quote | #8

    Thanks for the article.
    I have done some tests on Matlab 7.9 under Linux and I get substantial time improvements with respect to the same MEX file compiled without OpenMP. However, when I compile the same code in a machine with Matlab 7.5, I do not see any improvements.
    Do you know what is the minimum version of Matlab that I need to benefit from OpenMP in MEX files?
    Or do you think that it could be a library problem, not necessarily related to the Matlab version?
    Thanks for your insights.

  9. January 13th, 2011 at 16:13
    Reply | Quote | #9

    Hi Alberto

    Thanks for your comment. First off, I haven’t tried it in 7.5 so everything that follows is guesswork…

    It should be the version of gcc that matters, not the version of MATLAB. So, what version of gcc are you using on the MATLAB 7.5 machine? According to

    http://gcc.gnu.org/wiki/openmp

    you need gcc 4.2 or above to get OpenMP. That’s where I would start looking to try and understand what you are seeing.

    Also, are you definitely sure that the MATLAB 7.5 machine has a dual core processor?

    Cheers,
    Mike

  10. Alberto
    January 13th, 2011 at 17:52

    Thanks for your quick reply.
    I am using gcc 4.4, and yes, the machine I am trying to use has multiple cores. In fact I have been able to succesfully compile and run a stand alone program using openMP in that machine; however, the MEX file does not seem to benefit from the MP capabilities. It looks as if Matlab couldn’t see the value of the OMP_NUM_THREADS variable.
    Strange, because I am doing exactly the same thing in both machines.

  11. Ram
    February 13th, 2011 at 04:20

    I think your tutorial is for Unix system. I modified few things but probably the wrong ones. Anyway for Windows machines like mine the reply by ‘Oliver’ works (http://www.mathworks.com/matlabcentral/newsreader/view_thread/238799).

    Good tutorial though.

    –Ram.

  12. Xavier
    July 24th, 2011 at 21:07

    Thanks for the article, it got me started using openmp in my mex files.

    The inability to use printf or mexPrintf in the parallel regions is a problem. In case the program encounters errors, how can it warn the user? One possible solution is to use fprintf and to print the error messages to a file (instead of to the screen). Is that feasible? Any experience with that, or with other solutions?

    Thanks very much

    -Xavier

  13. July 25th, 2011 at 10:47

    @Xavier

    I’ve never used fprintf within a parallel region so can’t comment.

    One approach that springs to mind would be to catch the error within your parallel region and pass an error flag back to the main body of the mex function. You can then use this to inform a mexPrintf within the serial section of the code without causing any problems.

  14. Race
    October 21st, 2011 at 12:36

    @Vladko Hi Vladko
    I’ve the same problem as yours. Have you found any solution for this?

  15. Thomas Kesler
    November 15th, 2011 at 22:13

    Many thanks for this clearly explained example. I easily got it working. I had to use gcc/4.2.4 with mex (which wants gcc/4.2.3). Then I replaced gcc/4.2.4 with gcc/4.4.0 to get the shared OpenMP library. Here are the results:

    OMP_NUM_THREADS time (sec)
    ————— ———-
    1 1.001227
    2 0.467961
    4 0.244345

  16. November 23rd, 2011 at 15:13

    Thanks for letting me know Thomas :)

  17. Tom Benson
    November 29th, 2011 at 19:44

    Mike

    Thanks for posting this. I managed to get it working with Matlab 2008b using Visual Studio Pro 2008 intel compiler with an almost 1-to-1 speed-up. A couple of things to note when trying to do this on a PC:

    (1) don’t use “export” in the command window but use “set” instead.
    i.e.

    set OMP_NUM_THREADS=2
    

    (2) when compiling the mex file, the matlab command should is:

    mex mex_sum_openmp.c COMPFLAGS="$COMPFLAGS -openmp"  LINKFALGS="$LINKFALGS -openmp"
    

    (3) the Intel compiler also requires variables in the c-code to be declared at the beginning of routine. The following modified code worked for me.

    include "mex.h"
    #include <stdio.h>
    #include <math.h>
    #include <omp.h>
    
    
    double spawn_threads(double x[],int cols)
    {
    	double sum;
    	int count;
        
        sum=0;
            
        #pragma omp parallel for shared(x, cols) private(count) reduction(+: sum) 
                
    	for(count = 0;count<cols;count++)
    	{
    	     sum += sin(x[count]*x[count]*x[count])/cos(x[count]+1.0);;
    	}
    	
    	return sum;
    }
    
    
    
    void mexFunction(int nlhs,mxArray* plhs[], int nrhs, const mxArray* prhs[])
    {
    	double sum;
        int rows,cols;
        double *x,*y;
        
    	 /* Check for proper number of arguments. */
     	if(nrhs!=1) {
        	mexErrMsgTxt("One input required.");
      	} else if(nlhs>1) {
        	mexErrMsgTxt("Too many output arguments.");
      	}
    	
    	/*Find number of rows and columns in input data*/
    	rows = mxGetM(prhs[0]);
    	cols = mxGetN(prhs[0]);		
    
    	/*I'm only going to consider row matrices in this code so ensure that rows isn't more than 1*/
    	if(rows>1){
    	mexErrMsgTxt("Input data has too many Rows.  Just the one please");
    	}
    		
    	/*Get pointer to input data*/
    	x = mxGetPr(prhs[0]);
    
    	/*Do the calculation*/	
    	sum = spawn_threads(x,cols);
    	
    	/*create space for output*/
     	plhs[0] = mxCreateDoubleMatrix(1,1, mxREAL);
    
    	/*Get pointer to output array*/
    	y = mxGetPr(plhs[0]);
    
    	/*Return sum to output array*/
    	y[0]=sum;
       
    }
    
  18. James Tursa
    February 23rd, 2012 at 21:54

    This atricle is misleading. Your Golden Rule at the top of the article I agree with:

    “Never, ever call any of the mex api functions inside the portion of your code that you are trying to parallelise using openmp. Don’t try to interact with MATLAB at all during the parallel portion of your code.”

    But then you make the following statements near the end of the article:

    “The key point to notice is that I don’t have any parallel code inside mexFunction() which serves as MATLAB’s entry point into the C code. Likewise, I don’t have any mex functions inside my spawn_threads() function – the part that is actually parallelised using openmp.”

    I don’t see the connection between the Golden Rule and the latter statements. Why can’t you have OpenMP code as part of the mexFunction itself? As long as there are no API calls inside the OpenMP part of the code it should work just fine (and does for me on a 32-bit WinXP machine with MSVC). I don’t see why you imply a requirement to have the OpenMP code in a separate function. Likewise, I don’t see why you imply that there can’t be any MATLAB API calls inside your spawn_threads function. Should work fine (and does for me) as long as the API calls are not inside any OpenMP code.

    I have written several mex functions using OpenMP. One in particular, MTIMESX, is on the TMW FEX section and has been compiled on several different 32-bit & 64-bit machines with different Operating Systems by various users. The code has multiple instances of OpenMP and MATLAB API code within the same function. It all seems to work fine. What I *don’t* do is have API calls *inside* the OpenMP code sections (i.e., your Golden Rule).

    Am I missing something?

  19. February 24th, 2012 at 01:38

    Hi James

    Thanks for the comment. It’s been a while since I wrote this article and I can’t remember the process I went through to arrive at the form that this example is in…i.e. with all OpenMP loops in a separate function. I have hazy recollections of crashes when I tried putting an OpenMP loop in my main function but I honestly don’t remember the details.

    Looking at it with fresh eyes, I agree with you. There should be no reason why I can’t put the OpenMP code in the main loop. So, I tested it…I got rid of spawn_threads and just put the OpenMP loop inside the mexFunction and…it worked just fine.

    So either I was just plain wrong or there was a problem with MATLAB 2009b on 32bit Ubuntu Linux version 9.04 that doesn’t exist on my current configuration (2011b on 64bit Ubuntu 11.04)

    Anyway, I’ve removed the final paragraph and think that now the article isn’t misleading.

    Cheers,
    Mike

  20. Pat
    May 15th, 2013 at 13:30

    As this is the best explanation I’ve found on the web of getting openmp running in a mex file, I’ll just leave a couple of notes so others don’t replicate my errors:

    Firstly, some versions of matlab seems to reset OMP_NUM_THREADS back to 1 on startup. You can check for this by typing in matlab

    >> getenv OMP_NUM_THREADS

    and if this returns 1 you can fix it with

    >> setenv(‘OMP_NUM_THREADS’, ‘2’);

    Secondly, if your mex file is written in C++ rather than C, then when compiling you need to change every instance of CFLAGS in the above to CXXFLAGS. For example:

    >> mex mex_sum_openmp.cpp CXXFLAGS=”\$CXXFLAGS -fopenmp” LDFLAGS=”\$LDFLAGS -fopenmp”

    If you don’t then it will still compile and run just fine, but nothing will be parallelised. Yeah yeah, I know, it should have been obvious, but I was convinced I’d done something more fundamentally wrong and managed to overlook this for hours…

    Other than this it runs great! Thanks for putting this guide online.

  21. Mike Croucher
    May 17th, 2013 at 10:04

    You are welcome Pat. Thanks for the extra tips :)

    Cheers,
    Mike

  22. Ugur Karban
    January 14th, 2014 at 15:40

    Hi all,

    I want to call some fortran subroutines which includes openmp blocks from MATLAB. Is it possible to obtain parallel computation in fortran when it is called by MATLAB? Is there a process for fortran similar to C/C++?

    Thanks in advance.

    Ugur

  23. Ugur Karban
    January 27th, 2014 at 18:09

    @Ugur Karban
    I tried ‘FFLAGS’ instead of ‘CFLAGS’ and it worked…