{"id":4064,"date":"2012-06-14T20:38:32","date_gmt":"2012-06-14T19:38:32","guid":{"rendered":"http:\/\/www.walkingrandomly.com\/?p=4064"},"modified":"2012-07-12T00:50:33","modified_gmt":"2012-07-11T23:50:33","slug":"using-the-portland-pgi-compiler-for-matlab-mex-files-in-windows-1","status":"publish","type":"post","link":"https:\/\/walkingrandomly.com\/?p=4064","title":{"rendered":"Using the Portland PGI Compiler for MATLAB mex files in Windows #1"},"content":{"rendered":"<p>I recently got access to a shiny new (new to me at least) set of compilers, <a href=\"http:\/\/www.pgroup.com\/\">The Portland PGI compiler<\/a> suite which comes with a great set of technologies to play with including <a href=\"http:\/\/www.pgroup.com\/lit\/articles\/insider\/v3n2a4.htm\">AVX vector support<\/a>, <a href=\"http:\/\/www.pgroup.com\/lit\/articles\/insider\/v3n2a1.htm\">CUDA for x86<\/a> and <a href=\"http:\/\/www.pgroup.com\/resources\/accel.htm\">GPU pragma-based acceleration<\/a>.\u00a0 So naturally, it wasn&#8217;t long before I wondered if I could use the PGI suite as compilers for <a href=\"http:\/\/www.mathworks.co.uk\/help\/techdoc\/matlab_external\/f29322.html\">MATLAB mex files<\/a>.\u00a0 The bad news is that The Mathworks don&#8217;t support the PGI Compilers out of the box but that leads to the good news&#8230;I get to dig down and figure out how to add support for unsupported compilers.<\/p>\n<p>In what follows I made use of <strong>MATLAB 2012a<\/strong> on <strong>64bit Windows 7<\/strong> with <strong>Version 12.5 of the PGI Portland Compiler Suite<\/strong>.<\/p>\n<p>In order to set up a C mex-compiler in MATLAB you execute the following<\/p>\n<pre>mex -setup<\/pre>\n<p>which causes MATLAB to execute a Perl script at <strong>C:\\Program Files\\MATLAB\\R2012a\\bin\\mexsetup.pm<\/strong>.\u00a0 This script scans the directory <strong>C:\\Program Files\\MATLAB\\R2012a\\bin\\win64\\mexopts<\/strong> looking for Perl scripts with the extension .stp and running whatever it finds.  Each .stp file looks for a particular compiler.\u00a0 After all .stp files have been executed, a list of compilers found gets returned to the user.  When the user chooses a compiler, the corresponding .bat file gets copied to the directory returned by MATLAB&#8217;s <a href=\"http:\/\/www.mathworks.co.uk\/help\/techdoc\/ref\/prefdir.html\">prefdir<\/a> function.  This sets up the compiler for use.\u00a0 All of this is nicely documented in the <strong>mexsetup.pm<\/strong> file itself.<\/p>\n<p>So, I&#8217;ve had my first crack at this and the results are the following two files.<\/p>\n<ul>\n<li><a href=\"https:\/\/www.walkingrandomly.com\/images\/matlab\/mex\/pgi\/1\/pgi.bat\">pgi.dat<\/a><\/li>\n<li><a href=\"https:\/\/www.walkingrandomly.com\/images\/matlab\/mex\/pgi\/1\/pgi.stp\">pgi.stp<\/a><\/li>\n<\/ul>\n<p>These are crude, and there&#8217;s probably lots missing\/wrong but they seem to work.\u00a0 Copy them to <strong>C:\\Program Files\\MATLAB\\R2012a\\bin\\win64\\mexopts. <\/strong>The location of the compiler is hard-coded in pgi.stp so you&#8217;ll need to change the following line if your compiler location differs from mine<\/p>\n<pre>my $default_location = \"C:\\\\Program Files\\\\PGI\\\\win64\\\\12.5\\\\bin\";<\/pre>\n<p>Now, when you do <strong>mex -setup<\/strong>, you should get an entry <strong>PGI Workstation 12.5 64bit 12.5 in C:\\Program Files\\PGI\\win64\\12.5\\bin<\/strong> which you can select as normal.<\/p>\n<p><strong>An example compilation and some details.<\/strong><\/p>\n<p>Let&#8217;s compile the following very simple mex  file, <a href=\"https:\/\/www.walkingrandomly.com\/images\/matlab\/mex\/pgi\/1\/mex_sin.c\">mex_sin.c<\/a>, using the PGI compiler which does little more than take an elementwise sine of the input matrix.<\/p>\n<pre>#include &lt;math.h&gt;\r\n#include \"mex.h\"\r\n\r\nvoid mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] )\r\n{\r\n    double *in,*out;\r\n    double dist,a,b;\r\n    int rows,cols,outsize;\r\n    int i,j,k;\r\n\r\n    \/*Get pointers to input matrix*\/\r\n    in = mxGetPr(prhs[0]);\r\n    \/*Get rows and columns of input*\/\r\n    rows = mxGetM(prhs[0]);\r\n    cols = mxGetN(prhs[0]);\r\n\r\n    \/* Create output matrix *\/\r\n    outsize = rows*cols;\r\n    plhs[0] = mxCreateDoubleMatrix(rows, cols, mxREAL);\r\n    \/* Assign pointer to the output *\/\r\n    out = mxGetPr(plhs[0]);\r\n\r\n    for(i=0;i&lt;outsize;i++){\r\n        out[i] = sin(in[i]);\r\n    }\r\n\r\n}<\/pre>\n<p>Compile using the -v switch to get verbose information about the compilation<\/p>\n<pre>mex sin_mex.c -v<\/pre>\n<p>You&#8217;ll see that the compiled mex file is actually a renamed .dll file that was compiled and linked with the following flags<\/p>\n<pre>pgcc -c -Bdynamic  -Minfo -fast\r\npgcc --Mmakedll=export_all  -L\"C:\\Program Files\\MATLAB\\R2012a\\extern\\lib\\win64\\microsoft\" libmx.lib libmex.lib libmat.lib<\/pre>\n<p>The switch <strong>&#8211;Mmakedll=export_all<\/strong> is actually <a href=\"http:\/\/www.pgroup.com\/userforum\/viewtopic.php?p=4724&amp;sid=2a722becc630a95d9a186c5a1e27b008\">not supported by PGI<\/a> which makes this whole setup doubly unsupported!  However, I couldn&#8217;t find a way to export the required symbols without modifying the mex source code so I lived with it.\u00a0 Maybe I&#8217;ll figure out a better way in the future.\u00a0 Let&#8217;s try the new function out<\/p>\n<pre>&gt;&gt; a=[1 2 3];\r\n&gt;&gt; mex_sin(a)\r\nInvalid MEX-file 'C:\\Work\\mex_sin.mexw64': The specified module could not be found.<\/pre>\n<p>The reason for the error message is that a required PGI .dll file, pgc.dll, is not on my system path so I need to do the following in MATLAB.<\/p>\n<pre>setenv('PATH', [getenv('PATH') ';C:\\Program Files\\PGI\\win64\\12.5\\bin\\']);<\/pre>\n<p>This fixes things<\/p>\n<pre>&gt;&gt; mex_sin(a)\r\nans =\r\n    0.8415    0.9093    0.1411<\/pre>\n<p><strong>Performance<\/strong><\/p>\n<p>I took a quick look at the performance of this mex function using my quad-core, <a href=\"http:\/\/en.wikipedia.org\/wiki\/Sandy_Bridge\">Sandy Bridge<\/a> laptop.  I highly doubted that I was going to beat MATLAB&#8217;s built in sin function (which is highly optimised and <a href=\"https:\/\/www.walkingrandomly.com\/?p=1894\">multithreaded<\/a>) with so little work and I was right:<\/p>\n<pre>&gt;&gt; a=rand(1,100000000);\r\n&gt;&gt; tic;mex_sin(a);toc\r\nElapsed time is 1.320855 seconds.\r\n&gt;&gt; tic;sin(a);toc\r\nElapsed time is 0.486369 seconds.<\/pre>\n<p>That&#8217;s not really a fair comparison though since I am purposely leaving mutithreading out of the PGI mex equation for now.\u00a0 It&#8217;s a much fairer comparison to compare the exact same mex file using different compilers so let&#8217;s do that.\u00a0 I created three different compiled mex routines from the source code above using the three compilers installed on my laptop and performed a very crude time test as follows<\/p>\n<pre>&gt;&gt; a=rand(1,100000000);\r\n&gt;&gt; tic;mex_sin_pgi(a);toc              %PGI 12.5 run 1\r\nElapsed time is 1.317122 seconds.\r\n&gt;&gt; tic;mex_sin_pgi(a);toc              %PGI 12.5 run 2\r\nElapsed time is 1.338271 seconds.\r\n\r\n&gt;&gt; tic;mex_sin_vs(a);toc               %Visual Studio 2008 run 1\r\nElapsed time is 1.459463 seconds.\r\n&gt;&gt; tic;mex_sin_vs(a);toc\r\nElapsed time is 1.446947 seconds.      %Visual Studio 2008 run 2\r\n\r\n&gt;&gt; tic;mex_sin_intel(a);toc             %Intel Compiler 12.0 run 1\r\nElapsed time is 0.907018 seconds.\r\n&gt;&gt; tic;mex_sin_intel(a);toc             %Intel Compiler 12.0 run 2\r\nElapsed time is 0.860218 seconds.<\/pre>\n<p>PGI did a little better than Visual Studio 2008 but was beaten by Intel<strong>.<\/strong> I&#8217;m hoping that I&#8217;ll be able to get more performance out of the PGI compiler as I learn more about the compilation flags.<\/p>\n<p><strong>Getting PGI to make use of SSE extensions<\/strong><\/p>\n<p>Part of the output of the <strong>mex sin_mex.c -v<\/strong> compilation command is the following notice<\/p>\n<pre>mexFunction:\r\n     23, Loop not vectorized: data dependency<\/pre>\n<p>This notice is a result of the <strong>-Minfo<\/strong> compilation switch and indicates that the PGI compiler can&#8217;t determine if the <strong>in<\/strong> and <strong>out<\/strong> arrays overlap or not.\u00a0 If they don&#8217;t overlap then it would be safe to <a href=\"http:\/\/en.wikipedia.org\/wiki\/Loop_unwinding\">unroll the loop<\/a> and make use of <a href=\"http:\/\/en.wikipedia.org\/wiki\/Streaming_SIMD_Extensions\">SSE<\/a> or <a href=\"http:\/\/software.intel.com\/en-us\/avx\/\">AVX<\/a> instructions to make better use of my Sandy Bridge processor.\u00a0 This should hopefully speed things up a little.<\/p>\n<p>As the programmer, I am sure that the two arrays don&#8217;t overlap so I need to give the compiler a hand.\u00a0 One way to do this would be to modify the <strong>pgi.dat<\/strong> file to include the compilation switch <strong>-Msafeptr<\/strong> which tells the compiler that arrays never overlap anywhere.\u00a0 This might not be a good idea since it may not always be true so I decided to be more cautious and make use of\u00a0 the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Restrict\">restrict<\/a> keyword.\u00a0 That is, I changed the mex source code so that<\/p>\n<pre>double *in,*out;<\/pre>\n<p>becomes<\/p>\n<pre>double * restrict in,* restrict out;<\/pre>\n<p>Now when I compile using the PGI compiler, the notice from -Mifno becomes<\/p>\n<pre>mexFunction:\r\n     23, Generated 3 alternate versions of the loop\r\n         Generated vector sse code for the loop\r\n         Generated a prefetch instruction for the loop<\/pre>\n<p>which demonstrates that the compiler is much happier!  So, what did this do for performance?<\/p>\n<pre>&gt;&gt; tic;mex_sin_pgi(a);toc\r\nElapsed time is 1.450002 seconds.\r\n&gt;&gt; tic;mex_sin_pgi(a);toc\r\nElapsed time is 1.460536 seconds.<\/pre>\n<p>This is slower than when SSE instructions weren&#8217;t being used which isn&#8217;t what I was expecting at all! If anyone has any insight into what&#8217;s going on here, I&#8217;d love to hear from you.<\/p>\n<p><strong>Future Work<\/strong><\/p>\n<p>I&#8217;m happy that I&#8217;ve got this compiler working in MATLAB but there is a lot to do including:<\/p>\n<ul>\n<li>Tidy up the pgi.dat and pgi.stp files so that they look and act more professionally.<\/li>\n<li>Figure out the best set of compiler switches to use&#8211; it is almost certain that what I&#8217;m using now is sub-optimal since I am new to the PGI compiler.<\/li>\n<li>Get OpenMP support working.\u00a0 I tried using the <strong>-Mconcur<\/strong> compilation flag which auto-parallelised the loop but it crashed MATLAB when I ran it. This needs investigating<\/li>\n<li>Get PGI accelerator support working so I can offload work to the GPU.<\/li>\n<li>Figure out why the SSE version of this function is slower than the non-SSE version<\/li>\n<li>Figure out how to determine whether or not the compiler is emitting AVX instructions.\u00a0 The documentation suggests that if the compiler is called on a Sandy Bridge machine, and if vectorisation is possible then it will produce AVX instructions but AVX is not mentioned in the output of -Minfo.\u00a0 Nothing changes if you explicity set the target to Sandy Bridge with the compiler switch <strong>&#8211;<em>tp sandybridge<\/em>&#8211;<em>64.<\/em><\/strong><\/li>\n<\/ul>\n<p>Look out for more articles on this in the future.<\/p>\n<p><strong>Related WalkingRandomly Articles<\/strong><\/p>\n<ul>\n<li><a href=\"https:\/\/www.walkingrandomly.com\/?p=1894\">Which MATLAB functions make use of multithreading<\/a>?<\/li>\n<li><a href=\"https:\/\/www.walkingrandomly.com\/?p=3988\">Using Intel\u2019s SPMD Compiler (ispc) with MATLAB on Linux<\/a><\/li>\n<li><a href=\"https:\/\/www.walkingrandomly.com\/?p=1795\">Parallel MATLAB with OpenMP mex files<\/a><\/li>\n<li><a href=\"https:\/\/www.walkingrandomly.com\/?p=3898\">MATLAB mex functions using the NAG C Library<\/a><\/li>\n<\/ul>\n<p><strong>My setup<\/strong><\/p>\n<ul>\n<li>Laptop model: Dell XPS L702X<\/li>\n<li>CPU:<a href=\"http:\/\/www.notebookcheck.net\/Intel-Core-i7-2630QM-Notebook-Processor.41483.0.html\"> Intel Core i7-2630QM<\/a> @2Ghz software overclockable to 2.9Ghz. 4 physical cores but total 8 virtual cores due to Hyperthreading.<\/li>\n<li>GPU: <a href=\"http:\/\/www.notebookcheck.net\/NVIDIA-GeForce-GT-555M.41933.0.html\">GeForce GT 555M<\/a> with 144 CUDA Cores.\u00a0 Graphics clock: 590Mhz.\u00a0 Processor Clock:1180 Mhz. 3072 Mb DDR3 Memeory<\/li>\n<li>RAM: 8 Gb<\/li>\n<li>OS: Windows 7 Home Premium 64 bit.<\/li>\n<li>MATLAB: 2012a<\/li>\n<li>PGI Compiler: 12.5<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>I recently got access to a shiny new (new to me at least) set of compilers, The Portland PGI compiler suite which comes with a great set of technologies to play with including AVX vector support, CUDA for x86 and GPU pragma-based acceleration.\u00a0 So naturally, it wasn&#8217;t long before I wondered if I could use [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[53,11,7,42],"tags":[],"class_list":["post-4064","post","type-post","status-publish","format-standard","hentry","category-making-matlab-faster","category-matlab","category-programming","category-tutorials"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p3swhs-13y","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/4064","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=4064"}],"version-history":[{"count":19,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/4064\/revisions"}],"predecessor-version":[{"id":4403,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/4064\/revisions\/4403"}],"wp:attachment":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=4064"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=4064"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=4064"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}