{"id":3978,"date":"2012-02-09T20:31:57","date_gmt":"2012-02-09T19:31:57","guid":{"rendered":"http:\/\/www.walkingrandomly.com\/?p=3978"},"modified":"2012-02-13T19:00:11","modified_gmt":"2012-02-13T18:00:11","slug":"optimising-a-correlated-asset-calculation-on-matlab-2-using-the-gpu-via-the-pct","status":"publish","type":"post","link":"https:\/\/walkingrandomly.com\/?p=3978","title":{"rendered":"Optimising a correlated asset calculation on MATLAB #2 &#8211; Using the GPU via the PCT"},"content":{"rendered":"<p>This article is the second part of a series where I look at rewriting a particular piece of MATLAB code using various techniques.\u00a0 The introduction to the series <a href=\"https:\/\/www.walkingrandomly.com\/?p=3604\">is here<\/a> and the introduction to the larger series of <a href=\"https:\/\/www.walkingrandomly.com\/?p=3730\">GPU articles for MATLAB on WalkingRandomly is here<\/a>.<\/p>\n<p><strong>Attempt 1 &#8211; Make as few modifications as possible<\/strong><\/p>\n<p>I took my best CPU-only code from last time (<a href=\"https:\/\/www.walkingrandomly.com\/images\/matlab\/corr_asset\/1\/optimised_corr2.m\">optimised_corr2.m<\/a>) and changed a load of data-types from <strong>double<\/strong> to <strong>gpuArray<\/strong> in order to get the calculation to run on my laptop&#8217;s GPU using the <a href=\"http:\/\/www.mathworks.co.uk\/products\/parallel-computing\/\">parallel computing toolbox<\/a> in MATLAB 2010b.\u00a0 I also switched to using the GPU versions of various functions such as <strong>parallel.gpu.GPUArray.randn<\/strong> instead of <strong>randn<\/strong> for example.\u00a0 Functions such as <strong>cumprod<\/strong> needed no\u00a0 modifications at all since they are nicely overloaded; if the argument to cumprod is of type double then the calculation happens on the CPU whereas if it is gpuArray then it happens on the GPU.<\/p>\n<p>The above work took about a minute to do which isn&#8217;t bad for a CUDA &#8216;porting&#8217; effort!\u00a0 The result, which I&#8217;ve called <a href=\"https:\/\/www.walkingrandomly.com\/images\/matlab\/corr_asset\/2\/GPU_PCT_corr1.m\">GPU_PCT_corr1.m<\/a> is available for you to download and try out.<\/p>\n<p>How about performance?\u00a0 Let&#8217;s do a quick tic and toc using my laptop&#8217;s <a href=\"http:\/\/www.notebookcheck.net\/NVIDIA-GeForce-GT-555M.41933.0.html\">NVIDIA GT 555M GPU<\/a>.<\/p>\n<pre>&gt;&gt; tic;GPU_PCT_corr1;toc\r\nElapsed time is 950.573743 seconds.<\/pre>\n<p>The CPU version of this code took only 3.42 seconds which means that this GPU version is <strong>over 277 times slower!<\/strong> Something has gone horribly, horribly wrong!<\/p>\n<p><strong>Attempt 2 &#8211; Switch from script to function<\/strong><\/p>\n<p>In general functions should be faster than scripts in MATLAB because more automatic optimisations are performed on functions.\u00a0 I didn&#8217;t see any difference in the CPU version of this code (see <a href=\"https:\/\/www.walkingrandomly.com\/images\/matlab\/corr_asset\/1\/optimised_corr3.m\">optimised_corr3.m<\/a> from part 1 for a CPU function version) and so left it as a script (partly so I had an excuse to discuss it here if I am honest).\u00a0 This GPU-version, however, benefits noticeably from conversion to a function.\u00a0 To do this, add the following line to the top of GPU_PCT_corr1.m<\/p>\n<pre>function [SimulPrices] = GPU_PTC_corr2( n,sub_size)<\/pre>\n<p>Next, you need to <strong>delete<\/strong> the following two lines<\/p>\n<pre>n=100000;                       %Number of simulations\r\nsub_size = 125;<\/pre>\n<p>Finally, add the following to the end of our new function<\/p>\n<pre>end<\/pre>\n<p>That&#8217;s pretty much all I did to get <a href=\"https:\/\/www.walkingrandomly.com\/images\/matlab\/corr_asset\/2\/GPU_PCT_corr2.m\">GPU_PCT_corr2.m<\/a>.  Let&#8217;s see how that performs using the same parameters as our script (100,000 simulations in blocks of 125).\u00a0 I used <a href=\"https:\/\/www.walkingrandomly.com\/images\/matlab\/corr_asset\/2\/func_vs_script.m\">script_vs_func.m<\/a> to run both twice after a quick warm-up iteration and the results were:<\/p>\n<pre>Warm up\r\nElapsed time is 1.195806 seconds.\r\nMain event\r\nscript\r\nElapsed time is 950.399920 seconds.\r\nfunction\r\nElapsed time is 938.238956 seconds.\r\nscript\r\nElapsed time is 959.420186 seconds.\r\nfunction\r\nElapsed time is 939.716443 seconds.<\/pre>\n<p>So, switching to a function has saved us a few seconds but performance is still very bad!<br \/>\n<strong><\/strong><\/p>\n<p><strong>Attempt 3 &#8211; One big matrix multiply!<br \/>\n<\/strong><\/p>\n<p>So far all I have done is take a program that works OK on a CPU, and run it exactly as-is on the GPU in the hope that something magical would happen to make it go faster.\u00a0 Of course, GPUs and CPUs are very different beasts with differing sets of strengths and weaknesses so it is rather naive to think that this might actually work.\u00a0 What we need to do is to play to the GPUs strengths more and the way to do this is to focus on this piece of code.<\/p>\n<pre>for i=1:sub_size\r\n     CorrWiener(:,:,i)=parallel.gpu.GPUArray.randn(T-1,2)*UpperTriangle;\r\nend<\/pre>\n<p>Here, we are performing lots of small matrix multiplications and, as mentioned in part 1, we might hope to get better performance by performing just one large matrix multiplication instead.  To do this we can change the above code to<\/p>\n<pre>%Generate correlated random numbers\r\n%using one big multiplication\r\nrandoms = parallel.gpu.GPUArray.randn(sub_size*(T-1),2);\r\nCorrWiener = randoms*UpperTriangle;\r\nCorrWiener = reshape(CorrWiener,(T-1),sub_size,2);\r\n%CorrWiener = permute(CorrWiener,[1 3 2]); %Can't do this on the GPU in 2011b or below\r\n\r\n%poor man's permute since GPU permute if not available in 2011b\r\nCorrWiener_final = parallel.gpu.GPUArray.zeros(T-1,2,sub_size);\r\nfor s = 1:2\r\n    CorrWiener_final(:, s, :) = CorrWiener(:, :, s);\r\nend<\/pre>\n<p>The reshape and permute are necessary to get the matrix in the form needed later on.  Sadly, MATLAB 2011b doesn&#8217;t support permute on GPUArrays and so I had to use the &#8216;poor mans permute&#8217; instead.<\/p>\n<p>The result of the above is contained in <a href=\"..\/images\/matlab\/corr_asset\/2\/GPU_PCT_corr3.m\">GPU_PCT_corr3.m<\/a> so let&#8217;s see how that does in a fresh instance of MATLAB.<\/p>\n<pre>&gt;&gt; tic;GPU_PCT_corr3(100000,125);toc\r\nElapsed time is 16.666352 seconds.\r\n&gt;&gt; tic;GPU_PCT_corr3(100000,125);toc\r\nElapsed time is 8.725997 seconds.\r\n&gt;&gt; tic;GPU_PCT_corr3(100000,125);toc\r\nElapsed time is 8.778124 seconds.<\/pre>\n<p>The first thing to note is that performance is MUCH better so we appear to be on the right track.  The next thing to note is that the first evaluation is much slower than all subsequent ones.  This is totally expected and is due to various start-up overheads.<\/p>\n<p>Recall that 125 in the above function calls refers to the block size of our monte-carlo simulation.  We are doing 100,000 simulations in blocks of 125&#8211; a number chosen because I determined empirically that this was the best choice on my CPU.  It turns out we are better off using much larger block sizes on the GPU:<\/p>\n<pre>&gt;&gt; tic;GPU_PCT_corr3(100000,250);toc\r\nElapsed time is 6.052939 seconds.\r\n&gt;&gt; tic;GPU_PCT_corr3(100000,500);toc\r\nElapsed time is 4.916741 seconds.\r\n&gt;&gt; tic;GPU_PCT_corr3(100000,1000);toc\r\nElapsed time is 4.404133 seconds.\r\n&gt;&gt; tic;GPU_PCT_corr3(100000,2000);toc\r\nElapsed time is 4.223403 seconds.\r\n&gt;&gt; tic;GPU_PCT_corr3(100000,5000);toc\r\nElapsed time is 4.069734 seconds.\r\n&gt;&gt; tic;GPU_PCT_corr3(100000,10000);toc\r\nElapsed time is 4.039446 seconds.\r\n&gt;&gt; tic;GPU_PCT_corr3(100000,20000);toc\r\nElapsed time is 4.068248 seconds.\r\n&gt;&gt; tic;GPU_PCT_corr3(100000,25000);toc\r\nElapsed time is 4.099588 seconds.<\/pre>\n<p>The above, rather crude, test suggests that block sizes of 10,000 are the best choice on my laptop&#8217;s GPU.  Sadly, however, it&#8217;s STILL slower than the 3.42 seconds I managed on the i7 CPU and represents the best I&#8217;ve managed using pure MATLAB code.  The profiler tells me that the vast majority of the GPU execution time is spent in the cumprod line and in random number generation (over 40% each).<br \/>\n<strong><\/strong><\/p>\n<p><strong>Trying a better GPU<\/strong><br \/>\nOf course now that I have code that runs on a GPU I could just throw it at a better GPU and see how that does.  I have access to MATLAB 2011b on a  Tesla M2070 hooked up to a Linux machine so I ran the code on that.  I tried various block sizes and the best time was 0.8489 seconds with the call GPU_PCT_corr3(100000,20000) which is just over 4 times faster than my laptop&#8217;s CPU.<\/p>\n<p><strong>Ask the Audience<\/strong><br \/>\nCan you do better using just the GPU functionality provided in the Parallel Computing Toolbox (so no bespoke CUDA kernels or Jacket just yet)?  I&#8217;ll be looking at how AccelerEyes&#8217; Jacket myself in the next post.<\/p>\n<p><strong>Results so far<\/strong><\/p>\n<ul>\n<li> Best CPU Result on laptop (i7-2630GM)with pure MATLAB code &#8211; 3.42 seconds<\/li>\n<li> Best GPU Result with PCT on laptop (GT555M) &#8211; 4.04 seconds<\/li>\n<li> Best GPU Result with PCT on Tesla M2070 &#8211; 0.85 seconds<\/li>\n<\/ul>\n<p><strong>Test System Specification<\/strong><\/p>\n<ul>\n<li>Laptop model: Dell XPS L702X<\/li>\n<li>CPU:<a href=\"http:\/\/www.notebookcheck.net\/Intel-Core-i7-2630QM-Notebook-Processor.41483.0.html\"> Intel Core i7-2630QM<\/a> @2Ghz software overclockable to 2.9Ghz. 4 physical cores but total 8 virtual cores due to Hyperthreading.<\/li>\n<li>GPU: <a href=\"http:\/\/www.notebookcheck.net\/NVIDIA-GeForce-GT-555M.41933.0.html\">GeForce GT 555M<\/a> with 144 CUDA Cores.\u00a0 Graphics clock: 590Mhz.\u00a0 Processor Clock:1180 Mhz. 3072 Mb DDR3 Memeory<\/li>\n<li>RAM: 8 Gb<\/li>\n<li>OS: Windows 7 Home Premium 64 bit.<\/li>\n<li>MATLAB: 2011b<\/li>\n<\/ul>\n<p><strong>Acknowledgements<\/strong><\/p>\n<p>Thanks to Yong Woong Lee of the Manchester Business School as well as  various employees at The Mathworks for useful discussions and advice.\u00a0  Any mistakes that remain are all my own :)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This article is the second part of a series where I look at rewriting a particular piece of MATLAB code using various techniques.\u00a0 The introduction to the series is here and the introduction to the larger series of GPU articles for MATLAB on WalkingRandomly is here. Attempt 1 &#8211; Make as few modifications as possible [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[44,51,53,4,11,7],"tags":[],"class_list":["post-3978","post","type-post","status-publish","format-standard","hentry","category-cuda","category-gpu","category-making-matlab-faster","category-math-software","category-matlab","category-programming"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p3swhs-12a","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/3978","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3978"}],"version-history":[{"count":29,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/3978\/revisions"}],"predecessor-version":[{"id":4177,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/3978\/revisions\/4177"}],"wp:attachment":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3978"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3978"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3978"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}