{"id":3736,"date":"2011-07-27T22:02:43","date_gmt":"2011-07-27T21:02:43","guid":{"rendered":"http:\/\/www.walkingrandomly.com\/?p=3736"},"modified":"2011-08-16T15:12:54","modified_gmt":"2011-08-16T14:12:54","slug":"matlab-gpu-cuda-experiences-on-my-laptop-elementwise-operations-on-the-gpu-1","status":"publish","type":"post","link":"https:\/\/walkingrandomly.com\/?p=3736","title":{"rendered":"MATLAB GPU \/ CUDA experiences on my laptop &#8211; Elementwise operations on the GPU #1"},"content":{"rendered":"<p>This is part 1 of an ongoing series of articles about MATLAB programming  for GPUs using the Parallel Computing Toolbox.\u00a0 The introduction and  index to the series is at <a href=\"..\/?p=3730\">https:\/\/www.walkingrandomly.com\/?p=3730<\/a>.<\/p>\n<p>Have you ever needed to take the sine of 100 million random numbers?\u00a0 Me either, but such an operation gives us an excuse to look at the basic concepts of GPU computing with MATLAB and get an idea of the timings we can expect for simple elementwise calculations.<\/p>\n<p><strong>Taking the sine of 100 million numbers on the CPU<\/strong><\/p>\n<p>Let&#8217;s forget about GPUs for a second and look at how this would be done on the CPU using MATLAB.\u00a0 First, I create 100 million random numbers over a range from 0 to 10*pi and store them in the variable cpu_x;<\/p>\n<pre>cpu_x = rand(1,100000000)*10*pi;<\/pre>\n<p>Now I take the sine of all 100 million elements of cpu_x using a single command.<\/p>\n<pre>cpu_y = sin(cpu_x)<\/pre>\n<p>I have to confess that I find the above command very cool.  Not only are we looping over a massive array using just a single line of code but MATLAB will also be performing the operation<strong> in parallel<\/strong>.  So, if you have a multicore machine (and pretty much everyone does these days) then the above command will make good use of many of those cores.  Furthermore, this kind of parallelisation is built into the core of MATLAB&#8230;.no parallel computing toolbox necessary.  As an aside, if you&#8217;d like to see a list of functions that automatically run in parallel on the CPU then <a href=\"https:\/\/www.walkingrandomly.com\/?p=1894\">check out my blog post on the issue<\/a>.<\/p>\n<p>So, how quickly does my 4 core laptop get through this 100 million element array?\u00a0 We can find out using the MATLAB functions <strong>tic<\/strong> and <strong>toc<\/strong>.  I ran it three times on my laptop and got the following<\/p>\n<pre>&gt;&gt; tic;cpu_y = sin(cpu_x);toc\r\nElapsed time is 0.833626 seconds.\r\n&gt;&gt; tic;cpu_y = sin(cpu_x);toc\r\nElapsed time is 0.899769 seconds.\r\n&gt;&gt; tic;cpu_y = sin(cpu_x);toc\r\nElapsed time is 0.916969 seconds.<\/pre>\n<p>So the first thing you&#8217;ll notice is that the timings vary and I&#8217;m not going to go into the reasons why here.  What I am going to say is that because of this variation it makes sense to time the calculation a number of times (20 say) and take an average.  Let&#8217;s do that<\/p>\n<pre>sintimes=zeros(1,20);\r\nfor i=1:20;tic;cpu_y = sin(cpu_x);sintimes(i)=toc;end\r\naverage_time = sum(sintimes)\/20\r\n\r\naverage_time =\r\n    0.8011<\/pre>\n<p>So, on average, it takes my quad core laptop just over 0.8 seconds to take the sine of 100 million elements using the CPU.\u00a0 A couple of points:<\/p>\n<ul>\n<li>I note that this time is smaller than any of the three test times I did before running the loop and I&#8217;m not really sure why.\u00a0 I&#8217;m guessing that it takes my CPU a short while to decide that it&#8217;s got a lot of work to do and ramp up to full speed but further insights are welcomed.<\/li>\n<li>While staring at the CPU monitor I noticed that the above calculation never used more than 50% of the available virtual cores.\u00a0 It&#8217;s using all 4 of my physical CPU cores but perhaps if it took advantage of hyperthreading I&#8217;d get even better performance?\u00a0 Changing OMP_NUM_THREADS to 8 before launching MATLAB did nothing to change this.<\/li>\n<\/ul>\n<p><strong>Taking the sine of 100 million numbers on the GPU<\/strong><\/p>\n<p>Just like before, we start off by using the CPU to generate the 100 million random numbers<sup>1<\/sup><\/p>\n<pre>cpu_x = rand(1,100000000)*10*pi;<\/pre>\n<p>The first thing you need to know about GPUs is that they have their own memory that is completely separate from main memory.  So, the GPU doesn&#8217;t know anything about the array created above. Before our GPU can get to work on our data we have to transfer it from main memory to GPU memory and we acheive this using the <strong>gpuArray<\/strong> command.<\/p>\n<pre>gpu_x = gpuArray(cpu_x); %this moves our data to the GPU<\/pre>\n<p>Once the GPU can see all our data we can apply the sine function to it very easily.<\/p>\n<pre>gpu_y = sin(gpu_x)<\/pre>\n<p>Finally, we transfer the results back to main memory.<\/p>\n<pre>cpu_y = gather(gpu_y)<\/pre>\n<p>If, like many of the GPU articles you see in the literature, you don&#8217;t want to include transfer times between GPU and host then you time the calculation like this:<\/p>\n<pre>tic\r\ngpu_y = sin(gpu_x);\r\ntoc<\/pre>\n<p>Just like the CPU version, I repeated this calculation several times and took an average.  The result was 0.3008 seconds giving a<strong> speedup of 2.75 times compared to the CPU version<\/strong>.<br \/>\nIf, however, you include the time taken to transfer the input data to the GPU and the results back to the CPU then you need to time as follows<\/p>\n<pre>tic\r\ngpu_x = gpuArray(cpu_x);\r\ngpu_y = sin(gpu_x);\r\ncpu_y = gather(gpu_y)\r\ntoc<\/pre>\n<p>On my system this takes 1.0159 seconds on average&#8211; <strong>longer than it takes to simply do the whole thing on the CPU<\/strong>.  So, for this particular calculation, transfer times between host and GPU swamp the benefits gained by all of those CUDA cores.<\/p>\n<p><strong>Benchmark code<\/strong><br \/>\nI took the ideas above and wrote a simple benchmark program called <a href=\"https:\/\/www.walkingrandomly.com\/images\/matlab\/gpu\/sine_test.m\">sine_test<\/a>.\u00a0 The way you call it is as follows<\/p>\n<pre>[cpu,gpu_notransfer,gpu_withtransfer] = sin_test(numrepeats,num_elements]<\/pre>\n<p>For example, if you wanted to run the benchmarks 20 times on a 1  million element array and return the average times then you just do<\/p>\n<pre>&gt;&gt; [cpu,gpu_notransfer,gpu_withtransfer] = sine_test(20,1e6)\r\ncpu =\r\n    0.0085\r\ngpu_notransfer =\r\n    0.0022\r\ngpu_withtransfer =\r\n    0.0116<\/pre>\n<p>I then ran this on my laptop for array sizes ranging from 1 million to 100 million and used the results to plot the graph below.<br \/>\n<img decoding=\"async\" src=\"https:\/\/www.walkingrandomly.com\/images\/matlab\/gpu_vs_cpu_lots_of_sines.png\" alt=\"GPU vs CPU for lots of sines\" \/><\/p>\n<p><strong>But I wanna write a &#8216;GPUs are awesome&#8217; paper<\/strong><\/p>\n<p>So far in this little story things are not looking so hot for the GPU\u00a0 and yet all of the <strong>&#8216;GPUs are awesome&#8217; <\/strong>papers you&#8217;ve ever read seem to disagree with me entirely.\u00a0 What on earth is going on?\u00a0 Well, lets take the advice given by <a href=\"http:\/\/csgillespie.wordpress.com\/2011\/07\/12\/how-to-review-a-gpu-statistics-paper\/\">csgillespie.wordpress.com<\/a> and turn it on its head.\u00a0 How do we get awesome speedup figures from the above benchmarks to help us pump out a &#8216;GPUs are awesome paper&#8217;?<\/p>\n<p>0. Don&#8217;t consider transfer times between CPU and GPU.<\/p>\n<p>We&#8217;ve already seen that this can ruin performance so let&#8217;s not do it shall we?\u00a0 As long as we explicitly say that we are not including transfer times then we are covered.<\/p>\n<p>1. Use a singlethreaded CPU.<\/p>\n<p>Many papers in the literature compare the GPU version with a single-threaded CPU version and yet I&#8217;ve been using all 4 cores of my processor.\u00a0 Silly me&#8230;let&#8217;s fix that by running MATLAB in single threaded mode by launching it with the command<\/p>\n<pre>matlab -singleCompThread<\/pre>\n<p>Now when I run the benchmark for 100 million elements I get the following times<\/p>\n<pre>&gt;&gt; [cpu,gpu_no,gpu_with] = sine_test(10,1e8)\r\ncpu =\r\n    2.8875\r\ngpu_no =\r\n    0.3016\r\ngpu_with =\r\n    1.0205<\/pre>\n<p>Now we&#8217;re talking! I can now claim that my GPU version is over 9 times faster than the CPU version.<\/p>\n<p>2. Use an old CPU.<\/p>\n<p>My laptop has got one of those new-fangled sandy-bridge i7 processors&#8230;one of the best classes of CPU you can get for a laptop.\u00a0 If, however, I was doing these tests at work then I guess I&#8217;d be using a GPU mounted in my university Desktop machine.\u00a0 Obviously I would compare the GPU version of my program with the CPU in the Desktop&#8230;.an Intel Core 2 Quad Q9650.\u00a0 Heck its running at 3Ghz which is more Ghz than my laptop so to the casual observer (or <a href=\"http:\/\/en.wikipedia.org\/wiki\/Pointy-haired_Boss\">a phb<\/a>) it would look like I was using a more beefed up processor in order to make my comparison fairer.<\/p>\n<p>So, I ran the CPU benchmark on that (in singleCompThread mode obviously) and got 4.009 seconds&#8230;noticeably slower than my laptop.\u00a0 Awesome&#8230;I am definitely going to use that figure!<\/p>\n<p>I know what you&#8217;re thinking&#8230;Mike&#8217;s being a fool for the sake of it but <a href=\"http:\/\/csgillespie.wordpress.com\/2011\/07\/12\/how-to-review-a-gpu-statistics-paper\/\">csgillespie.wordpress.com<\/a> puts it like this <strong><em>&#8216;Since a GPU has (usually) been bought specifically for the purpose of the article, the CPU can be a few years older.&#8217; <\/em><\/strong>So, some of those &#8216;GPU are awesome&#8217; articles will be accidentally misleading us in exactly this manner.<\/p>\n<p>3. Work in single precision.<\/p>\n<p>Yeah I know that you like working with double precision arithmetic but that slows GPUs down.\u00a0 So, let&#8217;s switch to single precision.\u00a0 Just argue in your paper that single precision is OK for this particular calculation and we&#8217;ll be set.\u00a0 To change the benchmarking code all you need to do is change every instance of<\/p>\n<pre>rand(1,num_elems)*10*pi;<\/pre>\n<p>to<\/p>\n<pre>rand(1,num_elems,'single')*10*pi;<\/pre>\n<p>Since we are reputable researchers we will, of course, modify both the CPU and GPU versions to work in single precision.\u00a0 Timings are below<\/p>\n<ul>\n<li>Desktop at work (single thread, single precision): 3.49 seconds<\/li>\n<li>Laptop GPU (single precision, not including transfer): 0.122 seconds<\/li>\n<\/ul>\n<p>OK, so switching to single precision made the CPU version a bit faster but it&#8217;s more than doubled GPU performance.\u00a0 We can now say that the GPU version is over 28 times faster than the CPU version.\u00a0 Now we have ourselves a bone-fide &#8216;GPUs are awesome&#8217; paper.<\/p>\n<p>4. Use the best GPU we can find<\/p>\n<p>So far I have been comparing the CPU against the relatively lowly GPU in my laptop.\u00a0 Obviously, however, if I were to do this for real then I&#8217;d get a top of the range Tesla.\u00a0 It turns out that I know someone who has a Tesla C2050 and so we ran the single precision benchmark on that.\u00a0 The result was astonishing&#8230;0.0295 seconds for 100 million numbers not including transfer times.\u00a0 The double precision performance for the same calculation on the C2050 was 0.0524 seonds.<\/p>\n<p>5. Write the abstract for our &#8216;GPUs are awesome&#8217; paper<\/p>\n<p><strong><em>We took an Nvidia Tesla C2050 GPU and mounted it in a machine containing an Intel Quad Core CPU running at 3Ghz.\u00a0 We developed a program that performs element-wise trigonometry on arrays of up to 100 million single precision random numbers using both the CPU and the GPU.\u00a0 The GPU version of our code ran up to 118 times faster than the CPU version.\u00a0 GPUs are awesome!<\/em><\/strong><br \/>\n<strong> <\/strong><\/p>\n<p><strong>Results from different CPUs and GPUs.\u00a0 Double precision, multi-threaded<\/strong><\/p>\n<p>I ran the sine_test benchmark on several different systems for 100 million elements.\u00a0 The CPU was set to be multi-threaded and double precision was used throughout.<\/p>\n<pre>sine_test(10,1e8)<\/pre>\n<p>GPUs<\/p>\n<ul>\n<li>Tesla C2050, Linux, 2011a &#8211; 0.7487 seconds including transfers,\u00a0 0.0524 seconds excluding transfers.<\/li>\n<li>GT 555M &#8211; 144 CUDA Cores, 3Gb RAM, Windows 7, 2011a (My laptop&#8217;s GPU) -1.0205 seconds including transfers, 0.3016 seconds excluding transfers<\/li>\n<\/ul>\n<p>CPUs<\/p>\n<ul>\n<li>Intel Core i7-880 @3.07Ghz, Linux, 2011a &#8211; 0.659 seconds<\/li>\n<li>Intel Core i7-2630QM, Windows 7, 2011a (My laptop&#8217;s CPU) &#8211; 0.801 seconds<\/li>\n<li>Intel Core 2 Quad Q9650 @ 3.00GHz, Linux &#8211; 0.958 seconds<\/li>\n<\/ul>\n<p><strong>Conclusions<\/strong><\/p>\n<ul>\n<li>MATLAB&#8217;s new GPU functions are very easy to use!\u00a0 No need to learn low-level CUDA programming.<\/li>\n<li>It&#8217;s very easy to massage CPU vs GPU numbers to look impressive.\u00a0 Read those &#8216;GPUs are awesome&#8217; papers with care!<\/li>\n<li>In real life you have to consider data transfer times between GPU and CPU since these can dominate overall wall clock time with simple calculations such as those considered here.\u00a0 The more work you can do on the GPU, the better.<\/li>\n<li>My laptop&#8217;s GPU is nowhere near as good as I would have liked it to be.\u00a0 Almost 6 times slower than a Tesla C2050 (excluding data transfer) for elementwise double precision calculations.\u00a0 Data transfer times seem to about the same though.<\/li>\n<\/ul>\n<p><strong>Next time<\/strong><\/p>\n<p>In the next article in the series I&#8217;ll look at an element-wise calculation that really is worth doing on the GPU &#8211; even using the wimpy GPU in my laptop &#8211; and introduce the MATLAB function arrayfun.<\/p>\n<p><strong>Footnote<\/strong><\/p>\n<p>1 &#8211; MATLAB 2011a can&#8217;t create random numbers  directly on the GPU.  I have no doubt that we&#8217;ll be able to do this in  future versions of MATLAB which will change the nature of this  particular calculation somewhat.\u00a0 Then it will make sense to include the  random number generation in the overall benchmark; transfer times <strong>to<\/strong> the GPU will be non-existant.  In general, however, we&#8217;ll still come  across plenty of situations where we&#8217;ll have a huge array in main memory  that needs to be transferred to the GPU for further processing so what  we learn here will not be wasted.<\/p>\n<p><strong>Hardware \/ Software used for the majority of this article<\/strong><\/p>\n<ul>\n<li>Laptop model: Dell XPS L702X<\/li>\n<li>CPU:<a href=\"http:\/\/www.notebookcheck.net\/Intel-Core-i7-2630QM-Notebook-Processor.41483.0.html\"> Intel Core i7-2630QM<\/a> @2Ghz software overclockable to 2.9Ghz. 4 physical cores but total 8 virtual cores due to Hyperthreading.<\/li>\n<li>GPU: <a href=\"http:\/\/www.notebookcheck.net\/NVIDIA-GeForce-GT-555M.41933.0.html\">GeForce GT 555M<\/a> with 144 CUDA Cores.\u00a0 Graphics clock: 590Mhz.\u00a0 Processor Clock:1180 Mhz. 3072 Mb DDR3 Memeory<\/li>\n<li>RAM: 8 Gb<\/li>\n<li>OS: Windows 7 Home Premium 64 bit.\u00a0 I\u2019m not using Linux because of the <a href=\"..\/?p=3653\">lack of official support for Optimus<\/a>.<\/li>\n<li>MATLAB: 2011a with the parallel computing toolbox<\/li>\n<\/ul>\n<p><strong>Other GPU articles at Walking Randomly<\/strong><\/p>\n<ul>\n<li><a href=\"https:\/\/www.walkingrandomly.com\/?p=3436\">GPU Support in Mathematica, <\/a><a href=\"https:\/\/www.walkingrandomly.com\/?p=3436\">Maple, MATLAB and Maple Prime<\/a> &#8211; See the various options available<\/li>\n<li><a href=\"https:\/\/www.walkingrandomly.com\/?p=2860\">Insert new laptop to continue<\/a> &#8211; My first attempt at using the GPU functionality in MATLAB<\/li>\n<li><a href=\"https:\/\/www.walkingrandomly.com\/?p=2860\">NVIDIA lets down Linux laptop users<\/a> &#8211; and how an open source project saves the day<\/li>\n<\/ul>\n<p>Thanks to various people at The Mathworks for some useful discussions, advice and tutorials while creating this series of articles.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is part 1 of an ongoing series of articles about MATLAB programming for GPUs using the Parallel Computing Toolbox.\u00a0 The introduction and index to the series is at https:\/\/www.walkingrandomly.com\/?p=3730. Have you ever needed to take the sine of 100 million random numbers?\u00a0 Me either, but such an operation gives us an excuse to look [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[44,51,53,4,11,41,7],"tags":[],"class_list":["post-3736","post","type-post","status-publish","format-standard","hentry","category-cuda","category-gpu","category-making-matlab-faster","category-math-software","category-matlab","category-parallel-programming","category-programming"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p3swhs-Yg","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/3736","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3736"}],"version-history":[{"count":33,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/3736\/revisions"}],"predecessor-version":[{"id":3823,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=\/wp\/v2\/posts\/3736\/revisions\/3823"}],"wp:attachment":[{"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3736"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3736"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/walkingrandomly.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3736"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}