Optimising a correlated asset calculation on MATLAB #2 – Using the GPU via the PCT

February 9th, 2012 | Categories: CUDA, GPU, Making MATLAB faster, math software, matlab, programming | Tags:

This article is the second part of a series where I look at rewriting a particular piece of MATLAB code using various techniques.  The introduction to the series is here and the introduction to the larger series of GPU articles for MATLAB on WalkingRandomly is here.

Attempt 1 – Make as few modifications as possible

I took my best CPU-only code from last time (optimised_corr2.m) and changed a load of data-types from double to gpuArray in order to get the calculation to run on my laptop’s GPU using the parallel computing toolbox in MATLAB 2010b.  I also switched to using the GPU versions of various functions such as parallel.gpu.GPUArray.randn instead of randn for example.  Functions such as cumprod needed no  modifications at all since they are nicely overloaded; if the argument to cumprod is of type double then the calculation happens on the CPU whereas if it is gpuArray then it happens on the GPU.

The above work took about a minute to do which isn’t bad for a CUDA ‘porting’ effort!  The result, which I’ve called GPU_PCT_corr1.m is available for you to download and try out.

How about performance?  Let’s do a quick tic and toc using my laptop’s NVIDIA GT 555M GPU.

>> tic;GPU_PCT_corr1;toc
Elapsed time is 950.573743 seconds.

The CPU version of this code took only 3.42 seconds which means that this GPU version is over 277 times slower! Something has gone horribly, horribly wrong!

Attempt 2 – Switch from script to function

In general functions should be faster than scripts in MATLAB because more automatic optimisations are performed on functions.  I didn’t see any difference in the CPU version of this code (see optimised_corr3.m from part 1 for a CPU function version) and so left it as a script (partly so I had an excuse to discuss it here if I am honest).  This GPU-version, however, benefits noticeably from conversion to a function.  To do this, add the following line to the top of GPU_PCT_corr1.m

function [SimulPrices] = GPU_PTC_corr2( n,sub_size)

Next, you need to delete the following two lines

n=100000;                       %Number of simulations
sub_size = 125;

Finally, add the following to the end of our new function

end

That’s pretty much all I did to get GPU_PCT_corr2.m. Let’s see how that performs using the same parameters as our script (100,000 simulations in blocks of 125).  I used script_vs_func.m to run both twice after a quick warm-up iteration and the results were:

Warm up
Elapsed time is 1.195806 seconds.
Main event
script
Elapsed time is 950.399920 seconds.
function
Elapsed time is 938.238956 seconds.
script
Elapsed time is 959.420186 seconds.
function
Elapsed time is 939.716443 seconds.

So, switching to a function has saved us a few seconds but performance is still very bad!

Attempt 3 – One big matrix multiply!

So far all I have done is take a program that works OK on a CPU, and run it exactly as-is on the GPU in the hope that something magical would happen to make it go faster.  Of course, GPUs and CPUs are very different beasts with differing sets of strengths and weaknesses so it is rather naive to think that this might actually work.  What we need to do is to play to the GPUs strengths more and the way to do this is to focus on this piece of code.

for i=1:sub_size
     CorrWiener(:,:,i)=parallel.gpu.GPUArray.randn(T-1,2)*UpperTriangle;
end

Here, we are performing lots of small matrix multiplications and, as mentioned in part 1, we might hope to get better performance by performing just one large matrix multiplication instead. To do this we can change the above code to

%Generate correlated random numbers
%using one big multiplication
randoms = parallel.gpu.GPUArray.randn(sub_size*(T-1),2);
CorrWiener = randoms*UpperTriangle;
CorrWiener = reshape(CorrWiener,(T-1),sub_size,2);
%CorrWiener = permute(CorrWiener,[1 3 2]); %Can't do this on the GPU in 2011b or below

%poor man's permute since GPU permute if not available in 2011b
CorrWiener_final = parallel.gpu.GPUArray.zeros(T-1,2,sub_size);
for s = 1:2
    CorrWiener_final(:, s, :) = CorrWiener(:, :, s);
end

The reshape and permute are necessary to get the matrix in the form needed later on. Sadly, MATLAB 2011b doesn’t support permute on GPUArrays and so I had to use the ‘poor mans permute’ instead.

The result of the above is contained in GPU_PCT_corr3.m so let’s see how that does in a fresh instance of MATLAB.

>> tic;GPU_PCT_corr3(100000,125);toc
Elapsed time is 16.666352 seconds.
>> tic;GPU_PCT_corr3(100000,125);toc
Elapsed time is 8.725997 seconds.
>> tic;GPU_PCT_corr3(100000,125);toc
Elapsed time is 8.778124 seconds.

The first thing to note is that performance is MUCH better so we appear to be on the right track. The next thing to note is that the first evaluation is much slower than all subsequent ones. This is totally expected and is due to various start-up overheads.

Recall that 125 in the above function calls refers to the block size of our monte-carlo simulation. We are doing 100,000 simulations in blocks of 125– a number chosen because I determined empirically that this was the best choice on my CPU. It turns out we are better off using much larger block sizes on the GPU:

>> tic;GPU_PCT_corr3(100000,250);toc
Elapsed time is 6.052939 seconds.
>> tic;GPU_PCT_corr3(100000,500);toc
Elapsed time is 4.916741 seconds.
>> tic;GPU_PCT_corr3(100000,1000);toc
Elapsed time is 4.404133 seconds.
>> tic;GPU_PCT_corr3(100000,2000);toc
Elapsed time is 4.223403 seconds.
>> tic;GPU_PCT_corr3(100000,5000);toc
Elapsed time is 4.069734 seconds.
>> tic;GPU_PCT_corr3(100000,10000);toc
Elapsed time is 4.039446 seconds.
>> tic;GPU_PCT_corr3(100000,20000);toc
Elapsed time is 4.068248 seconds.
>> tic;GPU_PCT_corr3(100000,25000);toc
Elapsed time is 4.099588 seconds.

The above, rather crude, test suggests that block sizes of 10,000 are the best choice on my laptop’s GPU. Sadly, however, it’s STILL slower than the 3.42 seconds I managed on the i7 CPU and represents the best I’ve managed using pure MATLAB code. The profiler tells me that the vast majority of the GPU execution time is spent in the cumprod line and in random number generation (over 40% each).

Trying a better GPU
Of course now that I have code that runs on a GPU I could just throw it at a better GPU and see how that does. I have access to MATLAB 2011b on a Tesla M2070 hooked up to a Linux machine so I ran the code on that. I tried various block sizes and the best time was 0.8489 seconds with the call GPU_PCT_corr3(100000,20000) which is just over 4 times faster than my laptop’s CPU.

Ask the Audience
Can you do better using just the GPU functionality provided in the Parallel Computing Toolbox (so no bespoke CUDA kernels or Jacket just yet)? I’ll be looking at how AccelerEyes’ Jacket myself in the next post.

Results so far

  • Best CPU Result on laptop (i7-2630GM)with pure MATLAB code – 3.42 seconds
  • Best GPU Result with PCT on laptop (GT555M) – 4.04 seconds
  • Best GPU Result with PCT on Tesla M2070 – 0.85 seconds

Test System Specification

  • Laptop model: Dell XPS L702X
  • CPU: Intel Core i7-2630QM @2Ghz software overclockable to 2.9Ghz. 4 physical cores but total 8 virtual cores due to Hyperthreading.
  • GPU: GeForce GT 555M with 144 CUDA Cores.  Graphics clock: 590Mhz.  Processor Clock:1180 Mhz. 3072 Mb DDR3 Memeory
  • RAM: 8 Gb
  • OS: Windows 7 Home Premium 64 bit.
  • MATLAB: 2011b

Acknowledgements

Thanks to Yong Woong Lee of the Manchester Business School as well as various employees at The Mathworks for useful discussions and advice.  Any mistakes that remain are all my own :)

  1. Narfi Stefansson
    February 15th, 2012 at 22:58
    Reply | Quote | #1

    Thank you for the detailed writeup. This is indeed a fair presentation of how your algorithm works with our software on your hardware. You are absolutely correct that the random number generation is somewhat computationally intensive on the GPU, and I thought it would be appropriate to share with you some of the thinking behind that.

    It is rare that the speed of the random number generator has a significant effect on the overall application performance, but it does happen. In most cases, the computations that are done with the random numbers outweigh the cost of their generation. We should note that users rarely worry about the quality of the default random number generator in MATLAB because they trust that it yield “good” random numbers. We wanted that to remain true for the default random number generator the GPU, even if that meant that the random number generation was more computationally intensive.

    We chose mrg32k3a (see http://www.mathworks.com/help/techdoc/math/brn4ixh.html#brvku_2 ) as the default random number generator on the GPU as it meets the following criteria:
    – It is a very well-tested algorithm
    – It has full support for parallel random number generation
    – Its statistical properties (i.e. lack of correlation) are known in the parallel setting. Thus, it provides “good” random numbers on a single GPU, on a cluster of GPUs, or on a cluster of CPUs.
    – It has a large number of parallel independent substreams, each of which has length 2^76, long enough to make it extremely unlikely that any customer will ever see any effects from exhausting the random number generator.
    – It supports both the creation of random matrices and unrestricted usage within arrayfun to generate random scalars(*).
    – It supports reproducibility. That is, one can reproduce previous results through retrieving, storing and later re-instating the random number generator state.
    – It is supported on the CPU in MATLAB. Furthermore, the random numbers produced on the GPU match exactly those produced by that random number generator when used on the CPU in MATLAB.

    Using mrg32k3a with the Inversion method to generate normally distributed random numbers allows us to meet all of the criteria listed above. This makes it a very good choice as default random number generator for the GPU, one that we are very comfortable recommending to our customers

    (*) Complicated functions that are invoked using arrayfun may vary the number of times they invoke rand based on the input values or even based on the random numbers themselves. A realistic example might involve a Monte-Carlo simulation that draws random numbers to perform a simulation until convergence is reached. However, it can be demonstrated via the following contrived example:
    function m = meanOfNumbers(n)
    % meanOfNumbers Return the mean of n random numbers.
    s = 0;
    for i = 1:n
    s = s + rand();
    end
    m = s./n;
    end
    If you invoke it via
    y = arrayfun(@meanOfNumbers, gpuArray(1:1e4)), then y(1) is based on one random number, whereas y(1e4) is based on 1e4 random numbers.

    I hope you find it helpful to understand why we arrived at that particular choice of a random number generator, although it did not give you the expected performance.

    Best,

    Narfi Stefansson
    Parallel Computing, MathWorks

  2. KHOA
    February 6th, 2014 at 02:21
    Reply | Quote | #2

    Just follow the last comment. I wish to ask if it would be better to generate a huge pool of random number on the CPU then transfer to GPU for use later on. I’m thinking of doing that in my algorithm. Probably I can share with you later this week.

  3. Mike Croucher
    February 6th, 2014 at 20:17
    Reply | Quote | #3

    Hi

    I wouldn’t go down that route if I were you. GPU-based random number generation is faster in modern MATLAB — http://www.walkingrandomly.com/?p=4870

    Cheers,
    Mike