A brief look at CUDA support in Maple 15

February 12th, 2012 | Categories: CUDA, GPU, Maple | Tags:

Maple has had support for NVidia GPUs since version 14 but I’ve not played with it much until recently. Essentially I was put off by the fact that Maple’s CUDA package seemed to have support for only one function – Matrix-Matrix Multiplication. However, a recent conversation with a Maple developer changed my mind.

It is true that only MatrixMatrixMultiply has been accelerated but when you flip the CUDA switch in Maple, every function in the LinearAlgebra package that calls MatrixMatrixMultiply also gets accelerated. This leads to the possibility of a lot of speed-ups for very little work.

So, this morning I thought I would take a closer look using my laptop. Let’s start by timing how long it takes the CPU to multiply two 4000 by 4000 double precision matrices

with(LinearAlgebra):
CUDA:-Enable(false):
CUDA:-IsEnabled();
a := RandomMatrix(4000, datatype = float[8]):
b := RandomMatrix(4000, datatype = float[8]):
t := time[real]():
c := a.b:
time[real]()-t

The exact time varied a little from run to run but 3.76 seconds is a typical result. I’m only feeling my way at this stage so not doing any proper benchmarking.

To do this calculation on the GPU, all I need to do is change the line

CUDA:-Enable(false):

CUDA:-Enable(true):

like so

with(LinearAlgebra):
CUDA:-Enable(true):
CUDA:-IsEnabled();
a := RandomMatrix(4000, datatype = float[8]):
b := RandomMatrix(4000, datatype = float[8]):
t := time[real]():
c := a.b:
time[real]()-t

Typical execution time was 8.37 seconds so the GPU version is more than 2 times slower than the CPU version on my machine.

Trying different matrix sizes

Not wanting to admit defeat after just a single trial, I timed the above code using different matrix sizes. Here are the results

1000 by 1000: CPU=0.07 seconds GPU=0.17 seconds
2000 by 2000: CPU=0.53 seconds GPU=1.07 seconds
4000 by 4000: CPU=3.76 seconds GPU=8.37 seconds
5000 by 5000: CPU=7.44 seconds GPU=19.48 seconds

Switching to single precision

GPUs do much better with single precision numbers so I had a try with those too. All you need to do is change

datatype = float[8]

datatype = float[4]

in the above code. The results are:

1000 by 1000: CPU=0.03 seconds GPU=0.07 seconds
2000 by 2000: CPU=0.35 seconds GPU=0.66 seconds
4000 by 4000: CPU=1.86 seconds GPU=2.37 seconds
5000 by 5000: CPU=3.81 seconds GPU=5.2 seconds

So the GPU loses in single precision mode too on my hardware. If I can’t get a speedup with MatrixMatrixMultiply on my system then there is no point in exploring all of the other LinearAlgebra routines since all of them will be slower when moving to CUDA acceleration.

I guess that in this case, my CPU is too powerful and my GPU is too wimpy to see the acceleration I was hoping for.

Thanks to Maplesoft for providing me with a review copy of Maple 15.

Test System Specification

Laptop model: Dell XPS L702X
CPU: Intel Core i7-2630QM @2Ghz software overclockable to 2.9Ghz. 4 physical cores but total 8 virtual cores due to Hyperthreading.
GPU: GeForce GT 555M with 144 CUDA Cores. Graphics clock: 590Mhz. Processor Clock:1180 Mhz. 3072 Mb DDR3 Memeory
RAM: 8 Gb
OS: Windows 7 Home Premium 64 bit.
Maple 15

Walking Randomly