## A brief look at CUDA support in Maple 15

February 12th, 2012 | Categories: CUDA, GPU, Maple | Tags:

Maple has had support for NVidia GPUs since version 14 but I’ve not played with it much until recently.  Essentially I was put off by the fact that Maple’s CUDA package seemed to have support for only one function – Matrix-Matrix Multiplication. However, a recent conversation with a Maple developer changed my mind.

It is true that only MatrixMatrixMultiply has been accelerated but when you flip the CUDA switch in Maple, every function in the LinearAlgebra package that calls MatrixMatrixMultiply also gets accelerated.  This leads to the possibility of a lot of speed-ups for very little work.

So, this morning I thought I would take a closer look using my laptop.  Let’s start by timing how long it takes the CPU to multiply two 4000 by 4000 double precision matrices

with(LinearAlgebra):
CUDA:-Enable(false):
CUDA:-IsEnabled();
a := RandomMatrix(4000, datatype = float[8]):
b := RandomMatrix(4000, datatype = float[8]):
t := time[real]():
c := a.b:
time[real]()-t

The exact time varied a little from run to run but 3.76 seconds is a typical result. I’m only feeling my way at this stage so not doing any proper benchmarking.

To do this calculation on the GPU, all I need to do is change the line

CUDA:-Enable(false):

to

CUDA:-Enable(true):

like so

with(LinearAlgebra):
CUDA:-Enable(true):
CUDA:-IsEnabled();
a := RandomMatrix(4000, datatype = float[8]):
b := RandomMatrix(4000, datatype = float[8]):
t := time[real]():
c := a.b:
time[real]()-t

Typical execution time was 8.37 seconds so the GPU version is more than 2 times slower than the CPU version on my machine.

Trying different matrix sizes

Not wanting to admit defeat after just a single trial, I timed the above code using different matrix sizes.  Here are the results

• 1000 by 1000: CPU=0.07 seconds GPU=0.17 seconds
• 2000 by 2000: CPU=0.53 seconds GPU=1.07 seconds
• 4000 by 4000: CPU=3.76 seconds GPU=8.37 seconds
• 5000 by 5000: CPU=7.44 seconds GPU=19.48 seconds

Switching to single precision

GPUs do much better with single precision numbers so I had a try with those too.  All you need to do is change

datatype = float[8]

to

datatype = float[4]

in the above code. The results are:

• 1000 by 1000: CPU=0.03 seconds GPU=0.07 seconds
• 2000 by 2000: CPU=0.35 seconds GPU=0.66 seconds
• 4000 by 4000: CPU=1.86 seconds GPU=2.37 seconds
• 5000 by 5000: CPU=3.81 seconds GPU=5.2 seconds

So the GPU loses in single precision mode too on my hardware.  If I can’t get a speedup with MatrixMatrixMultiply on my system then there is no point in exploring all of the other LinearAlgebra routines since all of them will be slower when moving to CUDA acceleration.

I guess that in this case, my CPU is too powerful and my GPU is too wimpy to see the acceleration I was hoping for.

Thanks to Maplesoft for providing me with a review copy of Maple 15.

Test System Specification

• Laptop model: Dell XPS L702X
• CPU: Intel Core i7-2630QM @2Ghz software overclockable to 2.9Ghz. 4 physical cores but total 8 virtual cores due to Hyperthreading.
• GPU: GeForce GT 555M with 144 CUDA Cores.  Graphics clock: 590Mhz.  Processor Clock:1180 Mhz. 3072 Mb DDR3 Memeory
• RAM: 8 Gb
• OS: Windows 7 Home Premium 64 bit.
• Maple 15
1. 144 Cuda cores should have given you better results than seen above. Are you seeing similar performance in Matlab and Mathematica

2. Some rough back of the envelope calculations:

Your cuda card has 144 cores running at 1.18 GHz;

144 * 1.18 = 169.92 Giga instructions per second.

Matrix multiplication is also done in parallel with all the tricks in the book.
Your processor has 4 cores + 4 hyperthreads, each capable of running 4 simultaneous double precision SIMD instructions (or 8 single precision) at (up to) 2.9 GHz (this is a rough upper bound as hyperthreading will likely be less than double the effect):

(4+4)*4*2.9 = 92.8 Giga instructions per second

The CUDA card should have about double the throughput compared to the CPU. The main difference is that for the GPU operation, the random matrix must be copied from main memory and back, whereas the CPU operation can be done directly in main memory. It is the copy operations that are skewing the benchmarks against an already very fast matrix multiplication algorithm.

Here are some benchmarks using your test with a slightly bigger card:
CPU: Intel Core i7 920 @ 2.67GHz
GPU: NVIDIA Quadro FX 5800 (240 cores)

# 1000 by 1000: CPU=.400 seconds, GPU=.176 seconds
# 2000 by 2000: CPU=2.40 seconds, GPU=.504 seconds
# 3000 by 3000: CPU=7.600 seconds, GPU=1.212 seconds
# 4000 by 4000: CPU=19.396 seconds, GPU=2.653 seconds
# 5000 by 4000: CPU=67.900 seconds, GPU=5.451 seconds

3. Thanks maple guy. Your CPU timings are really slow compared to mine but our processors seem comaprable. 4000 by 4000 takes 3.76 seconds on my machine but 19.396 on yours. That’s quite a difference!

Which operating system are you using?

4. Good point. Your i7-2630QM should be more similar to my i7-920. The 920 came out in 2008, so it is a bit older, has a smaller cache (6MB instead of 8MB), slower memory DDR3-800/1066 instead of DDR3-1066/1333and handles only SSE, whereas your machine also supports Intel’s Advanced Vector Extensions (AVX). But the big difference does seem to be indirectly O/S related. I’m using Linux (OpenSUSE 11.0), and plugging into only dual-core supported Atlas BLAS instead of the MKL BLAS you’re getting on Windows.