## Playing with NAG’s Library for SMP & multicore (Or parallel programming made easy)

I work for the University of Manchester in the UK as a ‘Science and Engineering Applications specialist’ which basically means that I am obsessed with software used by mathematicians and scientists. One of the applications within my portfolio is the NAG library – a product that we use rather a lot in its various incarnations. We have people using it from Fortran, C, C++, MATLAB, Python and even Visual Basic in areas as diverse as engineering, applied maths, biology and economics.

Yes, we are big users of NAG at Manchester but then that stands to reason because NAG and Manchester have a collaborative history stretching back 40 years to NAG’s very inception. Manchester takes a lot of NAG’s products but for reasons that are lost in the mists of time, we have never (to my knowledge at least) had a site license for their SMP library (more recently called The NAG Library for SMP & multicore). Very recently, that changed!

SMP stands for Symmetric Multi-Processor which essentially means ‘two or more CPUs sharing the same bit of memory.’ Five years ago, it would have been rare for a desktop user to have an SMP machine but these days they are everywhere. Got a dual-core laptop or a quad-core desktop? If the answer’s ‘yes’ then you have an SMP machine and you are probably wondering how to get the most out of it.

### ‘How can I use all my cores (without any effort)’

One of the most common newbie questions I get asked these days goes along the lines of ‘Program X is only using 50%/25%/12.5% of my CPU – how can I fix this?’ and, of course, the reason for this is that the program in question is only using a single core of their multicore machine. So, the problem is easy enough to explain but not so easy to fix because it invariably involves telling the user that they are going to have to learn how to program in parallel.

Explicit parallel programming is a funny thing in that sometimes it is trivial and other times it is pretty much impossible – it all depends on the problem you see. Sometimes all you need to do is drop a few OpenMP pragmas here and there and you’ll get a 2-4 times speed up. Other times you need to completely rewrite your algorithm from the ground up to get even a modest speed up. Yet more times you are simply stuck because your problem is inherently non-parallelisable. It is even possible to slow your code down by trying to parallelize it!

If you are lucky, however, then you can make use of all those extra cores with no extra effort at all! Slowly but surely, mathematical software vendors have been re-writing some of their functions to ensure that they work efficiently on parallel processors in such a way that it is transparent to the user. This is sometimes referred to as implicit parallelism.

Take MATLAB, for example, ever since release 2007a more and more built in MATLAB functions have been modified to allow them to make full use of multi-processor systems. If your program makes extensive use of these functions then you don’t need to spend extra money or time on the parallel computing toolbox, just run your code on the latest version of MATLAB, sit back and enjoy the speed boost. It doesn’t work for all functions of course but it does for an ever increasing subset.

### The NAG SMP Library – zero effort parallel programming

For users of NAG routines, the zero-effort approach to making better use of your multicore system is to use their SMP library. According to NAG’s advertising blurb you don’t need to rewrite your code to get a speed boost – you just need to link to the SMP library instead of the Fortran library at compile time.

Just like MATLAB, you won’t get this speed boost for every function, but you will get it for a significant subset of the library (around 300+ functions as of Mark 22 – the full list is available on NAG’s website). Also, just like MATLAB, sometimes the speed-up will be massive and other times it will be much more modest.

I wanted to test NAG’s claims for myself on the kind of calculations that researchers at Manchester tend to perform so I asked NAG for a trial of the upcoming Mark 22 of the SMP library and, since I am lazy, I also asked them to provide me with simple demonstration programs and I’d like to share the results with you all. ** These are not meant to be an exhaustive set of benchmarks** (I don’t have the time to do such a thing) but should give you an idea of what you can expect from NAG’s new library.

**System specs**

All of these timings were made on a 3Ghz Intel Core2 Quad Q9650 CPU desktop machine with 8Gb of RAM running Ubuntu 9.10. The serial version of the NAG library used was fll6a22dfl and the SMP version of the library was fsl6a22dfl. The version of gfortran used was 4.4.1 and the cpu-frequency governor was switched off as per a previous blog post of mine.

**Example 1 – Nearest correlation matrix**

The first routine I looked at was one that calculated the nearest correlation matrix. In other words ‘Given a symmetric matrix, what is the nearest correlation matrix—that^{ }is, the nearest symmetric positive semidefinite matrix with^{ }unit diagonal?’^{[1]} This is a problem that pops up a lot in Finance – option pricing, risk management – that sort of thing.

The NAG routine that calculates the nearest correlation matrix is G02AAF which is based on the algorithm developed by Qi and Sun^{[2]}. The example program I was sent first constructs a N x N tridiagonal matrix A of the form A(j,j)=2, A(j-1,j)=-1.1 and A(j,j-1)=1.0. It then times how long it takes G02AAF to find the nearest correlation matrix to A. You can download this example, along with all supporting files, from the links below.

To compile this benchmark against the** serial library** I did

gfortran ./bench_g02aaf.f90 shtim.c wltime.f /opt/NAG/fll6a22dfl/lib/libnag_nag.a \ /opt/NAG/fll6a22dfl/acml/libacml_mv.a -o serial_g02aaf

To compile the **parallel version** I did

gfortran -fopenmp ./bench_g02aaf.f90 shtim.c wltime.f /opt/NAG/fsl6a22dfl/lib/libnagsmp.a \ /opt/NAG/fsl6a22dfl/acml4.3.0/lib/libacml_mp.a -o smp_g02aaf

I didn’t need to modify the source code in any way when going from a serial version to a parallel version of this benchmark. The only work required was to link to the SMP library instead of the serial library – so far so good. Let’s run the two versions and see the speed differences.

First things first, let’s ensure that there is no limit to the stack size of my shell by doing the following in bash

ulimit -s unlimited

Also, for the parallel version, I need to set the number of threads to equal the number of processors I have on my machine by setting the OMP_NUM_THREADS environment variable.

export OMP_NUM_THREADS=4

This won’t affect the serial version at all. So, onto the program itself. When you run it, it will ask you for two inputs – an integer, N, and a boolean, IO. N gives the size of the starting matrix and IO determines whether or not to output the result.

Here’s an example run of the serial version with N=1000 and IO set to False. (In Fortran True is represented as** .t.** and False is represented as** .f. **)

./serial_g02aaf G02AAF: Enter N, IO 1000 .f. G02AAF: Time, IFAIL = 37.138997077941895 0 G02AAF: ITER, FEVAL, NRMGRD = 4 5 4.31049822255544465E-012

The only output I am really interested in here is the time and 37.13 seconds doesn’t seem too bad for a 1000 x 1000 matrix at first glance. Move to the parallel version though and you get a very nice surprise

./smp_g02aaf G02AAF: Enter N, IO 1000 .f. G02AAF: Time, IFAIL = 5.1906139850616455 0 G02AAF: ITER, FEVAL, NRMGRD = 4 5 4.30898281428799420E-012

The above times were typical and varied little from run to run (although the SMP version varied by a bigger percentage than the serial version). However I averaged over 10 runs to make sure and got **37.14 s** for the serial version and **4.88 s** for the SMP version which gives a speedup of **7.6 times**!

Now, this is rather impressive. Usually, when one parallelises over N cores then the very best you can expect in an ideal word is a speed up of just less than N times, so called ‘linear scaling’. Here we have N=4 and a speedup of 7.6 implying that NAG have achieved ‘super-linear scaling’ which is usually pretty rare.

I dropped them a line to ask what was going on. It turns out that when they looked at parallelising this routine they worked out a way of speeding up the serial version as well. This serial speedup will be included in the serial library in its next release, Mark 23 but the SMP library got it as of Mark 22.

So, some of that 7.6 times speedup is as a result of serial speedup and the rest is parallelisation. By setting OMP_NUM_THREADS to 1 we can force the SMP version of the benchmark to only run on only one core and thus see how much of the speedup we can attribute to parallelisation:

export OMP_NUM_THREADS=1 ./smp_g02aaf G02AAF: Enter N, IO 1000 .f. G02AAF: Time, IFAIL = 12.714214086532593 0 G02AAF: ITER, FEVAL, NRMGRD = 4 5 4.31152424170390294E-012

Recall that the 4 core version took an average of 4.88 seconds so the speedup from parallelisation alone is **2.6 times** – much closer to what I expected to see. Now, it is probably worth mentioning that there is an extra level of complication (with parallel programming there is **always** an extra level of complication) in that some of this parallelisation speedup comes from extra work that NAG has done in their algorithm and some of it comes from the fact that they are linking to parallel versions of the BLAS and LAPACK libraries. We could go one step further and determine how much of the speed up comes from NAG’s work and how much comes from using parallel BLAS/LAPACK but, well, life is short.

The practical upshot is that if you come straight from the Mark 22 serial version then you can expect a **speed-up of around 7.6 times**. In the future, when you compare the Mark 22 SMP library to the Mark 23 serial library then you can expect a speedup of around 2.6 times on a 4 core machine like mine.

**Example 2 – Quasi random number generation**

Quasi random numbers (also referred to as ‘Low discrepancy sequences’) are extensively used in Monte-Carlo simulations which have applications in areas such as finance, chemistry and computational physics. When people need a set of quasi random numbers they usually need a LOT of them and so the faster they can be produced the better. The NAG library contains several quasi random number generators but the one considered here is the routine g05ymf. The benchmark program I used is called bench_g05ymf.f90 and the full set of files needed to compile it are available at the end of this section.

The benchmark program requires the user to input 4 numbers and a boolean as follows.

- The first number is the quasi random number generator to use with the options being:
- NAG’s newer Sobol generator (based on the 2008 Joe and Kuo algorithm
^{[3]}) - NAG’s older Sobol generator
- NAG’s Niederreiter generator
- NAG’s Faure generator

- NAG’s newer Sobol generator (based on the 2008 Joe and Kuo algorithm
- The second number is the order in which the generated values are returned (The parameter RCORD as referred to in the documentation for g05ymf). Say that the matrix of returned values is called QUAS then if RCORD=1, QUAS(i,j) holds the jth value for the ith dimension, otherwise QUAS(i,j) holds the ith value for the jth dimension.
- The third number is the number of dimensions required.
- The fourth number is the number of number of quasi-random numbers required.
- The boolean (either .t. or .f.) determines whether or not the output should be verbose or not. A value of .t. will output the first 100 numbers in the sequence.

To compile the benchmark program against the serial library I did:

gfortran ./bench_g05ymf.f90 shtim.c wltime.f /opt/NAG/fll6a22dfl/lib/libnag_nag.a \ /opt/NAG/fll6a22dfl/acml/libacml_mv.a -o serial_g02aaf

As before, the only thing needed to turn this into a parallel program was to compile against the SMP library and add the -fopenmp switch

gfortran -fopenmp ./bench_g05ymf.f90 shtim.c wltime.f /opt/NAG/fsl6a22dfl/lib/libnagsmp.a \ /opt/NAG/fsl6a22dfl/acml4.3.0/lib/libacml_mp.a -o smp_g02aaf

The first set of parameters I used was

1 2 900 1000000 .f.

Which uses NAG’s new Sobol generator with RCORD=2 to generate and store 1,000,000 numbers over 900 dimensions with no verbose output. Averaged over 10 runs the times were **5.17 seconds** for the serial version and **2.14 seconds** for the parallel version giving a **speedup of 2.4x** on a quad-core machine. I couldn’t push number of dimensions much higher because the benchmark stores all of the numbers in one large array and I was starting to run out of memory.

If you only want a relatively small sequence then switching to the SMP library is actually slower thanks to the overhead involved in spawning extra threads. For example if you only want 100,000 numbers over 8 dimensions:

1 2 8 100000 .f.

then the serial version of the code takes an average of ** 0.0048 seconds** compared to **0.0479 seconds** for the parallel version so the parallel version is almost 10 times slower when using 4 threads for small problems. Setting OMP_NUM_THREADS to 1 gives exactly the same speed as the serial version as you might expect.

NAG have clearly optimized this function to ensure that you get a good speedup for large problem sets which is where it really matters so I am very pleased with it. However, the degradation in performance for smaller problems concerns me a little. I think I’d prefer it if NAG were to modify this function so that it works serially for small sequences and to automatically switch to parallel execution if the user requests a large sequence.

**Conclusions**

In the old days we could speedup our programs simply by buying a new computer. The new computer would have a faster clock speed than the old one and so our code would run faster with close to zero effort on our part. Thanks to the fact that clock speeds have stayed more or less constant for the last few years those days are over. Now, when we buy a new computer we get more processors rather than faster ones and this requires a change in thinking. Products such as the NAG Library for SMP & multicore help us to to get the maximum benefit out of our machines with the minimum amount of effort on our part. If switching to a product like this doesn’t give you the performance boost you need then the next thing for you to do is to learn how to program in parallel. The free ride is over.

In summary:

- You don’t need to modify your code if it already uses NAG’s serial library. Just recompile against the new library and away you go. You don’t need to know anything about parallel programming to make use of this product.
- The SMP Library works best with big problems. Small problems don’t work so well because of the inherent overheads of parallelisation.
- On a quad-core machine you can expect speed-ups around 2-4 times compared to the serial library. In exceptional circumstances you can expect speed-up as large as 7 times or more.
- You should notice a speed increase in over 300 functions compared to the serial library.
- Some of this speed increase comes from fundamental libraries such as BLAS and LAPACK, some of it comes from NAG directly parallelising their algorithms and yet more comes from improvements to the underlying serial code.
- I’m hoping that this Fortran version is just the beginning. I’ve always felt that NAG program in Fortran so I don’t have to and I’m hoping that they will go on to incorporate their SMP improvements in their other products, especially their C library and MATLAB toolbox.

**Acknowledgements**

Thanks to several staff at NAG who suffered my seemingly endless stream of emails during the writing of this article. Ed, in particular, has the patience of a saint. Any mistakes left over are all mine!

**Links**

- The Nearest Correlation Matrix benchmark program used here – bench_g02aaf.f90 – along with the source code for the timing routine.
- NAG’s documentation for the g02aaf routine.
- The Quasi Random Number benchmark program used here – bench_g05ymf.f90 – along with the source code for the timing routine.
- An Introduction to quasi random numbers (written by a member of NAG’s staff)
- Official NAG documentation for g05ymf – the quasi random number generator considered here.

**References**

[1] – Higham N, Computing the nearest correlation matrix—a problem from finance, IMA Journal of Numerical Analysis 22 (3): 329–343

[2] – Qi H and Sun D (2006), A Quadratically Convergent Newton Method for Computing the Nearest

Correlation Matrix, SIAM J. Matrix AnalAppl 29(2) 360–385

[3] – Joe S and Kuo F Y (2008) Constructing Sobol sequences with better two-dimensional projects SIAM J. Sci. Comput. 30 2635–2654

**Dislaimer: This is a personal blog and the opinions expressed on it are mine and do not necessarily reflect the policies and opinions of my employer, The University of Manchester.**

Is it really true the NAG compilers aren’t OpenMP compliant themselves? Or was there another reason you used gfortan in the above examples?

Hi Michael

At the time of writing it’s true although I believe that the next version will be OpenMP compliant.

Cheers,

Mike