Archive for the ‘NAG Library’ Category

May 18th, 2011

MATLAB’s lsqcurvefit function is a very useful piece of code that will help you solve non-linear least squares curve fitting problems and it is used a lot by researchers at my workplace, The University of Manchester.  As far as we are concerned it has two problems:

  • lsqcurvefit is part of the Optimisation toolbox and, since we only have a limited number of licenses for that toolbox, the function is sometimes inaccessible.  When we run out of licenses on campus users get the following error message
??? License checkout failed.
License Manager Error -4
Maximum number of users for Optimization_Toolbox reached.
Try again later.
To see a list of current users use the lmstat utility or contact your License Administrator.
  • lsqcurvefit is written as an .m file and so isn’t as fast as it could be.

One solution to these problems is to switch to the NAG Toolbox for MATLAB.  Since we have a full site license for this product, we never run out of licenses and, since it is written in compiled Fortran, it is sometimes a lot faster than MATLAB itself.  However, the current version of the NAG Toolbox (Mark 22 at the time of writing) isn’t without its issues either:

  • There is no direct equivalent to lsqcurvefit in the NAG toolbox.  You have to use NAG’s equivalent of lsqnonlin instead (which NAG call e04fy)
  • The current version of the NAG Toolbox, Mark 22, doesn’t support function handles.
  • The NAG toolbox requires the use of the datatypes int32 and int64 depending on the architecture of your machine.  The practical upshot of this is that your MATLAB code is suddenly a lot less portable.  If you develop on a 64 bit machine then it will need modifying to run on a 32 bit machine and vice-versa.

While working on optimising someone’s code a few months ago I put together a couple of .m files in an attempt to address these issues.  My intent was to improve the functionality of these files and eventually publish them but I never seemed to find the time.  However, they have turned out to be a minor hit and I’ve sent them out to researcher after researcher with the caveat “These are a first draft, I need to tidy them up sometime” only to get the reply “Thanks for that, I no longer have any license problems and my code is executing more quickly.”  They may be simple but it seems that they are useful.

So, until I find the time to add more functionality, here are my little wrappers that allow you to do non linear least squares fitting using the NAG Toolbox for MATLAB.  They are about as minimal as you can get but they work and have proven to be useful time and time again.  They also solve all 3 of the NAG issues mentioned above.

To use them just put them both somewhere on your MATLAB path.  The syntax is identical to lsqcurvefit but this first version of my wrapper doesn’t do anything more complicated than the following


[x,resnorm] = lsqcurvefit(@myfun,x0,xdata,ydata);

NAG with my wrapper:

[x,resnorm] = nag_lsqcurvefit(@myfun,x0,xdata,ydata);

I may add extra functionality in the future but it depends upon demand.


Let’s look at the example given on the MATLAB help page for lsqcurvefit and compare it to the NAG version.  First create a file called myfun.m as follows

function F = myfun(x,xdata)
F = x(1)*exp(x(2)*xdata);

Now create the data and call the MATLAB fitting function

% Assume you determined xdata and ydata experimentally
xdata = [0.9 1.5 13.8 19.8 24.1 28.2 35.2 60.3 74.6 81.3];
ydata = [455.2 428.6 124.1 67.3 43.2 28.1 13.1 -0.4 -1.3 -1.5];
x0 = [100; -1] % Starting guess
[x,resnorm] = lsqcurvefit(@myfun,x0,xdata,ydata);

On my system I get the following timing for MATLAB (typical over around 10 runs)

tic;[x,resnorm] = lsqcurvefit(@myfun,x0,xdata,ydata); toc
Elapsed time is 0.075685 seconds.

with the following results

x =
   1.0e+02 *
>> resnorm
resnorm =

and for NAG using my wrapper function I get the following timing (typical over around 10 runs)

tic;[x,resnorm] = nag_lsqcurvefit(@myfun,x0,xdata,ydata); toc
Elapsed time is 0.008163 seconds.

So, for this example, NAG is around 9 times faster! The results agree with MATLAB to several decimal places

x =
   1.0e+02 *
>> resnorm
resnorm =

In the real world I find that the relative timings vary enormously and have seen speed-ups that range from a factor of 14 down to none at all.  Whenever I am optimisng MATLAB code and see a lsqcurvefit function I always give the NAG version a quick try and am often impressed with the results.

My system specs if you want to compare results: 64bit Linux on a Intel Core 2 Quad Q9650  @ 3.00GHz running MATLAB 2010b and NAG Toolbox version MBL6A22DJL

More articles about NAG

November 13th, 2010

Christmas isn’t all that far away so I thought that it was high time that I wrote my Christmas list for mathematical software developers and vendors.  All I want for christmas is….



  • Merge the parallel computing toolbox with core MATLAB.  Everyone uses multicore these days but only a few can feel the full benefit in MATLAB.   The rest are essentially second class MATLAB citizens muddling by with a single core (most of the time)
  • Make the mex interface thread safe so I can more easily write parallel mex files


  • More CUDA accelerated functions please.  I was initially excited by your CUDA package but then discovered that it only accelerated one function (Matrix Multiply).  CUDA accelerated Random Number Generators would be nice along with fast Fourier transforms and a bit more linear algebra.


  • Release Mathcad Prime.
  • Mac and Linux versions of Mathcad.  Maple,Mathematica and MATLAB have versions for all 3 platforms so why don’t you?

NAG Library

  • Produce vector versions of functions like g01bk (poisson distribution function).  They might not be needed in Fortran or C code but your MATLAB toolbox desperately needs them
  • A Mac version of the MATLAB toolbox.  I’ve got users practically begging for it :)
  • A NAG version of the MATLAB gamfit command



  • A Locator control for the interact function.  I still have a bounty outstanding for the person who implements this.
  • A fully featured, native windows version.  I know about the VM solution and it isn’t suitable for what I want to do (which is to deploy it on around 5000 University windows machines to introduce students to one of the best open source maths packages)

SMath Studio

  • An Android version please.  Don’t make it free – you deserve some money for this awesome Mathcad alternative.

SpaceTime Mathematics

  • The fact that you give the Windows version away for free is awesome but registration is a pain when you are dealing with mass deployment.  I’d love to deploy this to my University’s Windows desktop image but the per-machine registration requirement makes it difficult.  Most large developers who require registration usually come up with an alternative mechanism for enterprise-wide deployment.  You ask schools with more than 5 machines to link back to you.  I want tot put it on a few thousand machines and I would happily link back to you from several locations if you’ll help me with some sort of volume license.  I’ll also give internal (and external if anyone is interested) seminars at Manchester on why I think Spacetime is useful for teaching mathematics.  Finally, I’d encourage other UK University applications specialists to evaluate the software too.
  • An Android version please.

How about you?  What would you ask for Christmas from your favourite mathematical software developers?

October 5th, 2010

Not long after publishing my article, A faster version of MATLAB’s fsolve using the NAG Toolbox for MATLAB, I was contacted by Biao Guo of Quantitative Finance Collector who asked me what I could do with the following piece of code.

function MarkowitzWeight = fsolve_markowitz_MATLAB()

RandStream.setDefaultStream (RandStream('mt19937ar','seed',1));
% simulate 500 stocks return
SimR = randn(1000,500);
r = mean(SimR); % use mean return as expected return
targetR = 0.02;
rCOV = cov(SimR); % covariance matrix
NumAsset = length(r);

startX = [1/NumAsset*ones(1, NumAsset), 1, 1];
x = fsolve(@fsolve_markowitz,startX,options, r, targetR, rCOV);
MarkowitzWeight = x(1:NumAsset);

function F = fsolve_markowitz(x, r, targetR, rCOV)
NumAsset = length(r);
F = zeros(1,NumAsset+2);
weight = x(1:NumAsset); % for asset weights
lambda = x(NumAsset+1);
mu = x(NumAsset+2);
for i = 1:NumAsset
    F(i) = weight(i)*sum(rCOV(i,:))-lambda*r(i)-mu;
F(NumAsset+1) = sum(weight.*r)-targetR;
F(NumAsset+2) = sum(weight)-1;

This is an example from financial mathematics and Biao explains it in more detail in a post over at his blog, Now, it isn’t particularly computationally challenging since it only takes 6.5 seconds to run on my 2Ghz dual core laptop. It is, however, significantly more challenging than the toy problem I discussed in my original fsolve post. So, let’s see how much we can optimise it using NAG and any other techniques that come to mind.

Before going on, I’d just like to point out that we intentionally set the seed of the random number generator with

RandStream.setDefaultStream (RandStream('mt19937ar','seed',1));

to ensure we are always operating with the same data set. This is purely for benchmarking purposes.

Optimisation trick 1 – replacing fsolve with NAG’s c05nb

The first thing I had to do to Biao’s code in order to apply the NAG toolbox was to split it into two files. We have the main program, fsolve_markowitz_NAG.m

function MarkowitzWeight = fsolve_markowitz_NAG()
global r targetR rCOV;
%seed random generator to ensure same results every time for benchmark purposes
RandStream.setDefaultStream (RandStream('mt19937ar','seed',1));

% simulate 500 stocks return
SimR = randn(1000,500);
r = mean(SimR)'; % use mean return as expected return
targetR = 0.02;
rCOV = cov(SimR); % covariance matrix
NumAsset = length(r);

startX = [1/NumAsset*ones(1,NumAsset), 1, 1];
x = c05nb('fsolve_obj_NAG',startX);
MarkowitzWeight = x(1:NumAsset);

and the objective function, fsolve_obj_NAG.m

function [F,iflag] = fsolve_obj_NAG(n,x,iflag)
global r targetR rCOV;
NumAsset = length(r);
F = zeros(1,NumAsset+2);
weight = x(1:NumAsset); % for asset weights
lambda = x(NumAsset+1);
mu = x(NumAsset+2);
for i = 1:NumAsset
    F(i) = weight(i)*sum(rCOV(i,:))-lambda*r(i)-mu;
F(NumAsset+1) = sum(weight.*r)-targetR;
F(NumAsset+2) = sum(weight)-1;

I had to put the objective function into its own .m file because because the NAG toolbox currently doesn’t accept function handles (This may change in a future release I am reliably told). This is a pain but not a show stopper.

Note also that the function header for the NAG version of the objective function is different from the pure MATLAB version as
per my original article.

Transposition problems
The next thing to notice is that I have done the following in the main program, fsolve_markowitz_NAG.m

r = mean(SimR)';

compared to the original

r = mean(SimR);

This is because the NAG routine, c05nb, does something rather bizarre. I discovered that if you pass a 1xN matrix to NAGs c05nb then it gets transposed in the objective function to Nx1. Since Biao relies on this being 1xN when he does

F(NumAsset+1) = sum(weight.*r)-targetR;

we get an error unless you take the transpose of r in the main program. I consider this to be a bug in the NAG Toolbox and it is currently with NAG technical support.

Global variables
Sadly, there is no easy way to pass extra variables to the objective function when using NAG’s c05nb. In MATLAB Biao could do this

x = fsolve(@fsolve_markowitz,startX,options, r, targetR, rCOV);

No such luck with c05nb. The only way to get around it is to use global variables (Well, a small lie, there is another way, you can use reverse communication in a different NAG function, c05nd, but that’s a bit more complicated and I’ll leave it for another time I think)

So, It’s pretty clear that the NAG function isn’t as easy to use as MATLAB’s fsolve function but is it any faster? The following is typical.

>> tic;fsolve_markowitz_NAG;toc
Elapsed time is 2.324291 seconds.

So, we have sped up our overall computation time by a factor of 3. Not bad, but nowhere near the 18 times speed-up that I demonstrated in my original post.

Optimisation trick 2 – Don’t sum so much

In his recent blog post, Biao challenged me to make his code almost 20 times faster and after applying my NAG Toolbox trick I had ‘only’ managed to make it 3 times faster. So, in order to see what else I might be able to do for him I ran his code through the MATLAB profiler and discovered that it spent the vast majority of its time in the following section of the objective function

for i = 1:NumAsset
    F(i) = weight(i)*sum(rCOV(i,:))-lambda*r(i)-mu;

It seemed to me that the original code was calculating the sum of rCOV too many times. On a small number of assets (50 say) you’ll probably not notice it but with 500 assets, as we have here, it’s very wasteful.

So, I changed the above to

for i = 1:NumAsset
    F(i) = weight(i)*sumrCOV(i)-lambda*r(i)-mu;

So,instead of rCOV, the objective function needs to be passed the following in the main program


We do this summation once and once only and make a big time saving.  This is done in the files fsolve_markowitz_NAGnew.m and fsolve_obj_NAGnew.m

With this optimisation, along with using NAG’s c05nb we now get the calculation time down to

>> tic;fsolve_markowitz_NAGnew;toc
Elapsed time is 0.203243 seconds.

Compared to the 6.5 seconds of the original code, this represents a speed-up of over 32 times. More than enough to satisfy Biao’s challenge.

Checking the answers

Let’s make sure that the results agree with each other. I don’t expect them to be exactly equal due to round off errors etc but they should be extremely close. I check that the difference between each result is less than 1e-7:

original_results = fsolve_markowitz_MATLAB;
nagver1_results = fsolve_markowitz_NAGnew;
>> all(abs(original_results - nagver1_results)<1e8)
ans =

That’ll do for me!


The current iteration of the NAG Toolbox for MATLAB (Mark 22) may not be as easy to use as native MATLAB toolbox functions (for this example at least) but it can often provide useful speed-ups and in this example, NAG gave us a factor of 3 improvement compared to using fsolve.

For this particular example, however, the more impressive speed-up came from pushing the code through the MATLAB profiler, identifying the hotspot and then doing something about it.  There are almost certainly more optimisations to make but with the code running at less than a quarter of a second on my modest little laptop I think we will leave it there for now.

September 21st, 2010

Part of my job at the University of Manchester is to help researchers improve their code in systems such as Mathematica, MATLAB, NAG and so on.  One of the most common questions I get asked is ‘Can you make this go any faster please?’ and it always makes my day if I manage to do something significant such as a speed up of a factor of 10 or more.  The users, however, are often happy with a lot less and I once got bought a bottle of (rather good) wine for a mere 25% speed-up (Note:  Gifts of wine aren’t mandatory)

I employ numerous tricks to get these speed-ups including applying vectorisation, using mex files, modifying the algorithm to something more efficient,picking the brains of colleagues and so on.  Over the last year or so, I have managed to help several researchers get significant speed-ups in their MATLAB programs by doing one simple thing – swapping the fsolve function for an equivalent from the NAG Toolbox for MATLAB.

The fsolve function is part of MATLAB’s optimisation toolbox and, according to the documentation, it does the following:

“fsolve Solves a system of nonlinear equation of the form F(X) = 0 where F and X may be vectors or matrices.”

The NAG Toolbox for MATLAB contains a total of 6 different routines that can perform similar calculations to fsolve.  These 6 routines are split into 2 sets of 3 as follows

For when you have function values only:
c05nb – Solution of system of nonlinear equations using function values only (easy-to-use)
c05nc – Solution of system of nonlinear equations using function values only (comprehensive)
c05nd – Solution of system of nonlinear equations using function values only (reverse communication)

For when you have both function values and first derivatives
c05pb – Solution of system of nonlinear equations using first derivatives (easy-to-use)
c05pc – Solution of system of nonlinear equations using first derivatives (comprehensive)
c05pd – Solution of system of nonlinear equations using first derivatives (reverse communication)

So, the NAG routine you choose depends on whether or not you can supply first derivatives and exactly which options you’d like to exercise.  For many problems it will come down to a choice between the two ‘easy to use’ routines – c05nb and c05pb (at least, they are the ones I’ve used most of the time)

Let’s look at a simple example.  You’d like to find a root of the following system of non-linear equations.

F(1)=exp(-x(1)) + sinh(2*x(2)) + tanh(2*x(3)) – 5.01;
F(2)=exp(2*x(1)) + sinh(-x(2) ) + tanh(2*x(3) ) – 5.85;
F(3)=exp(2*x(1)) + sinh(2*x(2) ) + tanh(-x(3) ) – 8.88;

i.e. you want to find a vector X containing three values for which F(X)=0.

To solve this using MATLAB and the optimisation toolbox you could proceed as follows, first create a .m file called fsolve_obj_MATLAB.m (henceforth referred to as the objective function) that contains the following

function  F=fsolve_obj_MATLAB(x)
F(1)=exp(-x(1))  + sinh(2*x(2)) + tanh(2*x(3)) - 5.01;
F(2)=exp(2*x(1)) +  sinh(-x(2) ) + tanh(2*x(3) ) - 5.85;
F(3)=exp(2*x(1)) +  sinh(2*x(2) ) + tanh(-x(3) ) - 8.88;

Now type the following at the MATLAB command prompt to actually solve the problem:

options=optimset('Display','off'); %Prevents fsolve from  displaying too much information
startX=[0 0 0]; %Our  starting guess for the solution vector

MATLAB finds the following solution (Note that it’s only found one of the solutions, not all of them, which is typical of the type of algorithm used in fsolve)

X =
0.9001     1.0002    1.0945

Since we are not supplying the derivatives of our objective function and we are not using any special options, it turns out to be very easy to switch from using the optimisation toolbox to the NAG toolbox for this particular problem by making use of the routine c05nb.

First, we need to change the header of our objective function’s .m file from

function F=fsolve_obj_MATLAB(x)


function  [F,iflag]=fsolve_obj_NAG(n,x,iflag)

so our new .m file, fsolve_obj_NAG.m, will contain the following

function  [F,iflag]=fsolve_obj_NAG(n,x,iflag)
F(1)=exp(-x(1)) + sinh(2*x(2)) + tanh(2*x(3)) - 5.01;
F(2)=exp(2*x(1)) + sinh(-x(2) ) + tanh(2*x(3) ) - 5.85;
F(3)=exp(2*x(1)) + sinh(2*x(2) ) + tanh(-x(3) ) - 8.88;

Note that the ONLY change we made to the objective function was the very first line.  The NAG routine c05nb expects to see some extra arguments in the objective function (iflag and n in this example) and it expects to see them in a very particular order but we are not obliged to use them.  Using this modified objective function we can proceed as follows.

startX=[0 0 0]; %Our starting guess

NAG gives the same result as the optimisation toolbox:

X =
0.9001    1.0002    1.0945

Let’s look at timings.  I performed the calculations above on an 800Mhz laptop running Ubuntu Linux 10.04, MATLAB 2009 and Mark 22 of the NAG toolbox and got the following timings (averaged over 10 runs):

Optimisation toolbox: 0.0414 seconds
NAG toolbox: 0.0021 seconds

So, for this particular problem NAG was faster than MATLAB by a factor of 19.7 times on my machine.

That’s all well and good, but this is a simple problem. Does this scale to more realistic problems I hear you ask? Well, the answer is ‘yes’. I’ve used this technique several times now on production code and it almost always results in some kind of speed-up. Not always 19.7 times faster, I’ll grant you, but usually enough to make the modifications worthwhile.

I can’t actually post any of the more complicated examples here because the code in question belongs to other people but I was recently sent a piece of code from a researcher that made heavy use of fsolve and a typical run took 18.19 seconds (that’s for everything, not just the fsolve bit).  Simply by swapping a couple of lines of code to make it use the NAG toolbox rather than the optimisation toolbox I reduced the runtime to 1.08 seconds – a speed-up of almost 17 times.

Sadly, this technique doesn’t always result in a speed-up and I’m working on figuring out exactly when the benefits disappear. I guess that the main reason for NAG’s good performance is that it uses highly optimised, compiled Fortran code compared to MATLAB’s interpreted .m code. One case where NAG didn’t help was for a massively complicated objective function and the majority of the run-time of the code was spent evaluating this function. In this situation, NAG and MATLAB came out roughly neck and neck.

In the meantime, if you have some code that you’d like me to try it on then drop me a line.

If you enjoyed this post then you may also like:

July 29th, 2010

The MATLAB function ranksum is part of MATLAB’s Statistics Toolbox. Like many organizations who use network licensing for MATLAB and its toolboxes, my employer, The University of Manchester, sometimes runs out of licenses for this toolbox which leads to following error message when you attempt to evaluate ranksum.

??? License checkout failed.
License Manager Error -4
Maximum number of users for Statistics_Toolbox reached.
Try again later.
To see a list of current users use the lmstat utility or contact your License Administrator.

An alternative to the Statistics Toolbox is the NAG Toolbox for MATLAB for which we have an unlimited number of licenses. Here’s how to replace ranksum with the NAG routine g08ah.

Original MATLAB / Statistics Toolbox code

x = [0.8147;0.9058;0.1270;0.9134;0.6324;0.0975;0.2785;0.5469;0.9575;0.9649];
y=  [0.4076;1.220;1.207;0.735;1.0502;0.3918;0.671;1.165;1.0422;1.2094;0.9057;0.285;1.099;1.18;0.928];
p = ranksum(x,y)

The result is p = 0.0375

Code using the NAG Toolbox for MATLAB

x = [0.8147;0.9058;0.1270;0.9134;0.6324;0.0975;0.2785;0.5469;0.9575;0.9649];
y =  [0.4076;1.220;1.207;0.735;1.0502;0.3918;0.671;1.165;1.0422;1.2094;0.9057;0.285;1.099;1.18;0.928];
tail = 'T';
[u, unor, p, ties, ranks, ifail] = g08ah(x, y, tail);

The value for p is the same as that calculated by ranksum: p = 0.0375

NAG’s g08ah routine returns a lot more than just the value p but, for this particular example, we can just ignore it all. In fact, if you have MATLAB 2009b or above then you could call g08ah like this

tail = 'T';
[~, ~, p, ~, ~, ~] = g08ah(x, y, tail);

Which explicitly indicates that you are not going to use any of the outputs other than p.

People at Manchester are using the NAG toolbox for MATLAB more and more; not only because we have a full site license for it but because it can sometimes be very fast.  Here’s some more articles on the NAG toolbox you may find useful.

April 24th, 2010

Late last year I was asked to give a talk at my University to a group of academics studying financial mathematics (Black-Scholes, monte-carlo simulations, time series analysis – that sort of thing).  The talk was about the NAG Numerical Library, what it could do, what environments you can use it from, licensing and how they could get their hands on it via our site license.

I also spent a few minutes talking about Python including why I thought it was a good language for numerical computing and how to interface it with the NAG C library using my tutorials and PyNAG code.  I finished off with a simple graphical demonstration that minimised the Rosenbrock function using Python, Matplotlib and the NAG C-library running on Ubuntu Linux.

“So!”, I asked, “Any questions?”

The very first reply was “Does your Python-NAG interface work on Windows machines?” to which the answer at the time was “No!”  I took the opportunity to ask the audience how many of them did their numerical computing in Windows (Most of the room of around 50+ people), how many people did it using Mac OS X (A small group at the back), and how many people did it in Linux (about 3).

So, if I wanted the majority of that audience to use PyNAG then I had to get it working on Windows.  Fortunately, thanks to the portability of Python and the consistency of the NAG library across platforms, this proved to be rather easy to do and the result is now available for download on the main PyNAG webpage.

Let’s look at the main differences between the Linux and Windows versions

Loading the NAG Library

In the examples given in my Linux NAG-Python tutorials, the cllux08dgl NAG C-library was loaded using the line

libnag = cdll.LoadLibrary("/opt/NAG/cllux08dgl/lib/")

In Windows the call is slightly different. Here it is for CLDLL084ZL (Mark 8 of the C library for Windows)

C_libnag = ctypes.windll.CLDLL084Z_mkl

If you are modifying any of the codes shown in my NAG-Python tutorials then you’ll need to change this line yourself. If you are using PyNAG, however, then it will already be done for you.

Callback functions

For my full introduction to NAG callback functions on Linux click here. The main difference between using callbacks on Windows compared to Linux is that on Linux we have

C_FUNC = CFUNCTYPE(c_double,c_double )

but on Windows we have

C_FUNC = WINFUNCTYPE(c_double,c_double )

There are several callback examples in the PyNAG distribution.

That’s pretty much it I think.  Working through all of the PyNAG examples and making sure that they ran on Windows uncovered one or two bugs in my codes that didn’t affect Linux for one reason or another so it was a useful exercise all in all.

So, now you head over to the main PyNAG page and download the Windows version of my Python/NAG interface which includes a set of example codes.  I also took the opportunity to throw in a couple of extra features and so upgraded PyNAG to version 0.16, check out the readme for more details.  Thanks to several employees at NAG for all of their help with this including Matt, John, David, Marcin and Sorin.

April 20th, 2010

I work for the University of Manchester in the UK as a ‘Science and Engineering Applications specialist’ which basically means that I am obsessed with software used by mathematicians and scientists.  One of the applications within my portfolio is the NAG library – a product that we use rather a lot in its various incarnations.  We have people using it from Fortran, C, C++, MATLAB, Python and even Visual Basic in areas as diverse as engineering, applied maths, biology and economics.

Yes, we are big users of NAG at Manchester but then that stands to reason because NAG and Manchester have a collaborative history stretching back 40 years to NAG’s very inception.  Manchester takes a lot of NAG’s products but for reasons that are lost in the mists of time, we have never (to my knowledge at least) had a site license for their SMP library (more recently called The NAG Library for SMP & multicore).  Very recently, that changed!

SMP stands for Symmetric Multi-Processor which essentially means ‘two or more CPUs sharing the same bit of memory.’  Five years ago, it would have been rare for a desktop user to have an SMP machine but these days they are everywhere.  Got a dual-core laptop or a quad-core desktop?  If the answer’s ‘yes’ then you have an SMP machine and you are probably wondering how to get the most out of it.

‘How can I use all my cores (without any effort)’

One of the most common newbie questions I get asked these days goes along the lines of ‘Program X is only using 50%/25%/12.5% of my CPU – how can I fix this?’ and, of course, the reason for this is that the program in question is only using a single core of their multicore machine.  So, the problem is easy enough to explain but not so easy to fix because it invariably involves telling the user that they are going to have to learn how to program in parallel.

Explicit parallel programming is a funny thing in that sometimes it is trivial and other  times it is pretty much impossible – it all depends on the problem you see.  Sometimes all you need to do is drop a few OpenMP pragmas here and there and you’ll get a 2-4 times speed up.  Other times you need to completely rewrite your algorithm from the ground up to get even a modest speed up.  Yet more times you are simply stuck because your problem is inherently non-parallelisable.  It is even possible to slow your code down by trying to parallelize it!

If you are lucky, however, then you can make use of all those extra cores with no extra effort at all!  Slowly but surely, mathematical software vendors have been re-writing some of their functions to ensure that they work efficiently on parallel processors in such a way that it is transparent to the user.  This is sometimes referred to as implicit parallelism.

Take MATLAB, for example, ever since release 2007a more and more built in MATLAB functions have been modified to allow them to make full use of multi-processor systems.  If your program makes extensive use of these functions then you don’t need to spend extra money or time on the parallel computing toolbox, just run your code on the latest version of MATLAB, sit back and enjoy the speed boost.  It doesn’t work for all functions of course but it does for an ever increasing subset.

The NAG SMP Library – zero effort parallel programming

For users of NAG routines, the zero-effort approach to making better use of your multicore system is to use their SMP library.  According to NAG’s advertising blurb you don’t need to rewrite your code to get a speed boost – you just need to link to the SMP library instead of the Fortran library at compile time.

Just like MATLAB, you won’t get this speed boost for every function, but you will get it for a significant subset of the library (around 300+ functions as of Mark 22 – the full list is available on NAG’s website).  Also, just like MATLAB, sometimes the speed-up will be massive and other times it will be much more modest.

I wanted to test NAG’s claims for myself on the kind of calculations that researchers at Manchester tend to perform so I asked NAG for a trial of the upcoming Mark 22 of the SMP library and, since I am lazy, I also asked them to provide me with simple demonstration programs and I’d like to share the results with you all.  These are not meant to be an exhaustive set of benchmarks (I don’t have the time to do such a thing) but should give you an idea of what you can expect from NAG’s new library.

System specs

All of these timings were made on a 3Ghz Intel Core2 Quad Q9650 CPU desktop machine with 8Gb of RAM running Ubuntu 9.10.  The serial version of the NAG library used was fll6a22dfl  and the SMP version of the library was fsl6a22dfl.  The version of gfortran used was 4.4.1 and the cpu-frequency governor was switched off as per a previous blog post of mine.

Example 1 – Nearest correlation matrix

The first routine I looked at was one that calculated the nearest correlation matrix. In other words ‘Given a symmetric matrix, what is the nearest correlation matrix—that is, the nearest symmetric positive semidefinite matrix with unit diagonal?’[1] This is a problem that pops up a lot in Finance – option pricing, risk management – that sort of thing.

The NAG routine that calculates the nearest correlation matrix is G02AAF which is based on the algorithm developed by Qi and Sun[2].  The example program I was sent first constructs a N x N tridiagonal matrix A of the form A(j,j)=2, A(j-1,j)=-1.1 and A(j,j-1)=1.0.  It then times how long it takes G02AAF to find the nearest correlation matrix to A.  You can download this example, along with all supporting files, from the links below.

To compile this benchmark against the serial library I did

gfortran ./bench_g02aaf.f90 shtim.c wltime.f /opt/NAG/fll6a22dfl/lib/libnag_nag.a  \
/opt/NAG/fll6a22dfl/acml/libacml_mv.a -o serial_g02aaf

To compile the parallel version I did

gfortran -fopenmp ./bench_g02aaf.f90 shtim.c wltime.f /opt/NAG/fsl6a22dfl/lib/libnagsmp.a \
  /opt/NAG/fsl6a22dfl/acml4.3.0/lib/libacml_mp.a -o smp_g02aaf

I didn’t need to modify the source code in any way when going from a serial version to a parallel version of this benchmark. The only work required was to link to the SMP library instead of the serial library – so far so good. Let’s run the two versions and see the speed differences.

First things first, let’s ensure that there is no limit to the stack size of my shell by doing the following in bash

ulimit -s unlimited

Also, for the parallel version, I need to set the number of threads to equal the number of processors I have on my machine by setting the OMP_NUM_THREADS environment variable.


This won’t affect the serial version at all. So, onto the program itself. When you run it, it will ask you for two inputs – an integer, N, and a boolean, IO. N gives the size of the starting matrix and IO determines whether or not to output the result.

Here’s an example run of the serial version with N=1000 and IO set to False. (In Fortran True is represented as .t. and False is represented as .f. )

 G02AAF: Enter N, IO
1000 .f.
 G02AAF: Time, IFAIL =    37.138997077941895                0
 G02AAF: ITER, FEVAL, NRMGRD =            4           5  4.31049822255544465E-012

The only output I am really interested in here is the time and 37.13 seconds doesn’t seem too bad for a 1000 x 1000 matrix at first glance. Move to the parallel version though and you get a very nice surprise

 G02AAF: Enter N, IO
1000 .f.
 G02AAF: Time, IFAIL =    5.1906139850616455                0
 G02AAF: ITER, FEVAL, NRMGRD =            4           5  4.30898281428799420E-012

The above times were typical and varied little from run to run (although the SMP version varied by a bigger percentage than the serial version). However I averaged over 10 runs to make sure and got 37.14 s for the serial version and 4.88 s for the SMP version which gives a speedup of 7.6 times!

Now, this is rather impressive. Usually, when one parallelises over N cores then the very best you can expect in an ideal word is a speed up of just less than N times, so called ‘linear scaling’. Here we have N=4 and a speedup of 7.6 implying that NAG have achieved ‘super-linear scaling’ which is usually pretty rare.

I dropped them a line to ask what was going on. It turns out that when they looked at parallelising this routine they worked out a way of speeding up the serial version as well. This serial speedup will be included in the serial library in its next release, Mark 23 but the SMP library got it as of Mark 22.

So, some of that 7.6 times speedup is as a result of serial speedup and the rest is parallelisation. By setting OMP_NUM_THREADS to 1 we can force the SMP version of the benchmark to only run on only one core and thus see how much of the speedup we can attribute to parallelisation:

 G02AAF: Enter N, IO
1000 .f.
 G02AAF: Time, IFAIL =    12.714214086532593                0
 G02AAF: ITER, FEVAL, NRMGRD =            4           5  4.31152424170390294E-012

Recall that the 4 core version took an average of 4.88 seconds so the speedup from parallelisation alone is 2.6 times – much closer to what I expected to see.  Now, it is probably worth mentioning that there is an extra level of complication (with parallel programming there is always an extra level of complication) in that some of this parallelisation speedup comes from extra work that NAG has done in their algorithm and some of it comes from the fact that they are linking to parallel versions of the BLAS and LAPACK libraries.  We could go one step further and determine how much of the speed up comes from NAG’s work and how much comes from using parallel BLAS/LAPACK but, well, life is short.

The practical upshot is that if you come straight from the Mark 22 serial version then you can expect a speed-up of around 7.6 times.  In the future, when you compare the Mark 22 SMP library to the Mark 23 serial library then you can expect a speedup of around 2.6 times on a 4 core machine like mine.

Example 2 – Quasi random number generation

Quasi random numbers (also referred to as ‘Low discrepancy sequences’) are extensively used in Monte-Carlo simulations which have applications in areas such as finance, chemistry and computational physics.  When people need a set of quasi random numbers they usually need a LOT of them and so the faster they can be produced the better.  The NAG library contains several quasi random number generators but the one considered here is the routine g05ymf.  The benchmark program I used is called bench_g05ymf.f90 and the full set of files needed to compile it are available at the end of this section.

The benchmark program requires the user to input 4 numbers and a boolean as follows.

  • The first number is the quasi random number generator to use with the options being:
    1. NAG’s newer Sobol generator (based on the 2008 Joe and Kuo algorithm[3] )
    2. NAG’s older Sobol generator
    3. NAG’s Niederreiter generator
    4. NAG’s Faure generator
  • The second number is the order in which the generated values are returned (The parameter RCORD as referred to in the documentation for g05ymf). Say that the matrix of returned values is called QUAS then if RCORD=1, QUAS(i,j) holds the jth value for the ith dimension, otherwise QUAS(i,j) holds the ith value for the jth dimension.
  • The third number is the number of dimensions required.
  • The fourth number is the number of number of quasi-random numbers required.
  • The boolean (either .t. or .f.) determines whether or not the output should be verbose or not.  A value of .t. will output the first 100 numbers in the sequence.

To compile the benchmark program against the serial library I did:

gfortran ./bench_g05ymf.f90 shtim.c wltime.f /opt/NAG/fll6a22dfl/lib/libnag_nag.a  \
/opt/NAG/fll6a22dfl/acml/libacml_mv.a -o serial_g02aaf

As before, the only thing needed to turn this into a parallel program was to compile against the SMP library and add the -fopenmp switch

gfortran -fopenmp ./bench_g05ymf.f90 shtim.c wltime.f /opt/NAG/fsl6a22dfl/lib/libnagsmp.a  \
/opt/NAG/fsl6a22dfl/acml4.3.0/lib/libacml_mp.a -o smp_g02aaf

The first set of parameters I used was

1 2 900 1000000 .f.

Which uses NAG’s new Sobol generator with RCORD=2 to generate and store 1,000,000 numbers over 900 dimensions with no verbose output.  Averaged over 10 runs the times were 5.17 seconds for the serial version and 2.14 seconds for the parallel version giving a speedup of 2.4x on a quad-core machine.  I couldn’t push number of dimensions much higher because the benchmark stores all of the numbers in one large array and I was starting to run out of memory.

If you only want a relatively small sequence then switching to the SMP library is actually slower thanks to the overhead involved in spawning extra threads.  For example if you only want 100,000 numbers over 8 dimensions:

1 2 8 100000 .f.

then the serial version of the code takes an average of  0.0048 seconds compared to 0.0479 seconds for the parallel version so the parallel version is almost 10 times slower when using 4 threads for small problems. Setting OMP_NUM_THREADS to 1 gives exactly the same speed as the serial version as you might expect.

NAG have clearly optimized this function to ensure that you get a good speedup for large problem sets which is where it really matters so I am very pleased with it.  However, the degradation in performance for smaller problems concerns me a little.  I think I’d prefer it if NAG were to modify this function so that it works serially for small sequences and to automatically switch to parallel execution if the user requests a large sequence.


In the old days we could speedup our programs simply by buying a new computer.  The new computer would have a faster clock speed than the old one and so our code would run faster with close to zero effort on our part.  Thanks to the fact that clock speeds have stayed more or less constant for the last few years those days are over.  Now, when we buy a new computer we get more processors rather than faster ones and this requires a change in thinking.  Products such as the NAG Library for SMP & multicore help us to to get the maximum benefit out of our machines with the minimum amount of effort on our part.  If switching to a product like this doesn’t give you the performance boost you need then the next thing for you to do is to learn how to program in parallel.  The free ride is over.

In summary:

  • You don’t need to modify your code if it already uses NAG’s serial library.  Just recompile against the new library and away you go.  You don’t need to know anything about parallel programming to make use of this product.
  • The SMP Library works best with big problems.  Small problems don’t work so well because of the inherent overheads of parallelisation.
  • On a quad-core machine you can expect speed-ups around 2-4 times compared to the serial library.  In exceptional circumstances you can expect speed-up as large as 7 times or more.
  • You should notice a speed increase in over 300 functions compared to the serial library.
  • Some of this speed increase comes from fundamental libraries such as BLAS and LAPACK, some of it comes from NAG directly parallelising their algorithms and yet more comes from improvements to the underlying serial code.
  • I’m hoping that this Fortran version is just the beginning.  I’ve always felt that NAG program in Fortran so I don’t have to and I’m hoping that they will go on to incorporate their SMP improvements in their other products, especially their C library and MATLAB toolbox.


Thanks to several staff at NAG who suffered my seemingly endless stream of emails during the writing of this article.  Ed, in particular, has the patience of a saint.  Any mistakes left over are all mine!



[1] – Higham N, Computing the nearest correlation matrix—a problem from finance, IMA Journal of Numerical Analysis 22 (3): 329–343

[2] – Qi H and Sun D (2006), A Quadratically Convergent Newton Method for Computing the Nearest
Correlation Matrix, SIAM J. Matrix AnalAppl 29(2) 360–385

[3] – Joe S and Kuo F Y (2008) Constructing Sobol sequences with better two-dimensional projects SIAM J. Sci. Comput. 30 2635–2654

Dislaimer: This is a personal blog and the opinions expressed on it are mine and do not necessarily reflect the policies and opinions of my employer, The University of Manchester.

March 1st, 2010

The Problem

A MATLAB user came to me with a complaint recently.  He had a piece of code that made use of the MATLAB Statistics toolbox but couldn’t get reliable access to  a stats toolbox license from my employer’s server.  You see, although we have hundreds of licenses for MATLAB itself, we only have a few dozen licenses for the statistics toolbox.  This has always served us well in the past but use of this particular toolbox is on the rise and so we sometimes run out which means that users are more likely to get the following error message these days

??? License checkout failed.
License Manager Error -4
Maximum number of users for Statistics_Toolbox reached.
Try again later.
To see a list of current users use the lmstat utility or contact your License Administrator.

The user had some options if he wanted to run his code via our shared network licenses for MATLAB (rather than rewrite in a free language or buy his own, personal license for MATLAB):

  • Wait until a license became free.
  • Wait until we found funds for more licenses.
  • Buy an additional license token for himself and add it to the network (235 pounds +VAT currently)
  • Allow me to gave a look at his code and come up with another way

He went for the final option.  So, I sat with a colleague of mine and we looked through the code.  It turned out that there was only one line in the entire program that used the Statistics toolbox and all that line did was call binopdf.  That one function call almost cost him 235 quid!

Now, binopdf() simply evaluates the binomial probability density function and the definition doesn’t look too difficult so you may wonder why I didn’t just cook up my own version of binopdf for him?  Quite simply, I don’t trust myself to do as good a job as The Mathworks.  There’s bound to be gotchas and I am bound to not know them at first.  What would you rather use, binpodf.m from Mathworks or binopdf.m from ‘Dodgy Sysadmin’s Functions Inc.’?  Your research depends upon this function remember…

NAG to the rescue!

I wanted to use a version of this function that was written by someone I trust so my mind immediately turned to NAG (Numerical Algorithms Group) and their MATLAB toolbox which my employer has a full site license for.  The NAG Toolbox for MATLAB has a function called g01bj which does what we need and a whole lot more.  The basic syntax is

[plek, pgtk, peqk, ifail] = g01bj(n, p, x)

From the NAG documentation: Let X denote a random variable having a binomial distribution with parameters n and p. g01bj calculates the following probabilities

  • plek = Prob{X<=x}
  • pgtk = Prob{X > x}
  • peqk = Prob{X = x}

MATLAB’s binopodf only calculates Prob(X=x) so the practical upshot is that for the following MATLAB code


the corresponding NAG Toolbox code (on a 32bit machine) is

[plek, pgtk, y, ifail] = g01bj(n, p, x);

both of these code snippets gives

y =

Further tests show that NAG’s and MATLAB’s results agree either exactly or to within what is essentially numerical noise for a range of values. The only thing that remained was to make NAG’s function look more ‘normal’ for the user in question which meant creating a file called nag_binopdf.m which contained the following

function pbin = nag_binopdf(x,n,p)
   [~,~,pbin,~] = g01bj(n,p,x);

Now the user had a function called nag_binopdf that behaved just like the Statistics toolbox’s binopdf, for his particular application, argument order and everything.

It’s not all plain sailing though!

As good as the NAG toolbox is, it’s not perfect.  Note that I placed particular emphasis on the fact that we had come up with a suitable, drop-in replacement for binopdf for this user’s particular application.  The salient points of his application are:

  • His input argument, x, is just a single value – not a vector of values. The NAG library wasn’t written with MATLAB in mind, it’s written in Fortran, and so many of its functions are not vectorised (some of them are though).  In Fortran this is no problem at all but in MATLAB it’s a pain.  I could, of course, write a nag_binopodf that looks like it’s vectorised by putting g01bj in a loop hidden away from the user but I’d soon get found out because the performance would suck.  To better support MATLAB users, NAG needs to write properly vectorised versions of its functions.
  • He only wanted to run his code on 32bit machines. If he had wanted to run it on a 64bit machine then he would have to change all of those int32() calls to int64().  I agree that this is no big deal but it’s something that he wouldn’t have to worry about if he had coughed up the cash for a Mathworks statistics license.

Final thoughts

I use the NAG toolbox for MATLAB to do this sort of thing all the time and have saved researchers at my university several thousand pounds over the last year or so through unneeded Mathworks toolbox licenses.  As well as statistics, the NAG toolbox is also good for optimisation (local and global), curve fitting, splines, wavelets, finance,partial differential equations and probably more.  Not everyone likes this approach and it is not always suitable but when it works, it works well.  Sometimes, you even get a significant speed boost into the bargain (check out my interp1 article for example).

Comments are, as always, welcomed.

Blog posts similar to this one

January 15th, 2010

The Numerical Algorithms Group (NAG) is 40 years old this year and they have begun their celebrations by starting a company blog.  In their first blog post, one of the numerical library developers has asked the world a question:

“How useful would routines that provide particular solutions to the Painleve equations be to our customer base and to the research community in particular?”

I’ll confess that, until 10 minutes ago, I had never heard of the Painlevé equations but there are articles on Wikipedia and Mathworld to get you started if you are in the same boat as me.

If, on the other hand, you know all about them and can venture an opinion then head over to NAG’s new blog and say hi.

I am a big fan of the NAG’s products and you can see some of my posts about them at the following links:

September 10th, 2009

Just a quick note to say that I’ve released version 0.15 of PyNAG – my Python interface to the NAG C library and given it its own dedicated page on Walking Randomly.  New features include

  • Basic support for 165 functions from Mark 8 of the NAG C library.
  • 29 simple but fully functional examples.
  • Added initial support for NAG’s Complex structure.