Archive for July, 2010
The MATLAB function ranksum is part of MATLAB’s Statistics Toolbox. Like many organizations who use network licensing for MATLAB and its toolboxes, my employer, The University of Manchester, sometimes runs out of licenses for this toolbox which leads to following error message when you attempt to evaluate ranksum.
??? License checkout failed. License Manager Error -4 Maximum number of users for Statistics_Toolbox reached. Try again later. To see a list of current users use the lmstat utility or contact your License Administrator.
An alternative to the Statistics Toolbox is the NAG Toolbox for MATLAB for which we have an unlimited number of licenses. Here’s how to replace ranksum with the NAG routine g08ah.
Original MATLAB / Statistics Toolbox code
x = [0.8147;0.9058;0.1270;0.9134;0.6324;0.0975;0.2785;0.5469;0.9575;0.9649]; y= [0.4076;1.220;1.207;0.735;1.0502;0.3918;0.671;1.165;1.0422;1.2094;0.9057;0.285;1.099;1.18;0.928]; p = ranksum(x,y)
The result is p = 0.0375
Code using the NAG Toolbox for MATLAB
x = [0.8147;0.9058;0.1270;0.9134;0.6324;0.0975;0.2785;0.5469;0.9575;0.9649]; y = [0.4076;1.220;1.207;0.735;1.0502;0.3918;0.671;1.165;1.0422;1.2094;0.9057;0.285;1.099;1.18;0.928]; tail = 'T'; [u, unor, p, ties, ranks, ifail] = g08ah(x, y, tail);
The value for p is the same as that calculated by ranksum: p = 0.0375
NAG’s g08ah routine returns a lot more than just the value p but, for this particular example, we can just ignore it all. In fact, if you have MATLAB 2009b or above then you could call g08ah like this
tail = 'T'; [~, ~, p, ~, ~, ~] = g08ah(x, y, tail);
Which explicitly indicates that you are not going to use any of the outputs other than p.
People at Manchester are using the NAG toolbox for MATLAB more and more; not only because we have a full site license for it but because it can sometimes be very fast. Here’s some more articles on the NAG toolbox you may find useful.
A bit of background to this post
I work in the IT department of the University of Manchester and we are currently developing a Condor Pool which is basically a method of linking together hundreds of desktop machines to produce a high-throughput computing resource. A MATLAB user recently submitted some jobs to our pool and complained that all of them gave identical results which is stupid because his code used MATLAB’s rand command to mix things up a bit.
I was asked if I knew why this should happen to which I replied ‘yes.’ I was then asked to advise the user how to fix the problem and I did so. The next request was for me to write some recommendations and tutorials on how users should use random numbers in MATLAB (and Mathematica and possibly Python while I was at it) along with our Condor Pool and I went uncharacteristically quiet for a while.
It turned out that I had a lot to learn about random numbers. This is the first of a series of (probably 2) posts that will start off by telling you what I knew and move on to what I have learned. It’s as much a vehicle for getting the concepts straight in my head as it is a tutorial.
Ask MATLAB for 10 Random Numbers
Before we go on, I’d like you to try something for me. You have to start on a system that doesn’t have MATLAB running at all so if MATLAB is running then close it before proceeding. Now, open up MATLAB and before you do anything else issue the following command
rand(10,1)
As many of you will know, the rand command produces random numbers from the uniform distribution between 0 and 1 and the command above is asking for 10 such numbers. You may reasonably expect that the 10 random numbers that you get will be different from the 10 random numbers that I get; after all, they are random right? Well, I got the following numbers when running the above command on MATLAB 2009b running on Linux.
ans = 0.8147 0.9058 0.1270 0.9134 0.6324 0.0975 0.2785 0.5469 0.9575 0.9649
Look familiar?
Now I’ve done this experiment with a number of people over the last few weeks and the responses can be roughly split into two different camps as follows:
1. Oh yeah, I know all about that – nothing to worry about. It’s pretty obvious why it happens isn’t it?
2. It’s a bug. How can the numbers be random if MATLAB always returns the same set?
What does random mean anyway?
If you are new to the computer generation of random numbers then there is something that you need to understand and that is that, strictly speaking, these numbers (like all software generated ‘random’ numbers) are not ‘truly’ random. Instead they are pseudorandom – my personal working definition of which is “A sequence of numbers generated by some deterministic algorithm in such a way that they have the same statistical properties of ‘true’ random numbers”. In other words, they are not random they just appear to be but the appearance is good enough most of the time.
Pseudorandom numbers are generated from deterministic algorithms with names like Mersenne Twister, L’Ecuyer’s mrg32k3a [1] and Blum Blum Schub whereas ‘true’ random numbers come from physical processes such as radioactive decay or atmospheric noise (the website www.random.org uses atmospheric noise for example).
For many applications, the distinction between ‘truly random’ and ‘pseudorandom’ doesn’t really matter since pseudorandom numbers are ‘random enough’ for most purposes. What does ‘random enough’ mean you might ask? Well as far as I am concerned it means that the random number generator in question has passed a set of well defined tests for randomness – something like Marsaglia’s Diehard tests or, better still, L’Ecuyer and Simard’s TestU01 suite will do nicely for example.
The generation of random numbers is a complicated topic and I don’t know enough about it to do it real justice but I know a man who does. So, if you want to know more about the theory behind random numbers then I suggest that you read Pierre L’Ecuyer’s paper simply called ‘Random Numbers’ (pdf file).
Back to MATLAB…
Always the same seed
So, which of my two groups are correct? Is there a bug in MATLAB’s random number generator or not?
There is nothing wrong with MATLAB’s random number generator at all. The reason why the command rand(10,1) will always return the same 10 numbers if executed on startup is because MATLAB always uses the same seed for its pseudorandom number generator (which at the time of writing is a Mersenne Twister) unless you tell it to do otherwise.
Without going into details, a seed is (usually) an integer that determines the internal state of a random number generator. So, if you initialize a random number generator with the same seed then you’ll always get the same sequence of numbers and that’s what we are seeing in the example above.
Sometimes, this behaviour isn’t what we want. For example, say I am doing a Monte Carlo simulation and I want to run it several times to verify my results. I’m going to want a different sequence of random numbers each time or the whole exercise is going to be pointless.
One way to do this is to initialize the random number generator with a different seed at startup and a common way of achieving this is via the system clock. The following comes straight out of the current MATLAB documentation for example
RandStream.setDefaultStream(RandStream('mt19937ar','seed',sum(100*clock)));
If you are using MATLAB 2011a or above then you can use the following, much simpler syntax to do the same thing
rng shuffle
Do this once per MATLAB session and you should be good to go (there is usually no point in doing it more than once per session by the way….your numbers won’t be any ‘more random’ if you so. In fact, there is a chance that they will become less so!).
Condor and ‘random’ random seeds
Sometimes the system clock approach isn’t good enough either. For example, at my workplace, Manchester University, we have a Condor Pool of hundreds of desktop machines which is perfect for people doing Monte Carlo simulations. Say a single simulation takes 5 hours and it needs to be run 100 times in order to get good results. On one machine that’s going to take about 3 weeks but on our Condor Pool it can take just 5 hours since all 100 simulations run at the same time but on different machines.
If you don’t think about random seeding at all then you end up with 100 identical sets of results using MATLAB on Condor for the reasons I’ve explained above. Of course you know all about this so you switch to using the clock seeding method, try again and….get 100 identical sets of results[2].
The reason for this is that the time on all 100 machines is synchronized using internet time servers. So, when you start up 100 simultaneous jobs they’ll all have the same timestamp and, therefore, have the same random seed.
It seems that what we need to do is to guarantee (as far as possible) that every single one of our condor jobs gets a unique seed in order to provide a unique random number stream and one way to do this would be to incorporate the condor process ID into the seed generation in some way and there are many ways one could do this. Here, however, I’m going to take a different route.
On Linux machines it is possible to obtain small numbers of random numbers using the special files /dev/random and /dev/urandom which are interfaces to the kernel’s random number generator. According to the documentation ‘The random number generator gathers environmental noise from device drivers and other sources into an entropy pool. The generator also keeps an estimate of the number of bit of the noise in the entropy pool. From this entropy pool random numbers are created.’
This kernel generator isn’t suitable for simulation purposes but it will do just fine for generating an initial seed for MATLAB’s pseudorandom number generator. Here’re the MATLAB commands
[status seed] = system('od /dev/urandom --read-bytes=4 -tu | awk ''{print $2}'''); seed=str2double(seed); RandStream.setDefaultStream(RandStream('mt19937ar','seed',seed));
In MATLAB 2011a or above you can change this to
[status seed] = system('od /dev/urandom --read-bytes=4 -tu | awk ''{print $2}'''); seed=str2double(seed); rng(seed);
Put this at the beginning of the MATLAB script that defines your condor job and you should be good to go. Don’t do it more than once per MATLAB session though – you won’t gain anything!
The end or the end of the beginning?
If you asked me the question ‘How do I generate a random seed for a pseudorandom number generator?’ then I think that the above solution answers it quite well. If, however, you asked me ‘What is the best way to generate multiple independent random number streams that would be good for thousands of monte-carlo simulations?‘ then we need to rethink somewhat for the following reasons.
Seed collisions: The Mersenne twister algorithm currently used as the default random number generator for MATLAB uses a 32bit integer seed which means that it can take on 2^32 or 4,294,967,296 different values – which seems a lot! However, by considering a generalisation of the birthday problem it can be seen that if you select such a seed at random then you have a 50% chance choosing two identical seeds after only 65,536 runs. In other words, if you perform 65,536 simulations then there is a 50% chance that two such simulations will produce identical results.
Bad seeds: I have read about (but never experienced) the possibility of ‘bad seeds’; that is some seeds that produce some very strange, non-random results – for the first few thousand iterates at least. This has led to some people advising that you should ‘warm-up’ your random number generator by asking for, and throwing away, a few thousand random numbers before you start using them. Does anyone know of any such bad seeds?
Overlapping or correlated sequences: All pseudorandom number generators are periodic (at least, all the ones that I know about are) – which means that after N iterations the sequence repeats itself. If your generator is good then N is usually large enough that you don’t need to worry about this. The Mersenne Twister used in MATLAB, for instance, has a huge period of (2^19937 – 1)/2 (half of the standard 32bit implementation).
The point I want to make is that you don’t get several different streams of random numbers, you get just one, albeit a very big one. Now, when you choose a seed you are essentially choosing a random point in this stream and there is no guarantee how far apart these two points are. They could be separated by a distance of trillions of points or they could be right next to each other – we simply do not know – and this leads to the possibility of overlapping sequences.
Now, one could argue that the possibility of overlap is very small in a generator such as the Mersenne Twister and I do not know of any situation where it has occurred in practice but that doesn’t mean that we shouldn’t worry about it. If your work is based on the assumption that all of your simulations have used independent, uncorrelated random number streams then there is a possibility that your assumptions could be wrong which means that your conclusions could be wrong. Unlikely maybe, but still no way to do science.
Next Time
Next time I’ll be looking at methods for generating guaranteed independent random number streams using MATLAB’s in-built functions as well as methods taken from the NAG Toolbox for MATLAB. I’ll also be including explicit examples that use this stuff in Condor.
Ask the audience
I assume that some of you will be in the business of performing Monte-Carlo simulations and so you’ll probably know much more about all of this than me. I have some questions
- Has anyone come across any ‘bad seeds’ when dealing with MATLAB’s Mersenne Twister implementation?
- Has anyone come across overlapping sequences when using MATLAB’s Mersenne Twister implementation?
- How do YOU set up your random number generator(s).
I’m going to change my comment policy for this particular post in that I am not going to allow (most) anonymous comments. This means that you will have to give me your email address (which, of course, I will not publish) which I will use once to verify that it really was you that sent the comment.
Notes and References
[1] L’Ecuyer P (1999) Good parameter sets for combined multiple recursive random number generators Operations Research 47:1 159–164
[2] More usually you’ll get several different groups of results. For example you might get 3 sets of results, A B C, and get 30 sets of A, 50 sets of B and 20 sets of C. This is due to the fact that all 100 jobs won’t hit the pool at precisely the same instant.
[3] Much of this stuff has already been discussed by The Mathworks and there is an excellent set of articles over at Loren Shure’s blog – Loren onThe Art of MATLAB.
I was recently asked to install 32bit Gaussian 03 binaries on an Ubuntu 9.10 machine and when I tried to run a test job I got the following error message
Erroneous write during file extend. write -1 instead of 4096 Probably out of disk space. Erroneous write during file extend. write -1 instead of 4096 Probably out of disk space. Write error in NtrExt1 Write error in NtrExt1: Bad address Segmentation fault
A bit of googling suggested that the following might work
sudo echo 0 > /proc/sys/kernel/randomize_va_space
but this will result in permission denied (explanation here). The command you really want to use is
sudo bash -c "echo 0 > /proc/sys/kernel/randomize_va_space"
Once this was done, Gaussian worked as advertised. Maybe this post will help a googler sometime in the future.