## Archive for the ‘Statistics’ Category

In a recent article, Matt Asher considered the feasibility of doing statistical computations in JavaScript. In particular, he showed that the generation of 10 million normal variates can be as fast in Javascript as it is in R provided you use Google’s Chrome for the web browser. From this, one might infer that using javascript to do your Monte Carlo simulations could be a good idea.

It is worth bearing in mind, however, that we are not comparing like for like here.

The default random number generator for R uses the Mersenne Twister algorithm which is of very high quality, has a huge period and is well suited for Monte Carlo simulations. It is also the default algorithm for modern versions of MATLAB and is available in many other high quality mathematical products such as Mathematica, The NAG library, Julia and Numpy.

The algorithm used for Javascript’s **math.random()** function depends upon your web-browser. A little googling uncovered a document that gives details on some implementations. According to this document, Internet Explorer and Firefox both use 48 bit Linear Congruential Generator (LCG)-style generators but use different methods to set the seed. Safari on Mac OS X uses a 31 bit LCG generator and Version 8 of Chrome on Windows uses 2 calls to rand() in msvcrt.dll. So, for V8 Chrome on Windows, Math.random() is a floating point number consisting of the second rand() value, concatenated with the first rand() value, divided by 2^30.

The points I want to make here are:-

- Javascript’s
**math.random()**uses different algorithms between browsers. - These algorithms have relatively small periods. For example, a 48-bit LCG has a period of 2^48 compared to 2^19937-1 for Mersenne Twister.
- They have poor statistical properties. For example, the 48bit LCG implemented in Java’s
`java.util.Random`function fails 21 of the BigCrush tests. I haven’t found any test results for JavaScript implementations but expect them to be at least as bad. I understand that Mersenne Twister fails 2 of the BigCrush tests but these are not considered to be an issue by many people. - You can’t manually set the seed for
**math.random()**so reproducibility is impossible.

Genstat, a general-purpose statistics package that seems to have a good following in the biosciences has been upgraded to version 12.1. A selection from the ‘whats new?’ section:

- Release 12.1 includes 2 new directives, 42 new procedures and 41 new functions.
- Support for Cornell ecology file formats.
- Support for MapQTL loc and map files.
- Support for R/QTL rotated comma-delimited separate genotype (csvsr) files.
- Support for Stata 10 files.
- Quantile regression menu for fitting quantile regression models.
- Parallel regression analyses.
- Ability to display confidence intervals for parameter estimates.
- Display all standard error of differences in General Analysis of Variance menu.
- Trellis plot of means.
- Biplots added to principal components and canonical variates analysis menus.
- Permutation tests availible for MANOVA.

Check out the full list of new features over at VSNI’s website.

Since I am not a statistician and I don’t support statistics software, I hadn’t heard about Genstat until fairly recently but they seem to be a very cool company. Take a look at the free Genstat Discovery Edition for example which is for use by not-for-profit research organizations, charities and educational institutes based in the developing world.

Commercial software companies who give something back to the community in ways like this always earn respect in my eyes.

A commercial statistical package called Genstat appeared on my radar fairly recently and I have just discovered that a new version of it was released last month. The list of new features is here.

My knowledge of stats is pitiful at the moment (I am working on it though) so I won’t even attempt a review but will, instead, point you to one by John Wass over at ScientificComputing.com.

A full 30 day trial is available from Genstat’s website (you’ll need to watch a video before getting access to the trial). Feel free to drop me a comment about this software if you know anything about it.

**Update 9th July 2009: **As someone pointed out to me via email – Genstat 11 was released in June 2008 and not last month as I stated above. Sorry for any confusion this might have caused.

I love mathematical software – which makes me a lucky guy because I tend to deal with it a lot as part of my job as a science and engineering software specialist at the University of Manchester in the UK. Thanks to this role, I often get asked if I can recommend software for various tasks and my answer depends on many factors such as

- Does our university have a site license for it (or is it free)
- What packages do people in your research / teaching area tend to use? Sometimes I know this off the top of my head, other times I know people who know people who know ;)
- What is your computing background?
- Are you likely to need more support from me? If so then I am more likely to recommend that you use a package that I know a lot about.
- What is the support from the vendor like? Researchers tend to want to do odd things and a software vendor who is willing to work with them tends to bubble to the top of my list. There are some who have written routines just for us at no extra charge – I LIKE these kind of vendors.
- If I don’t know enough about a package to offer direct support to you then do I know someone who is willing and able to (bear this in mind if I ever offer to buy you a coffee, I may call in a favour at some point)?
- What software are you currently working with? Will you want something that can integrate with it or do you want to start afresh?

There are many others of course such as my mood, whether or not I have a clue what you are talking about and what software I have been playing around with recently but that pretty much covers it. One application that I have started to recommend to people recently is called Simfit.

**What is Simifit?**

Simifit is a **free** piece of software written by Bill Bardsley of the University of Manchester. According to the application’s website “*Simfit is a computer package for simulation, statistical analysis, curve fitting and graph plotting*.” It is written for Windows but can also be run under Linux via WINE (instructions specific to OpenSuse and KDE are given at this website.)

Simfit actually consists of a suite of over forty individual programs but all of these are held together by one central ‘program manager’. Essentially you spend your Simfit life in the program manager and it will run the sub-programs as and when they are needed. It’s all very straightforward although, in my opinion, the graphical user interface isn’t as intuitive as most applications I have used. I guess I should qualify this somewhat – by ‘not as intuitive’ I mean ‘doesn’t quite behave like all other Windows applications’ and after speaking to Bill it seems that the reason for this is that he wanted to avoid anything that would make Simfit less stable or more memory hungry. His interface works well and all of his users are happy with it so why change? Furthermore, it hardly uses any memory at all and so Simfit will quite happily accept massive data sets that can bring other packages to their knees. It may be quirky but once you get used to it you’ll be fine.

**Academic versus Professional versions **

Simfit comes in two flavours and both of them are free in the sense that Bill will charge you nothing for them. However one of them (*The professional version*) has the NAG library as a pre-requisite and NAG costs money. If you (or your institution) has a license for the NAG Library then you are good to go and can happily use the professional version of Simfit at no extra charge. If, however, you don’t have a license for the NAG Library then you need to use the *Academic version* which is also free but has less functionality. Details of the differences in functionality between the two versions can be found on the Simfit website.

When you download Simfit you get both versions at the same time and the package comes with an utility that allows you to easily swap between the Academic and NAG library editions.

At this point, it might be worth mentioning that if you desperately need some functionality that is only available in the professional version then you can obtain a trial version of the NAG library from NAG themselves.

**Installation**

Simfit works perfectly on just about any version of Windows you care to mention – Windows 95, 98, XP and Vista to name a few. I haven’t had chance to try it out on the new Windows 7 beta yet but I know one thing for sure – if it doesn’t work initially then I’ll bet my last dollarpound that Bill will get it working very quickly. He has put a lot of effort into ensuring that Simfit is portable across Windows versions and it shows.

In order to use all of Simfit’s features you will need to install a few pre-requisites first but happily all of them are free and Bill has made most of them conveniently available from his website. I installed things in the following order.

For .eps file support you need the following

- Ghostscript package – available from the Simfit website (among other places)
- GSview Windows – available from the Simfit website (among other places)

For pdf viewing support you need

- Adobe Reader – available from Adobe.

If you want to use Simfit in full ‘professional’ mode then you will need to install the commercial NAG libraries. If you don’t have a license the NAG library then don’t worry about it – Simfit will still work but with a few less bells and whistles.

- NAG LIbrary (FLDLL214ML or FLDLL214AL should do the job) – available from the Numerical Algorithms Group.

Finally, you install Simfit itself using simfit_setup.exe. The installer is so quick and easy that I won’t insult your intelligence by going through it blow by blow. If you forget to install any of the pre-requisites then don’t worry – you can still install Simfit and simply add the other packages if and when you need them.

**Plotting Graphs in Simfit**

Let’s start out with one of the (mathematically) simplest things that Simfit can do – plotting. They say that a picture can speak a thousand words so here are four plots produced with Simfit (taken from the author’s website)

Simfit’s graphical output is in the widely used EPS format (encapsulated postscript) which is of extremely high quality and easily incorporated into Latex documents. An extra bonus that you get with EPS files is that they are actually nothing more than text files (well the ones produced by Simfit are at least) – you can open them up in something like Notepad and edit them by hand (if you know what you are doing). If you are unfortunate enough to be unable to use EPS files then Simift can convert them to a wide range of graphics formats including bmp, jpg, pcx, tiff and Windows quality emf. There is also a built in Postscript editor to manipulate eps files and create collages, insets etc.

Simfit has the capability to generate many different plot types including (among others) standard x-y line plots, 2 and 3D histograms, pie charts, box and whisker plots, Scree diagrams, Dendrogams, Scatchard plots, 3D surface plots and contour plots. Phew! I don’t even know what some of these are. There are many other graphical facilities available but I think I will simply direct you to the author’s webpage rather than repeat the full list here.

**Statistical analysis with Simfit**

In its early days, Simfit only handled simulation and curve fitting but Bill and his collaborators felt that it would be useful to add an additional suite of statistical tests to the package. But which tests should be given priority? Bill’s approach was to speak to one of his collaborators, Tom Sharpe, a member of the Manchester University’s School of Biological Sciences (as it was called at the time). Tom drew up a list of all the statistical tests that had been used by the department over the previous few years and Bill implemented the lot! This formed the foundation of Simfit’s statistical abilities and Bill has been adding more routines ever since.

The full list of statistical routines is available on the Simfit website but, to save you a bit of clicking, I’ll reproduce a **small sample** of them here (shamelessly copied and pasted from the larger list here)

- chi-square (O/E vectors, m by n contingency tables and wssq/ndof)
- McNemar test on n by n frequency tables
- Cochran Q test
- Fisher exact (2 by 2 contingency table) with all p values
- Fisher exact Poisson distribution test
- t (both equal and unequal variances, paired and unpaired)
- variance ratio
- F for model validation
- 1,2,3-way Anova (with automatic variance stabilizing transformations and nonparametric equivalents)
- Tukey post-ANOVA Q test
- Factorial ANOVA with marginal plots
- Repeated measures ANOVA with Helmert matrix of orthonormal contrasts, Mauchly sphericity test and Greenhouse-Geisser/Huyn-Feldt epsilon corrections
- MANOVA with Wilks lambda, Roy’s largest root, Lawley-Hotelling trace, and Pillai trace for equality of mean vectors, Box’s test for equality of covariance matrices, and profile analysis for repeated measurements.

**Documentation**

It doesn’t matter how good a piece of software is if you can’t figure out how to use it and so good documentation is essential – especially for programs that contain advanced mathematics. Fortunatley Simfit is well served in this area as it comes with a beautifully produced manual. The manual has been typeset using Latex and is accessible from the Windows start menu in either pdf or ps format. Bill even makes the manual’s full Latex source code available from the Simfit website (in the downloads section) so you can compile it from source if you wish. The pdf file is very easy to navigate since it comes with a full table of contents and is extensively hyper-linked.

Weighing in at 425 pages (at the time of writing) it is certainly comprehensive but it is also very clearly written – Bill’ls multi-decade experience of being a university lecturer clearly shining through. As far as I can tell, the manual contains a fully worked example for every single Simfit routine including a suitable dataset and the sample data sets are also built into Simfit itself as you might expect. In fact I think that this is more than just a software manual, it gives a nice introduction to some of the mathematics too. In all honesty I think that this is one of the best produced manuals for a free piece of software that I have ever seen and it gives commercial offerings a run for their money too.

**Use of Simfit in the academic community**

If you do a quick google search for free software that can handle things such as curve fitting, statistical analysis and publication quality plots then you will quickly be overwhelmed with choices but which ones can you trust? If you are going to submit your results for publication in a peer-reviewed journal then you need to be really sure that the software you choose gives you the right answer. One way to measure confidence in a piece of software like this is to look in the scientific literature and see if other researchers are using and citing it (in a good way).

So, I did exactly that and discovered that Simfit has been used by hundreds of researchers for many years all over the world. Here is a completely random selection of scientific papers that have made use of Simfit calculations over the last few years. I have tried to work out exactly how they used the package and put this in bold at the end of each citation but bear in mind that my own knowledge of statistics is pitiful so I apologize if I have got anything wrong.

*Diverging catalytic capacities and selectivity profiles with haloalkane substrates of chimeric alpha class glutathione transferases*, Protein engineering, design & selection [1741-0126] Kurtovic yr:2008**(biplots and dendrogram analyses**)*Phylogeny and evolution of papillomaviruses based on the E1 and E2 proteins*, Bravo and Alonso, Virus genes [0920-8569] yr:2007 vol:34 iss:3 pg:249 (**deconvolution of Gaussian distributions)***Catalysis of potato epoxide hydrolase, StEH1*, The Biochemical journal [0264-6021] Elfström yr:2005 vol:390 iss:Pt 2 pg:633 (**fit to the****Michaelis–Menten equation by non linear regression****)***Selective Estrogen Receptor Modulators Accelerate Cutaneous Wound Healing in Ovariectomized Female Mice*, Endocrinology [0013-7227] Hardman yr:2008 vol:149 iss:2 pg:551 (**Spearman’s coefficient**)*Sex Dimorphism in Wound Healing: The Roles of Sex Steroids and Macrophage Migration Inhibitory Factor*, Endocrinology [0013-7227] Gilliver yr:2008 vol:149 iss:11 pg:5747 (**unpaired Student’s**)*t*test and^{ }one-way ANOVA*Three dimensional analysis of microaneurysms in the human diabetic retina*, Journal of anatomy [0021-8782] MOORE yr:1999 vol:194 iss:01 pg:89 (**Kruskal-Wallis test and Kolmogorov-Smirnov 1 sample test)**

This is just a small sample of what I found but I think you get the idea. If you spend a bit of time searching on google scholar then you will come up with **many** more. So, it seems that Simfit is definitely trusted and well used by many in the scientific community.

**What if Simfit doesn’t have the routine you need?**

If you find yourself wanting to do some sort of analysis that isn’t in Simfit then simply email Bill and (politely) ask! If time permits then he may well help you out.

**Spanish Version**

Simfit is used extensively in spain and a Spanish version of the program is maintained by a team in Salamanca. I don’t speak Spanish so can’t offer any comment on this except to say that it is fully endorsed by Bill (in fact he gives Simfit courses over in Salamanca) and that the url for the spanish version is http://simfit.usal.es/

**Source code**

Simfit is written in FORTRAN for windows and the full source code is available to download from the Simfit website. One of the great things about having access to the source code is that you can see exactly how each and every mathematical routine works if you so desire. In addition, if you find any bugs then you will be able to fix them yourself if Bill isn’t around and you will be safe in the knowledge that you will always be able to obtain a copy of the program and (with a bit of work) port it to any computing platform you like.

Something that I would personally **love** to see is a native Linux version of simift. Bill tells me that he has done his best to separate the windows-centric code from the calculational parts of the program so maybe, just maybe, someone out there has enough Linux GUI programming experience to produce a native version for KDE or GNOME. It would be fantastic if Simfit were to eventually appear in the standard package repositories for Debian, Ubuntu and Fedora.

Something that crossed my mind was the possibility of making some (or all) of the Simfit routines callable from Python. I have no idea how feasible this would be and would need to delve into the source code (and probably send a few emails to Bill) to get a better idea. It is on my ‘to do’ list but one or two people have seen the size of that particular list and have wondered if anyone would be able to get through it before the heat-death of the universe. So if anyone fancies rolling up their sleeves and having a bash…the source is there :)

If anyone is **seriously** interested in working on a Linux (or Mac) port then I am sure Bill would love to hear from you. His email address is all over the Simfit site.

**Full Disclosure**

Now I know what you are thinking…Mike (that’s me) is from Manchester University and the author of this software (that’s Bill) is also from Manchester University and so Mike is simply promoting his colleague’s software via his blog. Well in the interests of full disclosure I’ll tell you that I have met Bill a few times and like him a lot but if his software stank then I would tell him (and you) so. He isn’t my boss or anything (in fact he is now retired) so I don’t have a vested interest in making him happy. I don’t use or recommend software because I like the author – I use or recommend it because I think it is the right tool for the job.

As always, comments are welcome!