A free version of the pdist command for MATLAB
MATLAB contains a function called pdist that calculates the ‘Pairwise distance between pairs of objects’. Typical usage is
X=rand(10,2); dists=pdist(X,'euclidean');
It’s a nice function but the problem with it is that it is part of the Statistics Toolbox and that costs extra. I was recently approached by a user who needed access to the pdist function but all of the statistics toolbox license tokens on our network were in use and this led to the error message
??? License checkout failed. License Manager Error -4 Maximum number of users for Statistics_Toolbox reached. Try again later. To see a list of current users use the lmstat utility or contact your License Administrator
One option, of course, is to buy more licenses for the statistics toolbox but there is another way. You may have heard of GNU Octave, a free,open-source MATLAB-like program that has been in development for many years. Well, Octave has a sister project called Octave-Forge which aims to provide a set of free toolboxes for Octave. It turns out that not only does Octave-forge contain a statistics toolbox but that toolbox contains an pdist function. I wondered how hard it would be to take Octave-forge’s pdist function and modify it so that it ran on MATLAB.
Not very! There is a script called oct2mat that is designed to automate some of the translation but I chose not to use it – I prefer to get my hands dirty you see. I named the resulting function octave_pdist to help clearly identify the fact that you are using an Octave function rather than a MATLAB function. This may matter if one or the other turns out to have bugs. It appears to work rather nicely:
dists_oct = octave_pdist(X,'euclidean'); % let's see if it agrees with the stats toolbox version all( abs(dists_oct-dists)<1e-10) ans = 1
Let’s look at timings on a slightly bigger problem.
>> X=rand(1000,2); >> tic;matdists=pdist(X,'euclidean');toc Elapsed time is 0.018972 seconds. >> tic;octdists=octave_pdist(X,'euclidean');toc Elapsed time is 6.644317 seconds.
Uh-oh! The Octave version is 350 times slower (for this problem) than the MATLAB version. Ouch! As far as I can tell, this isn’t down to my dodgy porting efforts, the original Octave pdist really does take that long on my machine (Ubuntu 9.10, Octave 3.0.5).
This was far too slow to be of practical use and we didn’t want to be modifying algorithms so we ditched this function and went with the NAG Toolbox for MATLAB instead (routine g03ea in case you are interested) since Manchester effectively has an infinite number of licenses for that product.
If,however, you’d like to play with my MATLAB port of Octave’s pdist then download it below.
- octave_pdist.m makes use of some functions in the excellent NaN Toolbox so you will need to download and install that package first.
Another option would be to look on the File Exchange eg IPDM: Inter-Point Distance Matrix calculates the full interpair distance matrix.
The output isn’t in the form of the matrix returned by pdist, so for comparison purposes I’m using squareform from the Statistics Toolbox for converting the pdist output:
>> tic;dists_pdist = pdist(X, ‘euclidean’);toc % Default form of output
Elapsed time is 0.023311 seconds.
>> tic;dists_pdist = squareform(pdist(X, ‘euclidean’));toc % Full matrix output
Elapsed time is 0.067669 seconds.
>> tic;dists_ipdm = ipdm(X); toc
Elapsed time is 0.081956 seconds.
>> all(all((dists_ipdm – dists_pdist) == 0))
ans =
1
That’s great – thanks Matt. Always more than one way to skin a cat :)
Another approach might be to link the NAG function with Octave. I suspect if a user was genuinely interested they’d help. They have a report on their site http://www.nag.co.uk/doc/techrep/html/TR4_09/Tr4_09.asp
>> tic;octdists=octave_pdist(X,’euclidean’);toc
??? Undefined function or method ‘sumsq’ for input arguments of type ‘double’.
@Robert
sumsq is in the NaN Toolbox mentioned at the bottom of the post. You can get it at
http://biosig-consulting.com/matlab/NaN/
Cheers,
Mike