## Archive for the ‘matlab’ Category

Just over 4 years ago I was very happy with the 1.2 Teraflops of single precision performance I measured on my then-new Dell XPS 95600 laptop using MATLAB’s GPU Bench and noted that its performance was on-par with the first supercomputer I supported professionally. Double precision performance stank of course but I had gotten used to that with laptop GPUs!

Fast forward to 2021 and my personal computational landscape has changed substantially. The pandemic has confined me to my home office and high performance laptops don’t have quite the same allure that they did when I was doing most of my interesting computational work in airports and coffee shops. I swapped a 15 inch screen for a 49 inch Ultrawide monitor and my laptop is gathering dust while I enjoy my new Dell desktop with an 8 core Intel i7-11700 and an NVIDIA GTX 3070.

This is the first desktop machine I’ve personally owned for almost two decades and the amount of performance I got for less money than the high spec laptops I used to favour is astonishing! Here are the GPU Bench results using MATLAB 2021a

The standout figure is 10,399 Gigaflops for single precision Matrix-Matrix multiplication. At only slightly shy of **10.4 Teraflops** that’s almost an order of magnitude faster than the results that impressed me on my laptop 4 years ago!

Double precision performance is nothing to write home about but it never is on consumer NVIDIA GPU cards. I stopped being upset about that years ago!

The other result I find interesting with respect to my personal history of devices is the 622 Gigaflops result for single precision from the Intel i7 CPU. This is getting on for twice as fast as the GPU on my 2015 Apple MacbookPro which managed 349 Gigaflops….a machine I still use today although primarily for Netflix.

My stepchildren are pretty good at mathematics for their age and have recently learned about Pythagora’s theorem

$c=\sqrt{a^2+b^2}$

The fact that they have learned about this so early in their mathematical lives is testament to its importance. Pythagoras is everywhere in computational science and it may well be the case that you’ll need to compute the hypotenuse to a triangle some day.

Fortunately for you, this important computation is implemented in every computational environment I can think of!

It’s almost always called hypot so it will be easy to find.

Here it is in action using Python’s numpy module

import numpy as np a = 3 b = 4 np.hypot(3,4) 5

When I’m out and about giving talks and tutorials about Research Software Engineering, High Performance Computing and so on, I often get the chance to mention the hypot function and it turns out that fewer people know about this routine than you might expect.

### Trivial Calculation? Do it Yourself!

Such a trivial calculation, so easy to code up yourself! Here’s a one-line implementation

def mike_hypot(a,b): return(np.sqrt(a*a+b*b))

In use it looks fine

mike_hypot(3,4) 5.0

### Overflow and Underflow

I could probably work for quite some time before I found that my implementation was flawed in several places. Here’s one

mike_hypot(1e154,1e154) inf

You would, of course, expect the result to be large but not infinity. Numpy doesn’t have this problem

np.hypot(1e154,1e154) 1.414213562373095e+154

My function also doesn’t do well when things are small.

a = mike_hypot(1e-200,1e-200) 0.0

but again, the more carefully implemented hypot function in numpy does fine.

np.hypot(1e-200,1e-200) 1.414213562373095e-200

### Standards Compliance

Next up — standards compliance. It turns out that there is a an official standard for how hypot implementations should behave in certain edge cases. The IEEE-754 standard for floating point arithmetic has something to say about how any implementation of hypot handles NaNs (Not a Number) and inf (Infinity).

It states that any implementation of hypot should behave as follows (Here’s a human readable summary https://www.agner.org/optimize/nan_propagation.pdf)

hypot(nan,inf) = hypot(inf,nan) = inf

numpy behaves well!

np.hypot(np.nan,np.inf) inf np.hypot(np.inf,np.nan) inf

My implementation does not

mike_hypot(np.inf,np.nan) nan

So in summary, my implementation is

- Wrong for very large numbers
- Wrong for very small numbers
- Not standards compliant

That’s a lot of mistakes for one line of code! Of course, we can do better with a small number of extra lines of code as John D Cook demonstrates in the blog post What’s so hard about finding a hypotenuse?

### Hypot implementations in production

Production versions of the hypot function, however, are much more complex than you might imagine. The source code for the implementation used in openlibm (used by Julia for example) was 132 lines long last time I checked. Here’s a screenshot of part of the implementation I saw for prosterity. At the time of writing the code is at https://github.com/JuliaMath/openlibm/blob/master/src/e_hypot.c

That’s what bullet-proof, bug checked, has been compiled on every platform you can imagine and survived code looks like.

There’s more!

### Active Research

When I learned how complex production versions of hypot could be, I shouted out about it on twitter and learned that the story of hypot was far from over!

**The implementation of the hypot function is still a matter of active research! **See the paper here https://arxiv.org/abs/1904.09481

### Is Your Research Software Correct?

Given that such a ‘simple’ computation is so complicated to implement well, consider your own code and ask Is Your Research Software Correct?.

In a recent blog post, Daniel Lemire explicitly demonstrated that vectorising random number generators using SIMD instructions could give useful speed-ups. This reminded me of the one of the first times I played with the Julia language where I learned that Julia’s random number generator used a SIMD-accelerated implementation of Mersenne Twister called dSFMT to generate random numbers much faster than MATLAB’s Mersenne Twister implementation.

Just recently, I learned that MATLAB now has its own SIMD accelerated version of Mersenne Twister which can be activated like so:

seed=1; rng(seed,'simdTwister')

This new Mersenne Twister implementation gives different random variates to the original implementation (which I demonstrated is the same as Numpy’s implementation in an older post) as you might expect

>> rng(1,'Twister') >> rand(1,5) ans = 0.4170 0.7203 0.0001 0.3023 0.1468 >> rng(1,'simdTwister') >> rand(1,5) ans = 0.1194 0.9124 0.5032 0.8713 0.5324

So it’s clearly a different algorithm and, on CPUs that support the relevant instructions, it’s about twice as fast! Using my very unscientific test code:

format compact number_of_randoms = 10000000 disp('Timing standard twister') rng(1,'Twister') tic;rand(1,number_of_randoms);toc disp('Timing SIMD twister') rng(1,'simdTwister') tic;rand(1,number_of_randoms);toc

gives the following results for a typical run on my Dell XPS 15 9560 which supports AVX instructions

number_of_randoms = 10000000 Timing standard twister Elapsed time is 0.101307 seconds. Timing SIMD twister Elapsed time is 0.057441 seconds

The MATLAB documentation does not tell us which algorithm their implementation is based on but it seems to be different from Julia’s. In Julia, if we set the seed to 1 as we did for MATLAB and ask for 5 random numbers, we get something different from MATLAB:

julia> using Random julia> Random.seed!(1); julia> rand(1,5) 1×5 Array{Float64,2}: 0.236033 0.346517 0.312707 0.00790928 0.48861

The performance of MATLAB’s new generator is on-par with Julia’s although I’ll repeat that these timing tests are far far from rigorous.

julia> Random.seed!(1); julia> @time rand(1,10000000); 0.052981 seconds (6 allocations: 76.294 MiB, 29.40% gc time)

**Update**

A discussion on twitter determined that this was an issue with Locales. The practical upshot is that we can make R act the same way as the others by doing

`Sys.setlocale("LC_COLLATE", "C")`

```
```

` which may or may not be what you should do! `

**Original post**

While working on a project that involves using multiple languages, I noticed some tests failing in one language and not the other. Further investigation revealed that this was essentially because R's default sort order for strings is different from everyone else's.

I have no idea how to say to R 'Use the sort order that everyone else is using'. Suggestions welcomed.

**R 3.3.2**

```
sort(c("#b","-b","-a","#a","a","b"))
[1] "-a" "-b" "#a" "#b" "a" "b"
```

**Python 3.6
**

```
sorted({"#b","-b","-a","#a","a","b"})
['#a', '#b', '-a', '-b', 'a', 'b']
```

```
```

**MATLAB 2018a**

```
sort([{'#b'},{'-b'},{'-a'},{'#a'},{'a'},{'b'}])
ans =
1×6 cell array
{'#a'} {'#b'} {'-a'} {'-b'} {'a'} {'b'}
```

```
```

**C++**

```
int main(){
std::string mystrs[] = {"#b","-b","-a","#a","a","b"};
std::vector<std::string> stringarray(mystrs,mystrs+6);
std::vector<std::string>::iterator it;
std::sort(stringarray.begin(),stringarray.end());
for(it=stringarray.begin(); it!=stringarray.end();++it) {
std::cout << *it << " ";
}
return 0;
}
```

Result:

```
#a #b -a -b a b
```

I’m working on some MATLAB code at the moment that I’ve managed to reduce down to a bunch of implicitly parallel functions. This is nice because the data that we’ll eventually throw at it will be represented as a lot of huge matrices. As such, I’m expecting that if we throw a lot of cores at it, we’ll get a lot of speed-up. Preliminary testing on local HPC nodes shows that I’m probably right.

During testing and profiling on a smaller data set I thought that it would be fun to run the code on the most powerful single node I can lay my hands on. In my case that’s an Azure F72s_v2 which I currently get for free thanks to a Microsoft Azure for Research grant I won.

These Azure F72s_v2 machines are NICE! Running Intel Xeon Platinum 8168 CPUs with 72 virtual cores and 144GB of RAM, they put my Macbook Pro to shame! Theoretically, they should be more powerful than any of the nodes I can access on my University HPC system.

So, you can imagine my surprise when **the production code ran almost 3 times slower than on my Macbook Pro!**

Here’s a Microbenchmark, extracted from the production code, running on MATLAB 2017b on a few machines to show the kind of slowdown I experienced on these super powerful virtual machines.

test_t = rand(8755,1);

test_c = rand(5799,1);

tic;test_res = bsxfun(@times,test_t,test_c');toc

tic;test_res = bsxfun(@times,test_t,test_c');toc

I ran the bsxfun twice and report the fastest since the first call to any function in MATLAB is often slower than subsequent ones for various reasons. This quick and dirty benchmark isn’t exactly rigorous but its good enough to show the issue.

- Azure F72s_v2 (72 vcpus, 144 GB memory) running Windows Server 2016:
**0.3 seconds** - Azure F32s_v2 (32 vcpus, 64 GB memory) running Windows Server 2016:
**0.29 seconds** - 2014 Macbook Pro running OS X:
**0.11 seconds** - Dell XPS 15 9560 laptop running Windows 10:
**0.11 seconds** - 8 cores on a node of Sheffield University’s Linux HPC cluster:
**0.03 seconds** - 16 cores on a node of Sheffield University’s Linux HPC cluster:
**0.015 seconds**

After a conversation on twitter, I ran it on Azure twice — once on a 72 vCPU instance and once on a 32 vCPU instance. This was to test if the issue was related to having 2 physical CPUs. The results were pretty much identical.

The results from the University HPC cluster are more in line with what I expected to see — faster than a laptop and good scaling with respect to number of cores. I tried running it on 32 cores but the benchmark is still in the queue ;)

**What’s going on?**

I have no idea! I’m stumped to be honest. Here are some thoughts that occur to me in no particular order

- Maybe it’s an issue with Windows Server 2016. Is there some environment variable I should have set or security option I could have changed? Maybe the Windows version of MATLAB doesn’t get on well with large core counts? I can only test up to 4 on my own hardware and that’s using Windows 10 rather than Windows server. I need to repeat the experiment using a Linux guest OS.
- Is it an issue related to the fact that there isn’t a 1:1 mapping between physical hardware and virtual cores? Intel Xeon Platinum 8168 CPUs have 24 cores giving 48 hyperthreads so two of them would give me 48 cores and 96 hyperthreads. They appear to the virtualised OS as 2 x 18 core CPUs with 72 hyperthreads in total. Does this matter in any way?

My new toy is a 2017 Dell XPS 15 9560 laptop on which I am running Windows 10. Once I got over (and fixed) the annoyance of all the advertising in Windows Home, I quickly starting loving this new device.

To get a handle on its performance, I used GPUBench in MATLAB 2016b and got the following results (This was the best of 4 runs…I note that MTimes performance for the CPU (Host PC), for example, varied between 130 and 150 Glops).

- CPU: Intel Core I7-7700HQ (6M Cache, up to 3.8Ghz)
- GPU: NVIDIA GTX 1050 with 4GB GDDR5

I last did this for my Retina MacBook Pro and am happy to see that the numbers are better across the board. The standout figure for me is the 1206 Gflops (That’s 1.2 Teraflops!) of single precision performance for Matrix-Matrix Multiply.

That figure of 1.2 Teraflops rang a bell for me and it took me a while to realise why…..

**My laptop vs Manchester University’s old HPC system – Horace**

Old timers like me (I’m almost 40) like to compare modern hardware with bygone supercomputers (1980s Crays vs mobile phones for example) and we know we are truly old when the numbers coming out of laptop benchmarks match the peak theoretical performance of institutional HPC systems we actually used as part of our career.

This has now finally happened to me! I was at the University of Manchester when it commissioned a HPC service called Horace and I was there when it was switched off in 2010 (only 6 and a bit years ago!). It was the University’s primary HPC service with a support team, helpdesk, sysadmins…the lot. The specs are still available on Manchester’s website:

- 24 nodes, each with 8 cores giving 192 cores in total.
- Each core had a theoretical peak compute performance of 6.4 double precision Gflop/s
- So a node had a theoretical peak performance of 51.2 Gflop/s
- The whole thing could theoretically manage 1.2 Teraflop/s
- It had four special ‘high memory’ nodes with 32Gb RAM each

Good luck getting that 1.2 Teraflops out of it in practice!

I get a big geek-kick out of the fact that my new laptop has the same amount of RAM as one of these ‘big memory’ nodes and that my laptop’s double precision CPU performance is on par with the combined power of 3 of Horace’s nodes. Furthermore, my laptop’s GPU can just about manage 1.2 Teraflop/s of **single precision ** performance in MATLAB — on par with the total combined power of the HPC system*.

* (I know, I know….Horace’s numbers are for **double precision **and my GPU numbers are **single precision** — apples to oranges — but it still astonishes me that the headline numbers are the same — 1.2 Teraflops).

I stumbled across a great list of resources about the R programming language recently – a list called awesome-R. The list said it was inspired by awesome-machine-learning which, in turn, was inspired by awesome-PHP. It turns out that there is a whole network of these lists.

I noticed that there wasn’t a list for MATLAB so started the awesome-MATLAB list. Pull Requests are welcome.

**Update: 2nd July 2015 **The code in github has moved on a little since this post was written so I changed the link in the text below to the exact commit that produced the results discussed here.

Imagine that you are a very new MATLAB programmer and you have to create an N x N matrix called A where A(i,j) = i+j

Your first attempt at a solution might look like this

N=2000 % Generate a N-by-N matrix where A(i,j) = i + j; for ii = 1:N for jj = 1:N A(ii,jj) = ii + jj; end end

On my current machine (Macbook Pro bought in early 2015), the above loop takes **2.03 seconds.** You might think that this is a long time for something so simple and complain that MATLAB is slow. The person you complain to points out that you should preallocate your array before assigning to it.

N=2000 % Generate a N-by-N matrix where A(i,j) = i + j; A=zeros(N,N); for ii = 1:N for jj = 1:N A(ii,jj) = ii + jj; end end

This now takes **0.049 seconds** on my machine – a speed up of over 41 times! MATLAB suddenly doesn’t seem so slow after all.

Word gets around about your problem, however, and seasoned MATLAB-ers see that nested loop, make a funny face twitch and start muttering ‘vectorise, vectorise, vectorise’. Emails soon pile in with vectorised solutions

% Method 1: MESHGRID. [X, Y] = meshgrid(1:N, 1:N); A = X + Y;

This takes **0.025 seconds** on my machine — a healthy speed-up on the loop-with-preallocation solution. You have to understand the meshgrid command, however, in order to understand what’s going on here. It’s still clear (to me at least) what its doing but not as clear as the nice,obvious double loop. Call me old fashioned but I like loops…I understand them.

% Method 2: Matrix multiplication. A = (1:N).' * ones(1, N) + ones(N, 1) * (1:N);

This one is MUCH harder to read but you don’t worry about it too much because at **0.032 seconds** it’s slower than meshgrid.

% Method 3: REPMAT. A = repmat(1:N, N, 1) + repmat((1:N).', 1, N);

This one appears to be interesting! **At 0.009 seconds**, it’s the fastest so far – by a healthy amount!

% Method 4: CUMSUM. A = cumsum(ones(N)) + cumsum(ones(N), 2);

Coming in at **0.052 seconds**, this cumsum solution is slower than the preallocated loop.

% Method 5: BSXFUN. A = bsxfun(@plus, 1:N, (1:N).');

Ahhh, bsxfun or ‘The Widow-maker function’ as I sometimes refer to it. Responsible for some of the fastest and most unreadable vectorised MATLAB code I’ve ever written. In this case, it brings execution time down to **0.0045 seconds**.

Whenever I see something that can be vectorised with a repmat, I try to figure out if I can rewrite it as a bsxfun. The result is usually horrible to look at so I tend to keep the original loop commented out above it as an explanation! This particular example isn’t too bad but bsxfun can quickly get hairy.

**Conclusion**

Loops in MATLAB aren’t anywhere near as bad as they used to be thanks to advances in JIT compilation but it can often pay, speed-wise, to switch to vectorisation. The price you often pay for this speed-up is that vectorised code can become very difficult to read.

If you’d like the code I ran to get the timings above, it’s on github (link refers to the exact commit used for this post) . Here’s the output from the run I referred to in this post.

Original loop time is 2.025441 Preallocate and loop is 0.048643 Meshgrid time is 0.025277 Matmul time is 0.032069 Repmat time is 0.009030 Cumsum time is 0.051966 bsxfun time is 0.004527

- MATLAB Version: 2015a
- Early 2015 Macbook Pro with 16Gb RAM
- CPU: 2.8Ghz quad core i7 Haswell

This post is based on a demonstration given by Mathwork’s Ken Deeley during a recent session at The University of Sheffield.

I recently got a 15 inch Retina Macbook Pro which contains an NVIDIA GT 750M GPU. It’s been a while since I last got a laptop with a decent GPU in it so I wondered how it would perform in MATLAB using the Parallel Computing Toolbox.

Of course I didn’t read any documentation; I simply fired up MATLAB 2015a and issued the **gpuDevice** command.

>> gpuDevice Error using gpuDevice (line 26) There is a problem with the CUDA driver or with this GPU device. Be sure that you have a supported GPU and that the latest driver is installed. Caused by: The CUDA driver could not be loaded. The library name used was '/usr/local/cuda/lib/libcuda.dylib'. The error was: dlopen(/usr/local/cuda/lib/libcuda.dylib, 10): image not found

This is because I didn’t install a load of CUDA-related stuff! Following these instructions did the trick!

>> gpuDevice() ans = CUDADevice with properties: Name: 'GeForce GT 750M' Index: 1 ComputeCapability: '3.0' SupportsDouble: 1 DriverVersion: 6.5000 ToolkitVersion: 6.5000 MaxThreadsPerBlock: 1024 MaxShmemPerBlock: 49152 MaxThreadBlockSize: [1024 1024 64] MaxGridSize: [2.1475e+09 65535 65535] SIMDWidth: 32 TotalMemory: 2.1470e+09 AvailableMemory: 444055552 MultiprocessorCount: 2 ClockRateKHz: 925500 ComputeMode: 'Default' GPUOverlapsTransfers: 1 KernelExecutionTimeout: 1 CanMapHostMemory: 1 DeviceSupported: 1 DeviceSelected: 1

I headed over to the MATLAB File Exchange to get the GPU Bench App for MATLAB and fired it up. The summary of the results is below. Click on the image to see the detailed results.

The double precision performance of this GPU card is very poor – MUCH slower than the CPU on the Macbook Pro.

Looking on the bright side, the numbers for the CPU are pretty good for a laptop!

I’ve been working at The University of Manchester for almost a decade and will be leaving at the end of this week! A huge part of my job was to support a major subset of Manchester’s site licensed application software portfolio so naturally I’ve made use of a lot of it over the years. As of February 20th, I will no longer be entitled to use any of it!

This article is the second in a series where I’ll look at some of the software that’s become important to me and what my options are on leaving Manchester. Here, I consider MATLAB – a technical computing environment that has come to dominate my career at Manchester. For the last 10 years, I’ve used MATLAB at least every week, if not most days.

I had a standalone license for MATLAB and several toolboxes – Simulink, Image Processing, Parallel Computing, Statistics and Optimization. Now, I’ve got nothing! Unfortunately for me, I’ve also got hundreds of scripts, mex files and a few Simulink models that I can no longer run! These are my options:

**Go somewhere else that has a MATLAB site license**

- I’ll soon be joining the University of Sheffield who have a MATLAB site license. A great option if you can do it.

**Use something else**

- Octave – Octave is a pretty good free and open source clone of MATLAB and quite a few of my programs would work without modification. Others would require some rewriting and, in some cases, that rewriting could be extensive! There is no Simulink support.
- Scilab – It’s free and it’s MATLAB-like-ish but I’d have to rewrite my code most of the time. I could also port some of my Simulink models to Scilab as was done in this link.
- Rewrite all my code to use something completely different. What I’d choose would depend on what I’m trying to achieve but options include Python, Julia and R among others.

**Compile!**

- If all I needed was the ability to run a few MATLAB applications I’d written, I could compile them using the MATLAB Compiler and keep the result. The whole point of the MATLAB Compiler is to distribute MATLAB applications to those who don’t have a MATLAB license. Of course once I’ve lost access to MATLAB itself, debugging and adding features will be um……tricky!

**Get a hobbyist license for MATLAB**

- MATLAB Home – This is the full version of MATLAB for hobbyists. Writing a non-profit blog such as WalkingRandomly counts as a suitable ‘hobby’ activity so I could buy this license.
**MATLAB itself for 85 pounds**with most of the toolboxes coming in at**an extra 25 pounds each**. Not bad at all! The extra cost of the toolboxes would still lead me to obsess over how to do things without toolboxes but, to be honest, I think that’s an obsession I’d miss if it weren’t there! Buying all of the same toolboxes as I had before would end up costing me a total of £210+VAT. - Find a MOOC that comes with free MATLAB – Mathworks make MATLAB available for free for students of some online courses such as the one linked to here. Bear in mind, however, that the license only lasts for the duration of the course.

**Academic Use**

If I were to stay in academia but go to an institution with no MATLAB license, I could buy myself an academic standalone license for MATLAB and the various toolboxes I’m interested in. The price lists are available at http://uk.mathworks.com/pricing-licensing/

For reference, current UK academic prices are

- MATLAB £375 + VAT
- Simulink £375 + VAT
- Standard Toolboxes (statistics, optimisation, image processing etc) £150 +VAT each
- Premium Toolboxes (MATLAB Compiler, MATLAB Coder etc) – Pricing currently not available

My personal mix of MATLAB, Simulink and 4 toolboxes would set me back £1350 + VAT.

**Commercial Use**

If I were to use MATLAB professionally and outside of academia, I’d need to get a commercial license. Prices are available from the link above which, at the time of writing, are

- MATLAB £1600 +VAT
- Simulink £2400 + VAT
- Standard Toolboxes £800 +VAT each
- Premium Toolboxes – Pricing currently not available

My personal mix of MATLAB, Simulink and 4 toolboxes would set me back £7200 + VAT.

**Contact MathWorks**

If anyone does find themselves in a situation where they have MATLAB code and no means to run it, then they can always try contacting MathWorks and ask for help in finding a solution.