## Strange MATLAB performance issue on Microsoft Azure F72s_v2 instances

March 1st, 2018 | Categories: Cloud Computing, HPC, Making MATLAB faster, matlab | Tags:

I’m working on some MATLAB code at the moment that I’ve managed to reduce down to a bunch of implicitly parallel functions. This is nice because the data that we’ll eventually throw at it will be represented as a lot of huge matrices.  As such, I’m expecting that if we throw a lot of cores at it, we’ll get a lot of speed-up.  Preliminary testing on local HPC nodes shows that I’m probably right.

During testing and profiling on a smaller data set I thought that it would be fun to run the code on the most powerful single node I can lay my hands on.  In my case that’s an Azure F72s_v2 which I currently get for free thanks to a Microsoft Azure for Research grant I won.

These Azure F72s_v2 machines are NICE! Running Intel Xeon Platinum 8168 CPUs with 72 virtual cores and 144GB of RAM, they put my Macbook Pro to shame! Theoretically, they should be more powerful than any of the nodes I can access on my University HPC system.

So, you can imagine my surprise when the production code ran almost 3 times slower than on my Macbook Pro!

Here’s a Microbenchmark, extracted from the production code, running on MATLAB 2017b on a few machines to show the kind of slowdown I experienced on these super powerful virtual machines.

 test_t = rand(8755,1); test_c = rand(5799,1); tic;test_res = bsxfun(@times,test_t,test_c');toc tic;test_res = bsxfun(@times,test_t,test_c');toc 

I ran the bsxfun twice and report the fastest since the first call to any function in MATLAB is often slower than subsequent ones for various reasons. This quick and dirty benchmark isn’t exactly rigorous but its good enough to show the issue.

• Azure F72s_v2 (72 vcpus, 144 GB memory) running Windows Server 2016: 0.3 seconds
• Azure F32s_v2 (32 vcpus, 64 GB memory) running Windows Server 2016: 0.29 seconds
• 2014 Macbook Pro running OS X: 0.11 seconds
• Dell XPS 15 9560 laptop running Windows 10: 0.11 seconds
• 8 cores on a node of Sheffield University’s Linux HPC cluster: 0.03 seconds
• 16 cores on a node of Sheffield University’s Linux HPC cluster: 0.015 seconds

After a conversation on twitter, I ran it on Azure twice — once on a 72 vCPU instance and once on a 32 vCPU instance. This was to test if the issue was related to having 2 physical CPUs. The results were pretty much identical.

The results from the University HPC cluster are more in line with what I expected to see — faster than a laptop and good scaling with respect to number of cores.  I tried running it on 32 cores but the benchmark is still in the queue ;)

What’s going on?

I have no idea! I’m stumped to be honest.  Here are some thoughts that occur to me in no particular order

• Maybe it’s an issue with Windows Server 2016. Is there some environment variable I should have set or security option I could have changed? Maybe the Windows version of MATLAB doesn’t get on well with large core counts? I can only test up to 4 on my own hardware and that’s using Windows 10 rather than Windows server.  I need to repeat the experiment using a Linux guest OS.
• Is it an issue related to the fact that there isn’t a 1:1 mapping between physical hardware and virtual cores? Intel Xeon Platinum 8168 CPUs have 24 cores giving 48 hyperthreads so two of them would give me 48 cores and 96 hyperthreads.  They appear to the virtualised OS as 2 x 18 core CPUs with 72 hyperthreads in total.   Does this matter in any way?

1. Is it a Matlab-only issue for sure? Doesn’t it occur running – for instance – Prime95 (threaded)?

2. I have a remark regarding: “I ran the bsxfun twice and report the fastest since the first call to any function in MATLAB is often slower than subsequent ones for various reasons.”. Actually, that is not true because MATLAB can have different performance patterns during the warmup state (the initial execution(s) of the code). This can be seen on this graph: https://photos.app.goo.gl/fBv1NgeStKGhmnVE8.

The results show (from the left to the right): a long-warmup, a fast-warmup and a slowdown. Interestingly, the first result shows an upgrade of MATLAB JIT compiler which seems to perform now some kind of tracing/profiling with subsequent compilation (two stages) instead of pre-compiling the code during only the first run.

The three benchmarks come from the Ostrich2 suite by Sable Team (https://github.com/Sable/Ostrich2). Each benchmark was repeated 300 times in a loop in order to ignite the JIT compilation. In the evaluation, we have followed the methodology from Barrett et al. but without using KRUN (https://arxiv.org/abs/1602.00602 – btw. an excellent paper and a study).

In conclusion: these warmup patterns change with MATLAB versions and programs under analysis. The patterns need a consideration if one would like to report execution times from the steady-state only (subsequent executions with minimal variance, I guess).

As a side note: the longest warmup pattern I have encountered so far took 1500 iterations (on R2018b).

3. You’re probably already aware of this, but you should consider using the timeit (https://www.mathworks.com/help/matlab/ref/timeit.html) function. It takes care of measuring performance in the presence of warmup.