In defense of inefficient scientific code

June 10th, 2011 | Categories: programming | Tags:

One part of my job that I really enjoy is the optimisation of researcher’s code.  Typically, the code comes to me in a language such as MATLAB or Mathematica and may take anywhere from a couple of hours to several weeks to run.  I’ve had some nice successes recently in areas as diverse as finance, computer science, applied math and chemical engineering among others.  The size of the speed-up can vary from 10% right up to 5000% (yes, 50 times faster!) and that’s before I break out the big guns such as Manchester’s Condor pool or turn the code over to our HPC specialists for some SERIOUS (yet more time consuming in terms of developer time) optimisations.

Reporting these speed-ups to colleagues (along with the techniques I used) gets various responses such as ‘Well, they shouldn’t do time-consuming computing using high level languages.  They should rewrite the whole thing in Fortran’ or words to that effect.  I disagree!

In my opinion, high level programming languages such as Mathematica, MATLAB and Python have democratised scientific programming.  Now, almost anyone who can think logically can turn their scientific ideas into working code.  I’ve seen people who have had no formal programming training at all whip up models, get results and move on with their research.    Let’s be clear here – It’s results that matter not how you coded them.

It comes down to this.  CPU time is cheap.  Very cheap.  Human time, particularly specialised human time, is expensive.

Here’s an example:  Earlier this year I was working with a biologist who had put together some MATLAB code to analyse her data.  She had written the code in less than a day and it gave the correct results but it ran too slowly for her tastes.  Her sole programming experience came from reading the MATLAB manual and yet she could cook up useful code in next to no time.  Sure, it was slow and (to my eyes) badly written but give the gal a break…she’s a professional biologist and not a professional programmer.  Her programming is a lot better than my biology!

In less than two hours I gave her a crash course in MATLAB code optimisation; how to use the profiler, vectorisation and so on.  We identified the hotspot in the code and, between us, recoded it so that it was an order of magnitude faster.  This was more than fast enough for her needs, she could now analyse data significantly faster than she could collect it.   I realised that I could make it even faster by using parallelised mex functions but it would probably take a few more hours work.  She declined my offer…the code was fast enough.

In my opinion, this is an optimal use of resources.  I spend my days obsessing about mathematical software and she spends her days obsessing about experimental biology.  She doesn’t need a formal course in how to write uber-efficient code because her code runs as fast as she needs it to (with a little help from her friends).  The solution we eventually reached might not be the most CPU-efficient one but it is a good trade off between CPU-efficient and developer-efficient.

It was easy…trivial even..for someone like me to take her inefficient code and turn it into something that was efficient enough.  However, the whole endeavour relied on her producing working code in the first place.  Say high-level languages such as MATLAB didn’t exist….then her only options would be to hire a professional programmer (cash expensive) or spend a load of time learning how to code in a low level language such as Fortran or C (time expensive).

Also, because she is a beginner programmer, her C or Fortran code would almost certainly be crappy and one thing I am sure of is ‘Crappy MATLAB/Python/Mathematica/R code is a heck of a lot easier to debug and optimise than crappy C code.’  Segfault anyone?

  1. MySchizoBuddy
    June 10th, 2011 at 14:01
    Reply | Quote | #1

    do you ever get labview code to optimize from the professors?

  2. MySchizoBuddy
    June 10th, 2011 at 14:08
    Reply | Quote | #2

    i believe Maple 15 allows you to take maple code and convert it into C or Fortran. There should be something similar in Matlab as well

  3. June 10th, 2011 at 14:52
    Reply | Quote | #3

    Labview? Not so far. I’d have quite a steep learning curve to overcome if I ever did.

    Conversion to C or Fortran in MATLAB? There is a new product called MATLAB coder that I’ve not tried yet. http://www.mathworks.com/products/matlab-coder/

  4. MySchizoBuddy
    June 10th, 2011 at 21:58
    Reply | Quote | #4

    I believe you have posts about optimization where you take a matlab code and optimize it. correct?. You should create a category for it.

    Btw it would make a good senior design project to use Matlab HDL to actually program an fpga. This fpga will be specially designed to solve your particular problem. So for your colleagues biology problem you can have an fpga to do just that problem. How much faster will that be? A generic x86 or a highly specific fpga. This will make sense for researchers.

  5. June 13th, 2011 at 02:58
    Reply | Quote | #5

    I think I understand what you wrote about. Good example of collaborating with a colleague without worry about being “used” or misunderstood. Having a strong sense of your own identity is a great blessing, and helps avoid wondering if helping someone is good or bad. Being focused upon self is a terrible weakness but influences so much of our lives.
    Good for you. The character exhibited by the example is impressive.
    But you know I see you through very rosy glasses……

    Dad

  6. June 14th, 2011 at 04:53
    Reply | Quote | #6

    Hi,
    nice post. I guess working as a part-time experimental biologist and now trying to run my own lab with people that do experiments and write their own (quick & dirty) MATLAB code to analyse it, I totally agree!
    Indeed we have some guys sitting in lab only coding C++ and the others do this half-way.
    Works very well.
    Cheers
    Chaitanya

  7. Vicky van der Linden
    June 16th, 2011 at 14:52
    Reply | Quote | #7

    Fascinating reading material…given that I’m dealing with numbers & balances all day!

  8. J. F. Sebastian
    June 18th, 2011 at 22:23
    Reply | Quote | #8

    You spent two hours playing with profiler and teaching vectorization in matlab? What about coding her formulas directly (FORmula TRANslator) and use -parallel switch on your multicore computer? It should take half of your time with comparable (or even better) results. If she needed graphics then your choice was better but … there are many free graphics libraries for Fortran around.

    Cheers,

    J. F. Sebastian

  9. June 18th, 2011 at 23:53
    Reply | Quote | #9

    There was quite a lot of code using many matlab functions and graphics. It would have taken a lot longer than two hours to do a rewrite and would have needed several libraries including graphics. A rewrite would have probably introduced bugs. She later added a GUI, rather easier in matlab than in fortran.

    I prefer to code in a high level language….python, matlab, mathematica etc and just recode the computationally expensive part in fortran or c.

    When a user who has spent time writing and debugging a program in their favourite language comes for advice the comment ‘let’s start again but this time in fortran’ rarely goes down well.

  10. J. F. Sebastian
    June 19th, 2011 at 11:11

    You are right of course. Coding has been already done in this case. I wanted to point out that if sombody is focused on the problem and wants to get results quickly without investing time in programming hacks in any language then the shortest path might be diffrent. Especially if there is no access to expensive products like Matlab with their licensing etc.

  11. Ben
    June 19th, 2011 at 17:31

    I completely agree with your comments in this post. The thing is, MATLAB/Mathematica/Python already include reasonably performant numerical libraries, so often one only needs to know the proper way to take advantage of these in the language in question in order to gain nearly all the performance that one could obtain by a rewrite in a lower-level language such as Fortran. In fact, since we as scientists (even computer scientists or mathematicians) are not necessarily experts in numerical analysis, using the algorithms from the high-level languages can be preferable both in terms of asymptotic complexity as well as possibly huge differences in the constant term. For example, who wants to try writing a fast and accurate numerical quadrature, FFT, or SVD routine in Fortran or C? Not me, that’s for sure–these are decidedly nontrivial tasks, best left to the true experts.

  12. June 30th, 2011 at 13:58

    MATLAB has indeed made it easy for scientists to get results, without having to be computer scientists. I’m part of the team making Jacket which lets you write regular MATLAB and offload parts of it to the GPU: http://accelereyes.com. Well vectorized code definitely goes faster on the CPU, but often even more so on the GPU which often has dozens more vector units. We also put together a C++ version that borrows a lot from the NumPy and MATLAB APIs, and even started Fortran and Python wrappers: http://accelereyes.com/wiki/libjacket

  13. Ian Cottam
    June 30th, 2011 at 19:31

    Excellent article Mike, deserving of even wider dissemination.

    MATLAB (and similar, as you mention) is doing for scientists what Fortran and Algol 60 did for us computing scientists.

    Telling scientists to use Fortran today would be like telling computing scientists and software engineers to use assembly-level languages.

  14. June 30th, 2011 at 22:29

    Thanks Ian. Don’t get me wrong – low level languages such as Fortran and C have their place. When you need to make every FLOP count then they are where you turn to. I use C all the time to help me speed up MATLAB code for instance and one of my favourite library vendors, NAG, uses Fortran exclusively to keep their code super speedy (which I then use as black-box routines). When you use Python, you turn to C (via Cython etc) to cool down the hotspots..etc etc.

    For many MANY problems, however, you just don’t need such heavyweight tools. MATLAB, Python, Mathematica….they are plenty fast enough much of the time.

    As for when they are not…well, a scientist without a traditional programming background has a choice: spend potentially ages learning how to make their code go faster (and hence spend less time in the lab or writing those all important grant-proposals) or just hand it over to their friendly neighbourhood research support team who do this kind of stuff all the time….for fun!

    Cheers,
    Mike

  15. July 1st, 2011 at 15:01

    Good article, Mike — we’ve been trying to persuade people of many of the same points in the Software Carpentry project (http://software-carpentry.org) for years. Unfortunately, even when we do the math (http://software-carpentry.org/2011/06/doing-the-math/), too many of the people in charge of funding still confuse “scientific computing” with “high-performance computing”, which leaves the 90+% of scientists who _don’t_ needs teraflops out in the cold.

  16. July 1st, 2011 at 17:27

    Don Knuth: “Premature optimization is the root of all evil”.
    Michael A. Jackson: “The First Rule of Program Optimization: Don’t do it. The Second Rule of Program Optimization (for experts only!): Don’t do it yet.”