Software problems in research #1: Excel turns genes into dates!

August 24th, 2016 | Categories: programming, RSE | Tags:

I sometimes give a talk called Is Your Research Software correct (github repo, slide deck) where I attempt to give a (hopefully) entertaining overview of some of the basic issues in modern research software practice and what can be done to make the world a little better.

One section of this talk is a look at some case studies where software errors caused problems in research. Ideally, I try to concentrate on simple errors that led to profound scientific screw-ups. I want the audience to think ‘Damn! *I* could have made that mistake in my code‘.

Curating this talk has turned me into an interested collector of such stories. This is not an exercise in naming and shaming (after all, the odds are that its only a matter of time before I, or one of my collaborators, makes it into the list — why set myself up for a beating?). Instead, it is an exercise in observing the problems that other people have had and using them to enhance our own working practices.

Thus begins a new recurring WalkingRandomly feature.

Excel corrupts genetics data

Today’s entry comes courtesy of a recent paper by Mark Ziemann, Yotam Eren and Assam El-OstaEmail – ‘Gene name errors are widespread in the scientific literature‘ where they demonstrate that the supplementary data files for hundreds of papers in genetics have been corrupted by Microsoft Excel which has helpfully turned gene symbols into dates and floating point numbers.

The paper gives advice to reviewers on how to spot this particular error and the authors have also published the code used for the analysis. I’ve not run it myself so can only attest to its existence, not it’s accuracy.

I’ve not dealt with genetic data directly myself so ask you — what would you have used instead of Excel? (my gut tells me R or Python but I have no details to offer).

Do you have a story to contribute?

If you are interested in contributing a story where a software glitch caused problems in research, please contact me to discuss details.

Update (31st August 2016)

One of the authors of the paper, Mark Ziemann, has written a follow up of the Excel work on his blog: http://genomespot.blogspot.co.uk/2016/08/my-personal-thoughts-on-gene-name-errors.html

  1. Laurence Billingham
    August 24th, 2016 at 15:45
    Reply | Quote | #1

    Hello Mike,

    Thank you for this interesting blog post. We met IRL at the SSI CW16 way back when.

    Assuming the $7 Bn rocket failure is the Ariane 5/Cluster I launch, it also
    torpedoed the careers of quite a few UK researchers whose funding depended on the prospective-data.

    If we are allowed more engineering focussed bug reports then there is also the Mariner I launch failure (https://en.wikipedia.org/wiki/Mariner_1) and the really sad Therac 25 radiation overdosing bugs (https://en.wikipedia.org/wiki/Therac-25).

    The link to your github repo doesn’t seem to be a link (assume it is to https://github.com/mikecroucher/MLPM_talk).

    Would it be OK for me to reuse the talk or bits of it? The LICENSE file seems to be for Reveal.js rather than the talk content.

  2. Ian
    August 24th, 2016 at 15:57
    Reply | Quote | #2

    R and python are tools for manipulating data. Excel is often treated as tool for storing, displaying and sharing data. A good, easy to use CSV table editor might help, but you try explaining to someone barely understands the difference between a package and a file format that they should use this different table editor when they’ve been using Excel for years.

  3. Mike Croucher
    August 24th, 2016 at 16:19
    Reply | Quote | #3

    @Laurence – I remember you :) Feel free to reuse the talk (I fixed the github link — thanks for the heads-up). All I hope for is attribution. I’m also happy to travel to give the talk as long as travel expenses can be covered. I’ve added more details of this to the Readme.md of the repo.

    Thanks for the extra examples too.

  4. Andrew Landels
    August 25th, 2016 at 06:19
    Reply | Quote | #4

    I was performing a cluster analysis in R back when I first started using it, and a series of what I thought were numeric values had been converted to strings when I imported them, except unbeknownst to me they had been converted into factors in the dataframe on import. When I converted them with as.numeric my code ran fine, but clustered by the factor integers rather than the values, so the output was wrong. As a habit I always set strings as factors to False on import now, but it’s one to watch for!

  5. Laurence Billingham
    August 25th, 2016 at 08:03
    Reply | Quote | #5

    Thanks Mike. Not sure we have a budget for intercity travel but could probably bring you across town and buy you dinner if you were in Edinburgh. I’ll be sure to give attribution when I use any content… especially the ‘Croucher’s Law’ part. That should definitely be a thing.

  6. Mike Croucher
    August 25th, 2016 at 08:32
    Reply | Quote | #6

    @Andrew – Yes, I find that behaviour of R a PITA to be honest. It makes teaching introductory R a little more difficult for a start.

    @Laurence – I’m in Edinburgh from time to time so will give you a heads-up next time it happens. It turns out that Croucher’s Law isn’t the original…See this for details http://www.walkingrandomly.com/?p=5970