## Call for information: Why do good software people leave academia?

July 27th, 2016

This post is also published over at the Software Sustainability Institute.

William Stein, lead developer of the computer algebra system, Sage, and its cloud-based spin-off, SageMathCloud, recently announced that he was quitting academia to go and form a company. In his talk, William says ‘I can’t figure out how to create Sage in academia. The money isn’t there. The mathematical community doesn’t care enough. The only option left is for me to build a company.’

His talk is below and slides are at http://wstein.org/talks/2016-06-sage-bp/bp.pdf

“Every great open source math library is built on the ashes of someone’s academic career.”

William’s departure is not unique. Here’s a tweet from Wes Mckinney, creator of pandas, one of the essential data science tools for Python.

We are looking for similar stories; good research software people who felt that they had to leave academia because there wasn’t enough support, recognition or funding. Equally, we want to hear from you if you think academia is a rewarding environment for software development. Either way, please contact us at rse-study@software.ac.uk

## High Performance Computing: Think about where you do data operations

July 12th, 2016

The High Performance Computing system at University of Sheffield has several different file systems available to it. We have:-

• /fastdata – A lustre-based, shared filesystem with hundreds of terabytes of space. No backup. No quota.
• /data – An NFS file system where each user has access to 100Gb of storage. Back-ups go back 7 days.
• /home –  An NFS file system where each user has 10Gb. Backed up over 28 days. Mirrored.
• /scratch – Local disk on each worker node. No back up. Uses ext4.

Lots of options with differing amounts of space, back-up policy and, as I’m about to demonstrate, performance characteristics. I suspect that many other HPC systems have a similar set up.

On our system, it’s very tempting to do everything in /fastdata. There’s lots of space, no quota, readable from all worker nodes simultaneously — good times! I try to encourage people to think about what they are doing, however. Bad things can happen if the lustre filesystem is hammered too much. Also, there can be a huge difference in performance for some operations across different filesystems.

Let’s take an example. I want to download and untar gcc 4.9.2. How long does that take on the three different filesystems?

On the scratch directory of a worker node

cd \scratch
mkdir testing123
cd testing123
wget ftp://ftp.mirrorservice.org/sites/sourceware.org/pub/gcc/releases/gcc-4.9.2/gcc-4.9.2.tar.gz
time tar xfz ./gcc-4.9.2.tar.gz

real    0m6.237s
user    0m5.302s
sys 0m3.033s


On the lustre filesystem

cd /fastdata/
mkdir testing123
cd testing123
wget ftp://ftp.mirrorservice.org/sites/sourceware.org/pub/gcc/releases/gcc-4.9.2/gcc-4.9.2.tar.gz
time tar xfz ./gcc-4.9.2.tar.gz

real    7m18.170s
user    0m6.751s
sys 0m56.802s



On the NFS filesystem

cd /data/myusername
mkdir testing123
cd testing123
wget ftp://ftp.mirrorservice.org/sites/sourceware.org/pub/gcc/releases/gcc-4.9.2/gcc-4.9.2.tar.gz
time tar xfz ./gcc-4.9.2.tar.gz

real	16m37.343s
user	0m6.052s
sys	0m23.438s


For this particular operation, there is a two orders of magnitude difference between the worst and the best option.

I’m not an expert in filesystems and I have no idea what’s causing these differences or if I’d see a similar speed difference given a different file operation. I currently have no interest in doing a robust set of benchmarks. The point I’m making is that if you are using a system that has multiple filesystems it may be worth checking if there’s an advantage to using one over the other for your particular use case.

## Dagstuhl Seminar : Engineering Academic Software

July 6th, 2016

I was recently invited to a Schloss Dagstuhl Workshop on ‘Engineering Academic Software’ by organisers Carole GobleJames HowisonClaude Kirchner and Oscar M. Nierstrasz. One week of geeking-out with research software people from all over the world in lovely surroundings with as much beer and cheese as you can eat — sounds good to me!

I gave a presentation about life on the frontline of Research Software Engineering support or the RSE Accident and Emergency department as I sometimes think of it. I spent some time discussing Sheffield’s new Research Software Engineering group formed by Paul Richmond and me off the back of our EPSRC Research Software Engineering Fellowships. I also discussed a worrying trend I’ve noticed in research software — top people are leaving academia for industry, not because they want to but because of a lack of support! Slides for my talk are at https://mikecroucher.github.io/dagstuhl_RSE_Sheffield/#/.

Highlights

I love attending seminars like this because I get to learn about all of the wonderful things that the community is up to.  Personal highlights included:

Effective computation in physics

Meeting Katy Huff, co-author of my favourite Python book, Effective computation in physics. The only problem with this book is the word ‘physics’ in the title since it suggests that it’s only useful if you are a physicist. Totally not the case! If you are doing science in Python, get this book! Fellow blogger John D Cook, interviewed both authors of the book back in 2015 – see the write-up at http://www.johndcook.com/blog/2015/08/08/effective-computation-in-physics/.

Software Heritage

Learning about the Software Heritage project that launched very recently. The project harvests and archives projects from various locations — github, Debian and the GNU Project for now. They say that ‘we preserve software, because it contains our technical and scientific knowledge.’ It’s shaping up to be a ‘Library of Alexandria of Software’. The full mission statement is over at https://www.softwareheritage.org/mission/

Software citation and credit

There was a lot of discussion about considering software as a first class scientific output and several projects were mentioned that help the situation. The force11 software citation principles address how software should be cited and depsy.org is ‘an open-source webapp that tracks research software impact‘. Dan Katz’s blog post ‘How should we add citations inside software‘ is also worth a read.