Bespoke High Performance Computing Clusters in the Cloud with Alces Flight

February 21st, 2018 | Categories: Cloud Computing, HPC | Tags:

I needed a supercomputer…..quickly!

One of the things that we do in Sheffield’s Research Software Engineering Group is host training courses delivered by external providers.  One such course is on parallel programming using MPI for which we turn to the experts at NAG (Numerical Algorithms Group).  A few days before turning up to deliver the course, the trainer got in touch with me to ask for details about our HPC cluster.

Because Croucher’s law, I had forgotten to let our HPC sysadmin know that I’d need a bunch of training accounts and around 128 cores set-aside for us to play around with for a couple of days.

In other words, I was hosting a supercomputing course and had forgotten the supercomputer.

Building a HPC cluster in the cloud

AlcesFlight is a relatively new product that allows you to spin up a traditional-looking High Performance Computing cluster on cloud computing substrates such as Microsoft Azure or Amazon AWS.  You get a head node, a bunch of worker nodes and a job scheduler such as Slurm or Sun Grid Engine. It looks just the systems that The University of Sheffield provides for its researchers!

You also get lots of nice features such as the ability to scale the number of worker nodes according to demand, a metric ton of available applications and the ability to customise the cluster at start up.

The supercomputing budget was less than the coffee budget

…and I only bought coffee for myself and the two trainers over the two days!  The attendees had to buy their own (In my defence…the course was free for attendees!).

I used the following

• A head node of:  t2.large (2 vCPUs, 8Gb RAM)
• Initial worker nodes: 4 of c4.4xlarge (16 vCPUs and 30GB RAM each)
• Maximum worker nodes: 8 of c4.4xlarge (16 vCPUs and 30GB RAM each)

This gave me a cluster with between 64 and 128 virtual cores depending on the amount that the class were using it.  Much of the time, only 4 nodes were up and running – the others spun up automatically when the class needed them and vanished when they hadn’t been used for a while.

I was using the EU (Ireland) region and the prices at the time were

• Head node: On demand pricing of $0.101 per hour • Worker nodes:$0.24 (ish) using spot pricing. Each one about twice as powerful as a 2014 Macbook Pro according to this benchmark.

HPC cost: As such, the maximum cost of this cluster was $2.73 per hour when all nodes were up and running. The class ran from 10am to 5pm for two days so we needed it for 14 hours. Maximum cost would have been$38.22.

Coffee cost: 2 instructors and me needed coffee twice a day. So that’s 12 coffees in total.  Around £2.50 or $3.37 per coffee so$40.44

The HPC cost was probably less than that since we didn’t use 128 cores all the time and the coffee probably cost a little more.

Setting up the cluster

Technical details of how I configured the cluster can be found in the follow up post at https://www.walkingrandomly.com/?p=6431