## Intel’s Xeon Phi – GPU level power without the hassle?

November 13th, 2012 | Categories: CUDA, GPU, HPC, OpenCL, parallel programming, programming | Tags:

Intel have finally released the Xeon Phi – an accelerator card based on 60 or so customised Intel cores to give around a Teraflop of double precision performance.  That’s comparable to the latest cards from NVIDIA (1.3 Teraflops according to http://www.theregister.co.uk/2012/11/12/nvidia_tesla_k20_k20x_gpu_coprocessors/) but with one key difference—you don’t need to learn any new languages or technologies to take advantage of it (although you can do so if you wish)!

The Xeon Phi uses good, old fashioned High Performance Computing technologies that we’ve been using for years such as OpenMP and MPI.  There’s no need to completely recode your algorithms in CUDA or OpenCL to get a performance boost…just a sprinkling of OpenMP pragmas might be enough in many cases.  Obviously it will take quite a bit of work to squeeze every last drop of performance out of the thing but this might just be the realisation of ‘personal supercomputer’ we’ve all been waiting for.

Here are some links I’ve found so far — would love to see what everyone else has come up with.  I’ll update as I find more

I also note that the Xeon Phi uses AVX extensions but with a wider vector width of 512 bytes so if you’ve been taking advantage of that technology in your code (using one of these techniques perhaps) you’ll reap the benefits there too.

I, for one, am very excited and can’t wait to get my hands on one!  Thoughts, comments and links gratefully received!

1. Nvidia supports OpenACC which too is directives based. So no need to learn new language to leverage Nvidia GPUs. Nvidia was smart to jump on OpenACC. Now Intel’s entire case of “you don’t need to learn a new language” is rather weak.

2. Indeed they do. We have a few users of OpenACC on NVIDIA kit at Manchester University via our PGI Compiler license. They seem very happy!

3. I’ve been hearing vendors say “*this* language/compiler/hardware/accelerator will make parallelization to large scales a piece of cake!” for 20+ years. The cake is a lie.

It’s absolutely true that you can program this thing with OpenMP (say), and so that makes the initial barrier to getting things up and running much smaller. That’s genuinely nice, because it’s easier to get something that already works to go faster than if you to re-write code to make it work in the first place.

But I see a lot of perfectly sensibly written OpenMP code that has trouble scaling to 8-12 cores on node, and MPI code that has trouble scaling to 24-32 cores. Existing OpenMP code is not going to magically scale to 50-60 cores. And if you’re paying $2000-$3000 (and 300 extra watts) to buy one of these things, 8x the performance you’d get on one of the phi’s low-speed cores just isn’t going to be enough to justify the purchase and power costs (and fan noise).

So you have to go through the trouble of rewriting for performance anyway, and the things you’ll have to do to get performance on *this* many-core, data-parallel architecture is pretty much the same things you’ll have to do for the *other guy’s* many-core, data-parallel architecture.

So yeah, one may be in OpenMP, and the other may be in OpenACC/OpenCL/CUDA, but I just don’t see this as a huge shift one way or the other (although maybe it’ll drive NVIDIAs HPC offerings down in price, which I wouldn’t mind). Maybe I’m wrong, because Intel really is genuinely good at devel tools, and maybe they just make it really easy this time, but I just don’t see that yet.