Vectorising code to take advantage of modern CPUs (AVX and SSE)

August 24th, 2012 | Categories: CUDA, GPU, OpenCL, parallel programming, programming | Tags:

Updated 26th March 2015

I’ve been playing with AVX vectorisation on modern CPUs off and on for a while now and thought that I’d write up a little of what I’ve discovered.  The basic idea of vectorisation is that each processor core in a modern CPU can operate on multiple values (i.e. a vector) simultaneously per instruction cycle.

Modern processors have 256bit wide vector units which means that each CORE can perform up to 16 double precision  or 32 single precision floating point operations (FLOPS) per clock cycle. So, on a quad core CPU that’s typically found in a decent laptop you have 4 vector units (one per core) and could perform up to 64 double precision FLOPS per cycle. The Intel Xeon Phi accelerator unit has even wider vector units — 512bit!

This all sounds great but how does a programmer actually make use of this neat hardware trick?  There are many routes:-

Intrinsics

At the ‘close to the metal’ level you code for these vector units using instructions called AVX intrinsics.  This is relatively difficult and leads to none-portable code if you are not careful.

Auto-vectorisation in compilers

Since working with intrinsics is such hard work, why not let the compiler take the strain? Many modern compilers can automatically vectorize your C, C++ or Fortran code including gcc, PGI and Intel. Sometimes all you need to do is add an extra switch at compile time and reap the speed benefits. In truth, vectorization isn’t always automatic and the programmer needs to give the compiler some assistance but it is a lot easier than hand-coding intrinsics.

Intel SPMD Program Compiler (ispc)

There is a midway point between automagic vectorisation and having to use intrinsics. Intel have a free compiler called ispc (http://ispc.github.com/) that allows you to write compute kernels in a modified subset of C. These kernels are then compiled to make use of vectorised instruction sets. Programming using ispc feels a little like using OpenCL or CUDA. I figured out how to hook it up to MATLAB a few months ago and developed a version of the Square Root function that is almost twice as fast as MATLAB’s own version for sandy bridge i7 processors.

OpenMP

OpenMP is an API specification for parallel programming that’s been supported by several compilers for many years. OpenMP 4 was released in mid 2013 and included support for vectorisation.

Vectorised Libraries

Vendors of numerical libraries are steadily applying vectorisation techniques in order to maximise performance.  If the execution speed of your application depends upon these library functions, you may get a significant speed boost simply by updating to the latest version of the library and recompiling with the relevant compiler flags.

CUDA for x86

Another route to vectorised code is to make use of the PGI Compiler’s support for x86 CUDA.  What you do is take CUDA kernels written for NVIDIA GPUs and use the PGI Compiler to compile these kernels for x86 processors.  The resulting executables take advantage of vectorisation.  In essence, the vector units of the CPU are acting like CUDA cores–which they sort of are anyway!

The PGI compilers also have technology which they call PGI Unified binary which allows you to use NVIDIA GPUs when present or default to using multi-core x86 if no GPU is present.

  • PGI CUDA-x86 – PGI’s main page for their CUDA on x86 technologies

OpenCL for x86 processors

Yet another route to vectorisation would be to use Intel’s OpenCL implementation which takes OpenCL kernels and compiles them down to take advantage of vector units (http://software.intel.com/en-us/blogs/2011/09/26/autovectorization-in-intel-opencl-sdk-15/).  The AMD OpenCL implementation may also do this but I haven’t tried it and haven’t had chance to research it yet.

WalkingRandomly posts

I’ve written a couple of blog posts that made use of this technology.

Miscellaneous resources

There is other stuff out there but the above covers everything that I have used so far.  I’ll finish by saying that everyone interested in vectorisation should check out this website…It’s the bible!

Research Articles on SSE/AVX vectorisation

I found the following research articles useful/interesting.  I’ll add to this list over time as I dig out other articles.

  1. ludolph
    August 27th, 2012 at 11:27
    Reply | Quote | #1

    SSE/AVX based vectorization for technical (scientific) number crunching computing has only reason as an auto-vectorization tool + SSE/AVX enabled libraries, because of strong source level code dependence on current CPU architecture.

    Do not waste your time by learning of SSE/AVX low-level programming tricks, because the final benefit is very often only insignificant.

  2. August 27th, 2012 at 13:08
    Reply | Quote | #2

    @ludoph I agree, I wouldn’t bother with the low level intinsics as a scientist or mathematician. Using the auto-vectorisers, OpenCL, CUDA, ispc or libraries is the way to go in my opinion.

  3. ludolph
    August 28th, 2012 at 09:12
    Reply | Quote | #3

    @ludolph
    Just one additional remark:
    SSE/AVX has still terrible problems with double precision. As a scientist, I need always fast arithmetics with double precision accuracy.

    So finally, yes, occasionally is possible to benefit from SSE/AVX. But in general, this vectorization techniques are not so far very suitable for numerically intensive computational problems.

  4. August 28th, 2012 at 09:56
    Reply | Quote | #4

    Hi ludolph,

    What do you think are the problems with double precision AVX? Do you have any links please?

  5. ludolph
    August 29th, 2012 at 09:56
    Reply | Quote | #5

    Ok … “terrible” is not proper word. The problem is, that any SSE/AVX based vectorization techniques produce unpredictable speed up. See: http://www.efda-hlst.eu/training/HLST_scripts/performance-tuning-using-vectorization/at_download/file

    The only generally recommended approach is auto-vectorization, so everything depends on your compiler.