Archive for August, 2012
While on the train to work I came across a very interesting blog entry. Full LaTeX support (on device compilation and .dvi viewer) is now available on iPad courtesy of TeX Writer By FastIntelligence. Here is the blog post telling us the good news http://litchie.com/blog/?p=406
At the time of writing, the blog is down (Update: working again), possibly because of the click storm that my twitter announcement caused..especially after it was picked up by @TeXtip. So, here is the iTunes link http://itunes.apple.com/us/app/tex-writer/id552717222?mt=8
I haven’t tried this yet but it looks VERY interesting. If you get a chance to try it out, feel free to let me know how you get on in the comments section.
Update 1: This version of TeX writer(1.1) cannot output to .pdf. Only .dvi output is supported at the moment.
Let’s use Mathematica to to discover the longest English words where the letters are in alphabetical order. The following command will give all such words
DictionaryLookup[x__ /; Characters[x] == Sort[Characters[x]]]
I’m not going to show all of the output because there are 562 of them (including single letter words such as ‘I’ and ‘a’) as we can see by doing
Length[ DictionaryLookup[x__ /; Characters[x] == Sort[Characters[x]]] ] 562
The longest of these words has seven characters:
Max[Map[StringLength, DictionaryLookup[x__ /; Characters[x] == Sort[Characters[x]]]]] 7
It turns out that only one such word has the maximum 7 characters
DictionaryLookup[x__ /; Characters[x] == Sort[Characters[x]] && StringLength[x] == 7] {"billowy"}
There are 34 such words that contain 6 characters
DictionaryLookup[ x__ /; Characters[x] == Sort[Characters[x]] && StringLength[x] == 6] {"abbess", "Abbott", "abhors", "accent", "accept", "access", \ "accost", "adders", "almost", "begins", "bellow", "Bellow", "bijoux", \ "billow", "biopsy", "bloops", "cellos", "chills", "chilly", "chimps", \ "chinos", "chintz", "chippy", "chivvy", "choosy", "choppy", "Deimos", \ "effort", "floors", "floppy", "flossy", "gloppy", "glossy", "knotty"}
If you insist on all letters being different, there are 9:
DictionaryLookup[ x__ /; Characters[x] == Sort[Characters[x]] && StringLength[x] == 6 && Length[Union[Characters[x]]] == Length[Characters[x]]] {"abhors", "almost", "begins", "bijoux", "biopsy", "chimps", \ "chinos", "chintz", "Deimos"}
How about where all the letters are in reverse alphabetical order with no repeats? The longest such words have 7 characters
Max[ Map[StringLength, DictionaryLookup[ x__ /; Characters[x] == Reverse[Sort[Characters[x]]]]]] 7
Here they are
DictionaryLookup[ x__ /; Characters[x] == Reverse[Sort[Characters[x]]] && StringLength[x] == 7 && Length[Union[Characters[x]]] == Length[Characters[x]]] {"sponged", "wronged"}
Updated 26th March 2015
I’ve been playing with AVX vectorisation on modern CPUs off and on for a while now and thought that I’d write up a little of what I’ve discovered. The basic idea of vectorisation is that each processor core in a modern CPU can operate on multiple values (i.e. a vector) simultaneously per instruction cycle.
Modern processors have 256bit wide vector units which means that each CORE can perform up to 16 double precision or 32 single precision floating point operations (FLOPS) per clock cycle. So, on a quad core CPU that’s typically found in a decent laptop you have 4 vector units (one per core) and could perform up to 64 double precision FLOPS per cycle. The Intel Xeon Phi accelerator unit has even wider vector units — 512bit!
This all sounds great but how does a programmer actually make use of this neat hardware trick? There are many routes:-
Intrinsics
At the ‘close to the metal’ level you code for these vector units using instructions called AVX intrinsics. This is relatively difficult and leads to none-portable code if you are not careful.
- The Intel Intrinsics Guide – An interactive reference tool for Intel intrinsic instructions,
- Introduction to Intel Advanced Vector Extensions – includes some example C++ programs using AVX intinsics
- Benefits of Intel AVX for small matrices – More code examples along with speed comparisons.
Auto-vectorisation in compilers
Since working with intrinsics is such hard work, why not let the compiler take the strain? Many modern compilers can automatically vectorize your C, C++ or Fortran code including gcc, PGI and Intel. Sometimes all you need to do is add an extra switch at compile time and reap the speed benefits. In truth, vectorization isn’t always automatic and the programmer needs to give the compiler some assistance but it is a lot easier than hand-coding intrinsics.
- A Guide to Auto-vectorization with Intel C++ Compilers – Exactly what it says. In my experience, the intel compilers do auto-vectorisation better than other compilers.
- Auto-vectorisation in gcc 4.7 – A superb article showing how auto-vectorisation works in practice when using gcc 4.7. Lots of C code examples along with the emitted assembler and a good discussion of the hints you may need to give to the compiler to get maximum performance.
- Auto-vectorisation in gcc – The project page for auto-vectorisation in gcc
- Optimizing Application Performance on x64 Processor-based Systems with PGI Compilers and Tools – Includes discussion and example of auto-vectorisation using the PGI compiler
- Jim Radigan: Inside Auto-Vectorization, 1 of n – A video by a Microsoft engineer working on Visual Studio 2012. A superb introduction to what vectorisation is along with speed-up demonstrations and discussion on how the auto-vectoriser will work in Visual Studio 2012.
- Auto Vectorizer in Visual Studio 2012 – A series of blog articles about vectorization in Visual Studio 2012.
Intel SPMD Program Compiler (ispc)
There is a midway point between automagic vectorisation and having to use intrinsics. Intel have a free compiler called ispc (http://ispc.github.com/) that allows you to write compute kernels in a modified subset of C. These kernels are then compiled to make use of vectorised instruction sets. Programming using ispc feels a little like using OpenCL or CUDA. I figured out how to hook it up to MATLAB a few months ago and developed a version of the Square Root function that is almost twice as fast as MATLAB’s own version for sandy bridge i7 processors.
- http://ispc.github.com/ – The website for ispc
- http://ispc.github.com/perf.html – Some performance metrics. In some cases combining vectorisation and parallelisation can increase single precision throughput by more than a factor of 32 on a quad-core machine!
- ispc: A SPMD Compiler For High-Performance CPU Programming, Illinois-Intel Parallelism Center Distinguished Speaker Series (UIUC), March 15, 2012. (talk video–requires Windows Media Player.) This link was taken from Matt Pharr’s website (The author of ispc).
OpenMP
OpenMP is an API specification for parallel programming that’s been supported by several compilers for many years. OpenMP 4 was released in mid 2013 and included support for vectorisation.
- Performance Essentials with OpenMP 4.0 Vectorization – A series of seven videos from Intel covering performance essentials using OpenMP 4.0 Vectorization with C/C++.
- Explicit Vector Programming with OpenMP 4.0 SIMD Extensions – A tutorial from HPC today.
Vectorised Libraries
Vendors of numerical libraries are steadily applying vectorisation techniques in order to maximise performance. If the execution speed of your application depends upon these library functions, you may get a significant speed boost simply by updating to the latest version of the library and recompiling with the relevant compiler flags.
- Yeppp! – A fast, vectorised math library (benchmarks here)
- NAG Library for Xeon Phi – A huge, commercial library for the Intel Xeon Phi Accelerator
- Intel AVX optimization in Intel Math Kernel Library (MKL) – See what’s been vectorised in version 10.3 of the MKL
- Intel Integrated Performance Primitives (IPP) Functions Optimized for AVX – The IPP library includes many basic algorithms used in image and signal processing
- SIMD Library for Evaluating Elementary Functions (SLEEF) – An open-source, vectorised library for the calculation of various mathematical functions. Someone has done benchmarks for it here.
- SIMD-oriented Fast Mersenne Twister (SFMT) – Uses vectorisation to implement a very fast random number generation.
CUDA for x86
Another route to vectorised code is to make use of the PGI Compiler’s support for x86 CUDA. What you do is take CUDA kernels written for NVIDIA GPUs and use the PGI Compiler to compile these kernels for x86 processors. The resulting executables take advantage of vectorisation. In essence, the vector units of the CPU are acting like CUDA cores–which they sort of are anyway!
The PGI compilers also have technology which they call PGI Unified binary which allows you to use NVIDIA GPUs when present or default to using multi-core x86 if no GPU is present.
- PGI CUDA-x86 – PGI’s main page for their CUDA on x86 technologies
OpenCL for x86 processors
Yet another route to vectorisation would be to use Intel’s OpenCL implementation which takes OpenCL kernels and compiles them down to take advantage of vector units (http://software.intel.com/en-us/blogs/2011/09/26/autovectorization-in-intel-opencl-sdk-15/). The AMD OpenCL implementation may also do this but I haven’t tried it and haven’t had chance to research it yet.
WalkingRandomly posts
I’ve written a couple of blog posts that made use of this technology.
- Using Intel’s SPMD Compiler (ispc) with MATLAB on Linux
- Using the Portland PGI Compiler for MATLAB mex files in Windows
Miscellaneous resources
There is other stuff out there but the above covers everything that I have used so far. I’ll finish by saying that everyone interested in vectorisation should check out this website…It’s the bible!
Research Articles on SSE/AVX vectorisation
I found the following research articles useful/interesting. I’ll add to this list over time as I dig out other articles.
Someone recently asked me if WalkingRandomly.com had a facebook page. Since it wasn’t much effort to create one, I have now done so. I have no idea if this will be of any use to anyone but a first stab at it is at http://www.facebook.com/Walkingrandomly
Now I have to decide what to do with it. Does anyone have any thoughts on a blog such as this having its own facebook profile? Is it a good idea? Will anyone make use of it or is it just pointless?
I work for a very large UK University and am part of a team who is responsible for the network licenses of dozens of software applications serving thousands of clients. Many of these licenses need to be replaced annually. From a technical point of view this is usually fairly trivial but there are various management and administrative considerations that need to be made to ensure that user-side disruption is kept as low as possible.
Each software vendor has its own process for managing the license renewal process..some of them do it very well and some of them seem to work to make our life as difficult as possible but there is one problem that is shared by many software applications.
In a misguided attempt to be helpful to the user, many network licensed applications will display a launch-time message if the network license is set to expire ‘soon’ (where ‘soon’ might be as much as 90 days in the future’) . The user sees something like
‘Your license for this software will expire in X days. Please contact your administrator for more information’
Can you guess what happens when several hundred users see that message? Yep, we get snowed under in panic queries for a ‘problem’ that does not exist. The license upgrade is scheduled and will happen behind the scenes without the user ever needing to know anything about it.
So..to all software vendors whose applications do this for network licenses….PLEASE SHUT UP!
My first post on WalkingRandomly was on 20th August 2007 and, to be honest, I had no idea what I was going to write about or if I would keep at it; I certainly didn’t think that I would still be going strong 5 years later. Although I love to write, there is one reason why I keep on pounding out the posts here…you!
To all commenters, twitter followers, emailers and anyone else who has contacted me over the last 5 years regarding this blog….THANK YOU! Without you, I would have given up years ago because no one wants to write reams of stuff that doesn’t get read.
There are various things one could write about for a post like this. Maybe I could post reader statistics for example or discuss what I’ve got out of blogging over the last few years. Alternatively, I could go down the route of posting links to some of the posts I’m most proud of (or otherwise!). I considered all these things but realised that what was most important to me was just to say thank you to all of you.
Thanks!
I first mentioned Julia, a new language for high performance scientific computing, back in the February edition of a Month of Math software and it certainly hasn’t stood still since then. A WalkingRandomly reader, Ohad, recently contacted me to tell me about a forum post announcing some Julia speed improvements.
Julia has a set of micro-benchmarks and the slowest of them is now only two times slower than the equivalent in C. That’s compiled language performance with an easy to use scripting language. Astonishingly, Julia is faster than gfortran in a couple of instances. Nice work!
Comparison times between Julia and other scientific scripting languages (MATLAB, Python and R for instance) for these micro-benchmarks are posted on Julia’s website. The Julia team have included the full benchmark source code used for all languages so if you are an expert in one of them, why not take a look at how they have represented your language and see what you think?
Let me know if you’ve used Julia at all, I’m interested in what people think of it.
Someone recently contacted me complaining that MATLAB could not do a QR factorisation of a variable precision arithmetic matrix. Double precision matrices work fine of course:
>> A=[2 1 3;-1 0 7; 0 -1 -1]; >> [Q R]=qr(A) Q = -0.8944 -0.1826 0.4082 0.4472 -0.3651 0.8165 0 0.9129 0.4082 R = -2.2361 -0.8944 0.4472 0 -1.0954 -4.0166 0 0 6.5320
Variable precision matrices do not (I’m using MATLAB 2012a and the symbolic toolbox here).
>> a=vpa([2 1 3;-1 0 7; 0 -1 -1]); >> [Q R]=qr(a) Undefined function 'qr' for input arguments of type 'sym'.
It turns out that MATLAB and the symbolic toolbox CAN do variable precision QR factorisation….it’s just hidden a bit. The following very simple function, vpa_qr.m, shows how to get at it
function [ q,r ] = vpa_qr( x ) result = feval(symengine,'linalg::factorQR',x); q=result(1); r=result(2); end
Let’s see how that does
>> a=vpa([2 1 3;-1 0 7; 0 -1 -1]); >> [Q R]=vpa_qr(a);
I’ve suppressed the output because it’s so large but it has definitely worked. Let’s take a look at the first element of Q for example
>> Q(1) ans = 0.89442719099991587856366946749251
Which is correct to the default number of variable precision digits, 32. Of course we could change this to anything we like using the digits function.
- Download vpa_qr.m (needs symbolic toolbox)
I recently installed Ubuntu 12.04 on my laptop. I gave Unity a chance for a few days just in case it had improved since I last used it but still found it unusable. The following tweaks made Ubuntu usable again (for me at least).
- Install Bumblebee – I have an NVIDIA Graphics card that makes use of Optimus technology. I learned that NVIDIA do not support Optimus laptops on Linux the hard way but, thankfully, this open source project makes everything work.
- Install Cinnamon – Unity is an awful desktop interface in my opinion. Cinnamon, on the other hand, is very nice indeed
- Fix the scroll bars – Even after switching to Cinnamon, many applications in Ubuntu still use some truly awful window scrollbars. This link shows you how to make them go away.
That was pretty much it and I’m very happy with the result. Do you use Ubuntu? If so, are there any tweaks that you simply must make to the default setup before you feel that it’s usable for you?
The Mathworks sell dozens of toolboxes for MATLAB but they are not the only ones doing it. Many 3rd party vendors also sell MATLAB toolboxes and many of them are of extremely high quality. Here are some of my favourites
- The NAG Toolbox for MATLAB – Over 1500 high quality numerical routines from the Numerical Algorithms Group. Contains local and global optimisation, statistics, finance, wavelets, curve fitting, special functions and much much more. Superb!
- AccelerEyes Jacket – Very fast and comprehensive GPU computing toolbox for MATLAB. I’ve used it a lot.
- Multi-precision toolbox for MATLAB – A colleague at Manchester recently told me about this one as his group uses it a lot in their research. I’ve yet to play with it but it looks great.
Which commercial, 3rd party toolboxes do you use/rate and why?
If you like this list, you may also like my list of high quality, free MATLAB toolboxes