## A Parallel programming tip for Mathematica

Someone emailed me recently complaining that the ParallelTable function in Mathematica didn’t work. In fact it made his code slower! Let’s take a look at an instance where this can happen and what we can do about it. I’ll make the problem as simple as possible to allow us to better see what is going on.

Let’s say that all we want to do is generate a list of the prime numbers as quickly as possible using Mathematica. The code to generate the first 20 million primes is very straightforward.

primelist = Table[Prime[k], {k, 1, 20000000}];

To time how long this list takes to calculate we can use the AbsoluteTime function as follows.

t = AbsoluteTime[]; primelist = Table[Prime[k], {k, 1, 20000000}]; time2 = AbsoluteTime[] - t

This took about 63 seconds on my machine. Taking a quick look at the Mathematica documentation we can see that the parallel version of Table is, obviously enough, ParallelTable. What could be simpler? So, to spread this calculation over both processors in a dual-core laptop we simply need to do the following

t = AbsoluteTime[]; primelist = ParallelTable[Prime[k], {k, 1, 20000000}]; time2 = AbsoluteTime[] - t

The result? 98 seconds! So, it seems that the parallel version is more than **50% SLOWER**! My correspondent was right – ParallelTable doesn’t seem to work at all.

Before we send a message to Wolfram Tech Support though, let’s dig a little deeper. It turns out that a single evaluation of

Prime[k] for any given k is very quick – even for high k. Prime[100000000] evaluates in a tiny fraction of a second for example (try it! – it really is astonishingly quick.)

What I suspect is happening here is that when you move to the parallel version, the kernels spend most of the time communicating with each other rather than actually calculating anything. I think that ParallelTable does something roughly like the following.

Kernel1:Give me a k.

Master:have k=1.

Kernel2:Give me a k.

Master:have k=2

Kernel1:I’ve done k=1 and the answer is 2, give me another k

Master: have k=3

kernel2:I’ve done k=2 and the answer is 3, give me another k etc

So, the kernels spend more time talking about the work that needs to be done rather than actually doing it. Also, for various reasons, kernel1 and kernel2 might get out of sync and so the Master kernel may end up with a list out of order. If so then it will need to reorder the lists at the end of the calculations and so that’s even more time taken.

So, my approach was to try and reduce the amount of communication between kernels to something like

Master: Kernel 1 – you go away and get me first 10 million primes. Don’t bother me until it’s done.

Master: kernel 2 – you go away and get me the second 10 million. Don’t bother me until it’s done.

The following Mathematica code does this.

t = AbsoluteTime[]; job1 = ParallelSubmit[Table[Prime[k], {k, 1, 10000000}]]; job2 = ParallelSubmit[Table[Prime[k], {k, 10000001, 20000000}]]; {a1, a2} = WaitAll[{job1, job2}]; time2 = AbsoluteTime[] - t

This works in 40 seconds compared to the original time of 62 seconds. Not a 2x speedup but not too shabby for so little extra work on our part.

The moral of the story? Make sure that your code spends more time actually doing work than it does just talking about it.

**Disclaimer:** I am still learning about Mathematica’s parallel tools myself so don’t take this stuff as gospel. Also, there are almost certainly more efficient ways of getting a list of primes than using the Prime[k] function. I am only using Prime[k] here because it is an example of a function that evaluates very quickly.

Mike,

In the last section of Mathematica code of this blog entry, do you mean to use a2 instead of b1 in the job2 line. (Isn’t that confusing.)

Original:

job2 = ParallelSubmit[b1 = Table[Prime[k], {k, 10000001, 20000000}]];

Should be?

job2 = ParallelSubmit[a2 = Table[Prime[k], {k, 10000001, 20000000}]];

I recently used your advice in this blog entry to speed up a section of Mathematica code I was writing which had originally needed 2 hours to complete. Using your hints, I shrunk the execution time down to 45 minutes! Thank you!

Todd

Hi Todd

Thanks for the comment and I’m glad I helped you out a little. You are right – the way I originally wrote the code was confusing and I think that the extra assignments were completely unnecessary. I have now changed

job2 = ParallelSubmit[b1 = Table[Prime[k], {k, 10000001, 20000000}]];

to

job2 = ParallelSubmit[Table[Prime[k], {k, 10000001, 20000000}]];

and similarly for job1. Does that work for you?

Cheers,

Mike

“So, the kernels spend more time talking about the work that needs to be done rather than actually doing it.”

sounds more like GovernmentTable[] or AcademiaTable[]

Mike :)

ParallelTable is not the problem here, the default method is just suited for a different workload profile, where the parallel computations are much bigger and/or non-uniform in execution time.

Use option Method -> “CoarsestGrained” for this situation, where you have many tiny computations. The computations will be sent out in large batches.

AbsoluteTiming[

primelist = ParallelTable[Prime[k], {k, 1, 20000000},

Method -> “CoarsestGrained”];]

This should give you similar execution time as the ParallelSubmit/WaitAll approach shown in the post.

You can see the description of the Method argument under the More Information section of the Parallelize function: http://reference.wolfram.com/mathematica/ref/Parallelize.html.

–Joel

Wolfram Research Kernel Developer

Parallel Tools Team

Hi Joel

Thanks for the extra information.

Best Wishes,

Mike

AWESOME AWESOME AWESOME.. I wish there was more documentation on parallel computing though. A couple examples mathematica gives end up being slower on my machine like your correspondent.. Hence the confusion.

Actually scratch that last comment. Mathematica does provide a lot of documentation hah.. You’re still awesome!! thanks for the website!!

Hi, I tried Joel’s suggest. It works for his example, but not work for below case:

-> AbsoluteTiming[ ParallelTable[k^(100*Pi), {k, 1, 2000000},Method -> “CoarsestGrained”];]

Out-> {18.4867043, Null}

-> AbsoluteTiming[Table[k^(100*Pi), {k, 1, 2000000}];]

Out-> {8.7347663, Null}

So, it seems to depend on what kind of command? Joel, could you plz explain?

Btw, I run on AMD 2 core.

Hi Nghia

Have you tried applying the method I discussed in the main post? Does that work for you?

Cheers,

Mike

Hi Mike,

Your way is perfect way for general case (it’s natural way, right ^_^) and I’m using it now. Thank you very much for that. However for the example I tried above, it’s does not faster. I don’t know why.

But it doesn’t matter, just example. I don’t face with that problem in real application. So everything’s still fine.

Cheers

Nghia

You can do better than Joel’s suggestion and better than what Mike did.

For this sort of code it is likely that whoever gets the second half of the table runs slower than whoever gets the first half, and so you land up waiting for the laggard. (Not to mention you have limited yourself in your code to two kernels, perhaps on a 4 CPU machine).

The easy way to get what Mike did is via

ParallelTable[Prime[k], {k, 1, 20000000},

Method -> “EvaluationsPerKernel” -> 1]

which does what it says it does — split the problem into one unit of work per kernel — presumably by splitting the table indices at 10,000,000, though Wolfram don’t tell us exactly. This will run 4-way on a 4-CPU machine.

But we are still hostage to the issue of uneven workloads. So a better choice is something like

ParallelTable[Prime[k], {k, 1, 20000000},

Method -> “EvaluationsPerKernel” -> 5]

which is more likely to balance the amount of work between the available kernels.

On my (2-core) system, the first (1 evaluation per kernel) took 54 sec, the second took 46 sec.