Fused Multiply Add (FMA) – One flop or two?
I am having a friendly argument with a colleague over how you calculate the peak number of floating operations per second (flops) for devices that support Fused Multiply Add (FMA). The FMA operation is d=a+b*c, an operation that can be done in one cycle on devices that support it.
I say that an FMA operation is two flops, he says it’s one. So, when I calculate the theoretical peak of a device I get twice the value he does. So, what do you think..is FMA one flop or two?
Just a portability issue. On devices that support fma, your friend is right. On devices that don’t, you are. If you want worst case, you are.
The real issue is why do you care? If it’s an efficiency argument, then I can’t see 1 or 2 flops here or there changing the overall efficiency of an algorithm in the big-O little-o sense.
I would consider it one, but I would also point out that performance could be higher if FMAs are utilized.
The reason I would say one, is because when I read a flops number, I think that means I can get that many multiplies OR adds. If you say two you (which I think many advertisers would) you are saying you can get up to X flops… but only if your problem only needs FMAs… so saying two is not really stating flops, it’s stating X multiplies AND X adds.
@Craig If you consider an FMA operation to be one flop and so report the theoretical peak of a device as being X flops, is there then a possibility that a user could time an operation such as Matrix-Matrix multiply and go on to report that they’ve managed to get over 100% of peak?
@William. Why do we care? GPU and CPU vendors often report the theoretical peak Gflops of their devices without always explaining how they reached that figure. Say you were comparing two devices, one supports FMA and the other doesn’t but they are otherwise equal.
What would the marketing people do? I think they’d likely agree with me and consider an FMA operation to be 2 flops, hence doubling their theoretical peak.
Is this the right thing to do? Depends on whether or not your application would make good use of FMA of course. Matrix-Matrix multiply would.
You are right, it doesn’t affect the big-O state of affairs but it does affect the wall clock time.
A few of comments from Twitter. I’ll include direct links to the tweets but I don’t know if they are permanent or not
@StreamComputing: I vote for 1 RT @walkingrandomly When reporting peak flops, do you consider FMA to to be 1 flop or 2? https://twitter.com/StreamComputing/status/296555373249327104
@StreamComputing: See previous tweet. Say somebody makes Sin(x) twice as efficient, do the flops double? A: no, only in specific cases. – I see FMA like this. https://twitter.com/StreamComputing/status/296556174256504832
@gepasi: @walkingrandomly two or else measuring flops becomes impossible. https://twitter.com/gepasi/status/296364596082638848
@jensnockert: Most count as two, which kind of makes sense. @walkingrandomly @streamcomputing https://twitter.com/jensnockert/status/296561951713161216
TWO. It has the *potential* to produce 2 floating point operations – ADD and MULTIPLY.
This industry always uses the potential maximum number of add/subtract & multiply ops.
If we starting counting only FLOPS that you can actually achieve as opposed to the maximum, then well over half of the marketing material across the industry would have to be thrown in the bin! :-)
Sorry for the misunderstanding on Twitter, yes it is two. The problem I have with FLOPS is that it is tricked up with things like FMA. It is like measuring the power of chess-pieces by the Queen.
About the precision of FMA: “In 2008 the IEEE 754 standard was revised to include the fused multiply-add operation (FMA). The FMA operation computes rn(X × Y + Z) with only one rounding step. Without the FMA operation the result would have to be computed as rn(rn(X × Y ) + Z) with two rounding steps, one for multiply and one for add. Because the FMA uses only a single rounding step the result is computed more accurately”. Source: http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf
I understood before that precision might be lower than the two-step method. But all sources I read were quite vague about the precision of FMA, so I assumed it should be less precise.
More comments from Twitter.
Explicit examples from AMD, CUDA where they count FMA as 2 flops thanks to @jensnockert
@jensnockert: @walkingrandomly http://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last … for example, each Cuda Core with 1 FMA unit provides 2 flops.
@jensnockert: @walkingrandomly Same thing with AMD, http://www.anandtech.com/show/6445/amd-announces-firepro-s10000
@jensnockert: @walkingrandomly Intel does the same, http://www.anandtech.com/show/6451/the-xeon-phi-at-work-at-tacc/2 (Note that the Xeon Phi has a 512-bit vector unit doing the fma.)
From @edward_smyth:
@Edward_smyth: @walkingrandomly Two, as it is two in the numerical algorithm. Vendors will count it as two in their peak calculations but note that hardware performance counters may only count it as one (i.e. one instruction).
After reading some of the other votes, I think I would change mine to two. Due to the way things have been marketed in the past we should state the maximum. I would still want to know how the quoted number of flops were calculated, but all single number statistics are misleading.
@Craig
Fused FMA is most defintely 2 flops !
In the same wasy that a 256 bit wide register can do 4 64-bit floating point operations from a single instruction – it is just the same.
And remember when CPUs have a separate multiply and add unit – you still added them together even if your 5point stencil CFD code does a lot more adds per point than multiplies.