Fused Multiply Add (FMA) – One flop or two?

January 29th, 2013 | Categories: HPC, just for fun, walking randomly | Tags:

I am having a friendly argument with a colleague over how you calculate the peak number of floating operations per second (flops) for devices that support Fused Multiply Add (FMA).  The FMA operation is d=a+b*c, an operation that can be done in one cycle on devices that support it.

I say that an FMA operation is two flops, he says it’s one.  So, when I calculate the theoretical peak of a device I get twice the value he does.  So, what do you think..is FMA one flop or two?

  1. william
    January 29th, 2013 at 21:09
    Reply | Quote | #1

    Just a portability issue. On devices that support fma, your friend is right. On devices that don’t, you are. If you want worst case, you are.

    The real issue is why do you care? If it’s an efficiency argument, then I can’t see 1 or 2 flops here or there changing the overall efficiency of an algorithm in the big-O little-o sense.

  2. Craig
    January 30th, 2013 at 00:27
    Reply | Quote | #2

    I would consider it one, but I would also point out that performance could be higher if FMAs are utilized.

  3. Craig
    January 30th, 2013 at 00:39
    Reply | Quote | #3

    The reason I would say one, is because when I read a flops number, I think that means I can get that many multiplies OR adds. If you say two you (which I think many advertisers would) you are saying you can get up to X flops… but only if your problem only needs FMAs… so saying two is not really stating flops, it’s stating X multiplies AND X adds.

  4. January 30th, 2013 at 10:39
    Reply | Quote | #4

    @Craig If you consider an FMA operation to be one flop and so report the theoretical peak of a device as being X flops, is there then a possibility that a user could time an operation such as Matrix-Matrix multiply and go on to report that they’ve managed to get over 100% of peak?

  5. January 30th, 2013 at 10:40
    Reply | Quote | #5

    @William. Why do we care? GPU and CPU vendors often report the theoretical peak Gflops of their devices without always explaining how they reached that figure. Say you were comparing two devices, one supports FMA and the other doesn’t but they are otherwise equal.

    What would the marketing people do? I think they’d likely agree with me and consider an FMA operation to be 2 flops, hence doubling their theoretical peak.

    Is this the right thing to do? Depends on whether or not your application would make good use of FMA of course. Matrix-Matrix multiply would.

    You are right, it doesn’t affect the big-O state of affairs but it does affect the wall clock time.

  6. January 30th, 2013 at 11:01
    Reply | Quote | #6

    A few of comments from Twitter. I’ll include direct links to the tweets but I don’t know if they are permanent or not

    @StreamComputing: I vote for 1 RT @walkingrandomly When reporting peak flops, do you consider FMA to to be 1 flop or 2? https://twitter.com/StreamComputing/status/296555373249327104

    @StreamComputing: See previous tweet. Say somebody makes Sin(x) twice as efficient, do the flops double? A: no, only in specific cases. – I see FMA like this. https://twitter.com/StreamComputing/status/296556174256504832

    @gepasi: @walkingrandomly two or else measuring flops becomes impossible. https://twitter.com/gepasi/status/296364596082638848

    @jensnockert: Most count as two, which kind of makes sense. @walkingrandomly @streamcomputing https://twitter.com/jensnockert/status/296561951713161216

  7. January 30th, 2013 at 11:35
    Reply | Quote | #7

    TWO. It has the *potential* to produce 2 floating point operations – ADD and MULTIPLY.

    This industry always uses the potential maximum number of add/subtract & multiply ops.

    If we starting counting only FLOPS that you can actually achieve as opposed to the maximum, then well over half of the marketing material across the industry would have to be thrown in the bin! :-)

  8. January 30th, 2013 at 14:48
    Reply | Quote | #8

    Sorry for the misunderstanding on Twitter, yes it is two. The problem I have with FLOPS is that it is tricked up with things like FMA. It is like measuring the power of chess-pieces by the Queen.

    About the precision of FMA: “In 2008 the IEEE 754 standard was revised to include the fused multiply-add operation (FMA). The FMA operation computes rn(X × Y + Z) with only one rounding step. Without the FMA operation the result would have to be computed as rn(rn(X × Y ) + Z) with two rounding steps, one for multiply and one for add. Because the FMA uses only a single rounding step the result is computed more accurately”. Source: http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

    I understood before that precision might be lower than the two-step method. But all sources I read were quite vague about the precision of FMA, so I assumed it should be less precise.

  9. January 30th, 2013 at 15:41
    Reply | Quote | #9

    More comments from Twitter.

    Explicit examples from AMD, CUDA where they count FMA as 2 flops thanks to @jensnockert

    @jensnockert: @walkingrandomly http://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last … for example, each Cuda Core with 1 FMA unit provides 2 flops.

    @jensnockert: @walkingrandomly Same thing with AMD, http://www.anandtech.com/show/6445/amd-announces-firepro-s10000

    @jensnockert: @walkingrandomly Intel does the same, http://www.anandtech.com/show/6451/the-xeon-phi-at-work-at-tacc/2 (Note that the Xeon Phi has a 512-bit vector unit doing the fma.)

    From @edward_smyth:

    @Edward_smyth: @walkingrandomly Two, as it is two in the numerical algorithm. Vendors will count it as two in their peak calculations but note that hardware performance counters may only count it as one (i.e. one instruction).

  10. Craig
    January 30th, 2013 at 19:17

    After reading some of the other votes, I think I would change mine to two. Due to the way things have been marketed in the past we should state the maximum. I would still want to know how the quoted number of flops were calculated, but all single number statistics are misleading.

  11. Dan KIdger
    January 31st, 2013 at 17:12

    @Craig
    Fused FMA is most defintely 2 flops !
    In the same wasy that a 256 bit wide register can do 4 64-bit floating point operations from a single instruction – it is just the same.
    And remember when CPUs have a separate multiply and add unit – you still added them together even if your 5point stencil CFD code does a lot more adds per point than multiplies.