Interesting. Modern descriptions of FMA highlight the improved accuracy realised by not double-rounding. But the introductory material in

this thesis ("Floating-Point Fused Multiply-Add Architectures" by Eric Charles Quinnell) has IBM claiming:

Quote:

...benefits of combining the floating-point adder and floating-point multiplier into a single functional unit. First, the latency for a multiply-add fused mathematical operation is reduced significantly by having an addition combined with a multiplication in hardware. Second, the precision of the final result is increased, since the operands only go through a single rounding stage. Third, there is a decrease in the number of required input/output ports to the register file and their controlling sub-units. Finally, a reduced area of both the floating-point adder and floating-point multiplier may be realized since the adder is only wired to the output connections of the multiplier.

There seems to be some historical to-and-fro depending on whether FMA is an extra functional unit or one which replaces M and A.

Quote:

Even though the fused multiply-add architecture has troublesome latencies, high power consumption, and a performance degradation with single-instruction execution, it may be fully expected that more and more x87 designs will find floating-point fused multiply-add units in their silicon.

There are some great diagrams in the early parts of that thesis.

Why not replace FMA with doubled results? Conventionally, I would expect the power, the area, and the time to count against: for the desired extra one or two bits of accuracy (surely FMA doesn't offer more than that) doubling precision is going to incur a major cost. However, FPGAs may lead to an unconventional answer: the transistors may already be there and unused; the timing may depend on other parts of the design, or be dominated by routing costs; the power budget may be unimportant.