Understanding Floating-point Performance

Inexact Floating Point Comparisons

Some floating point applications exhibit extremely poor performance by not terminating. The applications do not terminate, in many cases, because exact FP comparisons were made against a given value. The following examples demonstrate the concept:

Example
if (foo() == 2.0)

Where foo() may be as close to 2.0 as can be imagined without actually exactly matching 2.0. You can improve the performance of such codes by using inexact floating point comparisons or fuzzy comparisons to test a value to within a certain tolerance, as shown below:

Example
epsilon = 1E-8; if (abs(foo() - 2.0) <= epsilon)

Example

epsilon = 1E-8;

if (abs(foo() - 2.0) <= epsilon)

Denormal Computations

A denormal number is where the mantissa is non zero, but the exponent value is zero in an IEEE* floating-point representation. The smallest normal single precision floating point number greater than zero is about 1.175494350822288e-38. Smaller numbers are possible, but are denormal and take hardware or operating system intervention to handle them, which can cost hundreds of clock cycles.

In many cases, denormal numbers are evidence of an algorithm problem where a poor choice of algorithms is causing too much computation in the denormal range. There are several ways to get around denormal numbers. For example, you can translate to normal, which means to multiply by a large scalar number, do the remaining computations in the normal space, then rescale back down to denormal range. This is done whenever the small denormal values benefit the program design. In many cases, denormals that can be considered to be zero may be flushed to zero.

Denormals are computed in software on Itanium® processors. Hundreds of clock cycles are required, resulting in excessive kernel time. Attempt to understand why denormal results occur and determine if they are justified. If you determine they are not justified, then use the following steps to handle the results:

Translate to normal problem by scaling values.
Increase precision and range by using a wider data type.
Set flush-to-zero mode in floating-point status register: -ftz (Linux*) or /Qftz (Windows*).

Note

This process applies to the source file containing the main() function only. See

Denormal numbers always indicate a loss of precision, an underflow condition, and usually an error (or at least a less than desirable condition). On the Intel® Pentium® 4 processor and the Intel Itanium® processor, floating-point computations that generate denormal results can be set to zero, improving the performance.

Itanium® compiler

The Itanium® compiler supports the -ftz (Linux) or /Qftz (Windows) option used to flush denormal results to zero when the application is in the gradual underflow mode. Use this option if the denormal values are not critical to application behavior. The default status of the option is OFF. By default, the compiler lets results gradually underflow.

The -ftz (Linux) or /Qftz (Windows) switch only needs to be used on the source containing the main(). The switch turns on the Flush-to-Zero (FTZ) mode for the process started by the main(). The initial thread, and any threads subsequently created by that process, will operate in FTZ mode. Note that the -O3 (Linux) or /O3 (Windows) option turns -ftz (Linux) or /Qftz (Windows) ON. Use -Qftz- to disable flushing denormal results to zero.

IA-32 compiler

The IA-32 compiler does not support the -ftz (Linux) or /Qftz (Windows) option, however, -xK or -xW (Linux) or /QxK or /QxW (Windows) will automatically flush to zero, which is the preferred approach. The only other way to enable Flush-to-Zero mode on an Intel® Pentium® 4 processor is to manually program the SSE2 Control Register as illustrated in the following example:

Example
void SIMDFlushToZero (void) { DWORD SIMDCtrl; _asm { STMXCSR SIMDCtrl mov eax, SIMDCtrl // flush-to-zero = bit 15 // mask underflow = bit 11 // denormals are zero = bit 6 or eax, 08840h mov SIMDCtrl, eax LDMXCSR SIMDCtrl } }

Example

void SIMDFlushToZero (void)

{

DWORD SIMDCtrl;

_asm

{

STMXCSR SIMDCtrl

mov eax, SIMDCtrl

// flush-to-zero = bit 15

// mask underflow = bit 11

// denormals are zero = bit 6

or eax, 08840h

mov SIMDCtrl, eax

LDMXCSR SIMDCtrl

}

Refer to IA-32 Intel® Architecture Software Developer’s Manual Volume 1: Basic Architecture for more details about flush to zero or specific bit field settings.

Detailed Microarchitectural Optimization Analysis (Itanium® Compiler)

For more detailed optimization advice regarding microarchitectural optimization and cycle accounting, refer to Introduction to Microarchitectural Optimization for Itanium® 2 Processors Reference Manual also known as “Software Optimization book“ document number 251464-001 located at http://www.Intel.com/software/products/vtune/techtopic/Software_Optimization.pdf.