TensorFloat-32 - Misplaced Pages

Numbering format in Nvidia hardware

Floating-point formats
IEEE 754
16-bit: Half (binary16) 32-bit: Single (binary32), decimal32 64-bit: Double (binary64), decimal64 128-bit: Quadruple (binary128), decimal128 256-bit: Octuple (binary256) Extended precision
Other
Minifloat bfloat16 TensorFloat-32 Microsoft Binary Format IBM floating-point architecture PMBus Linear-11 G.711 8-bit floats
Alternatives
Arbitrary precision
Tapered floating point
Posit
v t e

TensorFloat-32 or TF32 is a numeric floating point format designed for Tensor Core running on certain Nvidia GPUs.

Format

The binary format is:

1 sign bit
8 exponent bits
10 fraction bits (also called mantissa, or precision bits)

The total 19 bits fits within a double word (32 bits), and while it lacks precision compared with a normal 32 bit IEEE 754 floating point number, provides much faster computation, up to 8 times on a A100 (compared to a V100 using FP32).

References

https://deeprec.readthedocs.io/en/latest/NVIDIA-TF32.html accessed 23 May 2024

External links

This computer-engineering-related article is a stub. You can help Misplaced Pages by expanding it.

Categories:

Format

See also

References

External links