Loss Scaling Free ((better)) -

| Format | Exponent Bits | Mantissa Bits | Dynamic Range (approx) | |--------|---------------|---------------|------------------------| | FP16 | 5 | 10 | 5.96e-8 to 65504 | | BF16 | 8 | 7 | 1.18e-38 to 3.4e38 |

BF16 has the , so gradients rarely underflow — even without loss scaling. The tradeoff: less precision (7 vs 10 mantissa bits), but for most deep learning tasks, BF16’s precision is sufficient. loss scaling free

multiplies the loss by a constant S before backpropagation, scaling gradients up so they fall into the representable range of FP16. After backprop, gradients are divided by S before the optimizer step. | Format | Exponent Bits | Mantissa Bits

Here’s a helpful write-up on training, aimed at practitioners who use mixed precision (FP16/BF16) and want to avoid the complexity of manual or dynamic loss scaling. Loss Scaling Free: Training Deep Learning Models Without the Scaling Headache The Problem Loss Scaling Solves In FP16 mixed precision training, activations and gradients are stored as 16-bit floats. The issue: gradients often become too small to represent in FP16’s limited dynamic range (~5.96e-8 minimum normal value). When underflow happens, gradients become zero — and training stops learning. After backprop, gradients are divided by S before

BF16 has the , so gradients rarely underflow — even without loss scaling. The tradeoff: less precision (7 vs 10 mantissa bits), but for most deep learning tasks, BF16’s precision is sufficient.