[NFC][Docs] Add documentation for NVPTX conversion intrinsics (#175536)

This change adds documentation for the NVPTX narrow floating-point conversion intrinsics. PTX ISA Reference: https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt
2026-03-03 15:39:20 +05:30 · 2026-03-03 15:39:20 +05:30 · b4fffcd8e4
commit b4fffcd8e4
parent 5d8c6c198d
1 changed files with 177 additions and 0 deletions
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@ -1106,6 +1106,183 @@ If the given pointer in the generic address space refers to memory which falls
 within the state space of the intrinsic (and therefore could be safely address
 space casted to this space), 1 is returned, otherwise 0 is returned.

+Narrow Floating-Point Conversion intrinsics
+-------------------------------------------
+
+These intrinsics perform conversions involving narrow floating-point formats.
+The following table describes the rounding modes used across these intrinsics:
+
+.. _narrow-fp-rounding-modes:
+
+.. table:: Narrow Floating-Point Conversion Rounding Modes
+   :widths: 30 60
+
+   +-----------------------+---------------------------------------------------+
+   | Rounding Mode         | Description                                       |
+   +=======================+===================================================+
+   |``rn`` (default)       | Round to nearest, with ties to even               |
+   +-----------------------+---------------------------------------------------+
+   |``rz``                 | Round towards zero                                |
+   +-----------------------+---------------------------------------------------+
+   |``rp``                 | Round towards positive infinity                   |
+   +-----------------------+---------------------------------------------------+
+   |``rs``                 | Stochastic rounding which is achieved through the |
+   |                       | use of the supplied random bits (``%rnd_bits``).  |
+   |                       | The result s rounded in the direction towards     |
+   |                       | zero or away from zero based on the carry out of  |
+   |                       | the integer addition of the of mantissa from      |
+   |                       | the input.                                        |
+   +-----------------------+---------------------------------------------------+
+
+``fp8`` Conversion Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+    declare i16 @llvm.nvvm.ff.to{.e4m3x2, .e5m2x2}.rn{.relu}(float %a, float %b)
+    declare i16 @llvm.nvvm.ff.to.ue8m0x2{.rz, .rp}{.satfinite}(float %a, float %b)
+    declare i16 @llvm.f16x2.to{.e4m3x2, .e5m2x2}.rn{.relu}(<2 x half> %a)
+    declare i16 @llvm.bf16x2.to{.e4m3x2, .e5m2x2}.rn{.relu}.satfinite(<2 x bfloat> %a)
+    declare i16 @llvm.bf16x2.to.ue8m0x2{.rz, .rp}{.satfinite}(<2 x bfloat> %a)
+    declare <2 x half> @llvm.nvvm{.e4m3x2, .e5m2x2}.to.f16x2.rn{.relu}(i16 %a)
+    declare <2 x bfloat> @llvm.nvvm.ue8m0x2.to.bf16x2(i16 %a)
+    declare <4 x i8> @llvm.nvvm.f32x4.to{.e4m3x4, .e5m2x4}.rs{.relu}.satfinite(<4 x f32> %a, i32 %rnd_bits)
+
+Overview:
+"""""""""
+
+These intrinsics perform conversions involving the ``e4m3`` and ``e5m2`` narrow 
+floating-point formats. In case of two inputs, the value converted from input 
+``%a`` is stored in the upper 8-bits of the result, and the value converted 
+from input ``%b`` is stored in the lower 8-bits of the result.
+
+For rounding modes, see :ref:`narrow-fp-rounding-modes`.
+
+The ``relu`` modifier clamps negative results to 0.
+
+When ``satfinite`` is specified, if the absolute value of input (ignoring sign) 
+is greater than ``MAX_NORM`` of the specified destination format, then the 
+result is sign-preserved ``MAX_NORM`` of the destination format and a positive 
+``MAX_NORM`` in ``.ue8m0x2`` for which the destination sign is not supported. 
+Also, if the input value is ``NaN``, then the result is ``NaN`` in the 
+specified destination format. The ``satfinite`` modifier is assumed to be 
+present for conversions involving ``e4m3`` and ``e5m2`` types as the 
+destination.
+
+For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt>`__.
+
+``s2f6`` Conversion Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+    declare i16 @llvm.nvvm.ff.to.s2f6x2.rn{.relu}.satfinite.scale.n2.ue8m0(float %a, float %b, i16 %scale_factor)
+    declare i16 @llvm.nvvm.bf16x2.to.s2f6x2.rn{.relu}.satfinite.scale.n2.ue8m0(<2 x bfloat> %a, i16 %scale_factor)
+    declare <2 x bfloat> @llvm.nvvm.s2f6x2.to.bf16x2.rn{.relu}{.satfinite}.scale.n2.ue8m0(i16 %a, i16 %scale_factor)
+
+Overview:
+"""""""""
+
+These intrinsics perform conversions involving the ``s2f6`` narrow 
+floating-point format. In case of two inputs, the value converted from input 
+``%a`` is stored in the upper 8-bits of the result, and the value converted 
+from input ``%b`` is stored in the lower 8-bits of the result.
+
+For rounding modes, see :ref:`narrow-fp-rounding-modes`.
+
+The ``relu`` modifier clamps negative results to 0.
+
+When ``satfinite`` is specified, if the absolute value of input (ignoring sign) 
+is greater than ``MAX_NORM`` of the specified destination format, then the 
+result is sign-preserved ``MAX_NORM`` of the destination format. Also, if the 
+input is ``NaN``, then the result is the positive ``MAX_NORM`` of the 
+destination format.
+
+The operand ``%scale_factor`` stores two packed scaling factors of type 
+``ue8m0``, one for each input. For down conversion, inputs are divided by 
+``scale_factor`` and then the conversion is performed. For up-conversion, 
+inputs are converted to destination type and then multiplied by 
+``scale_factor``.
+
+For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt>`__.
+
+``fp6`` Conversion Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+    declare i16 @llvm.nvvm.ff.to{.e2m3x2, .e3m2x2}.rn{.relu}.satfinite(float %a, float %b)
+    declare i16 @llvm.nvvm.f16x2.to{.e2m3x2, .e3m2x2}.rn{.relu}.satfinite(<2 x half> %a)
+    declare i16 @llvm.nvvm.bf16x2.to{.e2m3x2, .e3m2x2}.rn{.relu}.satfinite(<2 x bfloat> %a)
+    declare <2 x half> @llvm.nvvm{.e2m3x2, .e3m2x2}.to.f16x2.rn{.relu}(i16 %a)
+    declare <4 x i8> @llvm.nvvm.f32x4.to{.e2m3x4, .e3m2x4}.rs{.relu}.satfinite(<4 x f32> %a, i32 %rnd_bits)
+    
+Overview:
+"""""""""
+
+These intrinsics perform conversions involving the ``e2m3`` and ``e3m2`` narrow 
+floating-point formats. In case of two inputs, the value converted from input 
+``%a`` is stored in the upper 8-bits of the result, and the value converted 
+from input ``%b`` is stored in the lower 8-bits of the result with 2 MSBs 
+padded with 0s in both cases.
+
+For rounding modes, see :ref:`narrow-fp-rounding-modes`.
+
+The ``relu`` modifier clamps negative results to 0.
+
+When ``satfinite`` is specified, if the absolute value of input (ignoring sign) 
+is greater than ``MAX_NORM`` of the specified destination format, then the 
+result is sign-preserved ``MAX_NORM`` of the destination format. Also, if the 
+input is ``NaN``, then the result is the positive ``MAX_NORM`` of the 
+destination format.
+
+For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt>`__.
+
+``fp4`` Conversion Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+    declare i16 @llvm.nvvm.ff.to.e2m1x2.rn{.relu}.satfinite(float %a, float %b)
+    declare i16 @llvm.nvvm.f16x2.to.e2m1x2.rn{.relu}.satfinite(<2 x half> %a)
+    declare i16 @llvm.nvvm.bf16x2.to.e2m1x2.rn{.relu}.satfinite(<2 x bfloat> %a)
+    declare <2 x half> @llvm.nvvm.e2m1x2.to.f16x2.rn{.relu}(i16 %a)
+    declare i16 @llvm.nvvm.f32x4.to.e2m1x4.rs{.relu}.satfinite(<4 x f32> %a, i32 %rnd_bits)
+
+Overview:
+"""""""""
+
+These intrinsics perform conversions involving the ``e2m1`` narrow 
+floating-point format. For conversions involving ``e2m1x2``, the packed 
+``e2m1x2`` value is stored in the lower byte of the ``i16`` argument or result.
+In case of two inputs, the value converted from input 
+``%a`` is stored in the upper 4-bits of the result, and the value converted 
+from input ``%b`` is stored in the lower 4-bits of the result.
+
+For rounding modes, see :ref:`narrow-fp-rounding-modes`.
+
+The ``relu`` modifier clamps negative results to 0.
+
+When ``satfinite`` is specified, if the absolute value of input (ignoring sign) 
+is greater than ``MAX_NORM`` of the specified destination format, then the 
+result is sign-preserved ``MAX_NORM`` of the destination format. Also, if the 
+input is ``NaN``, then the result is the positive ``MAX_NORM`` of the 
+destination format.
+
+For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cvt>`__.
+
 Arithmetic Intrinsics
 ---------------------