369 Commits

Author SHA1 Message Date
Dmitry Sidorov
4e95be7043
[RFC][SPIR-V] Add intrinsics to convert to/from ap.float (#164252)
The patch adds two intrinsics: llvm.convert.to.arbitrary.fp and
llvm.convert.from.arbitrary.fp.

The intrinsics perform conversions between values whose interpretation
differs from their representation in LLVM IR. The intrinsics are
overloaded on both its return type and first argument. Metadata operands
describe how the raw bits should be interpreted before and after the
conversion.

Typical use case is to convert IEEE-754 floating point types to FP8/FP4
and backwards for ML applications.

Addresses
https://discourse.llvm.org/t/rfc-spir-v-way-to-represent-float8-in-llvm-ir/87758/10
2026-01-14 16:53:53 +01:00
Yao Hwang
a24784ce11
[ADT][NFC] Expose fltSemantics struct (#175676)
This patch moves the `fltSemantics` struct definition (along with
`fltNonfiniteBehavior`
and `fltNanEncoding` enums) from APFloat.cpp to APFloat.h, making them
part of the
public API.

Currently, downstream projects cannot define custom floating-point
semantics because
`fltSemantics` is an opaque forward declaration in the header. This
forces projects
with specialized float formats to either patch LLVM locally or request
new formats be added upstream for each variant. By exposing the struct,
downstream users can define their own semantics.
2026-01-14 10:17:35 +01:00
Ramkumar Ramachandra
d69335bac9
[LLVM] Clean up code using [not_]equal_to (NFC) (#175824)
Use llvm::[not_]equal_to landed in d2a521750 ([ADT] Introduce
bind_{front,back}, [not_]equal_to, #175056) across LLVM for cleaner
code.
2026-01-13 21:19:39 +00:00
Mehdi Amini
efd9dc83f2
Revert "[APFloat] Add exp function for APFloat::IEEESsingle using expf implementation from LLVM libc. (#143959)" (#172325)
This reverts commit 4190d576823c18f45ee0632baee7d798448178ac.

See https://lab.llvm.org/buildbot/#/builders/181/builds/33524

```
                 from /home/buildbot/buildbot-root/cross-project-tests-sie-ubuntu/llvm-project/llvm/lib/Support/APFloat.cpp:32:
/home/buildbot/buildbot-root/cross-project-tests-sie-ubuntu/llvm-project/llvm/../libc/src/__support/math/asin_utils.h:548:1:   in ‘constexpr’ expansion of ‘__llvm_libc::fputil::DyadicFloat<128>(__llvm_libc::Sign::POS, -127, __llvm_libc::BigInt<128, false, long unsigned int>(__llvm_libc::operator""_u128(((const char*)"0x80000000\'00000000\'00000000\'00000000"))))’
/home/buildbot/buildbot-root/cross-project-tests-sie-ubuntu/llvm-project/llvm/../libc/src/__support/FPUtil/dyadic_float.h:109:5:   in ‘constexpr’ expansion of ‘((__llvm_libc::fputil::DyadicFloat<128>*)this)->__llvm_libc::fputil::DyadicFloat<128>::normalize()’
/home/buildbot/buildbot-root/cross-project-tests-sie-ubuntu/llvm-project/llvm/../libc/src/__support/FPUtil/dyadic_float.h:118:16:   in ‘constexpr’ expansion of ‘((__llvm_libc::fputil::DyadicFloat<128>*)this)->__llvm_libc::fputil::DyadicFloat<128>::mantissa.__llvm_libc::BigInt<128, false, long unsigned int>::operator<<=(((size_t)shift_length))’
/home/buildbot/buildbot-root/cross-project-tests-sie-ubuntu/llvm-project/llvm/../libc/src/__support/big_int.h:829:52:   in ‘constexpr’ expansion of ‘__llvm_libc::multiword::shift<__llvm_libc::multiword::LEFT, false, long unsigned int, 2>(((__llvm_libc::BigInt<128, false, long unsigned int>*)this)->__llvm_libc::BigInt<128, false, long unsigned int>::val, s)’
/home/buildbot/buildbot-root/cross-project-tests-sie-ubuntu/llvm-project/llvm/../libc/src/__support/big_int.h:264:35: error: ‘constexpr __llvm_libc::cpp::enable_if_t<((((sizeof (To) == sizeof (From)) && __llvm_libc::cpp::is_trivially_constructible<To>::value) && __llvm_libc::cpp::is_trivially_copyable<T>::value) && __llvm_libc::cpp::is_trivially_copyable<From>::value), To> __llvm_libc::cpp::bit_cast(const From&) [with To = __int128 unsigned; From = __llvm_libc::cpp::array<long unsigned int, 2>; __llvm_libc::cpp::enable_if_t<((((sizeof (To) == sizeof (From)) && __llvm_libc::cpp::is_trivially_constructible<To>::value) && __llvm_libc::cpp::is_trivially_copyable<T>::value) && __llvm_libc::cpp::is_trivially_copyable<From>::value), To> = __int128 unsigned]’ called in a constant expression
  264 |     auto tmp = cpp::bit_cast<type>(array);
      |                ~~~~~~~~~~~~~~~~~~~^~~~~~~
                 from /home/buildbot/buildbot-root/cross-project-tests-sie-ubuntu/llvm-project/llvm/lib/Support/APFloat.cpp:32:
/home/buildbot/buildbot-root/cross-project-tests-sie-ubuntu/llvm-project/llvm/../libc/src/__support/CPP/bit.h:48:1: note: ‘constexpr __llvm_libc::cpp::enable_if_t<((((sizeof (To) == sizeof (From)) && __llvm_libc::cpp::is_trivially_constructible<To>::value) && __llvm_libc::cpp::is_trivially_copyable<T>::value) && __llvm_libc::cpp::is_trivially_copyable<From>::value), To> __llvm_libc::cpp::bit_cast(const From&) [with To = __int128 unsigned; From = __llvm_libc::cpp::array<long unsigned int, 2>; __llvm_libc::cpp::enable_if_t<((((sizeof (To) == sizeof (From)) && __llvm_libc::cpp::is_trivially_constructible<To>::value) && __llvm_libc::cpp::is_trivially_copyable<T>::value) && __llvm_libc::cpp::is_trivially_copyable<From>::value), To> = __int128 unsigned]’ is not usable as a ‘constexpr’ function because:
   48 | bit_cast(const From &from) {
      | ^~~~~~~~
/home/buildbot/buildbot-root/cross-project-tests-sie-ubuntu/llvm-project/llvm/../libc/src/__support/CPP/bit.h:56:28: error: call to non-‘constexpr’ function ‘void __llvm_libc::cpp::inline_copy(const char*, char*) [with unsigned int N = 16]’
   56 |   inline_copy<sizeof(From)>(src, dst);
```
2025-12-15 16:57:37 +01:00
lntue
4190d57682
[APFloat] Add exp function for APFloat::IEEESsingle using expf implementation from LLVM libc. (#143959)
Discourse RFC:
https://discourse.llvm.org/t/rfc-make-clang-builtin-math-functions-constexpr-with-llvm-libc-to-support-c-23-constexpr-math-functions/86450

- The implementation in LLVM libc is header-only.
- expf implementation in LLVM libc is correctly rounded for all rounding
modes.
- LLVM libc implementation will round to the floating point
environment's rounding mode.
- No cmake build dependency between LLVM and LLVM libc, only requires
LLVM libc source presents in llvm-project/libc folder.
2025-12-15 10:21:45 -05:00
Kazu Hirata
0f76fbf546
[Support] Remove redundant declarations (NFC) (#165971)
In C++17, static constexpr members are implicitly inline, so they no
longer require an out-of-line definition.

Identified with readability-redundant-declaration.
2025-11-01 09:25:12 -07:00
Kazu Hirata
50a37c0226
[llvm] Migrate away from a soft-deprecated constructor of APInt (NFC) (#165164)
We have:

/// Once all uses of this constructor are migrated to other
constructors,
/// consider marking this overload ""= delete" to prevent calls from
being
/// incorrectly bound to the APInt(unsigned, uint64_t, bool)
constructor.
LLVM_ABI APInt(unsigned numBits, unsigned numWords, const uint64_t
bigVal[]);

This patch migrates away from this soft-deprecated constructor.
2025-10-26 15:20:00 -07:00
Kazu Hirata
d43ad92e20
[ADT, Support] Use std::min and std::max (NFC) (#164145) 2025-10-19 12:37:43 -07:00
Yingwei Zheng
035f811388
[APFloat] Outline special member functions (#164073)
As discussed in
https://github.com/llvm/llvm-project/pull/111544#issuecomment-3405281695,
large special member functions in APFloat prevent function inlining and
cause compile-time regression. This patch moves them into the cpp file.

Compile-time improvement (-0.1%):
https://llvm-compile-time-tracker.com/compare.php?from=0f68dc6cffd93954188f73bff8aced93aab63687&to=d3105c0860920651a7e939346e67c040776b2278&stat=instructions:u
2025-10-18 22:48:05 +08:00
Yingwei Zheng
0f68dc6cff
[APFloat] Inline static getters (#163794)
This patch exposes the declaration of fltSemantics to inline
PPCDoubleDouble() calls in the IEEEFloat/DoubleAPFloat dispatch.
It slightly improves the compile time:
https://llvm-compile-time-tracker.com/compare.php?from=f4359301c033694d36865c7560714164d2050240&to=68de94d77d5bd33603193e8769829345b18fbae3&stat=instructions:u
With https://github.com/llvm/llvm-project/pull/111544, the improvement
is more significant:
https://llvm-compile-time-tracker.com/compare.php?from=e438bae71d1fd55640d942b9ad795de2f60e44f2&to=04751477940890c092dc4edb74e284de8f746d5a&stat=instructions:u
 
Address comment
https://github.com/llvm/llvm-project/pull/111544#issuecomment-3405281695.

If breaking changes are allowed, we can encode all the properties of
fltSemantics within a 64-bit integer. Then we don't need `Semantics <->
const fltSemantic` conversion.
2025-10-18 17:07:25 +08:00
Kazu Hirata
18997b5a8f
[ADT] Fix a bug in DoubleAPFloat::frexp (#161625)
Without this patch, we call APFloat::makeQuiet() in frexp like so:

  Quiet.getFirst().makeQuiet();

The problem is that makeQuiet returns a new value instead of modifying
"*this" in place, so we end up discarding the newly returned value.

This patch fixes the problem by assigning the result back to
Quiet.getFirst().

We should put [[nodiscard]] on APFloat::makeQuiet, but I'll do that in
another patch.
2025-10-02 09:26:46 -07:00
David Majnemer
45f3263c6f [APFloat] Properly implement DoubleAPFloat::convertFromAPInt
The old implementation converted to the legacy semantics, inducing
rounding and not properly handling inputs like (2^1000 + 2^200) which
have have more precision than the legacy semantics can represent.

Instead, we convert the integer into two floats and an error.  The error
is used to implement the rounding behavior.

Remove related dead, untested code: convertFrom*ExtendedInteger
2025-08-24 15:47:54 -07:00
David Majnemer
f961b61f88 [APFloat] Properly implement DoubleAPFloat::compareAbsoluteValue
The prior implementation would treat X+Y and X-Y as having equal
magnitude.  Rework the implementation to be more resilient.
2025-08-21 15:42:07 -07:00
David Majnemer
0a7eabcc56 Reapply "[APFloat] Fix getExactInverse for DoubleAPFloat"
The previous implementation of getExactInverse used the following check
to identify powers of two:

  // Check that the number is a power of two by making sure that only the
  // integer bit is set in the significand.
  if (significandLSB() != semantics->precision - 1)
    return false;

This condition verifies that the only set bit in the significand is the
integer bit, which is correct for normal numbers. However, this logic is
not correct for subnormal values.

APFloat represents subnormal numbers by shifting the significand right
while holding the exponent at its minimum value. For a power of two in
the subnormal range, its single set bit will therefore be at a position
lower than precision - 1. The original check would consequently fail,
causing the function to determine that these numbers do not have an
exact multiplicative inverse.

The new logic calculated this correctly but it seems that
test/CodeGen/Thumb2/mve-vcvt-fixed-to-float.ll expected the old
behavior.

Seeing as how getExactInverse does not have tests or documentation, we
conservatively maintain (and document) this behavior.

This reverts commit 47e62e846beb267aad50eb9195dfd855e160483e.
2025-08-20 14:02:36 -07:00
Aiden Grossman
47e62e846b Revert "[APFloat] Fix getExactInverse for DoubleAPFloat"
This reverts commit f4941319cba19d7691baa6ec783c84be4d847637.

This broke llvm/test/CodeGen/Thumb2/mve-vcvt-fixed-to-float.ll which
took out a ton of buildbots and also broke premerge.
2025-08-14 04:39:50 +00:00
David Majnemer
f4941319cb [APFloat] Fix getExactInverse for DoubleAPFloat
Some background: getExactInverse()'s callers expect that
the result is not subnormal.

DoubleAPFloat implemented getExactInverse() by going through
semPPCDoubleDoubleLegacy.

This means that numbers like 0x1p1022 which would have a normal inverse
in semPPCDoubleDouble would not in semPPCDoubleDoubleLegacy.

This commit refactors the logic into a single method on APFloat which
uses getExactLog2Abs() and scalbn() to calculate the inverse without
having to compute a reciprocal and test if it is inexact.
This approach works for both IEEEFloat and DoubleAPFloat.
2025-08-13 11:39:53 -07:00
David Majnemer
acef1db3b2 [APFloat] Remove some overly optimistic assertions
An earlier draft of DoubleAPFloat::convertToSignExtendedInteger had
arranged for overflow to be handled in a different way.  However, these
assertions are now possible if Hi+Lo are out of range and Lo != 0.

A test has been added to defend against a regression.
2025-08-12 18:32:58 -07:00
David Majnemer
f6d143fd1f [APFloat] Properly implement frexp(DoubleAPFloat)
The prior implementation did not consider that the Lo component may
underflow when it undergoes scaling.  This means that we need to
carefully handle things like binade crossings or how to handle
roundTowardZero when Hi and Lo have different signs.

Particularly annoying is roundTiesToAway when Hi and Lo have different
signs.  It basically requires us to implement roundTiesTowardZero.
2025-08-12 17:03:27 -07:00
David Majnemer
e722ef4956 Reapply "[APFloat] Properly implement DoubleAPFloat::convertToSignExtendedInteger"
This reverts commit 8b44945a9231d4d7be0858a1c5d9c13d397bc512.

The compilation failure under !NDEBUG has been fixed.
2025-08-12 17:01:49 -07:00
Kazu Hirata
8b44945a92 Revert "[APFloat] Properly implement DoubleAPFloat::convertToSignExtendedInteger"
This reverts commit 052c38be824d9dabb1e8fb64c1c7c3908d786e83.

I'm getting:

llvm/lib/Support/APFloat.cpp:5627:29: error:
use of undeclared identifier 'Parts'
 5627 |     assert(DstPartsCount <= Parts.size() && "Integer too big");
      |                             ^
1 error generated.
2025-08-10 00:36:29 -07:00
David Majnemer
052c38be82 [APFloat] Properly implement DoubleAPFloat::convertToSignExtendedInteger
Use DoubleAPFloat::roundToIntegral to get a pair of APFloat values which
hold integral values.  Then we sum the pair, taking care to make sure
that we handle edge cases like (hi=2^128, lo=-1) and ensuring that they
fit in an unsigned i128.
2025-08-09 23:23:01 -07:00
David Majnemer
0a23b22d1d [APFloat] Properly implement DoubleAPFloat::roundToIntegral
The previous implementation did not correctly handle double-doubles like
0x1p100 + 0x1p1 as the low order component would need more than a
106-bit significand to represent.
2025-08-06 10:42:07 -07:00
David Majnemer
9fdb5e1fef [APFloat] Properly implement next for DoubleAPFloat
Rather than converting to the legacy 106-bit format, perform next() on the
low APFloat. Of course, we need to renormalize the two APFloats if
either of the two constraints are violated:
1. abs(low) <= ulp(high)/2
2. high = rtne(high + low)

Should renormalization be needed, it will increment the high component
and set low to the smallest value which obeys these rules.
2025-08-01 12:34:33 -07:00
Rahul Joshi
a76bf4da53
[NFC][ADT/Support] Add {} for else when if body has {} (#140758) 2025-05-21 13:19:09 -07:00
Kazu Hirata
7470131b43
[llvm] Use StringRef::consume_front (NFC) (#139458) 2025-05-11 10:28:49 -07:00
Yingwei Zheng
e710a5a9f2
[InstCombine] Fold fneg/fabs patterns with ppc_f128 (#130557)
This patch is needed by
https://github.com/llvm/llvm-project/pull/130496.
2025-04-14 14:30:00 +08:00
beetrees
8b9c91ec7d
[APFloat] Fix IEEEFloat::addOrSubtractSignificand and IEEEFloat::normalize (#98721)
Fixes #63895
Fixes #104984

Before this PR, `addOrSubtractSignificand` presumed that the loss came
from the side being subtracted, and didn't handle the case where lhs ==
rhs and there was loss. This can occur during FMA. This PR fixes the
situation by correctly determining where the loss came from and handling
it appropriately.

Additionally, `normalize` failed to adjust the exponent when the
significand is zero but `lost_fraction != lfExactlyZero`. This meant
that the test case from #63895 was rounded incorrectly as the loss
wasn't adjusted to account for the exponent being below the minimum
exponent. This PR fixes this by only skipping the exponent adjustment if
the significand is zero and there was no lost fraction.

(Note to reviewer: I don't have commit access)
2025-03-10 09:44:27 +07:00
Peter Collingbourne
c72ebeeb7f
ADT: Switch to a raw pointer for DoubleAPFloat::Floats.
In order for the union APFloat::Storage to permit access to the
semantics field when another union member is stored there, all members
of Storage must be standard layout. This is not necessarily the case
for DoubleAPFloat which may be non-standard layout because there is no
requirement that its std::unique_ptr member is standard layout. Fix this
by converting Floats to a raw pointer.

Reviewers: arsenm

Reviewed By: arsenm

Pull Request: https://github.com/llvm/llvm-project/pull/129981
2025-03-05 21:15:08 -08:00
Yingwei Zheng
44d1dbd24c
[X86][DAGCombiner] Skip x87 fp80 values in combineFMulOrFDivWithIntPow2 (#128618)
f80 is not a valid IEEE floating-point type.
Closes https://github.com/llvm/llvm-project/issues/128528.
2025-02-25 22:03:17 +08:00
Congcong Cai
ad56f6267f
[APFloat][NFC]extract fltSemantics::isRepresentableBy to header (#122636)
isRepresentableBy is useful to check float point type compatibility
2025-01-14 06:12:04 +08:00
Matthias Springer
f50ce316ec
[llvm][NFC] APFloat: Add missing semantics to enum (#117291)
* Add missing semantics to the `Semantics` enum.
* Move all documentation of the semantics to the header file.
* Also rename some functions for consistency.
2024-12-04 14:58:59 -08:00
Matthias Springer
309c890921
[llvm] APFloat: Add helpers to query NaN/inf semantics (#116315)
`APFloat` changes extracted from #116176 as per reviewer comments.
2024-11-16 11:48:05 +09:00
Matthias Springer
2f55de4e31
[llvm] APFloat: Query hasNanOrInf from semantics (#116158)
Whether a floating point type supports NaN or infinity can be queried
from its semantics. No need to hard-code a list of types.
2024-11-15 09:29:54 +09:00
Kazu Hirata
4048c64306
[llvm] Remove redundant control flow statements (NFC) (#115831)
Identified with readability-redundant-control-flow.
2024-11-12 10:09:42 -08:00
Sergey Kozub
7b308b18c3
Fix bitcasting E8M0 APFloat to APInt (#113298)
Fixes a bug in APFloat handling of E8M0 type (zero mantissa).

Related PRs:
- https://github.com/llvm/llvm-project/pull/107127
- https://github.com/llvm/llvm-project/pull/111028
2024-10-22 19:15:29 +02:00
Ariel-Burton
3b7091bcf3
[APFloat] add predicates to fltSemantics for hasZero and hasSignedRepr (#111451)
We add static methods to APFloatBase to allow the hasZero and
hasSignedRepr properties of fltSemantics to be obtained.
2024-10-09 08:37:40 -04:00
Yingwei Zheng
6004f5550c
[ADT][APFloat] Make sure EBO is performed on APFloat (#111641)
Since both APFloat and (Double)IEEEFloat inherit from APFloatBase, empty
base optimization is not performed by GCC/Clang (Minimal reproducer:
https://godbolt.org/z/dY8cM3Wre). This patch removes inheritance
relation between (Double)IEEEFloat and APFloatBase to make sure EBO is
performed on APFloat. After this patch, the size of `ConstantFPRange`
will be reduced from 72 to 56.

Address comment
https://github.com/llvm/llvm-project/pull/111544#discussion_r1792398427.
2024-10-09 16:39:02 +08:00
Durgadoss R
99f527d280
[APFloat] Add APFloat support for E8M0 type (#107127)
This patch adds an APFloat type for unsigned E8M0 format. This format is
used for representing the "scale-format" in the MX specification:
(section 5.4)

https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

This format does not support {Inf, denorms, zeroes}. Like FP32, this
format's exponents are 8-bits (all bits here) and the bias value is 127.
However, it differs from IEEE-FP32 in that the minExponent is -127
(instead of -126). There are updates done in the APFloat utility
functions to handle these constraints for this format.

* The bias calculation is different and convertIEEE* APIs are updated to
handle this.
* Since there are no significand bits, the isSignificandAll{Zeroes/Ones}
methods are updated accordingly.
* Although the format does not have any precision, the precision bit in
the fltSemantics is set to 1 for consistency with APFloat's internal
representation.
* Many utility functions are updated to handle the fact that this format
does not support Zero.
* Provide a separate initFromAPInt() implementation to handle the quirks
of the format.
* Add specific tests to verify the range of values for this format.

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2024-10-02 23:04:21 +05:30
Yingwei Zheng
fa824dc0dd
[LLVM][IR] Add constant range support for floating-point types (#86483)
This patch adds basic constant range support for floating-point types to enable range-based optimizations.
2024-09-25 13:58:23 +08:00
NAKAMURA Takumi
3ef64f7ab5 Revert "Enable logf128 constant folding for hosts with 128bit long double (#104929)"
ConstantFolding behaves differently depending on host's `HAS_IEE754_FLOAT128`.
LLVM should not change the behavior depending on host configurations.

This reverts commit 14c7e4a1844904f3db9b2dc93b722925a8c66b27.
(llvmorg-20-init-3262-g14c7e4a18449 and llvmorg-20-init-3498-g001e423ac626)
2024-08-25 08:30:23 +09:00
Matthew Devereau
14c7e4a184
Enable logf128 constant folding for hosts with 128bit long double (#104929)
This is a reland of (#96287). This patch attempts to reduce the reverted
patch's clang compile time by removing #includes of float128.h and
inlining convertToQuad functions instead.
2024-08-22 10:12:59 +01:00
Nikita Popov
6300233de1 Revert "Reland logf128 constant folding (#103217)"
This reverts commit 3cab7c555ad6451f2b1b4dc918a4b4f4e4a3e45d.

The modified test fails on ppc64le buildbots.
2024-08-14 12:30:33 +02:00
Matthew Devereau
3cab7c555a
Reland logf128 constant folding (#103217)
This is a reland of #96287. This change makes tests in logf128.ll ignore
the sign of NaNs for negative value tests and moves an #include <cmath>
to be blocked behind #ifndef _GLIBCXX_MATH_H.
2024-08-14 08:55:52 +01:00
Nikita Popov
a15de17772 Revert "Enable logf128 constant folding for hosts with 128bit floats (#96287)"
This reverts commit ccb2b011e577e861254f61df9c59494e9e122b38.

Causes buildbot failures, e.g. on ppc64le builders.
2024-08-09 15:12:11 +02:00
Matthew Devereau
ccb2b011e5
Enable logf128 constant folding for hosts with 128bit floats (#96287)
Hosts which support a float size of 128 bits can benefit from constant
fp128 folding.
2024-08-09 11:12:43 +01:00
Alexander Pivovarov
abc2fe31fc
[APFloat] Add support for f8E3M4 IEEE 754 type (#99698)
This PR adds `f8E4M3` type to APFloat.

`f8E3M4` type  follows IEEE 754 convention

```c
f8E3M4 (IEEE 754)
- Exponent bias: 3
- Maximum stored exponent value: 6 (binary 110)
- Maximum unbiased exponent value: 6 - 3 = 3
- Minimum stored exponent value: 1 (binary 001)
- Minimum unbiased exponent value: 1 − 3 = −2
- Precision specifies the total number of bits used for the significand (mantissa), 
    including implicit leading integer bit = 4 + 1 = 5
- Follows IEEE 754 conventions for representation of special values
- Has Positive and Negative zero
- Has Positive and Negative infinity
- Has NaNs

Additional details:
- Max exp (unbiased): 3
- Min exp (unbiased): -2
- Infinities (+/-): S.111.0000
- Zeros (+/-): S.000.0000
- NaNs: S.111.{0,1}⁴ except S.111.0000
- Max normal number: S.110.1111 = +/-2^(6-3) x (1 + 15/16) = +/-2^3 x 31 x 2^(-4) = +/-15.5
- Min normal number: S.001.0000 = +/-2^(1-3) x (1 + 0) = +/-2^(-2)
- Max subnormal number: S.000.1111 = +/-2^(-2) x 15/16 = +/-2^(-2) x 15 x 2^(-4) = +/-15 x 2^(-6)
- Min subnormal number: S.000.0001 = +/-2^(-2) x 1/16 =  +/-2^(-2) x 2^(-4) = +/-2^(-6)
```

Related PRs:
- [PR-97179](https://github.com/llvm/llvm-project/pull/97179) [APFloat]
Add support for f8E4M3 IEEE 754 type
2024-07-30 00:11:10 -07:00
Alexander Pivovarov
f363317702
[APFloat] Add support for f8E4M3 IEEE 754 type (#97179)
This PR adds `f8E4M3` type to APFloat.

`f8E4M3` type  follows IEEE 754 convention

```c
f8E4M3 (IEEE 754)
- Exponent bias: 7
- Maximum stored exponent value: 14 (binary 1110)
- Maximum unbiased exponent value: 14 - 7 = 7
- Minimum stored exponent value: 1 (binary 0001)
- Minimum unbiased exponent value: 1 − 7 = −6
- Precision specifies the total number of bits used for the significand (mantisa), 
    including implicit leading integer bit = 3 + 1 = 4
- Follows IEEE 754 conventions for representation of special values
- Has Positive and Negative zero
- Has Positive and Negative infinity
- Has NaNs

Additional details:
- Max exp (unbiased): 7
- Min exp (unbiased): -6
- Infinities (+/-): S.1111.000
- Zeros (+/-): S.0000.000
- NaNs: S.1111.{001, 010, 011, 100, 101, 110, 111}
- Max normal number: S.1110.111 = +/-2^(7) x (1 + 0.875) = +/-240
- Min normal number: S.0001.000 = +/-2^(-6)
- Max subnormal number: S.0000.111 = +/-2^(-6) x 0.875 = +/-2^(-9) x 7
- Min subnormal number: S.0000.001 = +/-2^(-6) x 0.125 = +/-2^(-9)
```

Related PRs:
- [PR-97118](https://github.com/llvm/llvm-project/pull/97118) Add f8E4M3
IEEE 754 type to mlir
2024-07-17 23:33:52 -07:00
Ariel-Burton
bbc6504b3d
[NFC] [APFloat] Refactor IEEEFloat::toString (#97117)
This PR lifts the body of IEEEFloat::toString out to a standalone
function. We do this to facilitate code sharing with other floating
point types, e.g., the forthcoming support for HexFloat.

There is no change in functionality.
2024-07-04 08:43:45 -04:00
Alexander Pivovarov
2628a5fd24
Rename f8E4M3 to f8E4M3FN in mlir.extras.types py package (#97102)
Currently `f8E4M3` is mapped to `Float8E4M3FNType`.

This PR renames `f8E4M3` to `f8E4M3FN` to accurately reflect the actual
type.

This PR is needed to avoid names conflict in upcoming PR which will add
IEEE 754 `Float8E4M3Type`.
https://github.com/llvm/llvm-project/pull/97118 Add f8E4M3 IEEE 754 type

Maksim, can you review this PR? @makslevental ?
2024-06-29 12:48:11 -07:00
Durgadoss R
880d37038c
[APFloat] Add APFloat support for FP4 data type (#95392)
This patch adds APFloat type support for the E2M1
FP4 datatype. The definitions for this format are
detailed in section 5.3.3 of the OCP specification,
which can be accessed here:

https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2024-06-14 14:17:37 +05:30