Always match ABD patterns pre-legalization, and use TargetLowering::expandABD to expand again during legalization.
abdu(lhs, rhs) -> sub(xor(sub(lhs, rhs), usub_overflow(lhs, rhs)), usub_overflow(lhs, rhs))
Alive2: https://alive2.llvm.org/ce/z/dVdMyv
REAPPLIED: Fix regression issue with "abs(ext(x) - ext(y)) -> zext(abd(x, y))" fold failing after type legalization
For a __thread variable x, when emulated TLS is enabled and there is an
access to x, the compiler first looks up the symbol __emutls_v.x within
the module. However, the issue arises with an alias y of x, the compiler
still tries to look up __emutls_v.y instead of __emutls_v.x. As a
result, the lookup returns a nullptr, causing the compiler to crash. The
purpose of this MR (Merge Request) is to ensure that in emulated TLS,
before checking __emutls_v.y, the compiler first identifies which global
value y is an alias of.
Always match ABD patterns pre-legalization, and use TargetLowering::expandABD to expand again during legalization.
abdu(lhs, rhs) -> sub(xor(sub(lhs, rhs), usub_overflow(lhs, rhs)), usub_overflow(lhs, rhs))
Alive2: https://alive2.llvm.org/ce/z/dVdMyv
Produces better code on x86_64 only in the unordered case. Not
sure what the exact condition should be to avoid the regression. Free
fabs might do it, or maybe requires legality checks for the alternative
integer expansion.
Have simpler lowering for exact udivs in both SelectionDAG and
GlobalISel.
The algorithm is the same between unsigned exact divs and signed divs
save for arithmetic vs logical shift for even divisors, according to
Hacker's Delight, 2nd Edition, page 242.
This PR adds a new vector intrinsic `@llvm.experimental.vector.compress`
to "compress" data within a vector based on a selection mask, i.e., it
moves all selected values (i.e., where `mask[i] == 1`) to consecutive
lanes in the result vector. A `passthru` vector can be provided, from
which remaining lanes are filled.
The main reason for this is that the existing
`@llvm.masked.compressstore` has very strong constraints in that it can
only write values that were selected, resulting in guard branches for
all targets except AVX-512 (and even there the AMD implementation is
_very_ slow). More instruction sets support "compress" logic, but only
within registers. So to store the values, an additional store is needed.
But this combination is likely significantly faster on many target as it
avoids branches.
In follow up PRs, my plan is to add target-specific lowerings for x86,
SVE, and possibly RISCV. I also want to combine this with a store
instruction, as this is probably a common case and we can avoid some
memory writes in that case.
See [discussion in
forum](https://discourse.llvm.org/t/new-intrinsic-for-masked-vector-compress-without-store/78663)
for initial discussion on the design.
The previous expansion of [US]CMP was done using two selects and two
compares. It produced decent code, but on many platforms it is better to
implement [US]CMP nodes by performing the following operation:
```
[us]cmp(x, y) = (x [us]> y) - (x [us]< y)
```
This patch adds this new expansion, as well as a hook in TargetLowering to allow some targets to still use the select-based approach. AArch64 and SystemZ are currently the only targets to prefer the former approach, but other targets may also start to use it if it provides for better codegen.
When looking for a largest legal integer type for a target
`TargetLowering::findOptimalMemOpLowering` assumes that `MVT::i64` is
the largets possible integer type. The patch removes this assumption and
uses `MVT::LAST_INTEGER_VALUETYPE` instead.
#97645 proposed to remove LegalTypes from getShiftAmountTy. This patches
removes it from getShiftAmountConstant which is one of the callers of
getShiftAmountTy.
As far as I can tell, this pull request was not approved, and
did not go through an RFC on discourse.
This reverts commit 89881480030f48f83af668175b70a9798edca2fb.
This reverts commit 225d8fc8eb24fb797154c1ef6dcbe5ba033142da.
Currently, on different platform, the behaivor of llvm.minnum is
different if one operand is sNaN:
When we compare sNaN vs NUM:
ARM/AArch64/PowerPC: follow the IEEE754-2008's minNUM: return qNaN.
RISC-V/Hexagon follow the IEEE754-2019's minimumNumber: return NUM. X86:
Returns NUM but not same with IEEE754-2019's minimumNumber as
+0.0 is not always greater than -0.0.
MIPS/LoongArch/Generic: return NUM.
LIBCALL: returns qNaN.
So, let's introduce llvm.minmumnum/llvm.maximumnum, which always follow
IEEE754-2019's minimumNumber/maximumNumber.
Half-fix: #93033
This PR adds initial support for the `scmp`/`ucmp` 3-way comparison
intrinsics in the SelectionDAG. Some of the expansions/lowerings
are not optimal yet.
Converting to avgfloor and then expanding it back to shift+add later is likely to prevent other folds (re-association and value-tracking in particular) in the meantime.
Fixes#95284
Always match AVG patterns pre-legalization, and use TargetLowering::expandAVG to expand again during legalization.
I've removed the X86 custom AVGCEILU pattern detection and replaced with combines to try and convert other AVG nodes to AVGCEILU.
First, expandFMINIMUM_FMAXIMUM should be a never-fail API. The client
wanted it expanded, and it can always be expanded. This logic was tied
up with what the VectorLegalizer wanted.
Prefer using the min/max opcodes, and unrolling if we don't have a
vselect.
This seems to produce better code in all the changed tests.
The getValidShiftAmountConstant/getValidMinimumShiftAmountConstant/getValidMaximumShiftAmountConstant helpers only worked with constant shift amounts, which could be problematic after type legalization (e.g. v2i64 might be partially scalarized or split into v4i32 on some targets such as 32-bit x86, Thumb2 MVE).
This patch proposes we generalize these helpers to work with ConstantRange+KnownBits if a scalar/buildvector constant isn't available.
Most restrictions are the same - the helper fails if any shift amount is out of bounds, getValidShiftConstant must be a specific constant uniform etc.
However, getValidMinimumShiftAmount/getValidMaximumShiftAmount now can return bounds values that aren't values in the actual data, as they are based off the common KnownBits of every vector element.
This addresses feedback on #92096
The operation selection logic here doesn't really work when vector types
need to be split. This was also dropping the flags, and losing nnan made
the combine from select back to fmin/fmax unrecoverable. Preserve the
flags to assist a future commit.
If the comparison results are allbits masks, we can expand as `abd(lhs, rhs) -> sub(cmpgt(lhs, rhs), xor(sub(lhs, rhs), cmpgt(lhs, rhs)))`, replacing a sub+sub+select pattern with the simpler sub+xor+sub pattern.
This allows us to remove a lot of X86 specific legalization code, and will be useful in future generic expansion for the legalization work in #92576
Alive2: https://alive2.llvm.org/ce/z/sj863C
In `TargetLowering::ShrinkDemandedOp`, types of lhs and rhs may differ
before legalization.
In the original case, `VT` is `i64` and `SmallVT` is `i32`, but the type
of rhs is `i8`. Then invalid truncate nodes will be created.
See the description of ISD::SHL for further information:
> After legalization, the type of the shift amount is known to be
TLI.getShiftAmountTy(). Before legalization, the shift amount can be any
type, but care must be taken to ensure it is large enough.
605ae4e93b/llvm/include/llvm/CodeGen/ISDOpcodes.h (L691-L712)
This patch stops handling ISD::SHL in `TargetLowering::ShrinkDemandedOp`
and duplicates the logic in `TargetLowering::SimplifyDemandedBits`.
Additionally, it adds some additional checks like
`isNarrowingProfitable` and `isTypeDesirableForOp` to improve the
codegen on AArch64.
Fixes https://github.com/llvm/llvm-project/issues/92720.
Pulled out of #92096 - ensure we have completed a topological simplification of the SRA/SRL shift operands before we try to combine to a AVG node, as its difficult to later simplify through AVG nodes.
This allows us to handle cases where the constant has already been type legalized behind a bitcast
Despite calling ComputeKnownBits I'm not seeing any notable change in compile time.
Prior to this, fixed point multiplication would lead to this assertion
error on AArhc64, armv8, and armv7.
```
_Accum f(_Accum x, _Accum y) { return x * y; }
// ./bin/clang++ -ffixed-point /tmp/test2.cc -c -S -o - -target aarch64 -O3
clang++: llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp:10245: void llvm::TargetLowering::forceExpandWideMUL(SelectionDAG &, const SDLoc &, bool, EVT, const SDValue, const SDValue, const SDValue, const SDValue, SDValue &, SDValue &) const: Assertion `Ret.getOpcode() == ISD::MERGE_VALUES && "Ret value is a collection of constituent nodes holding result."' failed.
```
This path into forceExpandWideMUL should only be taken if we don't
support [US]MUL_LOHI or MULH[US] for the operand size (32 in this case).
But we should also check if we can just leverage regular wide
multiplication. That is, extend the operands from 32 to 64, do a regular
64-bit mul, then trunc and shift. These ops are certainly available on
aarch64 but for wider types.
According to langref, llvm.maximum/minimum has -0.0 < +0.0 semantics and
propagates NaN.
Expand the nodes on targets not supporting the operation, by adding
extra check for NaN and using is_fpclass to check zero signs.
The load narrowing part of TargetLowering::SimplifySetCC is updated
according to this:
1) The offset calculation (for big endian) did not work properly for
non byte-sized types. This is basically solved by an early exit
if the memory type isn't byte-sized. But the code is also corrected
to use the store size when calculating the offset.
2) To still allow some optimizations for non-byte-sized types the
TargetLowering::isPaddedAtMostSignificantBitsWhenStored hook is
added. By default it assumes that scalar integer types are padded
starting at the most significant bits, if the type needs padding
when being stored to memory.
3) Allow optimizing when isPaddedAtMostSignificantBitsWhenStored is
true, as that hook makes it possible for TargetLowering to know
how the non byte-sized value is aligned in memory.
4) Update the algorithm to always search for a narrowed load with
a power-of-2 byte-sized type. In the past the algorithm started
with the the width of the original load, and then divided it by
two for each iteration. But for a type such as i48 that would
just end up trying to narrow the load into a i24 or i12 load,
and then we would fail sooner or later due to not finding a
newVT that fulfilled newVT.isRound().
With this new approach we can narrow the i48 load into either
an i8, i16 or i32 load. By checking if such a load is allowed
(e.g. alignment wise) for any "multiple of 8 offset", then we can find
more opportunities for the optimization to trigger. So even for a
byte-sized type such as i32 we may now end up narrowing the load
into loading the 16 bits starting at offset 8 (if that is allowed
by the target). The old algorithm did not even consider that case.
5) Also start using getObjectPtrOffset instead of getMemBasePlusOffset
when creating the new ptr. This way we get "nsw" on the add.
The current APInt::multiplicativeInverse takes a modulus which can be
any value, but all in-tree callers use a power of two. Moreover, most
callers want to use two to the power of the width of an existing APInt,
which is awkward because 2^N is not representable as an N-bit APInt.
Add a new overload of multiplicativeInverse which implicitly uses
2^BitWidth as the modulus.
Reverse the fold with handling inside canCreateUndefOrPoison for cases where we know that the extract index is in bounds.
This exposed a number or regressions, and required some initial freeze handling of SCALAR_TO_VECTOR, which will require us to properly improve demandedelts support to handle its undef upper elements.
There is still one outstanding regression to be addressed in the future - how do we want to handle folds involving frozen loads?
Fixes#86968