Rename one signature of getAtomic to getAtomicLoad and pass LoadExtType.
Previously we had to set the extension type after the node was created,
but we don't usually modify SDNodes once they are created. It's possible
the node already existed and has been CSEd. If that happens, modifying
the node may affect the other users. It's therefore safer to add the
extension type at creation so that it is part of the CSE information.
I don't know of any failures related to the current implementation. I
only noticed that it doesn't match how we usually do things.
If an ATOMIC_LOAD has ZEXTLOAD/SEXTLOAD extension type we should trust
that over getExtendForAtomicOps().
SystemZ is the only target that uses setAtomicLoadExtAction and they
return ANY_EXTEND from getExtendForAtomicOps(). So I'm not sure there's
a way to get a contradiction currently.
Note, type legalization uses getExtendForAtomicOps() when promoting
ATOMIC_LOAD so we may not need to check getExtendForAtomicOps() for
ATOMIC_LOAD. I have not done much investigating of this.
The PR will fix the issue
https://github.com/llvm/llvm-project/issues/122728
This patch addresses the signed/zero extension of poison by using a
poison value of the extended type instead of a constant zero of the
extended type.
Propagation to poison in function `SDValue
SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, EVT VT,SDValue
N1, SDValue N2, const SDNodeFlags Flags) ` if one of the input is
poison.
The patch also revert the test cases
llvm/test/CodeGen/X86/pr119158.ll
llvm/test/CodeGen/X86/half.ll
which are mentioned in
https://github.com/llvm/llvm-project/pull/125883#discussion_r2021390919
---------
Co-authored-by: Amy Kwan <amy.kwan1@ibm.com>
#135597 didn't correctly fix the issue of binops with an undef element
from only one operand - only reporting the common undef elements could
incorrectly recognise splats where the (binop X, undef) fold might
actually be different - we need to ensure both operands have the same
demanded undefs for certainty.
Fixes#135917
#134602 demonstrated an issue where an AND node always had at least one demanded UNDEF element in either operand, and incorrectly reported this an all-undef result - despite the other element being 0 (so would correctly fold to 0).
This fix only assumes a binops splats element is undefined if both operands are undef.
Fixes#134602
KnownBits passed to computeKnownBitsFromRangeMetadata must have the same
bit width as the range metadata bit width. Otherwise the calculated
results will be incorrect.
---------
Signed-off-by: John Lu <John.Lu@amd.com>
From #106446, this adds a variant of getVectorIdxTy that returns an LLT.
Many uses only look at the width, so a getVectorIdxWidth was added as
the common base.
If a (two-result) node like `FMODF` or `FFREXP` is expanded to a library
call, where said library has the function prototype like: `float(float,
float*)` -- that is it returns a float from the call and via an output
pointer. The first result of the node maps to the value returned by
value and the second result maps to the value returned via the output
pointer.
If only the second result is used after the expansion, we hit an issue
on x87 targets:
```
// Before expansion:
t0, t1 = fmodf x
return t1 // t0 is unused
```
Expanded result:
```
ptr = alloca
ch0 = call modf ptr
t0, ch1 = copy_from_reg, ch0 // t0 unused
t1, ch2 = ldr ptr, ch1
return t1
```
So far things are alright, but the DAGCombiner optimizes this to:
```
ptr = alloca
ch0 = call modf ptr
// copy_from_reg optimized out
t1, ch1 = ldr ptr, ch0
return t1
```
On most targets this is fine. The optimized out `copy_from_reg` is
unused and is a NOP. However, x87 uses a floating-point stack, and if
the `copy_from_reg` is optimized out it won't emit a pop needed to
remove the unused result.
The prior solution for this was to attach the chain from the
`copy_from_reg` to the root, which did work, however, the root is not
always available (it's set to null during legalize types). So the
alternate solution in this patch is to replace the `copy_from_reg` with
an `X86ISD::POP_FROM_X87_REG` within the X86 call lowering. This node is
the same as `copy_from_reg` except this node makes it explicit that it
may lower to an x87 FPU stack pop. Optimizations should be more cautious
when handling this node than a normal CopyFromReg to avoid removing a
required FPU stack pop.
```
ptr = alloca
ch0 = call modf ptr
t0, ch1 = pop_from_x87_reg, ch0 // t0 unused
t1, ch2 = ldr ptr, ch1
return t1
```
Using this node ensures a required x87 FPU pop is not removed due to the
DAGCombiner.
This is an alternate solution for #127976.
The hasOneUse check was failing in any case where the load was part of a chain - we should only be checking if the loaded value has one use, and any updates to the chain should be handled by the fold calling shouldReduceLoadWidth.
I've updated the x86 implementation to match, although it has no effect here yet (I'm still looking at how to improve the x86 implementation) as the inner for loop was discarding chain uses anyway.
By using SDValue::hasOneUse instead this patch exposes a missing dependency on the LLVMSelectionDAG library in a lot of tools + unittests, which resulted in having to make SDNode::hasNUsesOfValue inline.
Noticed while fighting the x86 regressions in #122671
Add signed and unsigned PARTIAL_REDUCE_MLA ISD nodes. Add command line
argument (aarch64-enable-partial-reduce-nodes) that indicates whether the
intrinsic experimental_vector_partial_ reduce_add will be transformed
into the new ISD node. Lowering with the new ISD nodes will, for now,
always be done as an expand.
In 5235973ee03aca4148ecabe5eff64da2af1e034e, an ICE was fixed in
getMemsetStringVal where f128 wasn't handled. It was noted at the time
[1] that the code below this also looks suspect, since it assumes the
element type of VT is either an f32 or f64.
This part of getMemsetStringVal relates to memcpy operations where the
source is a copy from a zero constant. The VT in question is determined
by TargetLowering::findOptimalMemOpLowering, which in turn calls a
further TLI hook getOptimalMemOpType.
For AArch64, getOptimalMemOpType returns either a v16i8, f128, i64, i32
or Other. For Other, TargetLowering::findOptimalMemOpLowering will then
pick an integer VT. So on AArch64 at least, I don't believe the suspect
code can be reached.
For other targets, ARM and x86 are the only ones that return a FP vector
type from getOptimalMemOpType. For both targets, the only such type is
v2f64, but given f64 is already handled it should also be fine.
To defend this, I considered adding an assert as mentioned in [1], but
given getConstantFP handles vector types, I figured using this to fully
handle the FP types makes the code simpler and more robust.
For test coverage I added unreachables to both of the branches handling
FP types in this code, but found neither fired with check-llvm across
all targets.
Test coverage was added to llvm/test/CodeGen/AArch64/memcpy-f128.ll in
5235973ee03aca4148ecabe5eff64da2af1e034e to defend ICE on f128, but at
some point it stopped hitting this code.
AArch64TargetLowering::getOptimalMemOpType was updated in
29200611055f49a0d37243caa5f8bba1df9d57a6, so I suspect this is when it
happened, although I haven't verified this. Although I did find by
updating the test to disable NEON, getOptimalMemOpType returns an f128
and the branch is once again hit.
For the final branch noted as suspect in [1], as far as I can tell this
has never had any test coverage, so I've added a test to the ARM backend
for this.
Fixes: https://github.com/llvm/llvm-project/issues/20521 [1]
A BUILD_VECTOR can implicity shrink the bits of the operands if the
operand types are not legal. For example a v8i16 constant BUILD_VECTOR
might be represented as v8i16 BUILDVECTOR(i32 1, i32 2, ...).
Unfortunately this means that the constants are not accepted by
matchUnaryPredicateImpl, preventing in this case funnel shifts detecting
that all the operands are non-zero. Add a flag to help it match.
Once we get to SelectionDAG the IR should not be changing anymore, so we
can use BatchAAResults rather than AAResults to cache AA queries.
This should be a NFC change for targets that enable AA during codegen
(such as AArch64), but also give a nice compile-time improvement in some
cases. See:
https://github.com/llvm/llvm-project/pull/123787#issuecomment-2606797041
Note: This follows Nikita's suggestion on #123787.
This pattern was originally spotted in 429.mcf by @topperc.
We already have a DAGCombiner pattern to turn `(neg (abs x))` into `(min
x, (neg x))`. But in some cases `(neg (max x, (neg x)))` is formed by an
expanded `abs` followed by a `neg` that is generated only after the
`abs` expansion. This patch adds a separate pattern to match cases like
this, as well as its inverse pattern: `(neg (min X, (neg X))) --> (max
X, (neg X))`.
This pattern is applicable to both signed and unsigned min/max.
With this change, targets are no longer required to put memory / strict-fp opcodes after special
`ISD::FIRST_TARGET_MEMORY_OPCODE`/`ISD::FIRST_TARGET_STRICTFP_OPCODE` markers.
This will also allow autogenerating `isTargetMemoryOpcode`/`isTargetStrictFPOpcode (#119709).
Pull Request: https://github.com/llvm/llvm-project/pull/119969
SDNode::use_iterator now returns an SDUse& when dereferenced.
SDNode::user_iterator returns SDNode*. SDNode::use_begin/use_end/uses
work on use_iterator. SDNode::user_begin/user_end/users work on
user_iterator.
We can now write range based for loops using SDUse& and SDNode::uses().
I've converted many of these in this patch. I didn't update loops that
have additional variables updated in their for statement.
Some loops use SDNode::use_iterator::getOperandNo() which also prevents
using range based for loops. I plan to move this into SDUse in a follow
up patch.
This function is most often used in range based loops or algorithms
where the iterator is implicitly dereferenced. The dereference returns
an SDNode * of the user rather than SDUse * so users() is a better name.
I've long beeen annoyed that we can't write a range based loop over
SDUse when we need getOperandNo. I plan to rename use_iterator to
user_iterator and add a use_iterator that returns SDUse& on dereference.
This will make it more like IR.
This adds a new helper `canFoldStoreIntoLibCallOutputPointers()` to
check that it is safe to fold a store into a node that will expand to a
library call that takes output pointers. This requires checking for two
(independent) properties:
1. The store is not within a CALLSEQ_START..CALLSEQ_END pair
* If it is, the expansion would lead to nested call sequences (which is
invalid)
2. The node does not appear as a predecessor to the store
* If it does, attempting to merge the store into the call would result
in a cycle in the DAG
These two properties are checked as part of the same traversal in
`canFoldStoreIntoLibCallOutputPointers()`
EVTs potentially contain a Type * that points into memory owned by an
LLVMContext. Storing them in a function scoped static means they may
outlive the LLVMContext they point to.
This std::set is used to unique single element VT lists containing a
single extended EVT. Single element VT list with a simple EVT are
uniqued by a separate cache indexed by the MVT::SimpleValueType enum. VT
lists with more than one element are uniqued by a FoldingSet owned by
the SelectionDAG object.
This patch moves the single element cache into SelectionDAG so that it
will be destroyed when SelectionDAG is destroyed.
Fixes#88233
Assert that the passed value is a valid unsigned integer value for the
specified type.
For signed values getSignedConstant() / getSignedTargetConstant() should
be used instead.
Fix all the places I could find that did't do this. We were already
mostly correct for FP_ROUND after
9a976f36615dbe15e76c12b22f711b2e597a8e51, but not STRICT_FP_ROUND.
When the chain is not the entry node there is a risk the stores are
within a (CALLSEQ_START, CALLSEQ_END), which when the node is expanded
will lead to nested call sequences.
It should be possible to check for this and allow more cases, but for
now, let's limit this to cases where it's definitely safe.
Fixes#115323
We already support computing known bits for extending loads, but not for
masked loads. For now I've only added support for zero-extends because
that's the only thing currently tested. Even when the passthru value is
poison we still know the top X bits are zero.
This merges the logic for expanding both FFREXP and FSINCOS into one
method `DAG.expandMultipleResultFPLibCall()`. This reduces duplication
and also allows FFREXP to benefit from the stack slot elimination
implemented for FSINCOS. This method will also be used in future to
implement more multiple-result intrinsics (such as modf and sincospi).
This shares most of its code with the scalar sincos expansion. It allows
expanding vector FSINCOS nodes to a library call from the specified
`-vector-library`. The upside of this is it will mean the vectorizer
only needs to handle the sincos intrinsic, which has no memory effects,
and this can handle lowering the intrinsic to a call that takes output
pointers.