Always match AVG patterns pre-legalization, and use TargetLowering::expandAVG to expand again during legalization.
I've removed the X86 custom AVGCEILU pattern detection and replaced with combines to try and convert other AVG nodes to AVGCEILU.
This moves the combine of fdiv by constant to fmul out of an
'if (Options.UnsafeFPMath || Flags.hasAllowReciprocal()' block,
so that it triggers if the divide is exact. An extra check for
Recip.isDenormal() is added as multiple places make reference
to it being unsafe or slow on certain platforms.
The pr description in #94008 mismatches with the code.
> + When VT is smaller than ShiftVT, it is safe to use trunc.
> + When VT is larger than ShiftVT, it is safe to use zext iff
`is_zero_poison` is true (i.e., `opcode == ISD::CTTZ_ZERO_UNDEF`). See
also the counterexample `src_shl_cttz2 -> tgt_shl_cttz2` in the alive2
proofs.
Closes#94824.
In some case, constant can survive early constant folding optimization
because they are hidden behind several layers of type changes.
E.g., consider the following sequence (extracted from the arm test that
this commit changes):
```
t2: v1f16 = BUILD_VECTOR ConstantFP:f16<APFloat(0)>
t4: v1f16 = insert_vector_elt t2, ConstantFP:f16<APFloat(0)>, Constant:i32<0>
t5: f16 = bitcast t4
t6: f32 = fp_extend t5
```
Because the constant (APFloat(0)) is hidden behind a <1 x ty> type, all
the constant folding that normally happen for scalar nodes when using
`SelectionDAG::getNode` are blocked.
As a result the constant manages to survive as an actual conversion
instruction down to the select phase:
```
t11: f32 = fp16_to_fp Constant:i32<0>
```
With the change in this patch, we try to do constant folding one more
time during dag combine, which in the motivating example result in the
much better sequence:
```
t7: ch = CopyToReg t0, Register:f32 %0, ConstantFP:f32<0.000000e+00>
```
Note: I'm sure we have this problem in a lot of other places. Generally
speaking I believe SDISel is not that good with <1 x ty> compared to
pure scalar. However, I only changed what I could easily test.
This "small" set grows quite large and it's more performant to store
whether a node has been combined before in the node itself.
As this information is only relevant for nodes that are currently not in
the worklist, add a second state to the CombinerWorklistIndex (-2) to
indicate that a node is currently not in a worklist, but was combined
before.
This brings a substantial performance improvement.
DenseMap for pointer lookup is expensive, and this is only used for
deduplication and index lookup. Instead, store the worklist index in the
node itself.
This brings a substantial performance improvement.
In #91747, we changed the SDNode from `X86ISD::SUB` (FROM) to
`X86ISD::CCMP`
(TO) in the DAGCombine. The value type of `X86ISD::SUB` can be `i8, i32`
while the value type of `X86ISD::CCMP` is i32. This breaks the
assumption
that the value type should match after the combine and triggers the
error
```
SelectionDAG.cpp:10942: void
llvm::SelectionDAG::transferDbgValues(llvm::SDValue, llvm::SDValue,
unsigned int, unsigned int, bool): Assertion `FromNode && ToNode &&
"Can't modify dbg values"' failed.
```
when running tests
llvm/test/CodeGen/X86/apx/ccmp.ll
llvm/test/CodeGen/X86/apx/ctest.ll
in Release build when LLVM_ENABLE_ASSERTIONS is on.
In this patch, we fix it by creating a merged value.
DAG combine for `CCMP` and `CTESTrr`:
```
and/or(setcc(cc0, flag0), setcc(cc1, sub (X, Y)))
->
setcc(cc1, ccmp(X, Y, ~cflags/cflags, cc0/~cc0, flag0))
and/or(setcc(cc0, flag0), setcc(cc1, cmp (X, 0)))
->
setcc(cc1, ctest(X, X, ~cflags/cflags, cc0/~cc0, flag0))
```
where `cflags` is determined by `cc1`.
Generic DAG combine:
```
cmp(setcc(cc, X), 0)
brcond ne
->
X
brcond cc
sub(setcc(cc, X), 1)
brcond ne
->
X
brcond ~cc
```
Post DAG transform: `ANDrr/rm + CTESTrr -> CTESTrr/CTESTmr`
Pattern match for `CTESTri`:
```
X= and A, B
ctest(X, X, cflags, cc0/, flag0)
->
ctest(A, B, cflags, cc0/, flag0)
```
`CTESTmi` is already handled by the memory folding mechanism in MIR.
As a small addition to #91148, this uses copysign to produce the correct
sign for zero when converting frem to div/trunc/mul when we do not know
that the input is positive (and we care about sign bits). The copysign
lets us get the sign of zero correct.
In testing, the only case this produced different results than fmod was:
frem -inf, 4.0 -> nan vs -nan
If we are lowering a frem and the divisor is known to be an integer power-2, we
can use the formula 'frem = x - trunc(x / d) * d'. This avoids the more
expensive call to fmod. The results are identical as fmod so long as d is a
power-2 (so the mul does not round incorrectly), and the sign of the return is
either always positive or not important for zeroes (nsz).
Unfortunately Alive2 does not handle this well at the moment. I was using
exhaustive checking to test this:
(https://gist.github.com/davemgreen/6078015f30d3bacd1e9572f8db5d4b64).
I found this in cpythons implementation of float_pow. I currently added it as a
DAG combine for frem with power-2 fp constants.
reassociationCanBreakAddressingModePattern tries to prevent bad add
reassociations that would break adrressing mode patterns. This adds
support for vscale offset addressing modes, making sure we don't break
patterns that already exist. It does not optimize _to_ the correct
addressing modes yet, but prevents us from optimizating _away_ from
them.
This is useful when the inner add has multiple uses, and so cannot be
canonicalized by pushing the constants down through the mul. This patch
adds patterns for both `add(mul(add(A, CA), CM), CB)` and with an extra add
`add(add(mul(add(A, CA), CM), B) CB)` as the second can come up when
lowering geps.
Previously we recursively looked through extends and truncates on both
SourceValue and WideVal.
SourceValue is the largest source found for each of the stores we are
combining. WideVal is the source for the current store.
Previously we could incorrectly look through a (zext (trunc X)) pair and
incorrectly believe X to be a good source.
I think we could also look through a zext on one store and a sext on
another store and arbitrarily pick one of the extends as the final
source.
With this patch we only look through one level of extend or truncate.
And we don't look through extends/truncs on both SourceValue and WideVal
at the same time.
This may lose some optimization cases, but keeps everything we had tests
for.
Fixes#90936.
If the shuffle mask contains no undef elements, then we can move the freeze through a shuffle node.
This requires special case handling to create a new ShuffleVectorSDNode.
Includes VECTOR_SHUFFLE support for isGuaranteedNotToBeUndefOrPoison / canCreateUndefOrPoison.
When looking through a right shift, we need to make sure that all of
the bits we are using from the shift come from the shift input and
not the sign or zero bits that are shifted in.
Fixes#90936.
In #70452 DAGCombiner::mayAlias was taught to handle scalable sizes, but
when it checks via AA->isNoAlias it didn't take into account the case
where the size is scalable but there was an offset too.
For the fixed length case the offset was just accounted for by adding to
the LocationSize, but for the scalable case there doesn't seem to be a
way to represent both a scalable and fixed part in it. So this patch
works around it by bailing if there is an offset.
Fixes#90559
This reverts commit 16bd10a38730fed27a3bf111076b8ef7a7e7b3ee.
Re-applies:
b3c55b707110084a9f50a16aade34c3be6fa18da - "[SelectionDAG] Handle more opcodes in canCreateUndefOrPoison (#84921)"
8e2f6495c0bac1dd6ee32b6a0d24152c9c343624 - "[DAGCombiner] Do not always fold FREEZE over BUILD_VECTOR (#85932)"
73472c5996716cda0dbb3ddb788304e0e7e6a323 - "[SelectionDAG] Treat CopyFromReg as freezing the value (#85932)"
with a fix in DAGCombiner::visitFREEZE.
This reverts:
b3c55b707110084a9f50a16aade34c3be6fa18da - "[SelectionDAG] Handle more opcodes in canCreateUndefOrPoison (#84921)"
(because it updates a test case that I don't know how to resolve the conflict for)
8e2f6495c0bac1dd6ee32b6a0d24152c9c343624 - "[DAGCombiner] Do not always fold FREEZE over BUILD_VECTOR (#85932)"
73472c5996716cda0dbb3ddb788304e0e7e6a323 - "[SelectionDAG] Treat CopyFromReg as freezing the value (#85932)"
Due to a test suite failure on AArch64 when compiling for SVE.
https://lab.llvm.org/buildbot/#/builders/197/builds/13955
clang: ../llvm/llvm/include/llvm/CodeGen/ValueTypes.h:307: MVT llvm::EVT::getSimpleVT() const: Assertion `isSimple() && "Expected a SimpleValueType!"' failed.
[SelectionDAG] Handle more opcodes in canCreateUndefOrPoison
Handle SELECT_CC similarly as SETCC.
Handle these operations that only propagate poison/undef based on the
input operands:
SADDSAT, UADDSAT, SSUBSAT, USUBSAT, MULHU, MULHS,
SMIN, SMAX, UMIN, UMAX
These operations may create poison based on shift amount and exact
flag being violated:
SRL, SRA
One goal here is to allow pushing freeze through these operations
when allowed, as well as letting analyses such as
isGuaranteedNotToBeUndefOrPoison to not break on such operations.
Since some problems have been observed with pushing freeze through
SRA/SRL we block that explicitly in DAGCombiner::visitFreeze now.
That way we can still model SRA/SRL properly in
SelectionDAG::canCreateUndefOrPoison, e.g. when used by
isGuaranteedNotToBeUndefOrPoison, even if we do not want to push
freeze through those instructions.
It's really unfortunate that STORE and ATOMIC_STORE are separate
opcodes. This duplicates a basic simplify demanded for the truncating
case. This avoids some AMDGPU lit regressions in a future patch.
I'm not sure how to craft a test that exposes this without first
introducing the regressions by promoting half to i16.
(Sorry, not an actual build_pair node just a similar pattern).
For cases where we're concatenating 2 integers into a double width integer, see if both integer sources are NOT patterns.
We could take this further and handle all logic ops with a constant operands, but I just wanted to handle the case reported on #89533 initially.
Fixes#89533
Avoid turning a BUILD_VECTOR that can be recognized as "all zeros",
"all ones" or "constant" into something that depends on
freeze(undef), as that would destroy those properties.
Instead we replace undef by 0/-1 in such vectors, making it possible
to fold away the freeze. We typically use -1 if the BUILD_VECTOR
would identify as "all ones", and otherwise we use the value 0.
Ensure that the sum of the shift amounts does not overflow the
shift amount type when combining shifts in combineShiftOfShiftedLogic.
Solves a miscompile bug found when testing the C23 BitInt feature.
Targets like X86 that only use an i8 for shift amounts after
legalization seems to be extra susceptible for bugs like this as it
isn't legal to shift more than 255 steps.
This requires type legalization to keep them the same. This means we no
longer need to legalize the operand since it will be legalized when we
legalize the second result.
In the attach test case we were creating a UADDO_CARRY with i1 carry out
and i41 carry in. i41 exceeds is larger than the setcc result type for
AArch64 which is i32. i41 needs to be promoted to i64 since it is larger
than i32. The type legalizer tried to use promoteTargetBoolean, but that
can only promote from a type smaller than setcc result type.
The easiest fix here is to force the carryin type to match the carryout
type at the type of creation. This should ensure the node won't exceeed
setcc result type as long as the output type doesn't.
I think we should explore requiring the types to match for this node.
Fixes#88966
If the extract_subvector is cheap, attempt to extract directly from an inserted subvector
Reapplied with a check to ensure we only attempt this for fixed vectors
As we canonicalizes smin with non-negative operands into umin in the
middle-end, the saturation pattern will be broken.
This patch reverts the transform in DAGCombine to fix the regression on
ARM.
Fixes https://github.com/llvm/llvm-project/issues/85706.