The address calculation for the offset assumes that you can calculate the offset by multiplying the index by the store size of the element. But that only works if the element's store size is exactly its real size since we store vectors tightly packed in memory. There are improvements we could make to this like special casing extracting element 0. I think we could also handle cases where the extracted VT is byte sized and the index is aligned with the extract element count.
Differential Revision: https://reviews.llvm.org/D75377
Select_cc isn't used by all targets. X86 doesn't have optimizations
for it.
Since we already know the input to the sint_to_fp/uint_to_fp is
a setcc we can just emit a plain select using that setcc as the
condition. Other DAG combines can turn that into a select_cc on
targets that support it.
Differential Revision: https://reviews.llvm.org/D75415
We get the simple cases of this via demanded elements and other folds,
but that doesn't work if the values have >1 use, so add a dedicated
match for the pattern.
We already have this transform in IR, but it doesn't help the
motivating x86 tests (based on PR42024) because the shuffles don't
exist until after legalization and other combines have happened.
The AArch64 test shows a minimal IR example of the problem.
Differential Revision: https://reviews.llvm.org/D75348
The alias analysis in DAG Combine looks at the BaseAlign, the Offset and
the Size of two accesses, and determines if they are known to access
different parts of memory by the fact that they are different offsets
from inside that "alignment window". It does not seem to account for
accesses that are not a multiple of the size, and may overflow from one
alignment window into another.
For example in the test case we have a 19byte memset that is splits into
a 16 byte neon store and an unaligned 4 byte store with a 15 byte
offset. This 15byte offset (with a base align of 8) wraps around to the
next alignment windows. When compared to an access that is a 16byte
offset (of the same 4byte size and 8byte basealign), the two accesses
are said not to alias.
I've fixed this here by just ensuring that the offsets are a multiple of
the size, ensuring that they don't overlap by wrapping. Fixes PR45035,
which was exposed by the UseAA changes in the arm backend.
Differential Revision: https://reviews.llvm.org/D75238
This may inhibit vector narrowing in general, but there's
already an inconsistency in the way that we deal with this
pattern as shown by the test diff.
We may want to add a dedicated function for narrowing fneg.
It's often folded into some other op, so moving it away from
other math ops may cause regressions that we would not see
for normal binops.
See D73978 for more details.
Use the SelectionDAG::getValidShiftAmountConstant helper to get const/constsplat shift amounts, which allows us to drop the out of range shift amount early-out.
First step towards better non-uniform shift amount support in SimplifyDemandedBits.
Similar to what we already do with SimplifyDemandedVectorElts, call SimplifyDemandedBits across all the extracted elements of the source vector, treating it as single use.
There's a minor regression in store-weird-sizes.ll which will be addressed in an upcoming SimplifyDemandedBits patch.
Summary:
This patch implements the part of the calling convention
where SVE Vectors are passed by reference. This means the
caller must allocate stack space for these objects and
pass the address to the callee.
Reviewers: efriedma, rovka, cameron.mcinally, c-rhodes, rengolin
Reviewed By: efriedma
Subscribers: tschuett, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71216
This adds another pattern to the combiner for a case that we were not handling
to generate the REV16 instruction for ARM/Thumb2 and a bswap+ror on X86.
Differential Revision: https://reviews.llvm.org/D74032
The isNegatibleForFree/getNegatedExpression methods currently rely on a raw char value to indicate whether a negation is beneficial or not.
This patch replaces the char return value with an NegatibleCost enum to more clearly demonstrate what is implied.
It also renames isNegatibleForFree to getNegatibleCost to more accurately reflect whats going on.
Differential Revision: https://reviews.llvm.org/D74221
Add a simplification to fuse a manual vector extract with shifts and
truncate into a bitcast.
Unpacking and packing values into vectors is only optimized with
extractelement instructions, not when manually unpacked using shifts
and truncates.
This patch simplifies shifts and truncates into a bitcast if possible.
Simplify (build_vec (trunc $1)
(trunc (srl $1 width))
(trunc (srl $1 (2 * width))) ...)
to (bitcast $1)
Differential Revision: https://reviews.llvm.org/D73892
AMDGPU and x86 at least both have separate controls for whether
denormal results are flushed on output, and for whether denormals are
implicitly treated as 0 as an input. The current DAGCombiner use only
really cares about the input treatment of denormals.
Updated FoldConstantArithmetic method signature to match that of
FoldConstantVectorArithmetic in preparation for merging the two
functions together
https://bugs.llvm.org/show_bug.cgi?id=36544
This is the first step in combining the various
FoldConstantVectorArithmetic and FoldConstantVectorArithmetic
functions into one FoldConstantArithmetic function.
Differential Revision: https://reviews.llvm.org/D72870
Unlike the existing code that I modified here, I only handle the
case where the strict_fsetcc has a single use. Not sure exactly
how to handle multiples uses.
Testing this on X86 is hard because we already have a other
combines that get rid of lowered version of the integer setcc that
this xor will eventually become. So this combine really just
saves a bunch of extra nodes being created. Not sure about other
targets.
Differential Revision: https://reviews.llvm.org/D71816
This patch also fixes up a number of cases in DAGCombine and
SelectionDAGBuilder where the size of a scalable vector is used in a
fixed-width context (thus triggering an assertion failure).
Reviewers: efriedma, c-rhodes, rovka, cameron.mcinally
Reviewed By: efriedma
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71215
This was unconditionally folding this to the source operand, even if the access was out of bounds. Use undef instead of the extract is not the first element.
This helps with some cases where 3-vectors are legalized and avoids processing the 4th component.
Original Patch by: arsenm (Matt Arsenault)
Differential Revision: https://reviews.llvm.org/D51589
If we are extracting a chunk of a vector that's a fraction of an
operand of the concatenated vector operand, we can extract directly
from one of those original operands.
This is another suggestion from PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024#c2
But I'm not sure yet if it will make any difference on those patterns.
It seems to help a few existing AVX512 tests though.
Differential Revision: https://reviews.llvm.org/D72361
This is a positive combination as long as the NEG is NOT free,
as we are reducing the number of NEG from two to one.
Differential Revision: https://reviews.llvm.org/D72312
This is possibly a small part towards solving PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
The vectorizer is creating shuffles of concat like this:
%63 = shufflevector <4 x i64> %x, <4 x i64> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
%64 = shufflevector <8 x i64> %63, <8 x i64> undef, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
That might be fixable in the vectorizers, but we're not allowed to fold that into a single shuffle in instcombine,
so we should have a backend backstop to convert that into the likely simpler form:
%64 = shufflevector <4 x i64> %x, <4 x i64> undef, <8 x i32> <i32 0, i32 0, i32 1, i32 1, i32 2, i32 2, i32 3, i32 3>
Differential Revision: https://reviews.llvm.org/D72300
This is the DAG node for SIGN_EXTEND_INREG :
t21: v4i32 = sign_extend_inreg t18, ValueType:ch:v4i16
It has two operands. The first one is the value it want to extend, and the second
one is the type to specify how to extend the value. For this example, it means
that, it is signed extend the t18(v4i32) from v4i16 to v4i32. That is
the semantics of c code:
vector int foo(vector int m) {
return m << 16 >> 16;
}
And it could be any vector type that hardware support the operation, though
the type 'v4i16' is NOT legal for the target. When we are trying to combine
the srl + sra, what we did now is calling the TLI.isOperationLegal(), which
will also check the legality of the type. That doesn't make sense.
Differential Revision: https://reviews.llvm.org/D70230
The fold 'A - (A & (B - 1))' -> 'A & (0 - B)'
added in 8dab0a4a7d691f2704f1079538e0ef29548db159
is too specific. It should/can just be 'A - (A & B)' -> 'A & (~B)'
Even if we don't manage to fold `~` into B,
we have likely formed `ANDN` node.
Also, this way there's less similar-but-duplicate folds.
Name: X - (X & Y) -> X & (~Y)
%o = and i32 %X, %Y
%r = sub i32 %X, %o
=>
%n = xor i32 %Y, -1
%r = and i32 %X, %n
https://rise4fun.com/Alive/kOUl
See
https://bugs.llvm.org/show_bug.cgi?id=44448https://reviews.llvm.org/D71499
The fold 'A - (A & (B - 1))' -> 'A & (0 - B)'
added in 8dab0a4a7d691f2704f1079538e0ef29548db159
is too specific. It should just be 'A - (A & B)' -> 'A & (~B)',
but we currently fail to sink that '~' into `(B - 1)`.
Name: ~(X - 1) -> (0 - X)
%o = add i32 %X, -1
%r = xor i32 %o, -1
=>
%r = sub i32 0, %X
https://rise4fun.com/Alive/rjU
While we do manage to fold integer-typed IR in middle-end,
we can't do that for the main motivational case of pointers.
There is @llvm.ptrmask() intrinsic which may or may not be helpful,
but i'm not sure it is fully considered canonical yet,
not everything is fully aware of it likely.
Name: PR44448 ptr - (ptr & C) -> ptr & (~C)
%bias = and i32 %ptr, C
%r = sub i32 %ptr, %bias
=>
%r = and i32 %ptr, ~C
See
https://bugs.llvm.org/show_bug.cgi?id=44448https://reviews.llvm.org/D71499
While we do manage to fold integer-typed IR in middle-end,
we can't do that for the main motivational case of pointers.
There is @llvm.ptrmask() intrinsic which may or may not be helpful,
but i'm not sure it is fully considered canonical yet,
not everything is fully aware of it likely.
https://rise4fun.com/Alive/ZVdp
Name: ptr - (ptr & (alignment-1)) -> ptr & (0 - alignment)
%mask = add i64 %alignment, -1
%bias = and i64 %ptr, %mask
%r = sub i64 %ptr, %bias
=>
%highbitmask = sub i64 0, %alignment
%r = and i64 %ptr, %highbitmask
See
https://bugs.llvm.org/show_bug.cgi?id=44448https://reviews.llvm.org/D71499
This was increasing the number of instructions when fsub was legalized
on AMDGPU with no signed zeros enabled. This fold should be guarded by
hasOneUse, and I don't think getNode should be doing that. The same
fold is already done as a regular combine through isNegatibleForFree.
This does require duplicating, even though isNegatibleForFree does
this combine already (and properly checks hasOneUse) to avoid one PPC
regression. In the regression, the outer fneg has nsz but the fsub
operand does not. isNegatibleForFree only sees the operand, and
doesn't see it's used from a nsz context. A nsz parameter needs to be
added and threaded through isNegatibleForFree to avoid this.
This moves the X86 specific transform from rL364407
into DAGCombiner to generically handle 'little to big' cases
(for example: extract_subvector(v2i64 bitcast(v16i8))). This
allows us to remove both the x86 implementation and the aarch64
bitcast(extract_subvector(bitcast())) combine.
Earlier patches that dealt with regressions initially exposed
by this patch:
rG5e5e99c041e4
rG0b38af89e2c0
Patch by: @RKSimon (Simon Pilgrim)
Differential Revision: https://reviews.llvm.org/D63815
Summary:
Without this check unnecessary FMA instructions are generated when the FSUB terms are reused.
This also has the side-effect that the same value is computed to different levels of precision, which can create undesirable effects if the results are used together in subsequent computation.
Reviewers: arsenm, nhaehnle, foad, tpr, dstuttard, spatel
Reviewed By: arsenm
Subscribers: jvesely, wdng, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71656
Summary:
Right now, DAGCombiner process the nodes in an iplementation defined order. This tends to be fragile as optimisation may or may not kick in depending on the traversal order.
This is part of a larger effort to get the DAGCombiner to process its node in topological order.
Reviewers: craig.topper, efriedma, RKSimon, lebedev.ri
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70921