Allow length changing shuffle masks in the "bitcast (shuf V, MaskC) --> shuf (bitcast V), MaskC'" fold.
It also exposes some poor shuffle mask detection for extract/insert subvector cases inside improveShuffleKindFromMask
First stage towards addressing Issue #67803
We should check whether the element type is non-byte-sized, not
the vector type. For types like <32 x i1> the whole type is
byte-sized, but the individual elements (that we scalarize to)
are not.
Fixes https://github.com/llvm/llvm-project/issues/67060.
of the scalar operation
VP Intrinsics whose vector operands are both splat values may be
simplified into the scalar version of the operation and the result is
splatted.
This issue is the intrinsic dual of #65072.
The transform 'scalarizeLoadExtract' can be applied to scalable
vector types if the index is less than the minimum number of elements.
The check whether the index is less than the minimum number of elements
locates at line 1175~1180. 'scalarizeLoadExtract' will call
'canScalarizeAccess' and check the returned result if this transform is safe.
At the beginning of the function 'canScalarizeAccess', the index will be
checked
1. If it is less than the number of elements of a fixed vector type.
2. If it is less than the minimum number of elements of a scalable vector type.
Otherwise 'canScalarizeAccess' will return unsafe and this transform
will be prevented.
The transform 'foldSingleElementStore' can be applied to scalable
vector types if the index is less than the minimum number of elements.
Reviewed By: dmgreen, nikic
Differential Revision: https://reviews.llvm.org/D157676
These vector lanes are never accessed. They are used for shifting a value into the right lane
and therefore only 1 value of the whole vector is actually used
Following the change in shufflevector semantics,
poison will be used to represent undefined elements in shufflevector masks.
Differential Revision: https://reviews.llvm.org/D149256
As shown in issue #60649, the new shuffles were
being inserted before a phi, and that is invalid.
It seems like most test coverage for this fold
(foldSelectShuffle) lives in the AArch64 dir,
but this doesn't repro there for a base target.
These are part of the optimization pipeline, of which the legacy pass manager version is deprecated.
Namely
* Internalize
* StripSymbols
* StripNonDebugSymbols
* StripDeadDebugInfo
* StripDeadPrototypes
* VectorCombine
* WarnMissedTransformations
Fixed previously failing ocaml tests (one of them seems to already be failing?)
These are part of the optimization pipeline, of which the legacy pass manager version is deprecated.
Namely
* Internalize
* StripSymbols
* StripNonDebugSymbols
* StripDeadDebugInfo
* StripDeadPrototypes
* VectorCombine
* WarnMissedTransformations
LoopUnroll estimates the loop size via getInstructionCost(),
but getInstructionCost() cannot pass CostKind to getVectorInstrCost().
And so does getShuffleCost() to getBroadcastShuffleOverhead(),
getPermuteShuffleOverhead(), getExtractSubvectorOverhead(),
and getInsertSubvectorOverhead().
To address this, this patch adds an argument CostKind to these
functions.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D142116
This reverts a change to exclude scalarizeBinopOrCmp in VectorCombine for
scalable vectors which caused poor scalable Binop codegen.
Differential Revision: https://reviews.llvm.org/D138545
This follows 87debdadaf18 to further eliminate wasting time
calling helper functions only to early return to the main
run loop.
Once again, this results in significant savings based on
experimental data:
https://llvm-compile-time-tracker.com/compare.php?from=01023bfcd33f922ed8c934ce563e54abe8bfe246&to=3dce4f70b73e48ccb045decb634c185e6b4c67d5&stat=instructions:u
This is NFCI other than making the pass faster. The total
cost of VectorCombine runs in an -O3 build appears to be
well under 0.1% of compile-time now, so there's not much
left to do AFAICT.
There's a TODO about making the code cleaner, but it
probably doesn't change timing much. I didn't include those
changes here because it requires updating much more code.
The option was added with https://reviews.llvm.org/D102496,
and currently the name is accurate, but I am hoping to add
a load transform that is not a scalarization. See issue #17113.
This adapts/copies code from the existing fold that allows
widening of load scalar+insert. It can help in IR because
it removes a shuffle, and the backend can already narrow
loads if that is profitable in codegen.
We might be able to consolidate more of the logic, but
handling this basic pattern should be enough to make a small
difference on one of the motivating examples from issue #17113.
The final goal of combining loads on those patterns is not
solved though.
Differential Revision: https://reviews.llvm.org/D137341
We can't assume that operand 0 is the negated operand because
the matcher handles "fsub -0.0, X" (and also +0.0 with FMF).
By capturing the extract within the match, we avoid the bug
and make the transform more robust (can't assume that this
pass will only see canonical IR).
insertelt DestVec, (fneg (extractelt SrcVec, Index)), Index --> shuffle DestVec, (fneg SrcVec), Mask
This is a specialized form of what could be a more general fold for a binop.
It's also possible that fneg is overlooked by SLP in this kind of
insert/extract pattern since it's a unary op.
This shows up in the motivating example from #issue 58139, but it won't solve
it (that probably requires some x86-specific backend changes). There are also
some small enhancements (see TODO comments) that can be done as follow-up
patches.
Differential Revision: https://reviews.llvm.org/D135278
A const reference is preferred over a non-null const pointer.
`Type *` is kept as is to match the other overload.
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D131197
1) Overloaded (instruction-based) method is a wrapper around the current (opcode-based) method.
2) This patch also changes a few callsites (VectorCombine.cpp,
SLPVectorizer.cpp, CodeGenPrepare.cpp) to call the overloaded method.
3) This is a split of D128302.
Differential Revision: https://reviews.llvm.org/D131114
The backend getShuffleCosts do not currently handle shuffles that change
size very well. Limit the shuffles we collect to the same type to make
sure they do not cause issues as reported in D128732.
This in an extension to the code added in D123911 which added vector
combine folding of shuffle-select patterns, attempting to reduce the
total amount of shuffling required in patterns like:
%x = shuffle %i1, %i2
%y = shuffle %i1, %i2
%a = binop %x, %y
%b = binop %x, %y
shuffle %a, %b, selectmask
This patch extends the handing of shuffles that are dependent on one
another, which can arise from the SLP vectorizer, as-in:
%x = shuffle %i1, %i2
%y = shuffle %x
The input shuffles can also be emitted, in which case they are treated
like identity shuffles. This patch also attempts to calculate a better
ordering of input shuffles, which can help getting lower cost input
shuffles, pushing complex shuffles further down the tree.
This is a recommit with some additional checks for supported forms and
out-of-bounds mask elements, with some extra tests.
Differential Revision: https://reviews.llvm.org/D128732
This reverts commit 19a1e20b8a0f69da2a871eae6cbd03d1314ee02d.
Clang crashes while linking bullet from llvm-test-suite in
ReleaseLTO-g cmake configuration.
This in an extension to the code added in D123911 which added vector
combine folding of shuffle-select patterns, attempting to reduce the
total amount of shuffling required in patterns like:
%x = shuffle %i1, %i2
%y = shuffle %i1, %i2
%a = binop %x, %y
%b = binop %x, %y
shuffle %a, %b, selectmask
This patch extends the handing of shuffles that are dependent on one
another, which can arise from the SLP vectorizer, as-in:
%x = shuffle %i1, %i2
%y = shuffle %x
The input shuffles can also be emitted, in which case they are treated
like identity shuffles. This patch also attempts to calculate a better
ordering of input shuffles, which can help getting lower cost input
shuffles, pushing complex shuffles further down the tree.
Differential Revision: https://reviews.llvm.org/D128732
Given a commutative reduction leading from a shuffle, the order of the
lanes on the shuffle are not important for the result. This means we can
reorder the shuffle to something simpler, which we try shuffling the
first vector lanes first. This was D123494.
The new shuffle may not be profitable though, and if it is not we can
try the folding of select shuffles from D123911. This, with some
adjustment as the output lane ordering is now unimportant, can allow the
final shuffle to simplify given the inputs to the patterns from D123911.
Where as each transformation on their own are not profitable, the
combination is.
We can only support a single shuffle when called from reductions, but we
are able to sort the ReconstructMask, potentially allowing it to
simplify to an identity or concat mask.
Differential Revision: https://reviews.llvm.org/D125086
This patch adds a combine to attempt to reduce the costs of certain
select-shuffle patterns. The form of code it attempts to detect is:
%x = shuffle ...
%y = shuffle ...
%a = binop %x, %y
%b = binop %x, %y
shuffle %a, %b, selectmask
A classic select-mask will pick items from each lane of a or b. These
do not always have a great lowering on many architectures. This patch
attempts to pack a and b into the lower elements, creating a differently
ordered shuffle for reconstructing the orignal which may be better than
the select mask. This can be better for performance, especially if less
elements of a and b need to be computed and the input shuffles are
cheaper.
Because select-masks are just one form of shuffle, we generalize to any
mask. So long as the backend has decent costmodel for the shuffles, this
can generally improve things when they come up. For more basic cost
models the folds do not appear to be profitable, not getting past the
cost checks.
Differential Revision: https://reviews.llvm.org/D123911
Running iwyu-diff on LLVM codebase since fa5a4e1b95c8f37796 detected a few
regressions, fixing them.
Differential Revision: https://reviews.llvm.org/D124847
I think this sort comparator was overly complex, and the windows
expensive check bot agreed, failing as it was not giving a strict weak
ordering. Change it to use the comparison of the mask values as unsigned
integers. This should sort the undef elements to the end whilst keeping
X<Y otherwise.
Given a shuffle feeding a commutative reduction, the lane ordering of
the shuffle will not alter the result. This is also true if there are a
number of operations between the reduction and the shuffle, providing
they only operate lane-wise. This patch searches for cases like that in
Vector Combine, allowing us to check the cost of the shuffle vs an
in-order identity shuffle and replace the order if possible. This only
handles a single shuffle at the moment to keep things simple, and is
able to ignore splats that produce results where every result is the
same.
This is a more powerful version of a combine that already happens in
instrcombine, capable of optimizing more cases by looking through more
instructions and being able to cost the shuffle.
Differential Revision: https://reviews.llvm.org/D123494
We can not bitcast pointers across different address spaces. This was
previously fixed in D89577 but then in D93229 an enhancement was added
which peeks further through the ponter operand, opening up the
possibility that address-space violations could be introduced.
Instead of bailing as the previous fix did, simply insert an
addrspacecast cast instruction.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D121787
We should not lose analysis precision if an 'add' has both no-wrap
flags (nsw and nuw) compared to just one or the other.
This patch is modeled on a similar construct that was added with
D59386.
I don't think it is possible to expose a problem with an unsigned
compare because of the way this was coded (nuw is handled first).
InstCombine has an assert that fires with the example from:
https://github.com/llvm/llvm-project/issues/52884
...because it was expecting InstSimplify to handle this kind of
pattern with an smax.
Fixes#52884
Differential Revision: https://reviews.llvm.org/D116322
These are deprecated and should be replaced with getAlign().
Some of these asserts don't do anything because Load/Store/AllocaInst never have a 0 align value.
shuf (bo X, Y), (bo X, W) --> bo (shuf X), (shuf Y, W)
This is motivated by an example in D111800
(although that patch avoids the problem for that particular example).
The pattern is shown in reduced form with:
https://llvm.org/PR52178https://alive2.llvm.org/ce/z/d8zB4D
There is no difference on the PhaseOrdering test from D111800
because the aarch64 cost model says that the shuffle cost is 3 while
the fadd cost is 2.
Differential Revision: https://reviews.llvm.org/D111901