This change tries to enable vector reordering during vectorization for
non-power-of-two vectors. Specifically, my goal is to be able to
vectorize reductions whose operands appear in other than identity order.
(i.e. a[1] + a[0] + a[2]). Our standard pass pipeline, Reassociation
effectively canonicalizes towards this form. So for reduction
vectorization to be wildly applicable, we need this feature.
This change enables the use of a non-empty ReorderIndices structure -
which is effectively required for out of order loads or gathers - while
leaving the ReuseShuffleIndices mechanism unused and disabled. If I've
understood the code structure, the former is used when describing
implicit shuffles required by the vectorization strategy (i.e. loading
elements 0,1,3,2 in the order 0,1,2,3 and then shuffling later), while
the later is used when trying to optimize explode/buildvectors (called
gathers in this code).
I audited all the code enabled by this change, but can't claim to
deeply understand most of it. I added a couple of bailouts in places
which appeared to be difficult to audit and optional optimizations. I've
tried to do so in the least risky way I can, but am not completely
confident in this change. Careful review appreciated.
Need to correctly track reduced values with multiple uses in the same
reduction emission attempt. Otherwise, the number of the reuses might be
calculated incorrectly, and may cause compiler crash.
Fixes https://github.com/llvm/llvm-project/issues/107037
Need to correctly track reduced values with multiple uses in the same
reduction emission attempt. Otherwise, the number of the reuses might be
calculated incorrectly, and may cause compiler crash.
Fixes https://github.com/llvm/llvm-project/issues/107037
This patch replaces the find-try_emplace sequence with just one call
to try_emplace, thereby avoiding two successive hash lookups on the
same key. I am not using the "inserted" boolean from try_emplace to
preserve the original behavior (that is, before PR 107123) that checks
to see if the value is nullptr or not.
NEON has non-IEEE compliant denormal flushing and the compiler should
check if it safe to vectorize instructions for NEON in non-fast math
mode.
Fixes https://github.com/llvm/llvm-project/issues/106909
When the shuffle masks are `PoisonMaskElem`, there is not need to check
the cost of `SK_ExtractSubvector`. It is free. Otherwise, it will cause
the compiler to crash.
Assertion `(Idx + EltsPerVector) <= alignTo(NumElts, EltsPerVector) &&
"SK_ExtractSubvector index out of range"' failed.
Patch adds basic support for non-power-of-2 number of elements in
operands. The patch still requires that this number addresses whole
registers.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/106449
Patch adds basic support for non-power-of-2 number of elements in
operands. The patch still requires that this number addresses whole
registers.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/106449
After landing support for actual vectorization of the "clustered" loads,
need better estimate the cost between the masked gather and clustered loads.
This includes estimation of the address calculation and better
estimation of the gathered loads. Also, this estimation now relies on
SLPCostThreshold option, allowing modify the behavior of the compiler.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/105858
If the operand node has the same scalars as one of the vectorized nodes,
the compiler could miss this and incorrectly request minbitwidth data
for the wrong node. It may lead to a compiler crash, because the
vectorized node might have different minbw result.
Fixes https://github.com/llvm/llvm-project/issues/106667
If the value is used in Scalar several times, the first attempt to find
its position in the node (if ReuseShuffleIndices and ReorderIndices not
empty) may fail. In this case need to find another copy of the same
value and try again.
Fixes https://github.com/llvm/llvm-project/issues/106626
Need to consider the maximum type size in the graph before doing attempt
for the vectorization of non-power-of-2 number of elements, which may be
less than MinVF.
This isn't quite just code motion as the four different versions we had
of this routine differed in whether they ignored the "size" marker used
to represent undef. I doubt this matters in practice, but it is a
functional change.
---------
Co-authored-by: Alexey Bataev <a.bataev@gmx.com>
Need to use original cmp type i1 when estimating the cost for the
buildvector node, not its operand types to prevent compiler crash upon
TTI cost estimation.
Currently, SLP uses shuffle for the external user of `InsertElementInst`
and iterates through the `InsertElementInst` chain to fill the mask with
constant indices. However, it may override the original Vec lane. Using
the original Vec lane is sufficient.
Build on the -slp-vectorize-non-power-of-2 experimental option, and
support vectorizing reductions with 2^N-1 sized vector.
Specifically, two related changes:
1) When searching for a profitable VL, start with the 2^N-1 reduction
width.
If cost model does not select that VL, return to power of two boundaries
when halfing the search VL. The later is mostly for simplicity.
2) Reduce the minimum reduction width from 4 to 3 when supporting
non-power
of two vectors. This is required to support <3 x Ty> cases.
One thing which isn't directly related to this change, but I want to
note for clarity is that the non-power-of-two vectorization appears to
be sensative to operand order of reduction. I haven't yet fully figured
out why, but I suspect this is non-power-of-two specific.
This particular variable name is shadowed by another lower in the
function, so reducing it's scope to it's single use removes the
shadowing and makes the code much less error prone.
If the node is not in MinBWs container and the user node is icmp node,
the compiler should not check the type size of the user instruction, it
is always 1 and is not good for actual bitwidth analysis.
Fixes https://github.com/llvm/llvm-project/issues/105988
Before checking the user components of the gather/buildvector nodes,
need to check if the node has users at all. Root nodes might not have
users, if it is a node for the reduction.
Fixes https://github.com/llvm/llvm-project/issues/105904
If the strided node is reversed, need to cehck for the last instruction,
not the first one in the list of scalars, when checking if the root
pointer must be extracted.
SLP vectorizer has an estimation for gather/buildvector nodes, which
contain some scalar loads. SLP vectorizer performs pretty similar (but
large in SLOCs) estimation, which not always correct. Instead, this
patch implements clustering analysis and actual node allocation with the
full analysis for the vectorized clustered scalars (not only loads, but
also some other instructions) with the correct cost estimation and
vector insert instructions. Improves overall vectorization quality and
simplifies analysis/estimations.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/104144
with "[Vectorize] Fix warnings"
It introduced compiler crashes, see #104144.
This reverts commit 69332bb8995aef60d830406de12cb79a50390261 and
351f4a5593f1ef507708ec5eeca165b20add3340.