Need to add NumSrcElts param to is..Mask functions in
ShuffleVectorInstruction class for better mask analysis. Mask.size() not
always matches the sizes of the permuted vector(s). Allows to better
estimate the cost in SLP and fix uses of the functions in other cases.
Differential Revision: https://reviews.llvm.org/D158449
Need to add NumSrcElts param to is..Mask functions in
ShuffleVectorInstruction class for better mask analysis. Mask.size() not
always matches the sizes of the permuted vector(s). Allows to better
estimate the cost in SLP and fix uses of the functions in other cases.
Differential Revision: https://reviews.llvm.org/D158449
Need to add NumSrcElts param to is..Mask functions in
ShuffleVectorInstruction class for better mask analysis. Mask.size() not
always matches the sizes of the permuted vector(s). Allows to better
estimate the cost in SLP and fix uses of the functions in other cases.
Differential Revision: https://reviews.llvm.org/D158449
Need to add NumSrcElts param to is..Mask functions in
ShuffleVectorInstruction class for better mask analysis. Mask.size() not
always matches the sizes of the permuted vector(s). Allows to better
estimate the cost in SLP and fix uses of the functions in other cases.
Differential Revision: https://reviews.llvm.org/D158449
Need to consider the length of the original vector for extractelements,
not the length, matched number of the scalars. It fixes 2 issues: 1)
improves cost estimation; 2) Fixes crashes after D158449.
This caused asserts:
Assertion failed: NumElts > 1 && "Expected at least 2-element fixed length vector(s).",
file C:\b\s\w\ir\cache\builder\src\third_party\llvm\llvm\lib\Transforms\Vectorize\SLPVectorizer.cpp, line 7096
see comment on 59a67ea35d
> Need to consider the length of the original vector for extractelements,
> not the length, matched number of the scalars. It fixes 2 issues: 1)
> improves cost estimation; 2) Fixes crashes after D158449.
This reverts commit 59a67ea35d608480257fc64ec3e5106ef50de740.
Need to consider the length of the original vector for extractelements,
not the length, matched number of the scalars. It fixes 2 issues: 1)
improves cost estimation; 2) Fixes crashes after D158449.
scheduling, is previously vectorized.
If the main node was vectorized already, but does not require
scheduling, we still can try to vectorize it in this new node instead of
gathering.
Reordering of possible strided nodes in bottom-to-top order requires
top-to-bottom reordering of the operands of such nodes, which is not
supported. Need to disable reordering of strided operands to avoid
compiler crashes.
This reverts commit 9a99944df068b29b905cd8ba9a2132cc6382b6fb.
Due to test suite failures on all our SVE buildbots e.g.:
https://lab.llvm.org/buildbot/#/builders/184/builds/7375
clang: ../llvm/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:3565:
InstructionCost llvm::AArch64TTIImpl::getShuffleCost(TTI::ShuffleKind,
VectorType *, ArrayRef<int>, TTI::TargetCostKind, int, VectorType *,
ArrayRef<const Value *>): Assertion `Mask.size() == TpNumElts && "Expected Mask and Tp size to match!"' failed.
artificial for better cost estimation.
Need to use original source vector type, not the one artificially
constructed, based on the number of vectorized scalars. It affect the
cost significantly.
scheduling, is previously vectorized.
If the main node was vectorized already, but does not require
scheduling, we still can try to vectorize it in this new node instead of
gathering.
instruction.
Need to check if the operand scalars are vectorized in the a different
vector node, if the main instruction is already gets vectorized in other
vector node.
vectorized node uses.
If the instruction is vectorized in many different vector nodes, it may
break the dependency analysis for gathered nodes with matched scalars.
Need to properly check the dependency between such gather nodes to avoid
cycle dependency.
On the very first iteration for the reductions, when trying to build
reduction for boolean logic operations, no need to compare LHS/RHS with
the Reduction(VectorizedTree), need to compare with actual parameters of
the reduction operations.
If the scalar does not need to be scheduled and it was vectorized
already in one of the vector nodes, we still can try to vectorize it in
another node. Just does not need account its cost in the scalar total
cost, as it will be handled in the main vectorized node.
Differential Revision: https://reviews.llvm.org/D159205
This reverts commit db5d845c73ee2d64f1a5bab3fc72edece9e3a7ba.
As per PR discussion "Looks like we've missed lowering of bitcasts
between v2f16 and v2i16 and it breaks XLA."
We still can try to vectorize the bundle of the instructions, even if
the
repeated number of instruction is non-power-of-2. In this case need to
adjust the cost (calculate the cost only for unique scalar instructions)
and cost of the extracts. Also, when scheduling the bundle need to
schedule only unique scalars to avoid compiler crash because of the
multiple dependencies. Can be safely applied only if all scalars's users
are also vectorized and do not require memory accesses (this one is
a temporarily requirement, can be relaxed later).
---------
Co-authored-by: Alexey Bataev <a.bataev@outlook.com>
The issue #55208 describes a current deficiency of the SLPVectorizer,
namely that it doesn't vectorize code written with lrint, while similar
code written with rint is vectorized. Add a test corresponding to this
issue for the RISC-V target.
Recently, 7f26c27 turned on SLP by default for RISC-V, and although
there are quite a few tests for SLP under the X86/ target, it is unclear
whether the same constructs would be vectorized on RISC-V. This patch
takes a step in the direction of remedying this, by noticing that ctpop
is often vectorized on RISC-V, and adding four tests for different
integer widths.
On sm_90 some instructions now support i16x2 which allows hardware to
execute more efficiently add, min and max instructions.
In order to support that we need to make i16x2 a native type in the
backend. This does the necessary changes to make i16x2 a native type and
adds support for the instructions natively supporting i16x2.
This caused a negative test in nvptx slp to start passing. Changed the
test to a positive one as the IR is correctly vectorized.
In the past we sort of pretended float might be implementable
as a non-IEEE type but that never realistically would work. Exotic
FP types would need to be added to the IR. Turning these
into FP operations enables FP tracking optimizations.
https://reviews.llvm.org/D151937
This is a resurrection of D18874. This was previously wrong with
fneg conflated with fsub, but we now have a proper fneg instruction.
Additionally, I think it is now clearer that IR float=IEEE float,
and a different bit layout would require adding a different IR type.
https://reviews.llvm.org/D151934
Some callers pass in an empty mask to represent "unknown". We should use the generic costs for these cases. We can add VL=1 costing seperately if desired.
Reapplying after revert. A new test had been added, and I'd missed updating it when rebasing before. This is a great happy accident as I hadn't figured out how to get SLP to exercise this case, I'd merely noticed it via inspection.
This assertion is introduced by D157425.
We should calculate the cost iff `Mask` is not empty.
Fixes 64901
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D158590
If the masked gathers can be reordered, it may produce strided access
pattern and the reordering does not affect common reodering, better to
try to reorder masked gathers for better performance.
Differential Revision: https://reviews.llvm.org/D157009