Added support for LShr instructions as base for copyable elements. Also,
added simple analysis for best base instruction selection, if multiple
candidates are available.
Reviewers: hiraditya, RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/153393
After clearing the dependencies in copyable data, need to recalculate
dependencies for the original ScheduleData, if it can be marked as
control dependent.
Fixes#153289
Adds initial support for copyable elements, both schedulable and
non-schedulable.
Adds support only for add for now, other opcodes will added in future.
Still some cases are not handled, e.g. stores do not include this,
because currently do not check for copyable elements.
Reviewers: hiraditya, RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/147366
Added initial check for potential fmad conversion in reductions and
operands vectorization.
Added the check for instruction to fix#152683
Skipped the code for reduction to avoid regressions.
If the instruction is checked for matching the main instruction, need to
check if the opcode of the main instruction is compatible with the
operands of the instruction. If they are not, need to check the
alternate instruction and its operands for compatibility and return
alternate instruction as a match.
Fixes#151699
Fixed check for non-supported binary operations.
If the instruction is checked for matching the main instruction, need to
check if the opcode of the main instruction is compatible with the
operands of the instruction. If they are not, need to check the
alternate instruction and its operands for compatibility and return
alternate instruction as a match.
Fixes#151699
Adds initial support for copyable elements. This patch only models adds
and model copyable elements as add <element>, 0, i.e. uses identity
constants for missing lanes.
Only support for elements, which do not require scheduling, is added to
reduce size of the patch.
Fixed compile time regressions, reported crashes, updated release notes
Reviewers: RKSimon, hiraditya
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/140279
This reverts commit c9cea24fe68e24750b2d479144f839e1c2ec9d2b.
This is being reverted as it is intermixed with another commit
(898bba311f180ed54de33dc09e7071c279a4942a) that needs to be reverted.
Adds initial support for copyable elements. This patch only models adds
and model copyable elements as add <element>, 0, i.e. uses identity
constants for missing lanes.
Only support for elements, which do not require scheduling, is added to
reduce size of the patch.
Fixed compile time regressions, updated release notes
Reviewers: RKSimon, hiraditya
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/140279
Adds initial support for copyable elements. This patch only models adds
and model copyable elements as add <element>, 0, i.e. uses identity
constants for missing lanes.
Only support for elements, which do not require scheduling, is added to
reduce size of the patch.
Reviewers: RKSimon, hiraditya
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/140279
Update LV to vectorize maxnum/minnum reductions without fast-math flags,
by adding an extra check in the loop if any inputs to maxnum/minnum are
NaN, due to maxnum/minnum behavior w.r.t to signaling NaNs. Signed-zeros
are already handled consistently by maxnum/minnum.
If any input is NaN,
*exit the vector loop,
*compute the reduction result up to the vector iteration that contained
NaN inputs and
* resume in the scalar loop
New recurrence kinds are added for reductions using maxnum/minnum
without fast-math flags.
PR: https://github.com/llvm/llvm-project/pull/148239
If all slices are small and end up with strided or even vectorization
states, better to not consider these candidates for the vectorization
and try to vectorize the whole bunch as gathered loads.
Reviewers: hiraditya, RKSimon, HanKuanChen
Reviewed By: RKSimon, HanKuanChen
Pull Request: https://github.com/llvm/llvm-project/pull/149209
It assumed that the VF remains constant throughout the tree. That's not
always true. This meant that we could query the extraction cost for a
lane that is out of bounds.
While experimenting with re-vectorisation for AArch64, we ran into this
issue. We cannot add a proper AArch64 test as more changes would need to
be brought in.
This commit is only fixing the computation of VF and adding an assert.
Some tests were failing after adding the assert:
- foo() in llvm/test/Transforms/SLPVectorizer/X86/horizontal.ll
- test() in
llvm/test/Transforms/SLPVectorizer/X86/reduction-with-removed-extracts.ll
- test_with_extract() in
llvm/test/Transforms/SLPVectorizer/RISCV/segmented-loads.ll
Added emission of the 2-element reduction instead of 2 extracts + scalar
op, when trying to vectorize operands of the instruction, if it is more
profitable.
This reverts commit ac4a38e9bd573a173432b89cbef7cce7a48e7907.
This breaks the RVV builders
(MicroBenchmarks/ImageProcessing/Blur/blur.test and
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test from llvm-test-suite)
and reportedly SPEC Accel2023
<https://github.com/llvm/llvm-project/pull/147583#issuecomment-3057183138>.
Added emission of the 2-element reduction instead of 2 extracts + scalar
op, when trying to vectorize operands of the instruction, if it is more
profitable.
Similar to FindLastIV, add FindFirstIVSMin to support select (icmp(), x, y)
reductions where one of x or y is a decreasing induction, producing a SMin
reduction. It uses signed max as sentinel value.
PR: https://github.com/llvm/llvm-project/pull/140451
A shuffle will take two input vectors and a mask, to produce a new
vector of size <MaskElts x SrcEltTy>. Historically it has been assumed
that the SrcTy and the DstTy are the same for getShuffleCost, with that
being relaxed in recent years. If the Tp passed to getShuffleCost is the
SrcTy, then the DstTy can be calculated from the Mask elts and the src
elt size, but the Mask is not always provided and the Tp is not reliably
always the SrcTy. This has led to situations notably in the SLP
vectorizer but also in the generic cost routines where assumption about
how vectors will be legalized are built into the generic cost routines -
for example whether they will widen or promote, with the cost modelling
assuming they will widen but the default lowering to promote for integer
vectors.
This patch attempts to start improving that - it originally tried to
alter more of the cost model but that too quickly became too many
changes at once, so this patch just plumbs in a DstTy to getShuffleCost
so that DstTy and SrcTy can be reliably distinguished. The callers of
getShuffleCost have been updated to try and include a DstTy that is more
accurate. Otherwise it tries to be fairly non-functional, keeping the
SrcTy used as the primary type used in shuffle cost routines, only using
DstTy where it was in the past (for InsertSubVector for example).
Some asserts have been added that help to check for consistent values
when a Mask and a DstTy are provided to getShuffleCost. Some of them
took a while to get right, and some non-mask calls might still be
incorrect. Hopefully this will provide a useful base to build more
shuffles that alter size.
ArrayRef has a constructor that accepts std::nullopt. This
constructor dates back to the days when we still had llvm::Optional.
Since the use of std::nullopt outside the context of std::optional is
kind of abuse and not intuitive to new comers, I would like to move
away from the constructor and eventually remove it.
This patch takes care of the llvm side of the migration.
Patch fixes cost estimation for the extractelements from non-power-of-2
vectors, defined as subvector extracts. In this case the subvector size
might be not adjusted to a whole register size, need to get the minimum
between whole vector size and the actual difference to prevent compiler
crash.
Fixes#143513
When implementing the vectorization, we potentially need to add shuffles
for external users. In such cases, we may be shuffling a smaller vector
into a larger vector. When this happens `ResizeToVF` will just build a
poison padded identity vector. Then the to build the final shuffle, we
just use the `SK_InsertSubvector` mask.
This is possibly clearer by looking at the included test in
SLPVectorizer/AMDGPU/external-shuffle.ll
In the exit block we have a bunch of shuffles to glue the vectorized
tree match the `InsertElement` users. `TMP25` holds the result of
resizing the v2i16 vectorized sequence to match the `InsertElement` size
v16i16. Then `TMP26` is the final shuffle which replaces the
`InsertElement` sequence. This is just an insertsubvector.
However, when calculating the cost for these shuffles, we aren't
modelling this correctly. `ResizeToVF` will indicate to
`performExtractsShuffleAction` that we cannot use the original mask due
to the resize shuffle. The consequence is that the cost calculation uses
a different shuffle mask than what is ultimately used.
Going back to the included test, we can consider again `TMP26`. Clearly
we can see the shuffle uses a mask {0, 1, 2, 3, 16, 17, poison ..}.
However, we will currently calculate the cost with a mask {0, 1, 2, 3,
20, 21, ...} we have replaced 16 and 17 with 20 and 21 (Index + Vector
Size). Queries like BasicTTImpl::improveShuffleKindFromMask will not
recognize this as an `SK_InsertSubvector` mask, and targets which have
reduced costs for `SK_InsertSubvector` will not accurately calculate the
cost.
Seeing how we can't generate any debug intrinsics any more: delete a
variety of codepaths where they're handled. For the most part these are
plain deletions, in others I've tweaked comments to remain coherent, or
added a type to (what was) type-generic-lambdas.
This isn't all the DbgInfoIntrinsic call sites but it's most of the
simple scenarios.
Co-authored-by: Nikita Popov <github@npopov.com>