After clearing the dependencies in copyable data, need to recalculate
dependencies for the original ScheduleData, if it can be marked as
control dependent.
Fixes#153289
Adds initial support for copyable elements, both schedulable and
non-schedulable.
Adds support only for add for now, other opcodes will added in future.
Still some cases are not handled, e.g. stores do not include this,
because currently do not check for copyable elements.
Reviewers: hiraditya, RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/147366
Added initial check for potential fmad conversion in reductions and
operands vectorization.
Added the check for instruction to fix#152683
Skipped the code for reduction to avoid regressions.
Changes: The original patch, landed as 1336675, was reverted due to a
bug in LoopVectorize resulting in a crash. The bug has now been fixed by
95c32bf ([VPlan] Return invalid cost if any skeleton block has invalid
costs), and this reland is identical to the original patch.
If the instruction is checked for matching the main instruction, need to
check if the opcode of the main instruction is compatible with the
operands of the instruction. If they are not, need to check the
alternate instruction and its operands for compatibility and return
alternate instruction as a match.
Fixes#151699
Fixed check for non-supported binary operations.
If the instruction is checked for matching the main instruction, need to
check if the opcode of the main instruction is compatible with the
operands of the instruction. If they are not, need to check the
alternate instruction and its operands for compatibility and return
alternate instruction as a match.
Fixes#151699
Certain fcmp predicates need to be expanded into multiple operations and
or'd together. This adds some more accurate cost modelling for them
based on the predicate. Unsupported operations are given the cost of a
libcall and the latency is set to 2 as that seemed to be fairly common
between different CPUs.
When these were converted to CostKindTblEntry the throughput was mainly copied to all cost kinds
Regenerated with my check_cost_tables.py helper script
Adds initial support for copyable elements. This patch only models adds
and model copyable elements as add <element>, 0, i.e. uses identity
constants for missing lanes.
Only support for elements, which do not require scheduling, is added to
reduce size of the patch.
Fixed compile time regressions, reported crashes, updated release notes
Reviewers: RKSimon, hiraditya
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/140279
This reverts commit c9cea24fe68e24750b2d479144f839e1c2ec9d2b.
This is being reverted as it is intermixed with another commit
(898bba311f180ed54de33dc09e7071c279a4942a) that needs to be reverted.
Adds initial support for copyable elements. This patch only models adds
and model copyable elements as add <element>, 0, i.e. uses identity
constants for missing lanes.
Only support for elements, which do not require scheduling, is added to
reduce size of the patch.
Fixed compile time regressions, updated release notes
Reviewers: RKSimon, hiraditya
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/140279
Adds initial support for copyable elements. This patch only models adds
and model copyable elements as add <element>, 0, i.e. uses identity
constants for missing lanes.
Only support for elements, which do not require scheduling, is added to
reduce size of the patch.
Reviewers: RKSimon, hiraditya
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/140279
If all slices are small and end up with strided or even vectorization
states, better to not consider these candidates for the vectorization
and try to vectorize the whole bunch as gathered loads.
Reviewers: hiraditya, RKSimon, HanKuanChen
Reviewed By: RKSimon, HanKuanChen
Pull Request: https://github.com/llvm/llvm-project/pull/149209
getVectorInstrCostHelper would return costs of zero for vector
inserts/extracts that move data between GPR and vector registers, if
there was no 'real' use, i.e. there was no corresponding existing
instruction.
This meant that passes like LoopVectorize and SLPVectorizer, which
likely are the main users of the interface, would understimate the cost
of insert/extracts that move data between GPR and vector registers,
which has non-trivial costs.
The patch removes the special case and only returns costs of zero for
lane 0 if it there is no need to transfer between integer and vector
registers.
This impacts a number of SLP test, and most of them look like general
improvements.I think the change should make things more accurate for any
AArch64 target, but if not it could also just be Apple CPU specific.
I am seeing +2% end-to-end improvements on SLP-heavy workloads.
PR: https://github.com/llvm/llvm-project/pull/146526
It assumed that the VF remains constant throughout the tree. That's not
always true. This meant that we could query the extraction cost for a
lane that is out of bounds.
While experimenting with re-vectorisation for AArch64, we ran into this
issue. We cannot add a proper AArch64 test as more changes would need to
be brought in.
This commit is only fixing the computation of VF and adding an assert.
Some tests were failing after adding the assert:
- foo() in llvm/test/Transforms/SLPVectorizer/X86/horizontal.ll
- test() in
llvm/test/Transforms/SLPVectorizer/X86/reduction-with-removed-extracts.ll
- test_with_extract() in
llvm/test/Transforms/SLPVectorizer/RISCV/segmented-loads.ll
Update test with all zero constant input values which get folded during
IR construction to actually use different input values, which require
materializing build vectors.
Added emission of the 2-element reduction instead of 2 extracts + scalar
op, when trying to vectorize operands of the instruction, if it is more
profitable.
This reverts commit ac4a38e9bd573a173432b89cbef7cce7a48e7907.
This breaks the RVV builders
(MicroBenchmarks/ImageProcessing/Blur/blur.test and
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test from llvm-test-suite)
and reportedly SPEC Accel2023
<https://github.com/llvm/llvm-project/pull/147583#issuecomment-3057183138>.
Added emission of the 2-element reduction instead of 2 extracts + scalar
op, when trying to vectorize operands of the instruction, if it is more
profitable.
This canonicalizes fneg/fabs (shuffle X, poison, mask) -> shuffle
(fneg/fabs X), posion, mask
This undoes part of b331a7ebc1e02f9939d1a4a1509e7eb6cdda3d38 and
a8f13dbdeb31be37ee15b5febb7cc2137bbece67, but keeps the binary shuffle
case i.e. shuffle fneg, fneg, mask.
By pulling out the shuffle we bring it inline with the same
canonicalisation we perform on binary ops and intrinsics, which the
original commit acknowledges it goes in the opposite direction.
However nowadays VectorCombine is more powerful and can do more
optimisations when the shuffle is pulled out, so I think we should
revisit this. In particular we get more shuffles folded and can perform
scalarization.
This patch adjusts the cost model to account for the ability of the
AMDGPU optimizer to group together i8 values into i32 values.
Co-authored-by: Erich Keane <ekeane@nvidia.com>
The memory value and mask value types might legalise differently - e.g. a v64i32 might split into 4 x v16i32 / 8 x v8i32 but the mask might legalize as 1 x v64i8 / 2 x v32i8 etc.
If the legalised value type has been split, then we must ensure we compute the cost for the entire mask value type and let getShuffleCost handle any legalisation, not assume that only a single trailing split mask will require widening.
We should only divide the number of pieces to fit the packed instructions
if we actually have pk instructions. This increases the cost of copysign,
but is closer to the current codegen output. It could be much cheaper
than it is now.
Patch fixes cost estimation for the extractelements from non-power-of-2
vectors, defined as subvector extracts. In this case the subvector size
might be not adjusted to a whole register size, need to get the minimum
between whole vector size and the actual difference to prevent compiler
crash.
Fixes#143513
When implementing the vectorization, we potentially need to add shuffles
for external users. In such cases, we may be shuffling a smaller vector
into a larger vector. When this happens `ResizeToVF` will just build a
poison padded identity vector. Then the to build the final shuffle, we
just use the `SK_InsertSubvector` mask.
This is possibly clearer by looking at the included test in
SLPVectorizer/AMDGPU/external-shuffle.ll
In the exit block we have a bunch of shuffles to glue the vectorized
tree match the `InsertElement` users. `TMP25` holds the result of
resizing the v2i16 vectorized sequence to match the `InsertElement` size
v16i16. Then `TMP26` is the final shuffle which replaces the
`InsertElement` sequence. This is just an insertsubvector.
However, when calculating the cost for these shuffles, we aren't
modelling this correctly. `ResizeToVF` will indicate to
`performExtractsShuffleAction` that we cannot use the original mask due
to the resize shuffle. The consequence is that the cost calculation uses
a different shuffle mask than what is ultimately used.
Going back to the included test, we can consider again `TMP26`. Clearly
we can see the shuffle uses a mask {0, 1, 2, 3, 16, 17, poison ..}.
However, we will currently calculate the cost with a mask {0, 1, 2, 3,
20, 21, ...} we have replaced 16 and 17 with 20 and 21 (Index + Vector
Size). Queries like BasicTTImpl::improveShuffleKindFromMask will not
recognize this as an `SK_InsertSubvector` mask, and targets which have
reduced costs for `SK_InsertSubvector` will not accurately calculate the
cost.