This reverts commit d1e477b00b49c63ff4dd513eeb14a5b18bc055d7.
Recommit with a extra checks making sure extends are VPWidenCastRecipes,
rejecting VPReplicateRecipes.
Original message:
As a first step, move the existing partial reduction detection logic to
VPlan, trying to preserve the existing code structure & behavior as
closely as possible.
With this, partial reductions are detected and created together in a
single step.
This allows forming partial reductions and bundling them up if
profitable together in a follow-up.
PR: https://github.com/llvm/llvm-project/pull/167851
This reverts commit f4e8cc1a2229dca76d21c8d37439c4c194b06b86.
This change wasn't NFC; it causes failed asserts when building
ffmpeg for i686 windows, see
https://github.com/llvm/llvm-project/pull/167851 for details.
As a first step, move the existing partial reduction detection logic to
VPlan, trying to preserve the existing code structure & behavior as
closely as possible.
With this, partial reductions are detected and created together in a
single step.
This allows forming partial reductions and bundling them up if
profitable together in a follow-up.
PR: https://github.com/llvm/llvm-project/pull/167851
Re-commit of https://github.com/llvm/llvm-project/pull/175839 after
fixing build without `LLVM_ENABLE_DUMP`.
This consists of the following changes:
* Merge several overloads of `VPlanTransforms::runPass` into a single
function to avoid code duplication.
* Add helper macro `RUN_VPLAN_PASS` to capture the transformation name
and pass it to the helper above for printing.
* Add new `-vplan-print-after-all` option (somewhat similar to existing
`-vplan-verify-each`).
* Add two empty passes `printAfterInitialConstruction`/`printFinalVPlan`
so that initial/final VPlans would be supported in `-vplan-print-after-all`
This follows the original future plans in
https://github.com/llvm/llvm-project/pull/123640.
VPWidenActiveLaneMaskPHIRecipe does not have side-effects and also does
not access memory. Mark accordingly. This allows hoisting of some
invariant loads out of loops and also removing unused phi recipes in the
future.
In
llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll,
the hoisting makes vectorization profitable.
PR: https://github.com/llvm/llvm-project/pull/177886
This makes use of the llvm.vector.partial.reduce.fadd intrinsics added
in #163975 to handle the following with FDOT:
```
float32_t fdot(float16_t *src, int N) {
float32_t sum = 0.0f;
for (int i=0; i<N; ++i)
sum += src[i];
return sum;
}
```
This consists of the following changes:
* Merge several overloads of `VPlanTransforms::runPass` into a single
function
to avoid code duplication.
* Add helper macro `RUN_VPLAN_PASS` to capture the transformation name
and pass it to the helper above for printing.
* Add new `-vplan-print-after-all` option (somewhat similar to existing
`-vplan-verify-each`).
* Add two empty passes `printAfterInitialConstruction`/`printFinalVPlan`
so that initial/final
VPlans would be supported in `-vplan-print-after-all`
This follows the original future plans in
https://github.com/llvm/llvm-project/pull/123640.
In some cases, we identify patterns as reductions, even though they can
be simplified to a non-reduction.
Mark VPReductionPHIRecipe as not reading from memory & not having
side-effects, to clean them up.
We also need to remove ComputeReductionResult VPInstructions with
live-in arguments. This means there is actually no reduction, and we
need to fold it to the live in. Otherwise we would incorrectly reduce
the live-in.
PR: https://github.com/llvm/llvm-project/pull/176795
This patch aims to align all nontemporal store/load handling to
systematically enforce a little-endian target. This has been the
effective support LLVM had for NT store/load lowering (there has been no
effective support for big-endian, even with the inconsistencies).
The change in `llvm/lib/Target/AArch64/AArch64InstrInfo.td` is
effectively a NFC, because the only lowering of LDNP, in
`llvm/lib/Target/AArch64/AArch64ISelLowering.cpp`, have already checked
for `isLittleEndian`. The change in
`llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h` affects its
single caller
`llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp`. The
previous logic has been wrong, enabling vectorization of effectively
illegal nontemporal store/load instructions on big-endian.
Fixes#176208. Scaled back version of #176515 that only affects the RISCV backend.
Only modifies the cost for cases when DIV is a legal operation.
Updates the cost for both Scalar and Vector types.
Used `TTI::TCC_Expensive` as suggested by
https://github.com/llvm/llvm-project/issues/176208#issuecomment-3760902537.
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>
When a vectorized loop has constant trip, it's important to update the
profile information accordingly. Hotness analysis will only look at
profile info.
For example, in the `tripcount.ll` test, without producing the profile
info, in the `const_trip_over_profile` function, the BFI of the
`vector.body` would be 32 (this is the expected value when synthetic
branch weights are used, in loops). The real value is 250. The
`for.body`value was _very_ incorrect before, too (and detrimentally so,
as it would have appeared as "very hot" when it wasn't):
The table below was obtained by printing BFI in the RUN: command, i.e.
`build/bin/opt < llvm/test/Transforms/LoopVectorize/tripcount.ll
-passes="loop-vectorize,print<block-freq>"
-loop-vectorize-with-block-frequency -S -o /dev/null`. Showing only the
`float` value, i.e. the BFI relative to the function entry BB.
```
Printing analysis results of BFI for function 'const_trip_over_profile':
block-frequency-info: const_trip_over_profile
```
| Block | Before | After |
| ----- | ------ | ----- |
| `entry` | float = 1.0 | float = 1.0 |
| `vector.ph` | float = 1.0 | float = 1.0 |
| `vector.body` | float = **32.0** | float = **250.0** |
| `middle.block` | float = 1.0 | float = 1.0 |
| `scalar.ph` | float = 1.0 | float = 1.0 |
| `for.body` | float = **2147483647.8** | float = **1.0** |
| `for.end` | float = 1.0 | float = 1.0 |
In isOutsideLoopWorkProfitable function, there are two places where only
the runtime check cost (RtC) should be used, but incorrectly included
the costs of middle blocks and early-exit blocks.
1. VectorizeMemoryCheckThreshold comparison for interleaving-only
2. Minimum trip count that bounds runtime check overhead, i.e. MinTC2
calculation
This results in an overly conservative minimum profitable trip count.
This patch separates the runtime check cost from the total overhead
cost, and uses only RtC for VectorizeMemoryCheckThreshold comparison and
the MinTC2 calculation.
FindLast in-loop reductions are not supported, similarly to FindLastIV
reductions. Skip them in collectInLoopReductions, to avoid a crash for
loops with FindLast reductions and in-loop reductions preferred.
Replace ComputeFindIVResult with ComputeReductionResult + explicit
compare + select, to more explicitly and simpler model computing finding
the first/last induction, which boils down to a min/max reduction +
compare and select of the sentinel value.
PR: https://github.com/llvm/llvm-project/pull/176672
If a UserIC is provided, the vector loop will process VF * UserIC. Pass
it through UserIC to computeFeasibleMaxVF and use it to limit the max VF
to factors where VF * UserIC <= MaxTripCount. This avoids creating dead
vector loops with user provided interleave counts.
PR: https://github.com/llvm/llvm-project/pull/174573
Whilst reviewing PR #176754 I realised there seemed to be some odd cost
model issues for the tests in file
LoopVectorize/AArch64/fold-tail-low-trip-count.ll
where we seemed to be vectorising loops that aren't worth it. It turns
out the tests were not targeting AArch64 despite being in the AArch64
directory. I fixed the RUN line for the file and also added a new file
for RISCV so we get more test coverage.
If any of the operands of a VPReplicateRecipe have been
force-scalarized, then the legacy cost model skips the scalarization
overhead, but we cannot match this in the VPlan cost model.
Bail out for now in those very rare cases.
Fixes https://github.com/llvm/llvm-project/issues/176720.
This patch removes the single uncountable exit constraint, allowing
loops with multiple early exits, if the exits form a dominance chain and
all other constraints hold for all uncountable early exits.
While legality now accepts such loops, vectorization is not yet
supported. VPlan support will be added in a follow up:
https://github.com/llvm/llvm-project/pull/174864
PR: https://github.com/llvm/llvm-project/pull/176403
This reverts commit ed004cf42bf57ca79b57bc3076ef83a8477426ea.
The original commit exposed an independent cost issue, triggering an
assertion. That issue has been fixed in 3457e7efc3.
Reland the patch now that the assertion has been fixed.
VPlan transforms may invert logical AND/OR selects, which can impact
costs on targets the select is not cheap but the boolean AND/OR is.
Also match the inverted logical AND/OR to improve accuracy of the
cost estimation and fixes the underlying issue for the cost
divergence between legacy and VPlan-based cost model that caused
the revert of 01d34eb38fa058 in ed004cf42bf57c.
Fix a miscompile in the FindLast handling by normalizing selects
with the phi node as the first op to ones that select the data value
when the condition is true, by swapping operands and inverting the
condition.
This should ensure correct codegen for both cases.
Select normalization:
https://alive2.llvm.org/ce/z/yFdivK
Fixes a miscompile reported for 2abd6d6d7ac (#158088).
When `extract-lane` only contains single vector operand. We can simplify
it to `extractelement`.
This patch makes `extract-lane` generate simple `extractelement` when it
only contains single vector operand to prevent unused IR generated.
This patch is mostly NFC, the unused IR should be removed in following
IR passes.
Fold select c, false, true -> not c. This allows for more accurate cost
estimation and fixes the underlying issue for the cost divergence
between legacy and VPlan-based cost model that caused the revert of
01d34eb38fa058 in ed004cf42bf57c.
https://alive2.llvm.org/ce/z/yVuSgW.
Following up to d5c11b9a24c84f, also handle min/max recurrence kinds in
::printFlags, so the proper kind is imprinted instead of icmp.
NFC modulo debug printing changes
Based on Michael Maitland's previous work:
https://github.com/llvm/llvm-project/pull/121222
This PR uses the existing recurrences code instead of introducing a
new pass just for CSA autovec. I've also made recipes that are more
generic.
Remove the artificial PhiR operand of ComputeReductionResult, which was
only used to look up recurrence kind, in-loop and ordered properties.
Instead, encode them as VPIRFlags as suggested by @ayalz in
https://github.com/llvm/llvm-project/pull/170223.
This addresses a TODO to make codegen for ComputeReductionResult
independent of looking up information from other recipes.
This is NFC w.r.t. codegen, the printing has been improved to include
the reduction type, and whether it is in-loop/ordered.
PR: https://github.com/llvm/llvm-project/pull/174026