The legacy cost model skips gather/scatters when determining the
predicated stores in useEmulatedMaskMemRefHack. Match the behavior in
the VPlan-based implementation. This fixes a cost divergence in the
attached test.
Add a symbolic unroll factor (UF) to VPlan similar to VF & VFxUF that
gets replaced with the concrete UF during plan execution, similar to how VF
is used for the vectorization factor. This is a preparatory change that
allows transforms to use the symbolic UF before the concrete UF is
determined.
Note that the old getUF that returns the concrete UF after unrolling has
been renamed to getConcreteUF.
Split off from the re-commit of 8d29d093096
(https://github.com/llvm/llvm-project/pull/149706) as suggested.
Compute the number of predicated stores directly in VPlan instead of
using CM.useEmulatedMaskMemRefHack(), which will only account for the
number of predicated stores for the last VF the legacy cost model
considered.
Fixes https://github.com/llvm/llvm-project/issues/181183
Re-commit of https://github.com/llvm/llvm-project/pull/175839 after
fixing build without `LLVM_ENABLE_DUMP`.
This consists of the following changes:
* Merge several overloads of `VPlanTransforms::runPass` into a single
function to avoid code duplication.
* Add helper macro `RUN_VPLAN_PASS` to capture the transformation name
and pass it to the helper above for printing.
* Add new `-vplan-print-after-all` option (somewhat similar to existing
`-vplan-verify-each`).
* Add two empty passes `printAfterInitialConstruction`/`printFinalVPlan`
so that initial/final VPlans would be supported in `-vplan-print-after-all`
This follows the original future plans in
https://github.com/llvm/llvm-project/pull/123640.
This consists of the following changes:
* Merge several overloads of `VPlanTransforms::runPass` into a single
function
to avoid code duplication.
* Add helper macro `RUN_VPLAN_PASS` to capture the transformation name
and pass it to the helper above for printing.
* Add new `-vplan-print-after-all` option (somewhat similar to existing
`-vplan-verify-each`).
* Add two empty passes `printAfterInitialConstruction`/`printFinalVPlan`
so that initial/final
VPlans would be supported in `-vplan-print-after-all`
This follows the original future plans in
https://github.com/llvm/llvm-project/pull/123640.
This commit introduces the VectorInstrContext (VIC) infrastructure to
improve cost estimates for insert/extracts based on the context
instruction in which the insert/extract is used.
This is similar to CastContextHint, and allows providing context on how
the insert/extract is going to be used before creating IR. This is
useful in the LoopVectorizer, where costs need to estimated before
creating IR.
The new hint currently only replaces an existing check in AArch64,
but new uses will be introduced in follow-ups, including
https://github.com/llvm/llvm-project/pull/177201.
PR: https://github.com/llvm/llvm-project/pull/175982
Move SubclassID to VPRecipeBase, and store VPRecipeBase directly in
VPRecipeValue, instead of VPDef. This allows for some additional
simplifications and VPDef now just holds various helpers to deal with
removing and adding VPValues.
This reverts commit 16395da0ff577750571b99fe28281ce6fb6a3ae8.
PR: https://github.com/llvm/llvm-project/pull/174282
When a vectorized loop has constant trip, it's important to update the
profile information accordingly. Hotness analysis will only look at
profile info.
For example, in the `tripcount.ll` test, without producing the profile
info, in the `const_trip_over_profile` function, the BFI of the
`vector.body` would be 32 (this is the expected value when synthetic
branch weights are used, in loops). The real value is 250. The
`for.body`value was _very_ incorrect before, too (and detrimentally so,
as it would have appeared as "very hot" when it wasn't):
The table below was obtained by printing BFI in the RUN: command, i.e.
`build/bin/opt < llvm/test/Transforms/LoopVectorize/tripcount.ll
-passes="loop-vectorize,print<block-freq>"
-loop-vectorize-with-block-frequency -S -o /dev/null`. Showing only the
`float` value, i.e. the BFI relative to the function entry BB.
```
Printing analysis results of BFI for function 'const_trip_over_profile':
block-frequency-info: const_trip_over_profile
```
| Block | Before | After |
| ----- | ------ | ----- |
| `entry` | float = 1.0 | float = 1.0 |
| `vector.ph` | float = 1.0 | float = 1.0 |
| `vector.body` | float = **32.0** | float = **250.0** |
| `middle.block` | float = 1.0 | float = 1.0 |
| `scalar.ph` | float = 1.0 | float = 1.0 |
| `for.body` | float = **2147483647.8** | float = **1.0** |
| `for.end` | float = 1.0 | float = 1.0 |
This patch adds VPValue sub-classes for the different cases we currently
have:
* VPIRValue: A live-in VPValue that wraps an underlying IR value
* VPSymbolicValue: A symbolic VPValue not tied to an underlying value,
e.g. the vector trip count or VF VPValues
* VPRecipeValue: A VPValue defined by a VPDef/VPRecipeBase.
This has multiple benefits:
* clearer constructors for each kind of VPValue
* limited scope: for example allows moving VPDef member to VPRecipeValue,
reducing size of other VPValues.
* stricter type checking for member variables (e.g. using VPLiveIn in
the Value -> live-in map in VPlan, or using VPSymbolicValue for symbolic
member VPValues)
There probably are additional opportunities for cleanups as follow-ups.
PR: https://github.com/llvm/llvm-project/pull/172758
This PR introduces a new BranchOnTwoConds VPInstruction, that takes 2
boolean operands and must be placed in a block with 3 successors.
If condition I is true, branches to successor I, otherwise falls through
to check the next condition. If both conditions are false, branch to the
third successor.
This new branch recipe is used for early-exit loops, to simplify the
representation in VPlan initially, by avoid the need for splitting the
middle block early on, in a way that preserves the single-exit block
property of regions. All exits still go through the latch block, but
they can go to more than 2 successors.
This idea was part of one of the original proposals for how to model
early exits in VPlan, but at that point in time, there was no good way
to handle this during code-gen, and we went with the early split-middle
block approach initially.
Now that we dissolve regions before ::execute, the new recipe can be
lowered nicely after regions have been removed, to a set of VPBBs and
BranchOnCond recipes. The initial lowering preserves the original
structure with the split middle blocks. Follow-ups will improve the
lowering to avoid this splitting, providing performance gains.
PR: https://github.com/llvm/llvm-project/pull/172750
Move narrowInterleaveGroups to to general VPlan optimization stage.
To do so, narrowInterleaveGroups now has to find a suitable VF where all
interleave groups are consecutive and saturate the full vector width.
If such a VF is found, the original VPlan is split into 2:
a) a new clone which contains all VFs of Plan, except VFToOptimize, and
b) the original Plan with VFToOptimize as single VF.
The original Plan is then optimized. If a new copy for the other VFs has
been created, it is returned and the caller has to add it to the list of
candidate plans.
Together with https://github.com/llvm/llvm-project/pull/149702, this
allows to take the narrowed interleave groups into account when
computing costs to choose the best VF and interleave count.
One example where we currently miss interleaving/unrolling when
narrowing interleave groups is https://godbolt.org/z/Yz77zbacz
PR: https://github.com/llvm/llvm-project/pull/149706
Follow up on 7c4f188 ([LV] Support multiplies by constants when forming
scaled reductions), introducing m_APInt, and improving code around
canConstantBeExtended: we change canConstantBeExtended to take an APInt.
Replication is currently not supported for scalable VFs. Make sure
VPReplicateRecipe::computeCost returns an invalid cost early, for
scalable VFs if the recipe is not a single-scalar.
Note that this moves the existing invalid-costs.ll out of the AArch64
subdirectory, as it does not use a target triple.
Fixes https://github.com/llvm/llvm-project/issues/160792.
This reverts commit f80c0baf058dbdc5 and 94eade61a02ae5.
Recommit a small fix for targets using prefersVectorizedAddressing.
Original message:
Update VPReplicateRecipe::computeCost to compute costs of more
replicating loads/stores.
There are 2 cases that require extra checks to match the legacy cost
model:
1. If the pointer is based on an induction, the legacy cost model passes
its SCEV to getAddressComputationCost. In those cases, still fall back
to the legacy cost. SCEV computations will be added as follow-up
2. If a load is used as part of an address of another load, the legacy
cost model skips the scalarization overhead. Those cases are currently
handled by a usedByLoadOrStore helper.
Note that getScalarizationOverhead also needs updating, because when the
legacy cost model computes the scalarization overhead, scalars have not
been collected yet, so we can't each for replicating recipes to skip
their cost, except other loads. This again can be further improved by
modeling inserts/extracts explicitly and consistently, and compute costs
for those operations directly where needed.
PR: https://github.com/llvm/llvm-project/pull/160053
This reverts commit f61be4352592639a0903e67a9b5d3ec664ad4d23.
Recommit a small fix handling scalarization overhead consistently with
legacy cost model if a load is used directly as operand of another
memory operation, which fixes
https://github.com/llvm/llvm-project/issues/161404.
Original message:
Update VPReplicateRecipe::computeCost to compute costs of more
replicating loads/stores.
There are 2 cases that require extra checks to match the legacy cost
model:
1. If the pointer is based on an induction, the legacy cost model passes
its SCEV to getAddressComputationCost. In those cases, still fall back
to the legacy cost. SCEV computations will be added as follow-up
2. If a load is used as part of an address of another load, the legacy
cost model skips the scalarization overhead. Those cases are currently
handled by a usedByLoadOrStore helper.
Note that getScalarizationOverhead also needs updating, because when the
legacy cost model computes the scalarization overhead, scalars have not
been collected yet, so we can't each for replicating recipes to skip
their cost, except other loads. This again can be further improved by
modeling inserts/extracts explicitly and consistently, and compute costs
for those operations directly where needed.
PR: https://github.com/llvm/llvm-project/pull/160053
We can create partial reductions for multiplies with constants, if the
constant is small enough to be extended from source to destination type
w/o changing the value.
This only handles constant on the right side of a multiply, relying on
other passes to canonicalize the input.
Alive2 Proofs: https://alive2.llvm.org/ce/z/iWRMr6
PR: https://github.com/llvm/llvm-project/pull/161092
DeleteDeadBlocks will remove single-entry phis. Remove them from the exit
VPIRBBs in VPlan as well, otherwise we would retain references to deleted
IR instructions.
Fixes MSan failures after 8907b6d39
https://lab.llvm.org/buildbot/#/builders/164/builds/14013
Update VPReplicateRecipe::computeCost to compute costs of more
replicating loads/stores.
There are 2 cases that require extra checks to match the legacy cost
model:
1. If the pointer is based on an induction, the legacy cost model passes
its SCEV to getAddressComputationCost. In those cases, still fall back
to the legacy cost. SCEV computations will be added as follow-up
2. If a load is used as part of an address of another load, the legacy
cost model skips the scalarization overhead. Those cases are currently
handled by a usedByLoadOrStore helper.
Note that getScalarizationOverhead also needs updating, because when the
legacy cost model computes the scalarization overhead, scalars have not
been collected yet, so we can't each for replicating recipes to skip
their cost, except other loads. This again can be further improved by
modeling inserts/extracts explicitly and consistently, and compute costs
for those operations directly where needed.
PR: https://github.com/llvm/llvm-project/pull/160053
The VPlan cost model is not used to compute costs of scalar VFs
currently, as conversion to replicate regions makes accurately computing
the original scalar cost difficult.
Remove left over, dead code.
This patch removes the metadata emission for EVL‑vectorized loops,
since there is no current in-tree consumer:
1) after VPlan performs canonical IV replacement #147222 and
2) RISCV dropped EVLIndVarSimplifyPass #151483, which was the only user
of this metadata.
After https://github.com/llvm/llvm-project/pull/153643, there may be a
BranchOnCond with constant condition in the entry block.
Simplify those in removeBranchOnConst. This removes a number of
redundant conditional branch from entry blocks.
In some cases, it may also make the original scalar loop unreachable,
because we know it will never execute. In that case, we need to remove
the loop from LoopInfo, because all unreachable blocks may dominate each
other, making LoopInfo invalid. In those cases, we can also completely
remove the loop, for which I'll share a follow-up patch.
Depends on https://github.com/llvm/llvm-project/pull/153643.
PR: https://github.com/llvm/llvm-project/pull/154510
Extend replicateByVF added in #142433 (aa240293190) to also explicitly
unroll replicating VPInstructions.
Now the only remaining case where we replicate for all lanes is
VPReplicateRecipes in replicate regions.
PR: https://github.com/llvm/llvm-project/pull/155102
This patch consolidates updating loop metadata and profile info for both
the remainder and vector loops in a single place. This is NFC, modulo
consistently applying vectorization specific metadata also in the
experimental VPlan-native path.
Split off from https://github.com/llvm/llvm-project/pull/154510.
Materialze Build(Struct)Vectors explicitly for VPRecplicateRecipes, to
serve their users requiring a vector, instead of doing so when unrolling
by VF.
Now we only need to implicitly build vectors in VPTransformState::get
for VPInstructions. Once they are also unrolled by VF we can remove the
code-path alltogether.
PR: https://github.com/llvm/llvm-project/pull/151487
Materialize VF and VFxUF computation using VPInstruction
instead of directly creating IR.
This is one of the last few steps needed to model the full vector
skeleton in VPlan.
This is mostly NFC, although in some cases we remove some unused
computations.
PR: https://github.com/llvm/llvm-project/pull/152879
A lot of time getCanonicalIV() is used to get the canonical IV type,
e.g. to instantiate a VPTypeAnalysis or to get the LLVMContext.
However VPTypeAnalysis has a constructor that takes the VPlan directly
and there's a method on VPlan to get the LLVMContext directly, so use
those instead where possible.
This lets us remove a constructor on VPTypeAnalysis.
Also remove an unused LLVMContext argument in UnrollState whilst we're
here.
Materialize the vector trip count computation using VPInstruction
instead of directly creating IR. This is one of the last few steps
needed to model the full vector skeleton in VPlan. It also simplifies
vector-trip count computations for scalable vectors, as we can re-use
the UF x VF computation.
PR: https://github.com/llvm/llvm-project/pull/151925
This is the VPWidenPointerInductionRecipe equivalent of #118638, with
the motivation of allowing us to use the EVL as the induction step.
There is a new VPInstruction added, WidePtrAdd to allow adding the step
vector to the induction phi, since VPInstruction::PtrAdd only handles
scalars or multiple scalar lanes.
Originally this transformation was copied from the original recipe's
execute code, but it's since been simplifed by teaching
`unrollWidenInductionByUF` to unroll the recipe, which brings it inline
with VPWidenIntOrFpInductionRecipe.