Move cloneFrom from a file-static function in VPlan.cpp to a public
static method VPBlockUtils::cloneFrom, and move
mergeBlocksIntoPredecessors from a file-static function in
VPlanTransforms.cpp to a public static method
VPlanTransforms::mergeBlocksIntoPredecessors.
This is in preparation for dissolving replicate regions which needs both
utilities.
Split off from approved
https://github.com/llvm/llvm-project/pull/170212.
PR: https://github.com/llvm/llvm-project/pull/188818
Now that we can vectorize loops with multiple early exits, we emit
dispatch blocks after the middle block to go to a specific exit or
continue in the dispatch chain.
With that, we need to be a bit more careful when it comes to picking the
loop the dispatch block belongs to. The dispatch block will belong to
the innermost loop of all exit blocks reachable from the current block.
Fixes https://github.com/llvm/llvm-project/issues/185362
PR: https://github.com/llvm/llvm-project/pull/185618
After VPSymbolicValues (like VF and VFxUF) are materialized via
replaceAllUsesWith, they should not be accessed again. This patch:
1. Tracks materialization state in VPSymbolicValue.
2. Asserts if the materialized VPValue is used again. Currently it
adds asserts to various member functions, preventing calling them
on materialized symbolic values.
Note that this still allows some uses (e.g. comparing VPSymbolicValue
references or pointers), but this should be relatively harmless given
that it is impossible to (re-)add any users. If we want to further
tighten the checks, we could add asserts to the accessors or override
operator&, but that will require more changes and not add much extra
guards I think.
Depends on https://github.com/llvm/llvm-project/pull/182146 to fix a
current access violation.
PR: https://github.com/llvm/llvm-project/pull/182318
BranchInst currently represents both unconditional and conditional
branches. However, these are quite different operations that are often
handled separately. Therefore, split them into separate opcodes and
classes to allow distinguishing these operations in the type system.
Additionally, this also slightly improves compile-time performance.
Currently the logic for introducing a header mask and predicating the
vector loop region is done inside introduceMasksAndLinearize.
This splits the tail folding part out into an individual VPlan transform
so that VPlanPredicator.cpp doesn't need to worry about tail folding,
which seemed to be a temporary measure according to a comment in
VPlanTransforms.h.
To perform tail folding independently, this splits the "body" of the
vector loop region between the phis in the header and the branch + iv
increment in the latch:
Before:
```
+-------------------------------------------+
|%iv = ... |
|... |
|%iv.next = add %iv, vfxuf |
|branch-on-count %iv.next, vector-trip-count|
+-------------------------------------------+
```
After:
```
+-------------------------------------------+
|%iv = ... |
|%wide.iv = widen-canonical-iv ... |
|%header-mask = icmp ule %wide.iv, BTC |---+
|branch-on-cond %header-mask | |
+-------------------------------------------+ |
| |
v |
+-------------------------------------------+ |
|... | |
+-------------------------------------------+ |
| |
v |
+-------------------------------------------+ |
|%iv.next = add %iv, vfxuf |<--+
|branch-on-count %iv.next, vector-trip-count|
+-------------------------------------------+
```
Phis are then inserted in the latch for any value in the loop body that
have outside uses, with poison as their incoming value from the header
edge.
The motivation for this is to allow us to share the same "predicate all
successor blocks" type of predication we do for tail folding, but for
early-exit loops in #172454. This may also allow us to directly emit an
EVL based header mask, instead of having to match + transform the
existing header mask in addExplicitVectorLength.
This also allows us to eventually handle recurrences in the same
transform, avoiding the need to special case tail folding in
addReductionResultComputation.
The legacy cost model skips gather/scatters when determining the
predicated stores in useEmulatedMaskMemRefHack. Match the behavior in
the VPlan-based implementation. This fixes a cost divergence in the
attached test.
Add a symbolic unroll factor (UF) to VPlan similar to VF & VFxUF that
gets replaced with the concrete UF during plan execution, similar to how VF
is used for the vectorization factor. This is a preparatory change that
allows transforms to use the symbolic UF before the concrete UF is
determined.
Note that the old getUF that returns the concrete UF after unrolling has
been renamed to getConcreteUF.
Split off from the re-commit of 8d29d093096
(https://github.com/llvm/llvm-project/pull/149706) as suggested.
Compute the number of predicated stores directly in VPlan instead of
using CM.useEmulatedMaskMemRefHack(), which will only account for the
number of predicated stores for the last VF the legacy cost model
considered.
Fixes https://github.com/llvm/llvm-project/issues/181183
Re-commit of https://github.com/llvm/llvm-project/pull/175839 after
fixing build without `LLVM_ENABLE_DUMP`.
This consists of the following changes:
* Merge several overloads of `VPlanTransforms::runPass` into a single
function to avoid code duplication.
* Add helper macro `RUN_VPLAN_PASS` to capture the transformation name
and pass it to the helper above for printing.
* Add new `-vplan-print-after-all` option (somewhat similar to existing
`-vplan-verify-each`).
* Add two empty passes `printAfterInitialConstruction`/`printFinalVPlan`
so that initial/final VPlans would be supported in `-vplan-print-after-all`
This follows the original future plans in
https://github.com/llvm/llvm-project/pull/123640.
This consists of the following changes:
* Merge several overloads of `VPlanTransforms::runPass` into a single
function
to avoid code duplication.
* Add helper macro `RUN_VPLAN_PASS` to capture the transformation name
and pass it to the helper above for printing.
* Add new `-vplan-print-after-all` option (somewhat similar to existing
`-vplan-verify-each`).
* Add two empty passes `printAfterInitialConstruction`/`printFinalVPlan`
so that initial/final
VPlans would be supported in `-vplan-print-after-all`
This follows the original future plans in
https://github.com/llvm/llvm-project/pull/123640.
This commit introduces the VectorInstrContext (VIC) infrastructure to
improve cost estimates for insert/extracts based on the context
instruction in which the insert/extract is used.
This is similar to CastContextHint, and allows providing context on how
the insert/extract is going to be used before creating IR. This is
useful in the LoopVectorizer, where costs need to estimated before
creating IR.
The new hint currently only replaces an existing check in AArch64,
but new uses will be introduced in follow-ups, including
https://github.com/llvm/llvm-project/pull/177201.
PR: https://github.com/llvm/llvm-project/pull/175982
Move SubclassID to VPRecipeBase, and store VPRecipeBase directly in
VPRecipeValue, instead of VPDef. This allows for some additional
simplifications and VPDef now just holds various helpers to deal with
removing and adding VPValues.
This reverts commit 16395da0ff577750571b99fe28281ce6fb6a3ae8.
PR: https://github.com/llvm/llvm-project/pull/174282
When a vectorized loop has constant trip, it's important to update the
profile information accordingly. Hotness analysis will only look at
profile info.
For example, in the `tripcount.ll` test, without producing the profile
info, in the `const_trip_over_profile` function, the BFI of the
`vector.body` would be 32 (this is the expected value when synthetic
branch weights are used, in loops). The real value is 250. The
`for.body`value was _very_ incorrect before, too (and detrimentally so,
as it would have appeared as "very hot" when it wasn't):
The table below was obtained by printing BFI in the RUN: command, i.e.
`build/bin/opt < llvm/test/Transforms/LoopVectorize/tripcount.ll
-passes="loop-vectorize,print<block-freq>"
-loop-vectorize-with-block-frequency -S -o /dev/null`. Showing only the
`float` value, i.e. the BFI relative to the function entry BB.
```
Printing analysis results of BFI for function 'const_trip_over_profile':
block-frequency-info: const_trip_over_profile
```
| Block | Before | After |
| ----- | ------ | ----- |
| `entry` | float = 1.0 | float = 1.0 |
| `vector.ph` | float = 1.0 | float = 1.0 |
| `vector.body` | float = **32.0** | float = **250.0** |
| `middle.block` | float = 1.0 | float = 1.0 |
| `scalar.ph` | float = 1.0 | float = 1.0 |
| `for.body` | float = **2147483647.8** | float = **1.0** |
| `for.end` | float = 1.0 | float = 1.0 |
This patch adds VPValue sub-classes for the different cases we currently
have:
* VPIRValue: A live-in VPValue that wraps an underlying IR value
* VPSymbolicValue: A symbolic VPValue not tied to an underlying value,
e.g. the vector trip count or VF VPValues
* VPRecipeValue: A VPValue defined by a VPDef/VPRecipeBase.
This has multiple benefits:
* clearer constructors for each kind of VPValue
* limited scope: for example allows moving VPDef member to VPRecipeValue,
reducing size of other VPValues.
* stricter type checking for member variables (e.g. using VPLiveIn in
the Value -> live-in map in VPlan, or using VPSymbolicValue for symbolic
member VPValues)
There probably are additional opportunities for cleanups as follow-ups.
PR: https://github.com/llvm/llvm-project/pull/172758
This PR introduces a new BranchOnTwoConds VPInstruction, that takes 2
boolean operands and must be placed in a block with 3 successors.
If condition I is true, branches to successor I, otherwise falls through
to check the next condition. If both conditions are false, branch to the
third successor.
This new branch recipe is used for early-exit loops, to simplify the
representation in VPlan initially, by avoid the need for splitting the
middle block early on, in a way that preserves the single-exit block
property of regions. All exits still go through the latch block, but
they can go to more than 2 successors.
This idea was part of one of the original proposals for how to model
early exits in VPlan, but at that point in time, there was no good way
to handle this during code-gen, and we went with the early split-middle
block approach initially.
Now that we dissolve regions before ::execute, the new recipe can be
lowered nicely after regions have been removed, to a set of VPBBs and
BranchOnCond recipes. The initial lowering preserves the original
structure with the split middle blocks. Follow-ups will improve the
lowering to avoid this splitting, providing performance gains.
PR: https://github.com/llvm/llvm-project/pull/172750
Move narrowInterleaveGroups to to general VPlan optimization stage.
To do so, narrowInterleaveGroups now has to find a suitable VF where all
interleave groups are consecutive and saturate the full vector width.
If such a VF is found, the original VPlan is split into 2:
a) a new clone which contains all VFs of Plan, except VFToOptimize, and
b) the original Plan with VFToOptimize as single VF.
The original Plan is then optimized. If a new copy for the other VFs has
been created, it is returned and the caller has to add it to the list of
candidate plans.
Together with https://github.com/llvm/llvm-project/pull/149702, this
allows to take the narrowed interleave groups into account when
computing costs to choose the best VF and interleave count.
One example where we currently miss interleaving/unrolling when
narrowing interleave groups is https://godbolt.org/z/Yz77zbacz
PR: https://github.com/llvm/llvm-project/pull/149706
Follow up on 7c4f188 ([LV] Support multiplies by constants when forming
scaled reductions), introducing m_APInt, and improving code around
canConstantBeExtended: we change canConstantBeExtended to take an APInt.
Replication is currently not supported for scalable VFs. Make sure
VPReplicateRecipe::computeCost returns an invalid cost early, for
scalable VFs if the recipe is not a single-scalar.
Note that this moves the existing invalid-costs.ll out of the AArch64
subdirectory, as it does not use a target triple.
Fixes https://github.com/llvm/llvm-project/issues/160792.
This reverts commit f80c0baf058dbdc5 and 94eade61a02ae5.
Recommit a small fix for targets using prefersVectorizedAddressing.
Original message:
Update VPReplicateRecipe::computeCost to compute costs of more
replicating loads/stores.
There are 2 cases that require extra checks to match the legacy cost
model:
1. If the pointer is based on an induction, the legacy cost model passes
its SCEV to getAddressComputationCost. In those cases, still fall back
to the legacy cost. SCEV computations will be added as follow-up
2. If a load is used as part of an address of another load, the legacy
cost model skips the scalarization overhead. Those cases are currently
handled by a usedByLoadOrStore helper.
Note that getScalarizationOverhead also needs updating, because when the
legacy cost model computes the scalarization overhead, scalars have not
been collected yet, so we can't each for replicating recipes to skip
their cost, except other loads. This again can be further improved by
modeling inserts/extracts explicitly and consistently, and compute costs
for those operations directly where needed.
PR: https://github.com/llvm/llvm-project/pull/160053
This reverts commit f61be4352592639a0903e67a9b5d3ec664ad4d23.
Recommit a small fix handling scalarization overhead consistently with
legacy cost model if a load is used directly as operand of another
memory operation, which fixes
https://github.com/llvm/llvm-project/issues/161404.
Original message:
Update VPReplicateRecipe::computeCost to compute costs of more
replicating loads/stores.
There are 2 cases that require extra checks to match the legacy cost
model:
1. If the pointer is based on an induction, the legacy cost model passes
its SCEV to getAddressComputationCost. In those cases, still fall back
to the legacy cost. SCEV computations will be added as follow-up
2. If a load is used as part of an address of another load, the legacy
cost model skips the scalarization overhead. Those cases are currently
handled by a usedByLoadOrStore helper.
Note that getScalarizationOverhead also needs updating, because when the
legacy cost model computes the scalarization overhead, scalars have not
been collected yet, so we can't each for replicating recipes to skip
their cost, except other loads. This again can be further improved by
modeling inserts/extracts explicitly and consistently, and compute costs
for those operations directly where needed.
PR: https://github.com/llvm/llvm-project/pull/160053
We can create partial reductions for multiplies with constants, if the
constant is small enough to be extended from source to destination type
w/o changing the value.
This only handles constant on the right side of a multiply, relying on
other passes to canonicalize the input.
Alive2 Proofs: https://alive2.llvm.org/ce/z/iWRMr6
PR: https://github.com/llvm/llvm-project/pull/161092
DeleteDeadBlocks will remove single-entry phis. Remove them from the exit
VPIRBBs in VPlan as well, otherwise we would retain references to deleted
IR instructions.
Fixes MSan failures after 8907b6d39
https://lab.llvm.org/buildbot/#/builders/164/builds/14013
Update VPReplicateRecipe::computeCost to compute costs of more
replicating loads/stores.
There are 2 cases that require extra checks to match the legacy cost
model:
1. If the pointer is based on an induction, the legacy cost model passes
its SCEV to getAddressComputationCost. In those cases, still fall back
to the legacy cost. SCEV computations will be added as follow-up
2. If a load is used as part of an address of another load, the legacy
cost model skips the scalarization overhead. Those cases are currently
handled by a usedByLoadOrStore helper.
Note that getScalarizationOverhead also needs updating, because when the
legacy cost model computes the scalarization overhead, scalars have not
been collected yet, so we can't each for replicating recipes to skip
their cost, except other loads. This again can be further improved by
modeling inserts/extracts explicitly and consistently, and compute costs
for those operations directly where needed.
PR: https://github.com/llvm/llvm-project/pull/160053
The VPlan cost model is not used to compute costs of scalar VFs
currently, as conversion to replicate regions makes accurately computing
the original scalar cost difficult.
Remove left over, dead code.
This patch removes the metadata emission for EVL‑vectorized loops,
since there is no current in-tree consumer:
1) after VPlan performs canonical IV replacement #147222 and
2) RISCV dropped EVLIndVarSimplifyPass #151483, which was the only user
of this metadata.
After https://github.com/llvm/llvm-project/pull/153643, there may be a
BranchOnCond with constant condition in the entry block.
Simplify those in removeBranchOnConst. This removes a number of
redundant conditional branch from entry blocks.
In some cases, it may also make the original scalar loop unreachable,
because we know it will never execute. In that case, we need to remove
the loop from LoopInfo, because all unreachable blocks may dominate each
other, making LoopInfo invalid. In those cases, we can also completely
remove the loop, for which I'll share a follow-up patch.
Depends on https://github.com/llvm/llvm-project/pull/153643.
PR: https://github.com/llvm/llvm-project/pull/154510
Extend replicateByVF added in #142433 (aa240293190) to also explicitly
unroll replicating VPInstructions.
Now the only remaining case where we replicate for all lanes is
VPReplicateRecipes in replicate regions.
PR: https://github.com/llvm/llvm-project/pull/155102
This patch consolidates updating loop metadata and profile info for both
the remainder and vector loops in a single place. This is NFC, modulo
consistently applying vectorization specific metadata also in the
experimental VPlan-native path.
Split off from https://github.com/llvm/llvm-project/pull/154510.