Materialize VF and VFxUF computation using VPInstruction
instead of directly creating IR.
This is one of the last few steps needed to model the full vector
skeleton in VPlan.
This is mostly NFC, although in some cases we remove some unused
computations.
PR: https://github.com/llvm/llvm-project/pull/152879
A lot of time getCanonicalIV() is used to get the canonical IV type,
e.g. to instantiate a VPTypeAnalysis or to get the LLVMContext.
However VPTypeAnalysis has a constructor that takes the VPlan directly
and there's a method on VPlan to get the LLVMContext directly, so use
those instead where possible.
This lets us remove a constructor on VPTypeAnalysis.
Also remove an unused LLVMContext argument in UnrollState whilst we're
here.
Materialize the vector trip count computation using VPInstruction
instead of directly creating IR. This is one of the last few steps
needed to model the full vector skeleton in VPlan. It also simplifies
vector-trip count computations for scalable vectors, as we can re-use
the UF x VF computation.
PR: https://github.com/llvm/llvm-project/pull/151925
This is the VPWidenPointerInductionRecipe equivalent of #118638, with
the motivation of allowing us to use the EVL as the induction step.
There is a new VPInstruction added, WidePtrAdd to allow adding the step
vector to the induction phi, since VPInstruction::PtrAdd only handles
scalars or multiple scalar lanes.
Originally this transformation was copied from the original recipe's
execute code, but it's since been simplifed by teaching
`unrollWidenInductionByUF` to unroll the recipe, which brings it inline
with VPWidenIntOrFpInductionRecipe.
Explicitly compute the backedge-taken count using VPInstruction. This is
needed to model the full skeleton in VPlan.
NFC modulo some instruction re-ordering.
Materialize constant vector trip counts before ::execute, if the trip
count can be computed as Original (TC / (VF * UF)) * (VF * UF). For now
this excludes when the tail is folded or scalar epilogues are required.
This enables removing a number of redundant branches from the middle
block.
For now this is also only done when not vectorizing the epilogue, as the
simplification complicates stitching the 2 plans together.
PR: https://github.com/llvm/llvm-project/pull/142309
Connect SCEV and memory runtime check block directly in VPlan as
VPIRBasicBlocks, removing ILV::emitSCEVChecks and
ILV::emitMemRuntimeChecks.
The new logic is currently split across
LoopVectorizationPlanner::addRuntimeChecks which collects a list of
{Condition, CheckBlock} pairs and performs some checks and emits remarks
if needed. The list of checks is then added to VPlan in
VPlanTransforms::connectCheckBlocks.
PR: https://github.com/llvm/llvm-project/pull/143879
Currently, when VPSlotTracker is initialized with a VPlan, its
assignName method calls printAsOperand on each underlying instruction.
Each such call recomputes slot numbers for the entire function, leading
to O(N × M) complexity, where M is the number of instructions in the
loop and N is the number of instructions in the function.
This results in slow debug output for large loops. For example, printing
costs of all instructions becomes O(M² × N), which is especially painful
when enabling verbose dumps.
This patch improves debugging performance by caching slot numbers using
ModuleSlotTracker. It avoids redundant recomputation and makes debug
output significantly faster.
Explicitly unroll VPReplicateRecipes outside replicate regions by VF,
replacing them by VF single-scalar recipes. Extracts for operands are
added as needed and the scalar results are combined to a vector using a
new BuildVector VPInstruction.
It also adds a few folds to simplify unnecessary extracts/BuildVectors.
It also adds a BuildStructVector opcode for handling of calls that have
struct return types.
VPReplicateRecipe in replicate regions can will be unrolled as follow
up, turing non-single-scalar VPReplicateRecipes into 'abstract', i.e.
not executable.
PR: https://github.com/llvm/llvm-project/pull/142433
The motivation of this PR is to make #115274 easier to implement, and
should allow us to add EVL support by just passing EVL to the VF
operand.
The current difficulty with widening IVs with EVL is that
VPWidenIntOrFpInductionRecipe generates its own backedge value. Since
it's a VPHeaderPHIRecipe the VF operand must be in the preheader, which
means we can't use the EVL since it's defined in the loop body.
The gist in this PR is to take the approach in #114305 and expand
VPWidenIntOrFpInductionRecipe into several recipes for the initial
value, phi and backedge value just before execution. I.e. this example:
```
vector.ph:
Successor(s): vector loop
<x1> vector loop: {
vector.body:
WIDEN-INDUCTION %i = phi %start, %step, %vf
...
EMIT branch-on-count ...
No successors
}
```
gets expanded to:
```
vector.ph:
...
vp<%induction.start> = ...
vp<%induction.increment> = ...
Successor(s): vector loop
<x1> vector loop: {
vector.body:
ir<%i> = WIDEN-PHI vp<%induction.start>, vp<%vec.ind.next>
...
vp<%vec.ind.next> = add ir<%i>, vp<%induction.increment>
EMIT branch-on-count ...
No successors
}
```
This allows us to a value defined in the loop in the backedge value, and
also means we can just reuse the existing backedge fixups in
VPlan::execute without having to specially handle it ourselves.
After this #115274 should just become a matter of setting the VF operand
to EVL (and building the increment step in the loop body, not the
preheader).
There are many places in VPlan and LoopVectorize where we use
getKnownMinValue to discover the number of elements in a vector. Where
we expect the vector to have a fixed length, I have used the stronger
getFixedValue call. I believe this is clearer and adds extra protection
in the form of an assert in getFixedValue that the vector is not
scalable.
While looking at VPFirstOrderRecurrencePHIRecipe::computeCost I also
took the liberty of simplifying the code.
In theory I believe this patch should be NFC, but I'm reluctant to add
that to the title in case we're just missing tests for some of the VPlan
changes. I built and ran the LLVM test suite when targeting neoverse-v1
and it seemed ok.
Part of the coverage-tracking feature, following #107279.
In order for DebugLoc coverage testing to work, we firstly have to set
annotations for intentionally-empty DebugLocs, and secondly we have to
ensure that we do not drop these annotations as we propagate DebugLocs
throughout compilation. As the annotations exist as part of the DebugLoc
class, and not the underlying DILocation, they will not survive a
DebugLoc->DILocation->DebugLoc roundtrip. Therefore this patch modifies
a number of places in the compiler to propagate DebugLocs directly
rather than via the underlying DILocation. This has no effect on the
output of normal builds; it only ensures that during coverage builds, we
do not drop incorrectly annotations and therefore create false
positives.
The bulk of these changes are in replacing
DILocation::getMergedLocation(s) with a DebugLoc equivalent, and in
changing the IRBuilder to store a DebugLoc directly rather than storing
DILocations in its general Metadata array. We also use a new function,
`DebugLoc::orElse`, which selects the "best" DebugLoc out of a pair
(valid location > annotated > empty), preferring the current DebugLoc on
a tie - this encapsulates the existing behaviour at a few sites where we
_may_ assign a DebugLoc to an existing instruction, while extending the
logic to handle annotation DebugLocs at the same time.
Directly replace the canonical IV when we dissolve the containing
region. That ensures that it won't get removed before the region gets
removed, which would result in an invalid region.
This removes the current ordering constraint between
convertToConcreteRecipes and dissolving regions.
PR: https://github.com/llvm/llvm-project/pull/142372
Building on top of https://github.com/llvm/llvm-project/pull/114305,
replace VPRegionBlocks with explicit CFG before executing.
This brings the final VPlan closer to the IR that is generated and
helps to simplify codegen.
It will also enable further simplifications of phi handling during
execution and transformations that do not have to preserve the
canonical IV required by loop regions. This for example could include
replacing the canonical IV with an EVL based phi while completely
removing the original canonical IV.
PR: https://github.com/llvm/llvm-project/pull/117506
Support cloning VPlans as they are created by the initial buildVPlan,
i.e. scalar header not yet connected and no trip-count set. This is not
used yet but will in follow-up changes/
Also add a unit test for cloning & printing.
When vectorizing loops with early exits that is nested within another
one, one of the loop exits may be outside both loops, so setting adding
it to the parent loop is incorrect. Also use the original parent loop
for exit blocks.
Add a new helper to manage IR metadata that can be progated to generated
instructions for recipes.
This helps to remove a number of remaining uses of getUnderlyingInstr
during VPlan execution.
PR: https://github.com/llvm/llvm-project/pull/135272
This PR accounts for scaled reductions in `calculateRegisterUsage` to
reflect the fact that the number of lanes in their output is smaller
than the VF.
Depends on https://github.com/llvm/llvm-project/pull/126437
Update VPlan::duplicate to add cloned exit blocks to ExitBlocks.
Currently there are no uses of the exit blocks after cloning so this is
NFC at the moment.
This reverts commit 46a2f4174a051f29a09dbc3844df763571c67309.
Recommits 2fd6f8fb5e3a with corresponding VPlan change to ensure
LoopInfo is updated for all blocks during VPlan execution if needed.
Exit blocks of the VPlan are now hold in ExitBlocks. Use it to check if
a block is an exit block. Otherwise we currently mis-classify the scalar
loop header also as exit block, as it is not explicitly connected to the
exit blocks.
NFC at the moment, as the helper currently is never queried with the
scalar header, but that will change in the future.
Set the debug location for each recipe before executing the recipe,
instead of ad-hoc setting the debug location during individual recipe
execution.
This simplifies the code and ensures that all recipe repsect the
recipe's debug location. There are some minor changes, where previously
we would re-use a previously set debug location.
Add a new VPIRPhi subclass of VPIRInstruction, that purely serves as an
overlay, to provide more convenient checking (via directly doing
isa/dyn_cast/cast) and specialied execute/print implementations.
Both VPIRInstruction and VPIRPhi share the same VPDefID, and are
differentiated by the backing IR instruction.
This pattern could alos be used to provide more specialized interfaces
for some VPInstructions ocpodes, without introducing new, completely
spearate recipes. An example would be modeling VPWidenPHIRecipe &
VPScalarPHIRecip using VPInstructions opcodes and providing an interface
to retrieve incoming blocks and values through a VPInstruction subclass
similar to VPIRPhi.
PR: https://github.com/llvm/llvm-project/pull/129387
This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.
This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.
This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms. For now
it only transforms load interleave groups feeding store groups.
Depends on #106431.
This lands the main parts of the approved
https://github.com/llvm/llvm-project/pull/106441 as suggested to break
things up a bit more.
DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently
gained C++23-style insert_range. This patch replaces:
Dest.insert(Src.begin(), Src.end());
with:
Dest.insert_range(Src);
This patch does not touch custom begin like succ_begin for now.
Update initial VPlan-construction in VPlanNativePath in line with the
inner loop path, in that it bails out when encountering constructs it
cannot handle, like non-intrinsic calls.
Fixes https://github.com/llvm/llvm-project/issues/131071.
Now that all phi nodes manage their incoming blocks through the
VPlan-predecessors, there should be no need for having a dedicate
recipe, it should be sufficient to allow PHI opcodes in VPInstruction.
Follow-ups will also migrate VPWidenPHIRecipe and possibly others,
building on top of https://github.com/llvm/llvm-project/pull/129388.
PR: https://github.com/llvm/llvm-project/pull/129767
This patch restricts broadcast operations from being hoisted to the vector
preheader unless the basic block that defines the broadcasted value properly
dominates the vector preheader.
This prevents potential use-before-definition issues when the broadcasted
value is defined within the plan. VPDominatorTree is used to confirm this
restriction while still allowing safe hoisting for broadcasted values defined
outside the plan.
Issue https://github.com/llvm/llvm-project/issues/117139
Create an empty VPlan first, then let the HCFG builder create a plain
CFG for the top-level loop (w/o a top-level region). The top-level
region is introduced by a separate VPlan-transform. This is instead of
creating the vector loop region before building the VPlan CFG for the
input loop.
This simplifies the HCFG builder (which should probably be renamed) and
moves along the roadmap ('buildLoop') outlined in [1].
As follow-up, I plan to also preserve the exit branches in the initial
VPlan out of the CFG builder, including connections to the exit blocks.
The conversion from plain CFG with potentially multiple exits to a
single entry/exit region will be done as VPlan transform in a follow-up.
This is needed to enable VPlan-based predication. Currently early exit
support relies on building the block-in masks on the original CFG,
because exiting branches and conditions aren't preserved in the VPlan.
So in order to switch to VPlan-based predication, we will have to
preserve them in the initial plain CFG, so the exit conditions are
available explicitly when we convert to single entry/exit regions.
Another follow-up is updating the outer loop handling to also introduce
VPRegionBlocks for nested loops as transform. Currently the existing
logic in the builder will take care of creating VPRegionBlocks for
nested loops, but not the top-level loop.
[1]
https://llvm.org/devmtg/2023-10/slides/techtalks/Hahn-VPlan-StatusUpdateAndRoadmap.pdf
PR: https://github.com/llvm/llvm-project/pull/128419