The motivation of this PR is to make #115274 easier to implement, and
should allow us to add EVL support by just passing EVL to the VF
operand.
The current difficulty with widening IVs with EVL is that
VPWidenIntOrFpInductionRecipe generates its own backedge value. Since
it's a VPHeaderPHIRecipe the VF operand must be in the preheader, which
means we can't use the EVL since it's defined in the loop body.
The gist in this PR is to take the approach in #114305 and expand
VPWidenIntOrFpInductionRecipe into several recipes for the initial
value, phi and backedge value just before execution. I.e. this example:
```
vector.ph:
Successor(s): vector loop
<x1> vector loop: {
vector.body:
WIDEN-INDUCTION %i = phi %start, %step, %vf
...
EMIT branch-on-count ...
No successors
}
```
gets expanded to:
```
vector.ph:
...
vp<%induction.start> = ...
vp<%induction.increment> = ...
Successor(s): vector loop
<x1> vector loop: {
vector.body:
ir<%i> = WIDEN-PHI vp<%induction.start>, vp<%vec.ind.next>
...
vp<%vec.ind.next> = add ir<%i>, vp<%induction.increment>
EMIT branch-on-count ...
No successors
}
```
This allows us to a value defined in the loop in the backedge value, and
also means we can just reuse the existing backedge fixups in
VPlan::execute without having to specially handle it ourselves.
After this #115274 should just become a matter of setting the VF operand
to EVL (and building the increment step in the loop body, not the
preheader).
This reverts commit 0604dc199c019b23746f4a54885ba0c75569cdae.
The recommitted version addresses post-commit comments and adjusts the
place the branch weights are added. It now runs before VPlans are optimized
for VF and UF, which may remove the vector loop region, causing a crash
trying to get the middle block after that. Test case added in
72f99b75afc12bb.
Original message:
Manage branch weights for the BranchOnCond in the middle block in VPlan.
This requires updating VPInstruction to inherit from VPIRMetadata, which
in general makes sense as there are a number of opcodes that could take
metadata.
There are other branches (part of the skeleton) that also need branch
weights adding.
PR: https://github.com/llvm/llvm-project/pull/143035
Similar to modeling the start value as operand, also model the sentinel
value as operand explicitly. This makes all require information for
code-gen available directly in VPlan.
PR: https://github.com/llvm/llvm-project/pull/142291
There are many places in VPlan and LoopVectorize where we use
getKnownMinValue to discover the number of elements in a vector. Where
we expect the vector to have a fixed length, I have used the stronger
getFixedValue call. I believe this is clearer and adds extra protection
in the form of an assert in getFixedValue that the vector is not
scalable.
While looking at VPFirstOrderRecurrencePHIRecipe::computeCost I also
took the liberty of simplifying the code.
In theory I believe this patch should be NFC, but I'm reluctant to add
that to the title in case we're just missing tests for some of the VPlan
changes. I built and ran the LLVM test suite when targeting neoverse-v1
and it seemed ok.
The use of VectorBuilder here was simply obscuring what was actually
going on. For vp.load and vp.store, the resulting code is significantly
more idiomatic. For the vp.reduce cases, we remove several layers of
indirection, including passing parameters via implicit state on the
builder. In both cases, the code is significantly easier to follow.
This caused assertion failures:
llvm/lib/Transforms/Vectorize/VPlan.h:4021:
llvm::VPBasicBlock* llvm::VPlan::getMiddleBlock():
Assertion `LoopRegion && "cannot call the function after vector loop region has been removed"' failed.
See comment on the PR.
> Manage branch weights for the BranchOnCond in the middle block in VPlan.
> This requires updating VPInstruction to inherit from VPIRMetadata, which
> in general makes sense as there are a number of opcodes that could take
> metadata.
>
> There are other branches (part of the skeleton) that also need branch
> weights adding.
>
> PR: https://github.com/llvm/llvm-project/pull/143035
This reverts commit db8d34db26e9ea92c08d6e813eca9cce40c48478.
Currently the loop vectorizer can only vectorize interleave groups for
power-of-2 factors at scalable VFs by recursively interleaving
[de]interleave2 intrinsics.
However after https://github.com/llvm/llvm-project/pull/124825 and
#139893, we now have [de]interleave intrinsics for all factors up to 8,
which is enough to support all types of segmented loads and stores on
RISC-V.
Now that the interleaved access pass has been taught to lower these in
#139373 and #141512, this patch teaches the loop vectorizer to emit
these intrinsics for factors up to 8, which enables scalable
vectorization for non-power-of-2 factors.
As far as I'm aware, no in-tree target will vectorize a scalable
interelave group above factor 8 because the maximum interleave factor is
capped at 4 on AArch64 and 8 on RISC-V, and the
`-max-interleave-group-factor` CLI option defaults to 8, so the
recursive [de]interleaving code has been removed for now.
Factors of 3 with scalable VFs are also turned off in AArch64 since
there's no lowering for [de]interleave3 just yet either.
Manage branch weights for the BranchOnCond in the middle block in VPlan.
This requires updating VPInstruction to inherit from VPIRMetadata, which
in general makes sense as there are a number of opcodes that could take
metadata.
There are other branches (part of the skeleton) that also need branch
weights adding.
PR: https://github.com/llvm/llvm-project/pull/143035
Add a new VPInstruction::ReductionStartVector opcode to create the start
values for wide reductions. This more accurately models the start value
creation in VPlan and simplifies VPReductionPHIRecipe::execute. Down the
line it also allows removing VPReductionPHIRecipe::RdxDesc.
PR: https://github.com/llvm/llvm-project/pull/142290
This patch implement the VPlan-based cost model for VPReduction,
VPExtendedReduction and VPMulAccumulateReduction.
With this patch, we can calculate the reduction cost by the VPlan-based
cost model so remove the reduction costs in `precomputeCost()`.
Ref: Original instruction based implementation:
https://reviews.llvm.org/D93476
Now that there is only a single FindLastIV recurrence kind, simply pass
the sentinel value instead of the full recurrence descriptor to tighten
the interface.
Now that there is only a single AnyOf recurrence kind, simply pass the
start value instead of the full recurrence descriptor, to tighten the
interface.
Manage fast-math flags using VPIRFlags from VPInstruciton, in inline
with other VPInstructions. With this change, we now print the correctly
flags for ComputeReductionResult, other than that NFC.
This patch moves the logic to manage IR flags to a separate VPIRFlags
class. For now, VPRecipeWithIRFlags is the only class that inherits
VPIRFlags. The new class allows for simpler passing of flags when
constructing recipes, simplifying the constructors for various recipes
(VPInstruction in particular, which now just has 2 constructors, one
taking an extra VPIRFlags argument.
This mirrors the approach taken for VPIRMetadata and makes it easier to
extend in the future. The patch also adds a unified flagsValidForOpcode
to check if the flags in a VPIRFlags match the provided opcode.
PR: https://github.com/llvm/llvm-project/pull/140621
Building on top of https://github.com/llvm/llvm-project/pull/114305,
replace VPRegionBlocks with explicit CFG before executing.
This brings the final VPlan closer to the IR that is generated and
helps to simplify codegen.
It will also enable further simplifications of phi handling during
execution and transformations that do not have to preserve the
canonical IV required by loop regions. This for example could include
replacing the canonical IV with an EVL based phi while completely
removing the original canonical IV.
PR: https://github.com/llvm/llvm-project/pull/117506
This patch introduce two new recipes.
* VPExtendedReductionRecipe
- cast + reduction.
* VPMulAccumulateReductionRecipe
- (cast) + mul + reduction.
This patch also implements the transformation that match following
patterns via vplan and converts to abstract recipes for better cost
estimation.
* VPExtendedReduction
- reduce(cast(...))
* VPMulAccumulateReductionRecipe
- reduce.add(mul(...))
- reduce.add(mul(ext(...), ext(...))
- reduce.add(ext(mul(ext(...), ext(...))))
The converted abstract recipes will be lower to the concrete recipes
(widen-cast + widen-mul + reduction) just before recipe execution.
Note that this patch still relies on legacy cost model the calculate the
cost for these patters.
Will enable vplan-based cost decision in #113903.
Split from #113903.
Directly compute costs for binary ops and GEPs in
VPReplicateRecipe::computeCost. This simply ports the legacy cost
computation for uniform/replicating binary ops to the VPlan cost model.
Similarly to VPInstructionWithType and VPIRPhi, add VPPhi as a subclass
for VPInstruction. This allows implementing the VPPhiAccessors trait,
making available helpers for generic printing of incoming values /
blocks and accessors for incoming blocks and values.
It will also allow properly verifying def-uses for values used by
VPInstructions with PHI opcodes via
https://github.com/llvm/llvm-project/pull/124838.
PR: https://github.com/llvm/llvm-project/pull/139151
(NFC modulo debug output changes)
Add generic helper to print phi operands (incoming values) together with
their incoming blocks.
As more and more transforms are added, keeping the incoming blocks of
phis becomes more important. Print incoming blocks via VPPhiAcessors, to
make debugging easier.
Split off from #118638, this adds VPInstruction::StepVector, which
generates integer step vectors (0,1,2,...,VF). This is a step towards
eventually modelling all the separate parts of
VPWidenIntOrFpInductionRecipe in VPlan.
This is then used by VPWidenIntOrFpInductionRecipe, where we materialize
it just before unrolling so the operands stay in a fixed position.
The need for a separate operand in VPWidenIntOrFpInductionRecipe, as
well as the need to update it in
optimizeVectorInductionWidthForTCAndVFUF, should be removed with #118638
when everything is expanded in convertToConcreteRecipes.
Replace CreateTrunc with CreateSExtOrTrunc in VPScalarIVStepsRecipe to
safely handle type conversion. This prevents assertion failures from
invalid truncation when StartIdx0 has a smaller integer type than
IntStepTy. The assertion was introduced by commit 783a846.
Fixes https://github.com/llvm/llvm-project/issues/137185
We can reuse isVectorIntrinsicWithScalarOpAtArg in VectorUtils to
determine if only the first lane will be used for a
VPWidenIntrinsicRecipe, provided that we also move the VP EVL operand
check into it.
This was needed by a local patch I was working on that created a
VPWidenIntrinsicRecipe with a VP intrinsic, and prevents the need to
update the scalar arguments in two places.
Whenever calls were transformed to VP intrinsics with EVL tail folding
in #110412, this workaround was added in computeCost to avoid an
assertion when checking ICA.getArgs().
However it turned out that the actual arguments were never used and this
assertion was removed in #115983 afterwards, so it's now fine to leave
the arguments empty and use the type based cost instead.
The type based cost and value based cost are the same for these VP
intrinsics.
This was tested by adding back in the transformation code in #110412 and
checking that no assertions were still hit.
ExtractFromEnd only has 2 uses, extracting the last and penultimate
elements. Replace it with 2 separate opcodes, removing the need to
materialize and handle a constant argument.
PR: https://github.com/llvm/llvm-project/pull/137030
Add a new helper to manage IR metadata that can be progated to generated
instructions for recipes.
This helps to remove a number of remaining uses of getUnderlyingInstr
during VPlan execution.
PR: https://github.com/llvm/llvm-project/pull/135272