Replace ComputeFindIVResult with ComputeReductionResult + explicit
compare + select, to more explicitly and simpler model computing finding
the first/last induction, which boils down to a min/max reduction +
compare and select of the sentinel value.
PR: https://github.com/llvm/llvm-project/pull/176672
Based on Michael Maitland's previous work:
https://github.com/llvm/llvm-project/pull/121222
This PR uses the existing recurrences code instead of introducing a
new pass just for CSA autovec. I've also made recipes that are more
generic.
This patch adds VPValue sub-classes for the different cases we currently
have:
* VPIRValue: A live-in VPValue that wraps an underlying IR value
* VPSymbolicValue: A symbolic VPValue not tied to an underlying value,
e.g. the vector trip count or VF VPValues
* VPRecipeValue: A VPValue defined by a VPDef/VPRecipeBase.
This has multiple benefits:
* clearer constructors for each kind of VPValue
* limited scope: for example allows moving VPDef member to VPRecipeValue,
reducing size of other VPValues.
* stricter type checking for member variables (e.g. using VPLiveIn in
the Value -> live-in map in VPlan, or using VPSymbolicValue for symbolic
member VPValues)
There probably are additional opportunities for cleanups as follow-ups.
PR: https://github.com/llvm/llvm-project/pull/172758
All extra state has been removed from VPWidenSelectRecipe at this point.
There's no benefit of having a separate recipe and Select can easily be
handled by the existing VPWidenRecipe.
PR: https://github.com/llvm/llvm-project/pull/174234
This PR introduces a new BranchOnTwoConds VPInstruction, that takes 2
boolean operands and must be placed in a block with 3 successors.
If condition I is true, branches to successor I, otherwise falls through
to check the next condition. If both conditions are false, branch to the
third successor.
This new branch recipe is used for early-exit loops, to simplify the
representation in VPlan initially, by avoid the need for splitting the
middle block early on, in a way that preserves the single-exit block
property of regions. All exits still go through the latch block, but
they can go to more than 2 successors.
This idea was part of one of the original proposals for how to model
early exits in VPlan, but at that point in time, there was no good way
to handle this during code-gen, and we went with the early split-middle
block approach initially.
Now that we dissolve regions before ::execute, the new recipe can be
lowered nicely after regions have been removed, to a set of VPBBs and
BranchOnCond recipes. The initial lowering preserves the original
structure with the split middle blocks. Follow-ups will improve the
lowering to avoid this splitting, providing performance gains.
PR: https://github.com/llvm/llvm-project/pull/172750
This patch introduces VPInstruction::Reverse and extracts the reverse
operations of loaded/stored values from reverse memory accesses. This
extraction facilitates future support for permutation elimination within
VPlan.
Replace ExtractLastElement and ExtractLastLanePerPart with more generic
and specific ExtractLastLane and ExtractLastPart, which model distinct
parts of extracting across parts and lanes. ExtractLastElement ==
ExtractLastLane(ExtractLastPart) and ExtractLastLanePerPart ==
ExtractLastLane, the latter clarifying the name of the opcode. A new
m_ExtractLastElement matcher is provided for convenience.
The patch should be NFC modulo printing changes.
PR: https://github.com/llvm/llvm-project/pull/164124
When narrowing stores of a single-scalar, we currently use
ExtractLastElement, which extracts the last element across all parts.
This is not correct if the store's address is not uniform across all
parts. If it is only uniform-per-part, the last lane per part must be
extracted. Add a new ExtractLastLanePerPart opcode to handle this
correctly. Most transforms apply to both ExtractLastElement and
ExtractLastLanePerPart, with the only difference being their treatment
during unrolling.
Fixes https://github.com/llvm/llvm-project/issues/162498.
PR: https://github.com/llvm/llvm-project/pull/163056
Check if the scale-factor of the accumulator is the same as the request
ScaleFactor in tryToCreatePartialReductions.
This prevents creating partial reductions if not all instructions in the
reduction chain form partial reductions. e.g. because we do not form a
partial reduction for the loop exit instruction.
Currently code-gen works fine, because the scale factor of
VPPartialReduction is not used during ::execute, but it means we compute
incorrect cost/register pressure, because the partial reduction won't
reduce to the specified scaling factor.
PR: https://github.com/llvm/llvm-project/pull/158603
Update calculateRegisterUsageForPlan to track live-ness of VPValues
instead of recipes. This gives slightly more accurate results for
recipes that define multiple values (i.e. VPInterleaveRecipe).
When tracking the live-ness of recipes, all VPValues defined by an
VPInterleaveRecipe are considered alive until the last use of any of
them. When tracking the live-ness of individual VPValues, we can
accurately track the individual values until their last use.
Note the changes in large-loop-rdx.ll and pr47437.ll. This patch
restores the original behavior before introducing VPlan-based liveness
tracking.
PR: https://github.com/llvm/llvm-project/pull/155301
The InterleavedAccess pass already supports transforming
vector-predicated (vp) load/store intrinsics. With this patch, we start
enabling interleaved access under tail folding by EVL.
This patch introduces a new base class, VPInterleaveBase, and a concrete
class, VPInterleaveEVLRecipe. Both the existing VPInterleaveRecipe and
the new VPInterleaveEVLRecipe inherit from and implement
VPInterleaveBase.
Compared to VPInterleaveRecipe, VPInterleaveEVLRecipe adds an EVL
operand to emit vp.load/vp.store intrinsics.
Currently, tail folding by EVL is only supported for scalable
vectorization. Therefore, VPInterleaveEVLRecipe will only emit
interleave/deinterleave intrinsics. Reverse accesses are not yet
implemented, as masked reverse interleaved access under tail folding is
not yet supported.
Fixed#123201
`VPEVLBasedIVPHIRecipe` will lower to VPInstruction scalar phi and
generate scalar phi. This recipe will only occupy a scalar register just
like other phi recipes.
This patch fix the register usage for `VPEVLBasedIVPHIRecipe` from
vector
to scalar which is close to generated vector IR.
https://godbolt.org/z/6Mzd6W6ha shows that no register spills when
choosing `<vscale x 16>`.
Note that this test is basically copied from AArch64.
A lot of time getCanonicalIV() is used to get the canonical IV type,
e.g. to instantiate a VPTypeAnalysis or to get the LLVMContext.
However VPTypeAnalysis has a constructor that takes the VPlan directly
and there's a method on VPlan to get the LLVMContext directly, so use
those instead where possible.
This lets us remove a constructor on VPTypeAnalysis.
Also remove an unused LLVMContext argument in UnrollState whilst we're
here.
Epilogue vectorization currently relies on the resume phi for the
canonical induction being always available, which is why VPPhi are
considered to have side-effects, to prevent their removal.
This patch adds a new ResumeForEpilogue opcode to mark the resume phi as
used for epilogue vectorization. This allows treating VPPhis in general
as not having side-effects, enabling removal of unused VPPhis.
This is the VPWidenPointerInductionRecipe equivalent of #118638, with
the motivation of allowing us to use the EVL as the induction step.
There is a new VPInstruction added, WidePtrAdd to allow adding the step
vector to the induction phi, since VPInstruction::PtrAdd only handles
scalars or multiple scalar lanes.
Originally this transformation was copied from the original recipe's
execute code, but it's since been simplifed by teaching
`unrollWidenInductionByUF` to unroll the recipe, which brings it inline
with VPWidenIntOrFpInductionRecipe.
This patch adds a new ExtractLane VPInstruction which extracts across
multiple parts using a wide index, to be used in combination with
FirstActiveLane.
The patch updates early-exit codegen to use it instead ExtractElement,
which is only per-part. With this change, interleaving should work
correctly with early-exit loops.
The patch removes the restrictions added in 6f43754e9 (#145877), but
does not yet automatically select interleave counts > 1 for early-exit
loops.
I'll share a patch as follow-up. The cost of extracting a lane adds
non-trivial overhead in the exit block, so that should be considered
when picking the interleave count.
PR: https://github.com/llvm/llvm-project/pull/148817
Update LV to vectorize maxnum/minnum reductions without fast-math flags,
by adding an extra check in the loop if any inputs to maxnum/minnum are
NaN, due to maxnum/minnum behavior w.r.t to signaling NaNs. Signed-zeros
are already handled consistently by maxnum/minnum.
If any input is NaN,
*exit the vector loop,
*compute the reduction result up to the vector iteration that contained
NaN inputs and
* resume in the scalar loop
New recurrence kinds are added for reductions using maxnum/minnum
without fast-math flags.
PR: https://github.com/llvm/llvm-project/pull/148239
Currently if MaxBandwidth is enabled, the register pressure is checked
for each VF. This changes that to only perform said check if the VF
would not have otherwise been considered by the LoopVectorizer if
maxBandwidth was not enabled.
Theoretically this allows for higher VFs to be considered than would
otherwise be deemed "safe" (from a regpressure perspective), but more
concretely this reduces the amount of work done at compile-time when
maxBandwidth is enabled.
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
This patch adds a new recipe to combine multiple recipes into an
'expression' recipe, which should be considered as single entity for
cost-modeling and transforms. The recipe needs to be 'decomposed', i.e.
replaced by its individual recipes before execute.
This subsumes VPExtendedReductionRecipe and
VPMulAccumulateReductionRecipe and should make it easier to extend to
include more types of bundled patterns, like e.g. extends folded into
loads or various arithmetic instructions, if supported by the target.
It allows avoiding re-creating the original recipes when converting to
concrete recipes, together with removing the need to record various
information. The current version of the patch still retains the original
printing matching VPExtendedReductionRecipe and
VPMulAccumulateReductionRecipe, but this specialized print could be
replaced with printing the bundled recipes directly.
PR: https://github.com/llvm/llvm-project/pull/144281
Similar to FindLastIV, add FindFirstIVSMin to support select (icmp(), x, y)
reductions where one of x or y is a decreasing induction, producing a SMin
reduction. It uses signed max as sentinel value.
PR: https://github.com/llvm/llvm-project/pull/140451
Explicitly unroll VPReplicateRecipes outside replicate regions by VF,
replacing them by VF single-scalar recipes. Extracts for operands are
added as needed and the scalar results are combined to a vector using a
new BuildVector VPInstruction.
It also adds a few folds to simplify unnecessary extracts/BuildVectors.
It also adds a BuildStructVector opcode for handling of calls that have
struct return types.
VPReplicateRecipe in replicate regions can will be unrolled as follow
up, turing non-single-scalar VPReplicateRecipes into 'abstract', i.e.
not executable.
PR: https://github.com/llvm/llvm-project/pull/142433
Add a new VPInstruction::ReductionStartVector opcode to create the start
values for wide reductions. This more accurately models the start value
creation in VPlan and simplifies VPReductionPHIRecipe::execute. Down the
line it also allows removing VPReductionPHIRecipe::RdxDesc.
PR: https://github.com/llvm/llvm-project/pull/142290
Move VPlan-based calculateRegisterUsage from LoopVectorize
to VPlanAnalysis.cpp. It is a VPlan-based analysis and this helps
to reduce the size of LoopVectorize.
PR: https://github.com/llvm/llvm-project/pull/135673
This patch introduce two new recipes.
* VPExtendedReductionRecipe
- cast + reduction.
* VPMulAccumulateReductionRecipe
- (cast) + mul + reduction.
This patch also implements the transformation that match following
patterns via vplan and converts to abstract recipes for better cost
estimation.
* VPExtendedReduction
- reduce(cast(...))
* VPMulAccumulateReductionRecipe
- reduce.add(mul(...))
- reduce.add(mul(ext(...), ext(...))
- reduce.add(ext(mul(ext(...), ext(...))))
The converted abstract recipes will be lower to the concrete recipes
(widen-cast + widen-mul + reduction) just before recipe execution.
Note that this patch still relies on legacy cost model the calculate the
cost for these patters.
Will enable vplan-based cost decision in #113903.
Split from #113903.
Add constructor that retrieves the scalar type from the trip count
expression, if no canonical IV is available. Used in the verifier, in
preparation for late verification, when the canonical IV has been
dissolved.
ExtractFromEnd only has 2 uses, extracting the last and penultimate
elements. Replace it with 2 separate opcodes, removing the need to
materialize and handle a constant argument.
PR: https://github.com/llvm/llvm-project/pull/137030
There are some opcodes that currently require specialized recipes, due
to their result type not being implied by their operands, including
casts.
This leads to duplication from defining multiple full recipes.
This patch introduces a new VPInstructionWithType subclass that also
stores the result type. The general idea is to have opcodes needing to
specify a result type to use this general recipe. The current patch
replaces VPScalarCastRecipe with VInstructionWithType, a similar patch
for VPWidenCastRecipe will follow soon.
There are a few proposed opcodes that should also benefit, without the
need of workarounds:
* https://github.com/llvm/llvm-project/pull/129508
* https://github.com/llvm/llvm-project/pull/119284
PR: https://github.com/llvm/llvm-project/pull/129706