Any VPlan we generate that contains a replicator region will result in
replicated blocks in the output, causing a large code size increase.
Reject such VPlans when optimizing for size, as the code size impact is
usually worse than having a scalar epilogue, which we already forbid
with optsize.
This change requires a lot of test changes. For tests of optsize
specifically I've updated the test with the new output, otherwise the
tests have been adjusted to not rely on optsize.
Fixes#66652
Fixed-order recurrence phis cannot be forced to be scalar, they will
always be widened at the moment.
Make sure we don't add them to ForcedScalars, otherwise the legacy cost
model will compute incorrect costs.
This fixes an assertion reported with
https://github.com/llvm/llvm-project/pull/129645.
This patch adds a WideIVStep opcode that can be used to create a vector
with the steps to increment a wide induction. The opcode has 2 operands
* the vector step
* the scale of the vector step
The opcode is later converted into a sequence of recipes that convert
the scale and step to the target type, if needed, and then multiply
vector step by scale.
This simplifies code that needs to materialize step vectors, e.g.
replacing wide IVs as follow up to
https://github.com/llvm/llvm-project/pull/108378 with an increment of
the wide IV step.
PR: https://github.com/llvm/llvm-project/pull/119284
This PR accounts for scaled reductions in `calculateRegisterUsage` to
reflect the fact that the number of lanes in their output is smaller
than the VF.
Depends on https://github.com/llvm/llvm-project/pull/126437
Now that VPlan is able to fold away redundant branches to the scalar
preheader, we can directly check in VPlan if the scalar tail may
execute. hasScalarTail returns true if the tail may execute.
We know that the scalar tail won't execute if the scalar preheader
doesn't have any predecessors, i.e. is not reachable.
This removes some late uses of the legacy cost model.
PR: https://github.com/llvm/llvm-project/pull/134674
There are some opcodes that currently require specialized recipes, due
to their result type not being implied by their operands, including
casts.
This leads to duplication from defining multiple full recipes.
This patch introduces a new VPInstructionWithType subclass that also
stores the result type. The general idea is to have opcodes needing to
specify a result type to use this general recipe. The current patch
replaces VPScalarCastRecipe with VInstructionWithType, a similar patch
for VPWidenCastRecipe will follow soon.
There are a few proposed opcodes that should also benefit, without the
need of workarounds:
* https://github.com/llvm/llvm-project/pull/129508
* https://github.com/llvm/llvm-project/pull/119284
PR: https://github.com/llvm/llvm-project/pull/129706
Add a version of calculateRegisterUsage that works estimates register
usage for a VPlan. This mostly just ports the existing code, with some
updates to figure out what recipes will generate vectors vs scalars.
There are number of changes in the computed register usages, but they
should be more accurate w.r.t. to the generated vector code.
There are the following changes:
* Scalar usage increases in most cases by 1, as we always create a
scalar canonical IV, which is alive across the loop and is not
considered by the legacy implementation
* Output is ordered by insertion, now scalar registers are added first
due the canonical IV phi.
* Using the VPlan, we now also more precisely know if an induction will
be vectorized or scalarized.
Depends on https://github.com/llvm/llvm-project/pull/126415
PR: https://github.com/llvm/llvm-project/pull/126437
Add a dedicated function to check if a plan is for a loop with an early
exit. This can easily be determined by checking the exit blocks.
This allows removing a use of Legal->hasUncountableEarlyExit() from
InnerLoopVectorizer.
PR: https://github.com/llvm/llvm-project/pull/134720
This patch changes the preferInLoopReduction function to take a
RecurKind instead of an unsigned Opcode.
This makes it possible to distinguish non-arithmetic reductions such as
min/max, AnyOf, and FindLastIV, and also helps unify IAnyOf with FAnyOf
and IFindLastIV with FFindLastIV.
Related patch #118393#131830
This reverts commit 46a2f4174a051f29a09dbc3844df763571c67309.
Recommits 2fd6f8fb5e3a with corresponding VPlan change to ensure
LoopInfo is updated for all blocks during VPlan execution if needed.
Add an initial CFG simplification transform, which removes the dead
edges for blocks terminated with BranchOnCond true.
At the moment, this removes the edge between middle block and scalar
preheader when folding the tail.
PR: https://github.com/llvm/llvm-project/pull/106748
If ScalarPH has predecessors, we may need to update its reduction resume
values. If there is a middle block, it must be the first predecessor.
Note that the first predecessor may not be the middle block, if the
middle block doesn't branch to the scalar preheader. In that case,
fixReductionScalarResumeWhenVectorizingEpilog will be a no-op.
In preparation for https://github.com/llvm/llvm-project/pull/106748.
Update both VPInterleaveRecipe and VPReplicateRecipe codegen to use
debug location directly from the recipe, not the underlying instruction.
This removes another dependency on underlying instructions.
In order to facilitate targets that only support masked loads/stores
on certain address spaces (AMDGPU will support them in an upcoming
patch, but only for address space 7), add an AddressSpace parameter
to isLegalMaskedLoad and isLegalMaskedStore
The patch splits the store-load forwarding distance analysis from other
dependency analysis in LAA. Currently it supports only power-of-2
distances, required to support non-power-of-2 distances in future.
Part of #100755
Avoid the pattern of always calling collectInstsToScalarize after
collectUniformsAndScalars, and call it in collectUniformsAndScalars
instead. Also strengthen checks for early exits in the function.
Add a new VPIRPhi subclass of VPIRInstruction, that purely serves as an
overlay, to provide more convenient checking (via directly doing
isa/dyn_cast/cast) and specialied execute/print implementations.
Both VPIRInstruction and VPIRPhi share the same VPDefID, and are
differentiated by the backing IR instruction.
This pattern could alos be used to provide more specialized interfaces
for some VPInstructions ocpodes, without introducing new, completely
spearate recipes. An example would be modeling VPWidenPHIRecipe &
VPScalarPHIRecip using VPInstructions opcodes and providing an interface
to retrieve incoming blocks and values through a VPInstruction subclass
similar to VPIRPhi.
PR: https://github.com/llvm/llvm-project/pull/129387
This moves the checks of MinTripCountTailFoldingThreshold later, during the
calculation of whether to tail fold. This allows it to check beforehand whether
tail predication is required, either for scalable or fixed-width vectors.
This option is only specified for AArch64, where it returns the minimum of 5.
This patch aims to allow the vectorization of TC=4 loops, preventing them from
performing slower when SVE is present.
VPReductionRecipes take a RecurrenceDescriptor, but only use the
RecurKind and FastMathFlags in it when executing. This patch makes the
recipe more lightweight by stripping it to only take the latter two.
The motiviation for this is to simplify an upcoming patch to support
in-loop AnyOf reductions. For an in-loop AnyOf reduction we want to
create an Or reduction, and by using RecurKind we can create an
arbitrary reduction without needing a full RecurrenceDescriptor.
Instead of executing the whole entry VPIRBB twice, first only execute
the VPExpandSCEVRecipes and replace their uses with the expanded
VPValue, which will be a live-in. This allows removing special logic in
VPExpandSCEVRecipe to support executing twice and allows moving the
ExpandedSCEVs map out of VPTransformState.
It will also allow adding other recipes to the entry VPBB in the future.
DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently
gained C++23-style insert_range. This patch uses insert_range with
iterator ranges. For each case, I've verified that foos is defined as
make_range(foo_begin(), foo_end()) or in a similar manner.
createInductionAdditionalBypassValues is only used for epilogue
vectorization now. Move it out of ILV, which means we do not have to
thread through ExpandedSCEVs and also don't have to track the bypass
values in ILV. Instead, directly create them if needed after executing
the epilogue plan. This moves more the epilogue specific logic out of
the generic executePlan.
At the moment if we decide to enable tail-folding we do not include
the cost of generating the mask per VF. This can mean we make some
poor choices of VF, which is definitely true for SVE-enabled AArch64
targets where mask generation for fixed-width vectors is more
expensive than for scalable vectors.
I've added a VPInstruction::computeCost function to return the costs
of the ActiveLaneMask and ExplicitVectorLength operations.
Unfortunately, in order to prevent asserts firing I've also had to
duplicate the same code in the legacy cost model to make sure the
chosen VFs match up. I've wrapped this up in a ifndef NDEBUG for
now. The alternative would be to disable the assert completely when
tail-folding, which I imagine is just as bad.
New tests added:
Transforms/LoopVectorize/AArch64/sve-tail-folding-cost.ll
Transforms/LoopVectorize/RISCV/tail-folding-cost.ll
This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.
This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.
This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms. For now
it only transforms load interleave groups feeding store groups.
Depends on #106431.
This lands the main parts of the approved
https://github.com/llvm/llvm-project/pull/106441 as suggested to break
things up a bit more.
DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently
gained C++23-style insert_range. This patch replaces:
Dest.insert(Src.begin(), Src.end());
with:
Dest.insert_range(Src);
This patch does not touch custom begin like succ_begin for now.
Update initial VPlan-construction in VPlanNativePath in line with the
inner loop path, in that it bails out when encountering constructs it
cannot handle, like non-intrinsic calls.
Fixes https://github.com/llvm/llvm-project/issues/131071.
calculateRegisterUsage adds end points for each user of an instruction
to Ends and ignores instructions not added to it, i.e. instructions with
no users.
This means things like stores aren't included, which in turn means
values that are only used in stores are also not included for
consideration. This means we underestimate the register usage in cases
where the only users are things like stores.
Update the code to don't skip instructions without users (i.e. not in
Ends) if they have side-effects.
PR: https://github.com/llvm/llvm-project/pull/126415