If a phi has fast math flags, we can propagate it to the widened select.
To do this, this patch makes VPPhi and VPBlendRecipe subclasses of
VPRecipeWithIRFlags, and propagates it through PlainCFGBuilder and
VPPredicator.
Alive2 proofs for some of the FMFs (it looks like it can't reason about
the full "fast" set yet)
nnan: https://alive2.llvm.org/ce/z/f0bRd4
nsz: https://alive2.llvm.org/ce/z/u9P96T
The actual motivation for this to eventually be able to move the special
casing for tail folding in
LoopVectorizationPlanner::addReductionResultComputation into the CFG in
#176143, which requires passing through FMFs.
Update VPReplicateReicpe::computeCost to compute predicated load/store
costs directly, unless the pointer is uniform. In that case, the legacy
cost model uses a different logic, which will be migrated separately.
PR: https://github.com/llvm/llvm-project/pull/179129
This patch restructures Find(First|Last)IV handling. Instead of
differentiating between FindLast, FindFirstIV and FindLastIV up front,
this patch simplifies the logic in IVDescriptor to just identify the
FindLast pattern up-front.
It then adds a new VPlan transformation to optimize FindLast reductions
to FindIV reductions if there is a suitable sentinel value.
Find(Last|First)IV recurrence kinds to a single FindIV kind.
This is simpler and more accurate, given selecting the first/last
induction of the final IV reduction is directly controlled by the
corresponding recurrence kind of the ComputeReductionResult.
The new structure also allows further optimizations, like vectorizing
FindLastIV with another boolean reduction that tracks if the condition
in the loop was ever true, if there is no suitable sentinel value.
PR: https://github.com/llvm/llvm-project/pull/177870
Enforce that all VPInstructions set the correct OpType of the VPIRFlags.
Flag mis-matches (e.g. VPInstruction Add without `OverflowingBinOp`
being set) can cause crashes (e.g. in CSE) or potentially mis-compiles.
Add a few helpers in VPBuilder to create common instructions with
correct flags.
PR: https://github.com/llvm/llvm-project/pull/179138
When a recipe can be safely sunk and all of its users are outside the
vector loop region in the same dedicated exit block, the recipe does not
need to be executed on every iteration.
This patch extends the VPlan-based LICM (Loop Invariant Code Motion) to
also sink such recipes from the vector loop region into the exit block.
This reduces redundant computation and improves cost model accuracy.
TODO: Support nested loop sinking
TODO: Support sinking `VPReplicateRecipe` (requires `replicateByVF`
fixes)
TODO: Support recipes with multiple defined values (e.g., interleaved
loads)
TODO: Clone recipes without users to all exit blocks
TODO: Support PHI node users by checking incoming value blocks
TODO: Support sinking when users are in multiple blocks
TODO: Clone recipes when users are on multiple exit paths
Co-authored-by: Luke Lau <luke@igalia.com>
---------
Co-authored-by: Luke Lau <luke@igalia.com>
Co-authored-by: Luke Lau <luke_lau@icloud.com>
This is split out from #177114.
In order to make canonicalizeEVLLoops a generic "convert to variable
stepping" transform, move the code that changes the exit condition to a
separate transform since not all variable stepping loops will want to
transform the exit condition. Run it before canonicalizeEVLLoops before
VPEVLBasedIVPHIRecipe is expanded.
Also relax the assertion for VPInstruction::ExplicitVectorLength to just
bail instead, since eventually VPEVLBasedIVPHIRecipe will be used by
other loops that aren't EVL tail folded.
This reverts commit d1e477b00b49c63ff4dd513eeb14a5b18bc055d7.
Recommit with a extra checks making sure extends are VPWidenCastRecipes,
rejecting VPReplicateRecipes.
Original message:
As a first step, move the existing partial reduction detection logic to
VPlan, trying to preserve the existing code structure & behavior as
closely as possible.
With this, partial reductions are detected and created together in a
single step.
This allows forming partial reductions and bundling them up if
profitable together in a follow-up.
PR: https://github.com/llvm/llvm-project/pull/167851
This reverts commit f4e8cc1a2229dca76d21c8d37439c4c194b06b86.
This change wasn't NFC; it causes failed asserts when building
ffmpeg for i686 windows, see
https://github.com/llvm/llvm-project/pull/167851 for details.
As a first step, move the existing partial reduction detection logic to
VPlan, trying to preserve the existing code structure & behavior as
closely as possible.
With this, partial reductions are detected and created together in a
single step.
This allows forming partial reductions and bundling them up if
profitable together in a follow-up.
PR: https://github.com/llvm/llvm-project/pull/167851
Re-commit of https://github.com/llvm/llvm-project/pull/175839 after
fixing build without `LLVM_ENABLE_DUMP`.
This consists of the following changes:
* Merge several overloads of `VPlanTransforms::runPass` into a single
function to avoid code duplication.
* Add helper macro `RUN_VPLAN_PASS` to capture the transformation name
and pass it to the helper above for printing.
* Add new `-vplan-print-after-all` option (somewhat similar to existing
`-vplan-verify-each`).
* Add two empty passes `printAfterInitialConstruction`/`printFinalVPlan`
so that initial/final VPlans would be supported in `-vplan-print-after-all`
This follows the original future plans in
https://github.com/llvm/llvm-project/pull/123640.
This makes use of the llvm.vector.partial.reduce.fadd intrinsics added
in #163975 to handle the following with FDOT:
```
float32_t fdot(float16_t *src, int N) {
float32_t sum = 0.0f;
for (int i=0; i<N; ++i)
sum += src[i];
return sum;
}
```
This consists of the following changes:
* Merge several overloads of `VPlanTransforms::runPass` into a single
function
to avoid code duplication.
* Add helper macro `RUN_VPLAN_PASS` to capture the transformation name
and pass it to the helper above for printing.
* Add new `-vplan-print-after-all` option (somewhat similar to existing
`-vplan-verify-each`).
* Add two empty passes `printAfterInitialConstruction`/`printFinalVPlan`
so that initial/final
VPlans would be supported in `-vplan-print-after-all`
This follows the original future plans in
https://github.com/llvm/llvm-project/pull/123640.
This commit introduces the VectorInstrContext (VIC) infrastructure to
improve cost estimates for insert/extracts based on the context
instruction in which the insert/extract is used.
This is similar to CastContextHint, and allows providing context on how
the insert/extract is going to be used before creating IR. This is
useful in the LoopVectorizer, where costs need to estimated before
creating IR.
The new hint currently only replaces an existing check in AArch64,
but new uses will be introduced in follow-ups, including
https://github.com/llvm/llvm-project/pull/177201.
PR: https://github.com/llvm/llvm-project/pull/175982
Always use the information from ComputeReductionResult to identify
recurrence kinds when connecting main and epilogue plans. Connecting the
live-outs involves the reduction result computations, so it is natural
and more accurate to check the reduction result for the correct
structure.
Suggested cleanup from https://github.com/llvm/llvm-project/pull/170223
Move SubclassID to VPRecipeBase, and store VPRecipeBase directly in
VPRecipeValue, instead of VPDef. This allows for some additional
simplifications and VPDef now just holds various helpers to deal with
removing and adding VPValues.
This reverts commit 16395da0ff577750571b99fe28281ce6fb6a3ae8.
PR: https://github.com/llvm/llvm-project/pull/174282
In isOutsideLoopWorkProfitable function, there are two places where only
the runtime check cost (RtC) should be used, but incorrectly included
the costs of middle blocks and early-exit blocks.
1. VectorizeMemoryCheckThreshold comparison for interleaving-only
2. Minimum trip count that bounds runtime check overhead, i.e. MinTC2
calculation
This results in an overly conservative minimum profitable trip count.
This patch separates the runtime check cost from the total overhead
cost, and uses only RtC for VectorizeMemoryCheckThreshold comparison and
the MinTC2 calculation.
FindLast in-loop reductions are not supported, similarly to FindLastIV
reductions. Skip them in collectInLoopReductions, to avoid a crash for
loops with FindLast reductions and in-loop reductions preferred.
Replace ComputeFindIVResult with ComputeReductionResult + explicit
compare + select, to more explicitly and simpler model computing finding
the first/last induction, which boils down to a min/max reduction +
compare and select of the sentinel value.
PR: https://github.com/llvm/llvm-project/pull/176672
Directly check the VPlan to see if there are any FindLast reductions.
Currently this is NFC, but checking in the VPlan is more future proof,
e.g. if reductions are simplified, removed or transformed. Then checking
in legacy LoopVectorizationLegality is inaccruate.
If a UserIC is provided, the vector loop will process VF * UserIC. Pass
it through UserIC to computeFeasibleMaxVF and use it to limit the max VF
to factors where VF * UserIC <= MaxTripCount. This avoids creating dead
vector loops with user provided interleave counts.
PR: https://github.com/llvm/llvm-project/pull/174573
This patch removes the single uncountable exit constraint, allowing
loops with multiple early exits, if the exits form a dominance chain and
all other constraints hold for all uncountable early exits.
While legality now accepts such loops, vectorization is not yet
supported. VPlan support will be added in a follow up:
https://github.com/llvm/llvm-project/pull/174864
PR: https://github.com/llvm/llvm-project/pull/176403
This reverts commit ed004cf42bf57ca79b57bc3076ef83a8477426ea.
The original commit exposed an independent cost issue, triggering an
assertion. That issue has been fixed in 3457e7efc3.
Reland the patch now that the assertion has been fixed.
Based on Michael Maitland's previous work:
https://github.com/llvm/llvm-project/pull/121222
This PR uses the existing recurrences code instead of introducing a
new pass just for CSA autovec. I've also made recipes that are more
generic.
Remove the artificial PhiR operand of ComputeReductionResult, which was
only used to look up recurrence kind, in-loop and ordered properties.
Instead, encode them as VPIRFlags as suggested by @ayalz in
https://github.com/llvm/llvm-project/pull/170223.
This addresses a TODO to make codegen for ComputeReductionResult
independent of looking up information from other recipes.
This is NFC w.r.t. codegen, the printing has been improved to include
the reduction type, and whether it is in-loop/ordered.
PR: https://github.com/llvm/llvm-project/pull/174026
There is a bug in this logic:
```
InstructionCost Cost = ScalarCost;
InstWidening Decision = CM_Scalarize;
if (VectorCost <= Cost) {
Cost = VectorCost;
Decision = CM_VectorCall;
}
if (IntrinsicCost <= Cost) {
Cost = IntrinsicCost;
Decision = CM_IntrinsicCall;
}
```
because it assumes that the comparisons behave sensibly in the face of
invalid costs. Unfortunately, PR #174835 exposes an issue when
attempting to vectorise the new test
uadd_with_overflow_i32 for AArch64 targets. Specifically, there are
situations where all costs are invalid (e.g. VF=vscale x 1), but some
costs are more invalid than others. For example, when querying the
intrinsic cost via the TTI hook we get an invalid cost with a non-zero
value, whereas the vector cost is invalid with a zero value. That leads
to us erroneously choosing CM_VectorCall as the call widening decision,
despite the lack of a vector math variant. Inevitably this causes
crashes because we create a VPCallWidenRecipe without a variant
function.
Fix this by only performing comparisons if the costs are valid. It now
leads to us choosing CM_Scalarize more often, but it's a toin coss
anyway between CM_Scalarize and CM_IntrinsicCall when both strategies
are invalid. Potentially we could also create a new strategy called
CM_Invalid, and avoid the creation of VPlans entirely.
Addresses part of #153144 and splits off part of #166164
There are two parts to the EVL transform:
1) Convert the loop so the number of elements processed each iteration
is EVL, not VF. The IV and header mask are replaced with EVL-based
variants.
2) Optimize users of the EVL based header mask to VP intrinsic based
recipes.
(1) changes the semantics of the vector loop region, whereas (2) needs
to preserve them. This splits (2) out so we don't mix the two up, and
allows us to move (1) earlier in the pipeline in a future PR.
Skip live-ins in findRecipe to prevent a crash for cases with degenerate
reductions (where the backedge value is a live-in). Such reductions
should be removed, but this requires further changes.
Fixes https://github.com/llvm/llvm-project/issues/175229.
Split off from https://github.com/llvm/llvm-project/pull/174026. Make
the lookup of the reduction phi recipe/compute-reduction-result
VPInstruction independent of the latter having the reduction phi as
operand.
This patch adds VPValue sub-classes for the different cases we currently
have:
* VPIRValue: A live-in VPValue that wraps an underlying IR value
* VPSymbolicValue: A symbolic VPValue not tied to an underlying value,
e.g. the vector trip count or VF VPValues
* VPRecipeValue: A VPValue defined by a VPDef/VPRecipeBase.
This has multiple benefits:
* clearer constructors for each kind of VPValue
* limited scope: for example allows moving VPDef member to VPRecipeValue,
reducing size of other VPValues.
* stricter type checking for member variables (e.g. using VPLiveIn in
the Value -> live-in map in VPlan, or using VPSymbolicValue for symbolic
member VPValues)
There probably are additional opportunities for cleanups as follow-ups.
PR: https://github.com/llvm/llvm-project/pull/172758
Conservatively predicate sdiv/srem:
- RHS may carry poison in masked‑off lanes.
- RHS could be −1 while LHS has masked‑off lanes (risking INT_MIN/−1
overflow).
We’ll relax this once we can prove non‑wrap/non‑poison conditions.
Fixes#170775.
Follow-up to https://github.com/llvm/llvm-project/pull/171204 and
1f331e453f to only rely on isAddressSCEVForCost in legacy isAddressSCEVForCost,
completely aligning the decisions of VPlan and legacy cost model.
All extra state has been removed from VPWidenSelectRecipe at this point.
There's no benefit of having a separate recipe and Select can easily be
handled by the existing VPWidenRecipe.
PR: https://github.com/llvm/llvm-project/pull/174234
Currently we need to precompute costs for exit conditions, to match the
legacy cost, as they will get replaced by a compare against the
canonical IV (or others, like active-lane-mask or EVL based) and the
original compare will get removed.
This is not true for instructions with users other than the exit
condition. Those will remain, and we can just use the VPlan-based cost
model to get more accurate results.
This improves results in some cases, like
@test_value_in_exit_compare_chain_used_outside because the IV increment
user outside the loop is replaced by computing the final value outside
the loop.
It also fixes a crash introduced by f196b1d66ff (#146525).
PR: https://github.com/llvm/llvm-project/pull/174029
This PR introduces a new BranchOnTwoConds VPInstruction, that takes 2
boolean operands and must be placed in a block with 3 successors.
If condition I is true, branches to successor I, otherwise falls through
to check the next condition. If both conditions are false, branch to the
third successor.
This new branch recipe is used for early-exit loops, to simplify the
representation in VPlan initially, by avoid the need for splitting the
middle block early on, in a way that preserves the single-exit block
property of regions. All exits still go through the latch block, but
they can go to more than 2 successors.
This idea was part of one of the original proposals for how to model
early exits in VPlan, but at that point in time, there was no good way
to handle this during code-gen, and we went with the early split-middle
block approach initially.
Now that we dissolve regions before ::execute, the new recipe can be
lowered nicely after regions have been removed, to a set of VPBBs and
BranchOnCond recipes. The initial lowering preserves the original
structure with the split middle blocks. Follow-ups will improve the
lowering to avoid this splitting, providing performance gains.
PR: https://github.com/llvm/llvm-project/pull/172750
No phi recipes are being transformed in the main loop any longer, so
skip phi recipes.
This also allows to clarify which recipes need skipping explicitly.
Those are recipes that have been already transformed.
Follow-up to post-commit comment in
https://github.com/llvm/llvm-project/pull/168291.