Instead of checking dereferenceability early during
LoopVectorizationLegality, defer the check to VPlan construction via
areAllLoadsDereferenceable.
This in preparation for supporting early exit vectorization of
non-dereferencable loads, e.g. via speculative loads
(https://discourse.llvm.org/t/rfc-provide-intrinsics-for-speculative-loads/89692)
or first-faulting loads. Detection in VPlan allows easily replacing
potentially non-deref loads with other loads as needed.
PR: https://github.com/llvm/llvm-project/pull/185323
Recursively splitting out some work from #183318; this covers
the enums for early exit loop type (none, readonly, readwrite)
and the style used (just readonly and
masked-handle-ee-in-scalar-tail for now) and refactoring for
basic use of those enums.
This reverts commit 91b928f919364b29e241821fc639b9ef56dab1a5.
This complicates some analysis that need the happen on the scalar VPlan,
before regions have been created, e.g.
https://github.com/llvm/llvm-project/pull/185323/.
The function instructionsWithoutDebug serves two uses: skipping debug
intrinsics and skipping pseudo instructions. Nonetheless, these
functions are expensive due to out-of-line filtering using
std::function. Ideally, the filter should be inlined, but that would
require including IntrinsicInst.h in BasicBlock.h.
We no longer use debug intrinsics, so the first use (parameter false) is
no longer needed. The second use is sometimes needed, but the
distinction between PseudoProbe instructions can be made at the call
sites more easily in many cases.
Therefore, remove instructionsWithoutDebug/sizeWithoutDebug.
c-t-t stage2-O3 -0.21%.
Since 1b29ac1d1857ea42273fc7862ea019a74a55195d, regions are constructed
as part of the scalar transforms; moving header phi creation after
region creation slightly simplifies the code.
Move handleEarlyExits, predication and region creation to operate
directly on VPlan0. This means they only have to run once, reducing
compile time a bit; the relative order remains unchanged.
Introducing the regions at this point in particular unlocks performing
more transforms once, on the initial VPlan, instead of running them for
each VF.
Whether a scalar epilogue is required is still determined by legacy cost
model, so we need to still account for that in the VF specific VPlan
logic.
PR: https://github.com/llvm/llvm-project/pull/185305
Remove updateScalarResumePhis and create extracts for live-outs early in
addInitialSkeleton. Instead of extracting the from the header phi
recipes for the resume values (which is incorrect), extract the last
lane of the backedege value.
Then update optimizeInductionExitUsers to optimize both the scalar
resume values for IVs and IV exit values together.
This removes the need to pass state between transforms and addresses a
TODO.
PR: https://github.com/llvm/llvm-project/pull/174239
This extends the existing support to work with arbitrary interleave
factors. The main change here is reworking the ExtractLastActive
VPInstruction to take a variable amount of arguments and handling it in
unrollRecipeByUF and VPInstruction::generate.
The select condition for all mask/data values in a find-last recurrence
is the true if the mask for any part is true. Because of this the masks
for inactive parts will be updated to all-false when the parts with
active lanes are updated. This ensures the mask/data for last active
element always corresponds to the greatest part with an active lane.
This means finding the last element in the middle block simply requires
chaining the `extract.last.active` to forward the result from the last
active part through any inactive parts ahead of it.
Currently the logic for introducing a header mask and predicating the
vector loop region is done inside introduceMasksAndLinearize.
This splits the tail folding part out into an individual VPlan transform
so that VPlanPredicator.cpp doesn't need to worry about tail folding,
which seemed to be a temporary measure according to a comment in
VPlanTransforms.h.
To perform tail folding independently, this splits the "body" of the
vector loop region between the phis in the header and the branch + iv
increment in the latch:
Before:
```
+-------------------------------------------+
|%iv = ... |
|... |
|%iv.next = add %iv, vfxuf |
|branch-on-count %iv.next, vector-trip-count|
+-------------------------------------------+
```
After:
```
+-------------------------------------------+
|%iv = ... |
|%wide.iv = widen-canonical-iv ... |
|%header-mask = icmp ule %wide.iv, BTC |---+
|branch-on-cond %header-mask | |
+-------------------------------------------+ |
| |
v |
+-------------------------------------------+ |
|... | |
+-------------------------------------------+ |
| |
v |
+-------------------------------------------+ |
|%iv.next = add %iv, vfxuf |<--+
|branch-on-count %iv.next, vector-trip-count|
+-------------------------------------------+
```
Phis are then inserted in the latch for any value in the loop body that
have outside uses, with poison as their incoming value from the header
edge.
The motivation for this is to allow us to share the same "predicate all
successor blocks" type of predication we do for tail folding, but for
early-exit loops in #172454. This may also allow us to directly emit an
EVL based header mask, instead of having to match + transform the
existing header mask in addExplicitVectorLength.
This also allows us to eventually handle recurrences in the same
transform, avoiding the need to special case tail folding in
addReductionResultComputation.
Now that we have ExitingIVValue, we can also use it for tail-folded
loops; the only difference is that we have to compute the end value with
the original trip count instead the vector trip count.
This allows removing the induction increment operand only used when
tail-folding.
PR: https://github.com/llvm/llvm-project/pull/182507
Previously, we could miscompile when vectorizing conditional scalar
assignments with forced tail folding, as the backedge select could be
based on the header mask, not the assignment conditional.
This resulted in a number of failures in the LLVM test suite when
building with `-O3 -march=armv8-a+sve -mllvm
-prefer-predicate-over-epilogue=predicate-dont-vectorize`.
The patch reworks `handleFindLastReductions()` to correctly handle tail
folding.
Add an alternative to test VPlan in more isolation via a new
`vplan-test-transform` option, which builds VPlan0 for each loop in the
input IR and then can invoke a set of transforms on it.
In order to allow different recipe types to be created, a new
widen-from-metadata transform is added, which transforms VPInstructions
to different recipes, based on custom !vplan.widen metadata. Currently
this supports creating widen & replicate recipes, but can easily be
extended in the future.
Currently the handling is intentionally bare-bones, to be extended
gradually as needed.
PR: https://github.com/llvm/llvm-project/pull/178522
Currently createExtractsForLiveOuts only handles creating extracts when
the middle block has one predecessor, but if an early exit exits to the
same block as the latch then it might have multiple predecessors.
This handles the latter case to avoid the need to handle it in
VPlanTransforms::handleUncountableEarlyExits. Addresses the comment in
https://github.com/llvm/llvm-project/pull/174864#discussion_r2794153217
In order to be able to create selects for reduction phis through tail
folding in foldTailByMasking (#176143), make VPReductionPHIRecipe an
instance of VPIRFlags and plumb the FMFs from the original RdxDesc.
This allows us to remove more uses of the RecurrenceDescriptor in
addReductionResultComputation, which should help untie it from
LoopVectorizationLegality.
Extend handleMultiUseReductions to support strict predicates (>, <),
matching the first index instead of the last for non-strict predicates.
Builds on top of https://github.com/llvm/llvm-project/pull/141431.
FindLast reductions with strict predicates are adjusted to compute the
correct result as follows:
1. Find the first canonical indices corresponding to partial min/max
values, using loop reductions.
2. Find which of the partial min/max values are equal to the overall
min/max value.
3. Select among the canonical indices those corresponding to the overall
min/max value.
4. Find the first canonical index of overall min/max and scale it back to
the original IV using VPDerivedIVRecipe.
5. If the overall min/max equals the starting min/max, the condition in
the loop was always false, due to being strict; return the original start
value in that case.
Building on top of the recent changes to introduce BranchOnTwoConds,
this patch adds support for vectorizing loops with multiple early exits,
all dominating a countable latch. The early exits must form a
dominance chain, so we can simply check which early exit has been taken
in dominance order.
Currently LoopVectorizationLegality ensures that all exits other than
the latch must be uncountable. handleUncountableEarlyExits now collects
those uncountable exits and processes each exit.
In the vector region, we compute if any exit has been taken, by taking
the OR of all early exit conditions (EarlyExitConds) and checking if
there's
any active lane.
If the early exit is taken, we exit the loop and compute which early
exit
has been taken. The first taken early exit is the one where its exit
condition is true in the first active lane of EarlyExitConds.
We create a chain of dispatch blocks outside the loop to check this for
the early exit blocks ordered by dominance.
Depends on https://github.com/llvm/llvm-project/pull/174016.
PR: https://github.com/llvm/llvm-project/pull/174864
This patch extends the support added in #158088 to loops where the
assignment is non-speculatable (e.g. a conditional load or divide).
For example, the following loop can now be vectorized:
```
int simple_csa_int_load(
int* a, int* b, int default_val, int N, int threshold)
{
int result = default_val;
for (int i = 0; i < N; ++i)
if (a[i] > threshold)
result = b[i];
return result;
}
```
It does this by extending the recurrence matching from only looking for
selects, to include phis where all operands are the header phi, except
for one which can be an arbitrary value outside the recurrence.
---
Reverts llvm/llvm-project#180275 (original PR: #178862)
Additional type legalization for `ISD::VECTOR_FIND_LAST_ACTIVE` was
added in #180290, which should resolve the backend crashes on x86.
If a phi has fast math flags, we can propagate it to the widened select.
To do this, this patch makes VPPhi and VPBlendRecipe subclasses of
VPRecipeWithIRFlags, and propagates it through PlainCFGBuilder and
VPPredicator.
Alive2 proofs for some of the FMFs (it looks like it can't reason about
the full "fast" set yet)
nnan: https://alive2.llvm.org/ce/z/f0bRd4
nsz: https://alive2.llvm.org/ce/z/u9P96T
The actual motivation for this to eventually be able to move the special
casing for tail folding in
LoopVectorizationPlanner::addReductionResultComputation into the CFG in
#176143, which requires passing through FMFs.
Add a new VPInstruction opcode to compute the exiting value of an
induction variable after vectorization. This replaces the pattern of
extracting the last lane from the last part of the induction backedge
value when applicable.
This allows us to always use the pre-computed IV end value. It will also
allow unifying end value creation for both induction resume and exit
values.
PR: https://github.com/llvm/llvm-project/pull/175651
This patch extends the support added in #158088 to loops where the
assignment is non-speculatable (e.g. a conditional load or divide).
For example, the following loop can now be vectorized:
```
int simple_csa_int_load(
int* a, int* b, int default_val, int N, int threshold)
{
int result = default_val;
for (int i = 0; i < N; ++i)
if (a[i] > threshold)
result = b[i];
return result;
}
```
It does this by extending the recurrence matching from only looking for
selects, to include phis where all operands are the header phi, except
for one which can be an arbitrary value outside the recurrence.
This patch restructures Find(First|Last)IV handling. Instead of
differentiating between FindLast, FindFirstIV and FindLastIV up front,
this patch simplifies the logic in IVDescriptor to just identify the
FindLast pattern up-front.
It then adds a new VPlan transformation to optimize FindLast reductions
to FindIV reductions if there is a suitable sentinel value.
Find(Last|First)IV recurrence kinds to a single FindIV kind.
This is simpler and more accurate, given selecting the first/last
induction of the final IV reduction is directly controlled by the
corresponding recurrence kind of the ComputeReductionResult.
The new structure also allows further optimizations, like vectorizing
FindLastIV with another boolean reduction that tracks if the condition
in the loop was ever true, if there is no suitable sentinel value.
PR: https://github.com/llvm/llvm-project/pull/177870
Enforce that all VPInstructions set the correct OpType of the VPIRFlags.
Flag mis-matches (e.g. VPInstruction Add without `OverflowingBinOp`
being set) can cause crashes (e.g. in CSE) or potentially mis-compiles.
Add a few helpers in VPBuilder to create common instructions with
correct flags.
PR: https://github.com/llvm/llvm-project/pull/179138
This reverts commit d1e477b00b49c63ff4dd513eeb14a5b18bc055d7.
Recommit with a extra checks making sure extends are VPWidenCastRecipes,
rejecting VPReplicateRecipes.
Original message:
As a first step, move the existing partial reduction detection logic to
VPlan, trying to preserve the existing code structure & behavior as
closely as possible.
With this, partial reductions are detected and created together in a
single step.
This allows forming partial reductions and bundling them up if
profitable together in a follow-up.
PR: https://github.com/llvm/llvm-project/pull/167851
This reverts commit f4e8cc1a2229dca76d21c8d37439c4c194b06b86.
This change wasn't NFC; it causes failed asserts when building
ffmpeg for i686 windows, see
https://github.com/llvm/llvm-project/pull/167851 for details.
Split up attachCheckBlock into its distinct operations:
* inserting the check block in the CFG + updating phis, and
* adding the branch VPInstruction.
Those helpers can be re-used in follow-up changes.
As a first step, move the existing partial reduction detection logic to
VPlan, trying to preserve the existing code structure & behavior as
closely as possible.
With this, partial reductions are detected and created together in a
single step.
This allows forming partial reductions and bundling them up if
profitable together in a follow-up.
PR: https://github.com/llvm/llvm-project/pull/167851
Re-commit of https://github.com/llvm/llvm-project/pull/175839 after
fixing build without `LLVM_ENABLE_DUMP`.
This consists of the following changes:
* Merge several overloads of `VPlanTransforms::runPass` into a single
function to avoid code duplication.
* Add helper macro `RUN_VPLAN_PASS` to capture the transformation name
and pass it to the helper above for printing.
* Add new `-vplan-print-after-all` option (somewhat similar to existing
`-vplan-verify-each`).
* Add two empty passes `printAfterInitialConstruction`/`printFinalVPlan`
so that initial/final VPlans would be supported in `-vplan-print-after-all`
This follows the original future plans in
https://github.com/llvm/llvm-project/pull/123640.
This consists of the following changes:
* Merge several overloads of `VPlanTransforms::runPass` into a single
function
to avoid code duplication.
* Add helper macro `RUN_VPLAN_PASS` to capture the transformation name
and pass it to the helper above for printing.
* Add new `-vplan-print-after-all` option (somewhat similar to existing
`-vplan-verify-each`).
* Add two empty passes `printAfterInitialConstruction`/`printFinalVPlan`
so that initial/final
VPlans would be supported in `-vplan-print-after-all`
This follows the original future plans in
https://github.com/llvm/llvm-project/pull/123640.
Replace ComputeFindIVResult with ComputeReductionResult + explicit
compare + select, to more explicitly and simpler model computing finding
the first/last induction, which boils down to a min/max reduction +
compare and select of the sentinel value.
PR: https://github.com/llvm/llvm-project/pull/176672
Fix a miscompile in the FindLast handling by normalizing selects
with the phi node as the first op to ones that select the data value
when the condition is true, by swapping operands and inverting the
condition.
This should ensure correct codegen for both cases.
Select normalization:
https://alive2.llvm.org/ce/z/yFdivK
Fixes a miscompile reported for 2abd6d6d7ac (#158088).
After d5c11b9a2 (https://github.com/llvm/llvm-project/pull/174026),
finding the ComputeReductionResult VPInstruction for a reduction
requires an extra step. Factor out code to helper to be re-used in
follow-up patches.