Successors outside of any loop do not contribute to the innermost loop,
skip them to avoid incorrect results due to
getSmallestCommonLoop(nullptr, X) returning nullptr.
computeBestVF iterates over all VPlans and picks the VF of the most
profitable VPlan. This VPlan is later needed for execution and
additional checks. Instead of retrieving it multiple times later, just
directly return it from computeBestVF.
This removes some redundant lookups.
PR: https://github.com/llvm/llvm-project/pull/190385
Match the select operands directly against PhiR using m_Specific,
binding only the non-phi IV expression. This replaces the generic
TrueVal/FalseVal matching followed by an assert and conditional
extraction.
Split off from approved
https://github.com/llvm/llvm-project/pull/183911/ as suggested.
Simplify the sentinel checking logic by using APSInt and checking for
both a signed and unsigned sentinel in a single call.
Removes the IsSigned argument
Split off from approved
https://github.com/llvm/llvm-project/pull/183911/ as suggested.
…ns (NFC).
Use the more descriptive name FindLastSelect for the conditional select
that picks between the reduction phi and the IV value.
Split off from approved
https://github.com/llvm/llvm-project/pull/183911/ as suggested.
Remove unused ReductionLiveOuts variable in `canFoldTailByMasking()`.
The set was being populated with reduction loop exit instructions but
was never actually used anywhere in the function.
We want the LV cost-model to make the best possible decision of VF and
whether or not to use partial reductions. At the moment, when the LV can
use partial reductions for a given VF range, it assumes those are always
preferred. After transforming the plan to use partial reductions, it
then chooses the most profitable VF. It is possible for a different VF
to have been more profitable, if it wouldn't have chosen to use partial
reductions.
This PR changes that, to first decide whether partial reductions are
more profitable for a given chain. If not, then it won't do the
transform.
This causes some regressions for AArch64 which are addressed in a
follow-up PR to keep this one simple.
Splitting out some work from #178454; this covers the enums for
early exit loop type (none, readonly, readwrite) and the style
used (readonly with multiple exit blocks, or masking with the
last iteration done in scalar code), along with changing the early
exit recipe detection to suit moving the transform for handling
early exit readwrite loops earlier in the vplan pipeline.
The isTreeTinyAndNotFullyVectorizable check for 2-node trees
(insertelement root + gather child) was too aggressive: it rejected
trees even when LoadEntriesToVectorize was non-empty, preventing
gathered loads from being vectorized into masked loads/strided loads, etc.
Reviewers: hiraditya, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/190181
The isTreeTinyAndNotFullyVectorizable check for 2-node trees
(insertelement root + gather child) was too aggressive: it rejected
trees even when LoadEntriesToVectorize was non-empty, preventing
gathered loads from being vectorized into masked loads/strided loads, etc.
Reviewers: RKSimon, hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/190040
The FMulAdd (CombinedVectorize) transformation in transformNodes() marks
an FMul child entry with zero cost, assuming it is fully absorbed into
the fmuladd intrinsic. However, when any FMul scalar has multiple uses
(e.g., also stored separately), the FMul must survive as a separate
node.
Reviewers: hiraditya, RKSimon, bababuck
Pull Request: https://github.com/llvm/llvm-project/pull/189692
ReductionRoot was initialized to nullptr instead of the RdxRoot
parameter. This caused two ScaleCost calls (for MinBWs cast cost and
ReductionBitWidth resize cost) to pass nullptr as the user instruction,
and suppressed the "Reduction Cost" line in debug output. In practice
the scale factor is the same because the tree root's main op and the
reduction root share the same basic block, so this is NFC.
Reviewers:
Pull Request: https://github.com/llvm/llvm-project/pull/189994
Update LV to also use the VPlan-based addMinimumIterationCheck for the
iteration count check for the epilogue.
As the VPlan-based addMinimumIterationCheck uses VPExpandSCEV, those
need to be placed in the entry block for now, moving vscale * VF * IC to
the entry for scalable vectors.
The new logic also fails to simplify some checks involving PtrToInt,
because they were only simplified when going through generated IR, then
folding some PtrToInt in IR, then constructing SCEVs again. But those
should be cleaned up by later combines, and there is not really much we
can do other than trying to go through IR.
PR: https://github.com/llvm/llvm-project/pull/189372
The Uses in foldShuffleToIdentity is intended to detect where an operand
is used to distinguish between splats, identities and concats of the
same value. When looking through multiple unsimplified shuffles the same
Use could be both a splat and a identity though. This patch changes the
Use to a Value and an original Use, so that even if we are looking
through multiple vectors we recognise the splat vs identity vs concat of
each use correctly.
Fixes#180338
Replace the DenseMap<Value*, Value*> TrackedToOrig with a SmallVector<Value*>
indexed in parallel with Candidates. This avoids hash-table overhead for the
tracked-value-to-original-value mapping in horizontal reduction processing.
Fixes#189686
In outer-loop VPlan, avoid emitting vector intrinsic calls for intrinsics
without a vector form. In VPRecipeBuilder, detect missing vector intrinsic
mapping and emit scalar handling instead of a vector call.
Also fix assertion when `llvm.pseudoprobe` in VPlan's native path is being
treated as a `WIDEN-INTRINSIC`.
Reproducer: https://godbolt.org/z/GsPYobvYs
A CSE crash is observed arising from outdated hash values unless we
forbid replacements in successor phis in blocks that are not dominated
by the def: the crash is observed when there is a block with CSE'able
phis with CSE'able incoming values, with incoming values coming from a
non-dominating block, under the condition that the block with the phis
is visited before the non-dominating block. It is unfortunately
impossible to write a test case showing a crash at present, but crashes
do occur when attempting to CSE DerivedIV recipes. The root cause of the
crash is visiting a non-dominated use before a def, and hence would be
fixed by a reverse post-order traversal.
Fixes#187499.
Co-authored-by: Luke Lau <luke@igalia.com>
Following #146525, separate the reverse mask from reverse access
recipes.
At the same time, remove the unused member variable `Reverse` from
`VPWidenMemoryRecipe`.
This will help to reduce redundant reverse mask computations by
VPlan-based common subexpression elimination.
The truncating store analogue of #181104.
Adds `Alignment` and `AddrSpace` parameters to
`TargetLoweringBase::getTruncStoreAction` and dependents, and introduces
a `getCustomTruncStoreAction` hook for targets to customize legalization
behavior using this new information.
This change is fully backwards compatible from the target's point of
view, with `setTruncStoreAction` having identical functionality. The
change is purely additive.
This patch introduces an iterator that helps us iterate over lane-value
pairs in a range. For example, given a container `(i32 %v0, <2 x i32>
%v1, i32 %v2)` we get:
```
Lane Value
0 %v0
1 %v1
3 %v2
```
We use this iterator to replace the lane counting logic in
BottomUpVec.cpp.
If the trimming candidate subtree is rooted at an alternate-shuffle node
with binary ops, and this subtree has the same cost as the buildvector
node cost, better to stick with the buildvector node to avoid runtime
perf regressions from shuffle/extra operations overhead that the cost model may
underestimate. Skip trimming if the subtree contains ExtractElement
nodes, since those operate on already-materialized vectors, which may
reduced vector-to-scalar code movement and have better perf.
Reviewers: hiraditya, bababuck, fhahn, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/188272
- Differentiate between operations that need masking because they are in
a conditionally-executed block, and operations that need masking because
the loop is tail-folded (predicated).
- This is needed for future work when we need to support a predicated
vector epilogue in combination with an unpredicated vector body.
- This is first patch in a series.
- See #181401 for the follow-on work.
Extend intial unrolling of replicate regions
(https://github.com/llvm/llvm-project/pull/170212) to support live-outs,
if the VF is scalar.
This allows adding the logic needed to explicitly unroll, and replacing
VPPredPhiInsts with regular scalar VPPhi, without yet having to worry
about packing values into vector phis. This will be done in a follow-up
change, which means all replicate regions will be fully dissolved.
PR: https://github.com/llvm/llvm-project/pull/186252
Instead of replacing all uses of the canonical IV with an add of the
resume value and then relying on the fold to simplify, directly create
offset versions of both the canonical IV and its increment.
The original offset computation were incorrect, but not resulted in
mis-compiles due to the corresponding fold.
Split off from approved
https://github.com/llvm/llvm-project/pull/156262.
Need to check if the potential bitcast/bswap-like construct is a root of
the reduction, otherwise it cannot represent a bitcast/bswap construct.
Fixes#189184
Refactor to proceed #185964.
Much of this is a refactor to address this issues. Instead of iterating over one chain at a time, attempting all VFs for that given change, we now iterate over VFs, trying each chain for the current VF.
Includes fix for use after free bug.
Due to a somewhat recent change, IntOrFpInduction recipes have
associated VPIRFlags. The VPlanUnroll logic for WidenInduction recipes
predates this change, and computes incomplete wrap-flags: update it to
simply use the flags on IntOrFpInduction recipes; PointerInduction
recipes have no associated flags, and indeed, no flags should be used.