Skip live-ins in findRecipe to prevent a crash for cases with degenerate
reductions (where the backedge value is a live-in). Such reductions
should be removed, but this requires further changes.
Fixes https://github.com/llvm/llvm-project/issues/175229.
Split off from https://github.com/llvm/llvm-project/pull/174026. Make
the lookup of the reduction phi recipe/compute-reduction-result
VPInstruction independent of the latter having the reduction phi as
operand.
This patch adds VPValue sub-classes for the different cases we currently
have:
* VPIRValue: A live-in VPValue that wraps an underlying IR value
* VPSymbolicValue: A symbolic VPValue not tied to an underlying value,
e.g. the vector trip count or VF VPValues
* VPRecipeValue: A VPValue defined by a VPDef/VPRecipeBase.
This has multiple benefits:
* clearer constructors for each kind of VPValue
* limited scope: for example allows moving VPDef member to VPRecipeValue,
reducing size of other VPValues.
* stricter type checking for member variables (e.g. using VPLiveIn in
the Value -> live-in map in VPlan, or using VPSymbolicValue for symbolic
member VPValues)
There probably are additional opportunities for cleanups as follow-ups.
PR: https://github.com/llvm/llvm-project/pull/172758
Conservatively predicate sdiv/srem:
- RHS may carry poison in masked‑off lanes.
- RHS could be −1 while LHS has masked‑off lanes (risking INT_MIN/−1
overflow).
We’ll relax this once we can prove non‑wrap/non‑poison conditions.
Fixes#170775.
Follow-up to https://github.com/llvm/llvm-project/pull/171204 and
1f331e453f to only rely on isAddressSCEVForCost in legacy isAddressSCEVForCost,
completely aligning the decisions of VPlan and legacy cost model.
All extra state has been removed from VPWidenSelectRecipe at this point.
There's no benefit of having a separate recipe and Select can easily be
handled by the existing VPWidenRecipe.
PR: https://github.com/llvm/llvm-project/pull/174234
Currently we need to precompute costs for exit conditions, to match the
legacy cost, as they will get replaced by a compare against the
canonical IV (or others, like active-lane-mask or EVL based) and the
original compare will get removed.
This is not true for instructions with users other than the exit
condition. Those will remain, and we can just use the VPlan-based cost
model to get more accurate results.
This improves results in some cases, like
@test_value_in_exit_compare_chain_used_outside because the IV increment
user outside the loop is replaced by computing the final value outside
the loop.
It also fixes a crash introduced by f196b1d66ff (#146525).
PR: https://github.com/llvm/llvm-project/pull/174029
This PR introduces a new BranchOnTwoConds VPInstruction, that takes 2
boolean operands and must be placed in a block with 3 successors.
If condition I is true, branches to successor I, otherwise falls through
to check the next condition. If both conditions are false, branch to the
third successor.
This new branch recipe is used for early-exit loops, to simplify the
representation in VPlan initially, by avoid the need for splitting the
middle block early on, in a way that preserves the single-exit block
property of regions. All exits still go through the latch block, but
they can go to more than 2 successors.
This idea was part of one of the original proposals for how to model
early exits in VPlan, but at that point in time, there was no good way
to handle this during code-gen, and we went with the early split-middle
block approach initially.
Now that we dissolve regions before ::execute, the new recipe can be
lowered nicely after regions have been removed, to a set of VPBBs and
BranchOnCond recipes. The initial lowering preserves the original
structure with the split middle blocks. Follow-ups will improve the
lowering to avoid this splitting, providing performance gains.
PR: https://github.com/llvm/llvm-project/pull/172750
No phi recipes are being transformed in the main loop any longer, so
skip phi recipes.
This also allows to clarify which recipes need skipping explicitly.
Those are recipes that have been already transformed.
Follow-up to post-commit comment in
https://github.com/llvm/llvm-project/pull/168291.
getSCEVExprForVPValue is used to create SCEVs for expressions from the
original loop, which may be predicated. Use PSE to construct predicated
SCEVs if possible. This matches the legacy LV code behavior.
Currently should be NFC, but will enable migrating more SCEV/cost-based
computations to VPlan.
The patch requires exposing a new getPredicatedSCEV helper to
PredicatedScalarEvolution which just takes a SCEV, to avoid needing to
go through IR values, which isn't an option for getSCEVExprForVPValue.
getAddressAccessSCEV previously had some restrictive checks that limited
pointer SCEV expressions passed to TTI to GEPs with operands that must
either be invariant or marked as inductions.
As a consequence, the check rejected things like `GEP %base, (%iv + 1)`,
while the SCEV for the GEP should be as easily analyzeable as for `GEP
%base, %v`, with the only difference being the of the AddRec start
adjusted by 1.
This patch changes the code to use a SCEV-based check, limiting the
address SCEV to be loop invariant, an affine AddRec (i.e. induction ),
or an add expression of such operands or a sign-extended AddRec.
This catches all existing cases getAddressAccessSCEV caught, plus
additional ones like the cases mentioned above.
This means we pass address SCEVs in more cases, giving the backends a
better change to make informed decisions. It also unifies the decision
when to use an address SCEV between the legacy and VPlan-based cost
model.
An illustrative example of showing the impact are the gather-cost.ll
tests. Previously they were considered not profitable to vectorize
because we failed to determine that
%gep.src_data = getelementptr inbounds [1536 x float], ptr @src_data,
i64 0, i64 %mul
has a relatively small constant stride.
There may be some rough edges in the cost models, where not passing
pointer SCEVs hid some incorrect modeling, but those issues should be
fixed in the target cost models if they surface.
PR: https://github.com/llvm/llvm-project/pull/171204
This patch introduces VPInstruction::Reverse and extracts the reverse
operations of loaded/stored values from reverse memory accesses. This
extraction facilitates future support for permutation elimination within
VPlan.
Reapply 8a115b6934a90441 with an update to tests handling remarks.
The patch now directly emits a clear remark when we bail out
due to the memory check threshold.
Original message:
When GeneratedRTChecks::create bails out due to exceeding the cost
threshold, no runtime checks are generated and we must not proceed
assuming checks have been generated.
Mark the checks as never succeeding, to make sure we don't try to
vectorize assuming the runtime checks hold. This fixes a case where we
previously incorrectly vectorized assuming runtime checks had been
generated when forcing vectorization via metadate.
Fixes the mis-compile mentioned in
https://github.com/llvm/llvm-project/pull/166247#issuecomment-3631471588
This reapplies #171846 with a test case and fix for a legacy cost-model
mismatch assertion.
In the previous version of the patch, we only considered the plan to
contain simplifications when it had a VPBlendRecipe and VF.isScalar()
was true.
However for some VPlans we may have a blend with only the first lane
used:
BLEND ir<%phi> = ir<%foo.res> ir<%bar.res>/ir<%c>
CLONE ir<%gep> = getelementptr ir<%p>, ir<%phi>
vp<%5> = vector-pointer ir<%gep>
And in the legacy cost model we cost a blend as a phi if it's uniform:
// If we know that this instruction will remain uniform, check the cost
of
// the scalar version.
if (isUniformAfterVectorization(I, VF))
VF = ElementCount::getFixed(1);
So this replaces the VF.isScalar() check with
vputils::onlyFirstLaneUsed, which matches how the VPlan cost model
mirrored the legacy model beforehand.
A VPInstruction::Select will also emit a scalar select for a vector VF
if only the first lane is used, so this also updates
VPBlendRecipe::computeCost to reflect that too.
Pass backedge values directly to VPFirstOrderRecurrencePHIRecipe and
VPReductionPHIRecipe directly, as they must be provided and availbale.
Split off from https://github.com/llvm/llvm-project/pull/168291.
Use SCEV to simplify all live-ins during VPlan0 construction. This
enables us to remove special SCEV queries when constructing
VPWidenRecipes and improves results in some cases.
This leads to simplifications in a number of cases in real-world
applications (~250 files changed across LLVM, SPEC, ffmpeg)
PR: https://github.com/llvm/llvm-project/pull/155304
Always include the cost of the middle block in
isOutsideLoopWorkProfitable. This addresses the TODO from
https://github.com/llvm/llvm-project/pull/168949 and removes the
temporary restriction.
isOutsideLoopWorkProfitable already scales the cost outside loops
according the expected trip counts.
In practice this increases the minimum iteration threshold in a few
cases. On a large IR corpus based on C/C++ workloads, ~50 out of 179450
vector loops have their thresholds increased slightly.
PR: https://github.com/llvm/llvm-project/pull/171102
A VPBlendRecipe always emits selects, even when the VF is scalar.
However the legacy cost model always costs all scalar non-header phis as
a phi, and the VPlan cost model has to account for this.
This can cause the cost to be a little off, for example not including
the cost of the select in @smax_call_uniform leading to unprofitable
vectorization.
This removes this from the VPlan cost model and handles checks for the
case in planContainsAdditionalSimplifications instead.
I considered trying to make the legacy cost model more accurate but I'm
not sure if it's possible. We need information as to whether or not the
scalar VF we are costing is the original loop in which case it's
actually a phi, or if it's a VPBlendRecipe that emits a select,
potentially from a VF=1, UF>=1 VPlan.
This reverts commit 8a115b6934a90441d77ea54af73e7aaaa1394b38.
This broke premerge. https://lab.llvm.org/staging/#/builders/192/builds/13326
/home/gha/llvm-project/clang/test/Frontend/optimization-remark-options.c:10:11: remark: loop not vectorized: cannot prove it is safe to reorder floating-point operations; allow reordering by specifying '#pragma clang loop vectorize(enable)' before the loop or by providing the compiler option '-ffast-math'
When GeneratedRTChecks::create bails out due to exceeding the cost
threshold, no runtime checks are generated and we must not proceed
assuming checks have been generated.
Mark the checks as never succeeding, to make sure we don't try to
vectorize assuming the runtime checks hold. This fixes a case where we
previously incorrectly vectorized assuming runtime checks had been
generated when forcing vectorization via metadate.
Fixes the mis-compile mentioned in
https://github.com/llvm/llvm-project/pull/166247#issuecomment-3631471588
When the probability of a block is extremely low, HeaderFreq / BBFreq
may be larger than 32 bits. Previously this got truncated to uint32_t
which could cause division by zero exceptions on x86. Widen the return
type to uint64_t which should fit the entire range of BlockFrequency
values.
It's also worth noting that a frequency can never be zero according to
BlockFrequency.h, so we shouldn't need to worry about divide by zero in
getPredBlockCostDivisor itself.
Instead of comparing plain VPValue in the assertion checking the start
values, directly compare the SCEV's. This future-proofs the code in
preparation of performing more simplifications/canonicalizations for
live-ins.
In 531.deepsjeng_r from SPEC CPU 2017 there's a loop that we
unprofitably loop vectorize on RISC-V.
The loop looks something like:
```c
for (int i = 0; i < n; i++) {
if (x0[i] == a)
if (x1[i] == b)
if (x2[i] == c)
// do stuff...
}
```
Because it's so deeply nested the actual inner level of the loop rarely
gets executed. However we still deem it profitable to vectorize, which
due to the if-conversion means we now always execute the body.
This stems from the fact that `getPredBlockCostDivisor` currently
assumes that blocks have 50% chance of being executed as a heuristic.
We can fix this by using BlockFrequencyInfo, which gives a more accurate
estimate of the innermost block being executed 12.5% of the time. We can
then calculate the probability as `HeaderFrequency / BlockFrequency`.
Fixing the cost here gives a 7% speedup for 531.deepsjeng_r on RISC-V.
Whilst there's a lot of changes in the in-tree tests, this doesn't
affect llvm-test-suite or SPEC CPU 2017 that much:
- On armv9-a -flto -O3 there's 0.0%/0.2% more geomean loops vectorized
on llvm-test-suite/SPEC CPU 2017.
- On x86-64 -flto -O3 **with PGO** there's 0.9%/0% less geomean loops
vectorized on llvm-test-suite/SPEC CPU 2017.
Overall geomean compile time impact is 0.03% on stage1-ReleaseLTO:
https://llvm-compile-time-tracker.com/compare.php?from=9eee396c58d2e24beb93c460141170def328776d&to=32fbff48f965d03b51549fdf9bbc4ca06473b623&stat=instructions%3Au
Replace ExtractLastElement and ExtractLastLanePerPart with more generic
and specific ExtractLastLane and ExtractLastPart, which model distinct
parts of extracting across parts and lanes. ExtractLastElement ==
ExtractLastLane(ExtractLastPart) and ExtractLastLanePerPart ==
ExtractLastLane, the latter clarifying the name of the opcode. A new
m_ExtractLastElement matcher is provided for convenience.
The patch should be NFC modulo printing changes.
PR: https://github.com/llvm/llvm-project/pull/164124
My understanding is that a VPWidenRecipe should be used for recipes with
an exact underlying scalar instruction, and VPInstruction should be used
elsewhere e.g. for instructions generated as a part of the vectorization
process.
The only user of the VPWidenRecipe constructor that doesn't take an
underlying instruction is in adjustRecipesForReductions, but we can just
use VPInstruction there.
The VPlan-based cost model assigns the forced cost once for a whole
VPInterleaveRecipe. Update the legacy cost model to match this behavior.
This fixes a cost-model divergence, and assigns the cost in a way that
matches the generated code more accurately.
PR: https://github.com/llvm/llvm-project/pull/168270
Extend the logic to hoist predicated loads
(https://github.com/llvm/llvm-project/pull/168373) to sink predicated
stores with complementary masks in a similar fashion.
The patch refactors some of the existing logic for legality checks to be
shared between hosting and sinking, and adds a new sinking transform on
top.
With respect to the legality checks, for sinking stores the code also
checks if there are any aliasing stores that may alias, not only loads.
PR: https://github.com/llvm/llvm-project/pull/168771
The VPlan-based cost model use vp_gather/vp_scatter for gather/scatter
costs, which is different to the legacy cost model and cannot be matched
there. Don't verify the costs match for plans containing gather/scatters
with EVL.
Fixes https://github.com/llvm/llvm-project/issues/169948.
Add support for vectorizing loops that select the index of the minimum
or maximum element. The patch implements vectorizing those patterns by
combining Min/Max and FindFirstIV reductions.
It extends matching Min/Max reductions to allow in-loop users that are
FindLastIV reductions. It records a flag indicating that the Min/Max
reduction is used by another reduction. The extra user is then check as
part of the new `handleMultiUseReductions` VPlan transformation.
It processes any reduction that has other reduction users. The reduction
using the min/max reduction currently must be a FindLastIV reduction,
which needs adjusting to compute the correct result:
1. We need to find the last IV for which the condition based on the
min/max reduction is true,
2. Compare the partial min/max reduction result to its final value and,
3. Select the lanes of the partial FindLastIV reductions which
correspond to the lanes matching the min/max reduction result.
Depends on https://github.com/llvm/llvm-project/pull/140451
PR: https://github.com/llvm/llvm-project/pull/141431
In #160470, there is a discussion about the possibility to explored a
general approach for handling memory intrinsics.
API changes:
- Remove getMaskedMemoryOpCost, getGatherScatterOpCost,
getExpandCompressMemoryOpCost, getStridedMemoryOpCost from
Analysis/TargetTransformInfo.
- Add getMemIntrinsicInstrCost.
In BasicTTIImpl, map intrinsic IDs to existing target implementation
until the legacy TTI hooks are retired.
- masked_load/store → getMaskedMemoryOpCost
- masked_/vp_gather/scatter → getGatherScatterOpCost
- masked_expandload/compressstore → getExpandCompressMemoryOpCost
- experimental_vp_strided_{load,store} → getStridedMemoryOpCost
TODO: add support for vp_load_ff.
No functional change intended; costs continue to route to the same
target-specific hooks.
This patch adds a new VPlan transformation to hoist predicated loads, if
we can prove they execute unconditionally, i.e. there are 2 predicated
loads to the same address with complementary masks. Then we are
guaranteed to execute one of them on each iteration, allowing us to
remove the mask.
The transform groups masked replicating loads by their address SCEV,
then checks if there are 2 loads with complementary mask. If that is the
case, we check if there are any writes that may alias the load address
in the blocks between the first and last load with the same address.
The transforms operates after linearizing the CFG, but before
introducing replicate regions, which means this is just checking a chain
of consecutive blocks.
Currently this only uses noalias metadata to check for no-alias (using
the helpers added in https://github.com/llvm/llvm-project/pull/166247).
Then we create an unpredicated VPReplicateRecipe at the position of the
first load, then replace all users of the grouped loads with it.
Small Alive2 proof for hoisting with complementary masks:
https://alive2.llvm.org/ce/z/kUx742
PR: https://github.com/llvm/llvm-project/pull/168373
If the expected trip count is less than the VF, the vector loop will
only execute a single iteration. When that's the case, the cost of the
middle block has the same impact as the cost of the vector loop. Include
it in isOutsideLoopWorkProfitable to avoid vectorizing when the extra
work in the middle block makes it unprofitable.
Note that isOutsideLoopWorkProfitable already scales the cost of blocks
outside the vector region, but the patch restricts accounting for the
middle block to cases where VF <= ExpectedTC, to initially catch some
worst cases and avoid regressions.
This initial version should specifically avoid unprofitable tail-folding
for loops with low trip counts after re-applying
https://github.com/llvm/llvm-project/pull/149042.
PR: https://github.com/llvm/llvm-project/pull/168949
Create phi recipes for scalar resume value up front in addInitialSkeleton during initial construction. This will allow moving the remaining code dealing with resume values to VPlan transforms/construction.
PR: https://github.com/llvm/llvm-project/pull/166099
Remove `VPWidenPointerInductionRecipe::IsScalarAfterVectorization` and
replace it with `onlyScalarValuesUsed`. This removes the need to carry
state from the legacy cost model through VPlan, and the VPlan-based
analysis gives more accurate results, avoiding a number of extracts.
PR: https://github.com/llvm/llvm-project/pull/168289