This patch introduces VPInstruction::Reverse and extracts the reverse
operations of loaded/stored values from reverse memory accesses. This
extraction facilitates future support for permutation elimination within
VPlan.
Reapply 8a115b6934a90441 with an update to tests handling remarks.
The patch now directly emits a clear remark when we bail out
due to the memory check threshold.
Original message:
When GeneratedRTChecks::create bails out due to exceeding the cost
threshold, no runtime checks are generated and we must not proceed
assuming checks have been generated.
Mark the checks as never succeeding, to make sure we don't try to
vectorize assuming the runtime checks hold. This fixes a case where we
previously incorrectly vectorized assuming runtime checks had been
generated when forcing vectorization via metadate.
Fixes the mis-compile mentioned in
https://github.com/llvm/llvm-project/pull/166247#issuecomment-3631471588
This reapplies #171846 with a test case and fix for a legacy cost-model
mismatch assertion.
In the previous version of the patch, we only considered the plan to
contain simplifications when it had a VPBlendRecipe and VF.isScalar()
was true.
However for some VPlans we may have a blend with only the first lane
used:
BLEND ir<%phi> = ir<%foo.res> ir<%bar.res>/ir<%c>
CLONE ir<%gep> = getelementptr ir<%p>, ir<%phi>
vp<%5> = vector-pointer ir<%gep>
And in the legacy cost model we cost a blend as a phi if it's uniform:
// If we know that this instruction will remain uniform, check the cost
of
// the scalar version.
if (isUniformAfterVectorization(I, VF))
VF = ElementCount::getFixed(1);
So this replaces the VF.isScalar() check with
vputils::onlyFirstLaneUsed, which matches how the VPlan cost model
mirrored the legacy model beforehand.
A VPInstruction::Select will also emit a scalar select for a vector VF
if only the first lane is used, so this also updates
VPBlendRecipe::computeCost to reflect that too.
In an effort to get rid of VPUnrollPartAccessor and directly unroll
recipes, start by directly unrolling VectorPointerRecipe, allowing for
VPlan-based simplifications and simplification of the corresponding
execute.
Add test coverage for remark when runtime checks are not profitable with
threshold provided.
Also make sure that X86 remark tests actually passes an X86 triple,
which is needed for the threshold remark.
Also clean up the tests a bit.
A VPBlendRecipe always emits selects, even when the VF is scalar.
However the legacy cost model always costs all scalar non-header phis as
a phi, and the VPlan cost model has to account for this.
This can cause the cost to be a little off, for example not including
the cost of the select in @smax_call_uniform leading to unprofitable
vectorization.
This removes this from the VPlan cost model and handles checks for the
case in planContainsAdditionalSimplifications instead.
I considered trying to make the legacy cost model more accurate but I'm
not sure if it's possible. We need information as to whether or not the
scalar VF we are costing is the original loop in which case it's
actually a phi, or if it's a VPBlendRecipe that emits a select,
potentially from a VF=1, UF>=1 VPlan.
Update the logic in narrowToSingleScalar to allow narrowing even if not
all users use scalars, if at least one of the operands already needs
broadcasting.
In that case, there won't be any additional broadcasts introduced. This
should allow removing the special handling for stores, which can
introduce additional broadcasts currently.
Fixes https://github.com/llvm/llvm-project/issues/169668.
PR: https://github.com/llvm/llvm-project/pull/168246
In preparation to strip VPUnrollPartAccessor and unroll recipes
directly, strip unnecessary complication in getGEPIndexTy, as the unroll
part will no longer be available in follow-ups (see #168886 for
instance). The patch also helps by doing a mass test update up-front.
Narrowing the GEP index type conditionally does not yield any benefit,
and the change is non-functional in terms of emitted assembly. While at
it, avoid hard-coding address-space 0, and use the pointer operand's
address space to get the GEP index type.
Split off from PR #163525, this standalone patch replaces almost all the
remaining cases where undef is used as value in loop vectoriser tests.
This will reduce the likelihood of contributors hitting the `undef
deprecator` warning in github.
NOTE: The remaining use of undef in iv_outside_user.ll will be fixed in
a separate PR.
I've removed the test stride_undef from version-mem-access.ll, since
there is already a stride_poison test.
If the expected trip count is less than the VF, the vector loop will
only execute a single iteration. When that's the case, the cost of the
middle block has the same impact as the cost of the vector loop. Include
it in isOutsideLoopWorkProfitable to avoid vectorizing when the extra
work in the middle block makes it unprofitable.
Note that isOutsideLoopWorkProfitable already scales the cost of blocks
outside the vector region, but the patch restricts accounting for the
middle block to cases where VF <= ExpectedTC, to initially catch some
worst cases and avoid regressions.
This initial version should specifically avoid unprofitable tail-folding
for loops with low trip counts after re-applying
https://github.com/llvm/llvm-project/pull/149042.
PR: https://github.com/llvm/llvm-project/pull/168949
Remove `VPWidenPointerInductionRecipe::IsScalarAfterVectorization` and
replace it with `onlyScalarValuesUsed`. This removes the need to carry
state from the legacy cost model through VPlan, and the VPlan-based
analysis gives more accurate results, avoiding a number of extracts.
PR: https://github.com/llvm/llvm-project/pull/168289
This patch implements a transform to hoists single-scalar replicated
loads with invariant addresses out of the vector loop to the preheader
when scoped noalias metadata proves they cannot alias with any stores in
the loop.
This enables hosting of loads we can prove do not alias any stores in
the loop due to memory runtime checks added during vectorization.
PR: https://github.com/llvm/llvm-project/pull/166247
Changes: The previous patch had to be reverted to a mismatching-OpType
assert in cse. The reduced-test has now been added corresponding to a
RVV pointer-induction, and the pointer-induction case has been updated
to use createOverflowingBinaryOp.
While at it, record VPIRFlags in VPWidenInductionRecipe.
Follow up on c2d4c7c18b96 ([VPlan] Permit more users in
narrowToSingleScalars) to fix an assert related to WidenStore users of
the recipe being narrowed in narrowToSingleScalars.
Split off from #158690. Currently if an instruction needs predicated due
to tail folding, it will also have a predicated discount applied to it
in multiple places.
This is likely inaccurate because we can expect a tail folded
instruction to be executed on every iteration bar the last.
This fixes it by checking if the instruction/block was originally
predicated, and in doing so prevents vectorization with tail folding
where we would have had to scalarize the memory op anyway.
On llvm-test-suite this causes 4 loops in total to no longer be
vectorized with -O3 on arm64-apple-darwin, and there's no observable
performance impact.
Call getVectorTripCount first, and call getTripCount failing that, in
simplifyBranchConditionForVFAndUF, to simplify missed cases. While at
it, strip the dead check for a zero TC.
Generalize VPWidenSelectRecipe codegen to consider single-scalar
conditions instead of just loop-invariant ones.
If the condition is a single-scalar, we can simply use a scalar
condition.
PR: https://github.com/llvm/llvm-project/pull/165506
Split off from PR #163525, this standalone patch replaces simple cases
where undef is used as a value for arithmetic or getelementptr
instructions. This will reduce the likelihood of contributors hitting
the `undef deprecator` warning in github.
Update isNoWrap to only use the inbounds/nusw flags from GEPs that are
guaranteed to be dereferenced on every iteration. This fixes a case
where we incorrectly determine no dependence.
I think the issue is isolated to code that evaluates the resulting
AddRec at BTC, just using it to compute the distance between accesses
should still be fine; if the access does not execute in a given
iteration, there's no dependence in that iteration. But isolating the
code is not straight-forward, so be conservative for now. The practical
impact should be very minor (only one loop changed across a corpus with
27k modules from large C/C++ workloads.
Fixes https://github.com/llvm/llvm-project/issues/160912.
PR: https://github.com/llvm/llvm-project/pull/161445
This follows similar reasoning as 45ce88758d24
(https://github.com/llvm/llvm-project/pull/159556):
LV does not preserve LCSSA, it constructs it just before processing a
loop to vectorize. Runtime check expressions are invariant to that loop,
so expanding them should not break LCSSA form for the loop we are about
to vectorize.
LV creates SCEV and memory runtime checks early on and then disconnects
the blocks temporarily. The patch fixes a mis-compile, where previously
LCSSA construction during SCEV expand may replace uses in currently
unreachable SCEV/memory check blocks.
Fixes https://github.com/llvm/llvm-project/issues/162512
PR: https://github.com/llvm/llvm-project/pull/165505
Move narrowInterleaveGroups to to general VPlan optimization stage.
To do so, narrowInterleaveGroups now has to find a suitable VF where all
interleave groups are consecutive and saturate the full vector width.
If such a VF is found, the original VPlan is split into 2:
a) a new clone which contains all VFs of Plan, except VFToOptimize, and
b) the original Plan with VFToOptimize as single VF.
The original Plan is then optimized. If a new copy for the other VFs has
been created, it is returned and the caller has to add it to the list of
candidate plans.
Together with https://github.com/llvm/llvm-project/pull/149702, this
allows to take the narrowed interleave groups into account when
computing costs to choose the best VF and interleave count.
One example where we currently miss interleaving/unrolling when
narrowing interleave groups is https://godbolt.org/z/Yz77zbacz
PR: https://github.com/llvm/llvm-project/pull/149706
Split off from PR #163525, this standalone patch replaces
use of undef as incoming PHI values with zero, in order
to reduce the likelihood of contributors hitting the
`undef deprecator` warning in github.
When the legacy cost model scalarizes loads that are used as addresses
for other loads and stores, it looks to phi nodes, if they are direct
address operands of loads/stores. Match this behavior in
isUsedByLoadStoreAddress, to fix a divergence between legacy and
VPlan-based cost model.
The `masked.load`, `masked.store`, `masked.gather` and `masked.scatter`
intrinsics currently accept a separate alignment immarg. Replace this
with an `align` attribute on the pointer / vector of pointers argument.
This is the standard representation for alignment information on
intrinsics, and is already used by all other memory intrinsics. This
means the signatures now match llvm.expandload, llvm.vp.load, etc.
(Things like llvm.memcpy used to have a separate alignment argument as
well, but were already migrated a long time ago.)
It's worth noting that the masked.gather and masked.scatter intrinsics
previously accepted a zero alignment to indicate the ABI type alignment
of the element type. This special case is gone now: If the align
attribute is omitted, the implied alignment is 1, as usual. If ABI
alignment is desired, it needs to be explicitly emitted (which the
IRBuilder API already requires anyway).
Split off from PR #163525, this standalone patch replaces `ret * undef`
returns with `ret void` in order to reduce the likelihood of
contributors hitting the `undef deprecator` warning in github.
Consistently scalarize loads used as part of address computations across
all uses in the loop. This aligns the VPlan and legacy cost model and
fixes a divergence crash. It doesn't matter if the load and address
users are in different blocks, as long as they are in the same loop, the
scalar value can be used. This removes a number of insert/extracts.
VPBlendRecipes are introduced as part of if-conversion, potentially adding
a def-use chain from a load used in a compare to another load/store. In
the scalar IR, there is no connection via def-use chains, so the legacy
cost model won't consider the load used by memory operation.
Skipping blends brings the VPlan-based cost-computation in line with the
legacy cost model after https://github.com/llvm/llvm-project/pull/162157.