This patch introduces VPInstruction::Reverse and extracts the reverse
operations of loaded/stored values from reverse memory accesses. This
extraction facilitates future support for permutation elimination within
VPlan.
Reapply 8a115b6934a90441 with an update to tests handling remarks.
The patch now directly emits a clear remark when we bail out
due to the memory check threshold.
Original message:
When GeneratedRTChecks::create bails out due to exceeding the cost
threshold, no runtime checks are generated and we must not proceed
assuming checks have been generated.
Mark the checks as never succeeding, to make sure we don't try to
vectorize assuming the runtime checks hold. This fixes a case where we
previously incorrectly vectorized assuming runtime checks had been
generated when forcing vectorization via metadate.
Fixes the mis-compile mentioned in
https://github.com/llvm/llvm-project/pull/166247#issuecomment-3631471588
This reapplies #171846 with a test case and fix for a legacy cost-model
mismatch assertion.
In the previous version of the patch, we only considered the plan to
contain simplifications when it had a VPBlendRecipe and VF.isScalar()
was true.
However for some VPlans we may have a blend with only the first lane
used:
BLEND ir<%phi> = ir<%foo.res> ir<%bar.res>/ir<%c>
CLONE ir<%gep> = getelementptr ir<%p>, ir<%phi>
vp<%5> = vector-pointer ir<%gep>
And in the legacy cost model we cost a blend as a phi if it's uniform:
// If we know that this instruction will remain uniform, check the cost
of
// the scalar version.
if (isUniformAfterVectorization(I, VF))
VF = ElementCount::getFixed(1);
So this replaces the VF.isScalar() check with
vputils::onlyFirstLaneUsed, which matches how the VPlan cost model
mirrored the legacy model beforehand.
A VPInstruction::Select will also emit a scalar select for a vector VF
if only the first lane is used, so this also updates
VPBlendRecipe::computeCost to reflect that too.
This patch optimizes vector scatters that have a uniform (single-scalar)
address by replacing them with "extract-last-lane + scalar store" when
the scatter is unmasked.
Notes:
- The legacy cost model can scalarize a store if both the address and
the value are uniform. In VPlan we materialize the stored value via
ExtractLastLane, so only the address must be uniform.
- Some of the loops won't be vectorized any sine no vector instructions
will be generated.
In an effort to get rid of VPUnrollPartAccessor and directly unroll
recipes, start by directly unrolling VectorPointerRecipe, allowing for
VPlan-based simplifications and simplification of the corresponding
execute.
Use SCEV to simplify all live-ins during VPlan0 construction. This
enables us to remove special SCEV queries when constructing
VPWidenRecipes and improves results in some cases.
This leads to simplifications in a number of cases in real-world
applications (~250 files changed across LLVM, SPEC, ffmpeg)
PR: https://github.com/llvm/llvm-project/pull/155304
Add test coverage for remark when runtime checks are not profitable with
threshold provided.
Also make sure that X86 remark tests actually passes an X86 triple,
which is needed for the threshold remark.
Also clean up the tests a bit.
Always include the cost of the middle block in
isOutsideLoopWorkProfitable. This addresses the TODO from
https://github.com/llvm/llvm-project/pull/168949 and removes the
temporary restriction.
isOutsideLoopWorkProfitable already scales the cost outside loops
according the expected trip counts.
In practice this increases the minimum iteration threshold in a few
cases. On a large IR corpus based on C/C++ workloads, ~50 out of 179450
vector loops have their thresholds increased slightly.
PR: https://github.com/llvm/llvm-project/pull/171102
A VPBlendRecipe always emits selects, even when the VF is scalar.
However the legacy cost model always costs all scalar non-header phis as
a phi, and the VPlan cost model has to account for this.
This can cause the cost to be a little off, for example not including
the cost of the select in @smax_call_uniform leading to unprofitable
vectorization.
This removes this from the VPlan cost model and handles checks for the
case in planContainsAdditionalSimplifications instead.
I considered trying to make the legacy cost model more accurate but I'm
not sure if it's possible. We need information as to whether or not the
scalar VF we are costing is the original loop in which case it's
actually a phi, or if it's a VPBlendRecipe that emits a select,
potentially from a VF=1, UF>=1 VPlan.
This reverts commit 8a115b6934a90441d77ea54af73e7aaaa1394b38.
This broke premerge. https://lab.llvm.org/staging/#/builders/192/builds/13326
/home/gha/llvm-project/clang/test/Frontend/optimization-remark-options.c:10:11: remark: loop not vectorized: cannot prove it is safe to reorder floating-point operations; allow reordering by specifying '#pragma clang loop vectorize(enable)' before the loop or by providing the compiler option '-ffast-math'
When GeneratedRTChecks::create bails out due to exceeding the cost
threshold, no runtime checks are generated and we must not proceed
assuming checks have been generated.
Mark the checks as never succeeding, to make sure we don't try to
vectorize assuming the runtime checks hold. This fixes a case where we
previously incorrectly vectorized assuming runtime checks had been
generated when forcing vectorization via metadate.
Fixes the mis-compile mentioned in
https://github.com/llvm/llvm-project/pull/166247#issuecomment-3631471588
ExtractLastLane is a no-op for scalar VFs. Update simplifyRecipe to
remove them. This also requires adjusting the code in VPlanUnroll.cpp to
split off handling of ExtractLastLane/ExtractPenultimateElement for
scalar VFs, which now needs to match ExtractLastPart.
PR: https://github.com/llvm/llvm-project/pull/171145
When the probability of a block is extremely low, HeaderFreq / BBFreq
may be larger than 32 bits. Previously this got truncated to uint32_t
which could cause division by zero exceptions on x86. Widen the return
type to uint64_t which should fit the entire range of BlockFrequency
values.
It's also worth noting that a frequency can never be zero according to
BlockFrequency.h, so we shouldn't need to worry about divide by zero in
getPredBlockCostDivisor itself.
These quantities should never unsigned-wrap. This matches the behavior
if only VFxUF is used (and not VF): when computing both VF and VFxUF,
nuw should hold for each step separately.
In 531.deepsjeng_r from SPEC CPU 2017 there's a loop that we
unprofitably loop vectorize on RISC-V.
The loop looks something like:
```c
for (int i = 0; i < n; i++) {
if (x0[i] == a)
if (x1[i] == b)
if (x2[i] == c)
// do stuff...
}
```
Because it's so deeply nested the actual inner level of the loop rarely
gets executed. However we still deem it profitable to vectorize, which
due to the if-conversion means we now always execute the body.
This stems from the fact that `getPredBlockCostDivisor` currently
assumes that blocks have 50% chance of being executed as a heuristic.
We can fix this by using BlockFrequencyInfo, which gives a more accurate
estimate of the innermost block being executed 12.5% of the time. We can
then calculate the probability as `HeaderFrequency / BlockFrequency`.
Fixing the cost here gives a 7% speedup for 531.deepsjeng_r on RISC-V.
Whilst there's a lot of changes in the in-tree tests, this doesn't
affect llvm-test-suite or SPEC CPU 2017 that much:
- On armv9-a -flto -O3 there's 0.0%/0.2% more geomean loops vectorized
on llvm-test-suite/SPEC CPU 2017.
- On x86-64 -flto -O3 **with PGO** there's 0.9%/0% less geomean loops
vectorized on llvm-test-suite/SPEC CPU 2017.
Overall geomean compile time impact is 0.03% on stage1-ReleaseLTO:
https://llvm-compile-time-tracker.com/compare.php?from=9eee396c58d2e24beb93c460141170def328776d&to=32fbff48f965d03b51549fdf9bbc4ca06473b623&stat=instructions%3Au
Replace ExtractLastElement and ExtractLastLanePerPart with more generic
and specific ExtractLastLane and ExtractLastPart, which model distinct
parts of extracting across parts and lanes. ExtractLastElement ==
ExtractLastLane(ExtractLastPart) and ExtractLastLanePerPart ==
ExtractLastLane, the latter clarifying the name of the opcode. A new
m_ExtractLastElement matcher is provided for convenience.
The patch should be NFC modulo printing changes.
PR: https://github.com/llvm/llvm-project/pull/164124
In some cases, the lowering a select depends on the predicate. If the
condition of a select is a compare instruction, thread the predicate
through to the TTI hook.
PR: https://github.com/llvm/llvm-project/pull/170278
The VPlan-based cost model assigns the forced cost once for a whole
VPInterleaveRecipe. Update the legacy cost model to match this behavior.
This fixes a cost-model divergence, and assigns the cost in a way that
matches the generated code more accurately.
PR: https://github.com/llvm/llvm-project/pull/168270
While looking into fixing #158499, I found some other cases where the
messages emitted could be improved. This PR improves both the messages
printed to the debug output and the missed-optimization messages in
cases where:
- loop vectorization is explicitly disabled
- loop vectorization is implicitly disabled by disabling all loop
transformations
- loop vectorization is set to happen only where explicitly enabled
A branch that should currently be unreachable is also added. If the
related logic ever breaks (eg. due to changes to getForce() or the
ForceKind enum) this should alert devs and users. New test cases are
also added to verify that the correct messages (and only them) are
outputted.
---------
Co-authored-by: GYT <tiborgyri@gmail.com>
Co-authored-by: Florian Hahn <flo@fhahn.com>
Extend the logic to hoist predicated loads
(https://github.com/llvm/llvm-project/pull/168373) to sink predicated
stores with complementary masks in a similar fashion.
The patch refactors some of the existing logic for legality checks to be
shared between hosting and sinking, and adds a new sinking transform on
top.
With respect to the legality checks, for sinking stores the code also
checks if there are any aliasing stores that may alias, not only loads.
PR: https://github.com/llvm/llvm-project/pull/168771
For scalable vectors, VPScsalarIVStepsRecipe cannot create all scalar
step values. At the moment, it creates a vector, in addition to to the
first lane. The only supported case for this is when only the last lane
is used. A recipe should not set both scalar and vector values.
Instead, we can simply use a vector induction. It would also be possible
to preserve the current vector code-gen, by creating VPInstructions
based on the first lane of VPScalarIVStepsRecipe, but using a vector
induction seems simpler.
PR: https://github.com/llvm/llvm-project/pull/169796
While attempting to remove the use of undef from more loop vectoriser
tests I discovered a bug where this assert was firing:
```
llvm::Constant* llvm::Constant::getSplatValue(bool) const: Assertion `this->getType()->isVectorTy() && "Only valid for vectors!"' failed.
...
#8 0x0000aaaab9e2fba4 llvm::Constant::getSplatValue
#9 0x0000aaaab9dfb844 llvm::ConstantFoldBinaryInstruction
```
This seems to be happening because we are incorrectly generating
WidePtrAdd recipes for scalar VFs. The PR fixes this by checking whether
a plan has a scalar VF only in legalizeAndOptimizeInductions.
This PR also removes the use of undef from the test `both` in
Transforms/LoopVectorize/iv_outside_user.ll, which is what started
triggering the assert.
Fixes#169334
The VPlan-based cost model use vp_gather/vp_scatter for gather/scatter
costs, which is different to the legacy cost model and cannot be matched
there. Don't verify the costs match for plans containing gather/scatters
with EVL.
Fixes https://github.com/llvm/llvm-project/issues/169948.
Turn assertion added in 99addbf73 [0] into an early exit.
There are cases where the operand may not be a
VPWidenIntOrFpInductionRecipe, e.g. if the IV increment is selected,
as in the test cases.
[0] https://github.com/llvm/llvm-project/pull/141431
Add support for vectorizing loops that select the index of the minimum
or maximum element. The patch implements vectorizing those patterns by
combining Min/Max and FindFirstIV reductions.
It extends matching Min/Max reductions to allow in-loop users that are
FindLastIV reductions. It records a flag indicating that the Min/Max
reduction is used by another reduction. The extra user is then check as
part of the new `handleMultiUseReductions` VPlan transformation.
It processes any reduction that has other reduction users. The reduction
using the min/max reduction currently must be a FindLastIV reduction,
which needs adjusting to compute the correct result:
1. We need to find the last IV for which the condition based on the
min/max reduction is true,
2. Compare the partial min/max reduction result to its final value and,
3. Select the lanes of the partial FindLastIV reductions which
correspond to the lanes matching the min/max reduction result.
Depends on https://github.com/llvm/llvm-project/pull/140451
PR: https://github.com/llvm/llvm-project/pull/141431