This reverts commit 580454526b936f7a576ddbc9bb932cf9be376ec4.
The recommitted version contains an extra check to not peel if the
latch exit is controlled by a pointer induction.
Original message:
Remove the restriction that the loop must be known to execute at least 2
iterations when peeling the last iteration. If we cannot prove at least
2 iterations are executed, a check and branch to skip the peeled loop is
inserted.
PR: https://github.com/llvm/llvm-project/pull/140792
This reverts commit 24b97756decb7bf0e26dcf0e30a7a9aaf27f417c.
Also reverts ac9a466e39bf97ffeab127982aa7c405cb257551.
Building CMake triggers a crash with the patch, revert while I
investigate.
Make sure the new phis are inserted before any non-phi instructions.
This fixes a crash when dbg_value instructions are present in the
original exit block.
Remove the restriction that the loop must be known to execute at least 2
iterations when peeling the last iteration. If we cannot prove at least
2 iterations are executed, a check and branch to skip the peeled loop is
inserted.
PR: https://github.com/llvm/llvm-project/pull/140792
Add more test coverage for peeling the last iteration with variable trip
counts. Separate test cases for constant and variable trip counts in
different files.
Update the check in canPeelLastIteration to make sure the exiting
condition has a single use. When peeling the last iteration, we adjust
the condition in the loop body to be true one iteration early, which
would be incorrect for other users.
Fixes https://github.com/llvm/llvm-project/issues/140444.
Account for constant values when updating exit values after peeling an
iteration from the end. This can happen if the inner loop gets unrolled
and simplified.
Fixes https://github.com/llvm/llvm-project/issues/140442.
This reverts the revert commit bf92b127d2637948f53d11a187e865aa10e2e74c.
This adds missing initialization of PeelLast in gatherPeelingPreferences.
Original message:
Generalize countToEliminateCompares to also consider peeling off the
last iteration if it eliminates a compare.
At the moment, codegen for peeling off the last iteration is quite
restrictive and callers have to make sure that the exit condition can be
adjusted when peeling and that the loop executes at least 2 iterations.
Both will be relaxed in follow-ups.
PR: https://github.com/llvm/llvm-project/pull/139551
Generalize countToEliminateCompares to also consider peeling off the
last iteration if it eliminates a compare.
At the moment, codegen for peeling off the last iteration is quite
restrictive and callers have to make sure that the exit condition can be
adjusted when peeling and that the loop executes at least 2 iterations.
Both will be relaxed in follow-ups.
PR: https://github.com/llvm/llvm-project/pull/139551
This avoids the need to have special handling at every use site.
Unfortunately this means we unnecessarily emit AssertZext in the DAG
(where we already directly understand the range of the intrinsic), andt
we regress in undefined cases as we don't fold out asserts on undef.
While sinking instructions (that are loop invariant) from preheader to
the exit block, we are skipping instructions due to decrementing
instruction iterator twice.
Considering that "or disjoint" is the canonical for certain add
operations, then I think we want to support such "add like" operations
when doing ADD+GEP->GEP+GEP rewrites to make things more consistent.
Problem was found when improving ValueTracking, which turned an ADD into
OR, and then suddenly optimizations got worse due to these rewrites no
longer triggering.
It can be highly beneficial to unroll small, two-block search loops
that look for a value in an array. An example of this would be
something that uses std::find to find a value in libc++. Older
versions of std::find in the libstdc++ headers are manually unrolled
in the source code, but this might change in newer releases where
the compiler is expected to either vectorise or unroll itself.
These date back to when the non-intrinsic format of variable locations
was still being tested and was behind a compile-time flag, so not all
builds / bots would correctly run them. The solution at the time, to get
at least some test coverage, was to have tests opt-in to non-intrinsic
debug-info if it was built into LLVM.
Nowadays, non-intrinsic format is the default and has been on for more
than a year, there's no need for this flag to exist.
(I've downgraded the flag from "try" to explicitly requesting
non-intrinsic format in some places, so that we can deal with tests that
are explicitly about non-intrinsic format in their own commit).
Extend unrolling preferences to allow more aggressive unrolling of
search loops with 2 exits, building on the TTI hook added in
ad9da92cf6.
In combination with
eac23a5b97
this enables unrolling loops like
std::find, which can improve performance significantly (+15% end-to-end
on a workload that makes heavy use of std::find). It increase the total
number of unrolled loops by ~2.5% across a very large corpus of
workloads.
For SPEC2017, +1.6% more loops are unrolled and the following workloads
increase in size (`__text`):
workload base patch
500.perlbench_r 1682884.00 1694104.00 0.7%
523.xalancbmk_r 3001716.00 3003832.00 0.1%
PR: https://github.com/llvm/llvm-project/pull/124751
This PR removes the old `nocapture` attribute, replacing it with the new
`captures` attribute introduced in #116990. This change is
intended to be essentially NFC, replacing existing uses of `nocapture`
with `captures(none)` without adding any new analysis capabilities.
Making use of non-`none` values is left for a followup.
Some notes:
* `nocapture` will be upgraded to `captures(none)` by the bitcode
reader.
* `nocapture` will also be upgraded by the textual IR reader. This is to
make it easier to use old IR files and somewhat reduce the test churn in
this PR.
* Helper APIs like `doesNotCapture()` will check for `captures(none)`.
* MLIR import will convert `captures(none)` into an `llvm.nocapture`
attribute. The representation in the LLVM IR dialect should be updated
separately.
For processors with low overhead branching (LOB), runtime unrolling the
innermost loop is often detrimental to performance. In these cases the
loop remainder gets unrolled into a series of compare-and-jump blocks,
which in deeply nested loops get executed multiple times, negating the
benefits of LOB.
This is particularly noticable when the loop trip count of the innermost
loop varies within the outer loop, such as in the case of triangular
matrix decompositions.
In these cases we will prefer to not unroll the innermost loop, with the
intention for it to be executed as a low overhead loop.
Add initial heuristics to selectively enable runtime unrolling for loops
where doing so is expected to be highly beneficial on Apple Silicon
CPUs.
To start with, we try to runtime-unroll small, single block loops, if
they have load/store dependencies, to expose more parallel memory
access streams [1] and to improve instruction delivery [2].
We also explicitly avoid runtime-unrolling for loop structures that may
limit the expected gains from runtime unrolling. Such loops include
loops with complex control flow (aren't innermost loops, have multiple
exits, have a large number of blocks), trip count expansion is
expensive and are expected to execute a small number of iterations.
Note that the heuristics here may be overly conservative and we err on
the side of avoiding runtime unrolling rather than unroll excessively.
They are all subject to further refinement.
Across a large set of workloads, this increase the total number of
unrolled loops by 2.9%.
[1] 4.6.10 in Apple Silicon CPU Optimization Guide
[2] 4.4.4 in Apple Silicon CPU Optimization Guide
Depends on https://github.com/llvm/llvm-project/pull/118316 for TTI
changes.
PR: https://github.com/llvm/llvm-project/pull/118317
Add test for existing loop unroll behaviour.
Current behaviour is the single loop with fmul gets runtime unrolled by
count of 4, with the loop remainder unrolled as the 3 for.body9.us.prol
sections. This is quite a lot of compare and branch, negating the
benefits of the low overhead loop mechanism.
This patch fixes a simple error where as part of loop unrolling we
optimize conditional loop-exiting branches into unconditional branches
when we know that they will or won't exit the loop, but does not
propagate the source location of the original branch to the new one.
Found using https://github.com/llvm/llvm-project/pull/107279.
In the test case, the BECount of the second loop uses %load,
but we only have an LCSSA phi node for %add, so that is what
gets invalidated. Use the forgetLcssaPhiWithNewPredecessor()
API instead, which will invalidate the roots of the expression
instead.
Fixes https://github.com/llvm/llvm-project/issues/109333.
The current isHighCostExpansion cost model for addrecs computes the cost
for some kind of polynomial expansion that does not appear to have any
relation to addrec expansion whatsoever.
A literal expansion of an affine addrec is a phi and add (plus the
expansion of start and step). For a non-affine addrec, we get another
phi+add for each additional addrec nested in the step recurrence.
This partially `fixes` https://github.com/llvm/llvm-project/issues/53205
(the runtime unroll test case in this PR).
Use ConstantFoldLoadFromConst() instead of a partial re-implementation.
This makes the code slightly more generic by not depending on the
exact structure of the constant.
It is now translated to `<1 x i64>`, which allows the removal of a bunch
of special casing.
This _incompatibly_ changes the ABI of any LLVM IR function with
`x86_mmx` arguments or returns: instead of passing in mmx registers,
they will now be passed via integer registers. However, the real-world
incompatibility caused by this is expected to be minimal, because Clang
never uses the x86_mmx type -- it lowers `__m64` to either `<1 x i64>`
or `double`, depending on ABI.
This change does _not_ eliminate the SelectionDAG `MVT::x86mmx` type.
That type simply no longer corresponds to an IR type, and is used only
by MMX intrinsics and inline-asm operands.
Because SelectionDAGBuilder only knows how to generate the
operands/results of intrinsics based on the IR type, it thus now
generates the intrinsics with the type MVT::v1i64, instead of
MVT::x86mmx. We need to fix this before the DAG LegalizeTypes, and thus
have the X86 backend fix them up in DAGCombine. (This may be a
short-lived hack, if all the MMX intrinsics can be removed in upcoming
changes.)
Works towards issue #98272.
The use-def walk in forgetValue() was skipping instructions with
non-SCEVable types. However, SCEV may look past with.overflow
intrinsics returning aggregates.
Fixes#97586.