For the attached test:
Before the loop-idiom pass, we have a store into the inner loop which is
considered simple and one that does not have any side effects on the
loop. Post loop-idiom pass, we get a memset into the outer loop that is
considered to introduce side effects on the loop. This changes the
backedge taken count before and after the pass and hence, the crash with
verify-scev.
We try to consider non-volatile memory intrinsics as not having
side-effect for forward progress to fix the issue.
Fixes#149377
In order to keep the change as incremental as possible, this only
introduces the memset.pattern intrinsic in cases where memset_pattern16
would have been used. Future patches can enable it on targets that don't
have the intrinsic, and select it in cases where the libcall isn't
directly usable. As the memset.pattern intrinsic takes the number of
times to store the pattern as an argument unlike memset_pattern16 which
takes the number of bytes to write, we no longer try to form an i128
pattern.
Special care is taken for cases where multiple stores in the same loop
iteration were combined to form a single pattern. For such cases, we
inherit the limitation that loops such as the following are supported:
```
for (unsigned i = 0; i < 2 * n; i += 2) {
f[i] = 2;
f[i+1] = 2;
}
```
But the following doesn't result in a memset.pattern (even though it
could be, by forming an appropriate pattern):
```
for (unsigned i = 0; i < 2 * n; i += 2) {
f[i] = 2;
f[i+1] = 3;
}
```
Addressing this existing deficiency is left for a follow-up due to a
desire not to change too much at once (i.e. to target equivalence to the
current codegen).
A command line option is introduced to force the selection of the
intrinsic even in cases it wouldn't be (i.e. in cases where the libcall
wouldn't have been selected). This is intended as a transitionary option
for testing and experimentation, to be removed at a later point.
The only platforms this should impact are those that have the memset_pattern16 libcall (Apple platforms). Testing performed to check for no unexpected codegen changes is described here https://github.com/llvm/llvm-project/pull/126736#issuecomment-3005097468
This patch adds `nocallback` attributes for string/math libcalls. It
allows FuncAttributor to infer `norecurse` more precisely and encourages
more aggressive global optimization.
Fixes the case where subsequent passes were unable to find and delete
the invariant loop left over by the strlen idiom conversion. Since
`loop-deletion` only operate on computable loops, we can update the loop
condition to something more easily picked up by `loop-deletion`
As pointed out in https://github.com/llvm/llvm-project/issues/134736
Reland https://github.com/llvm/llvm-project/pull/108985
Extend `LoopIdiomRecognize` to find and replace loops of the form
```c
base = str;
while (*str)
++str;
```
and transforming the `strlen` loop idiom into the appropriate `strlen`
and `wcslen` library call which will give a small performance boost if
replaced.
```c
str = base + strlen(base)
len = str - base
```
This reverts commit 7c52886700a5a70d873400ec022a99d7dce8b03b.
Reverting this as I have to revert another preceding commit,
ac9049df7e62e2ca4dc5d103593b51639b5715e3.
This PR continues the effort made in
https://discourse.llvm.org/t/rfc-strlen-loop-idiom-recognition-folding/55848
and https://reviews.llvm.org/D83392 and https://reviews.llvm.org/D88460
to extend `LoopIdiomRecognize` to find and replace loops of the form
```c
base = str;
while (*str)
++str;
```
and transforming the `strlen` loop idiom into the appropriate `strlen`
and `wcslen` library call which will give a small performance boost if
replaced.
```c
str = base + strlen(base)
len = str - base
```
---------
Co-authored-by: Michael Kruse <github@meinersbur.de>
These date back to when the non-intrinsic format of variable locations
was still being tested and was behind a compile-time flag, so not all
builds / bots would correctly run them. The solution at the time, to get
at least some test coverage, was to have tests opt-in to non-intrinsic
debug-info if it was built into LLVM.
Nowadays, non-intrinsic format is the default and has been on for more
than a year, there's no need for this flag to exist.
(I've downgraded the flag from "try" to explicitly requesting
non-intrinsic format in some places, so that we can deal with tests that
are explicitly about non-intrinsic format in their own commit).
This patch adds a new loop to LoopIdiomVectorizePass, enabling it to
recognise and vectorise loops such as:
```cpp
template<class InputIt, class ForwardIt>
InputIt find_first_of(InputIt first, InputIt last,
ForwardIt s_first, ForwardIt s_last)
{
for (; first != last; ++first)
for (ForwardIt it = s_first; it != s_last; ++it)
if (*first == *it)
return first;
return last;
}
```
These loops match the C++ standard library function `std::find_first_of`.
This reduces the diff for upcoming changes. In some cases there were
already CHECK lines for the globals, but re-running update_test_check.py
deletes them without --check-globals being added. For
memset-pattern-tbaa.ll, the globals weren't checked but should have
been.
This PR removes the old `nocapture` attribute, replacing it with the new
`captures` attribute introduced in #116990. This change is
intended to be essentially NFC, replacing existing uses of `nocapture`
with `captures(none)` without adding any new analysis capabilities.
Making use of non-`none` values is left for a followup.
Some notes:
* `nocapture` will be upgraded to `captures(none)` by the bitcode
reader.
* `nocapture` will also be upgraded by the textual IR reader. This is to
make it easier to use old IR files and somewhat reduce the test churn in
this PR.
* Helper APIs like `doesNotCapture()` will check for `captures(none)`.
* MLIR import will convert `captures(none)` into an `llvm.nocapture`
attribute. The representation in the LLVM IR dialect should be updated
separately.
This commit makes the `generic` target to support FP and LSX, as
discussed in #110211. Thereby, it allows 128-bit vector to be enabled by
default in the loongarch64 backend.
When expanding SCEV adds to geps, transfer the nuw flag to the resulting
gep. (Note that this doesn't apply to IV increment GEPs, which go
through a different code path.)
This adds the TTI predicate for conveying the availability of fast
`popcnt`, which subsequently allows passes like `LoopIdiomRecognize` to
collapse known popcount patterns. Since SPIR-V natively exposes
`OpBitcount`, it seems preferable to compress the resulting code, and
retain the information, even if a concrete target might have to expand
back into a loop structure.
The original patch failed to handle the case where the loopback
condition compared against a constant exceeding 64 bit unsigned range -
which caused a buildbot failure.
This PR fixes this and relands the original PR #95002.
The current loop idiom code for recognising and inserting a CTLZ
intrinsic does not support loops where the loopback control is based on
an unsigned less-than condition. This patch adds support for recognising
these loops and inserting a CTLZ intrinsic.
Fixes the missed optimization cases in #51064.
---------
Co-authored-by: David Sherwood <david.sherwood@arm.com>
The current loop idiom code for recognising and inserting a CTLZ
intrinsic does not support loops where the loopback control is based on
an unsigned less-than condition. This patch adds support for recognising
these loops and inserting a CTLZ intrinsic.
Fixes the missed optimization cases in #51064
---------
Co-authored-by: David Sherwood <david.sherwood@arm.com>
This patch makes the final major change of the RemoveDIs project, changing the
default IR output from debug intrinsics to debug records. This is expected to
break a large number of tests: every single one that tests for uses or
declarations of debug intrinsics and does not explicitly disable writing
records.
If this patch has broken your downstream tests (or upstream tests on a
configuration I wasn't able to run):
1. If you need to immediately unblock a build, pass
`--write-experimental-debuginfo=false` to LLVM's option processing for all
failing tests (remember to use `-mllvm` for clang/flang to forward arguments to
LLVM).
2. For most test failures, the changes are trivial and mechanical, enough that
they can be done by script; see the migration guide for a guide on how to do
this: https://llvm.org/docs/RemoveDIsDebugInfo.html#test-updates
3. If any tests fail for reasons other than FileCheck check lines that need
updating, such as assertion failures, that is most likely a real bug with this
patch and should be reported as such.
For more information, see the recent PSA:
https://discourse.llvm.org/t/psa-ir-output-changing-from-debug-intrinsics-to-debug-records/79578
To facilitate sharing LoopIdiomTransform between AArch64 and RISC-V,
this first patch moves AArch64LoopIdiomTransform from lib/Target/AArch64
to lib/Transforms/Vectorize and renames it to LoopIdiomVectorize. The
following patch (#94082) will teach LoopIdiomVectorize how to generate VP
intrinsics (in addition to the current masked vector style) in favor of
RVV.
PR #83567 ports `SelectionDAGISel` to the new pass manager, then each
backend should provide `<Target>DagToDagISel()` in new pass manager
style. Then each target should provide `<Target>PassRegistry.def` to
register backend passes in `registerPassBuilderCallbacks` to reduce
duplicate code.
This PR adds `AArch64PassRegistry.def` to AArch64 backend and
boilerplate code in `registerPassBuilderCallbacks`.
SCEVExpanderCleaner will currently remove instructions created by
SCEVExpander, but not restore poison generating flags that it may have
dropped. As such, running LIR can currently spuriously drop flags
without performing any transforms.
Fix this by keeping track of original instruction flags in SCEVExpander.
Fixes https://github.com/llvm/llvm-project/issues/82337.
Similar to the non-ptr case, directly create the getelementptr
instruction. Going through expandAddToGEP() no longer makes sense
with opaque pointers, where generating the necessary instruction
is trivial.
This avoids recursive expansion of (the SCEV of) StepV while the
IR is in an inconsistent state, in particular with an incomplete
IV phi node, which utilities may not be prepared to deal with.
Fixes https://github.com/llvm/llvm-project/issues/80954.
I found another case where in the end block we could have a PHI that we
deal with incorrectly. The two incoming values are unique - one of them
is
the induction variable and another one is a value defined outside the
loop, e.g.
%final_val = phi i32 [ %inc, %while.body ], [ %d, %while.cond ]
We won't correctly select between the two values in the new end block
that
we create and so we will get the wrong result.
We have added a new pass that looks for loops such as the following:
```
while (i != max_len)
if (a[i] != b[i])
break;
... use index i ...
```
Although similar to a memcmp, this is slightly different because instead
of returning the difference between the values of the first non-matching
pair of bytes, it returns the index of the first mismatch. As such, we
are not able to lower this to a memcmp call.
The new pass can now spot such idioms and transform them into a
specialised predicated loop that gives a significant performance
improvement for AArch64. It is intended as a stop-gap solution until
this can be handled by the vectoriser, which doesn't currently deal with
early exits.
This specialised loop makes use of a generic intrinsic that counts the
trailing zero elements in a predicate vector. This was added in
https://reviews.llvm.org/D159283 and for SVE we end up with brkb & incp
instructions.
Although we have added this pass only for AArch64, it was written in a
generic way so that in theory it could be used by other targets.
Currently the pass requires scalable vector support and needs to know
the minimum page size for the target, however it's possible to make it
work for fixed-width vectors too. Also, the llvm.experimental.cttz.elts
intrinsic used by the pass has generic lowering, but can be made
efficient for targets with instructions similar to SVE's brkb, cntp and
incp.
Original version of patch was posted on Phabricator:
https://reviews.llvm.org/D158291
Patch co-authored by Kerry McLaughlin (@kmclaughlin-arm) and David
Sherwood (@david-arm)
See the original discussion on Discourse:
https://discourse.llvm.org/t/aarch64-target-specific-loop-idiom-recognition/72383
These tests rely on SCEV looking recognizing an "or" with no common
bits as an "add". Add the disjoint flag to relevant or instructions
in preparation for switching SCEV to use the flag instead of the
ValueTracking query. The IR with disjoint flag matches what
InstCombine would produce.