The current tests in iv-select-cmp.ll are not representative of clang
output of common real-world C programs, which are often written with i32
induction vars, as opposed to i64 induction vars. Hence, add five tests
corresponding to the following programs:
int test(int *a, int n) {
int rdx = 331;
for (int i = 0; i < n; i++) {
if (a[i] > 3)
rdx = i;
}
return rdx;
}
int test(int *a) {
int rdx = 331;
for (int i = 0; i < 20000; i++) {
if (a[i] > 3)
rdx = i;
}
return rdx;
}
int test(int *a, long n) {
int rdx = 331;
for (int i = 0; i < n; i++) {
if (a[i] > 3)
rdx = i;
}
return rdx;
}
int test(int *a, unsigned n) {
int rdx = 331;
for (int i = 0; i < n; i++) {
if (a[i] > 3)
rdx = i;
}
return rdx;
}
int test(int *a) {
int rdx = 331;
for (long i = INT_MIN - 1; i < UINT_MAX; i++) {
if (a[i] > 3)
rdx = i;
}
return rdx;
}
The first two can theoretically be vectorized without a runtime-check,
while the third and fourth cannot. The fifth cannot be vectorized, even
with a runtime-check.
This issue was found while reviewing D150851.
Differential Revision: https://reviews.llvm.org/D156124
Split off from D150398 to avoid builder-related diff changes there.
Using IRBuilder to create ICmps simplifies the result if both operands
are constants.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D158332
Suppose we have a nested loop like this:
void foo(int32_t *dst, int32_t *src, int m, int n) {
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
dst[(i * n) + j] += src[(i * n) + j];
}
}
}
We currently generate runtime memory checks as a precondition for
entering the vectorised version of the inner loop. However, if the
runtime-determined trip count for the inner loop is quite small
then the cost of these checks becomes quite expensive. This patch
attempts to mitigate these costs by adding a new option to
expand the memory ranges being checked to include the outer loop
as well. This leads to runtime checks that can then be hoisted
above the outer loop. For example, rather than looking for a
conflict between the memory ranges:
1. &dst[(i * n)] -> &dst[(i * n) + n]
2. &src[(i * n)] -> &src[(i * n) + n]
we can instead look at the expanded ranges:
1. &dst[0] -> &dst[((m - 1) * n) + n]
2. &src[0] -> &src[((m - 1) * n) + n]
which are outer-loop-invariant. As with many optimisations there
is a trade-off here, because there is a danger that using the
expanded ranges we may never enter the vectorised inner loop,
whereas with the smaller ranges we might enter at least once.
I have added a HoistRuntimeChecks option that is turned off by
default, but can be enabled for workloads where we know this is
guaranteed to be of real benefit. In future, we can also use
PGO to determine if this is worthwhile by using the inner loop
trip count information.
When enabling this option for SPEC2017 on neoverse-v1 with the
flags "-Ofast -mcpu=native -flto" I see an overall geomean
improvement of ~0.5%:
SPEC2017 results (+ is an improvement, - is a regression):
520.omnetpp: +2%
525.x264: +2%
557.xz: +1.2%
...
GEOMEAN: +0.5%
I didn't investigate all the differences to see if they are
genuine or noise, but I know the x264 improvement is real because
it has some hot nested loops with low trip counts where I can
see this hoisting is beneficial.
Tests have been added here:
Transforms/LoopVectorize/runtime-checks-hoist.ll
Differential Revision: https://reviews.llvm.org/D152366
The current version of the test doesn't use any of the loads, so they
can be removed together with the mask of the interleave group.
Use some loaded values and store them, to prevent the mask from being
optimized away.
Try to avoid some unprofitable predication on PPC. Recognize in the cost model that computing on i1 values will require extra mask or compare operation.
Differential Revision: https://reviews.llvm.org/D155876
When SVE2 is enabled, we can combine an add of 1, add & shift right by 1
to a single s/urhadd instruction. If the operands to the adds are extended,
these extends will fold into the s/urhadd and their costs should be 0.
Reviewed By: dtemirbulatov
Differential Revision: https://reviews.llvm.org/D157628
This is a complete fix for CompleteLoadGroups introduced in
D154309. We need to check for dependency between A and every member of
the load Group of B.
This patch also fixes another miscompile seen when we incorrectly sink stores
below a depending load (see testcase in
interleaved-accesses-sink-store-across-load.ll). This is fixed by
releasing store groups correctly.
This change was previously reverted (e85fd3cbdd68) due to Asan failure with
use-after-free error. A testcase is added and the bug is fixed in this
version of the patch.
Differential Revision: https://reviews.llvm.org/D155520
Model wrap flags directly using VPRecipeWithIRFlags and clean up the
duplicated *NUW opcodes.
D157144 will build on this and also model FMFs for VPInstruction.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D157194
Use the printOperands for printing VPInstruction's operands to be more
in line with other recipes and ensure consistent printing after D15719.
Also removes some stray spaces in print output.
The original test has a unused load, which is removed. Also add a
variant with a store that cannot be removed, forcing the mask for the
block to always be generated.
This reverts commit 245ec675a4e41f7ec24dfc998720bffdc46a6c53.
Recommits eea9258648ce with a fix to only erase the instruction from the
first part if it is defined outside the loop. This fixes a
use-after-free error reported.
We check the loop trip count is known a power of 2 to determine
whether the tail loop can be eliminated in D146199.
However, the remainder loop of mask scalable loop can also be removed
If we know the mask is always going to be true for every vector iteration.
Depend on the assume of power-of-two vscale on D155350
proofs: https://alive2.llvm.org/ce/z/bT62Wa
Fix https://github.com/llvm/llvm-project/issues/63616.
Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D154953
Set phi inputs to poison whenever we find a dead edge (either
during initial worklist population or the main InstCombine run),
instead of only doing this for successors of dead blocks.
This means that the phi operand is set to poison even if for
critical edges without an intermediate block.
There are quite a few test changes, because the pattern is fairly
common in vectorizer output, for cases where we know the vectorized
loop will be entered.
This reverts commit 3e386b227886e2fb77b0c1e9182026c4e049f346.
Next to the original fold, this also implements an unnecessary and
inappropriate simplifyICmpWithDominatingAssume() based fold.
We check the loop trip count is known a power of 2 to determine
whether the tail loop can be eliminated in D146199.
However, the remainder loop of mask scalable loop can also be removed
If we know the mask is always going to be true for every vector iteration.
Depend on the assume of power-of-two vscale on D155350
proofs: https://alive2.llvm.org/ce/z/FkTMoy
Fix https://github.com/llvm/llvm-project/issues/63616.
Reviewed By: goldstein.w.n, nikic, david-arm, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D154953
Make sure the full IR is checked for loop-vectorization-factors.ll and
to make sure nothing gets missed and add missing checks for type-shrinkage-insertelt.ll.
Also removes some undef ops from tests.
The cost of vector instructions has always been high under AArch64, in order to
add a high cost for inserts/extracts, shuffles and scalarization. This is a
conservative approach to limit the scope of unusual SLP vectorization where the
codegen ends up being quite poor, but has always been higher than the correct
costs would be for any specific core.
This relaxes that, reducing the vector insert/extract cost from 3 to 2. It is a
generalization of D142359 to all AArch64 cpus. The ScalarizationOverhead is
also overridden for integer vector at the same time, to remove the effect of
lane 0 being considered free for integer vectors (something that should only be
true for float when scalarizing).
The lower insert/extract cost will reduce the cost of insert, extracts,
shuffling and scalarization. The adjustments of ScalaizationOverhead will
increase the cost on integer, especially for small vectors. The end result will
be lower cost for float and long-integer types, some higher cost for some
smaller vectors. This, along with the raw insert/extract cost being lower, will
generally mean more vectorization from the Loop and SLP vectorizer.
We may end up regretting this, as that vectorization is not always profitable.
In all the benchmarking I have done this is generally an improvement in the
overall performance, and I've attempted to address the places where it wasn't
with other costmodel adjustments.
Differential Revision: https://reviews.llvm.org/D155459
Split off min-max in-loop reduction tests into separate file and extend
them by adding tests with
* min & max intrinsics
* fmuladd with permuted operands
* min & max select tests with permuted operands.
Adds extra test coverage as suggested in D155845.
The most straightforward extension to D150851 would involve a loop with
decreasing induction variable, with a constant start value.
iv-select-cmp.ll only contains a negative test for the decreasing
induction variable case when the start value is variable, namely
not_vectorized_select_decreasing_induction_icmp. Hence, add a test for
the most straightforward extension to D150851, in preparation to
vectorize:
long rdx = 331;
for (long i = 19999; i >= 0; i--) {
if (a[i] > 3)
rdx = i;
}
return rdx;
Differential Revision: https://reviews.llvm.org/D156152
This is a complete fix for CompleteLoadGroups introduced in
D154309. We need to check for dependency between A and every member of
the load Group of B.
This patch also fixes another miscompile seen when we incorrectly sink stores
below a depending load (see testcase in
interleaved-accesses-sink-store-across-load.ll). This is fixed by
releasing store groups correctly.
Differential Revision: https://reviews.llvm.org/D155520
This reverts commit eea9258648ce73507f6f85c395de978af659d498.
That commit triggered crashes in the following testcase:
$ cat reduced.c
typedef struct {
int a[8]
} b;
typedef struct {
b *c;
short d
} e;
void f() {
int g;
char *h;
e *i = f;
short j = i->d;
int a = i->c->a[0];
for (;;)
for (; g < a; g++) {
*h = j * i->d >> 8;
h++;
}
}
$ clang -target aarch64-linux-gnu -w -c -O2 reduced.c
These tests were originally added in 0aff1798b5721d5f95d16f465b99d, where they
were measuring the cost of fadd and fmuladd reductions, which should be fairly
high cost. For some reason, due to the forced vector factors, the debug costs
of each instruction are printed twice by the vectorizer. Once as if the
instruction is a simple fadd/fmuladd, and later with the correct reduction
cost.
In d827865e9f778f5b27edb2afe003c2a the costs were updated to match the first
print statements, where they would be better to match the second to test the
cost of the reduction.
This patch returns them to testing the original reduction costs.
vrgather.vv across multiple vector registers (i.e. LMUL > 1) requires all to all data movement. This includes two conceptual sets of changes:
For permutes, we were modeling these as being linear in LMUL.
For reverse, we were modeling them as being fixed cost in LMUL.
Both were wrong, and have been adjusted to O(LMUL^2). Noticed via code inspection while looking at something else.
Its worth asking whether we should be lowering reverse to something other than a vrgather at high LMULs. That shuffle is quite expensive. (Future work)
Differential Revision: https://reviews.llvm.org/D152019
Before this patch, the only way to generate streaming-compatible code
was to use the `-force-streaming-compatible-sve` flag, but the compiler
should also avoid the use of instructions invalid in streaming mode
when a function has the aarch64_pstate_sm_enabled/compatible attribute.
Reviewed By: paulwalker-arm, david-arm
Differential Revision: https://reviews.llvm.org/D155428
Reorder VPlan transforms slightly so they are all grouped together,
after disabling Value -> VPValue lookup. In terms of codegen impact,
this should be NFC modulo a small number of instruction reorderings.
Preparation to split up tryToBuildVPlanWithVPRecipes in a follow-up.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D154640
Identified another miscompile while working on fixing interleaving's
current miscompile in D154309. This is different from testcases landed in D154309,
since it showcases an incorrect sinking of store (the former testcases
in that review and follow-up ones) showed incorrect hoisting of loads
across stores.
This patch is separated from D154953 to see what tests are affected by this
change alone according comment.
Depend on the related updating of LangRef on D155193.
Reviewed By: paulwalker-arm, nikic, david-arm
Differential Revision: https://reviews.llvm.org/D155350
Arm Performance Libraries contain math library which provides
vectorized versions of common math functions.
This patch allows to use it with clang and llvm via -fveclib=ArmPL or
-vector-library=ArmPL, so loops with such calls can be vectorized.
The executable needs to be linked with the amath library.
Arm Performance Libraries are available at:
https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Libraries
Reviewed by: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D154508