2005 Commits

Author SHA1 Message Date
David Sherwood
af2ed59f2b [NFC][LoopVectorize] Add zext/sext cost tests when there is type shrinkage
Differential Revision: https://reviews.llvm.org/D147151
2023-04-03 13:12:11 +00:00
Florian Hahn
0d61ffd350
[Loads] Support SCEVAddExpr as start for pointer AddRec.
Extend handling to support `%base + offset` as start for AddRecs in
isDereferenceableAndAlignedInLoop. This is done by adjusting AccessSize
by the offset and effectively checking if the full object starting from
%base to %base + offset + access-size is dereferenceable.

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D147260
2023-04-02 12:33:44 +01:00
Florian Hahn
fdaebbeff4
[LV] Improve test added in 74dee4791a2.
Adjust test so it triggers a case missed in the original version
of D147260.
2023-03-31 21:50:39 +01:00
Florian Hahn
74dee4791a
[LV] Add test with predicated load where EltSize % Align != 0. 2023-03-31 21:15:24 +01:00
Philip Reames
78ae870f11 {tests] Rerun autogen to reduce a diff [nfc] 2023-03-31 12:47:08 -07:00
Philip Reames
a512ce5e12 [LV] Add tests for non-constant stride pointer inductions
Reduced from the case which triggered the revert of 498aa534f472, and then generalized to cover both expansion paths.
2023-03-31 09:10:59 -07:00
David Green
965a090f02 Revert "[IVDescriptors] Add pointer InductionDescriptors with non-constant strides"
Multiple errors have being reported on
https://reviews.llvm.org/rG498aa534f472d28db893aa9a8627d0b46e17f312

Reverting until the correctness issues can be resolved.

We are also seeing a lot of performance differences from the patch.  Some are
looking good, but some are looking pretty bad.
2023-03-31 11:08:50 +01:00
Florian Hahn
b060ca7042
[LV] Regenerate check lines for test to reduce diff in follow-up patch. 2023-03-30 20:17:12 +01:00
Philip Reames
498aa534f4 [IVDescriptors] Add pointer InductionDescriptors with non-constant strides
This matches the handling for integer IVs.  I left the non-opaque cases alone, mostly because they're largely irrelevant today.

This doesn't actually make much difference in vectorization right now as we immediately fail on aliasing checks (which also bail on non-constant strides).  Slightly suprisingly, it's the case which *do* need runtime checks which work after this patch as they don't use the same dependency analysis path.

This will also enable non-constant stride pointer recurrences for other consumers.  I've auditted said code, and don't see any obvious issues.
2023-03-30 11:56:00 -07:00
Philip Reames
1c5bb25d62 [RISCV][LV] Add test coverage for strided access patterns [nfc] 2023-03-30 10:02:27 -07:00
Florian Hahn
4173ed1382
[LV] Add test cases for global struct dereferencability.
Currently LLVM fails to determine that conditional loads in
@accesses_to_struct_dereferenceable are dereferenceable unconditionally.
2023-03-29 17:47:41 +01:00
Paul Osmialowski
6b6f312cce [TLI][AArch64] Extend SLEEF vectorized functions mapping with VLA functions
This commit extends D134719 "[AArch64] Enable libm vectorized
functions via SLEEF" with the mappings for the scalable functions.

It also introduces all the necessary changes needed to support masked
interfaces.

Reviewed By: danielkiss, sdesmalen

Differential Revision: https://reviews.llvm.org/D146839
2023-03-29 13:07:09 +01:00
Paul Osmialowski
f8f1909d36 Revert "[TLI][AArch64] Extend SLEEF vectorized functions mapping with VLA functions"
Reverting it so I could land it with Arcanist.

This reverts commit 59dcf927ee43e995374907b6846b657f68d7ea49.
2023-03-29 12:54:22 +01:00
Paul Osmialowski
59dcf927ee [TLI][AArch64] Extend SLEEF vectorized functions mapping with VLA functions
This commit extends D134719 "[AArch64] Enable libm vectorized
functions via SLEEF" with the mappings for the scalable functions.

It also introduces all the necessary changes needed to support masked
interfaces.

Signed-off-by: Paul Osmialowski <pawel.osmialowski@arm.com>
2023-03-29 11:07:35 +01:00
Graham Hunter
fba2a7c695 [LV][AArch64] Precommit interleaved access tests
Precommit for D145163
2023-03-29 10:26:14 +01:00
Philip Reames
64f69e453e [RISCV] Cost model for general case of single vector permute
The cost model was not accounting for the fact that we can generate vrgather + an index expression.

Two cases to call out.
1) I did not model the difference between vrgather and vrgatherei16. The result is the constant pool cost can be slightly understated on RV32. I don't think we care, but if someone disagrees, this would be easy to add.
2) Our current codegen for i8 vectors longer than 256 (which is the limit of what this costs) has some room for improvement.

Differential Revision: https://reviews.llvm.org/D147000
2023-03-28 07:34:11 -07:00
David Sherwood
636efd2e35 [SVE][LoopVectorize] Add option to disable tail-folding for reverse loops
If we use tail-folding for reverse loops that contain loads
and stores then we will need to reverse the loop predicate.
This patch adds a new 'reverse' sve-tail-folding option and
ensures they are not considered 'simple'.

I did this by adding a function called
containsDecreasingPointers to AArch64TargetTransformInfo.cpp
that searches all instructions in the loop for loads or
stores with negative strides.

Differential Revision: https://reviews.llvm.org/D146128
2023-03-27 14:10:15 +00:00
David Green
3f8160e9ce [LV][ARM][AArch64] Add multi-exit Loop Vectorizer tests. NFC
These are useful to test with the various predication schemes available on
different targets.
2023-03-27 10:21:24 +01:00
David Sherwood
1c4fedfa35 [LoopVectorize] Don't tail-fold for scalable VFs when there is no scalar tail
Currently in LoopVectorize we avoid tail-folding if we can
prove the trip count is always a multiple of the maximum
fixed-width VF. This works because we know the vectoriser
only ever chooses a VF that is a power of 2. However, if
we are also considering scalable VFs then we conservatively
bail out of the optimisation because we don't know the value
of vscale, which could be an odd or prime number, etc.

This patch tries to enable the same optimisation for scalable
VFs by asking if vscale is known to be a power of 2. If so,
we can then query the maximum value of vscale and use the same
logic as we do for fixed-width VFs. I've also added a new TTI
hook called isVScaleKnownToBeAPowerOfTwo that does the same
thing as the existing TargetLowering hook.

Differential Revision: https://reviews.llvm.org/D146199
2023-03-27 08:34:30 +00:00
David Sherwood
bd0c281fcd [NFC][LoopVectorize] Change trip counts for some tests to guarantee a scalar tail
Quite a few vectoriser tests were using a trip count of 1024,
which meant:

1. For fixed-length VFs we would never actually tail-fold, e.g.
see Transforms/LoopVectorize/RISCV/uniform-load-store.ll. This
is because we can prove at compile-time there will never be a
scalar tail.
2. As of D146199 the same optimisation mentioned above will also
apply to scalable VFs too.

I've changed all such trip counts to be 1025 instead.

Differential Revision: https://reviews.llvm.org/D146219
2023-03-24 09:43:50 +00:00
Luke Lau
8d16c6809a [RISCV] Increase default vectorizer LMUL to 2
After some discussion and experimentation, we have seen that changing the default number of vector register bits to LMUL=2 strikes a sweet spot.
Whilst we could be clever here and make the vectorizer smarter about dynamically selecting an LMUL that
a) Doesn't affect register pressure
b) Suitable for the microarchitecture
we would need to teach its heuristics about RISC-V register grouping specifics.
Instead this just does the easy, pragmatic thing by changing the default to a safe value that doesn't affect register pressure signifcantly[1], but should increase throughput and unlock more interleaving.

[1] Register spilling when compiling sqlite at various levels of `-riscv-v-register-bit-width-lmul`:

LMUL=1    2573 spills
LMUL=2    2583 spills
LMUL=4    2819 spills
LMUL=8    3256 spills

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D143723
2023-03-23 10:33:50 +00:00
Luke Lau
06f16232b1 [RISCV][NFC] Make interleaved access test more vectorizable
The previous test case stored the result of a deinterleaved load and add
into the same source address, which resulted in some scatters which we
weren't testing for and made the tests harder to understand.
Store it at a separate address, which will make the tests easier to read
when the cost model is changed after D145085 is landed

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D146442
2023-03-22 16:02:44 +00:00
Anna Thomas
4277d932ef [LV] Use speculatability within entire loop to avoid strided load predication
Use existing functionality for identifying total access size by strided
loads. If we can speculate the load across all vector iterations, we can
avoid predication for these strided loads (or masked gathers in
architectures which support it).

Differential Revision: https://reviews.llvm.org/D145616
2023-03-21 12:08:25 -04:00
Florian Hahn
962c306a11
[LV] Don't consider pointer as uniform if it is also stored.
Update isVectorizedMemAccessUse to also check if the pointer is stored.
This prevents LV to incorrectly consider a pointer as uniform if it is
used as both pointer and stored by the same StoreInst.

Fixes #61396.
2023-03-17 16:26:16 +00:00
Florian Hahn
a4bb037418
[LV] Add test where pointer is incorrectly marked as uniform.
Test for #61396.
2023-03-17 14:24:13 +00:00
Florian Hahn
565b98e793
[LV] Convert consecutive-ptr-uniforms.ll to use opaque pointers (NFC). 2023-03-17 14:07:11 +00:00
Douglas Yung
3e16488769 Mark test added in D145155 as requiring asserts since it uses the "-debug-only" option.
This should fix the test failure in Release builds.
2023-03-16 15:03:19 -07:00
Luke Lau
b9238abe05 [RISCV] Enable interleaved access vectorization
The loop vectorizer supports generating interleaved loads and stores via
shuffle patterns for fixed length vectors.
This enables it for RISC-V, since interleaved shuffle patterns can be
lowered to vlseg/vsseg in https://reviews.llvm.org/D145022

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D145155
2023-03-16 15:48:55 +00:00
Luke Lau
fc220a1aa9 Revert "[RISCV] Enable interleaved access vectorization"
This reverts commit acc03ad10af4f379a644e3956cb9aca54e40696c.
2023-03-15 22:00:48 +00:00
Luke Lau
acc03ad10a [RISCV] Enable interleaved access vectorization
The loop vectorizer supports generating interleaved loads and stores via
shuffle patterns for fixed length vectors.
This enables it for RISC-V, since interleaved shuffle patterns can be
lowered to vlseg/vsseg in https://reviews.llvm.org/D145022

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D145155
2023-03-15 21:56:30 +00:00
David Green
98481bc723 [LV][VPlan] Fix printing TripCount liveins. NFC
The TripCount liveins would currently be printed as badref in the vplan as they
are not allocated slots in the VPSlotTracker. This patch allocates them a slot
and adds them to the printed Live-Ins. It also makes a minor adjustment to
printing of Live-ins to reduce the empty lines when multiple Live-ins are
present.

Differential Revision: https://reviews.llvm.org/D145507
2023-03-13 19:44:12 +00:00
Sanjay Patel
ef6f23535d Revert "[InstCombine] use loop info when running the pass after loop vectorization"
This reverts commit 43ae4b62b2671cf73e691c0b53324cd39405cd51.

This was intended to be practically NFC in terms of the overall
opt pipeline, but there is experimental data showing that code
changes occurred here:
https://llvm-compile-time-tracker.com/compare.php?from=772aa05452f8ff90a47168e6801cda2acb5a1873&to=43ae4b62b2671cf73e691c0b53324cd39405cd51&stat=size-text
2023-03-11 17:28:56 -05:00
Sanjay Patel
43ae4b62b2 [InstCombine] use loop info when running the pass after loop vectorization
This is the follow-up to D144199 and suggestion from D144045.
We make use of loop info explicit via InstCombine pass parameter
rather than semi-arbitrary via caching.

The only InstCombine transform that uses LoopInfo currently is a
GEP fold in visitGEPOfGEP(), so that shows up as a failure in the
dedicated test for the fold as well as several LoopVectorizer tests
that run extra passes.

I don't see any pass manager regression tests that actually check
for pass options, but this is intended to be NFC for the pass
pipeline behavior - we only try to use loop info where it would
have been used before via caching .

Differential Revision: https://reviews.llvm.org/D144274
2023-03-11 14:20:30 -05:00
Arthur Eubanks
7c3c981442 [Passes] Remove some legacy passes
DFAJumpThreading
JumpThreading
LibCallsShrink
LoopVectorize
SLPVectorizer
DeadStoreElimination
AggressiveDCE
CorrelatedValuePropagation
IndVarSimplify

These are part of the optimization pipeline, of which the legacy version is deprecated and being removed.
2023-03-10 17:17:00 -08:00
Luke Lau
d7323f6a7a [RISCV][NFC] Add tests for interleaved accesses in loop vectorizer
Precommit test for D145155

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D145697
2023-03-09 17:43:16 +00:00
Anna Thomas
ac4c0ea73b [Tests] Precommit tests for D145616 2023-03-08 17:30:53 -05:00
Florian Hahn
79272ec028
[VPlan] Add predicate to VPReplicateRecipe, expand region later.
This patch adds the predicate as additional operand to VPReplicateRecipe
during initial construction. The predicated recipes are later moved into
replicate regions. This simplifies constructions and some VPlan
transformations, like fixed-order recurrence handling.

It also improves codegen in some cases (e.g. for in-loop reductions),
because the recipes remain in the same block.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D143865
2023-03-08 20:11:28 +01:00
Florian Hahn
7019624ee1
[SCEV] Strengthen nowrap flags via ranges for ARs on construction.
At the moment, proveNoWrapViaConstantRanges is only used when creating
SCEV[Zero,Sign]ExtendExprs. We can get significant improvements by
strengthening flags after creating the AddRec.

I'll also share a follow-up patch that removes the code to strengthen
flags when creating SCEV[Zero,Sign]ExtendExprs. Modifying AddRecs while
creating those can lead to surprising changes.

Compile-time looks neutral:
https://llvm-compile-time-tracker.com/compare.php?from=94676cf8a13c511a9acfc24ed53c98964a87bde3&to=aced434e8b103109104882776824c4136c90030d&stat=instructions:u

Reviewed By: mkazantsev, nikic

Differential Revision: https://reviews.llvm.org/D144050
2023-03-07 17:10:34 +01:00
Mel Chen
d5c404d1b9 [RISCV] Enable ordered reduction.
Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D144455
2023-03-07 04:36:50 -08:00
Graham Hunter
92e0ab937f [AArch64] Don't map llvm sqrt intrinsics to veclib functions
Since AArch64 has sqrt instructions, we want to use those instead of
calls to vector math routines for llvm sqrt intrinsics (since those
don't imply some of the constraints that libm calls might have) so
we just remove the mappings.

Code originally written by mgabka

Reviewed By: danielkiss, paulwalker-arm

Differential Revision: https://reviews.llvm.org/D145392
2023-03-07 11:43:41 +00:00
Graham Hunter
a180344589 [LV] Allow scalarization of function calls when masking is required
This patch adds support for scalarizing calls to a function when
there is a vector variant that cannot be used, either because there
isn't a masked variant or because the cost model indicated a VF
without a masked variant was better.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D134422
2023-03-03 15:26:04 +00:00
Mel Chen
9d0703a646 [RISCV] Pre-commit test case for ordered reduction, NFC
Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D144458
2023-03-01 06:27:43 -08:00
Sander de Smalen
c41b41eb11 [LoopVectorize] Use overflow-check analysis to improve tail-folding.
This work follows on from D142109 and addresses a possible regression
when we know the loop iteration counter cannot overflow.

When we know the overflow-check always evaluates to false, it's better to
use the other style of tail folding where it assumes a runtime check was
added, because that avoids having to calculate a modified trip-count.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D142894
2023-03-01 14:17:58 +00:00
Sander de Smalen
fe1b51ffee [LoopVectorize] Remove runtime check and scalar tail loop when tail-folding.
When using tail-folding and using the predicate for both data and control-flow
(the next vector iteration's predicate is generated with the llvm.active.lane.mask
intrinsic and then tested for the backedge), the LoopVectorizer still inserts a
runtime check to see if the 'i + VF' may at any point overflow for the given
trip-count. When it does, it falls back to a scalar epilogue loop.

We can get rid of that runtime check in the pre-header and therefore also
remove the scalar epilogue loop. This reduces code-size and avoids a runtime
check.

Consider the following loop:

  void foo(char * __restrict__ dst, char *src, unsigned long N) {
      for (unsigned long  i=0; i<N; ++i)
          dst[i] = src[i] + 42;
  }

If 'N' is e.g. ULONG_MAX, and the VF > 1, then the loop iteration counter
will overflow when calculating the predicate for the next vector iteration
at some point, because LLVM does:

  vector.ph:
    %active.lane.mask.entry = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %N)

  vector.body:
    %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
    %active.lane.mask = phi <vscale x 16 x i1> [ %active.lane.mask.entry, %vector.ph ], [ %active.lane.mask.next, %vector.body ]
    ...

    %index.next = add i64 %index, 16
      ; The add above may overflow, which would affect the lane mask and control flow. Hence a runtime check is needed.
    %active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %index.next, i64 %N)
    %8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0
    br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7

The solution:

What we can do instead is calculate the predicate before incrementing
the loop iteration counter, such that the llvm.active.lane.mask is
calculated from 'i' to 'tripcount > VF ? tripcount - VF : 0', i.e.

  vector.ph:
    %active.lane.mask.entry = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %N)
    %N_minus_VF = select %N > 16 ? %N - 16 : 0

  vector.body:
    %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
    %active.lane.mask = phi <vscale x 16 x i1> [ %active.lane.mask.entry, %vector.ph ], [ %active.lane.mask.next, %vector.body ]
    ...

    %active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %index, i64 %N_minus_VF)
    %index.next = add i64 %index, %4
      ; The add above may still overflow, but this time the active.lane.mask is not affected
    %8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0
    br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7

For N = 20, we'd then get:

  vector.ph:
    %active.lane.mask.entry = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %N)
      ; %active.lane.mask.entry = <1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1>
    %N_minus_VF = select 20 > 16 ? 20 - 16 : 0
      ; %N_minus_VF = 4

  vector.body: (1st iteration)
    ... ; using <1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1> as predicate in the loop
    ...
    %active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 4)
      ; %active.lane.mask.next = <1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0>
    %index.next = add i64 0, 16
      ; %index.next = 16
    %8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0
      ; %8 = 1
    br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7
      ; branch to %vector.body

  vector.body: (2nd iteration)
    ... ; using <1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0> as predicate in the loop
    ...
    %active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 16, i64 4)
      ; %active.lane.mask.next = <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0>
    %index.next = add i64 16, 16
      ; %index.next = 32
    %8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0
      ; %8 = 0
    br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7
      ; branch to %for.cond.cleanup

Reviewed By: fhahn, david-arm

Differential Revision: https://reviews.llvm.org/D142109
2023-03-01 09:01:19 +00:00
Sander de Smalen
de111ae70a NFC: Use generate_test_checks script for LV tests which seem to have been auto-generated. 2023-03-01 09:01:19 +00:00
sgokhale
ac67ec3a54 [LV] Reland testcase in 0ec4cae 2023-02-28 18:14:18 +05:30
sgokhale
0ec4cae146 [LV] Modify test case for commit 4f9a544
Was observing test failure. Relanding the test
2023-02-28 18:07:36 +05:30
sgokhale
4f9a5447c6 [LV] Reland "Update logic for calculating register usage due to invariants"
Previously, while calculating register usage due to invariants, it was assumed that invariant would always be part of widening
instructions. This resulted in calculating vector register types for vectors which cant be legalized(check the newly added test for more details).

An invariant might not always need a vector register. For e.g., invariant might just be used for iteration check.

This patch checks if the invariant is part of any widening instruction and considers register usage accordingly. Fixes issue 60493

Differential Revision: https://reviews.llvm.org/D143422
2023-02-28 17:32:39 +05:30
sgokhale
3c8ddbde37 Revert "[LV] Update logic for calculating register usage due to invariants"
Observing test failure for llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll

This reverts commit d1628266946fdddb44bdad2b3ccf3cd5fc769f42.
2023-02-28 15:46:59 +05:30
sgokhale
d162826694 [LV] Update logic for calculating register usage due to invariants
Previously, while calculating register usage due to invariants, it was assumed that invariant would always be part of widening
instructions. This resulted in calculating vector register types for vectors which cant be legalized(check the newly added test for more details).

An invariant might not always need a vector register. For e.g., invariant might just be used for iteration check.

This patch checks if the invariant is part of any widening instruction and considers register usage accordingly. Fixes issue 60493

Differential Revision: https://reviews.llvm.org/D143422
2023-02-28 11:05:26 +05:30