655 Commits

Author SHA1 Message Date
Shih-Po Hung
266ff98cba
[LV][VPlan] Use VF VPValue in VPVectorPointerRecipe (#110974)
Refactors VPVectorPointerRecipe to use the VF VPValue to obtain the
runtime VF, similar to #95305.

Since only reverse vector pointers require the runtime VF, the patch
sets VPUnrollPart::PartOpIndex to 1 for vector pointers and 2 for
reverse vector pointers. As a result, the generation of reverse vector
pointers is moved into a separate recipe.
2024-10-26 23:18:50 +08:00
Tex Riddell
c03d09ce3e
[aarch64] atan2 intrinsic lowering (p5) (#112611)
This change is part of this proposal:
https://discourse.llvm.org/t/rfc-all-the-math-intrinsics/78294

- `VecFuncs.def`: define intrinsic to sleef/armpl mapping
- `LegalizerHelper.cpp`: add missing fewerElementsVector handling for
the new atan2 intrinsic
- `AArch64ISelLowering.cpp`: Add arch64 specializations for lowering
like neon instructions
- `AArch64LegalizerInfo.cpp`: Legalize atan2.

Part 5 for Implement the atan2 HLSL Function #70096.
2024-10-24 17:53:12 -07:00
Florian Hahn
ddbb382a7c
[LV] Regenerate check-lines for some tests. 2024-10-23 04:34:13 +01:00
Paul Walker
5bb34803a4 [NFC] Migrate tests to use autoupdate for CHECK lines. 2024-10-22 12:55:15 +00:00
Florian Hahn
c7496cebac
[LV] Use SCEV to check if minimum iteration check is known. (#111310)
Use SCEV to check if the minimum iteration check (TC < Step) is known to
be false.

This is a first step towards addressing
https://github.com/llvm/llvm-project/issues/111098. To catch the exact
case from the issue, we need to do extra work to make sure the wrap
flags on the shl are preserved and used by SCEV.

Note that skeleton creation will be gradually moved to VPlan and this
simplification should be done as VPlan transform eventually. The current
plan is to move skeleton creation to VPlan starting from parts closest
to the parts already created by VPlan, starting with induction resume
value creation (started with
https://github.com/llvm/llvm-project/pull/110577), then memory and SCEV
checks and finally minimum iteration checks.

PR: https://github.com/llvm/llvm-project/pull/111310
2024-10-18 15:22:59 -07:00
Graham Hunter
091a235ec5
Revert "[AArch64][SVE] Enable max vector bandwidth for SVE" (#112873)
Reverts llvm/llvm-project#109671

Reverting due to some performance regressions on neoverse-v1.
2024-10-18 11:05:55 +01:00
Florian Hahn
b497010854
[VPlan] Use VPInstruction::Name when assigning names (NFCI).
This slightly improves the printing of VPInstructions. NFC except debug
output.
2024-10-18 05:52:35 +01:00
Yingwei Zheng
095d49da76
[InstCombine] Set samesign when converting signed predicates into unsigned (#112642)
Alive2: https://alive2.llvm.org/ce/z/6cqdt-
2024-10-17 20:43:48 +08:00
Graham Hunter
c980a20b10
[AArch64][SVE] Enable max vector bandwidth for SVE (#109671)
Returns true for shouldMaximizeVectorBandwidth when the register type
is a scalable vector and SVE or streaming SVE are available.
2024-10-17 13:17:24 +01:00
David Sherwood
671976ff59
[NFC][LoopVectorize] Add more simple early exit tests (#112529)
I realised we are missing tests to cover more loops with multiple early
exits - some countable and some uncountable.

I've also added a few SVE versions of the test in the AArch64 directory.
Once we can vectorise such early exit loops it's a good sanity check to
make sure they also vectorise for SVE. Also, for some of the tests I
expect there to be some divergence from the same tests in the top level
directory once we start vectorising them.
2024-10-17 09:49:51 +01:00
Florian Hahn
3860e29e0e
[VPlan] Mark VPVectorPointerRecipe as not having sideeffects.
VectorPointer doesn't read from memory or have any sideeffects. Mark it
accordingly.
2024-10-16 06:10:19 +01:00
David Sherwood
72f339de45
[LoopVectorize] Use predicated version of getSmallConstantMaxTripCount (#109928)
There are a number of places where we call getSmallConstantMaxTripCount
without passing a vector of predicates:

getSmallBestKnownTC
isIndvarOverflowCheckKnownFalse
computeMaxVF
isMoreProfitable

I've changed all of these to now pass in a predicate vector so that
we get the benefit of making better vectorisation choices when we
know the max trip count for loops that require SCEV predicate checks.

I've tried to add tests that cover all the cases affected by these
changes.
2024-10-11 10:10:15 +01:00
Florian Hahn
bb937e276d
[LV] Compute value of escaped induction based on the computed end value. (#110576)
Update fixupIVUsers to compute the value for escaped inductions using
the already computed end value of the induction (EndValue), but
subtracting the step.

This results in slightly simpler codegen, as we avoid computing the full
transformed index at VectorTripCount - 1.

PR: https://github.com/llvm/llvm-project/pull/110576
2024-10-10 20:04:46 +01:00
Florian Hahn
6fbbe152fa
[VPlan] Introduce VPWidenIntrinsicRecipe to separate from libcall. (#110486)
This patch splits off intrinsic hanlding to a new
VPWidenIntrinsicRecipe. VPWidenIntrinsicRecipes only need access to the
intrinsic ID to widen and the scalar result type (in case the intrinsic
is overloaded on the result type). It does not need access to an
underlying IR call instruction or function.

This means VPWidenIntrinsicRecipe can be created easily without access
to underlying IR.
2024-10-08 22:37:20 +01:00
Florian Hahn
36fc291b6e
[VPlan] Implement VPBlendRecipe::computeCost.
Implement VPBlendRecipe::computeCost. VPBlendRecipe is currently is also
used if only the first lane is used.

This also requires pre-computing costs for forced scalars and
instructions considered profitable to scalarize. For those, the cost
will be computed separately in the legacy cost model. This will also be
needed when implementing VPReplicateRecipe::computeCost.
2024-10-08 21:33:42 +01:00
Florian Hahn
3ec6f805c5
[VPlan] Don't created GEP x, 0 for interleave group pointers.
The GEP with offet 0 is redundant, remove it. This addresses a TODO
from 7f74651837b ((#106431).
2024-10-08 12:08:13 +01:00
Florian Hahn
7f74651837
[VPlan] Use pointer to member 0 as VPInterleaveRecipe's pointer arg. (#106431)
Update VPInterleaveRecipe to always use the pointer to member 0 as
pointer argument. This in many cases helps to remove unneeded index
adjustments and simplifies VPInterleaveRecipe::execute.

In some rare cases, the address of member 0 does not dominate the insert
position of the interleave group. In those cases a PtrAdd VPInstruction
is emitted to compute the address of member 0 based on the address of
the insert position. Alternatively we could hoist the recipe computing
the address of member 0.
2024-10-06 22:53:13 +01:00
Benjamin Maxwell
01a1398971
[AArch64][Test] Update test variable names (NFC) (#110667)
Simply by running update_test_checks.py with no changes. This is to make
updating these tests for later changes easier.
2024-10-03 16:14:21 +01:00
Nikita Popov
9f3d1695eb
[SCEVExpander] Preserve gep nuw during expansion (#102133)
When expanding SCEV adds to geps, transfer the nuw flag to the resulting
gep. (Note that this doesn't apply to IV increment GEPs, which go
through a different code path.)
2024-10-02 11:45:00 +02:00
Florian Hahn
0344123ffb
[VPlan] Manage FMFs for VPWidenCall via VPRecipeWithIRFlags. (NFC)
Update VPWidenCallRecipe to manage fast-math flags directly via
VPRecipeWithIRFlags. This addresses a TODO and allows adjusting the FMFs
directly on the recipe. Also fixes printing for flags for
VPWidenCallRecipe.
2024-10-01 13:20:34 +01:00
Graham Hunter
6f1a8c2da2
[LV] Vectorize histogram operations (#99851)
This patch implements autovectorization support for the 'all-in-one'
histogram intrinsic, which seems to have more support than the
'standalone' intrinsic. See
https://discourse.llvm.org/t/rfc-vectorization-support-for-histogram-count-operations/74788/
for an overview of the work and my notes on the tradeoffs between the
two approaches.
2024-09-27 13:08:55 +01:00
Benjamin Maxwell
50a1ab12ab
[LAA] Don't assume libcalls with output/input pointers can be vectorized (#108980)
LoopAccessAnalysis currently does not check/track aliasing from the
output pointers, but assumes vectorizing library calls with a mapping is
safe.

This can result in incorrect codegen if something like the following is
vectorized:

```
for(int i=0; i<N; i++) {
  // No aliasing between input and output pointers detected.
  sincos(cos_out[0], sin_out+i, cos_out+i);
}
```

Where for VF >= 2 `cos_out[1]` to `cos_out[VF-1]` is the cosine of the
original value of `cos_out[0]` not the updated value.
2024-09-23 16:05:55 +01:00
Graham Hunter
785337e2d9
[LV][AArch64] Don't query registers for illegal scalable vector elts (#109411)
When trying to maximize vector bandwidth we ask TTI for the number of
registers required for a given operation. If the type of that operation
happens to be something illegal for scalable vectors (e.g.
<vscale x 4 x fp128>) then we would see a crash.

Instead, just return a default value and let the cost model reject the
invalid operation later.
2024-09-23 13:35:23 +01:00
Florian Hahn
53266f73f0
[VPlan] Run DCE after unrolling.
This cleans up a number of dead recipes after unrolling if only their
first or last parts are used. This simplifies a number of tests.

Fixes https://github.com/llvm/llvm-project/issues/109581.
2024-09-22 22:08:46 +01:00
Florian Hahn
8ec406757c
[VPlan] Implement unrolling as VPlan-to-VPlan transform. (#95842)
This patch implements explicit unrolling by UF  as VPlan transform. In
follow up patches this will allow simplifying VPTransform state (no need
to store unrolled parts) as well as recipe execution (no need to
generate code for multiple parts in an each recipe). It also allows for
more general optimziations (e.g. avoid generating code for recipes that
are uniform-across parts).

It also unifies the logic dealing with unrolled parts in a single place,
rather than spreading it out across multiple places (e.g. VPlan post
processing for header-phi recipes previously.)

In the initial implementation, a number of recipes still take the
unrolled part as additional, optional argument, if their execution
depends on the unrolled part.

The computation for start/step values for scalable inductions changed
slightly. Previously the step would be computed as scalar and then
splatted, now vscale gets splatted and multiplied by the step in a
vector mul.

This has been split off https://github.com/llvm/llvm-project/pull/94339
which also includes changes to simplify VPTransfomState and recipes'
::execute.

The current version mostly leaves existing ::execute untouched and
instead sets VPTransfomState::UF to 1.

A follow-up patch will clean up all references to VPTransformState::UF.

Another follow-up patch will simplify VPTransformState to only store a
single vector value per VPValue.

PR: https://github.com/llvm/llvm-project/pull/95842
2024-09-21 19:47:37 +01:00
Florian Hahn
58e05779b4
[LV] Move test requiring AArch64 to target subdir.
The test added in bd8fe9972e3f depends on the AArch64. Move it.
2024-09-21 12:54:59 +01:00
Florian Hahn
4eb9838409
[VPlan] Generalize VPValue::isDefinedOutsideLoopRegions.
Update isDefinedOutsideLoopRegions to check if a recipe is defined
outside any region. Split off already approved
https://github.com/llvm/llvm-project/pull/95842 now that this can be
tested separately after landing VPlan-based LICM
https://github.com/llvm/llvm-project/issues/107501
2024-09-20 15:34:00 +01:00
Florian Hahn
a861ed411a
[VPlan] Add initial loop-invariant code motion transform. (#107894)
Add initial transform to move out loop-invariant recipes.

This also helps to fix a divergence between legacy and VPlan-based cost
model due to legacy using ScalarEvolution::isLoopInvariant in some
cases.

Fixes https://github.com/llvm/llvm-project/issues/107501.

PR: https://github.com/llvm/llvm-project/pull/107894
2024-09-20 11:22:03 +01:00
Florian Hahn
e584278289
[LV] Update tests to avoid loop invariant instructions.
Update some tests with loop invariant instructions so the instructions
cannot be hoisted out.

This preserves the original test intention after
https://github.com/llvm/llvm-project/pull/107894.
2024-09-19 18:50:10 +01:00
Shih-Po Hung
ffcff2f465
[VPlan][NFC] Fix the value name of VECTOR_GEP (#107544)
This patch passes the string `"vector.gep"` to CreateGEP instead of
CreateMul.
2024-09-18 19:22:36 +08:00
Florian Hahn
012dbec604
[VPlan] Handle ForceTargetInstructionCost in during precomputeCosts.
Make sure ForceTargetInstruction is respected in precomputeCosts.
2024-09-15 10:53:43 +01:00
Florian Hahn
ea83e1c05a
[LV] Assign cost to all interleave members when not interleaving.
At the moment, the full cost of all interleave group members is assigned
to the instruction at the group's insert position, even if the decision
was to not form an interleave group.

This can lead to inaccurate cost estimates, e.g. if the instruction at
the insert position is dead. If the decision is to not vectorize but
scalarize or scather/gather, then the cost will be to total cost for all
members. In those cases, assign individual the cost per member, to more
closely reflect to choice per instruction.

This fixes a divergence between legacy and VPlan-based cost model.

Fixes https://github.com/llvm/llvm-project/issues/108098.
2024-09-11 21:04:34 +01:00
Florian Hahn
a794ee4559
[VPlan] Add VPValue for VF, use it for VPWidenIntOrFpInductionRecipe. (#95305)
Similar to VFxUF, also add a VF VPValue to VPlan and use it to get the
runtime VF in VPWidenIntOrFpInductionRecipe. Code for VF is only
generated if there are users of VF, to avoid unnecessary test changes.

PR: https://github.com/llvm/llvm-project/pull/95305
2024-09-10 10:41:35 +01:00
Florian Hahn
aa158bf402
[LV] Update tests to replace some code with loop varying instructions.
Update some tests with loop-invariant instructions, where hoisting them
out of the loop changes the vectorization decision. This should preserve
their original spirit when making further improvements.
2024-09-09 14:10:12 +01:00
Florian Hahn
3bd161e98d
[LV] Honor forced scalars in setVectorizedCallDecision.
Similarly to dd94537b4, setVectorizedCallDecision also did not consider
ForcedScalars. This lead to VPlans not reflecting the decision by the
legacy cost model (cost computation would use scalar cost, VPlan would
have VPWidenCallRecipe).

To fix this, check if the call has been forced to scalar in
setVectorizedCallDecision.

Note that this requires moving setVectorizedCallDecision after
collectLoopUniforms (which sets ForcedScalars). collectLoopUniforms does
not depend on call decisions and can safely be moved.

Fixes https://github.com/llvm/llvm-project/issues/107051.
2024-09-03 21:06:32 +01:00
Philip Reames
1fbb6b4efc
[LV] Prefer FLT_MIN/MAX for fmin/fmax reductions with ninf (#107141)
Analogous to 2c7786e94a1058bd4f96794a1d4f70dcb86e5cc5, cleanup a case
where the vectorizer is emitting a non-canonical identity value given
the available flags. We use largest/smallest value during ISEL, and VP
expansion, but not during vectorization.

Since the fmin/fmax/fminimum/fmaximum intrinsics don't require a start
value, this difference is only visible when masking of inactive lanes is
required.

Primary motivation of this change is simply to remove a difference
between version of code which reason about the identity value of a
reduction so I can kill all but one off.

In review, it was pointed out that this is actually a functional fix as well. 
The old code used inf on a noinf reduction instruction - whose
result is poison!  That wasn't the intent of the code.
2024-09-03 12:21:54 -07:00
Philip Reames
2c7786e94a
Prefer use of 0.0 over -0.0 for fadd reductions w/nsz (in IR) (#106770)
This is a follow up to 924907bc6, and is mostly motivated by consistency
but does include one additional optimization. In general, we prefer 0.0
over -0.0 as the identity value for an fadd. We use that value in
several places, but don't in others. So, let's be consistent and use the
same identity (when nsz allows) everywhere.

This creates a bunch of test churn, but due to 924907bc6, most of that
churn doesn't actually indicate a change in codegen. The exception is
that this change enables the use of 0.0 for nsz, but *not* reasoc, fadd
reductions. Or said differently, it allows the neutral value of an
ordered fadd reduction to be 0.0.
2024-09-03 09:16:37 -07:00
Florian Hahn
dd94537b40
[LV] Update call widening decision when scalarzing calls.
collectInstsToScalarize may decide to scalarize a call. If so, we have
to update the widening decision for the call, otherwise the call won't
be scalarized as expected during VPlan construction.

This issue was uncovered by f82543d509.
2024-09-03 14:12:41 +01:00
Florian Hahn
954ed05c10
[VPlan] Simplify MUL operands at recipe construction.
This moves the logic to create simplified operands using SCEV to MUL
recipe creation. This is needed to match the behavior of the legacy's cost
model. TODOs are to extend to other opcodes and move to a transform.

Note that this also restricts the number of SCEV simplifications we
apply to more precisely match the cases handled by the legacy cost
model.

Fixes https://github.com/llvm/llvm-project/issues/107015.
2024-09-02 21:25:31 +01:00
Florian Hahn
50a02e7c68
[VPlan] Pass intrinsic inst to TTI in VPWidenCallRecipe::computeCost.
Follow-up to 9ccf825, adjust computeCost to also pass IntrinsicInst to
TTI if available, as there are multiple places in TTI which use the
IntrinsicInst.

Fixes https://github.com/llvm/llvm-project/issues/107016.
2024-09-02 20:47:37 +01:00
Florian Hahn
b0de7fa466
[VPlan] Use op from underlying call in computeCost if needed.
This fixes a divergence between legacy and VPlan-based cost model, e.g.
if one of the operands has an first-order recurrence phi as operand.
2024-09-02 14:00:10 +01:00
Yingwei Zheng
380fa875ab
[InstCombine] Replace all dominated uses of condition with constants (#105510)
This patch replaces all dominated uses of condition with true/false to
improve context-sensitive optimizations. It eliminates a bunch of
branches in llvm-opt-benchmark.

As a side effect, it may introduce new phi nodes in some corner cases.
See the following case:
```
define i1 @test(i1 %cmp, i1 %cond) {
entry:
   br i1 %cond, label %bb1, label %bb2
bb1:
   br i1 %cmp, label %if.then, label %if.else
if.then:
   br %bb2
if.else:
   br %bb2
bb2:
  %res = phi i1 [%cmp, %entry], [%cmp, %if.then], [%cmp, %if.else]
  ret i1 %res
}
```
It will be simplified into:
```
define i1 @test(i1 %cmp, i1 %cond) {
entry:
   br i1 %cond, label %bb1, label %bb2
bb1:
   br i1 %cmp, label %if.then, label %if.else
if.then:
   br %bb2
if.else:
   br %bb2
bb2:
  %res = phi i1 [%cmp, %entry], [true, %if.then], [false, %if.else]
  ret i1 %res
}
```

I am planning to fix this in late pipeline/CGP since this problem exists
before the patch.
2024-09-01 09:49:23 +08:00
Philip Reames
4b553f4916 Regen a bunch of vectorizer tests to avoid naming churn in upcoming review 2024-08-30 10:13:02 -07:00
Paul Walker
ce5620ba9a
[LLVM][VPlan] Pick more optimal initial value for VPBlend. (#104019)
By choosing an initial value whose mask is only used by the blend we can
remove the need for the mask entirely.
2024-08-30 13:30:23 +01:00
Maciej Gabka
95d2d1cba0
Move stepvector intrinsic out of experimental namespace (#98043)
This patch is moving out stepvector intrinsic from the experimental
namespace.

This intrinsic exists in LLVM for several years now, and is widely used.
2024-08-28 12:48:20 +01:00
Florian Hahn
885c4365c1
[VPlan] Skip branches marked as dead in cost precomputation.
Don't consider the cost of branches marked to be skipped in VPlan cost
pre-computation. Those aren't included in the legacy cost, so they
should not be included in the VPlan cast.
2024-08-23 15:58:29 +01:00
Nikita Popov
a105877646
[InstCombine] Remove some of the complexity-based canonicalization (#91185)
The idea behind this canonicalization is that it allows us to handle less
patterns, because we know that some will be canonicalized away. This is
indeed very useful to e.g. know that constants are always on the right.

However, this is only useful if the canonicalization is actually
reliable. This is the case for constants, but not for arguments: Moving
these to the right makes it look like the "more complex" expression is
guaranteed to be on the left, but this is not actually the case in
practice. It fails as soon as you replace the argument with another
instruction.

The end result is that it looks like things correctly work in tests,
while they actually don't. We use the "thwart complexity-based
canonicalization" trick to handle this in tests, but it's often a
challenge for new contributors to get this right, and based on the
regressions this PR originally exposed, we clearly don't get this right
in many cases.

For this reason, I think that it's better to remove this complexity
canonicalization. It will make it much easier to write tests for
commuted cases and make sure that they are handled.
2024-08-21 12:02:54 +02:00
Florian Hahn
42555cdba4
[VPlan] Run VPlan optimizations on plans in native path.
Update buildVPlans (used in native path) to also run general VPlan
optimizations in another small step to align both codepaths.
2024-08-15 13:05:51 +01:00
Paul Walker
9e318bac5b [LLVM] Regenerate some test outputs for llvm/test/Transforms/LoopVectorize. 2024-08-14 10:59:46 +00:00
Madhur Amilkanthwar
b73771cf0f
[AArch64] Increase scatter overhead on Neoverse-V2 (#101296)
This patch increases scatter overhead on Neoverse-V2 to 13. This
benefits s128 kernel from TSVC_2 test suite.
SPEC 17, RAJAPerf, and Sptter are unaffected by this patch.

This patch boosts s128 kernel's performance from TSVC test suite by about
40% as this enables vectorization. Also, handle minor code refactoring
for gather related part.
2024-08-14 10:12:40 +05:30