2743 Commits

Author SHA1 Message Date
Sjoerd Meijer
9bccf61f5f
[AArch64][LV] Set MaxInterleaving to 4 for Neoverse V2 and V3 (#100385)
Set the maximum interleaving factor to 4, aligning with the number of available
SIMD pipelines. This increases the number of vector instructions in the vectorised
loop body, enhancing performance during its execution. However, for very low
iteration counts, the vectorised body might not execute at all, leaving only the
epilogue loop to run. This issue affects e.g. cam4_r from SPEC FP, which
experienced a performance regression. To address this, the patch reduces the
minimum epilogue vectorisation factor from 16 to 8, enabling the epilogue to be
vectorised and largely mitigating the regression.
2024-11-20 09:33:39 +00:00
David Sherwood
aeb88f6778
Fix test failures introduced by PR #113697 (#116941)
Don't match the entire floating point debug output since it's prone to
rounding errors depending upon the target.
2024-11-20 09:10:51 +00:00
David Sherwood
3097c60928
[LoopVectorize][NFC] Rewrite tests to check output of vplan cost model (#113697)
Currently it's very difficult to improve the cost model for tail-folded
loops because as soon as you add a VPInstruction::computeCost function
that adds the costs of instructions such as
VPInstruction::ActiveLaneMask
and VPInstruction::ExplicitVectorLength the assert in
LoopVectorizationPlanner::computeBestVF fails for some tests. This is
because the VF chosen by the legacy cost model doesn't match the vplan
cost model. See PR #90191. This assert is currently making it difficult
to improve the cost model.

Hopefully we will be in a position to remove the assert soon, however
in order to do that we have to fix up a whole bunch of tests that rely
upon the legacy cost model output. I've tried my best to update
these tests to use vplan output instead.

There is still work needed for the VF=1 case because the vplan cost
model is not printed out in this case. I've not attempted to fix those
in this patch.
2024-11-19 08:55:39 +00:00
Nikita Popov
9a844a36eb
[InstCombine] Use InstSimplify in FoldOpIntoSelect (#116073)
Instead of only trying to constant fold the select arms, try to simplify
them. This subsumes https://github.com/llvm/llvm-project/pull/115969
which implements this for extractvalue only.

This is still fairly limited in that we will usually only call
FoldOpIntoSelect in the first place if we have a constant operand. This
can be relaxed in the future if worthwhile.
2024-11-18 10:07:31 +01:00
Julian Nagele
a8538b9138
[LV] Vectorize Epilogues for loops with small VF but high IC (#108190)
- Consider MainLoopVF * IC when determining whether Epilogue
Vectorization is profitable
- Allow the same VF for the Epilogue as for the main loop
- Use an upper bound for the trip count of the Epilogue when choosing
the Epilogue VF

PR: https://github.com/llvm/llvm-project/pull/108190
---------

Co-authored-by: Florian Hahn <flo@fhahn.com>
2024-11-17 19:35:32 +00:00
Stephen Tozer
caa9a82797
[DebugInfo][LoopVectorizer] Avoid dropping !dbg in optimizeForVFAndUF (#114243)
Prior to this patch, optimizeForVFAndUF may optimize the conditional
branch for a VPBasicblock to have a constant condition, but
unnecessarily drops the DILocation attachment when it does so; this
patch changes it to preserve the DILocation.
2024-11-14 09:33:46 +00:00
Luke Lau
d119d43e92 [LV] Add missing REQUIRES: asserts to test 2024-11-14 17:41:40 +09:00
Luke Lau
050e2d325a
[LV] Remove assertions in IV overflow check (#115705)
In #111310 an assert was added that for the IV overflow check used with
tail folding, the overflow check is never known.

However when applying the loop guards, it looks like it's possible that
we might actually know the IV won't overflow: this occurs in
500.perlbench_r from SPEC CPU 2017 and triggers the assertion:

Assertion failed: (!isIndvarOverflowCheckKnownFalse(Cost, VF * UF) &&
!SE.isKnownPredicate(CmpInst::getInversePredicate(ICmpInst::ICMP_ULT),
TC2OverflowSCEV, SE.getSCEV(Step)) && "unexpectedly proved overflow
check to be known"), function emitIterationCountCheck, file
LoopVectorize.cpp, line 2501.

There is a discrepancy between `isIndvarOverflowCheckKnownFalse` and the
ICMP_ULT check, because the former uses `getSmallConstantMaxTripCount`
which only takes into trip counts that fit into 32 bits. There doesn't
seem to be an easy way to make the assertion aware of this, so this PR
just removes it for now.

There are two potential follow up things from this PR:

1. We miss calculating the max trip count in `@trip_count_max_1024`, it
looks like we might need to apply loop guards somewhere in
`ScalarEvolution::computeExitLimitFromICmp`
2. In `@overflow_at_0`, if `%tc == 0` then we the overflow check will
always return false, even though it will overflow

Fixes https://github.com/llvm/llvm-project/issues/115755
2024-11-14 17:04:49 +09:00
Luke Lau
9e77f59005
[LV] Account for vp_merge in out of loop EVL reductions in legacy cost model (#115903)
In #101641, support for out of loop reductions with EVL tail folding was
added by transforming selects to vp_merges in
transformRecipestoEVLRecipes.

Whilst the select was previously free, the vp_merge wasn't and incurs a
cost on RISC-V with the VPlan cost model. But this diverged from the
legacy cost model and caused the "VPlan cost model and legacy cost model
disagreed" assertion to trigger when building 502.gcc_r from SPEC CPU
2017.

Neither the select nor vp_merge recipes from the VPlan exist in the
underlying instructions, so I thought it would make the most sense to
fix this by adding the cost to the underlying phi instruction in
getInstructionCost.

It's worth noting that on RISC-V this vp_merge won't actually generate
any instructions because the mask is all true, and will be folded away.
So we should update the cost model at some point to reflect that.
2024-11-14 16:55:18 +09:00
Elvis Wang
b4d23cf685
[LV] Fix missing precomptueCosts() in emitInvalidCostRemarks(). (#114918)
We should always update the `SkipComputation` which is set in
`VPCostContext` before VPlan compute costs.

This patch prevent the assertion of in-loop reduction in the
`VPReductionRecipe::computeCost()` and other potential assertions of
partially implemented VPlan-based cost model.
2024-11-14 08:29:55 +08:00
David Green
7665d3f0df [ARM] Extra MVE reduction test cases. NFC 2024-11-12 10:00:57 +00:00
Nikita Popov
6dc23b7009
[SCEVExpander] Don't try to reuse SCEVUnknown values (#115141)
The expansion of a SCEVUnknown is trivial (it's just the wrapped value).
If we try to reuse an existing value it might be a more complex
expression that simplifies to the SCEVUnknown.

This is inspired by https://github.com/llvm/llvm-project/issues/114879,
because SCEVExpander replacing a constant with a phi node is just silly.
(I don't consider this a fix for that issue though.)
2024-11-11 12:36:29 +01:00
Luke Lau
b2e2d8b3f6
[RISCV] Enable scalable loop vectorization for zvfhmin/zvfbfmin (#115272)
This PR enables scalable loop vectorization for f16 with zvfhmin and
bf16 with zvfbfmin.

Enabling this was dependent on filling out the gaps for scalable
zvfhmin/zvfbfmin codegen, but everything that the loop vectorizer might
emit should now be handled.

It does this by marking f16 and bf16 as legal in
`isLegalElementTypeForRVV`. There are a few users of
`isLegalElementTypeForRVV` that have already been enabled in other PRs:

- `isLegalStridedLoadStore` #115264
- `isLegalInterleavedAccessType` #115257
- `isLegalMaskedLoadStore` #115145
- `isLegalMaskedGatherScatter` #114945

The remaining user is `isLegalToVectorizeReduction`. We can't promote
f16/bf16 reductions to f32 so we need to disable them for scalable
vectors. The cost model actually marks these as invalid, but for
out-of-tree reductions `ComputeReductionResult` doesn't get costed and
it will end up emitting a reduction intrinsic regardless, so we still
need to mark them as illegal. We might be able to remove this
restriction later for fmax and fmin reductions.
2024-11-11 13:29:48 +08:00
Florian Hahn
a5a1612deb
[VPlan] Consistently use DEBUG_TYPE loop-vectorize.
This ensures debug messages in VPlan.cpp are included in the commonly
used -debug-only=loop-vectorize.
2024-11-10 09:17:03 +00:00
Tex Riddell
818d715989
[Analysis] atan2: isTriviallyVectorizable; add to massv and accelerate veclibs (#113637)
This change is part of this proposal:
https://discourse.llvm.org/t/rfc-all-the-math-intrinsics/78294

- Return true for atan2 from isTriviallyVectorizable
- Add atan2 to VecFuncs.def for massv and accelerate libraries.
- Add atan2 to hasOptimizedCodeGen
- Add atan2 support in llvm/lib/Analysis/ValueTracking.cpp
llvm::getIntrinsicForCallSite and update vectorization tests
- Add atan2 name check to isLoweredToCall in
llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
- Note: there's no test coverage for these names in isLoweredToCall, except that Transforms/TailCallElim/inf-recursion.ll is impacted by the "fabs" case

Thanks to @jroelofs for the atan2 accelerate veclib and associated test
additions, plus the hasOptimizedCodeGen addition.

Part of: Implement the atan2 HLSL Function #70096.
2024-11-08 16:07:38 -08:00
Noah Goldstein
8af5ae0648 [VPlan] Preserve IR flags when widening casts
We have `nneg` for both `sext` and `uitofp`.

Fixes #114856

Closes #115373
2024-11-08 17:21:05 -06:00
Noah Goldstein
50850bc78b [LV] Add test for preserving flags when widening casts; NFC 2024-11-08 17:21:05 -06:00
Florian Hahn
144bdf3eb7
[VPlan] Also check if plan for best legacy VF contains simplifications.
The plan for the VF chosen by the legacy cost model could also contain
additional simplifications that cause cost differences. Also check if it
contains simplifications.

Fixes https://github.com/llvm/llvm-project/issues/114860.
2024-11-08 20:53:03 +00:00
David Green
ab9178e3e7 [ARM] Add a couple of new MVE reduction tests. NFC
Nowadays we generate add(zext(mul(sext, sext)) with nneg zext and the multi-use
test is awkward to get right. This should help our test coverage with the vplan
cost transition.
2024-11-08 14:32:06 +00:00
Nikita Popov
dd116369f6
[InstSimplify] Fix incorrect poison propagation when folding phi (#96631)
We can only replace phi(X, undef) with X, if X is known not to be
poison. Otherwise, the result may be more poisonous on the undef branch.

Fixes https://github.com/llvm/llvm-project/issues/68683.
2024-11-07 14:09:45 +01:00
Yingwei Zheng
0b9f1cc024
[SCEV] Disallow simplifying phi(undef, X) to X (#115109)
See the following case:
```
@GlobIntONE = global i32 0, align 4

define ptr @src() {
entry:
  br label %for.body.peel.begin

for.body.peel.begin:                              ; preds = %entry
  br label %for.body.peel

for.body.peel:                                    ; preds = %for.body.peel.begin
  br i1 true, label %cleanup.peel, label %cleanup.loopexit.peel

cleanup.loopexit.peel:                            ; preds = %for.body.peel
  br label %cleanup.peel

cleanup.peel:                                     ; preds = %cleanup.loopexit.peel, %for.body.peel
  %retval.2.peel = phi ptr [ undef, %for.body.peel ], [ @GlobIntONE, %cleanup.loopexit.peel ]
  br i1 true, label %for.body.peel.next, label %cleanup7

for.body.peel.next:                               ; preds = %cleanup.peel
  br label %for.body.peel.next1

for.body.peel.next1:                              ; preds = %for.body.peel.next
  br label %entry.peel.newph

entry.peel.newph:                                 ; preds = %for.body.peel.next1
  br label %for.body

for.body:                                         ; preds = %cleanup, %entry.peel.newph
  %retval.0 = phi ptr [ %retval.2.peel, %entry.peel.newph ], [ %retval.2, %cleanup ]
  br i1 false, label %cleanup, label %cleanup.loopexit

cleanup.loopexit:                                 ; preds = %for.body
  br label %cleanup

cleanup:                                          ; preds = %cleanup.loopexit, %for.body
  %retval.2 = phi ptr [ %retval.0, %for.body ], [ @GlobIntONE, %cleanup.loopexit ]
  br i1 false, label %for.body, label %cleanup7.loopexit

cleanup7.loopexit:                                ; preds = %cleanup
  %retval.2.lcssa.ph = phi ptr [ %retval.2, %cleanup ]
  br label %cleanup7

cleanup7:                                         ; preds = %cleanup7.loopexit, %cleanup.peel
  %retval.2.lcssa = phi ptr [ %retval.2.peel, %cleanup.peel ], [ %retval.2.lcssa.ph, %cleanup7.loopexit ]
  ret ptr %retval.2.lcssa
}

define ptr @tgt() {
entry:
  br label %for.body.peel.begin

for.body.peel.begin:                              ; preds = %entry
  br label %for.body.peel

for.body.peel:                                    ; preds = %for.body.peel.begin
  br i1 true, label %cleanup.peel, label %cleanup.loopexit.peel

cleanup.loopexit.peel:                            ; preds = %for.body.peel
  br label %cleanup.peel

cleanup.peel:                                     ; preds = %cleanup.loopexit.peel, %for.body.peel
  %retval.2.peel = phi ptr [ undef, %for.body.peel ], [ @GlobIntONE, %cleanup.loopexit.peel ]
  br i1 true, label %for.body.peel.next, label %cleanup7

for.body.peel.next:                               ; preds = %cleanup.peel
  br label %for.body.peel.next1

for.body.peel.next1:                              ; preds = %for.body.peel.next
  br label %entry.peel.newph

entry.peel.newph:                                 ; preds = %for.body.peel.next1
  br label %for.body

for.body:                                         ; preds = %cleanup, %entry.peel.newph
  br i1 false, label %cleanup, label %cleanup.loopexit

cleanup.loopexit:                                 ; preds = %for.body
  br label %cleanup

cleanup:                                          ; preds = %cleanup.loopexit, %for.body
  br i1 false, label %for.body, label %cleanup7.loopexit

cleanup7.loopexit:                                ; preds = %cleanup
  %retval.2.lcssa.ph = phi ptr [ %retval.2.peel, %cleanup ]
  br label %cleanup7

cleanup7:                                         ; preds = %cleanup7.loopexit, %cleanup.peel
  %retval.2.lcssa = phi ptr [ %retval.2.peel, %cleanup.peel ], [ %retval.2.lcssa.ph, %cleanup7.loopexit ]
  ret ptr %retval.2.lcssa
}
```
1. `simplifyInstruction(%retval.2.peel)` returns `@GlobIntONE`. Thus,
`ScalarEvolution::createNodeForPHI` returns SCEV expr `@GlobIntONE` for
`%retval.2.peel`.
2. `SimplifyIndvar::replaceIVUserWithLoopInvariant` tries to replace the
use of `%retval.2.peel` in `%retval.2.lcssa.ph` with `@GlobIntONE`.
3. `simplifyLoopAfterUnroll -> simplifyLoopIVs -> SCEVExpander::expand`
reuses `%retval.2.peel = phi ptr [ undef, %for.body.peel ], [
@GlobIntONE, %cleanup.loopexit.peel ]` to generate code for
`@GlobIntONE`. It is incorrect.

This patch disallows simplifying `phi(undef, X)` to `X` by setting
`CanUseUndef` to false.
Closes https://github.com/llvm/llvm-project/issues/114879.
2024-11-07 15:53:51 +08:00
Paul Walker
38fffa630e
[LLVM][IR] Use splat syntax when printing Constant[Data]Vector. (#112548) 2024-11-06 11:53:33 +00:00
Mel Chen
4480a22c2b
[LV][EVL] Emit vp.merge intrinsic to enable out-loop reduction in EVL vectorization. (#101641)
Following #90184, this patch emits vp.merge intrinsic, which is used to
set the inactive lanes in a select operation to the RHS instead of
undef. Currently, it is applied to out-loop reduction for EVL
vectorization.

This patch performs transformation to convert 
  select(header_mask, LHS, RHS)
into
  vp.merge(all-true, LHS, RHS, EVL) 
And always use the predicated reduction select to set the incoming value
of the reduction phi to support out-loop reduction when using tail
folding with EVL.

TODO: Postpone the adjustment of the predicated reduction select to
VPlanTransform. The current adjustment might be too early, which could
lead to a situation where the predicated reduction select is adjusted,
but the EVL recipes cannot be successfully generated during
VPlanTransform.
2024-11-06 14:53:49 +08:00
Florian Hahn
a353e258ba
[LAA] Don't require Stride == 1/-1 for inbounds pointer AddRecs nowrap. (#113126)
If we have a pointer AddRec, the maximum increment is
2^(pointer-index-wdith - 1) - 1. This means that if incrementing the
AddRec wraps, the distance between the previously accessed location and
the wrapped location is > 2^(pointer-index-wdith - 1), i.e. if the GEP
for the AddRec is inbounds, this would be poison due to the object being
larger than half the pointer index type space. The poison would be
immediate UB when the memory access gets executed..

Similar reasoning can be applied for decrements.

PR: https://github.com/llvm/llvm-project/pull/113126
2024-11-05 22:45:56 +01:00
Yongtao Huang
f6948e8f9d
[LoopVectorize] Fix typo in branch-weights.ll test CEHCK->CHECK (NFC) (#113574)
Fix the typo CEHCK.

Signed-off-by: Yongtao Huang <yongtaoh2022@gmail.com>
2024-11-05 14:22:24 +01:00
zhijian lin
a51712751c
[PowerPC][LLC] Utilize PPC::getNormalizedPPCTargetCPU() to set CPU (#113943)
Utilize common API in PPCTargetParser
(https://github.com/llvm/llvm-project/pull/97541) to set default CPU
with same interfaces for LLC.
This will update AIX default CPU to pwr7 and LoP powerppc64 default CPU
to ppc64.
2024-11-04 09:40:54 -05:00
Luke Lau
beb12f92c7
[RISCV] Add +optimized-nfN-segment-load-store (#114414)
This is a follow up to #111511, where after benchmarking we learnt that
the Banana Pi F3 has fast segmented loads for not just NF=2, but also
NF=3 and NF=4:
https://github.com/preames/bp3-microarch#vlseg_lmul_x_sew_throughput

This adds tuning features to allow these segment loads and stores to be
costed cheaper and enables it for the spacemit-x60.

It also enables +optimized-nf2-segment-load-store by default in the
generic tuning to maintain the previous behaviour when compiled without
-mcpu or -mtune.
2024-11-04 06:43:58 +08:00
Florian Hahn
17bad1a9da
[LV] Bail out on header phis in shouldConsiderInvariant.
This fixes an infinite recursion in rare cases.

Fixes https://github.com/llvm/llvm-project/issues/113794.
2024-11-01 20:51:25 +00:00
Yingwei Zheng
96b14f2ccb
[Reland][InstCombine] Fix FMF propagation in foldSelectIntoOp (#114499)
Relands #114356. Compared to the last version, this patch only merges
poison-generating/nsz flags from the select to fix LV regression in
`llvm/test/Transforms/PhaseOrdering/AArch64/predicated-reduction.ll`.
2024-11-01 12:22:57 +08:00
Florian Hahn
b021464d35
[VPlan] Introduce scalar loop header in plan, remove VPLiveOut. (#109975)
Update VPlan to include the scalar loop header. This allows retiring
VPLiveOut, as the remaining live-outs can now be handled by adding
operands to the wrapped phis in the scalar loop header.

Note that the current version only includes the scalar loop header, no
other loop blocks and also does not wrap it in a region block.

PR: https://github.com/llvm/llvm-project/pull/109975
2024-10-31 21:36:44 +01:00
Ami-zhang
1897bf61f0
[LoongArch] Enable FeatureExtLSX for generic-la64 processor (#113421)
This commit makes the `generic` target to support FP and LSX, as
discussed in #110211. Thereby, it allows 128-bit vector to be enabled by
default in the loongarch64 backend.
2024-10-31 15:58:15 +08:00
David Green
9735c05186
[ValueTracking] Compute KnownFP state from recursive select/phi. (#113686)
Given a recursive phi with select:
 %p = phi [ 0, entry ], [ %sel, loop]
 %sel = select %c, %other, %p

The fp state can be calculated using the knowledge that the select/phi
pair can only be the initial state (0 here) or from %other. This adds a
short-cut into computeKnownFPClass for PHI to detect that the select is
recursive back to the phi, and if so use the state from the other
operand.

This helps to address a regression from #83200.
2024-10-31 07:50:44 +00:00
Luke Lau
14045de250
[RISCV] Account for factor in interleave memory op costs (#111511)
Currently we cost an interleaved memory op as if it were a load/store of
the widened vector type, but this was undercosting in all cases when
compared to the measured performance of todays hardware.

On the x280 at NF=2 and spacemit-x60 at NF=2,3 and 4, a segmented load
is carried out as a wide load and NF LMUL shuffle ops:
https://github.com/preames/bp3-microarch#vlseg_lmul_x_sew_throughput

All other NFs go through a slow path. On the spacemit-x60 this is
proportional to VLMAX * NF, and on the x280 proportional to the number
of segments.

This patch increases the cost by implementing a wide load + NF LMUL
shuffle op cost for the lowest common denominator NF=2, and then a
slower cost proportional to VL for the other NFs.

In a follow up patch we can add a tuning flag to use the faster cost
model for NF=3 and 4 on the spacemit-x60.

Note that the FIXME about illegal vectors seems to have been fixed in
#100436
2024-10-31 05:36:46 +08:00
David Sherwood
7f498a865f
[CostModel][LoopVectorize] Move some loop vectoriser tests (#113702)
Many tests that were in test/Analysis/CostModel were actually
loop vectoriser tests. I've moved them as follows:

Analysis/CostModel/X86 -> Transforms/LoopVectorize/X86/CostModel
Analysis/CostModel/AArch64/arith-fp-frem.ll ->
  Transforms/LoopVectorize/AArch64/arith-fp-frem-costs.ll
2024-10-30 13:50:02 +00:00
Rohit Aggarwal
dfb60bb919
Adding more vector calls for -fveclib=AMDLIBM (#109662)
AMD has it's own implementation of vector calls.
New vector calls are introduced in the library for exp10, log10, sincos and finite asin/acos
Please refer [https://github.com/amd/aocl-libm-ose]

---------

Co-authored-by: Rohit Aggarwal <Rohit.Aggarwal@amd.com>
2024-10-29 10:09:55 +00:00
Florian Hahn
0d0abb351b
[VPlan] Use ResumePhi to create reduction resume phis. (#110004)
Use VPInstruction::ResumePhi to create phi nodes for reduction resume
values in the scalar preheader, similar to how ResumePhis are used for
first-order recurrence resume values after 9a5a8731e77.

This allows simplifying createAndCollectMergePhiForReduction to only
collect reduction resume phis when vectorizing epilogue loops and adding
extra incoming edges from the main vector loop. Updating phis for the
epilogue vector loops requires special attention, because additional
incoming values from the bypass blocks need to be added.

PR: https://github.com/llvm/llvm-project/pull/110004
2024-10-28 20:14:08 +01:00
Shih-Po Hung
266ff98cba
[LV][VPlan] Use VF VPValue in VPVectorPointerRecipe (#110974)
Refactors VPVectorPointerRecipe to use the VF VPValue to obtain the
runtime VF, similar to #95305.

Since only reverse vector pointers require the runtime VF, the patch
sets VPUnrollPart::PartOpIndex to 1 for vector pointers and 2 for
reverse vector pointers. As a result, the generation of reverse vector
pointers is moved into a separate recipe.
2024-10-26 23:18:50 +08:00
Florian Hahn
e724226da7
[VPlan] Return cost of 0 for VPWidenCastRecipe without underlying value.
In some cases, VPWidenCastRecipes are created but not considered in the
legacy cost model, including truncates/extends when evaluating a reduction
in a smaller type. Return 0 for such casts for now, to avoid divergences
between VPlan and legacy cost models.

Fixes https://github.com/llvm/llvm-project/issues/113526.
2024-10-25 21:25:44 +02:00
Tex Riddell
c03d09ce3e
[aarch64] atan2 intrinsic lowering (p5) (#112611)
This change is part of this proposal:
https://discourse.llvm.org/t/rfc-all-the-math-intrinsics/78294

- `VecFuncs.def`: define intrinsic to sleef/armpl mapping
- `LegalizerHelper.cpp`: add missing fewerElementsVector handling for
the new atan2 intrinsic
- `AArch64ISelLowering.cpp`: Add arch64 specializations for lowering
like neon instructions
- `AArch64LegalizerInfo.cpp`: Legalize atan2.

Part 5 for Implement the atan2 HLSL Function #70096.
2024-10-24 17:53:12 -07:00
Florian Hahn
2dfb1c664c
[VPlan] Try to hoist Previous (and operands), if sinking fails for FORs. (#108945)
In some cases, Previous (and its operands) can be hoisted. This allows
supporting additional cases where sinking of all users of to FOR fails,
e.g. due having to sink recipes with side-effects.

This fixes a crash where we fail to create a scalar VPlan for a
first-order recurrence, but can create a vector VPlan, because the trunc
instruction of an IV which generates the previous value of the
recurrence has been optimized to a truncated induction recipe, thus
hoisting it to the beginning.

Fixes https://github.com/llvm/llvm-project/issues/106523.

PR: https://github.com/llvm/llvm-project/pull/108945
2024-10-23 13:12:03 -07:00
Florian Hahn
ddbb382a7c
[LV] Regenerate check-lines for some tests. 2024-10-23 04:34:13 +01:00
Paul Walker
5bb34803a4 [NFC] Migrate tests to use autoupdate for CHECK lines. 2024-10-22 12:55:15 +00:00
Ramkumar Ramachandra
f719cfa868
LAA: be less conservative in isNoWrap (#112553)
isNoWrap has exactly one caller which handles Assume = true separately,
but too conservatively. Instead, pass Assume to isNoWrap, so it is
threaded into getPtrStride, which has the correct handling for the
Assume flag. Also note that the Stride == 1 check in isNoWrap is
incorrect: getPtrStride returns Strides == 1 or -1, except when
isNoWrapAddRec or Assume are true, assuming ShouldCheckWrap is true; we
can include the case of -1 Stride, and when isNoWrapAddRec is true. With
this change, passing Assume = true to getPtrStride could return a
non-unit stride, and we correctly handle that case as well.
2024-10-22 09:55:51 +01:00
Florian Hahn
2a6b09e0d3
[LV] Use type from InsertPos for cost computation of interleave groups.
Previously the legacy cost model would pick the type for the cost
computation depending on the order of the members in the input IR.
This is incompatible with the VPlan-based cost model (independent of
original IR order) and also doesn't match code-gen, which uses the type
of the insert position.

Update the legacy cost model to use the type (and address space) from
the Group's insert position.

This brings the legacy cost model in line with the legacy cost model and
fixes a divergence between both models.

Note that the X86 cost model seems to assign different costs to groups
with i64 and double types. Added a TODO to check.

Fixes https://github.com/llvm/llvm-project/issues/112922.
2024-10-18 19:12:40 -07:00
Florian Hahn
c7496cebac
[LV] Use SCEV to check if minimum iteration check is known. (#111310)
Use SCEV to check if the minimum iteration check (TC < Step) is known to
be false.

This is a first step towards addressing
https://github.com/llvm/llvm-project/issues/111098. To catch the exact
case from the issue, we need to do extra work to make sure the wrap
flags on the shl are preserved and used by SCEV.

Note that skeleton creation will be gradually moved to VPlan and this
simplification should be done as VPlan transform eventually. The current
plan is to move skeleton creation to VPlan starting from parts closest
to the parts already created by VPlan, starting with induction resume
value creation (started with
https://github.com/llvm/llvm-project/pull/110577), then memory and SCEV
checks and finally minimum iteration checks.

PR: https://github.com/llvm/llvm-project/pull/111310
2024-10-18 15:22:59 -07:00
Alexey Bataev
f148d5791b
[LV]Initial support for safe distance in predicated DataWithEVL vectorization mode.
Enabled initial support for max safe distance in DataWithEVL mode. If
max safe distance is required, need to emit special code:
CMP = icmp ult AVL, MAX_SAFE_DISTANCE
SAFE_AVL = select CMP, AVL, MAX_SAFE_DISTANCE
EVL = call i32 @llvm.experimental.get.vector.length(i64 SAFE_AVL)

while vectorize the loop in DataWithEVL tail folding mode.

Reviewers: fhahn

Reviewed By: fhahn

Pull Request: https://github.com/llvm/llvm-project/pull/102897
2024-10-18 15:51:49 -04:00
Graham Hunter
091a235ec5
Revert "[AArch64][SVE] Enable max vector bandwidth for SVE" (#112873)
Reverts llvm/llvm-project#109671

Reverting due to some performance regressions on neoverse-v1.
2024-10-18 11:05:55 +01:00
Florian Hahn
b497010854
[VPlan] Use VPInstruction::Name when assigning names (NFCI).
This slightly improves the printing of VPInstructions. NFC except debug
output.
2024-10-18 05:52:35 +01:00
Florian Hahn
b060661da8
[SCEVExpander] Expand UDiv avoiding UB when in seq_min/max. (#92177)
Update SCEVExpander to introduce an SafeUDivMode, which is set
when expanding operands of SCEVSequentialMinMaxExpr. In this mode,
the expander will make sure that the divisor of the expanded UDiv is
neither 0 nor poison.

Fixes https://github.com/llvm/llvm-project/issues/89958.


PR https://github.com/llvm/llvm-project/pull/92177
2024-10-17 13:55:20 -07:00
David Sherwood
76f3776185
[NFC][LoopVectorize] Restructure simple early exit tests (#112721)
The previous simple_early_exit.ll was growing too large and difficult to
manage. Instead I've decided to refactor the tests by splitting out into
notional groups:

1. single_early_exit.ll: loops with a single uncountable exit that do
not have live-outs from the loop.
2. single_early_exit_live_outs.ll: loops with a single uncountable exit
with live-outs.
3. multi_early_exit.ll: loops with multiple early exits, i.e. a mixture
of countable and uncountable exits, but with no live-outs from the loop.
4. multi_early_exit_live_outs.ll: as above, but with live-outs.
5. single_early_exit_unsafe_ptrs.ll: loops with a single uncountable
exit, but with pointers that are not unconditionally dereferenceable.
6. unsupported_early_exit.ll: loops with uncountable exits that we
cannot yet vectorise.
7. early_exit_legality.ll: tests the debug output from
LoopVectorizationLegality to make sure we handle different scenarios
correctly.

Only the last test now requires asserts. Over time some of these tests
should start vectorising as more support is added.

I also tried to rename the multi early exit tests to make it clear there
what mixture of countable and uncountable exits are present.
2024-10-17 16:50:59 +01:00