2963 Commits

Author SHA1 Message Date
Florian Hahn
4e9894498e
[VPlan] Truncate VFxUF if needed in VPWidenPointerInduction::execute.
Create truncate if needed after 56b05a0d6. Note that this preserves the
original behavior pre 56b05a0d6. If truncate would strip any set bits,
then the explicit computation in the narrower type would wrap.
2025-03-16 11:37:58 +00:00
Florian Hahn
6a8d5f22ff
[VPlan] Don't access canonical IV in VPWidenPointerInduction::execute.
This updates VPWidenPointerInductionRecipe::execute to not use the
canonical IV to determine the insert point. Instead, it relies on the
current recipe position. In cases where this is not sufficient, set the
insert point to the first non-phi instruction, to ensure phis are
created together.
2025-03-15 21:32:48 +00:00
Florian Hahn
aadfa9f6c8
[LV] Add additional tests for narrowing interleave groups.
Extend test coverage for https://github.com/llvm/llvm-project/pull/106441.
2025-03-15 21:13:49 +00:00
Florian Hahn
37a57ca257
[FMF] Set all bits if needed when setting individual flags. (#131321)
Currently fast() won't return true if all flags are set via setXXX,
which is surprising. Update setters to set all bits if needed to make
sure isFast() consistently returns the expected result.

PR: https://github.com/llvm/llvm-project/pull/131321
2025-03-15 18:46:26 +00:00
Florian Hahn
56b05a0d6b
[VPlan] Use VFxUF in VPWidenPointerInductionRecipe.
Use VFxUF VPValue instead of computing VF * UF explicitly.
2025-03-15 18:18:53 +00:00
Jeremy Morse
792a6f8119
[RemoveDIs] Remove "try-debuginfo-iterators..." test flags (#130298)
These date back to when the non-intrinsic format of variable locations
was still being tested and was behind a compile-time flag, so not all
builds / bots would correctly run them. The solution at the time, to get
at least some test coverage, was to have tests opt-in to non-intrinsic
debug-info if it was built into LLVM.

Nowadays, non-intrinsic format is the default and has been on for more
than a year, there's no need for this flag to exist.

(I've downgraded the flag from "try" to explicitly requesting
non-intrinsic format in some places, so that we can deal with tests that
are explicitly about non-intrinsic format in their own commit).
2025-03-14 15:50:49 +00:00
David Sherwood
3b6d0093aa
[LV][NFC] Refactor code for extracting first active element (#131118)
Refactor the code to extract the first active element of a
vector in the early exit block, in preparation for PR #130766.
I've replaced the VPInstruction::ExtractFirstActive nodes with
a combination of a new VPInstruction::FirstActiveLane node and
a Instruction::ExtractElement node.
2025-03-14 11:14:09 +00:00
Luke Lau
26324bc1bf
[VPlan] Move FOR splice cost into VPInstruction::FirstOrderRecurrenceSplice (#129645)
After #124093 we now support fixed-order recurrences with EVL tail
folding by replacing VPInstruction::FirstOrderRecurrenceSplice with a VP
splice intrinsic.

However the costing for the splice is currently done in
VPFirstOrderRecurrencePHIRecipe, so when we add the VP splice intrinsic
we end up costing it twice.

This fixes it by splitting out the cost for the splice into
FirstOrderRecurrenceSplice so that it's not duplicated when we replace
it.

We still have to keep the VF=1 checks in VPFirstOrderRecurrencePHIRecipe
since the splice might end up dead and discarded, e.g. in the test
@pr97452_scalable_vf1_for.
2025-03-14 15:33:32 +08:00
Luke Lau
315c02aa02
[VPlan] Fix crash with inloop fmuladd reductions with blend (#131154)
When visiting in-loop reduction links, we previously crashed if we had
an fmuladd with a blend after it in the chain. This fixes it by lifting
the existing blend folding to also handle fmuladd.

This also simplifies the code structure slightly for an upcoming patch I
want to post to handle in-loop AnyOf reductions.

I removed the PhiR->isInLoop() check since it's already guarded at the
top of the parent Header->Phis() loop.
2025-03-14 09:08:32 +08:00
Florian Hahn
dfb661cd1c
[LAA] Add extra tests for #128061.
Extend test coverage for
https://github.com/llvm/llvm-project/pull/128061.
2025-03-13 21:42:32 +00:00
Florian Hahn
02575f887b
[VPlan] Use VPInstruction for VPScalarPHIRecipe. (NFCI) (#129767)
Now that all phi nodes manage their incoming blocks through the
VPlan-predecessors, there should be no need for having a dedicate
recipe, it should be sufficient to allow PHI opcodes in VPInstruction.

Follow-ups will also migrate VPWidenPHIRecipe and possibly others,
building on top of https://github.com/llvm/llvm-project/pull/129388.

PR: https://github.com/llvm/llvm-project/pull/129767
2025-03-13 18:35:07 +00:00
Mel Chen
5d5e706691 [VPlan] Restrict hoisting of broadcast operations using VPDominatorTree (#117138)
This patch restricts broadcast operations from being hoisted to the vector
preheader unless the basic block that defines the broadcasted value properly
dominates the vector preheader.

This prevents potential use-before-definition issues when the broadcasted
value is defined within the plan. VPDominatorTree is used to confirm this
restriction while still allowing safe hoisting for broadcasted values defined
outside the plan.

Issue https://github.com/llvm/llvm-project/issues/117139
2025-03-13 07:16:04 -07:00
Mel Chen
ffe202ca00 Revert "[LV] Limits the splat operations be hoisted must not be defined by a recipe. (#117138)"
This reverts commit 1ff10fa82fff83bb2f0a5c1ffde6203b52bc9619.
2025-03-13 07:16:04 -07:00
Florian Hahn
62994c3291
[VPlan] Also introduce explicit broadcasts for values from entry VPBB.
Update and generalize materializeBroadcasts to also introduce explicit
broadcasts for VPValues defined in the Plans Entry block.

This fixes a crash when trying to insert the broadcasts generated by
VPTransformState::get after the generating instruction, which isn't
possible after invoke instructions.

Fixes https://github.com/llvm/llvm-project/issues/128838.
2025-03-12 22:03:19 +00:00
Florian Hahn
8132c4f554
[VPlan] Also introduce broadcasts for live-ins used in vec preheader.
Slightly generalize materializeLiveInBroadcasts to also introduce
broadcasts for live-ins used in the vector preheader. This should cover
all live-ins.

If the live-in is used in the vector preheader, insert the broadcast at
the beginning of the block.
2025-03-11 21:19:14 +00:00
David Sherwood
26ecf97895
[LoopVectorize] Further improve cost model for early exit loops (#126235)
Following on from #125058, this patch takes into account the
work done in the vector early exit block when assessing the
profitability of vectorising the loop. I have renamed
areRuntimeChecksProfitable to isOutsideLoopWorkProfitable and
we now pass in the early exit costs. As part of this, I have
added the ExtractFirstActive opcode to VPInstruction::computeCost.

It's worth pointing out that when we assess profitability of the
loop we calculate a minimum trip count and compare that against
the *maximum* trip count. However, since the loop has an early
exit the runtime trip count can still end up being less than the
minimum. Alternatively, we may never take the early exit at all
at runtime and so we have the opposite problem of over-estimating
the cost of the loop. The loop vectoriser cannot simultaneously
take two contradictory positions and so I feel the only sensible
thing to do is be conservative and assume the loop will be more
expensive than loops without early exits.

We may find in future that we need to adjust the cost according to
the probability of taking the early exit. This will become even
more important once we support multiple early exits. However, we
have to start somewhere and we can always revisit this later.
2025-03-11 11:48:55 +00:00
David Sherwood
055db3ec33
[LV] Optimise latch exit induction users for some early exit loops (#128880)
This is the first of two PRs that attempts to improve the IR
generated in the exit blocks of vectorised loops with uncountable
early exits. In this PR I am improving the generated code for
users of induction variables in early exit loops that have a
unique exit block, when exiting via the latch.

I have moved some of the code for calculating the exit values in
latch exit blocks from `optimizeInductionExitUsers` into a new
function `optimizeLatchExitInductionUser`.

I intend to follow this up very soon with another patch to
optimise the code for induction users in the vector.early.exit
block.
2025-03-11 10:13:16 +00:00
Mel Chen
1ff10fa82f
[LV] Limits the splat operations be hoisted must not be defined by a recipe. (#117138)
Issue https://github.com/llvm/llvm-project/issues/117139
2025-03-11 17:59:12 +08:00
Florian Hahn
f84f4e1e05
[LV] Don't crash on inner loops with RT checks in VPlan-native path.
Assert that we only generate runtime checks for inner loops in
emitMemRuntimeChecks, instead of returning nullptr in the VPlan-native
path, which is causing crashes and incorrect code.
2025-03-10 20:28:56 +00:00
Sushant Gokhale
c4808741e8
[AArch64][CostModel] Alter sdiv/srem cost where the divisor is constant (#123552)
This patch revises the cost model for sdiv/srem and draws its inspiration from the udiv/urem patch #122236

The typical codegen for the different scenarios has been mentioned as notes/comments in the code itself( this is done owing to lot of scenarios such that it would be difficult to mention them here in the patch description).
2025-03-09 22:26:39 -07:00
Florian Hahn
437d587e48
[LV] Add outer loop test with different successor orders in inner latch. 2025-03-09 18:13:06 +00:00
Florian Hahn
fd267082ee
[VPlan] Refactor VPlan creation, add transform introducing region (NFC). (#128419)
Create an empty VPlan first, then let the HCFG builder create a plain
CFG for the top-level loop (w/o a top-level region). The top-level
region is introduced by a separate VPlan-transform. This is instead of
creating the vector loop region before building the VPlan CFG for the
input loop.

This simplifies the HCFG builder (which should probably be renamed) and
moves along the roadmap ('buildLoop') outlined in [1].

As follow-up, I plan to also preserve the exit branches in the initial
VPlan out of the CFG builder, including connections to the exit blocks.

The conversion from plain CFG with potentially multiple exits to a
single entry/exit region will be done as VPlan transform in a follow-up.

This is needed to enable VPlan-based predication. Currently early exit
support relies on building the block-in masks on the original CFG,
because exiting branches and conditions aren't preserved in the VPlan.
So in order to switch to VPlan-based predication, we will have to
preserve them in the initial plain CFG, so the exit conditions are
available explicitly when we convert to single entry/exit regions.

Another follow-up is updating the outer loop handling to also introduce
VPRegionBlocks for nested loops as transform. Currently the existing
logic in the builder will take care of creating VPRegionBlocks for
nested loops, but not the top-level loop.

[1]
https://llvm.org/devmtg/2023-10/slides/techtalks/Hahn-VPlan-StatusUpdateAndRoadmap.pdf

PR: https://github.com/llvm/llvm-project/pull/128419
2025-03-09 15:05:35 +00:00
Florian Hahn
8dd160f476
Revert "[VPlan] Fold NOT into predicate of wide compares." (#130347)
Reverts llvm/llvm-project#129430

this seems to have introduced a divergence between legacy and
VPlan-based cost model

https://lab.llvm.org/buildbot/#/builders/30/builds/17159
2025-03-07 21:18:49 +00:00
Florian Hahn
cb3ce30ca8
[VPlan] Fold NOT into predicate of wide compares. (#129430)
Add simplification to fold negation into a compare, if the negation is
the only user of the compare. This removes a number of redundant
negations.

Alive2 Proofs for FPCMP test changes:  https://alive2.llvm.org/ce/z/WGDz9U

PR: https://github.com/llvm/llvm-project/pull/129430
2025-03-07 20:32:43 +00:00
Ramkumar Ramachandra
ddffb74afd
[LV] Strip unreachable SCEV-check blocks (#130079)
emitSCEVChecks checks if SCEVCheckCond matches zero, and returns
nullptr. However, it sets SCEVCheckCond as used before it does this,
which prevents it from being removed during cleanup, resulting in
unreachable blocks being emitted. Fix this.
2025-03-06 19:30:25 +00:00
Luke Lau
c6e2cbe5fd [LV] Regenerate select-cmp-predicated.ll with UTC. NFC
The main select-cmp.ll tests seem to be generated with UTC after
it should probably be converted to UTC beforehand.
2025-03-06 16:20:03 +08:00
Ramkumar Ramachandra
03da079968
[LoopUtils] Saturate at INT_MAX when estimating TC (#129683)
getLoopEstimatedTripCount returns std::nullopt when the trip count would
overflow the return type, but since it is an estimate anyway, we might
as well saturate at UINT_MAX, improving results.
2025-03-05 18:19:39 +00:00
Luke Lau
5e54c92314
[VPlan] Fix crash when unrolling in-loop reduction chains (#129840)
If an in-loop reduction is chained e.g.

    WIDEN-REDUCTION-PHI ir<%rdx> = phi ir<0>, ir<%add2>
    REDUCE ir<%add1> = ir<%rdx> + reduce.add (ir<%x>)
    REDUCE ir<%add2> = ir<%add1> + reduce.add (ir<%y>)

When we try to unroll the second add reduction, we crash because we
currently expect the chain to be a VPReductionPHIRecipe, when in fact
it's the previous reduction. This relaxes the cast to a dyn_cast, so we
end up unrolling to:

    WIDEN-REDUCTION-PHI ir<%rdx> = phi ir<0>, ir<%add2>
    WIDEN-REDUCTION-PHI ir<%rdx>.1 = phi ir<0>, ir<%add2>.1, ir<1>
    WIDEN-REDUCTION-PHI ir<%rdx>.2 = phi ir<0>, ir<%add2>.2, ir<2>
    WIDEN-REDUCTION-PHI ir<%rdx>.3 = phi ir<0>, ir<%add2>.3, ir<3>
    REDUCE ir<%add1> = ir<%rdx> + reduce.add (ir<%x>)
    REDUCE ir<%add1>.1 = ir<%rdx>.1 + reduce.add (ir<%x>.1)
    REDUCE ir<%add1>.2 = ir<%rdx>.2 + reduce.add (ir<%x>.2)
    REDUCE ir<%add1>.3 = ir<%rdx>.3 + reduce.add (ir<%x>.3)
    REDUCE ir<%add2> = ir<%add1> + reduce.add (ir<%y>)
    REDUCE ir<%add2>.1 = ir<%add1>.1 + reduce.add (ir<%y>.1)
    REDUCE ir<%add2>.2 = ir<%add1>.2 + reduce.add (ir<%y>.2)
    REDUCE ir<%add2>.3 = ir<%add1>.3 + reduce.add (ir<%y>.3)

This fixes a crash when building 525.x264_r from SPEC CPU 2017 on
AArch64 with -mllvm -prefer-inloop-reductions
2025-03-05 19:13:23 +08:00
Krzysztof Drewniak
e697c99b63
[AMDGPU] Add custom MachineValueType entries for buffer fat poiners (#127692)
The old hack of returning v5/v6i32 for the fat and strided buffer
pointers was causing issuse during vectorization queries that expected
to be able to construct a VectorType from the return value of `MVT
getPointerType()`. On example is in the test attached to this PR, which
used to crash.

Now, we define the custom MVT entries, the 160-bit
amdgpuBufferFatPointer and 192-bit amdgpuBufferStridedPointer, which are
used to represent ptr addrspace(7) and ptr addrspace(9) respectively.

Neither of these types will be present at the time of lowering to a
SelectionDAG or other MIR - MVT::amdgpuBufferFatPointer is eliminated by
the LowerBufferFatPointers pass and amdgpu::bufferStridedPointer is not
currently used outside of the SPIR-V translator (which does its own
lowering).

An alternative solution would be to add MVT::i160 and MVT::i192. We
elect not to do this now as it would require changes to unrelated code
and runs the risk of breaking any SelectionDAG code that assumes that
the MVT series are all powers of two (and so can be split apart and
merged back together) in ways that wouldn't be obvious if someone tried
to use MVT::i160 in codegen. If i160 is added at some future point,
these custom types can be retired.
2025-03-04 17:19:06 -06:00
Jim Lin
03505a004f
[RISCV] Enable scalable loop vectorization for fmax/fmin reductions with f16/bf16 type for zvfhmin/zvfbfmin (#129629)
This PR enable scalable loop vectorization for fmax and fmin reductions
with f16/bf16 type when only zvfhmin/zvfbfmin are enabled.

After https://github.com/llvm/llvm-project/pull/128800, we can promote
the fmax/fmin reductions with f16/bf16 type to f32 reductions for
zvfhmin/zvfbfmin.
2025-03-04 16:49:24 +08:00
Ramkumar Ramachandra
80bdfcd411
[LoopUtils] Don't wrap in getLoopEstimatedTripCount (#129080)
getLoopEstimatedTripCount returns the trip count based on profiling
data, and its documentation says that it could return 0 when the trip
count is zero, but this is not the case: a valid trip count can never be
zero, and it returns 0 when the unsigned ExitCount is incremented by 1
and wraps. Some callers are careful about checking for this zero value
in an std::optional, but it makes for an API with footguns, as a
std::optional return value indicates that a non-nullopt value would be a
valid trip count. Fix this by explicitly returning std::nullopt when the
return value would wrap, and strip additional checks in callers. This
also fixes a minor bug in LoopVectorize.
2025-03-04 08:43:08 +00:00
Mel Chen
9b4ad2fe50
[LV][EVL] Support fixed-order recurrence idiom with EVL tail folding. (#124093)
This patch converts the llvm.vector.splice intrinsic to
llvm.experimental.vp.splice, ensuring that fixed-order recurrences
execute correctly when tail folding by EVL is enable.
Due to the non-VFxUF penultimate EVL issue, the EVL from the previous
iteration will be preserved and used in llvm.experimental.vp.splice.
2025-03-03 21:27:13 +08:00
Florian Hahn
f937b17e85
[LV] Don't query SCEV for non-invariant values in cost model.
This fixes a divergence between VPlan and legacy cost model, matching
behavior further up in getInstructionCost as well.

Fixes https://github.com/llvm/llvm-project/issues/129236.
2025-03-02 10:55:52 +00:00
Benjamin Maxwell
89e7f4d31b
[LV] Teach the vectorizer to cost and vectorize modf and sincospi intrinsics (#129064)
Follow on to #128035. It is a small extension to support vectorizing
`llvm.modf.*` and `llvm.sincospi.*` too.

This renames the test files from `sincos.ll` ->
`multiple-result-intrinsics.ll` to group together the similar tests
(which make up most of this PR).
2025-02-28 12:56:12 +00:00
Florian Hahn
6ce41db6b0
[VPlan] Preserve DebugLoc for VPBranchOnMaskRecipe.
Update code to set and generate debug location for branch recipe
2025-02-27 20:19:42 +00:00
Florian Hahn
1e1b9bccc0
[VPlan] Simplify BLEND %a, %b, NOT(%m) -> BLEND %b, %a, %m. (#128375)
Avoid negations for normalized blends by reordering operands.

PR: https://github.com/llvm/llvm-project/pull/128375
2025-02-27 17:43:24 +00:00
Florian Hahn
649f4dcc19
[LV] Fix tests after 8150ab93f741.
PR #124119 wasn't rebased & tested before merging. Update the failing
tests.
2025-02-27 12:15:24 +00:00
John Brawn
8150ab93f7
[LoopVectorize] Use CodeSize as the cost kind for minsize (#124119)
Functions marked with minsize should aim for minimum code size, so the
vectorizer should use CodeSize for the cost kind and also the cost we
compare should be the cost for the entire loop: it shouldn't be divided
by the number of vector elements and block costs shouldn't be divided by
the block probability.

Possibly we should also be doing this for optsize as well, but there are
a lot of tests that assume the current behaviour and the definition of
optsize is less clear than minsize (for minsize the goal is to "keep the
code size of this function as small as possible" whereas for optsize
it's "keep the code size of this function low").
2025-02-27 11:07:02 +00:00
Benjamin Maxwell
3307b0374a
[LV] Teach the loop vectorizer llvm.sincos is trivially vectorizable (#128035)
Depends on #123210
2025-02-27 09:37:06 +00:00
Florian Hahn
7b6abd827f
[LV] Remove stray check lines after be28365ca78. 2025-02-26 21:14:01 +00:00
Florian Hahn
be28365ca7
[LV] Generate check lines for if-conversion.ll
The limited check lines make it difficult to reason about test changes
in https://github.com/llvm/llvm-project/pull/128375.
2025-02-26 20:39:58 +00:00
Florian Hahn
4277c21059
[VPlan] Introduce explicit broadcasts for live-ins. (#124644)
Add a new VPInstruction::Broadcast opcode and use it to materialize
explicit broadcasts of live-ins. The initial patch only materlizes the
broadcasts if the vector preheader dominates all uses that need it.
Later patches will pick the best valid insert point, thus retiring
implicit hoisting of broadcasts from VPTransformsState::get().

PR: https://github.com/llvm/llvm-project/pull/124644
2025-02-26 13:57:51 +00:00
Elvis Wang
8009c1fd81
[LV][VPlan] Prevent calculate cost for skiped instructions in precomputeCosts(). (#127966)
Skip calculating instruction costs for exit conditions in
precomputeCosts() when it should be skipped.

Reported from:
https://github.com/llvm/llvm-project/issues/115744#issuecomment-2670479463
Godbolt for reduced test cases: https://godbolt.org/z/fr4YMeqcv
2025-02-25 11:09:09 +08:00
Florian Hahn
65e44b4301 [LV] Add tests with deref assumptions and non-constant sizes. 2025-02-23 18:21:22 +00:00
Luke Lau
e23ab73335
[VPlan] Don't convert widen recipes to VP intrinsics in EVL transform (#127180)
This is a copy of #126177, since it was automatically and permanently
closed because I messed up the source branch on my remote

This patch proposes to avoid converting widening recipes to VP
intrinsics during the EVL transform.

IIUC we initially did this to avoid `vl` toggles on RISC-V. However we
now have the RISCVVLOptimizer pass which mostly makes this redundant.

Emitting regular IR instead of VP intrinsics allows more generic
optimisations, both in the middle end and DAGCombiner, and we generally
have better patterns in the RISC-V backend for non-VP nodes. Sticking to
regular IR instructions is likely a lot less work than reimplementing
all of these optimisations for VP intrinsics, and on SPEC CPU 2017 we get
noticeably better code generation.
2025-02-22 19:38:11 +08:00
Florian Hahn
52ded67249
[LAA] Always require non-wrapping pointers for runtime checks. (#127543)
Currently we only check if the pointers involved in runtime checks do
not wrap if we need to perform dependency checks. If that's not the
case, we generate runtime checks, even if the pointers may wrap (see
test/Analysis/LoopAccessAnalysis/runtime-checks-may-wrap.ll).

If the pointer wraps, then we swap start and end of the runtime check,
leading to incorrect checks.

An Alive2 proof of what the runtime checks are checking conceptually (on
i4 to have it complete in reasonable time) showing the incorrect result
should be https://alive2.llvm.org/ce/z/KsHzn8

Depends on https://github.com/llvm/llvm-project/pull/127410 to avoid
more regressions.

PR: https://github.com/llvm/llvm-project/pull/127543
2025-02-20 19:00:23 +01:00
Florian Hahn
404af37175 [VPlan] Remove stale assertion in HCFG builder.
The assertion was left over from a time when VPBBs still had an
associated condition bit. This is not the case any more (comment was
stale). In case a branch on condition is needed, a BranchOnCond
VPInstruction is added when constructing recipes. That's also where it
is checked if the condition is available.

Exposed by 38376dee9.
2025-02-20 17:01:49 +01:00
Florian Hahn
04b5c63ddf [LV] Add inbounds to interleave test.
In preparation for https://github.com/llvm/llvm-project/pull/127543
2025-02-20 16:33:01 +01:00
Ramkumar Ramachandra
3b9f9645e0
[LV] Regen a couple of tests with UTC (#127785) 2025-02-20 15:27:51 +00:00
Florian Hahn
a96444af44 [VPlan] Remove dead exit block handling code in HCFGBuilder.
The mapping of IR ExitBB to a VPBB isn't used. It also sets an incorrect
VPBB for the ExitBB; the regions successor is the middle block, no the
exit block.

It also unnecessarily triggers an assertion after 38376dee922.
2025-02-19 18:51:45 +01:00