467 Commits

Author SHA1 Message Date
Sam Tebbs
664b227089
[LV] Keep duplicate recipes in VPExpressionRecipe (#156976)
The VPExpressionRecipe class uses a set to store its bundled recipes. If
repeated recipes are bundled then the duplicates will be lost, causing
the following recipes to not be at the expected place in the set.

When printing a reduce.add(mul(ext, ext)) bundle, for example, if the
extends are the same then the 3rd element of the set will be the
reduction, rather than the expected mul, causing a cast error. With this
change, the recipes are at the expected index in the set.

Fixes #156464
2025-10-01 16:01:54 +01:00
Florian Hahn
f61be43525
Revert "[VPlan] Compute cost of more replicating loads/stores in ::computeCost. (#160053)"
This reverts commit b4be7ecaf06bfcb4aa8d47c4fda1eed9bbe4ae77.

See https://github.com/llvm/llvm-project/issues/161404 for a crash
exposed by the change. Revert while I investigate.
2025-09-30 22:13:06 +01:00
Sam Tebbs
88658dbbc5
[LV] Add ExtNegatedMulAccReduction expression type (#160154)
This PR adds the ExtNegatedMulAccReduction expression type for
VPExpressionRecipe so that extend-multiply-accumulate reductions with a
negated multiply can be bundled.

Stacked PRs:

1. https://github.com/llvm/llvm-project/pull/156976
2. -> https://github.com/llvm/llvm-project/pull/160154
3. https://github.com/llvm/llvm-project/pull/147302
2025-09-30 10:10:37 +01:00
Florian Hahn
b4be7ecaf0
[VPlan] Compute cost of more replicating loads/stores in ::computeCost. (#160053)
Update VPReplicateRecipe::computeCost to compute costs of more
replicating loads/stores.

There are 2 cases that require extra checks to match the legacy cost
model:
1. If the pointer is based on an induction, the legacy cost model passes
its SCEV to getAddressComputationCost. In those cases, still fall back
to the legacy cost. SCEV computations will be added as follow-up
2. If a load is used as part of an address of another load, the legacy
cost model skips the scalarization overhead. Those cases are currently
handled by a usedByLoadOrStore helper.

Note that getScalarizationOverhead also needs updating, because when the
legacy cost model computes the scalarization overhead, scalars have not
been collected yet, so we can't each for replicating recipes to skip
their cost, except other loads. This again can be further improved by
modeling inserts/extracts explicitly and consistently, and compute costs
for those operations directly where needed.

PR: https://github.com/llvm/llvm-project/pull/160053
2025-09-29 08:08:09 +00:00
Florian Hahn
8460dbb450
[VPlan] Mark VPInstruction::Broadcast as not reading/writing memory.
This enables additional DCE/CSE opportunities and ensures that we don't
end up with multiple redundant users of a VPInstruction using EVL. It
fixes a verifier error in the added test_3_inductions test.
2025-09-27 20:48:42 +01:00
Luke Lau
7275c178bd
[VPlan] Fix packed replication of struct types (#160274)
I ran into this crash when #158690 caused a loop with a struct call to
be vectorized.

If we have a replicate recipe in a branch-on-mask predicated region
that's used by a widened recipe in another block then it will be packed
together with the other lanes via a VPPredInstPHIRecipe.

If we're replicating a call with a struct return type then we currently
crash. The code that handles structs in packScalarIntoVectorizedValue
seemed to be untested at least on test/Transforms/LoopVectorize.

There's two places that need to be fixed. The poison value that the
scalar is packed into needs to use toVectorizedTy to correctly handle
structs (not to be confused with toVectorTy!)

The other is that VPPredInstPHIRecipe expects its operand to be an
InsertElementInstr when stringing together the different lanes. For
structs this will be an InsertVlaueInstr, and the value for the previous
lane will be at the back of a chain of InsertValueInstrs.
2025-09-26 02:22:15 +00:00
Florian Hahn
70a26da639
[VPlan] Set correct flags when creating and cloning VPWidenCastRecipe.
Make sure that we set the correct wrap flags when creating new
VPWidenCastRecipes for truncs and preserve the flags from the recipe
directly when cloning, to make sure they are not dropped.

Fixes https://github.com/llvm/llvm-project/issues/160396
2025-09-25 09:00:47 +01:00
Ramkumar Ramachandra
66c35ebf3c
[VPlan] Avoid branching around State.get (NFC) (#159042) 2025-09-22 10:31:16 +01:00
Ramkumar Ramachandra
019913e4fa
[VPlan] Add WidenGEP::getSourceElementType (NFC) (#159029) 2025-09-22 10:02:08 +01:00
Sander de Smalen
17e008db17
[IR] NFC: Remove 'experimental' from partial.reduce.add intrinsic (#158637)
The partial reduction intrinsics are no longer experimental, because
they've been used in production for a while and are unlikely to change.
2025-09-17 11:44:47 +01:00
Florian Hahn
1858532c48
[VPlan] Handle predicated UDiv in VPReplicateRecipe::computeCost.
Account for predicated UDiv,SDiv,URem,SRem in
VPReplicateRecipe::computeCost: compute costs of extra phis and apply
getPredBlockCostDivisor.

Fixes https://github.com/llvm/llvm-project/issues/158660
2025-09-15 21:46:50 +01:00
Ramkumar Ramachandra
148a83543b
[LV] Introduce m_One and improve (0|1)-match (NFC) (#157419) 2025-09-15 10:34:06 +00:00
Florian Hahn
fb60d0337c
[VPlan] Return non-option cost from getCostForRecipeWithOpcode (NFC).
getCostForRecipeWithOpcode must only be called with supported opcodes.
Directly return the cost, and add llvm_unreachable to catch unhandled
cases.
2025-09-14 22:24:57 +01:00
Florian Hahn
91d4c0dfdf
Reapply "[VPlan] Compute cost of scalar (U|S)Div, (U|S)Rem in computeCost (NFCI)."
This reverts commit 9490d58fa92bb338db96af331194c9ba26eb0201.

Recommits de7e3a58952 with a fix for an unhandled case, causing crashes
in some configs.
2025-09-14 13:15:07 +01:00
Aiden Grossman
9490d58fa9 Revert "[VPlan] Compute cost of scalar (U|S)Div, (U|S)Rem in computeCost (NFCI)."
This reverts commit de7e3a589525179f3b02b84b194aac6cf581425c.

This broke quite a few upstream buildbots and premerge. Reverting for now to
get things back to green.

https://lab.llvm.org/buildbot/#/builders/137/builds/25467
2025-09-13 22:32:48 +00:00
Florian Hahn
de7e3a5895
[VPlan] Compute cost of scalar (U|S)Div, (U|S)Rem in computeCost (NFCI).
Directly compute the cost of UDiv, SDiv, URem, SRem in VPlan.
2025-09-13 22:09:06 +01:00
Florian Hahn
30e9cbacab
[VPlan] Move logic to compute scalarization overhead to cost helper(NFC)
Extract the logic to compute the scalarization overhead to a helper for
easy re-use in the future.
2025-09-13 20:41:44 +01:00
Florian Hahn
b8eaceb39b
[VPlan] Explicitly replicate VPInstructions by VF. (#155102)
Extend replicateByVF added in #142433 (aa240293190) to also explicitly
unroll replicating VPInstructions.

Now the only remaining case where we replicate for all lanes is
VPReplicateRecipes in replicate regions.

PR: https://github.com/llvm/llvm-project/pull/155102
2025-09-12 17:06:26 +01:00
Florian Hahn
c3e76b2770
[VPlan] Keep common flags during CSE. (#157664)
During CSE, we don't have to drop all poison-generating flags on
mis-match, we can keep the ones common on both recipes.

PR: https://github.com/llvm/llvm-project/pull/157664
2025-09-10 10:20:48 +00:00
David Sherwood
ba4ce60f1a
[LV] Add scalar load/stores to VPReplicateRecipe::computeCost (#153218)
Avoid calling getLegacyCost for single scalar loads and stores where the
cost is trivial to calculate.
2025-09-05 11:52:07 +01:00
Sam Tebbs
37127f74f4
[LV] Bundle sub reductions into VPExpressionRecipe (#147255)
This PR bundles sub reductions into the VPExpressionRecipe class and
adjusts the cost functions to take the negation into account.

Stacked PRs:
1. https://github.com/llvm/llvm-project/pull/147026
2. -> https://github.com/llvm/llvm-project/pull/147255
3. https://github.com/llvm/llvm-project/pull/147302
4. https://github.com/llvm/llvm-project/pull/147513
2025-09-01 17:25:01 +01:00
Mel Chen
13357e8a12
[LV][EVL] Support interleaved access with tail folding by EVL (#152070)
The InterleavedAccess pass already supports transforming
vector-predicated (vp) load/store intrinsics. With this patch, we start
enabling interleaved access under tail folding by EVL.

This patch introduces a new base class, VPInterleaveBase, and a concrete
class, VPInterleaveEVLRecipe. Both the existing VPInterleaveRecipe and
the new VPInterleaveEVLRecipe inherit from and implement
VPInterleaveBase.

Compared to VPInterleaveRecipe, VPInterleaveEVLRecipe adds an EVL
operand to emit vp.load/vp.store intrinsics.

Currently, tail folding by EVL is only supported for scalable
vectorization. Therefore, VPInterleaveEVLRecipe will only emit
interleave/deinterleave intrinsics. Reverse accesses are not yet
implemented, as masked reverse interleaved access under tail folding is
not yet supported.

Fixed #123201
2025-09-01 21:20:06 +08:00
Kerry McLaughlin
f0e9bba024
[LoopVectorize] Generate wide active lane masks (#147535)
This patch adds a new flag (-enable-wide-lane-mask) which allows
LoopVectorize to generate wider-than-VF active lane masks when it
is safe to do so (i.e. the mask is used for data and control flow).

The transform in extractFromWideActiveLaneMask creates vector
extracts from the first active lane mask in the header & loop body,
modifying the active lane mask phi operands to use the extracts.

An additional operand is passed to the ActiveLaneMask instruction,
the value of which is used as a multiplier of VF when generating the
mask.
By default this is 1, and is updated to UF by
extractFromWideActiveLaneMask.

The motivation for this change is to improve interleaved loops when
SVE2.1 is available, where we can make use of the whilelo instruction
which returns a predicate pair.

This is based on a PR that was created by @momchil-velikov (#81140)
and contains tests which were added there.
2025-09-01 13:53:30 +01:00
Florian Hahn
df098796ec
[VPlan] Compute cost of intrinsics directly for VPReplicateRecipe (NFCI). (#154617)
Handle intrinsic calls in VPReplicateRecipe::computeCost. There are some
intrinsics pseudo intrinsics for which the computed cost is known zero,
so we handle those up front.

Depends on https://github.com/llvm/llvm-project/pull/154291.

PR: https://github.com/llvm/llvm-project/pull/154617
2025-08-27 21:40:47 +01:00
Florian Hahn
5e32f728ec
[VPlan] Move logic to compute cost for intrinsic to helper (NFC).
Refactor to prepare for https://github.com/llvm/llvm-project/pull/154617.
2025-08-27 19:26:34 +01:00
Florian Hahn
c3470d1cdd
[VPlan] Compute cost of replicating calls in VPlan. (NFCI) (#154291)
Implement computing the scalarization overhead for replicating calls in
VPlan, matching the legacy cost model.

Depends on https://github.com/llvm/llvm-project/pull/154126. 

PR: https://github.com/llvm/llvm-project/pull/154291
2025-08-26 13:37:49 +01:00
David Sherwood
d606eae2ce
[LV] Stop using the legacy cost model for udiv + friends (#152707)
In VPWidenRecipe::computeCost for the instructions udiv, sdiv, urem and
srem we fall back on the legacy cost unnecessarily. At this point we
know that the vplan must be functionally correct, i.e. if the
divide/remainder is not safe to speculatively execute then we must have
either:

1. Scalarised the operation, in which case we wouldn't be using a
VPWidenRecipe, or
2. We've inserted a select for the second operand to ensure we don't
fault through divide-by-zero.

For 2) it's necessary to add the select operation to
VPInstruction::computeCost so that we mirror the cost of the legacy cost
model. The only problem with this is that we also generate selects in
vplan for predicated loops with reductions, which *aren't* accounted for
in the legacy cost model. In order to prevent asserts firing I've also
added the selects to precomputeCosts to ensure the legacy costs match
the vplan costs for reductions.
2025-08-26 10:17:23 +01:00
Elvis Wang
ed52bdd453
[VPlan] Get Addr computation cost with scalar type if it is uniform for gather/scatter. (NFC) (#150371)
This patch query `getAddressComputationCost()` with scalar type if the
address is uniform. This can help the cost for gather/scatter more
accurate.

In current LV, non consecutive VPWidenMemoryRecipe (gather/scatter) will
account the cost of address computation. But there are some cases that
the address is uniform across all lanes, that makes the address can be
calculated with scalar type and broadcast.

I have a followup optimization that tries to convert gather/scatter with
uniform memory access to scalar load/store + broadcast (and select if
needed). With this optimization, we can remove this temporary change.

This patch is preparation for #149955 to prevent regressions.
2025-08-26 09:04:15 +08:00
Florian Hahn
c950a72974
[VPlan] Support scalar VF for ExtractLane and FirstActiveLane.
Extend ExtractLane and FirstActiveLane to support scalable VFs. This
allows correct handling when interleaving with VF = 1.

Alive2 proofs:
 - Fixed codegen with this patch: https://alive2.llvm.org/ce/z/8Y5_Vc
   (verifies as correct)
 - Original codegen: https://alive2.llvm.org/ce/z/twdg3X (doesn't
   verify)

Fixes https://github.com/llvm/llvm-project/issues/154967.
2025-08-25 21:45:21 +01:00
Florian Hahn
f492eb9509
[VPlan] Make VPInstruction::AnyOf poison-safe. (#154156)
AnyOf reduces multiple input vectors to a single boolean value. When
used for early-exit vectorization, we need to consider any lane after
the early exit being poison. Any poison lane would result in poison
after the AnyOf reduction. To prevent this, freeze all inputs to AnyOf.

Fixes https://github.com/llvm/llvm-project/issues/153946.
Fixes https://github.com/llvm/llvm-project/issues/155162.

https://alive2.llvm.org/ce/z/FD-XxA

PR: https://github.com/llvm/llvm-project/pull/154156
2025-08-25 18:55:23 +01:00
Florian Hahn
d84be8a9b4
[VPlan] Get Cmp cost via getCostForRecipeWithOp for VPReplicateR (NFCI).
Use common getCostForRecipeWithOpcode to get the cost for ICmp/FCmp.
2025-08-24 19:43:22 +01:00
Ramkumar Ramachandra
66be00d635
[VPlan] Introduce m_Cmp; match more compares (#154771)
Extend [Specific]Cmp_match to handle floating-point compares, and
introduce m_Cmp that matches both integer and floating-point compares.
Use it in simplifyRecipe to match and simplify the general case of
compares. The change has necessitated a bugfix in
VPReplicateRecipe::execute.
2025-08-24 13:27:06 +01:00
Florian Hahn
300d2c6d20
[VPlan] Move SCEV expansion to VPlan transform. (NFCI).
Move the logic to expand SCEVs directly to a late VPlan transform that
expands SCEVs in the entry block. This turns VPExpandSCEVRecipe into an
abstract recipe without execute, which clarifies how the recipe is
handled, i.e. it is not executed like regular recipes.

It also helps to simplify construction, as now scalar evolution isn't
required to be passed to the recipe.
2025-08-21 22:03:26 +01:00
Florian Hahn
e41aaf5a64
[VPlan] Use VPIRMetadata for VPInterleaveRecipe. (#153084)
Use VPIRMetadata for VPInterleaveRecipe to preserve noalias metadata
added by versioning.

This still uses InterleaveGroup's logic to preserve existing metadata
from IR. This can be migrated separately.

Fixes https://github.com/llvm/llvm-project/issues/153006.

PR: https://github.com/llvm/llvm-project/pull/153084
2025-08-21 18:58:10 +01:00
Luke Lau
955c475ae6
[VPlan] Add m_Sub to VPlanPatternMatch. NFC (#154705)
To mirror PatternMatch.h, and we'll also be able to use it in #152167
2025-08-21 09:33:46 +00:00
Ramkumar Ramachandra
0db57ab586
[VPlan] Improve code using onlyScalarValuesUsed (NFC) (#154564) 2025-08-20 22:38:00 +01:00
Florian Hahn
35be64a416
[VPlan] Factor out logic to common compute costs to helper (NFCI). (#153361)
A number of recipes compute costs for the same opcodes for scalars or
vectors, depending on the recipe.

Move the common logic out to a helper in VPRecipeWithIRFlags, that is
then used by VPReplicateRecipe, VPWidenRecipe and VPInstruction.

This makes it easier to cover all relevant opcodes, without duplication.

PR: https://github.com/llvm/llvm-project/pull/153361
2025-08-20 16:05:20 +01:00
David Sherwood
13d8ba7dea
[LV][TTI] Calculate cost of extracting last index in a scalable vector (#144086)
There are a couple of places in the loop vectoriser where we
want to calculate the cost of extracting the last lane in a
vector. However, we wrongly assume that asking for the cost
of extracting lane (VF.getKnownMinValue() - 1) is an accurate
representation of the cost of extracting the last lane. For
SVE at least, this is non-trivial as it requires the use of
whilelo and lastb instructions.

To solve this problem I have added a new
getReverseVectorInstrCost interface where the index is used
in reverse from the end of the vector. Suppose a vector has
a given ElementCount EC, the extracted/inserted lane would be
EC - 1 - Index. For scalable vectors this index is unknown at
compile time. I've added a AArch64 hook that better represents
the cost, and also a RISCV hook that maintains compatibility
with the behaviour prior to this PR.

I've also taken the liberty of adding support in vplan for
calculating the cost of VPInstruction::ExtractLastElement.
2025-08-19 09:31:37 +01:00
Mel Chen
1dac302ce7
[LV] Explicitly disallow interleaved access requiring gap mask for scalable VFs. nfc (#154122)
Currently, VPInterleaveRecipe::execute does not support generating LLVM
IR for interleaved accesses that require a gap mask for scalable VFs.
It would be better to detect and prevent such groups from being
vectorized as interleaved accesses in
LoopVectorizationCostModel::interleavedAccessCanBeWidened, rather than
relying on the TTI function getInterleavedMemoryOpCost to return an
invalid cost.
2025-08-19 08:42:39 +08:00
Florian Hahn
79be94c984
[VPlan] Compute cost single-scalar calls in computeCost. (NFC)
Compute the cost of non-intrinsic, single-scalar calls directly in
VPReplicateRecipe::computeCost.

This starts moving call cost computations to VPlan, handling the
simplest case first.
2025-08-18 21:56:56 +01:00
Florian Hahn
7e9989390d
[VPlan] Materialize Build(Struct)Vectors for VPReplicateRecipes. (NFCI) (#151487)
Materialze Build(Struct)Vectors explicitly for VPRecplicateRecipes, to
serve their users requiring a vector, instead of doing so when unrolling
by VF.

Now we only need to implicitly build vectors in VPTransformState::get
for VPInstructions. Once they are also unrolled by VF we can remove the
code-path alltogether.

PR: https://github.com/llvm/llvm-project/pull/151487
2025-08-18 20:49:42 +01:00
David Sherwood
7ee6cf06c8
[LV] Fix incorrect cost kind in VPReplicateRecipe::computeCost (#153216)
We were incorrectly using the TTI::TCK_RecipThroughput cost kind and
ignoring the kind set in the context.
2025-08-18 09:52:31 +01:00
Florian Hahn
177f27d220
[VPlan] Add incoming_[blocks,values] iterators to VPPhiAccessors (NFC) (#138472)
Add 3 new iterator ranges to VPPhiAccessors

* incoming_values(): returns a range over the incoming
  values of a phi 
* incoming_blocks(): returns a range over the incoming 
  blocks of a phi
* incoming_values_and_blocks: returns a range over pairs of
   incoming values and blocks.

Depends on https://github.com/llvm/llvm-project/pull/124838.

PR: https://github.com/llvm/llvm-project/pull/138472
2025-08-14 16:47:04 +01:00
Elvis Wang
01fac67e2a
[TTI] Add cost kind to getAddressComputationCost(). NFC. (#153342)
This patch add cost kind to `getAddressComputationCost()` for #149955.

Note that this patch also remove all the default value in `getAddressComputationCost()`.
2025-08-14 16:01:44 +08:00
Florian Hahn
424258947e
[VPlan] Materialize VF and VFxUF using VPInstructions. (#152879)
Materialize VF and VFxUF computation using VPInstruction
instead of directly creating IR.

This is one of the last few steps needed to model the full vector
skeleton in VPlan.

This is mostly NFC, although in some cases we remove some unused
computations.

PR: https://github.com/llvm/llvm-project/pull/152879
2025-08-12 14:13:13 +01:00
Sam Tebbs
0bfa1718af
[LV] Create in-loop sub reductions (#147026)
This PR allows the loop vectorizer to handle in-loop sub reductions by
forming a normal in-loop add reduction with a negated input.

Stacked PRs:
1. -> https://github.com/llvm/llvm-project/pull/147026
2. https://github.com/llvm/llvm-project/pull/147255
3. https://github.com/llvm/llvm-project/pull/147302
4. https://github.com/llvm/llvm-project/pull/147513
2025-08-12 10:22:41 +01:00
Luke Lau
acb86fb9e0
[TTI] Consistently pass the pointer type to getAddressComputationCost. NFCI (#152657)
In some places we were passing the type of value being accessed, in
other cases we were passing the type of the pointer for the access.

The most "involved" user is
LoopVectorizationCostModel::getMemInstScalarizationCost, which is the
only call site that passes in the SCEV, and it passes along the pointer
type.

This changes call sites to consistently pass the pointer type, and
renames the arguments to clarify this.

No target actually checks the contents of the type passed, only to see
if it's a vector or not, so this shouldn't have an effect.
2025-08-11 18:00:12 +08:00
Elvis Wang
37fe7a9933
[LV] Generate scalar xor for VPInstruction::Not if possible. (#152628)
`VPInstruction::Not` which will generate xor instruction is widely used
for the exit condition. This patch make `VPInstruction::Not` generate
scalar `xor` if possible.

This can help reducing the (splat true) in the `xor` and make `xor` be
scalar.
2025-08-11 16:35:21 +08:00
Florian Hahn
86813aa786
[VPlan] Add dedicated user for resume phi with epilogue vectorization.
Epilogue vectorization currently relies on the resume phi for the
canonical induction being always available, which is why VPPhi are
considered to have side-effects, to prevent their removal.

This patch adds a new ResumeForEpilogue opcode to mark the resume phi as
used for epilogue vectorization. This allows treating VPPhis in general
as not having side-effects, enabling removal of unused VPPhis.
2025-08-10 21:21:16 +01:00
Florian Hahn
47258ca470
[VPlan] Use VPPhi instead of dyn_cast + opcode check in isPhi (NFC). 2025-08-05 19:20:12 +01:00