1019 Commits

Author SHA1 Message Date
David Green
6ad25c5912
[AArch64] Improve the cost model for extending mull (#125651)
We already have cost model code for detecting extending mull multiplies
for the form `mul(ext, ext)`. Since it was added the codegen for mull
has been improved, this attempts to catch the cost model up.

The main idea is to incorporate extends of larger sizes. A vector `v8i32
mul(zext(v8i8), zext(v8i8))` will be code-generated as `zext (v8i16
mul(zext(v8i8), zext(v8i8))`, or umull+ushll+ushll2.

So the total cost should be 3ish if each instruction costs 1. Where
exactly we attribute the costs is dependable, this patch opts to sets
the cost of the extend to 0 (or the cost of the extend not included in
the mull) and the mul gets the cost of the mull+extra extends.

isWideningInstruction is split into two functions for the two types of
operands it supports. isSingleExtWideningInstruction now handles addw
instructions that extend the second operand, isBinExtWideningInstruction
is for instructions like addl that extend both operands.
2025-11-04 07:50:51 +00:00
Sander de Smalen
f17c95ba54 [LV] Simplify vplan-printing.ll test (NFC)
This simplifies the test by moving some of the complicated options
to loop attributes, so that it's easier to extend the test file
with new cases.

The options `-enable-epilogue-vectorization` and
`-epilogue-vectorization-force-VF=2` were not strictly necessary
for the test.
2025-11-03 08:34:23 +00:00
Florian Hahn
b7e922a3da
[VPlan] Convert BuildVector with all-equal values to Broadcast. (#165826)
Fold BuildVector where all operands are equal to Broadcast of the first
operand. This will subsequently make it easier to remove additional
buildvectors/broadcasts, e.g. via
https://github.com/llvm/llvm-project/pull/165506.

PR: https://github.com/llvm/llvm-project/pull/165826
2025-11-01 17:28:42 -07:00
Florian Hahn
683b00bb50
[VPlan] Limit VPScalarIVSteps to step == 1 in getSCEVExprForVPValue.
For now, just support VPScalarIVSteps with step == 1 in
getSCEVExprForVPValue. This fixes a crash when the step would be != 1.
2025-10-31 02:22:56 +00:00
Hassnaa Hamdi
be29f0dd86
[LV]: Improve accuracy of calculating remaining iterations of MainLoopVF (#156723)
Transform TC and VF to same numerical space when they are different.
2025-10-26 14:45:44 +00:00
Florian Hahn
301fa24671 [VPlan] Limit narrowInterleaveGroups to single block regions for now.
Currently only regions with a single block are supported by the legality
checks.
2025-10-23 23:55:59 +01:00
Florian Hahn
4ec5852c1d [LV] Add tests for narrowing interleave groups with multiple blocks.
Add additional test coverage for narrowInterleaveGroups with loops with
multiple blocks.
2025-10-23 22:54:03 +01:00
paperchalice
249883d0c5
[test][Transforms] Remove unsafe-fp-math uses part 2 (NFC) (#164786)
Post cleanup for #164534.
2025-10-23 20:31:31 +08:00
Sam Tebbs
6b19a546aa
[LV] Bundle partial reductions inside VPExpressionRecipe (#147302)
This PR bundles partial reductions inside the VPExpressionRecipe class.

Stacked PRs:
1. https://github.com/llvm/llvm-project/pull/147026
2. https://github.com/llvm/llvm-project/pull/147255
3. https://github.com/llvm/llvm-project/pull/156976
4. https://github.com/llvm/llvm-project/pull/160154
5. -> https://github.com/llvm/llvm-project/pull/147302
6. https://github.com/llvm/llvm-project/pull/162503
7. https://github.com/llvm/llvm-project/pull/147513
2025-10-23 11:18:55 +00:00
Florian Hahn
bfc322dd72
Revert "[VPlan] Run narrowInterleaveGroups during general VPlan optimizations. (#149706)"
This reverts commit 8d29d09309654541fb2861524276ada6a3ebf84c.

There have been reports of mis-compiles
in https://github.com/llvm/llvm-project/pull/149706.

Revert while I investigate.
2025-10-22 21:27:11 +01:00
Kerry McLaughlin
45c0b29171
[LV] Ignore user-specified interleave count when unsafe. (#153009)
When an VF is specified via a loop hint, it will be clamped to a safe
VF or ignored if it is found to be unsafe. This is not the case for
user-specified interleave counts, which can lead to loops such as
the following with a memory dependence being vectorised with
interleaving:

```
#pragma clang loop interleave_count(4)
for (int i = 4; i < LEN; i++)
    b[i] = b[i - 4] + a[i];
```

According to [1], loop hints are ignored if they are not safe to apply.

This patch adds a check to prevent vectorisation with interleaving if
isSafeForAnyVectorWidth() returns false. This is already checked in
selectInterleaveCount().

[1]
https://llvm.org/docs/LangRef.html#llvm-loop-vectorize-and-llvm-loop-interleave
2025-10-22 15:21:27 +01:00
Florian Hahn
aca53f4375
[VPlan] Skip masked interleave groups in narrowInterleaveGroups.
8d29d09309 exposed a crash due to incorrectly trying to handle masked
interleave recipes. For now, the current code does not support masked
interleave recipes. Bail out for them.
2025-10-22 14:10:01 +01:00
Florian Hahn
8d29d09309
[VPlan] Run narrowInterleaveGroups during general VPlan optimizations. (#149706)
Move narrowInterleaveGroups to to general VPlan optimization stage.

To do so, narrowInterleaveGroups now has to find a suitable VF where all
interleave groups are consecutive and saturate the full vector width.

If such a VF is found, the original VPlan is split into 2:
 a) a new clone which contains all VFs of Plan, except VFToOptimize, and
 b) the original Plan with VFToOptimize as single VF.

The original Plan is then optimized. If a new copy for the other VFs has
been created, it is returned and the caller has to add it to the list of
candidate plans.

Together with https://github.com/llvm/llvm-project/pull/149702, this
allows to take the narrowed interleave groups into account when
computing costs to choose the best VF and interleave count.

One example where we currently miss interleaving/unrolling when
narrowing interleave groups is https://godbolt.org/z/Yz77zbacz

PR: https://github.com/llvm/llvm-project/pull/149706
2025-10-21 11:37:42 +01:00
David Sherwood
822c291aac
[LV][NFC] Remove undef from phi incoming values (#163762)
Split off from PR #163525, this standalone patch replaces
 use of undef as incoming PHI values with zero, in order
 to reduce the likelihood of contributors hitting the
 `undef deprecator` warning in github.
2025-10-21 10:49:27 +01:00
Sushant Gokhale
005ec78b71
[AArch64][CostModel] Add constraints on which partial reductions are (#163728)
natively supported on Neon and SVE

PR #158641 refined and refactored the cost model for partial reductions.
While doing so, it missed out on certain constraints. Specifically,
cases like i32 -> i64 partial reduce are not natively supported. This
patch adds back the condition/constraint that was present before PR
#158641
2025-10-20 17:36:44 -07:00
Ramkumar Ramachandra
9bfaf12c07
[VPlan] Handle more replicates in isUniformAcrossVFsAndUFs (#162342)
A single-scalar replicate without side-effects, and with uniform
operands, is uniform. Special-case assumes and stores.
2025-10-20 10:26:23 +00:00
Nikita Popov
573ca36753
[IR] Replace alignment argument with attribute on masked intrinsics (#163802)
The `masked.load`, `masked.store`, `masked.gather` and `masked.scatter`
intrinsics currently accept a separate alignment immarg. Replace this
with an `align` attribute on the pointer / vector of pointers argument.

This is the standard representation for alignment information on
intrinsics, and is already used by all other memory intrinsics. This
means the signatures now match llvm.expandload, llvm.vp.load, etc.
(Things like llvm.memcpy used to have a separate alignment argument as
well, but were already migrated a long time ago.)

It's worth noting that the masked.gather and masked.scatter intrinsics
previously accepted a zero alignment to indicate the ABI type alignment
of the element type. This special case is gone now: If the align
attribute is omitted, the implied alignment is 1, as usual. If ABI
alignment is desired, it needs to be explicitly emitted (which the
IRBuilder API already requires anyway).
2025-10-20 08:50:09 +00:00
Florian Hahn
445415709e
[LV] Move test for incomplete partial reduction chains to separate file.
Move test to new file, to prepare for adding similar tests in
https://github.com/llvm/llvm-project/pull/162822.
2025-10-19 22:23:53 +01:00
Florian Hahn
b9ce7656e9
[VPlan] Add VPInstruction to unpack vector values to scalars. (#155670)
Add a new Unpack VPInstruction (name to be improved) to explicitly
extract scalars values from vectors.

Test changes are movements of the extracts: they are no generated
together and also directly after the producer.

Depends on https://github.com/llvm/llvm-project/pull/155102 (included in
PR)

PR: https://github.com/llvm/llvm-project/pull/155670
2025-10-19 18:49:05 +00:00
Nikita Popov
8fa4a1029c [LoopVectorize] Regenerate test checks (NFC) 2025-10-16 18:21:42 +02:00
Florian Hahn
7f54fccc0e
[VPlan] Add ExtractLastLanePerPart, use in narrowToSingleScalar. (#163056)
When narrowing stores of a single-scalar, we currently use
ExtractLastElement, which extracts the last element across all parts.
This is not correct if the store's address is not uniform across all
parts. If it is only uniform-per-part, the last lane per part must be
extracted. Add a new ExtractLastLanePerPart opcode to handle this
correctly. Most transforms apply to both ExtractLastElement and
ExtractLastLanePerPart, with the only difference being their treatment
during unrolling.

Fixes https://github.com/llvm/llvm-project/issues/162498.

PR: https://github.com/llvm/llvm-project/pull/163056
2025-10-15 13:46:09 +01:00
David Sherwood
4f2c867756
[LV][NFC] Fix "cpu" attribute in some partial-reduce*.ll tests (#163518) 2025-10-15 09:26:04 +01:00
Sushant Gokhale
778d3c8ccc
[NFC] Partial reduce test to demonstrate regression post commit #cc9c64d (#162681)
We have seen performance regression for several instances of the Numba
benchmark, with some ranging around 70%, on Neoverse-v2 post #158641.
The mentioned case is short reproducer of the same. See
https://godbolt.org/z/j9Mj5WM7c for the IR differences.. A future patch
will address this.
2025-10-14 23:51:36 -07:00
Florian Hahn
ae7b15f2e2
[VPlan] Return invalid for scalable VF in VPReplicateRecipe::computeCost
Replication is currently not supported for scalable VFs. Make sure
VPReplicateRecipe::computeCost returns an invalid cost early, for
scalable VFs if the recipe is not a single-scalar.

Note that this moves the existing invalid-costs.ll out of the AArch64
subdirectory, as it does not use a target triple.

Fixes https://github.com/llvm/llvm-project/issues/160792.
2025-10-11 19:28:02 +01:00
Ramkumar Ramachandra
7296734394
[VPlan] Mark ActiveLaneMask as not having mem effects (#162330)
VPInstruction::ActiveLaneMask does not read or write memory. This allows
us to clean up some dead recipes.
2025-10-08 09:19:24 +01:00
Florian Hahn
9c0e09e0c1
[VPlan] Process ExpressionRecipes in reverse order in constructor.
Currently there's a crash when trying to construct VPExpressionRecipes
for a mul (ext, ext), if the multiply has outside users; the mul will be
cloned to serve its external users, but the extends won't get cloned and
will stay connected to users outside the loop (the cloned multiply).

To fix this, process recipes in reverse order. This ensures that we
visit bundled users before their operands, properly ensuring that the
extends for the external user are cloned as well.
2025-10-06 22:24:02 +01:00
Sander de Smalen
f3a952311c
[AArch64] Return Invalid partial reduction cost for i128 accumulator. (#162066)
PR #158641 introduced an issue where i128 accumulator types resulted
in a valid cost, because for a <2 x i128> type the code that
checks for unsupported type legalization would see a type action
of 'TypeSplitVector' which is supported, even though the legalised
type of <1 x i128> would require further scalarization.

This fixes https://github.com/llvm/llvm-project/issues/162009
2025-10-06 15:32:13 +01:00
Sander de Smalen
cc9c64d525
[AArch64] Refactor and refine cost-model for partial reductions (#158641)
This cost-model takes into account any type-legalisation that would
happen on vectors such as splitting and promotion. This results in wider
VFs being chosen for loops that can use partial reductions.

The cost-model now also assumes that when SVE is available, the SVE dot
instructions for i16 -> i64 dot products can be used for fixed-length
vectors. In practice this means that loops with non-scalable VFs are
vectorized using partial reductions where they wouldn't before, e.g.

```
  int64_t foo2(int8_t *src1, int8_t *src2, int N) {
    int64_t sum = 0;
    for (int i=0; i<N; ++i)
      sum += (int64_t)src1[i] * (int64_t)src2[i];
    return sum;
  }
```

These changes also fix an issue where previously a partial reduction
would be used for mixed sign/zero-extends (USDOT), even when +i8mm was
not available.
2025-10-03 10:07:07 +01:00
Florian Hahn
2b2bc6320f
[LV] Add tests with multiple F(Max|Min)Num reductions w/o fast-math.
Pre-commits extra test coverage for loops with multiple F(Max|Min)Num
reductions w/o fast-math-flags for follow-up PR.
2025-10-02 21:30:54 +01:00
Florian Hahn
7c4f188f27
[LV] Support multiplies by constants when forming scaled reductions. (#161092)
We can create partial reductions for multiplies with constants, if the
constant is small enough to be extended from source to destination type
w/o changing the value.

This only handles constant on the right side of a multiply, relying on
other passes to canonicalize the input.

Alive2 Proofs: https://alive2.llvm.org/ce/z/iWRMr6

PR: https://github.com/llvm/llvm-project/pull/161092
2025-10-02 10:53:17 +00:00
Alexey Bader
2d6e7ef567
[LV] Add additional tests for replicating load/store costs.
Includes test for https://github.com/llvm/llvm-project/issues/161404
2025-10-01 19:15:19 +01:00
Florian Hahn
8907b6d393
[VPlan] Remove original loop blocks if dead. (#155497)
Build on top of https://github.com/llvm/llvm-project/pull/154510 to
completely remove the blocks of dead scalar loops.

Depends on https://github.com/llvm/llvm-project/pull/154510. 

PR: https://github.com/llvm/llvm-project/pull/155497
2025-10-01 16:53:59 +00:00
Paul Walker
9e0c0a0939
[LLVM][SCEV] udiv (mul nuw a, vscale), (mul nuw b, vscale) -> udiv a, b (#157836) 2025-10-01 15:46:12 +01:00
Florian Hahn
78af056137
[VPlan] Run CSE closer to VPlan::execute. (#160572)
Additional CSE opportunities are exposed after converting to concrete
recipes/dissolving regions and materializing various expressions. Run
CSE later, to capitalize on some of the late opportunities.

PR: https://github.com/llvm/llvm-project/pull/160572
2025-09-26 09:38:58 +00:00
Ramkumar Ramachandra
4769e52bb6
[LV] Fixup a test after c1f8dbb (#160688)
Follow up on c1f8dbb ([LV] Add coverage for fixing-up scalar resume
values) to regenerate a test with UTC.
2025-09-25 12:04:26 +00:00
Ramkumar Ramachandra
c1f8dbb11c
[LV] Add coverage for fixing-up scalar resume values (#160492)
Increase coverage of the routine fixScalarResumeValuesFromBypass in the
case where the original scalar resume value is zero.

Co-authored-by: Florian Hahn <flo@fhahn.com>
2025-09-25 11:39:55 +01:00
Florian Hahn
ce63093e2b
[LV] Add partial reduction tests multiplying extend with constants. 2025-09-25 11:13:21 +01:00
Florian Hahn
2016af5652
[VPlan] Create epilogue minimum iteration check in VPlan. (#157545)
Move creation of the minimum iteration check for the epilogue vector
loop to VPlan. This is a first step towards breaking up and moving
skeleton creation for epilogue vectorization to VPlan.

It moves most logic out of EpilogueVectorizerEpilogueLoop: the minimum
iteration check is created directly in VPlan, connecting the check
blocks from the main vector loop is done as post-processing. Next steps
are to move connecting and updating the branches from the check blocks
to VPlan, as well as updating the incoming values for phis.

Test changes are improvements due to folding of live-ins.

PR: https://github.com/llvm/llvm-project/pull/157545
2025-09-25 07:13:38 +00:00
Florian Hahn
a7b4dd42bd
[LV] Don't create partial reductions if factor doesn't match accumulator (#158603)
Check if the scale-factor of the accumulator is the same as the request
ScaleFactor in tryToCreatePartialReductions.

This prevents creating partial reductions if not all instructions in the
reduction chain form partial reductions. e.g. because we do not form a
partial reduction for the loop exit instruction.

Currently code-gen works fine, because the scale factor of
VPPartialReduction is not used during ::execute, but it means we compute
incorrect cost/register pressure, because the partial reduction won't
reduce to the specified scaling factor.

PR: https://github.com/llvm/llvm-project/pull/158603
2025-09-24 12:21:03 +01:00
Matthew Devereau
819e6b2043
[InstSimplify] Consider vscale_range for get active lane mask (#160073)
Scalable get_active_lane_mask intrinsic calls can be simplified to i1
splat (ptrue) when its constant range is larger than or equal to the
maximum possible number of elements, which can be inferred from
vscale_range(x, y)
2025-09-24 11:35:15 +01:00
Florian Hahn
88aab08ae5
[LV] Check for hoisted safe-div selects in planContainsAdditionalSimp.
In some cases, safe-divisor selects can be hoisted out of the vector
loop. Catching all cases in the legacy cost model isn't possible, in
particular checking if all conditions guarding a division are loop
invariant.

Instead, check in planContainsAdditionalSimplifications if there are any
hoisted safe-divisor selects. If so, don't compare to the more
inaccurate legacy cost model.

Fixes https://github.com/llvm/llvm-project/issues/160354.
Fixes https://github.com/llvm/llvm-project/issues/160356.
2025-09-23 21:54:02 +01:00
Luke Lau
70ab1201e4
[LV] Regenerate literal struct return tests with UTC. NFC (#160268)
This is a precommit for an upcoming patch which fixes a crash when
replicating struct calls
2025-09-23 18:51:53 +08:00
Florian Hahn
49605a4727
[LV] Set correct costs for interleave group members.
This ensures each scalarized member has an accurate cost, matching the
cost it would have if it would not have been considered for an
interleave group.
2025-09-21 18:07:22 +01:00
Florian Hahn
7dd9b3d814
[LV] Also handle non-uniform scalarized loads when processing AddrDefs.
Loads of addresses are scalarized and have their costs computed w/o
scalarization overhead. Consistently apply this logic also to
non-uniform loads that are already scalarized, to ensure their costs are
consistent with other scalarized lodas that are used as addresses.
2025-09-21 09:36:58 +01:00
Florian Hahn
c506c28ec0
[LV] Add additional tests for scalar load costs of addresses. 2025-09-20 21:12:48 +01:00
Florian Hahn
19659eec2b
[LV] Add additional test for replicating store costs.
Add tests for costing replicating stores with x86_fp80, scalarizing
costs after discarding interleave groups and cost when preferring vector
addressing.
2025-09-19 20:38:53 +01:00
Paul Walker
7b8fd8f31b
[LLVM][SCEV] Look through common vscale multiplicand when simplifying compares. (#141798)
My usecase is simplifying the control flow generated by LoopVectorize
when vectorising loops whose tripcount is a function of the runtime
vector length. This can be problematic because:

* CSE is a pre-LoopVectorize transform and so it's common for an IR
function to include several calls to llvm.vscale(). (NOTE: Code
generation will typically remove the duplicates)
* Pre-LoopVectorize instcombines will rewrite some multiplies as shifts.
This leads to a mismatch between VL based maths of the scalar loop and
that created for the vector loop, which prevents some obvious
simplifications.

SCEV does not suffer these issues because it effectively does CSE during
construction and shifts are represented as multiplies.
2025-09-19 12:57:13 +01:00
Florian Hahn
0c028bbf33
[LV] Always add uniform pointers to uniforms list.
Always add pointers proved to be uniform via legal/SCEV to worklist.
This extends the existing logic to handle a few more pointers known to
be uniform.
2025-09-18 22:56:19 +01:00
Florian Hahn
70a7ffdc29
[LV] Add missing test cover for replicating load/store costs. 2025-09-18 19:47:06 +01:00
Florian Hahn
50b9ca4dda
[VPlan] Simplify Plan's entry in removeBranchOnConst. (#154510)
After https://github.com/llvm/llvm-project/pull/153643, there may be a
BranchOnCond with constant condition in the entry block.

Simplify those in removeBranchOnConst. This removes a number of
redundant conditional branch from entry blocks.

In some cases, it may also make the original scalar loop unreachable,
because we know it will never execute. In that case, we need to remove
the loop from LoopInfo, because all unreachable blocks may dominate each
other, making LoopInfo invalid. In those cases, we can also completely
remove the loop, for which I'll share a follow-up patch.

Depends on https://github.com/llvm/llvm-project/pull/153643.

PR: https://github.com/llvm/llvm-project/pull/154510
2025-09-18 19:25:05 +01:00