6896 Commits

Author SHA1 Message Date
Mel Chen
f196b1d66f
[VPlan] Extract reverse operation for reverse accesses (#146525)
This patch introduces VPInstruction::Reverse and extracts the reverse
operations of loaded/stored values from reverse memory accesses. This
extraction facilitates future support for permutation elimination within
VPlan.
2025-12-18 14:57:48 +00:00
Simon Pilgrim
24d9550b27
[VectorCombine] foldShuffleOfBinops - if both operands are the same don't duplicate the total new cost (#172719)
If we're shuffling/concatenating the same operands then ensure we don't
duplicate the total cost, ensure we reuse the final shuffle and
recognise that we reduce the total instruction count (so fold even when
NewCost == OldCost, not just NewCost < OldCost).
2025-12-18 07:03:06 +00:00
Florian Hahn
9cc1585b13
[VPlan] Add VPBlockUtils::transferSuccessors (NFCI).
Add a new helper to transfer successors to a new, unconnected VPBB.
Helps to simplify existing code, and prepare for upcoming changes.
2025-12-17 22:48:22 +00:00
Florian Hahn
bab0dc4d48
Reapply "[LV] Mark checks as never succeeding for high cost cutoff."
Reapply 8a115b6934a90441 with an update to tests handling remarks.

The patch now directly emits a clear remark when we bail out
due to the memory check threshold.

Original message:
When GeneratedRTChecks::create bails out due to exceeding the cost
threshold, no runtime checks are generated and we must not proceed
assuming checks have been generated.

Mark the checks as never succeeding, to make sure we don't try to
vectorize assuming the runtime checks hold. This fixes a case where we
previously incorrectly vectorized assuming runtime checks had been
generated when forcing vectorization via metadate.

Fixes the mis-compile mentioned in
https://github.com/llvm/llvm-project/pull/166247#issuecomment-3631471588
2025-12-17 20:21:49 +00:00
Florian Hahn
eb0c7e752f
[VPlan] Replace BranchOnCount with Compare + BranchOnCond (NFC). (#172181)
Expand BranchOnCount to BranchOnCond + ICmp in convertToConcreteRecipes
to simplify codegen.

PR: https://github.com/llvm/llvm-project/pull/172181
2025-12-16 19:19:31 +00:00
Ramkumar Ramachandra
1c6e5b2d04
[LV] Improve code using VPlan::get{ConstantInt,True} (NFC) (#172471) 2025-12-16 13:03:43 +00:00
Luke Lau
67d0e21a62
Reapply "[VPlan] Remove legacy costing inside VPBlendRecipe::computeCost (#171846)" (#172261)
This reapplies #171846 with a test case and fix for a legacy cost-model
mismatch assertion.

In the previous version of the patch, we only considered the plan to
contain simplifications when it had a VPBlendRecipe and VF.isScalar()
was true.

However for some VPlans we may have a blend with only the first lane
used:

    BLEND ir<%phi> = ir<%foo.res> ir<%bar.res>/ir<%c>
    CLONE ir<%gep> = getelementptr ir<%p>, ir<%phi>
    vp<%5> = vector-pointer ir<%gep>

And in the legacy cost model we cost a blend as a phi if it's uniform:

// If we know that this instruction will remain uniform, check the cost
of
    // the scalar version.
    if (isUniformAfterVectorization(I, VF))
      VF = ElementCount::getFixed(1);

So this replaces the VF.isScalar() check with
vputils::onlyFirstLaneUsed, which matches how the VPlan cost model
mirrored the legacy model beforehand.

A VPInstruction::Select will also emit a scalar select for a vector VF
if only the first lane is used, so this also updates
VPBlendRecipe::computeCost to reflect that too.
2025-12-16 06:30:54 +00:00
Elvis Wang
1eba2cbe72
[LV] Convert uniform-address unmasked scatters to scalar store. (#166114)
This patch optimizes vector scatters that have a uniform (single-scalar)
address by replacing them with "extract-last-lane + scalar store" when
the scatter is unmasked.

Notes:

- The legacy cost model can scalarize a store if both the address and
the value are uniform. In VPlan we materialize the stored value via
ExtractLastLane, so only the address must be uniform.
- Some of the loops won't be vectorized any sine no vector instructions
will be generated.
2025-12-16 12:24:22 +08:00
Florian Hahn
83eea87a36
[VPlan] Create header phis once, after constructing VPlan0 (NFC). (#168291)
Together with https://github.com/llvm/llvm-project/pull/168289 &
https://github.com/llvm/llvm-project/pull/166099 we can construct header
phis once up front, after creating VPlan0, as the
induction/reduction/first-order-recurrence classification applies across
all VFs.

Depends on https://github.com/llvm/llvm-project/pull/168289 &
https://github.com/llvm/llvm-project/pull/166099 

PR: https://github.com/llvm/llvm-project/pull/168291
2025-12-15 22:12:10 +00:00
Florian Hahn
dbb4f5c2dd
[VPlan] Set VF scale factor in tryToCreatePartialReduction (NFCI).
Split off unrelated change from approved
https://github.com/llvm/llvm-project/pull/168291/ to land separately as
suggested.
2025-12-15 21:18:07 +00:00
Nicolai Hähnle
88bd56597c
VectorCombine: Improve the insert/extract fold in the narrowing case (#168820)
Keeping the extracted element in a natural position in the narrowed
vector has two beneficial effects:

1. It makes the narrowing shuffles cheaper (at least on AMDGPU), which
allows the insert/extract fold to trigger.
2. It makes the narrowing shuffles in a chain of extract/insert
compatible, which allows foldLengthChangingShuffles to successfully
recognize a chain that can be folded.

There are minor X86 test changes that look reasonable to me. The IR
change for AVX2 in
llvm/test/Transforms/VectorCombine/X86/extract-insert-poison.ll
doesn't change the assembly generated by `llc -mtriple=x86_64--
-mattr=AVX2`
at all.
2025-12-15 11:25:51 -08:00
Alexey Bataev
b988555812 [SLP]Check if the extractelement is part of other buildvector node before marking for erasing
Need to check if the extractelement instruction is part of other
buildvector node, before trying to mark it for the deletion, otherwise
the compiler may reuse the deleted instruction.

Fixes #172221
2025-12-15 09:54:05 -08:00
Bala_Bhuvan_Varma
0b2fe07e6b
[VectorCombine] Prevent redundant cost computation for repeated operand pairs in foldShuffleOfIntrinsics (#171965)
This pr resolves [#170867](https://github.com/llvm/llvm-project/issues/170867)

Existing code recomputes the cost for creating a shuffle instruction even for the
repeating Intrinsic operand pairs. This will result in higher newCost.
Hence the runtime will decide not to fold.

The change proposed in this pr will address this issue. When calculating
the newCost we are skipping the cost calculation of an operand pair if
it was already considered. And when creating the transformed code, we
are reusing the already created shuffle instruction for repeated operand
pair.
2025-12-15 14:42:41 +00:00
Ramkumar Ramachandra
0636225b93
[VPlan] Directly unroll VectorPointerRecipe (#168886)
In an effort to get rid of VPUnrollPartAccessor and directly unroll
recipes, start by directly unrolling VectorPointerRecipe, allowing for
VPlan-based simplifications and simplification of the corresponding
execute.
2025-12-15 10:54:06 +00:00
Florian Hahn
bcbbe2c2bc
[VPlan] Pass backedge value directly to FOR and reduction phis (NFC).
Pass backedge values directly to VPFirstOrderRecurrencePHIRecipe and
VPReductionPHIRecipe directly, as they must be provided and availbale.

Split off from https://github.com/llvm/llvm-project/pull/168291.
2025-12-14 20:59:22 +00:00
Florian Hahn
53cf22f3a1
[VPlan] Simplify live-ins early using SCEV. (#155304)
Use SCEV to simplify all live-ins during VPlan0 construction. This
enables us to remove special SCEV queries when constructing
VPWidenRecipes and improves results in some cases.

This leads to simplifications in a number of cases in real-world
applications (~250 files changed across LLVM, SPEC, ffmpeg)

PR: https://github.com/llvm/llvm-project/pull/155304
2025-12-14 20:15:05 +00:00
Luke Lau
4ea8157773 Revert "[VPlan] Remove legacy costing inside VPBlendRecipe::computeCost (#171846)"
This reverts commit fd5f53aa9b21060063484fc6c346316a34a6464c.

It's triggering legacy cost model assertions reported in
https://github.com/llvm/llvm-project/pull/171846#issuecomment-3647640019
2025-12-13 20:05:34 +08:00
Nicolai Hähnle
54ae1222ef
VectorCombine: Fold chains of shuffles fed by length-changing shuffles (#168819)
Such chains can arise from folding insert/extract chains.
2025-12-12 13:53:03 -08:00
Florian Hahn
e6e3f94b5c
[VPlan] Re-add clarifying comment regarding part to extract. (NFC)
Re-add and emphasize comment regarding extracting from the last part, as
suggested post-commit in https://github.com/llvm/llvm-project/pull/171145.
2025-12-12 21:51:33 +00:00
Florian Hahn
333ee931df
[LV] Update stale comment after 4e05d702f02a. (NFC)
Address post-commit suggestion, update stale comment after 4e05d702f.
2025-12-12 21:36:56 +00:00
Florian Hahn
0171e881b5
[VPlan] Strip stray whitespace when printing VPWidenIntOrFpInduction.
printFlags takes care of inserting the needed spaces, remove unneeded
extra stray whitespace
2025-12-12 21:28:50 +00:00
Florian Hahn
65deac0872
[VPlan] Remove vector type checking in inferScalartType (NFC).
inferScalarTypeForRecipe always infers a scalar type, so BaseTy must be
a scalar type. Remove unneeded cast.
2025-12-11 22:10:31 +00:00
Florian Hahn
4e05d702f0
[LV] Always include middle block cost in isOutsideLoopWorkProfitable. (#171102)
Always include the cost of the middle block in
isOutsideLoopWorkProfitable. This addresses the TODO from
https://github.com/llvm/llvm-project/pull/168949 and removes the
temporary restriction.

isOutsideLoopWorkProfitable already scales the cost outside loops
according the expected trip counts.

In practice this increases the minimum iteration threshold in a few
cases. On a large IR corpus based on C/C++ workloads, ~50 out of 179450
vector loops have their thresholds increased slightly.


PR: https://github.com/llvm/llvm-project/pull/171102
2025-12-11 21:41:47 +00:00
Nikita Popov
8a9d9e4853 [LV] Use getSigned() for stride
The stride may be negative.
2025-12-11 17:30:37 +01:00
Luke Lau
fd5f53aa9b
[VPlan] Remove legacy costing inside VPBlendRecipe::computeCost (#171846)
A VPBlendRecipe always emits selects, even when the VF is scalar.

However the legacy cost model always costs all scalar non-header phis as
a phi, and the VPlan cost model has to account for this.

This can cause the cost to be a little off, for example not including
the cost of the select in @smax_call_uniform leading to unprofitable
vectorization.

This removes this from the VPlan cost model and handles checks for the
case in planContainsAdditionalSimplifications instead.

I considered trying to make the legacy cost model more accurate but I'm
not sure if it's possible. We need information as to whether or not the
scalar VF we are costing is the original loop in which case it's
actually a phi, or if it's a VPBlendRecipe that emits a select,
potentially from a VF=1, UF>=1 VPlan.
2025-12-12 00:25:58 +08:00
Ramkumar Ramachandra
85fafd5db0
[SCEVExp] Get DL from SE, strip constructor arg (NFC) (#171823) 2025-12-11 14:26:47 +00:00
Luke Lau
2967815249
[VPlan] Don't emit VPBlendRecipes with only one incoming value. NFC (#171804)
We can just directly use the incoming value. These single value blends
would get optimized later on in simplifyBlends, but by doing it early it
removes the notion of an "immediately normalized" blend, and simplifies
an upcoming patch.
2025-12-11 12:55:56 +00:00
Florian Hahn
5a1299b196
[VPlan] Strip stray whitespace when printing VPWidenSelectRecipe. (NFCI)
printFlags takes care of inserting the correct amount of spaces,
depending on whether there are flags to print or not.
2025-12-10 22:15:35 +00:00
Ramkumar Ramachandra
1c7126d8db
[VPlan] Combine LiveIns fields into MapVector (NFC) (#170220)
Combine Value2VPValue and VPLiveIns into a single MapVector LiveIns
field, simplifying users.
2025-12-10 07:09:21 +00:00
Ramkumar Ramachandra
3310c0be58
[VPlan] Strip TODO to consolidate (ActiveLaneMask|Widen)PHI (#171392)
They cannot be consolidated, as WidenPHI is not a header PHI, while
ActtiveLaneMaskPHI is.
2025-12-09 21:38:58 +00:00
Aiden Grossman
f29d06029f Revert "[LV] Mark checks as never succeeding for high cost cutoff."
This reverts commit 8a115b6934a90441d77ea54af73e7aaaa1394b38.

This broke premerge. https://lab.llvm.org/staging/#/builders/192/builds/13326

/home/gha/llvm-project/clang/test/Frontend/optimization-remark-options.c:10:11: remark: loop not vectorized: cannot prove it is safe to reorder floating-point operations; allow reordering by specifying '#pragma clang loop vectorize(enable)' before the loop or by providing the compiler option '-ffast-math'
2025-12-09 21:32:09 +00:00
Florian Hahn
8a115b6934
[LV] Mark checks as never succeeding for high cost cutoff.
When GeneratedRTChecks::create bails out due to exceeding the cost
threshold, no runtime checks are generated and we must not proceed
assuming checks have been generated.

Mark the checks as never succeeding, to make sure we don't try to
vectorize assuming the runtime checks hold. This fixes a case where we
previously incorrectly vectorized assuming runtime checks had been
generated when forcing vectorization via metadate.

Fixes the mis-compile mentioned in
https://github.com/llvm/llvm-project/pull/166247#issuecomment-3631471588
2025-12-09 20:37:21 +00:00
Florian Hahn
c61a481a23
[VPlan] Use SCEV to prove non-aliasing for stores at different offsets. (#170347)
Extend the logic add in https://github.com/llvm/llvm-project/pull/168771
to also allow sinking stores past stores in the same noalias set by
checking if we can prove no-alias via the distance between accesses,
checked via SCEV.

PR: https://github.com/llvm/llvm-project/pull/170347
2025-12-09 16:19:13 +00:00
Fabrice de Gans
d478baa238
Add more missing LLVM_ABI annotations (#168765)
This patch updates various LLVM headers to properly add the `LLVM_ABI`
and `LLVM_ABI_FOR_TEST` annotations ot build LLVM as a DLL on Windows.

This effort is tracked in #109483.
2025-12-09 09:03:15 -05:00
Florian Hahn
0768068ff0
[VPlan] Remove ExtractLastLane for plans with scalar VFs. (#171145)
ExtractLastLane is a no-op for scalar VFs. Update simplifyRecipe to
remove them. This also requires adjusting the code in VPlanUnroll.cpp to
split off handling of ExtractLastLane/ExtractPenultimateElement for
scalar VFs, which now needs to match ExtractLastPart.

PR: https://github.com/llvm/llvm-project/pull/171145
2025-12-09 11:59:40 +00:00
Pengcheng Wang
1ef0a56b55
[LV][NFC] Use foldTailWithEVL() (#171282) 2025-12-09 16:58:53 +08:00
Luke Lau
0fbb45e7d6 [LV] Return getPredBlockCostDivisor in uint64_t
When the probability of a block is extremely low, HeaderFreq / BBFreq
may be larger than 32 bits. Previously this got truncated to uint32_t
which could cause division by zero exceptions on x86. Widen the return
type to uint64_t which should fit the entire range of BlockFrequency
values.

It's also worth noting that a frequency can never be zero according to
BlockFrequency.h, so we shouldn't need to worry about divide by zero in
getPredBlockCostDivisor itself.
2025-12-09 15:43:13 +08:00
Drew Kersnar
5c8c7f3d21
[LoadStoreVectorizer] Fill gaps in load/store chains to enable vectorization (#159388)
This change introduces Gap Filling, an optimization that aims to fill in
holes in otherwise contiguous load/store chains to enable vectorization.
It also introduces Chain Extending, which extends the end of a chain to
the closest power of 2.

This was originally motivated by the NVPTX target, but I tried to
generalize it to be universally applicable to all targets that may use
the LSV. I'm more than willing to make adjustments to improve the
target-agnostic-ness of this change. I fully expect there are some
issues and encourage feedback on how to improve things.

For both loads and stores we only perform the optimization when we can
generate a legal llvm masked load/store intrinsic, masking off the
"extra" elements. Determining legality for stores is a little tricky
from the NVPTX side, because these intrinsics are only supported for
256-bit vectors. See the other PR I opened for the implementation of the
NVPTX lowering of masked store intrinsics, which include NVPTX TTI
changes that return true for isLegalMaskedStore under certain
conditions: https://github.com/llvm/llvm-project/pull/159387. This
change is dependent on that backend change, but I predict this change
will require more discussion, so I am putting them both up at the same
time. The backend change will be merged first assuming both are
approved.

Edited: both stores _and loads_ must use masked intrinsics for this
optimization to be legal.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-12-08 15:57:17 -06:00
Florian Hahn
65dd29b335
[LV] Compare induction start values via SCEV in assertion (NFCI).
Instead of comparing plain VPValue in the assertion checking the start
values, directly compare the SCEV's. This future-proofs the code in
preparation of performing more simplifications/canonicalizations for
live-ins.
2025-12-08 21:31:53 +00:00
Alexey Bataev
f8d0c355f5 [SLP]Prefer instructions, ued outside the block, as the initial main copyable instructions
Instructions, used outside the block, must be considered the first
choice for the main instructionsin the copyable nodes, to avoid
use-before-def.

Fixes #171055
2025-12-08 09:46:15 -08:00
Ramkumar Ramachandra
c5b90103da
[VPlan] Use nuw when computing {VF,VScale}xUF (#170710)
These quantities should never unsigned-wrap. This matches the behavior
if only VFxUF is used (and not VF): when computing both VF and VFxUF,
nuw should hold for each step separately.
2025-12-08 15:46:02 +00:00
Luke Lau
e8219e5ce8
[VPlan] Use BlockFrequencyInfo in getPredBlockCostDivisor (#158690)
In 531.deepsjeng_r from SPEC CPU 2017 there's a loop that we
unprofitably loop vectorize on RISC-V.

The loop looks something like:

```c
  for (int i = 0; i < n; i++) {
    if (x0[i] == a)
      if (x1[i] == b)
        if (x2[i] == c)
          // do stuff...
  }
```

Because it's so deeply nested the actual inner level of the loop rarely
gets executed. However we still deem it profitable to vectorize, which
due to the if-conversion means we now always execute the body.

This stems from the fact that `getPredBlockCostDivisor` currently
assumes that blocks have 50% chance of being executed as a heuristic.

We can fix this by using BlockFrequencyInfo, which gives a more accurate
estimate of the innermost block being executed 12.5% of the time. We can
then calculate the probability as `HeaderFrequency / BlockFrequency`.

Fixing the cost here gives a 7% speedup for 531.deepsjeng_r on RISC-V.

Whilst there's a lot of changes in the in-tree tests, this doesn't
affect llvm-test-suite or SPEC CPU 2017 that much:

- On armv9-a -flto -O3 there's 0.0%/0.2% more geomean loops vectorized
on llvm-test-suite/SPEC CPU 2017.
- On x86-64 -flto -O3 **with PGO** there's 0.9%/0% less geomean loops
vectorized on llvm-test-suite/SPEC CPU 2017.

Overall geomean compile time impact is 0.03% on stage1-ReleaseLTO:
https://llvm-compile-time-tracker.com/compare.php?from=9eee396c58d2e24beb93c460141170def328776d&to=32fbff48f965d03b51549fdf9bbc4ca06473b623&stat=instructions%3Au
2025-12-08 14:28:26 +00:00
Aiden Grossman
7bfdaa51f1 [VPlan] Fix unused variable warning
llvm-project/llvm/lib/Transforms/Vectorize/VPlanPredicator.cpp:312:19: warning: unused variable 'EB' [-Wunused-variable]
  312 |     VPBasicBlock *EB = Plan.getExitBlocks().front();
      |                   ^~

This showed up in a non-assertions build.
2025-12-07 18:07:52 +00:00
Florian Hahn
3fc7419236
[VPlan] Replace ExtractLast(Elem|LanePerPart) with ExtractLast(Lane/Part) (#164124)
Replace ExtractLastElement and ExtractLastLanePerPart with more generic
and specific ExtractLastLane and ExtractLastPart, which model distinct
parts of extracting across parts and lanes. ExtractLastElement ==
ExtractLastLane(ExtractLastPart) and ExtractLastLanePerPart ==
ExtractLastLane, the latter clarifying the name of the opcode. A new
m_ExtractLastElement matcher is provided for convenience.

The patch should be NFC modulo printing changes.

PR: https://github.com/llvm/llvm-project/pull/164124
2025-12-07 15:15:43 +00:00
Florian Hahn
ba836dc5ed
[VPlan] Remove stray space before ops when printing vector-ptr (NFC) 2025-12-06 13:07:07 +00:00
Jerry Dang
23f09fd3e9
[VectorCombine] Fold permute of intrinsics into intrinsic of permutes: shuffle(intrinsic, poison/undef) -> intrinsic(shuffle) (#170052)
[VectorCombine] Fold permute of intrinsics into intrinsic of permutes

Add foldPermuteOfIntrinsic to transform:
  shuffle(intrinsic(args), poison) -> intrinsic(shuffle(args))
when the shuffle is a permute (operates on single vector) and the cost
model determines the transformation is profitable.

This optimization is particularly beneficial for subvector extractions
where we can avoid computing unused elements.

For example:
  %fma = call <8 x float> @llvm.fma.v8f32(<8 x float> %a, %b, %c)
  %result = shufflevector <8 x float> %fma, poison, <4 x i32> <0,1,2,3>
transforms to:
  %a_low = shufflevector <8 x float> %a, poison, <4 x i32> <0,1,2,3>
  %b_low = shufflevector <8 x float> %b, poison, <4 x i32> <0,1,2,3>
  %c_low = shufflevector <8 x float> %c, poison, <4 x i32> <0,1,2,3>
  %result = call <4 x float> @llvm.fma.v4f32(%a_low, %b_low, %c_low)

The transformation creates one shuffle per vector argument and calls the
intrinsic with smaller vector types, reducing computation when only a
subset of elements is needed.

The existing foldShuffleOfIntrinsics handled the blend case (two
intrinsic inputs), this adds support for the permute case (single
intrinsic input).

Fixes #170002
2025-12-05 15:54:53 +00:00
Florian Hahn
f02dc4d198
[VPlan] Don't try to hoist multi-defs for first-order recurrences.
Currently the hoisting implementation expects single-defs. Bail out on
multi-defs (VPInterleaveRecipe), to fix an assertion.

Fixes https://github.com/llvm/llvm-project/issues/170666
2025-12-04 21:09:16 +00:00
Ramkumar Ramachandra
ef58670f03
Revert [VPlan] Consolidate logic for narrowToSingleScalars (#170720)
This reverts commit 7b3ec51, as a crash was reported:
https://llvm.godbolt.org/z/dK6ff5zvr -- this will give us time to
investigate a re-land.
2025-12-04 19:14:51 +00:00
Alexey Bataev
a2a3d89e08 [SLP][NFC]Hoist invariant request for user nodes out of the loop, NFC 2025-12-04 06:57:54 -08:00
Alexey Bataev
e502dce8b5
[SLP][NFC]Simplify analysis of the scalars, NFC.
Just an attempt to simplify some checks, remove extra calls and reorder
checks to make code simpler and faster

Reviewers: RKSimon, hiraditya

Reviewed By: hiraditya

Pull Request: https://github.com/llvm/llvm-project/pull/170382
2025-12-04 08:28:38 -05:00