40764 Commits

Author SHA1 Message Date
Luke Lau
955c475ae6
[VPlan] Add m_Sub to VPlanPatternMatch. NFC (#154705)
To mirror PatternMatch.h, and we'll also be able to use it in #152167
2025-08-21 09:33:46 +00:00
Elvis Wang
d611a9ca15
[LV][VPlan] Reduce register usage of VPEVLBasedIVPHIRecipe. (#154482)
`VPEVLBasedIVPHIRecipe` will lower to VPInstruction scalar phi and
generate scalar phi. This recipe will only occupy a scalar register just
like other phi recipes.

This patch fix the register usage for `VPEVLBasedIVPHIRecipe` from
vector
to scalar which is close to generated vector IR.

https://godbolt.org/z/6Mzd6W6ha shows that no register spills when
choosing `<vscale x 16>`.

Note that this test is basically copied from AArch64.
2025-08-21 07:39:01 +08:00
Shih-Po Hung
cf0e86118d
[VPlan] Handle canonical VPWidenIntOrFpInduction in branch-condition simplification (#153539)
SimplifyBranchConditionForVFAndUF only recognized canonical IVs and a
few PHI
recipes in the loop header. With more IV-step optimizations,
the canonical widen-canonical-iv can be replaced by a canonical
VPWidenIntOrFpInduction,
which the pass did not handle, causing regressions (missed
simplifications).

This patch replaces canonical VPWidenIntOrFpInduction with a StepVector
in the vector preheader
since the vector loop region only executes once.
2025-08-21 07:34:54 +08:00
Kazu Hirata
9aae8ef329
[Scalar] Use SmallPtrSet directly instead of SmallSet (NFC) (#154473)
I'm trying to remove the redirection in SmallSet.h:

template <typename PointeeType, unsigned N>
class SmallSet<PointeeType*, N> : public SmallPtrSet<PointeeType*, N>
{};

to make it clear that we are using SmallPtrSet.  There are only
handful places that rely on this redirection.

This patch replaces SmallSet to SmallPtrSet where the element type is
a pointer.
2025-08-20 16:30:39 -07:00
Ramkumar Ramachandra
0db57ab586
[VPlan] Improve code using onlyScalarValuesUsed (NFC) (#154564) 2025-08-20 22:38:00 +01:00
Philip Reames
e6b4a21849
[IR] Add utilities for manipulating length of MemIntrinsic [nfc] (#153856)
Goal is simply to reduce direct usage of getLength and setLength so that
if we end up moving memset.pattern (whose length is in elements) there
are fewer places to audit.
2025-08-20 13:50:11 -07:00
Florian Hahn
4e6c88be7c
[TTI] Remove Args argument from getOperandsScalarizationOverhead (NFC). (#154126)
Remove the ArrayRef<const Value*> Args operand from
getOperandsScalarizationOverhead and require that the callers
de-duplicate arguments and filter constant operands.

Removing the Value * based Args argument enables callers where no Value
* operands are available to use the function in a follow-up: computing
the scalarization cost directly for a VPlan recipe.

It also allows more accurate cost-estimates in the future: for example,
when vectorizing a loop, we could also skip operands that are live-ins,
as those also do not require scalarization.

PR: https://github.com/llvm/llvm-project/pull/154126
2025-08-20 21:09:08 +01:00
Florian Hahn
35be64a416
[VPlan] Factor out logic to common compute costs to helper (NFCI). (#153361)
A number of recipes compute costs for the same opcodes for scalars or
vectors, depending on the recipe.

Move the common logic out to a helper in VPRecipeWithIRFlags, that is
then used by VPReplicateRecipe, VPWidenRecipe and VPInstruction.

This makes it easier to cover all relevant opcodes, without duplication.

PR: https://github.com/llvm/llvm-project/pull/153361
2025-08-20 16:05:20 +01:00
David Sherwood
e172110d12
[LV] Don't calculate scalar costs for scalable VFs in setVectorizedCallDecision (#152713)
In setVectorizedCallDecision we attempt to calculate the scalar costs
for vectorisation calls, even for scalable VFs where we already know the
answer is Invalid. We can avoid doing unnecessary work by skipping this
completely for scalable vectors.
2025-08-20 15:00:31 +01:00
Ryotaro Kasuga
2330fd2f73
[LoopPeel] Add new option to peeling loops to convert PHI into IV (#121104)
LoopPeel currently considers PHI nodes that become loop invariants
through peeling. However, in some cases, peeling transforms PHI nodes
into induction variables (IVs), potentially enabling further
optimizations such as loop vectorization. For example:

```c
// TSVC s292
int im = N-1;
for (int i=0; i<N; i++) {
  a[i] = b[i] + b[im];
  im = i;
}
```

In this case, peeling one iteration converts `im` into an IV, allowing
it to be handled by the loop vectorizer.

This patch adds a new feature to peel loops when to convert PHIs into
IVs. At the moment this feature is disabled by default.

Enabling it allows to vectorize the above example. I have measured on
neoverse-v2 and observed a speedup of more than 60% (options: `-O3
-ffast-math -mcpu=neoverse-v2 -mllvm -enable-peeling-for-iv`).

This PR is taken over from #94900
Related #81851
2025-08-20 13:44:56 +00:00
Florian Hahn
dc23869f98
[LV] Handle vector trip count being zero in preparePlanForEpiVectorLoop.
After a485e0e, we may not set the vector trip count in
preparePlanForEpilogueVectorLoop if it is zero. We should not choose a
VF * UF that makes the main vector loop dead (i.e. vector trip count is
zero), but there are some cases where this can happen currently.

In those cases, set EPI.VectorTripCount to zero.
2025-08-20 11:54:22 +01:00
Nikita Popov
5ae749b77d
[FunctionAttr] Invalidate callers with mismatching signature (#154289)
If FunctionAttrs infers additional attributes on a function, it also
invalidates analysis on callers of that function. The way it does this
right now limits this to calls with matching signature. However, the
function attributes will also be used when the signatures do not match.
Use getCalledOperand() to avoid a signature check.

This is not a correctness fix, just improves analysis quality. I noticed
this due to
https://github.com/llvm/llvm-project/pull/144497#issuecomment-3199330709,
where LICM ends up with a stale MemoryDef that could be a MemoryUse
(which is a bug in LICM, but still non-optimal).
2025-08-20 11:38:31 +02:00
Orlando Cazalet-Hyams
6c9352530a
[RemoveDIs][NFC] Clean up BasicBlockUtils now intrinsics are gone (#154326)
A couple of minor readability changes now that we're not supporting both
intrinsics and records.
2025-08-20 10:03:44 +01:00
Stephen Tozer
5cedb01487 [Debugify] Fix compile error in tracking coverage build
Forward-fixes a compile error in bc216b057d (#150212) in specific build
configurations, due to a missing const_cast.
2025-08-19 11:18:42 +01:00
David Green
a7df02f83c
[InstCombine] Make strlen optimization more resilient to different gep types. (#153623)
This makes the optimization in optimizeStringLength for strlen(gep
@glob, %x) -> sub endof@glob, %x a little more resilient, and maybe a
bit more correct for geps with non-array types.
2025-08-19 10:37:17 +01:00
Nikita Popov
4ab87ffd1e
[SCCP] Enable PredicateInfo for non-interprocedural SCCP (#153003)
SCCP can use PredicateInfo to constrain ranges based on assume and
branch conditions. Currently, this is only enabled during IPSCCP.

This enables it for SCCP as well, which runs after functions have
already been simplified, while IPSCCP runs pre-inline. To a large
degree, CVP already handles range-based optimizations, but SCCP is more
reliable for the cases it can handle. In particular, SCCP works reliably
inside loops, which is something that CVP struggles with due to LVI
cycles.

I have made various optimizations to make PredicateInfo more efficient,
but unfortunately this still has significant compile-time cost (around
0.1-0.2%).
2025-08-19 10:59:38 +02:00
David Sherwood
13d8ba7dea
[LV][TTI] Calculate cost of extracting last index in a scalable vector (#144086)
There are a couple of places in the loop vectoriser where we
want to calculate the cost of extracting the last lane in a
vector. However, we wrongly assume that asking for the cost
of extracting lane (VF.getKnownMinValue() - 1) is an accurate
representation of the cost of extracting the last lane. For
SVE at least, this is non-trivial as it requires the use of
whilelo and lastb instructions.

To solve this problem I have added a new
getReverseVectorInstrCost interface where the index is used
in reverse from the end of the vector. Suppose a vector has
a given ElementCount EC, the extracted/inserted lane would be
EC - 1 - Index. For scalable vectors this index is unknown at
compile time. I've added a AArch64 hook that better represents
the cost, and also a RISCV hook that maintains compatibility
with the behaviour prior to this PR.

I've also taken the liberty of adding support in vplan for
calculating the cost of VPInstruction::ExtractLastElement.
2025-08-19 09:31:37 +01:00
Nikita Popov
5753ee2434 [LICM] Avoid assertion failure on stale MemoryDef
It can happen that the call is originally created as a MemoryDef,
and then later transforms show it is actually read-only and could
be a MemoryUse -- however, this is not guaranteed to be reflected
in MSSA.
2025-08-19 10:25:45 +02:00
Luke Lau
cabf6433c6
[VPlan] EVL transform VPVectorEndPointerRecipe alongisde load/store recipes. NFC (#152542)
This is the first step in untangling the variable step transform and
header mask optimizations as described in #152541.

Currently we replace all VF users globally in the plan, including
VPVectorEndPointerRecipe. However this leaves reversed loads and stores
in an incorrect state until they are adjusted in optimizeMaskToEVL.

This moves the VPVectorEndPointerRecipe transform so that it is updated
in lockstep with the actual load/store recipe.

One thought that crossed my mind was that VPInterleaveRecipe could also
use VPVectorEndPointerRecipe, in which case we would have also been
computing the wrong address because we don't transform it to an EVL
recipe which accounts for the reversed address.
2025-08-19 08:16:48 +00:00
Luke Lau
144736b07e
[VPlan] Don't fold live ins with both scalar and vector operands (#154067)
If we end up with a extract_element VPInstruction where both operands
are live-ins, we will try to fold the live-ins even though the first
operand is a vector whilst the live-in is scalar.

This fixes it by just returning the vector live-in instead of calling
the folder, and removes the handling for insertelement where we aren't
able to do the fold. From some quick testing we previously never hit
this fold anyway, and were probably just missing test coverage.

Fixes #154045
2025-08-19 04:10:53 +00:00
Mel Chen
1dac302ce7
[LV] Explicitly disallow interleaved access requiring gap mask for scalable VFs. nfc (#154122)
Currently, VPInterleaveRecipe::execute does not support generating LLVM
IR for interleaved accesses that require a gap mask for scalable VFs.
It would be better to detect and prevent such groups from being
vectorized as interleaved accesses in
LoopVectorizationCostModel::interleavedAccessCanBeWidened, rather than
relying on the TTI function getInterleavedMemoryOpCost to return an
invalid cost.
2025-08-19 08:42:39 +08:00
Florian Hahn
79be94c984
[VPlan] Compute cost single-scalar calls in computeCost. (NFC)
Compute the cost of non-intrinsic, single-scalar calls directly in
VPReplicateRecipe::computeCost.

This starts moving call cost computations to VPlan, handling the
simplest case first.
2025-08-18 21:56:56 +01:00
Thurston Dang
4220538e25
[msan] Handle multiply-add-accumulate; apply to AVX Vector Neural Network Instructions (VNNI) (#153927)
This extends the pmadd handler (recently improved in https://github.com/llvm/llvm-project/pull/153353) to three-operand intrinsics (multiply-add-accumulate), and applies it to the AVX Vector Neural Network Instructions.

Updates the tests from https://github.com/llvm/llvm-project/pull/153135
2025-08-18 13:18:27 -07:00
Florian Hahn
7e9989390d
[VPlan] Materialize Build(Struct)Vectors for VPReplicateRecipes. (NFCI) (#151487)
Materialze Build(Struct)Vectors explicitly for VPRecplicateRecipes, to
serve their users requiring a vector, instead of doing so when unrolling
by VF.

Now we only need to implicitly build vectors in VPTransformState::get
for VPInstructions. Once they are also unrolled by VF we can remove the
code-path alltogether.

PR: https://github.com/llvm/llvm-project/pull/151487
2025-08-18 20:49:42 +01:00
Thurston Dang
ade755d62b
[msan] Add Instrumentation for Avx512 Instructions: pmaddw, pmaddubs (#153919)
This applies the pmadd handler (recently improved in https://github.com/llvm/llvm-project/pull/153353) to the Avx512
equivalent of the pmaddw and pmaddubs intrinsics:
  <16 x i32> @llvm.x86.avx512.pmaddw.d.512(<32 x i16>, <32 x i16>)
  <32 x i16> @llvm.x86.avx512.pmaddubs.w.512(<64 x i8>, <64 x i8>)
2025-08-18 11:31:15 -07:00
Kyle Wang
064f02dac0
[VectorCombine] Preserve scoped alias metadata (#153714)
Right now if a load op is scalarized, the `!alias.scope` and `!noalias`
metadata are dropped. This PR is to keep them if exist.
2025-08-18 18:16:32 +00:00
Tobias Stadler
8135b7c1ab
[LV] Emit all remarks for unvectorizable instructions (#153833)
If ExtraAnalysis is requested, emit all remarks caused by unvectorizable instructions - instead of only the first.
This is in line with how other places handle DoExtraAnalysis and it can be quite helpful to get info about all instructions in a loop that prevent vectorization.
2025-08-18 18:04:53 +01:00
Ramkumar Ramachandra
97f554249c
[VPlan] Preserve nusw in createInBoundsPtrAdd (#151549)
Rename createInBoundsPtrAdd to createNoWrapPtrAdd, and preserve nusw as
well as inbounds at the callsite.
2025-08-18 17:48:42 +01:00
Andreas Jonson
1b60236200
[SimplifyCFG] Avoid redundant calls in gather. (NFC) (#154133)
Split out from https://github.com/llvm/llvm-project/pull/154007 as it
showed compile time improvements

NFC as there needs to be at least two icmps that is part of the chain.
2025-08-18 18:45:52 +02:00
Antonio Frighetto
33761df961
Revert "[SimpleLoopUnswitch] Record loops from unswitching non-trivial conditions"
This reverts commit e9de32fd159d30cfd6fcc861b57b7e99ec2742ab due to
multiple performance regressions observed across downstream Numba
benchmarks (https://github.com/llvm/llvm-project/issues/138509#issuecomment-3193855772).

While avoiding non-trivial unswitches on newly-cloned loops helps
mitigate the pathological case reported in https://github.com/llvm/llvm-project/issues/138509,
it may as well make the IR less friendly to vectorization / loop-
canonicalization (in the test reported, previously no select with
loop-carried dependence existed in the new specialized loops),
leading the abovementioned approach to be reconsidered.
2025-08-18 17:40:08 +02:00
Kazu Hirata
07eb7b7692
[llvm] Replace SmallSet with SmallPtrSet (NFC) (#154068)
This patch replaces SmallSet<T *, N> with SmallPtrSet<T *, N>.  Note
that SmallSet.h "redirects" SmallSet to SmallPtrSet for pointer
element types:

  template <typename PointeeType, unsigned N>
class SmallSet<PointeeType*, N> : public SmallPtrSet<PointeeType*, N>
{};

We only have 140 instances that rely on this "redirection", with the
vast majority of them under llvm/. Since relying on the redirection
doesn't improve readability, this patch replaces SmallSet with
SmallPtrSet for pointer element types.
2025-08-18 07:01:29 -07:00
Arne Stenkrona
ea2f5395b1
[SimplifyCFG] Avoid threading for loop headers (#151142)
Updates SimplifyCFG to avoid jump threading through loop headers if
-keep-loops is requested. Canonical loop form requires a loop header
that dominates all blocks in the loop. If we thread through a header, we
risk breaking its domination of the loop. This change avoids this issue
by conservatively avoiding threading through headers entirely.

Fixes: https://github.com/llvm/llvm-project/issues/151144
2025-08-18 09:46:55 +00:00
David Sherwood
7ee6cf06c8
[LV] Fix incorrect cost kind in VPReplicateRecipe::computeCost (#153216)
We were incorrectly using the TTI::TCK_RecipThroughput cost kind and
ignoring the kind set in the context.
2025-08-18 09:52:31 +01:00
David Green
790bee99de
[VectorCombine] Remove dead node immediately in VectorCombine (#149047)
The vector combiner will process all instructions as it first loops
through the function, adding any newly added and deleted instructions to
a worklist which is then processed when all nodes are done. These leaves
extra uses in the graph as the initial processing is performed, leading
to sub-optimal decisions being made for other combines. This changes it
so that trivially dead instructions are removed immediately. The main
changes that this requires is to make sure iterator invalidation does not
occur.
2025-08-18 07:55:21 +01:00
Kazu Hirata
cbf5af9668
[llvm] Remove unused includes (NFC) (#154051)
These are identified by misc-include-cleaner.  I've filtered out those
that break builds.  Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
2025-08-17 23:46:35 -07:00
Mel Chen
145e8aadca
[LV][EVL] Add dead EVL mask into ToErase for consistency. nfc (#153761) 2025-08-18 14:11:50 +08:00
Owen Anderson
69e4514978
[GlobalOpt] Do not fold away addrspacecasts which may be runtime operations (#153753)
Specifically in the context of the once-stored transformation, GlobalOpt
would strip
all pointer casts unconditionally, even though addrspacecasts might be
runtime operations.
This manifested particularly on CHERI targets.

This patch was inspired by an existing change in CHERI LLVM
(91afa60f17),
but has been reimplemented with updated conventions, and a testcase
constructed from scratch.
2025-08-18 02:11:51 +00:00
Florian Hahn
5892a2beec
[VPlan] Remove dead code from GetBroadCastInstr (NFCI).
All relevant places should already explicitly materialize broadcasts.
Remove dead code from VPTransformState::get
2025-08-17 21:51:14 +01:00
Andreas Jonson
5ae8a9b8ce
[SimplifyCfg] Handle trunc nuw i1 condition in Equality comparison. (#153051)
proof: https://alive2.llvm.org/ce/z/WVt4-F
2025-08-17 09:53:40 +02:00
Florian Hahn
351d398a37
[VPlan] Run final VPlan simplifications before codegen.
Dissolving the hierarchical VPlan CFG and converting abstract to
concrete recipes can expose additional simplification opportunities.

Do a final run of simplifyRecipes before executing the VPlan.
2025-08-16 18:54:27 +01:00
Mircea Trofin
c971c25544
[licm] don't drop MD_prof when dropping other metadata (#152420)
Part of Issue #147390
2025-08-16 07:26:13 -07:00
Thurston Dang
638bd11c13
[msan] Handle SSE/AVX pshuf intrinsic by applying to shadow (#153895)
llvm.x86.sse.pshuf.w(<1 x i64>, i8) and llvm.x86.avx512.pshuf.b.512(<64
x i8>, <64 x i8>) are currently handled strictly, which is suboptimal.

llvm.x86.ssse3.pshuf.b(<1 x i64>, <1 x i64>)
llvm.x86.ssse3.pshuf.b.128(<16 x i8>, <16 x i8>) and
llvm.x86.avx2.pshuf.b(<32 x i8>, <32 x i8>) are currently heuristically
handled using maybeHandleSimpleNomemIntrinsic, which is incorrect.

Since the second argument is the shuffle order, we instrument all these
intrinsics using `handleIntrinsicByApplyingToShadow(...,
/*trailingVerbatimArgs=*/1)`
(https://github.com/llvm/llvm-project/pull/114490).
2025-08-15 20:28:30 -07:00
Matt Arsenault
3e5d8a1439 Reapply "RuntimeLibcalls: Generate table of libcall name lengths (#153… (#153864)
This reverts commit 334e9bf2dd01fbbfe785624c0de477b725cde6f2.

Check if llvm-nm exists before building the benchmark.
2025-08-16 09:53:50 +09:00
Thurston Dang
2b75ff192d
[msan] Reland with even more improvement: Improve packed multiply-add instrumentation (#153353)
This reverts commit cf002847a464c004a57ca4777251b1aafc33d958 i.e.,
relands ba603b5e4d44f1a25207a2a00196471d2ba93424. It was reverted
because it was subtly wrong: multiplying an uninitialized zero should
not result in an initialized zero.

This reland fixes the issue by using instrumentation analogous to
visitAnd (bitwise AND of an initialized zero and an uninitialized value
results in an initialized value). Additionally, this reland expands a
test case; fixes the commit message; and optimizes the change to avoid
the need for horizontalReduce.

The current instrumentation has false positives: it does not take into
account that multiplying an initialized zero value with an uninitialized
value results in an initialized zero value This change fixes the issue
during the multiplication step. The horizontal add step is modeled using
bitwise OR.
    
Future work can apply this improved handler to the AVX512 equivalent
intrinsics (x86_avx512_pmaddw_d_512, x86_avx512_pmaddubs_w_512.) and AVX
VNNI intrinsics.
2025-08-15 16:35:42 -07:00
gulfemsavrun
334e9bf2dd
Revert "RuntimeLibcalls: Generate table of libcall name lengths (#153… (#153864)
…210)"

This reverts commit 9a14b1d254a43dc0d4445c3ffa3d393bca007ba3.

Revert "RuntimeLibcalls: Return StringRef for libcall names (#153209)"

This reverts commit cb1228fbd535b8f9fe78505a15292b0ba23b17de.

Revert "TableGen: Emit statically generated hash table for runtime
libcalls (#150192)"

This reverts commit 769a9058c8d04fc920994f6a5bbb03c8a4fbcd05.

Reverted three changes because of a CMake error while building llvm-nm
as reported in the following PR:
https://github.com/llvm/llvm-project/pull/150192#issuecomment-3192223073
2025-08-15 13:32:27 -07:00
Alexey Bataev
b157599156 [SLP]Do not include copyable data to the same user twice
If the copyable schedule data is created and the user is used several
times in the user node, no need to count same data for the same user
several times, need to include it only ones.

Fixes #153754
2025-08-15 12:36:45 -07:00
Florian Hahn
2ed727f3f6
[VPlan] Move SCEV invalidation to ::executePlan. (NFCI)
Move SCEV invalidation from legacy ILV code-path directly to ::executePlan.
2025-08-15 20:32:41 +01:00
Alexey Bataev
09f5b9ab0a Revert "[SLP]Do not include copyable data to the same user twice"
This reverts commit 758c6852c3ffe6b5e259cafadd811e60d8c276fb to fix
buildbot  https://lab.llvm.org/buildbot/#/builders/195/builds/13298
2025-08-15 12:08:31 -07:00
zGoldthorpe
82caa251d4
[InstCombine] Fold integer unpack/repack patterns through ZExt (#153583)
This patch explicitly enables the InstCombiner to fold integer
unpack/repack patterns such as

```llvm
define i64 @src_combine(i32 %lower, i32 %upper) {
  %base = zext i32 %lower to i64

  %u.0 = and i32 %upper, u0xff
  %z.0 = zext i32 %u.0 to i64
  %s.0 = shl i64 %z.0, 32
  %o.0 = or i64 %base, %s.0

  %r.1 = lshr i32 %upper, 8
  %u.1 = and i32 %r.1, u0xff
  %z.1 = zext i32 %u.1 to i64
  %s.1 = shl i64 %z.1, 40
  %o.1 = or i64 %o.0, %s.1

  %r.2 = lshr i32 %upper, 16
  %u.2 = and i32 %r.2, u0xff
  %z.2 = zext i32 %u.2 to i64
  %s.2 = shl i64 %z.2, 48
  %o.2 = or i64 %o.1, %s.2

  %r.3 = lshr i32 %upper, 24
  %u.3 = and i32 %r.3, u0xff
  %z.3 = zext i32 %u.3 to i64
  %s.3 = shl i64 %z.3, 56
  %o.3 = or i64 %o.2, %s.3

  ret i64 %o.3
}
; =>
define i64 @tgt_combine(i32 %lower, i32 %upper) {
  %base = zext i32 %lower to i64
  %upper.zext = zext i32 %upper to i64
  %s.0 = shl nuw i64 %upper.zext, 32
  %o.3 = or disjoint i64 %s.0, %base
  ret i64 %o.3
}
```

Alive2 proofs: [YAy7ny](https://alive2.llvm.org/ce/z/YAy7ny)
2025-08-15 12:48:32 -06:00
Alexey Bataev
758c6852c3 [SLP]Do not include copyable data to the same user twice
If the copyable schedule data is created and the user is used several
times in the user node, no need to count same data for the same user
several times, need to include it only ones.

Fixes #153754
2025-08-15 11:47:35 -07:00