595 Commits

Author SHA1 Message Date
Kerry McLaughlin
c34cba0413
[AArch64][SME] Lower aarch64.sme.cnts* to vscale when in streaming mode (#154305)
In streaming mode, both the @llvm.aarch64.sme.cnts and @llvm.aarch64.sve.cnt
intrinsics are equivalent. For SVE, cnt* is lowered in instCombineIntrinsic
to @llvm.sme.vscale(). This patch lowers the SME intrinsic similarly when
in streaming-mode.
2025-08-20 09:48:36 +01:00
David Sherwood
13d8ba7dea
[LV][TTI] Calculate cost of extracting last index in a scalable vector (#144086)
There are a couple of places in the loop vectoriser where we
want to calculate the cost of extracting the last lane in a
vector. However, we wrongly assume that asking for the cost
of extracting lane (VF.getKnownMinValue() - 1) is an accurate
representation of the cost of extracting the last lane. For
SVE at least, this is non-trivial as it requires the use of
whilelo and lastb instructions.

To solve this problem I have added a new
getReverseVectorInstrCost interface where the index is used
in reverse from the end of the vector. Suppose a vector has
a given ElementCount EC, the extracted/inserted lane would be
EC - 1 - Index. For scalable vectors this index is unknown at
compile time. I've added a AArch64 hook that better represents
the cost, and also a RISCV hook that maintains compatibility
with the behaviour prior to this PR.

I've also taken the liberty of adding support in vplan for
calculating the cost of VPInstruction::ExtractLastElement.
2025-08-19 09:31:37 +01:00
Benjamin Maxwell
81c06d198e
Reland "[AArch64][SME] Port all SME routines to RuntimeLibcalls" (#153417)
This updates everywhere we emit/check an SME routines to use
RuntimeLibcalls to get the function name and calling convention.
2025-08-18 14:53:40 +01:00
Ahmad Yasin
1b0bce972b
Reorder checks to speed up getAppleRuntimeUnrollPreferences() (#154010)
- Delay load/store values calculation unless a best unroll-count is
found
- Remove extra getLoopLatch() invocation
2025-08-18 11:06:37 +03:00
David Green
5836bae463
[AArch64] Change the cost of fma and fmuladd to match fmul. (#152963)
As fmul and fmadd are so similar, their performance characteristics tend
to be the same on most platforms, at least in terms of reciprocal
throughputs. Processors capable of performing a given number of fmul per
cycle can usually perform the same number of fma, with the extra add
being relatively simple on top. This patch makes the scores of the two
operations the same, which brings the throughput cost of a fma/fmuladd
to 2, and the latency to 3, which are the defaults for fmul.

Note that we might also want to change the throughput cost of a fmul to
1, as most processors have ample bandwidth for them, but they should
still stay in-line with one another.
2025-08-14 21:53:45 +01:00
Elvis Wang
01fac67e2a
[TTI] Add cost kind to getAddressComputationCost(). NFC. (#153342)
This patch add cost kind to `getAddressComputationCost()` for #149955.

Note that this patch also remove all the default value in `getAddressComputationCost()`.
2025-08-14 16:01:44 +08:00
Nikita Popov
48beed5b71
Revert "[AArch64][SME] Port all SME routines to RuntimeLibcalls" (#153392)
This introduced a 5% compile-time regression on AArch64, see
https://llvm-compile-time-tracker.com/compare.php?from=b9138bde3562de5c28a239dbd303caf2406678c6&to=271688b87abe7cf45aceaff8266270a25eb7b436&stat=instructions:u.

Reverts llvm/llvm-project#152505.
2025-08-13 11:54:39 +00:00
Ahmad Yasin
1f2fb8e979
[AArch64] Tune unrolling prefs for more patterns on Apple CPUs (#149358)
Enhance the heuristics in `getAppleRuntimeUnrollPreferences` to let a
bit more loops to be unrolled.

Specifically, this patch adjusts two checks:
I. Tune the loop size budget from 8 to 10
II. Include immediate in-loop users of loaded values in the load/stores
dependencies predicate

---------

Co-authored-by: Florian Hahn <flo@fhahn.com>

PR: https://github.com/llvm/llvm-project/pull/149358
2025-08-13 11:16:54 +01:00
Benjamin Maxwell
271688b87a
[AArch64][SME] Port all SME routines to RuntimeLibcalls (#152505)
This updates everywhere we emit/check an SME routines to use
RuntimeLibcalls to get the function name and calling convention.

Note: RuntimeLibcallEmitter had some issues with emitting non-unique
variable names for sets of libcalls, so I tweaked the output to avoid
the need for variables.
2025-08-13 08:48:59 +01:00
Sam Tebbs
0bfa1718af
[LV] Create in-loop sub reductions (#147026)
This PR allows the loop vectorizer to handle in-loop sub reductions by
forming a normal in-loop add reduction with a negated input.

Stacked PRs:
1. -> https://github.com/llvm/llvm-project/pull/147026
2. https://github.com/llvm/llvm-project/pull/147255
3. https://github.com/llvm/llvm-project/pull/147302
4. https://github.com/llvm/llvm-project/pull/147513
2025-08-12 10:22:41 +01:00
Luke Lau
acb86fb9e0
[TTI] Consistently pass the pointer type to getAddressComputationCost. NFCI (#152657)
In some places we were passing the type of value being accessed, in
other cases we were passing the type of the pointer for the access.

The most "involved" user is
LoopVectorizationCostModel::getMemInstScalarizationCost, which is the
only call site that passes in the SCEV, and it passes along the pointer
type.

This changes call sites to consistently pass the pointer type, and
renames the arguments to clarify this.

No target actually checks the contents of the type passed, only to see
if it's a vector or not, so this shouldn't have an effect.
2025-08-11 18:00:12 +08:00
David Green
26b302fd8b [AArch64] Rename Cost -> PromotedCost to avoid shadowing error 2025-08-08 14:37:24 +01:00
David Green
7f1638efc1
[AArch64] Generalize costing for FP16 instructions (#150033)
This extracts the code for modelling a fp16 operation as
`fptrunc(fpop(fpext,fpext))` into a new function named
getFP16BF16PromoteCost so that it can be reused by the
arithmetic instructions. The function takes a lambda to
calculate the cost of the operation with the promoted type.
2025-08-08 13:40:07 +01:00
Graham Hunter
de72cca671
[CostModel] Provide a default model for histogram intrinsics (#149348)
Since we scalarize these intrinsics when the target does not support
them, we should model that for costing purposes.
2025-08-08 11:00:00 +01:00
Paul Walker
94d374ab6c
[LLVM][CGP] Allow finer control for sinking compares. (#151366)
Compare sinking is selectable based on the result of
hasMultipleConditionRegisters. This function is too coarse grained by
not taking into account the differences between scalar and vector
compares. This PR extends the interface to take an EVT to allow finer
control.
    
The new interface is used by AArch64 to disable sinking of scalable
vector compares, but with isProfitableToSinkOperands updated to maintain
the cases that are specifically tested.
2025-08-05 11:43:41 +01:00
David Green
b30d5315b7
[AArch64] Add better fcmp costs for expanded predicates (#147940)
Certain fcmp predicates need to be expanded into multiple operations and
or'd together. This adds some more accurate cost modelling for them
based on the predicate. Unsupported operations are given the cost of a
libcall and the latency is set to 2 as that seemed to be fairly common
between different CPUs.
2025-08-04 13:42:57 +01:00
David Green
e136fb04f2
[AArch64] Add sve bf16 fpext and fpround costs. (#150485)
This prevents them from generating Invalid costs, as generating the
instructions seems to work fine with and without +bf16. The costs are
mostly taken from the number of instructions (minus ptrue and constants).
2025-08-04 09:47:41 +01:00
Sander de Smalen
76ce464073
[AArch64] Dont inline streaming fn into non-streaming caller (#150595)
Without this change, the following test would fail to compile
with `-march=armv8-a+sme`:

```
  void func1(const svuint32_t *in, svuint32_t *out) {
    [&]() __arm_streaming { *out = *in; }();
  }
```

But in general, it's probably better never to inline
streaming functions into non-streaming functions, because
they will have been marked as 'streaming' for a reason
by the user.
2025-08-01 09:05:19 +01:00
John Brawn
9a9b8b7d1c
[AArch64] Allow unrolling of scalar epilogue loops (#151164)
#147420 changed the unrolling preferences to permit unrolling of
non-auto vectorized loops by checking for the isvectorized attribute,
however when a loop is vectorized this attribute is put on both the
vector loop and the scalar epilogue, so this change prevented the scalar
epilogue from being unrolled.

Restore the previous behaviour of unrolling the scalar epilogue by
checking both for the isvectorized attribute and vector instructions in
the loop.
2025-07-31 11:03:41 +01:00
Alexandros Lamprineas
3ab64c5b29
[NFC][Clang][FMV] Make FMV priority data type future proof. (#150079)
FMV priority is the returned value of a polymorphic function. On RISC-V
and X86 targets a 32-bit value is enough. On AArch64 we currently need
64 bits and we will soon exceed that. APInt seems to be a suitable
replacement for uint64_t, presumably with minimal compile time overhead.
It allows bit manipulation, comparison and variable bit width.
2025-07-23 10:37:29 +01:00
David Green
828a867ee0
[AArch64] Reduce the costs of and/or/xor reductions (#148553)
Since the costs were added the codegen for i8/i16 and/or/xor reductions
has improved. This updates the cost model to produce the same costs in
terms of number of instructions.
2025-07-16 09:59:36 +01:00
Jeremy Morse
57a5f9c47e
[DebugInfo][RemoveDIs] Suppress getNextNonDebugInfoInstruction (#144383)
There are no longer debug-info instructions, thus we don't need this
skipping. Horray!
2025-07-15 15:34:10 +01:00
Florian Hahn
02d3738be9
[AArch64,TTI] Remove RealUse check for vector insert/extract costs. (#146526)
getVectorInstrCostHelper would return costs of zero for vector
inserts/extracts that move data between GPR and vector registers, if
there was no 'real' use, i.e. there was no corresponding existing
instruction.

This meant that passes like LoopVectorize and SLPVectorizer, which
likely are the main users of the interface, would understimate the cost
of insert/extracts that move data between GPR and vector registers,
which has non-trivial costs.

The patch removes the special case and only returns costs of zero for
lane 0 if it there is no need to transfer between integer and vector
registers.

This impacts a number of SLP test, and most of them look like general
improvements.I think the change should make things more accurate for any
AArch64 target, but if not it could also just be Apple CPU specific.

I am seeing +2% end-to-end improvements on SLP-heavy workloads.

PR: https://github.com/llvm/llvm-project/pull/146526
2025-07-15 15:19:27 +01:00
David Green
58d79aaba6
[AArch64] Guard against non-simple types in udiv sve costs. (#148580)
The code here probably needs to change to handle types more uniformly,
but this patch prevents it from trying to use a simple type where it does
not exist.

Fixes #148438.
2025-07-15 10:25:08 +01:00
Ahmad Yasin
671072e830
[AArch64] Unrolling of loops with vector instructions. (#147420)
This patch permits loops with vector instructions to be unrolled.

Today there is an early exit in `getUnrollingPreferences()` of AArch64
targets if a vector instruction is observed in any of the loop blocks.
This patch fixes that so common loops like this one get a chance to be
unrolled:

void saxpy (float * dst, const float * src, const float a, const int
len) {
        float32x4_t * vdst = (float32x4_t *)dst;
        float32x4_t * vsrc = (float32x4_t *)src;
        float32x4_t vk = vdupq_n_f32(a);
        for (int i = 0; i < (len >> 2); i++)
        {
            vdst[i] = vaddq_f32(vdst[i], vmulq_f32(vsrc[i], vk));
        }
    }

Auto-vectorized loops are still not unrolled, unless they were not
interleaved when vectorized.

The provided test case shows the enhancement on top of runtime/partial
unrolling, depending on the CPU.

PR: https://github.com/llvm/llvm-project/pull/147420
2025-07-14 20:53:09 +01:00
Benjamin Maxwell
43a9ec2ecd
[AArch64][SME] Instcombine llvm.aarch64.sme.in.streaming.mode() (#147930)
This can fold away in functions with known streaming modes.
2025-07-13 13:20:20 +01:00
David Green
a647fd7dda [AArch64] Add a cost for v2i32 vecreduce.add.
These can lower to a addp. The score does not alter with this patch, but this
should help keep the scores the same with #146526.
2025-07-13 08:06:10 +01:00
David Green
10f782456e
[AArch64] Enable other cost kinds for getCmpSelInstrCost. (#144375)
This removes the CostKind == TCK_RecipThroughput limitation from
getCmpSelInstrCost, allowing it to return more accurate costs for CodeSize and
Lat / SizeLat. Especially for larger vectors under CodeSize, the returned costs
are currently 1, not the legalization cost.
2025-07-10 07:12:21 +01:00
Florian Hahn
d3d77f71aa
[EarlyCSE,TTI] Don't create new, unused, instructions. (#134534)
getOrCreateResultFromMemIntrinsic can modify the current function by
inserting new instructions without EarlyCSE keeping track of the
changes.

Introduce a new CanCreate argument, and update the function to only create
new instructions when CanCreate = true. Use it when appropriate.

Fixes https://github.com/llvm/llvm-project/issues/145183
2025-07-08 21:43:51 +01:00
David Sherwood
f575b18fdc
[LV] Add support for partial reductions without a binary op (#133922)
Consider IR such as this:

for.body:
  %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
  %accum = phi i32 [ 0, %entry ], [ %add, %for.body ]
  %gep.a = getelementptr i8, ptr %a, i64 %iv
  %load.a = load i8, ptr %gep.a, align 1
  %ext.a = zext i8 %load.a to i32
  %add = add i32 %ext.a, %accum
  %iv.next = add i64 %iv, 1
  %exitcond.not = icmp eq i64 %iv.next, 1025
  br i1 %exitcond.not, label %for.exit, label %for.body

Conceptually we can vectorise this using partial reductions too,
although the current loop vectoriser implementation requires the
accumulation of a multiply. For AArch64 this is easily done with
a udot or sdot with an identity operand, i.e. a vector of (i16 1).

In order to do this I had to teach getScaledReductions that the
accumulated value may come from a unary op, hence there is only
one extension to consider. Similarly, I updated the vplan and
AArch64 TTI cost model to understand the possible unary op.

---------

Co-authored-by: Matt Devereau <matthew.devereau@arm.com>
2025-07-02 13:05:51 +01:00
Graham Hunter
85bc868417
[AArch64][TTI] Reduce cost for splatting whole first vector segment (SVE) (#145701)
Improve cost modeling for splatting the first 128b segment.
2025-07-02 09:51:56 +01:00
Kazu Hirata
3d5903c4d8
[llvm] Use llvm::is_contained (NFC) (#145844)
llvm::is_contained is shorter than llvm::all_of plus a lambda.
2025-06-26 08:41:18 -07:00
Graham Hunter
2db3cc4f34
[AArch64][CostModel] Lower cost of dupq (SVE2.1) (#144918)
With codegen in place to match shuffles to dupq, we can now lower the
cost to something reasonable.
2025-06-25 13:26:56 +01:00
David Green
8583882bdc [AArch64] Remove unnecessary DL variable. NFC 2025-06-22 10:52:13 +01:00
David Green
77941eba7f
[CostModel] Add a DstTy to getShuffleCost (#141634)
A shuffle will take two input vectors and a mask, to produce a new
vector of size <MaskElts x SrcEltTy>. Historically it has been assumed
that the SrcTy and the DstTy are the same for getShuffleCost, with that
being relaxed in recent years. If the Tp passed to getShuffleCost is the
SrcTy, then the DstTy can be calculated from the Mask elts and the src
elt size, but the Mask is not always provided and the Tp is not reliably
always the SrcTy. This has led to situations notably in the SLP
vectorizer but also in the generic cost routines where assumption about
how vectors will be legalized are built into the generic cost routines -
for example whether they will widen or promote, with the cost modelling
assuming they will widen but the default lowering to promote for integer
vectors.

This patch attempts to start improving that - it originally tried to
alter more of the cost model but that too quickly became too many
changes at once, so this patch just plumbs in a DstTy to getShuffleCost
so that DstTy and SrcTy can be reliably distinguished. The callers of
getShuffleCost have been updated to try and include a DstTy that is more
accurate. Otherwise it tries to be fairly non-functional, keeping the
SrcTy used as the primary type used in shuffle cost routines, only using
DstTy where it was in the past (for InsertSubVector for example).

Some asserts have been added that help to check for consistent values
when a Mask and a DstTy are provided to getShuffleCost. Some of them
took a while to get right, and some non-mask calls might still be
incorrect. Hopefully this will provide a useful base to build more
shuffles that alter size.
2025-06-21 12:29:29 +01:00
Philip Reames
b96370131d
[TTI] Plumb CostKind through getPartialReductionCost (#144953)
Purely for the sake of being idiomatic with other TTI costing routines,
no direct motivation beyond that.
2025-06-19 15:29:56 -07:00
AZero13
dc72b91ffe
[AArch64] Report icmp as free if it can be folded into ands (#143286)
Since changing the backend to fold x >= 1 / x < 1 -> x > 0 / x <= 0 and
x <= -1 / x > -1 -> x > 0 / x <= 0, this should be reflected in the
cost.
2025-06-17 14:59:38 +01:00
Luke Lau
7ef77eb998
[LV] Support scalable interleave groups for factors 3,5,6 and 7 (#141865)
Currently the loop vectorizer can only vectorize interleave groups for
power-of-2 factors at scalable VFs by recursively interleaving
[de]interleave2 intrinsics.

However after https://github.com/llvm/llvm-project/pull/124825 and
#139893, we now have [de]interleave intrinsics for all factors up to 8,
which is enough to support all types of segmented loads and stores on
RISC-V.

Now that the interleaved access pass has been taught to lower these in
#139373 and #141512, this patch teaches the loop vectorizer to emit
these intrinsics for factors up to 8, which enables scalable
vectorization for non-power-of-2 factors.

As far as I'm aware, no in-tree target will vectorize a scalable
interelave group above factor 8 because the maximum interleave factor is
capped at 4 on AArch64 and 8 on RISC-V, and the
`-max-interleave-group-factor` CLI option defaults to 8, so the
recursive [de]interleaving code has been removed for now.

Factors of 3 with scalable VFs are also turned off in AArch64 since
there's no lowering for [de]interleave3 just yet either.
2025-06-12 11:09:09 +01:00
AZero13
79a72c47d0
[AArch64] Consider negated powers of 2 when calculating throughput cost (#143013)
Negated powers of 2 have similar or (exact in the case of remainder)
codegen with lowering sdiv. In the case of sdiv, it just negates the
result in the end anyway, so nothing dissimilar at all.
2025-06-11 11:29:37 +01:00
Paul Walker
f43aaf90df
[NFC][LLVM] Refactor IRBuilder::Create{VScale,ElementCount,TypeSize}. (#142803)
CreateVScale took a scaling parameter that had a single use outside of
IRBuilder with all other callers having to create a redundant
ConstantInt. To work round this some code perferred to use
CreateIntrinsic directly.

This patch simplifies CreateVScale to return a call to the llvm.vscale()
intrinsic and nothing more. As well as simplifying the existing call
sites I've also migrated the uses of CreateIntrinsic.

Whilst IRBuilder used CreateVScale's scaling parameter as part of the
implementations of CreateElementCount and CreateTypeSize, I have
follow-on work to switch them to the NUW varaiety and thus they would
stop using CreateVScale's scaling as well. To prepare for this I have
moved the multiplication and constant folding into the implementations
of CreateElementCount and CreateTypeSize.

As a final step I have replaced some callers of CreateVScale with
CreateElementCount where it's clear from the code they wanted the
latter.
2025-06-10 12:35:59 +01:00
Jonathan Thackray
62cae4ffcb
[AArch64] Fix a multitude of AArch64 typos (NFC) (#143370)
Fix a multitude of typos in the AArch64 codebase using the
https://github.com/crate-ci/typos Rust package.
2025-06-09 22:13:22 +01:00
Ramkumar Ramachandra
0240129218
[IVDesc] Unify RecurKinds [I|F]AnyOf (#118393)
Co-authored-by: Mel Chen <mel.chen@sifive.com>
2025-05-23 11:57:30 +01:00
David Green
ff78d233c0 [AArch64] Ensure Source1 and Source2 are initialized.
Try to appease the sanatizers builders by making sure the Source1 and Source2
and initialized when passed to the tuple.

See #139331.
2025-05-16 22:03:25 +01:00
David Green
0de8ff6d9c
[AArch64] Reduce the cost of repeated sub-shuffle (#139331)
Given a larger-than-legal shuffle we will split into multiple sub-parts.
This adds a check to the computed costs of sub-shuffles so that repeated
sequences are not accounted for multiple times. This especially reduces
the cost of broadcasts/splats.
2025-05-16 18:50:05 +01:00
Benjamin Maxwell
647db1b02d
Reland "[AArch64][SME] Split SMECallAttrs out of SMEAttrs" (#138671)
SMECallAttrs is a new helper class that holds all the SMEAttrs for a
call. The interfaces to query actions needed for the call (e.g. change
streaming mode) have been moved to the SMECallAttrs class.

The main motivation for this change is to make the split between the
caller, callee, and callsite attributes more apparent.

Before this change, we would always merge callsite and callee
attributes. The main reason to do this was to handle indirect calls,
however, we also occasionally used callsite attributes on direct calls
in tests (mainly to avoid creating multiple function declarations). With
this patch, we now explicitly handle indirect calls and disallow
incompatible attributes on direct calls (so this patch is not entirely
an NFC).

Same as #137239, but with a change to avoid inferring SME attributes for
function definitions. This allows stubbing the SME ABI routines in C/C++
(and matches the old behaviour).
2025-05-15 08:37:08 +01:00
David Green
1b13849a9b [AArch64] Add bf16 broadcast and transpose costs
These are only based on the size of the element, not the type (although the
codegen does need to account for it).
2025-05-09 22:38:57 +01:00
David Green
3b4d5638b3 [AArch64] Limit vector splitting to vectors of size larger than 128bit
The intent of this code is to split larger vectors into smaller shuffles, but
it currently triggering on some small vector types. Limit it to vectors of size
>128bit.
2025-05-09 22:17:28 +01:00
Benjamin Maxwell
703b479f16
Revert "[AArch64][SME] Split SMECallAttrs out of SMEAttrs" (#138664)
Reverts llvm/llvm-project#137239

This broke implementing SME ABI routines in C/C++ (used for some stubs),
see: https://lab.llvm.org/buildbot/#/builders/94/builds/6859
2025-05-06 10:28:13 +01:00
Benjamin Maxwell
cadf652857
[AArch64][SME] Split SMECallAttrs out of SMEAttrs (#137239)
SMECallAttrs is a new helper class that holds all the SMEAttrs for a
call. The interfaces to query actions needed for the call (e.g. change
streaming mode) have been moved to the SMECallAttrs class.

The main motivation for this change is to make the split between the
caller, callee, and callsite attributes more apparent.

Before this change, we would always merge callsite and callee
attributes. The main reason to do this was to handle indirect calls,
however, we also occasionally used callsite attributes on direct calls
in tests (mainly to avoid creating multiple function declarations). With
this patch, we now explicitly handle indirect calls and disallow
incompatible attributes on direct calls (so this patch is not entirely
an NFC).
2025-05-06 09:36:26 +01:00
Ricardo Jesus
fbd9a3160b
[AArch64][SVE] Combine UXT[BHW] intrinsics to AND. (#137956)
This patch combines uxt[bhw] intrinsics to and_u when the governing
predicate is all-true or the passthrough is undef (e.g. in cases of
``unknown'' merging). This improves code gen as the latter can be
emitted as AND immediate instructions.

For example, given:
```cpp
svuint64_t foo(svuint64_t x) {
  return svextb_z(svptrue_b64(), x);
}
```

Currently:
```gas
foo:
  ptrue   p0.d
  movi    v1.2d, #0000000000000000
  uxtb    z0.d, p0/m, z0.d
  ret
```

Becomes:
```gas
foo:
  and     z0.d, z0.d, #0xff
  ret
```
2025-05-06 08:48:08 +01:00