llvm-project

Author	SHA1	Message	Date
Kerry McLaughlin	c34cba0413	[AArch64][SME] Lower aarch64.sme.cnts* to vscale when in streaming mode (#154305 ) In streaming mode, both the @llvm.aarch64.sme.cnts and @llvm.aarch64.sve.cnt intrinsics are equivalent. For SVE, cnt* is lowered in instCombineIntrinsic to @llvm.sme.vscale(). This patch lowers the SME intrinsic similarly when in streaming-mode.	2025-08-20 09:48:36 +01:00
David Sherwood	13d8ba7dea	[LV][TTI] Calculate cost of extracting last index in a scalable vector (#144086 ) There are a couple of places in the loop vectoriser where we want to calculate the cost of extracting the last lane in a vector. However, we wrongly assume that asking for the cost of extracting lane (VF.getKnownMinValue() - 1) is an accurate representation of the cost of extracting the last lane. For SVE at least, this is non-trivial as it requires the use of whilelo and lastb instructions. To solve this problem I have added a new getReverseVectorInstrCost interface where the index is used in reverse from the end of the vector. Suppose a vector has a given ElementCount EC, the extracted/inserted lane would be EC - 1 - Index. For scalable vectors this index is unknown at compile time. I've added a AArch64 hook that better represents the cost, and also a RISCV hook that maintains compatibility with the behaviour prior to this PR. I've also taken the liberty of adding support in vplan for calculating the cost of VPInstruction::ExtractLastElement.	2025-08-19 09:31:37 +01:00
Benjamin Maxwell	81c06d198e	Reland "[AArch64][SME] Port all SME routines to RuntimeLibcalls" (#153417 ) This updates everywhere we emit/check an SME routines to use RuntimeLibcalls to get the function name and calling convention.	2025-08-18 14:53:40 +01:00
Ahmad Yasin	1b0bce972b	Reorder checks to speed up getAppleRuntimeUnrollPreferences() (#154010 ) - Delay load/store values calculation unless a best unroll-count is found - Remove extra getLoopLatch() invocation	2025-08-18 11:06:37 +03:00
David Green	5836bae463	[AArch64] Change the cost of fma and fmuladd to match fmul. (#152963 ) As fmul and fmadd are so similar, their performance characteristics tend to be the same on most platforms, at least in terms of reciprocal throughputs. Processors capable of performing a given number of fmul per cycle can usually perform the same number of fma, with the extra add being relatively simple on top. This patch makes the scores of the two operations the same, which brings the throughput cost of a fma/fmuladd to 2, and the latency to 3, which are the defaults for fmul. Note that we might also want to change the throughput cost of a fmul to 1, as most processors have ample bandwidth for them, but they should still stay in-line with one another.	2025-08-14 21:53:45 +01:00
Elvis Wang	01fac67e2a	[TTI] Add cost kind to getAddressComputationCost(). NFC. (#153342 ) This patch add cost kind to `getAddressComputationCost()` for #149955. Note that this patch also remove all the default value in `getAddressComputationCost()`.	2025-08-14 16:01:44 +08:00
Nikita Popov	48beed5b71	Revert "[AArch64][SME] Port all SME routines to RuntimeLibcalls" (#153392 ) This introduced a 5% compile-time regression on AArch64, see https://llvm-compile-time-tracker.com/compare.php?from=b9138bde3562de5c28a239dbd303caf2406678c6&to=271688b87abe7cf45aceaff8266270a25eb7b436&stat=instructions:u. Reverts llvm/llvm-project#152505.	2025-08-13 11:54:39 +00:00
Ahmad Yasin	1f2fb8e979	[AArch64] Tune unrolling prefs for more patterns on Apple CPUs (#149358 ) Enhance the heuristics in `getAppleRuntimeUnrollPreferences` to let a bit more loops to be unrolled. Specifically, this patch adjusts two checks: I. Tune the loop size budget from 8 to 10 II. Include immediate in-loop users of loaded values in the load/stores dependencies predicate --------- Co-authored-by: Florian Hahn <flo@fhahn.com> PR: https://github.com/llvm/llvm-project/pull/149358	2025-08-13 11:16:54 +01:00
Benjamin Maxwell	271688b87a	[AArch64][SME] Port all SME routines to RuntimeLibcalls (#152505 ) This updates everywhere we emit/check an SME routines to use RuntimeLibcalls to get the function name and calling convention. Note: RuntimeLibcallEmitter had some issues with emitting non-unique variable names for sets of libcalls, so I tweaked the output to avoid the need for variables.	2025-08-13 08:48:59 +01:00
Sam Tebbs	0bfa1718af	[LV] Create in-loop sub reductions (#147026 ) This PR allows the loop vectorizer to handle in-loop sub reductions by forming a normal in-loop add reduction with a negated input. Stacked PRs: 1. -> https://github.com/llvm/llvm-project/pull/147026 2. https://github.com/llvm/llvm-project/pull/147255 3. https://github.com/llvm/llvm-project/pull/147302 4. https://github.com/llvm/llvm-project/pull/147513	2025-08-12 10:22:41 +01:00
Luke Lau	acb86fb9e0	[TTI] Consistently pass the pointer type to getAddressComputationCost. NFCI (#152657 ) In some places we were passing the type of value being accessed, in other cases we were passing the type of the pointer for the access. The most "involved" user is LoopVectorizationCostModel::getMemInstScalarizationCost, which is the only call site that passes in the SCEV, and it passes along the pointer type. This changes call sites to consistently pass the pointer type, and renames the arguments to clarify this. No target actually checks the contents of the type passed, only to see if it's a vector or not, so this shouldn't have an effect.	2025-08-11 18:00:12 +08:00
David Green	26b302fd8b	[AArch64] Rename Cost -> PromotedCost to avoid shadowing error	2025-08-08 14:37:24 +01:00
David Green	7f1638efc1	[AArch64] Generalize costing for FP16 instructions (#150033 ) This extracts the code for modelling a fp16 operation as `fptrunc(fpop(fpext,fpext))` into a new function named getFP16BF16PromoteCost so that it can be reused by the arithmetic instructions. The function takes a lambda to calculate the cost of the operation with the promoted type.	2025-08-08 13:40:07 +01:00
Graham Hunter	de72cca671	[CostModel] Provide a default model for histogram intrinsics (#149348 ) Since we scalarize these intrinsics when the target does not support them, we should model that for costing purposes.	2025-08-08 11:00:00 +01:00
Paul Walker	94d374ab6c	[LLVM][CGP] Allow finer control for sinking compares. (#151366 ) Compare sinking is selectable based on the result of hasMultipleConditionRegisters. This function is too coarse grained by not taking into account the differences between scalar and vector compares. This PR extends the interface to take an EVT to allow finer control. The new interface is used by AArch64 to disable sinking of scalable vector compares, but with isProfitableToSinkOperands updated to maintain the cases that are specifically tested.	2025-08-05 11:43:41 +01:00
David Green	b30d5315b7	[AArch64] Add better fcmp costs for expanded predicates (#147940 ) Certain fcmp predicates need to be expanded into multiple operations and or'd together. This adds some more accurate cost modelling for them based on the predicate. Unsupported operations are given the cost of a libcall and the latency is set to 2 as that seemed to be fairly common between different CPUs.	2025-08-04 13:42:57 +01:00
David Green	e136fb04f2	[AArch64] Add sve bf16 fpext and fpround costs. (#150485 ) This prevents them from generating Invalid costs, as generating the instructions seems to work fine with and without +bf16. The costs are mostly taken from the number of instructions (minus ptrue and constants).	2025-08-04 09:47:41 +01:00
Sander de Smalen	76ce464073	[AArch64] Dont inline streaming fn into non-streaming caller (#150595 ) Without this change, the following test would fail to compile with `-march=armv8-a+sme`: ``` void func1(const svuint32_t in, svuint32_t out) { [&]() __arm_streaming { out = in; }(); } ``` But in general, it's probably better never to inline streaming functions into non-streaming functions, because they will have been marked as 'streaming' for a reason by the user.	2025-08-01 09:05:19 +01:00
John Brawn	9a9b8b7d1c	[AArch64] Allow unrolling of scalar epilogue loops (#151164 ) #147420 changed the unrolling preferences to permit unrolling of non-auto vectorized loops by checking for the isvectorized attribute, however when a loop is vectorized this attribute is put on both the vector loop and the scalar epilogue, so this change prevented the scalar epilogue from being unrolled. Restore the previous behaviour of unrolling the scalar epilogue by checking both for the isvectorized attribute and vector instructions in the loop.	2025-07-31 11:03:41 +01:00
Alexandros Lamprineas	3ab64c5b29	[NFC][Clang][FMV] Make FMV priority data type future proof. (#150079 ) FMV priority is the returned value of a polymorphic function. On RISC-V and X86 targets a 32-bit value is enough. On AArch64 we currently need 64 bits and we will soon exceed that. APInt seems to be a suitable replacement for uint64_t, presumably with minimal compile time overhead. It allows bit manipulation, comparison and variable bit width.	2025-07-23 10:37:29 +01:00
David Green	828a867ee0	[AArch64] Reduce the costs of and/or/xor reductions (#148553 ) Since the costs were added the codegen for i8/i16 and/or/xor reductions has improved. This updates the cost model to produce the same costs in terms of number of instructions.	2025-07-16 09:59:36 +01:00
Jeremy Morse	57a5f9c47e	[DebugInfo][RemoveDIs] Suppress getNextNonDebugInfoInstruction (#144383 ) There are no longer debug-info instructions, thus we don't need this skipping. Horray!	2025-07-15 15:34:10 +01:00
Florian Hahn	02d3738be9	[AArch64,TTI] Remove RealUse check for vector insert/extract costs. (#146526 ) getVectorInstrCostHelper would return costs of zero for vector inserts/extracts that move data between GPR and vector registers, if there was no 'real' use, i.e. there was no corresponding existing instruction. This meant that passes like LoopVectorize and SLPVectorizer, which likely are the main users of the interface, would understimate the cost of insert/extracts that move data between GPR and vector registers, which has non-trivial costs. The patch removes the special case and only returns costs of zero for lane 0 if it there is no need to transfer between integer and vector registers. This impacts a number of SLP test, and most of them look like general improvements.I think the change should make things more accurate for any AArch64 target, but if not it could also just be Apple CPU specific. I am seeing +2% end-to-end improvements on SLP-heavy workloads. PR: https://github.com/llvm/llvm-project/pull/146526	2025-07-15 15:19:27 +01:00
David Green	58d79aaba6	[AArch64] Guard against non-simple types in udiv sve costs. (#148580 ) The code here probably needs to change to handle types more uniformly, but this patch prevents it from trying to use a simple type where it does not exist. Fixes #148438.	2025-07-15 10:25:08 +01:00
Ahmad Yasin	671072e830	[AArch64] Unrolling of loops with vector instructions. (#147420 ) This patch permits loops with vector instructions to be unrolled. Today there is an early exit in `getUnrollingPreferences()` of AArch64 targets if a vector instruction is observed in any of the loop blocks. This patch fixes that so common loops like this one get a chance to be unrolled: void saxpy (float * dst, const float * src, const float a, const int len) { float32x4_t * vdst = (float32x4_t )dst; float32x4_t vsrc = (float32x4_t *)src; float32x4_t vk = vdupq_n_f32(a); for (int i = 0; i < (len >> 2); i++) { vdst[i] = vaddq_f32(vdst[i], vmulq_f32(vsrc[i], vk)); } } Auto-vectorized loops are still not unrolled, unless they were not interleaved when vectorized. The provided test case shows the enhancement on top of runtime/partial unrolling, depending on the CPU. PR: https://github.com/llvm/llvm-project/pull/147420	2025-07-14 20:53:09 +01:00
Benjamin Maxwell	43a9ec2ecd	[AArch64][SME] Instcombine `llvm.aarch64.sme.in.streaming.mode()` (#147930 ) This can fold away in functions with known streaming modes.	2025-07-13 13:20:20 +01:00
David Green	a647fd7dda	[AArch64] Add a cost for v2i32 vecreduce.add. These can lower to a addp. The score does not alter with this patch, but this should help keep the scores the same with #146526.	2025-07-13 08:06:10 +01:00
David Green	10f782456e	[AArch64] Enable other cost kinds for getCmpSelInstrCost. (#144375 ) This removes the CostKind == TCK_RecipThroughput limitation from getCmpSelInstrCost, allowing it to return more accurate costs for CodeSize and Lat / SizeLat. Especially for larger vectors under CodeSize, the returned costs are currently 1, not the legalization cost.	2025-07-10 07:12:21 +01:00
Florian Hahn	d3d77f71aa	[EarlyCSE,TTI] Don't create new, unused, instructions. (#134534 ) getOrCreateResultFromMemIntrinsic can modify the current function by inserting new instructions without EarlyCSE keeping track of the changes. Introduce a new CanCreate argument, and update the function to only create new instructions when CanCreate = true. Use it when appropriate. Fixes https://github.com/llvm/llvm-project/issues/145183	2025-07-08 21:43:51 +01:00
David Sherwood	f575b18fdc	[LV] Add support for partial reductions without a binary op (#133922 ) Consider IR such as this: for.body: %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ] %accum = phi i32 [ 0, %entry ], [ %add, %for.body ] %gep.a = getelementptr i8, ptr %a, i64 %iv %load.a = load i8, ptr %gep.a, align 1 %ext.a = zext i8 %load.a to i32 %add = add i32 %ext.a, %accum %iv.next = add i64 %iv, 1 %exitcond.not = icmp eq i64 %iv.next, 1025 br i1 %exitcond.not, label %for.exit, label %for.body Conceptually we can vectorise this using partial reductions too, although the current loop vectoriser implementation requires the accumulation of a multiply. For AArch64 this is easily done with a udot or sdot with an identity operand, i.e. a vector of (i16 1). In order to do this I had to teach getScaledReductions that the accumulated value may come from a unary op, hence there is only one extension to consider. Similarly, I updated the vplan and AArch64 TTI cost model to understand the possible unary op. --------- Co-authored-by: Matt Devereau <matthew.devereau@arm.com>	2025-07-02 13:05:51 +01:00
Graham Hunter	85bc868417	[AArch64][TTI] Reduce cost for splatting whole first vector segment (SVE) (#145701 ) Improve cost modeling for splatting the first 128b segment.	2025-07-02 09:51:56 +01:00
Kazu Hirata	3d5903c4d8	[llvm] Use llvm::is_contained (NFC) (#145844 ) llvm::is_contained is shorter than llvm::all_of plus a lambda.	2025-06-26 08:41:18 -07:00
Graham Hunter	2db3cc4f34	[AArch64][CostModel] Lower cost of dupq (SVE2.1) (#144918 ) With codegen in place to match shuffles to dupq, we can now lower the cost to something reasonable.	2025-06-25 13:26:56 +01:00
David Green	8583882bdc	[AArch64] Remove unnecessary DL variable. NFC	2025-06-22 10:52:13 +01:00
David Green	77941eba7f	[CostModel] Add a DstTy to getShuffleCost (#141634 ) A shuffle will take two input vectors and a mask, to produce a new vector of size <MaskElts x SrcEltTy>. Historically it has been assumed that the SrcTy and the DstTy are the same for getShuffleCost, with that being relaxed in recent years. If the Tp passed to getShuffleCost is the SrcTy, then the DstTy can be calculated from the Mask elts and the src elt size, but the Mask is not always provided and the Tp is not reliably always the SrcTy. This has led to situations notably in the SLP vectorizer but also in the generic cost routines where assumption about how vectors will be legalized are built into the generic cost routines - for example whether they will widen or promote, with the cost modelling assuming they will widen but the default lowering to promote for integer vectors. This patch attempts to start improving that - it originally tried to alter more of the cost model but that too quickly became too many changes at once, so this patch just plumbs in a DstTy to getShuffleCost so that DstTy and SrcTy can be reliably distinguished. The callers of getShuffleCost have been updated to try and include a DstTy that is more accurate. Otherwise it tries to be fairly non-functional, keeping the SrcTy used as the primary type used in shuffle cost routines, only using DstTy where it was in the past (for InsertSubVector for example). Some asserts have been added that help to check for consistent values when a Mask and a DstTy are provided to getShuffleCost. Some of them took a while to get right, and some non-mask calls might still be incorrect. Hopefully this will provide a useful base to build more shuffles that alter size.	2025-06-21 12:29:29 +01:00
Philip Reames	b96370131d	[TTI] Plumb CostKind through getPartialReductionCost (#144953 ) Purely for the sake of being idiomatic with other TTI costing routines, no direct motivation beyond that.	2025-06-19 15:29:56 -07:00
AZero13	dc72b91ffe	[AArch64] Report icmp as free if it can be folded into ands (#143286 ) Since changing the backend to fold x >= 1 / x < 1 -> x > 0 / x <= 0 and x <= -1 / x > -1 -> x > 0 / x <= 0, this should be reflected in the cost.	2025-06-17 14:59:38 +01:00
Luke Lau	7ef77eb998	[LV] Support scalable interleave groups for factors 3,5,6 and 7 (#141865 ) Currently the loop vectorizer can only vectorize interleave groups for power-of-2 factors at scalable VFs by recursively interleaving [de]interleave2 intrinsics. However after https://github.com/llvm/llvm-project/pull/124825 and #139893, we now have [de]interleave intrinsics for all factors up to 8, which is enough to support all types of segmented loads and stores on RISC-V. Now that the interleaved access pass has been taught to lower these in #139373 and #141512, this patch teaches the loop vectorizer to emit these intrinsics for factors up to 8, which enables scalable vectorization for non-power-of-2 factors. As far as I'm aware, no in-tree target will vectorize a scalable interelave group above factor 8 because the maximum interleave factor is capped at 4 on AArch64 and 8 on RISC-V, and the `-max-interleave-group-factor` CLI option defaults to 8, so the recursive [de]interleaving code has been removed for now. Factors of 3 with scalable VFs are also turned off in AArch64 since there's no lowering for [de]interleave3 just yet either.	2025-06-12 11:09:09 +01:00
AZero13	79a72c47d0	[AArch64] Consider negated powers of 2 when calculating throughput cost (#143013 ) Negated powers of 2 have similar or (exact in the case of remainder) codegen with lowering sdiv. In the case of sdiv, it just negates the result in the end anyway, so nothing dissimilar at all.	2025-06-11 11:29:37 +01:00
Paul Walker	f43aaf90df	[NFC][LLVM] Refactor IRBuilder::Create{VScale,ElementCount,TypeSize}. (#142803 ) CreateVScale took a scaling parameter that had a single use outside of IRBuilder with all other callers having to create a redundant ConstantInt. To work round this some code perferred to use CreateIntrinsic directly. This patch simplifies CreateVScale to return a call to the llvm.vscale() intrinsic and nothing more. As well as simplifying the existing call sites I've also migrated the uses of CreateIntrinsic. Whilst IRBuilder used CreateVScale's scaling parameter as part of the implementations of CreateElementCount and CreateTypeSize, I have follow-on work to switch them to the NUW varaiety and thus they would stop using CreateVScale's scaling as well. To prepare for this I have moved the multiplication and constant folding into the implementations of CreateElementCount and CreateTypeSize. As a final step I have replaced some callers of CreateVScale with CreateElementCount where it's clear from the code they wanted the latter.	2025-06-10 12:35:59 +01:00
Jonathan Thackray	62cae4ffcb	[AArch64] Fix a multitude of AArch64 typos (NFC) (#143370 ) Fix a multitude of typos in the AArch64 codebase using the https://github.com/crate-ci/typos Rust package.	2025-06-09 22:13:22 +01:00
Ramkumar Ramachandra	0240129218	[IVDesc] Unify RecurKinds [I\|F]AnyOf (#118393 ) Co-authored-by: Mel Chen <mel.chen@sifive.com>	2025-05-23 11:57:30 +01:00
David Green	ff78d233c0	[AArch64] Ensure Source1 and Source2 are initialized. Try to appease the sanatizers builders by making sure the Source1 and Source2 and initialized when passed to the tuple. See #139331.	2025-05-16 22:03:25 +01:00
David Green	0de8ff6d9c	[AArch64] Reduce the cost of repeated sub-shuffle (#139331 ) Given a larger-than-legal shuffle we will split into multiple sub-parts. This adds a check to the computed costs of sub-shuffles so that repeated sequences are not accounted for multiple times. This especially reduces the cost of broadcasts/splats.	2025-05-16 18:50:05 +01:00
Benjamin Maxwell	647db1b02d	Reland "[AArch64][SME] Split SMECallAttrs out of SMEAttrs" (#138671 ) SMECallAttrs is a new helper class that holds all the SMEAttrs for a call. The interfaces to query actions needed for the call (e.g. change streaming mode) have been moved to the SMECallAttrs class. The main motivation for this change is to make the split between the caller, callee, and callsite attributes more apparent. Before this change, we would always merge callsite and callee attributes. The main reason to do this was to handle indirect calls, however, we also occasionally used callsite attributes on direct calls in tests (mainly to avoid creating multiple function declarations). With this patch, we now explicitly handle indirect calls and disallow incompatible attributes on direct calls (so this patch is not entirely an NFC). Same as #137239, but with a change to avoid inferring SME attributes for function definitions. This allows stubbing the SME ABI routines in C/C++ (and matches the old behaviour).	2025-05-15 08:37:08 +01:00
David Green	1b13849a9b	[AArch64] Add bf16 broadcast and transpose costs These are only based on the size of the element, not the type (although the codegen does need to account for it).	2025-05-09 22:38:57 +01:00
David Green	3b4d5638b3	[AArch64] Limit vector splitting to vectors of size larger than 128bit The intent of this code is to split larger vectors into smaller shuffles, but it currently triggering on some small vector types. Limit it to vectors of size >128bit.	2025-05-09 22:17:28 +01:00
Benjamin Maxwell	703b479f16	Revert "[AArch64][SME] Split SMECallAttrs out of SMEAttrs" (#138664 ) Reverts llvm/llvm-project#137239 This broke implementing SME ABI routines in C/C++ (used for some stubs), see: https://lab.llvm.org/buildbot/#/builders/94/builds/6859	2025-05-06 10:28:13 +01:00
Benjamin Maxwell	cadf652857	[AArch64][SME] Split SMECallAttrs out of SMEAttrs (#137239 ) SMECallAttrs is a new helper class that holds all the SMEAttrs for a call. The interfaces to query actions needed for the call (e.g. change streaming mode) have been moved to the SMECallAttrs class. The main motivation for this change is to make the split between the caller, callee, and callsite attributes more apparent. Before this change, we would always merge callsite and callee attributes. The main reason to do this was to handle indirect calls, however, we also occasionally used callsite attributes on direct calls in tests (mainly to avoid creating multiple function declarations). With this patch, we now explicitly handle indirect calls and disallow incompatible attributes on direct calls (so this patch is not entirely an NFC).	2025-05-06 09:36:26 +01:00
Ricardo Jesus	fbd9a3160b	[AArch64][SVE] Combine UXT[BHW] intrinsics to AND. (#137956 ) This patch combines uxt[bhw] intrinsics to and_u when the governing predicate is all-true or the passthrough is undef (e.g. in cases of ``unknown'' merging). This improves code gen as the latter can be emitted as AND immediate instructions. For example, given: ```cpp svuint64_t foo(svuint64_t x) { return svextb_z(svptrue_b64(), x); } ``` Currently: ```gas foo: ptrue p0.d movi v1.2d, #0000000000000000 uxtb z0.d, p0/m, z0.d ret ``` Becomes: ```gas foo: and z0.d, z0.d, #0xff ret ```	2025-05-06 08:48:08 +01:00

1 2 3 4 5 ...

595 Commits