llvm-project

Author	SHA1	Message	Date
Jeremy Morse	81d18ad864	[NFC][DebugInfo] Make some block-start-position methods return iterators (#124287 ) As part of the "RemoveDIs" work to eliminate debug intrinsics, we're replacing methods that use Instruction's as positions with iterators. A number of these (such as getFirstNonPHIOrDbg) are sufficiently infrequently used that we can just replace the pointer-returning version with an iterator-returning version, hopefully without much/any disruption. Thus this patch has getFirstNonPHIOrDbg and getFirstNonPHIOrDbgOrLifetime return an iterator, and updates all call-sites. There are no concerns about the iterators returned being converted to Instruction's and losing the debug-info bit: because the methods skip debug intrinsics, the iterator head bit is always false anyway.	2025-01-27 16:27:54 +00:00
Jeremy Morse	8e70273509	[NFC][DebugInfo] Use iterator moveBefore at many call-sites (#123583 ) As part of the "RemoveDIs" project, BasicBlock::iterator now carries a debug-info bit that's needed when getFirstNonPHI and similar feed into instruction insertion positions. Call-sites where that's necessary were updated a year ago; but to ensure some type safety however, we'd like to have all calls to moveBefore use iterators. This patch adds a (guaranteed dereferenceable) iterator-taking moveBefore, and changes a bunch of call-sites where it's obviously safe to change to use it by just calling getIterator() on an instruction pointer. A follow-up patch will contain less-obviously-safe changes. We'll eventually deprecate and remove the instruction-pointer insertBefore, but not before adding concise documentation of what considerations are needed (very few).	2025-01-24 10:53:11 +00:00
Mats Jun Larsen	d7c14c8f97	[IR] Replace of PointerType::getUnqual(Type) with opaque version (NFC) (#123909 ) Follow up to https://github.com/llvm/llvm-project/issues/123569	2025-01-23 18:23:05 +09:00
David Sherwood	a733c1fa90	[AArch64][NFC] Move getPartialReductionCost into cpp file (#123370 ) The function getPartialReductionCost is already quite large and is likely to grow in size as we add support for more cases in future. Therefore, I think it's best to move this into the cpp file.	2025-01-20 14:07:03 +00:00
David Green	27a2d3d088	[AArch64] Build v2i64 Mul cost out of getArithmeticInstrCost and getVectorInstrCost. NFCI This should not effect the result, unless the getArithmeticInstrCost and getVectorInstrCost routines learn to produce different costs (with CostKind = CodeSize for example). The -1 lanes prevent 0 lanes from (incorrectly) being marked as free.	2025-01-20 11:43:57 +00:00
Alexandros Lamprineas	831527a5ef	[FMV][GlobalOpt] Statically resolve calls to versioned functions. (#87939 ) To deduce whether the optimization is legal we need to compare the target features between caller and callee versions. The criteria for bypassing the resolver are the following: * If the callee's feature set is a subset of the caller's feature set, then the callee is a candidate for direct call. * Among such candidates the one of highest priority is the best match and it shall be picked, unless there is a version of the callee with higher priority than the best match which cannot be picked from a higher priority caller (directly or through the resolver). * For every higher priority callee version than the best match, there is a higher priority caller version whose feature set availability is implied by the callee's feature set. Example: Callers and Callees are ordered in decreasing priority. The arrows indicate successful call redirections. Caller Callee Explanation ========================================================================= mops+sve2 --+--> mops all the callee versions are subsets of the \| caller but mops has the highest priority \| mops --+ sve2 between mops and default callees, mops wins sve sve between sve and default callees, sve wins but sve2 does not have a high priority caller default -----> default sve (callee) implies sve (caller), sve2(callee) implies sve (caller), mops(callee) implies mops(caller)	2025-01-17 10:49:43 +00:00
David Green	5a069eac5f	[AArch64] Don't try to sink and(load) (#122274 ) If we sink the and in and(load), CGP can hoist is back again to the load, getting into an infinite loop. This prevents sinking the and in this case. Fixes #122074	2025-01-10 11:54:46 +00:00
David Green	c05fc9b6d5	[AArch64] Fix sebvector -> subvector typo. NFC	2025-01-09 12:10:43 +00:00
David Green	32bc029be6	[AArch64] Fix signed comparison warning. NFC	2025-01-08 08:59:15 +00:00
David Green	a8dab1aa03	[AArch64] Add a subvector extract cost. (#121472 ) These can generally be emitted using an ext instruction or mov from the high half. The half half extracts can be free depending on the users, but that is not handled here, just the basic costs. It originally included all subvector extracts, but that was toned-down to just half-vector extracts to try and help the mid end not breakup high/low extracts without having the SLP vectorizer create a mess using other shuffles.	2025-01-08 08:13:07 +00:00
David Green	2db7b314da	[AArch64] Add BF16 fpext and fptrunc costs. (#119524 ) This expands the recently added fp16 fpext and fpround costs to bf16. Some of the costs are taken from the rough number of instructions needed, some are a little aspirational. https://godbolt.org/z/bGEEd1vsW	2025-01-07 09:39:08 +00:00
David Sherwood	346185c42c	[AArch64] Improve codegen of vectorised early exit loops (#119534 ) Once PR #112138 lands we are able to start vectorising more loops that have uncountable early exits. The typical loop structure looks like this: vector.body: ... %pred = icmp eq <2 x ptr> %wide.load, %broadcast.splat ... %or.reduc = tail call i1 @llvm.vector.reduce.or.v2i1(<2 x i1> %pred) %iv.cmp = icmp eq i64 %index.next, 4 %exit.cond = or i1 %or.reduc, %iv.cmp br i1 %exit.cond, label %middle.split, label %vector.body middle.split: br i1 %or.reduc, label %found, label %notfound found: ret i64 1 notfound: ret i64 0 The problem with this is that %or.reduc is kept live after the loop, and since this is a boolean it typically requires making a copy of the condition code register. For AArch64 this requires an additional cset instruction, which is quite expensive for a typical find loop that only contains 6 or 7 instructions. This patch attempts to improve the codegen by sinking the reduction out of the loop to the location of it's user. It's a lot cheaper to keep the predicate alive if the type is legal and has lots of registers for it. There is a potential downside in that a little more work is required after the loop, but I believe this is worth it since we are likely to spend most of our time in the loop.	2025-01-06 13:17:14 +00:00
Kerry McLaughlin	d8d4c18761	[AArch64][SME] Disable inlining of callees with new ZT0 state (#121338 ) Inlining must be disabled for new-ZT0 callees as the callee is required to save ZT0 and toggle PSTATE.ZA on entry.	2025-01-06 12:02:28 +00:00
Yingwei Zheng	a77346bad0	[IRBuilder] Refactor FMF interface (#121657 ) Up to now, the only way to set specified FMF flags in IRBuilder is to use `FastMathFlagGuard`. It makes the code ugly and hard to maintain. This patch introduces a helper class `FMFSource` to replace the original parameter `Instruction *FMFSource` in IRBuilder. To maximize the compatibility, it accepts an instruction or a specified FMF. This patch also removes the use of `FastMathFlagGuard` in some simple cases. Compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=f87a9db8322643ccbc324e317a75b55903129b55&to=9397e712f6010be15ccf62f12740e9b4a67de2f4&stat=instructions%3Au	2025-01-06 14:37:04 +08:00
David Green	b35d3453dd	[AArch64] Add an option for sve-prefer-fixed-over-scalable-if-equal. NFC Add a new option to control preferFixedOverScalableIfEqualCost, useful for testing.	2024-12-31 11:07:42 +00:00
Sander de Smalen	2ce168baed	[AArch64] SME implementation for agnostic-ZA functions (#120150 ) This implements the lowering of calls from agnostic-ZA functions to non-agnostic-ZA functions, using the ABI routines `__arm_sme_state_size`, `__arm_sme_save` and `__arm_sme_restore`. This implements the proposal described in the following PRs: * https://github.com/ARM-software/acle/pull/336 * https://github.com/ARM-software/abi-aa/pull/264	2024-12-23 19:10:21 +00:00
Florian Hahn	d486b76823	[AArch64] Unroll some loops with early-continues on Apple Silicon. (#118499 ) Try to runtime-unroll loops with early-continues depending on loop-varying loads; this helps with branch-prediction for the early-continues and can significantly improve performance for such loops Builds on top of https://github.com/llvm/llvm-project/pull/118317. PR: https://github.com/llvm/llvm-project/pull/118499.	2024-12-22 13:10:54 +00:00
David Sherwood	eaf482f012	[AArch64] Tweak truncate costs for some scalable vector types (#119542 ) == We were previously returning an invalid cost when truncating anything to <vscale x 2 x i1>, which is incorrect since we can generate perfectly good code for this. == The costs for truncating legal or unpacked types to predicates seemed overly optimistic. For example, when truncating <vscale x 8 x i16> to <vscale x 8 x i1> we typically do something like and z0.h, z0.h, #0x1 cmpne p0.h, p0/z, z0.h, #0 I guess it might depend upon whether the input value is generated in the same block or not and if we can avoid the inreg zero-extend. However, it feels safe to take the more conservative cost here. == The costs for some truncates such as trunc <vscale x 2 x i32> %a to <vscale x 2 x i16> were 1, whereas in actual fact they are free and no instructions are required. == Also, for this trunc <vscale x 8 x i32> %a to <vscale x 8 x i16> it's just a single uzp1 instruction so I reduced the cost to 1. In general, I've added costs for all cases where the destination type is legal or unpacked. One unfortunate side effect of this is the costs for some fixed-width truncates when using SVE now look too optimistic.	2024-12-19 10:07:41 +00:00
Ramkumar Ramachandra	4a0d53a0b0	PatternMatch: migrate to CmpPredicate (#118534 ) With the introduction of CmpPredicate in 51a895a (IR: introduce struct with CmpInst::Predicate and samesign), PatternMatch is one of the first key pieces of infrastructure that must be updated to match a CmpInst respecting samesign information. Implement this change to Cmp-matchers. This is a preparatory step in migrating the codebase over to CmpPredicate. Since we no functional changes are desired at this stage, we have chosen not to migrate CmpPredicate::operator==(CmpPredicate) calls to use CmpPredicate::getMatching(), as that would have visible impact on tests that are not yet written: instead, we call CmpPredicate::operator==(Predicate), preserving the old behavior, while also inserting a few FIXME comments for follow-ups.	2024-12-13 14:18:33 +00:00
Ricardo Jesus	2fe30bc669	[AArch64] Add cost model for @experimental.vector.match (#118512 ) The base cost approximates the expansion code in SelectionDAGBuilder. For the AArch64 cases that don't need generic expansion, fixed-length search vectors have a higher cost than scalable vectors due to the extra instructions to convert the boolean mask.	2024-12-11 07:51:11 +00:00
David Green	2f18b5ef03	[AArch64] Add fpext and fpround costs (#119292 ) This adds some basic costs for fpext and fpround, many of which were already handled by the generic costing routines but this does make some adjustments for larger vector types that can use fcvtn+fcvtn2, as opposed to fcvtn+fcvtn+concat. These should now more closely match the codegen from https://godbolt.org/z/r3P9Mf8ez, for example.	2024-12-11 06:26:41 +00:00
David Green	ca884009e4	[AArch64] Add test coverage of fp16 and bf16 fptrunc and fpext. NFC Some of the scalable tests have been split off to make the tests more managable. AArch64TTIImpl::getCastInstrCost is also formatted to avoid the need to fight against CI.	2024-12-09 23:41:18 +00:00
Florian Hahn	0bb7bd4b4e	[AArch64] Runtime-unroll small load/store loops for Apple Silicon CPUs. (#118317 ) Add initial heuristics to selectively enable runtime unrolling for loops where doing so is expected to be highly beneficial on Apple Silicon CPUs. To start with, we try to runtime-unroll small, single block loops, if they have load/store dependencies, to expose more parallel memory access streams [1] and to improve instruction delivery [2]. We also explicitly avoid runtime-unrolling for loop structures that may limit the expected gains from runtime unrolling. Such loops include loops with complex control flow (aren't innermost loops, have multiple exits, have a large number of blocks), trip count expansion is expensive and are expected to execute a small number of iterations. Note that the heuristics here may be overly conservative and we err on the side of avoiding runtime unrolling rather than unroll excessively. They are all subject to further refinement. Across a large set of workloads, this increase the total number of unrolled loops by 2.9%. [1] 4.6.10 in Apple Silicon CPU Optimization Guide [2] 4.4.4 in Apple Silicon CPU Optimization Guide Depends on https://github.com/llvm/llvm-project/pull/118316 for TTI changes. PR: https://github.com/llvm/llvm-project/pull/118317	2024-12-09 14:28:31 +00:00
Hari Limaye	8bc9551d9b	[AArch64] Improve operand sinking for mul instructions (#116604 ) - Sink splat operands to mul instructions for types where we can use the lane-indexed variants. - When sinking operands for [su]mull, also sink the ext instruction.	2024-12-06 12:45:18 +00:00
Florian Hahn	9a0f25158c	[SelectOpt] Support ADD and SUB with zext operands. (#115489 ) Extend the support for implicit selects in the form of OR with a ZExt operand to support ADD and SUB binops as well. They similarly can form implicit selects which can be profitable to convert back the branches. PR: https://github.com/llvm/llvm-project/pull/115489	2024-11-30 21:05:41 +00:00
Jonas Paulsson	0ad6be1927	[SLPVectorizer, TargetTransformInfo, SystemZ] Improve SLP getGatherCost(). (#112491 ) As vector element loads are free on SystemZ, this patch improves the cost computation in getGatherCost() to reflect this. getScalarizationOverhead() gets an optional parameter which can hold the actual Values so that they in turn can be passed (by BasicTTIImpl) to getVectorInstrCost(). SystemZTTIImpl::getVectorInstrCost() will now recognize a LoadInst and typically return a 0 cost for it, with some exceptions.	2024-11-29 21:19:45 +01:00
David Green	d714b221c7	[AArch64] Guard against getRegisterBitWidth returning zero in vector instr cost. (#117749 ) If the getRegisterBitWidth is zero (such as in sme streaming functions), then we could hit a crash from using % RegWidth.	2024-11-29 04:01:03 +00:00
David Green	d106a39c33	[AArch64] Minor cleanup and speedup for getVectorInstrCostHelper If UserToExtractIdx is empty then we can skip checking the users.	2024-11-29 01:11:39 +00:00
hev	e26af0938c	[llvm] Add `BasicTTIImpl::areInlineCompatible` for target feature subset checks (#117493 ) This patch moves the `areInlineCompatible` implementation from multiple subclasses (`AArch64TTIImpl`, `RISCVTTIImpl`, `WebAssemblyTTIImpl`) to the base class `BasicTTIImpl`. The new implementation checks whether the callee's target features are a subset of the caller's, enabling consistent behavior across targets. Subclasses now simply delegate to the base implementation, reducing code duplication and improving maintainability.	2024-11-25 11:22:49 +08:00
Sjoerd Meijer	9bccf61f5f	[AArch64][LV] Set MaxInterleaving to 4 for Neoverse V2 and V3 (#100385 ) Set the maximum interleaving factor to 4, aligning with the number of available SIMD pipelines. This increases the number of vector instructions in the vectorised loop body, enhancing performance during its execution. However, for very low iteration counts, the vectorised body might not execute at all, leaving only the epilogue loop to run. This issue affects e.g. cam4_r from SPEC FP, which experienced a performance regression. To address this, the patch reduces the minimum epilogue vectorisation factor from 16 to 8, enabling the epilogue to be vectorised and largely mitigating the regression.	2024-11-20 09:33:39 +00:00
Hari Limaye	4f0403fe96	[CodeGen][AArch64] Sink splat operands of FMul instructions (#116222 ) Sink shuffle operands of FMul instructions if these are splats, as we can generate lane-indexed variants for these.	2024-11-19 12:59:22 +00:00
Sushant Gokhale	9991ea28fc	[CostModel][AArch64] Make extractelement, with fmul user, free whenev… (#111479 ) …er possible In case of Neon, if there exists extractelement from lane != 0 such that 1. extractelement does not necessitate a move from vector_reg -> GPR 2. extractelement result feeds into fmul 3. Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0 then the extractelement can be merged with fmul in the backend and it incurs no cost. e.g. ``` define double @foo(<2 x double> %a) { %1 = extractelement <2 x double> %a, i32 0 %2 = extractelement <2 x double> %a, i32 1 %res = fmul double %1, %2 ret double %res } ``` `%2` and `%res` can be merged in the backend to generate: `fmul d0, d0, v0.d[1]` The change was tested with SPEC FP(C/C++) on Neoverse-v2. Compile time impact: None Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.	2024-11-13 11:10:49 +05:30
David Green	8274be509e	[AArch64] Remove header dependencies of AArch64ISelLowering.h. NFC This patch aims to reduce the include used by AArch64ISelLowering, allowing it to be included by unittests so that they can reference the AArch64ISD nodes. It: - Moves the inclusion of AArch64SMEAttributes.h to the uses. - Moves LowerPtrAuthGlobalAddressStatically to a static function, so that AArch64PACKey is not required in the header. - Moves the definitions of getExceptionPointerRegister to the cpp file, to remove the reference of AArch64::X0.	2024-10-28 18:53:37 +00:00
Graham Hunter	091a235ec5	Revert "[AArch64][SVE] Enable max vector bandwidth for SVE" (#112873 ) Reverts llvm/llvm-project#109671 Reverting due to some performance regressions on neoverse-v1.	2024-10-18 11:05:55 +01:00
Danila Malyutin	1a609052b6	[AArch64][InstCombine] Eliminate redundant barrier intrinsics (#112023 ) If there are no memory ops on the path from one dmb to another then one barrier can be eliminated.	2024-10-17 21:04:04 +04:00
Graham Hunter	c980a20b10	[AArch64][SVE] Enable max vector bandwidth for SVE (#109671 ) Returns true for shouldMaximizeVectorBandwidth when the register type is a scalable vector and SVE or streaming SVE are available.	2024-10-17 13:17:24 +01:00
Philip Reames	b3c687b4e9	[LV] Check early for supported interleave factors with scalable types [nfc] (#111592 ) Previously, the cost model was returning an invalid cost. This simply moves the check from one place to another. This is mostly to make the cost modeling code a bit easier to follow. --------- Co-authored-by: Mel Chen <mel.chen@sifive.com>	2024-10-15 07:37:46 -07:00
Rahul Joshi	fa789dffb1	[NFC] Rename `Intrinsic::getDeclaration` to `getOrInsertDeclaration` (#111752 ) Rename the function to reflect its correct behavior and to be consistent with `Module::getOrInsertFunction`. This is also in preparation of adding a new `Intrinsic::getDeclaration` that will have behavior similar to `Module::getFunction` (i.e, just lookup, no creation).	2024-10-11 05:26:03 -07:00
Jeffrey Byrnes	853c43d04a	[TTI] NFC: Port TLI.shouldSinkOperands to TTI (#110564 ) Porting to TTI provides direct access to the instruction cost model, which can enable instruction cost based sinking without introducing code duplication.	2024-10-09 14:30:09 -07:00
Paul Walker	d283705829	[AArch64][SVE] Fix definition of bfloat fcvt intrinsics. (#110281 ) Affected intrinsics: llvm.aarch64.sve.fcvt.bf16f32 llvm.aarch64.sve.fcvtnt.bf16f32 The named intrinsics took a predicate based on the smallest element type when it should be based on the largest. The intrinsics have been replace by v2 equivalents and affected code ported to use them. Patch includes changes to getSVEPredicateBitCast() that ensure the generated code for the auto-upgraded old intrinsics is unchanged.	2024-10-03 12:36:01 +01:00
Paul Walker	be9461cda6	[LLVM][InstCombine][SVE] fcvtnt(a,all_active,b) != fcvtnt(undef,all_active,b) (#110278 ) The "narrowing top" convert instructions leave the bottom half of active elements untouched and thus the first paramater of their associated intrinsic remains live even when there are no inactive lanes.	2024-10-01 11:13:04 +01:00
Philip Reames	d288574363	[TTI][RISCV] Model cost of loading constants arms of selects and compares (#109824 ) This follows in the spirit of 7d82c99403f615f6236334e698720bf979959704, and extends the costing API for compares and selects to provide information about the operands passed in an analogous manner. This allows us to model the cost of materializing the vector constant, as some select-of-constants are significantly more expensive than others when you account for the cost of materializing the constants involved. This is a stepping stone towards fixing https://github.com/llvm/llvm-project/issues/109466. A separate SLP patch will be required to utilize the new API.	2024-09-25 07:25:57 -07:00
Paul Walker	622ae7ffa4	[LLVM][InstCombine][AArch64] sve.insr(splat(x), x) ==> splat(x) (#109445 ) Fixes https://github.com/llvm/llvm-project/issues/100497	2024-09-24 15:11:36 +01:00
Sushant Gokhale	c5672e21ca	[AArch64][CostModel] Reduce the cost of fadd reduction with fast flag (#108791 ) fadd reduction with 1. Fast flag set 2. No of elements in input vector is power of 2 results in series of faddp instructions. faddp instruction has latency/throughput identical to fadd instruction and hence, we set relative cost=1 for faddp as well. The change didn't show any regression with SPEC17-FP(C/C++), llvm-test-suite on Neoverse-V2.	2024-09-24 14:35:01 +05:30
Matthew Devereau	1808fc13c8	[AArch64][InstCombine] Bail from combining SRAD on +/-1 divisor (#109274 ) This fixes a crash when svdiv's third parameter is svdup_s64(1)	2024-09-20 13:53:02 +01:00
Samuel Tebbs	b1b436c108	[AArch64] Fix build error from extra ! This fixes a build failure caused by https://github.com/llvm/llvm-project/pull/108521	2024-09-19 14:45:30 +01:00
Sam Tebbs	b49a6b2a9d	[AArch64] Consider histcnt smaller than i32 in the cost model (#108521 ) This PR updates the AArch64 cost model to consider the cheaper cost of <i32 histograms to reflect the improvements from https://github.com/llvm/llvm-project/pull/101017 and https://github.com/llvm/llvm-project/pull/103037 Work by Max Beck-Jones (@DevM-uk) --------- Co-authored-by: DevM-uk <max.beck-jones@arm.com>	2024-09-19 13:56:52 +01:00
Lukacma	d57be195e3	[AArch64] replace SVE intrinsics with no active lanes with zero (#107413 ) This patch extends https://github.com/llvm/llvm-project/pull/73964 and optimises SVE intrinsics into zero constants when predicate is zero.	2024-09-09 10:28:01 +01:00
Jon Roelofs	bded3b3ea9	[llvm][AArch64] Improve the cost model for i128 div's (#107306 )	2024-09-05 07:42:23 -07:00
Lukacma	113806d187	[AArch64] optimise SVE cvt intrinsics with no active lanes (#104809 ) This patch extends https://github.com/llvm/llvm-project/pull/73964 and optimises SVE cvt intrinsics away when predicate is zero.	2024-08-29 11:45:14 +01:00

1 2 3 4 5 ...

498 Commits