llvm-project

Author	SHA1	Message	Date
Paul Walker	622ae7ffa4	[LLVM][InstCombine][AArch64] sve.insr(splat(x), x) ==> splat(x) (#109445 ) Fixes https://github.com/llvm/llvm-project/issues/100497	2024-09-24 15:11:36 +01:00
Sander de Smalen	db054a1970	[AArch64][SME] Fix ADDVL addressing to scavenged stackslot. (#109674 ) In https://reviews.llvm.org/D159196 we avoided stackslot scavenging when there was no FP available. But in the case where FP is available we need to actually prefer using the FP over the BP. This change affects more than just SME, but it should be a general improvement, since any slot above the (address pointed to by) FP is always closer to FP than BP, so it makes sense to always favour using the FP to address it when the FP is available. This also fixes the issue for SME where this is not just preferred but required.	2024-09-24 13:29:30 +01:00
Paul Walker	3e3780ef6a	[LLVM][CodeGen][SVE] Implement nxvf32 fpround to nxvbf16. (#107420 )	2024-09-24 13:15:26 +01:00
Sushant Gokhale	c5672e21ca	[AArch64][CostModel] Reduce the cost of fadd reduction with fast flag (#108791 ) fadd reduction with 1. Fast flag set 2. No of elements in input vector is power of 2 results in series of faddp instructions. faddp instruction has latency/throughput identical to fadd instruction and hence, we set relative cost=1 for faddp as well. The change didn't show any regression with SPEC17-FP(C/C++), llvm-test-suite on Neoverse-V2.	2024-09-24 14:35:01 +05:30
Craig Topper	19f04e9086	[AArch64] Use MCRegister in more places. NFC	2024-09-23 10:24:48 -07:00
chuongg3	b0dc7b5b86	[AArch64][GlobalISel] Prefer to use Vector Truncate (#105692 ) Tries to combine scalarised truncates into vector truncate operations EXAMPLE: `%a(i32), %b(i32) = G_UNMERGE %src(<2 x i32>)` `%T_a(i16) = G_TRUNC %a(i32)` `%T_b(i16) = G_TRUNC %b(i32)` `%Imp(i16) = G_IMPLICIT_DEF(i16)` `%dst(v8i16) = G_MERGE_VALUES %T_a(i16), %T_b(i16), %Imp(i16), %Imp(i16)` ===> `%Imp(<2 x i32>) = G_IMPLICIT_DEF(<2 x i32>)` `%Mid(<4 x s16>) = G_CONCAT_VECTORS %src(<2 x i32>), %Imp(<2 x i32>)` `%dst(<4 x s16>) = G_TRUNC %Mid(<4 x s16>)`	2024-09-23 13:52:37 +01:00
David Green	c3f9b736c4	[AArch64] Treat fp128 G_FNEG like G_FABS These fp128 G_FNEG operations should be treated more like G_FABS, where the operation is lowered to simple integer arithmetic. All other operations are the same between the two ActionDefinitionsBuilders.	2024-09-21 21:10:55 +01:00
Alexandros Lamprineas	d497f465df	[FMV][AArch64] Unify ls64, ls64_v and ls64_accdata. (#108024 ) Originally I tried spliting these features in the compiler with https://github.com/llvm/llvm-project/pull/101712, but we decided to lump those features in the ACLE specification (see https://github.com/ARM-software/acle/pull/346). Since there are no hardware implementations out there which implement ls64 without ls64_v or ls64_accdata, this shouldn't be a regression for feature detection.	2024-09-20 19:10:54 +01:00
Matthew Devereau	1808fc13c8	[AArch64][InstCombine] Bail from combining SRAD on +/-1 divisor (#109274 ) This fixes a crash when svdiv's third parameter is svdup_s64(1)	2024-09-20 13:53:02 +01:00
David Green	2da70572a2	[AArch64][GISel] Scalarize fp128 fadd/fsub/fmul/etc. Like other fp128/i128 vectors, we scalarize these operations to allow them to be libcalled.	2024-09-20 10:40:22 +01:00
Franklin	e45f9aa7fa	[AArch64] Initial sched model for Neoverse N3 (#106371 ) References: * Arm Neoverse N3 Software Optimization Guide * Arm A64 Instruction Set for A-profile architecture	2024-09-19 19:22:24 +01:00
Jay Foad	e03f427196	[LLVM] Use {} instead of std::nullopt to initialize empty ArrayRef (#109133 ) It is almost always simpler to use {} instead of std::nullopt to initialize an empty ArrayRef. This patch changes all occurrences I could find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor could be deprecated or removed.	2024-09-19 16:16:38 +01:00
David Green	02a1d311bd	[AArch64] Extend and rewrite load zero and load undef patterns (#108185 ) The ldr instructions implicitly zero any upper lanes, so we can use them for insert(zerovec, load, 0) patterns. Likewise insert(undef, load, 0) or scalar_to_reg can reuse the scalar loads as the top bits are undef. This patch makes sure there are patterns for each type and for each of the normal, unaligned, roW and roX addressing modes.	2024-09-19 14:52:52 +01:00
Samuel Tebbs	b1b436c108	[AArch64] Fix build error from extra ! This fixes a build failure caused by https://github.com/llvm/llvm-project/pull/108521	2024-09-19 14:45:30 +01:00
Sam Tebbs	f7714342ae	[AArch64][NEON][SVE] Lower mixed sign/zero extended partial reductions to usdot (#107566 ) This PR adds lowering for partial reductions of a mix of sign/zero extended inputs to the usdot intrinsic.	2024-09-19 14:00:45 +01:00
Sam Tebbs	b49a6b2a9d	[AArch64] Consider histcnt smaller than i32 in the cost model (#108521 ) This PR updates the AArch64 cost model to consider the cheaper cost of <i32 histograms to reflect the improvements from https://github.com/llvm/llvm-project/pull/101017 and https://github.com/llvm/llvm-project/pull/103037 Work by Max Beck-Jones (@DevM-uk) --------- Co-authored-by: DevM-uk <max.beck-jones@arm.com>	2024-09-19 13:56:52 +01:00
Daniil Kovalev	3d5e8e4693	[PAC][CodeGen] Do not emit trivial 'mov xN, xN' on tail call (#109100 ) Under some conditions, a trivial `mov xN xN` instruction was emitted on tail calls. Consider the following code: ``` class Test { public: virtual void f() {} }; void call_f(Test *t) { t->f(); } ``` Correponding assembly: ``` _Z6call_fP4Test: ldr x16, [x0] mov x17, x0 movk x17, #6503, lsl #48 autda x16, x17 ldr x1, [x16] =====> mov x16, x16 movk x16, #54167, lsl #48 braa x1, x16 ``` This patch makes such movs being omitted. Co-authored-by: Anatoly Trosinenko <atrosinenko@accesssoftek.com>	2024-09-19 12:17:58 +03:00
David Green	4c50112ba1	[AArch64] Add patterns for 64bit vector addp This extends the existing patterns for addp to 64bit outputs with a single input. Whilst the general pattern is similar to the 128bit patterns (add(uzp1(extract_lo, extract_hi), uzp2(extract_lo, extract_hi))), at the late stage other optimzations have happened to turn the first uzp1 into trunc and the second into extract(uzp2) with undef. Fixes #109108	2024-09-19 08:50:43 +01:00
Him188	77af9d1023	[AArch64][GlobalISel] Implement selectVaStartAAPCS (#106979 ) This commit adds the missing support for varargs in the instruction selection pass for AAPCS. Previously we only implemented this for Darwin. The implementation was according to AAPCS and SelectionDAG's LowerAAPCS_VASTART. It resolves all VA_START fallbacks in RAJAperf, llvm-test-suite, and SPEC CPU2017. These benchmarks now compile and pass without fallbacks due to varargs. --------- Co-authored-by: Madhur Amilkanthwar <madhura@nvidia.com>	2024-09-19 11:48:14 +05:30
Lei Huang	4b524088a8	[NFC] Update function names in MCTargetAsmParser.h (#108643 ) Update function names to adhere to LLVM coding standard.	2024-09-18 11:43:49 -04:00
Franklin	ef34cba1c3	[AArch64] Fix sched model of Neoverse N2 (#106376 ) * fix write order of "Load vector reg, immed post-index" * fix a typo	2024-09-18 09:33:57 +01:00
Csanád Hajdú	72901fe19e	[AArch64] Fold UBFMXri to UBFMWri when it's an LSR or LSL alias (#106968 ) Using the LSR or LSL aliases of UBFM can be faster on some CPUs, so it is worth changing 64 bit UBFM instructions, that are equivalent to 32 bit LSR/LSL operations, to 32 bit variants. This change folds the following patterns: * If `Imms == 31` and `Immr <= Imms`: `UBFMXri %0, Immr, Imms` -> `UBFMWri %0.sub_32, Immr, Imms` * If `Immr == Imms + 33`: `UBFMXri %0, Immr, Imms` -> `UBFMWri %0.sub_32, Immr - 32, Imms`	2024-09-17 11:21:23 +01:00
SpencerAbson	79d380f2ca	[AArch64][SVE2] Add codegen patterns for SVE2 FAMINMAX (#107284 ) Tablegen patterns were previously added to lower the following sequences from generic IR to NEON FAMIN/FAMAX instructions - `fminimum((abs(a), abs(b)) -> famin(a, b)` - `fmaximum((abs(a)), abs(b)) -> famax(a, b)` - https://github.com/llvm/llvm-project/pull/103027 - `fminnum[nnan](abs(a), abs(b)) -> famin(a, b)` - `fmaxnum[nnan](abs(a), abs(b)) -> famax(a, b)` - https://github.com/llvm/llvm-project/pull/104766 The same idea has been applied for the scalable vector variants of [FAMIN](https://developer.arm.com/documentation/ddi0602/2024-06/SVE-Instructions/FAMIN--Floating-point-absolute-minimum--predicated--)/[FAMAX](https://developer.arm.com/documentation/ddi0602/2024-06/SVE-Instructions/FAMAX--Floating-point-absolute-maximum--predicated--). ('nnan' documenatation: https://llvm.org/docs/LangRef.html#fast-math-flags). - Changes to LLVM - lib/target/AArch64/AArch64SVEInstrInfo.td - Add 'AArch64fminnm_p_nnan' and 'AArch64fmaxnm_p_nnan' patfrags (patterns predicated on the 'nnan' flag). - Add 'AArch64famax_p' and 'AArch64famin_p' - test/CodeGen/AArch64/aarch64-sve2-faminmax.ll - Add tests to verify the new patterns, including both positive and negative tests for 'nnan' predicated behavior.	2024-09-17 09:12:27 +01:00
David Green	960c975acd	[AArch64] Expand scmp/ucmp vector operations with sub (#108830 ) Unlike scalar, where AArch64 prefers expanding scmp/ucmp with select, under Neon we can use the arithmetic expansion to generate fewer instructions. Notably it also prevents the scalarization of vselect during vector-legalization.	2024-09-16 18:44:52 +01:00
David Green	feac761f37	[GlobalISel][AArch64] Add G_FPTOSI_SAT/G_FPTOUI_SAT (#96297 ) This is an implementation of the saturating fp to int conversions for GlobalISel. On AArch64 the converstion instrctions work this way, producing saturating results. LegalizerHelper::lowerFPTOINT_SAT is ported from SDAG. AArch64 has a lot of existing tests for fptosi_sat, covering a wide range of types. I have tried to make most of them work all at once, but a few fall back due to other missing features such as f128 handling for min/max.	2024-09-16 10:33:59 +01:00
David Green	3a4b30e11e	[AArch64][GISel] Scalarize i128 ICmp and Select. Similar to other i128 bit operations, we scalarizer any icmps or selects larger than 64bits.	2024-09-13 18:44:26 +01:00
David Green	758230827d	[AArch64][GISel] Scalarize i128 vector shifts. Like most other i128 operations, this adds scalarization for i128 vector shifts. Which in turn allows a few other operations to legalize too.	2024-09-13 18:44:25 +01:00
David Green	1642f64b52	[AArch64] Replace _Ncyc_ with _Nc_ in Neoverse scheduling models. This brings them in line with the other Neoverse scheduling models, reducing the amount of differences between them.	2024-09-12 14:47:13 +01:00
David Green	5c7957dd4f	[AArch64] Allow i16->f64 uitofp tbl shuffles Just as we convert i8->f32 uitofp to tbl to perform the zext, we can do the same for i16->f64.	2024-09-11 22:21:52 +01:00
Momchil Velikov	b0ffaa7905	[AArch64] Prevent the AArch64LoadStoreOptimizer from reordering CFI instructions (#101317 ) When AArch64LoadStoreOptimizer pass merges an SP update with a load/store instruction and needs to adjust unwind information either: * create the merged instruction at the location of the SP update (so no CFI instructions are moved), or * only move a CFI instruction if the move would not reorder it across other CFI instructions If neither of the above is possible, don't perform the optimisation.	2024-09-10 13:07:06 +01:00
Paul Walker	516f08b415	[LLVM][AArch64] Refactor sve-b16b16 instruction definitions. (#107265 ) Update the predicate protecting bfloat instructions to only reference FEAT_SVE_B16B16, which matches the specification. Rename and move instruction classes to match the names of the encoding groups the bfloat arithmetic instructions belong.	2024-09-10 10:58:46 +01:00
adprasad-nvidia	23595d1b96	[AArch64] Lower __builtin_bswap16 to rev16 if bswap followed by any_extend (#105375 ) GCC compiles the built-in function `__builtin_bswap16`, to the ARM instruction rev16, which reverses the byte order of 16-bit data. On the other Clang compiles the same built-in function to e.g. ``` rev w8, w0 lsr w0, w8, #16 ``` i.e. it performs a byte reversal of a 32-bit register, (which moves the lower half, which contains the 16-bit data, to the upper half) and then right shifts the reversed 16-bit data back to the lower half of the register. We can improve Clang codegen by generating `rev16` instead of `rev` and `lsr`, like GCC.	2024-09-10 10:57:07 +01:00
Momchil Velikov	cf8fb4320f	[AArch64] Implement NEON vamin/vamax intrinsics (#99041 ) This patch implements the intrinsics of the form floatNxM_t vamin[q]_fN(floatNxM_t vn, floatNxM_t vm); floatNxM_t vamax[q]_fN(floatNxM_t vn, floatNxM_t vm); as defined in https://github.com/ARM-software/acle/pull/324 --------- Co-authored-by: Hassnaa Hamdi <hassnaa.hamdi@arm.com>	2024-09-09 13:34:41 +01:00
Lukacma	d57be195e3	[AArch64] replace SVE intrinsics with no active lanes with zero (#107413 ) This patch extends https://github.com/llvm/llvm-project/pull/73964 and optimises SVE intrinsics into zero constants when predicate is zero.	2024-09-09 10:28:01 +01:00
Craig Topper	f2b71491d1	[MC] Make MCRegisterInfo::getLLVMRegNum return std::optional<MCRegister>. NFC (#107776 )	2024-09-08 21:21:51 -07:00
David Green	307713aafc	[AArch64] Do not generate uitofp(ld4) where and/shift can be used. (#107538 ) After #107201 and #107367 the codegen for zext(ld4) can use and / shift to extract the lanes out of the original vectors elements. This avoids the need for the expensive ld4 operations, so can lead to performance improvements over using the interleaving loads and ushll. This patch stops the generation of ld4 for uitofp(ld4) that would become uitofp(zext(ld4)). It doesn't handle zext yet to make sure that widening instructions like mull and addl are not adversely affected.	2024-09-07 15:31:26 +01:00
Igor Kirillov	bf57ecf06e	[AArch64] Prevent generating tbl instruction instead of smull (#106375 ) Generating tbl instruction for zext in an expression like: mul(zext(i8), sext) is not optimal. Instead, allowing later optimisations to generate smull(zext, sext) would do some of the type extensions implicitly and be faster.	2024-09-06 14:53:50 +01:00
Sam Tebbs	458c91d810	[AArch64][NEON] Lower fixed-width add partial reductions to dot product (#107078 ) This PR adds lowering for fixed-width <4 x i32> and <2 x i32> partial reductions to a dot product when Neon and the dot product feature are available. The work is by Max Beck-Jones (@DevM-uk).	2024-09-06 09:38:03 +01:00
David Green	9df592fb80	[AArch64] Fold away zext of extract of uzp. (#107367 ) Similar to #107201, this comes up from the lowering of zext of deinterleaving shuffles. Patterns such as ext(extract_subvector(uzp(a, b))) can be converted to a simple and to perform the extract/zext from a uzp1. Uzp2 can be handled with an extra shift, and due to the existing legalization we could have and / shift between which can be combined in. Mostly this reduces instruction count or increases the amount of parallelism in the sequence.	2024-09-05 20:25:56 +01:00
Sander de Smalen	91a3c6f3d6	[AArch64] Remove redundant COPY from loadRegFromStackSlot (#107396 ) This removes a redundant 'COPY' instruction that #81716 probably forgot to remove. This redundant COPY led to an issue because because code in LiveRangeSplitting expects that the instruction emitted by `loadRegFromStackSlot` is an instruction that accesses memory, which isn't the case for the COPY instruction.	2024-09-05 17:54:57 +01:00
Paul Walker	be1958fd48	[LLVM][CodeGen][SVE] Implement nxvbf16 fpextend to nxvf32/nxvf64. (#107253 ) NOTE: There are no dedicated SVE instructions but bf16->f32 is just a left shift because they share the same exponent range and from there other convert instructions can be used.	2024-09-05 17:02:48 +01:00
Jon Roelofs	bded3b3ea9	[llvm][AArch64] Improve the cost model for i128 div's (#107306 )	2024-09-05 07:42:23 -07:00
Lukacma	7f0c5b0502	[AArch64]Fix invalid use of ld1/st1 in stack alloc (#105518 ) This patch fixes incorrect usage of scalar+immediate variant of ld1/st1 instructions during stack allocation caused by [c4bac7f](`c4bac7f7dc`). This commit used ld1/st1 even when stack offset was outside of immediate range for this instruction, producing invalid assembly. This commit was also using incorrect offsets when using ld1/st1.	2024-09-05 14:47:10 +01:00
David Green	77f0488225	[AArch64] Combine zext of deinterleaving shuffle. (#107201 ) This is part 1 of a few patches that are intended to take deinterleaving shuffles with masks like `[0,4,8,12]`, where the shuffle is zero-extended to a larger size, and optimize away the deinterleave. In this case it converts them to `and(uzp1, mask)`, where the `uzp1` act upon the elements in the larger type size to get the lanes into the correct possitions, and the `and` performs the zext. It performs the combine fairly late, on the legalized type so that uitofp that are converted to uitofp(zext(..)) will also be handled.	2024-09-05 08:11:29 +01:00
Momchil Velikov	f7fa75b208	[AArch64] Implement intrinsics for SME2 FAMIN/FAMAX (#99063 ) This patch implements these intrinsics: ``` c // Variants are also available for: // [_f32_x2], [_f64_x2], // [_f16_x4], [_f32_x4], [_f64_x4] svfloat16x2_t svamax[_f16_x2](svfloat16x2 zd, svfloat16x2_t zm) __arm_streaming; svfloat16x2_t svamin[_f16_x2](svfloat16x2 zd, svfloat16x2_t zm) __arm_streaming; ``` (cf. https://github.com/ARM-software/acle/pull/324) Co-authored-by: Caroline Concatto <caroline.concatto@arm.com>	2024-09-04 15:29:32 +01:00
Momchil Velikov	bb1b368e0a	[AArch64] Implement intrinsics for SVE FAMIN/FAMAX (#99042 ) This patch implements the following intrinsics: * Floating-point absolute maximum (predicated) svfloat16_t svamax[_f16]_m(svbool_t, svfloat16_t, svfloat16_t); svfloat16_t svamax[_f16]_x(svbool_t, svfloat16_t, svfloat16_t); svfloat16_t svamax[_f16]_z(svbool_t, svfloat16_t, svfloat16_t); svfloat16_t svamax[_n_f16]_m(svbool_t, svfloat16_t, float16_t); svfloat16_t svamax[_n_f16]_x(svbool_t, svfloat16_t, float16_t); svfloat16_t svamax[_n_f16]_z(svbool_t, svfloat16_t, float16_t); * Floating-point absolute minimum (predicated) svfloat16_t svmin[_f16]_m(svbool_t, svfloat16_t, svfloat16_t); svfloat16_t svmin[_f16]_x(svbool_t, svfloat16_t, svfloat16_t); svfloat16_t svmin[_f16]_z(svbool_t, svfloat16_t, svfloat16_t); svfloat16_t svmin[_n_f16]_m(svbool_t, svfloat16_t, float16_t); svfloat16_t svmin[_n_f16]_x(svbool_t, svfloat16_t, float16_t); svfloat16_t svmin[_n_f16]_z(svbool_t, svfloat16_t, float16_t); All the intrinsics have also variants for `f32` and `f64`, and have the `__arm_streaming` attribute. (cf. https://github.com/ARM-software/acle/pull/324)	2024-09-04 13:07:57 +01:00
Paul Walker	2fef449f30	[LLVM][AArch64] Enable verifyTargetSDNode for scalable vectors and fix the fallout. (#104820 ) Fix incorrect use of AArch64ISD::UZP1/UUNPK{HI,LO} in: AArch64TargetLowering::LowerDIV AArch64TargetLowering::LowerINSERT_SUBVECTOR The latter highlighted DAG combines that relied on broken behaviour, which this patch also fixes.	2024-09-04 11:07:11 +01:00
Lukacma	3e948eb3e8	[AArch64][NEON] Add intrinsics for LUTI (#96883 ) This patch adds intrinsics for NEON LUTI2 and LUTI4 instructions as specified in the [ACLE proposal](https://github.com/ARM-software/acle/pull/324)	2024-09-04 10:39:59 +01:00
Lukacma	59093cae86	[AARCH64][SVE] Add intrinsics for SVE LUTI instructions (#97058 ) This patch adds intrinsics for LUTI2 and LUTI4 instructions, which use SVE registers, as specified in the https://github.com/ARM-software/acle/pull/324	2024-09-04 10:39:43 +01:00
Kazu Hirata	a628bc3c2e	[AArch64] Fix a warning This patch fixes: lib/Target/AArch64/AArch64GenPostLegalizeGILowering.inc:506:14: error: unused variable 'GIMatchData_matchinfo' [-Werror,-Wunused-variable]	2024-09-03 19:55:08 -07:00

1 2 3 4 5 ...

8389 Commits