8389 Commits

Author SHA1 Message Date
Paul Walker
622ae7ffa4
[LLVM][InstCombine][AArch64] sve.insr(splat(x), x) ==> splat(x) (#109445)
Fixes https://github.com/llvm/llvm-project/issues/100497
2024-09-24 15:11:36 +01:00
Sander de Smalen
db054a1970
[AArch64][SME] Fix ADDVL addressing to scavenged stackslot. (#109674)
In https://reviews.llvm.org/D159196 we avoided stackslot scavenging
when there was no FP available. But in the case where FP is available
we need to actually prefer using the FP over the BP.

This change affects more than just SME, but it should be a general
improvement, since any slot above the (address pointed to by) FP
is always closer to FP than BP, so it makes sense to always favour
using the FP to address it when the FP is available.

This also fixes the issue for SME where this is not just preferred
but required.
2024-09-24 13:29:30 +01:00
Paul Walker
3e3780ef6a
[LLVM][CodeGen][SVE] Implement nxvf32 fpround to nxvbf16. (#107420) 2024-09-24 13:15:26 +01:00
Sushant Gokhale
c5672e21ca
[AArch64][CostModel] Reduce the cost of fadd reduction with fast flag (#108791)
fadd reduction with
  1. Fast flag set
2. No of elements in input vector is power of 2 results in series of
faddp instructions. faddp instruction has latency/throughput identical
to fadd instruction and hence, we set relative cost=1 for faddp as well.

The change didn't show any regression with SPEC17-FP(C/C++),
llvm-test-suite on Neoverse-V2.
2024-09-24 14:35:01 +05:30
Craig Topper
19f04e9086 [AArch64] Use MCRegister in more places. NFC 2024-09-23 10:24:48 -07:00
chuongg3
b0dc7b5b86
[AArch64][GlobalISel] Prefer to use Vector Truncate (#105692)
Tries to combine scalarised truncates into vector truncate operations

EXAMPLE:
`%a(i32), %b(i32) = G_UNMERGE %src(<2 x i32>)`
`%T_a(i16) = G_TRUNC %a(i32)`
`%T_b(i16) = G_TRUNC %b(i32)`
`%Imp(i16) = G_IMPLICIT_DEF(i16)`
`%dst(v8i16) = G_MERGE_VALUES %T_a(i16), %T_b(i16), %Imp(i16),
%Imp(i16)`

===>
`%Imp(<2 x i32>) = G_IMPLICIT_DEF(<2 x i32>)`
`%Mid(<4 x s16>) = G_CONCAT_VECTORS %src(<2 x i32>), %Imp(<2 x i32>)`
`%dst(<4 x s16>) = G_TRUNC %Mid(<4 x s16>)`
2024-09-23 13:52:37 +01:00
David Green
c3f9b736c4 [AArch64] Treat fp128 G_FNEG like G_FABS
These fp128 G_FNEG operations should be treated more like G_FABS, where the
operation is lowered to simple integer arithmetic. All other operations are the
same between the two ActionDefinitionsBuilders.
2024-09-21 21:10:55 +01:00
Alexandros Lamprineas
d497f465df
[FMV][AArch64] Unify ls64, ls64_v and ls64_accdata. (#108024)
Originally I tried spliting these features in the compiler with
https://github.com/llvm/llvm-project/pull/101712, but we decided to lump
those features in the ACLE specification (see
https://github.com/ARM-software/acle/pull/346). Since there are no
hardware implementations out there which implement ls64 without ls64_v
or ls64_accdata, this shouldn't be a regression for feature detection.
2024-09-20 19:10:54 +01:00
Matthew Devereau
1808fc13c8
[AArch64][InstCombine] Bail from combining SRAD on +/-1 divisor (#109274)
This fixes a crash when svdiv's third parameter is svdup_s64(1)
2024-09-20 13:53:02 +01:00
David Green
2da70572a2 [AArch64][GISel] Scalarize fp128 fadd/fsub/fmul/etc.
Like other fp128/i128 vectors, we scalarize these operations to allow them to
be libcalled.
2024-09-20 10:40:22 +01:00
Franklin
e45f9aa7fa
[AArch64] Initial sched model for Neoverse N3 (#106371)
References:

* Arm Neoverse N3 Software Optimization Guide
* Arm A64 Instruction Set for A-profile architecture
2024-09-19 19:22:24 +01:00
Jay Foad
e03f427196
[LLVM] Use {} instead of std::nullopt to initialize empty ArrayRef (#109133)
It is almost always simpler to use {} instead of std::nullopt to
initialize an empty ArrayRef. This patch changes all occurrences I could
find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor
could be deprecated or removed.
2024-09-19 16:16:38 +01:00
David Green
02a1d311bd
[AArch64] Extend and rewrite load zero and load undef patterns (#108185)
The ldr instructions implicitly zero any upper lanes, so we can use them
for insert(zerovec, load, 0) patterns. Likewise insert(undef, load, 0)
or scalar_to_reg can reuse the scalar loads as the top bits are undef.

This patch makes sure there are patterns for each type and for each of
the normal, unaligned, roW and roX addressing modes.
2024-09-19 14:52:52 +01:00
Samuel Tebbs
b1b436c108 [AArch64] Fix build error from extra !
This fixes a build failure caused by https://github.com/llvm/llvm-project/pull/108521
2024-09-19 14:45:30 +01:00
Sam Tebbs
f7714342ae
[AArch64][NEON][SVE] Lower mixed sign/zero extended partial reductions to usdot (#107566)
This PR adds lowering for partial reductions of a mix of sign/zero
extended inputs to the usdot intrinsic.
2024-09-19 14:00:45 +01:00
Sam Tebbs
b49a6b2a9d
[AArch64] Consider histcnt smaller than i32 in the cost model (#108521)
This PR updates the AArch64 cost model to consider the cheaper cost of
<i32 histograms to reflect the improvements from
https://github.com/llvm/llvm-project/pull/101017 and
https://github.com/llvm/llvm-project/pull/103037

Work by Max Beck-Jones (@DevM-uk)

---------

Co-authored-by: DevM-uk <max.beck-jones@arm.com>
2024-09-19 13:56:52 +01:00
Daniil Kovalev
3d5e8e4693
[PAC][CodeGen] Do not emit trivial 'mov xN, xN' on tail call (#109100)
Under some conditions, a trivial `mov xN xN` instruction was emitted on
tail calls. Consider the following code:

```
class Test {
public:
  virtual void f() {}
};

void call_f(Test *t) {
  t->f();
}
```

Correponding assembly:

```
_Z6call_fP4Test:
        ldr     x16, [x0]
        mov     x17, x0
        movk    x17, #6503, lsl #48
        autda   x16, x17
        ldr     x1, [x16]
 =====> mov     x16, x16
        movk    x16, #54167, lsl #48
        braa    x1, x16
```

This patch makes such movs being omitted.

Co-authored-by: Anatoly Trosinenko <atrosinenko@accesssoftek.com>
2024-09-19 12:17:58 +03:00
David Green
4c50112ba1 [AArch64] Add patterns for 64bit vector addp
This extends the existing patterns for addp to 64bit outputs with a single
input. Whilst the general pattern is similar to the 128bit patterns
(add(uzp1(extract_lo, extract_hi), uzp2(extract_lo, extract_hi))), at the late
stage other optimzations have happened to turn the first uzp1 into trunc and
the second into extract(uzp2) with undef.

Fixes #109108
2024-09-19 08:50:43 +01:00
Him188
77af9d1023
[AArch64][GlobalISel] Implement selectVaStartAAPCS (#106979)
This commit adds the missing support for varargs in the instruction
selection pass for AAPCS. Previously we only implemented this for
Darwin.

The implementation was according to AAPCS and SelectionDAG's
LowerAAPCS_VASTART.

It resolves all VA_START fallbacks in RAJAperf, llvm-test-suite, and
SPEC CPU2017. These benchmarks now compile and pass without fallbacks
due to varargs.

---------

Co-authored-by: Madhur Amilkanthwar <madhura@nvidia.com>
2024-09-19 11:48:14 +05:30
Lei Huang
4b524088a8
[NFC] Update function names in MCTargetAsmParser.h (#108643)
Update function names to adhere to LLVM coding standard.
2024-09-18 11:43:49 -04:00
Franklin
ef34cba1c3
[AArch64] Fix sched model of Neoverse N2 (#106376)
* fix write order of "Load vector reg, immed post-index"
* fix a typo
2024-09-18 09:33:57 +01:00
Csanád Hajdú
72901fe19e
[AArch64] Fold UBFMXri to UBFMWri when it's an LSR or LSL alias (#106968)
Using the LSR or LSL aliases of UBFM can be faster on some CPUs, so it
is worth changing 64 bit UBFM instructions, that are equivalent to 32
bit LSR/LSL operations, to 32 bit variants.

This change folds the following patterns:
* If `Imms == 31` and `Immr <= Imms`:
   `UBFMXri %0, Immr, Imms`  ->  `UBFMWri %0.sub_32, Immr, Imms`
* If `Immr == Imms + 33`:
   `UBFMXri %0, Immr, Imms`  ->  `UBFMWri %0.sub_32, Immr - 32, Imms`
2024-09-17 11:21:23 +01:00
SpencerAbson
79d380f2ca
[AArch64][SVE2] Add codegen patterns for SVE2 FAMINMAX (#107284)
Tablegen patterns were previously added to lower the following sequences
from generic IR to NEON FAMIN/FAMAX instructions

- `fminimum((abs(a), abs(b)) -> famin(a, b)`
- `fmaximum((abs(a)), abs(b)) -> famax(a, b)`
- https://github.com/llvm/llvm-project/pull/103027
- `fminnum[nnan](abs(a), abs(b)) -> famin(a, b)`
- `fmaxnum[nnan](abs(a), abs(b)) -> famax(a, b)`
- https://github.com/llvm/llvm-project/pull/104766 

The same idea has been applied for the scalable vector variants of
[FAMIN](https://developer.arm.com/documentation/ddi0602/2024-06/SVE-Instructions/FAMIN--Floating-point-absolute-minimum--predicated--)/[FAMAX](https://developer.arm.com/documentation/ddi0602/2024-06/SVE-Instructions/FAMAX--Floating-point-absolute-maximum--predicated--).
('nnan' documenatation:
https://llvm.org/docs/LangRef.html#fast-math-flags).

- Changes to LLVM
	- lib/target/AArch64/AArch64SVEInstrInfo.td
- Add 'AArch64fminnm_p_nnan' and 'AArch64fmaxnm_p_nnan' patfrags
(patterns predicated on the 'nnan' flag).
		- Add 'AArch64famax_p' and 'AArch64famin_p'
	- test/CodeGen/AArch64/aarch64-sve2-faminmax.ll
- Add tests to verify the new patterns, including both positive and
negative tests for 'nnan' predicated behavior.
2024-09-17 09:12:27 +01:00
David Green
960c975acd
[AArch64] Expand scmp/ucmp vector operations with sub (#108830)
Unlike scalar, where AArch64 prefers expanding scmp/ucmp with select,
under Neon we can use the arithmetic expansion to generate fewer
instructions. Notably it also prevents the scalarization of vselect
during vector-legalization.
2024-09-16 18:44:52 +01:00
David Green
feac761f37
[GlobalISel][AArch64] Add G_FPTOSI_SAT/G_FPTOUI_SAT (#96297)
This is an implementation of the saturating fp to int conversions for
GlobalISel. On AArch64 the converstion instrctions work this way,
producing saturating results. LegalizerHelper::lowerFPTOINT_SAT is
ported from SDAG.

AArch64 has a lot of existing tests for fptosi_sat, covering a wide
range of types. I have tried to make most of them work all at once, but
a few fall back due to other missing features such as f128 handling for
min/max.
2024-09-16 10:33:59 +01:00
David Green
3a4b30e11e [AArch64][GISel] Scalarize i128 ICmp and Select.
Similar to other i128 bit operations, we scalarizer any icmps or selects larger
than 64bits.
2024-09-13 18:44:26 +01:00
David Green
758230827d [AArch64][GISel] Scalarize i128 vector shifts.
Like most other i128 operations, this adds scalarization for i128 vector
shifts. Which in turn allows a few other operations to legalize too.
2024-09-13 18:44:25 +01:00
David Green
1642f64b52 [AArch64] Replace _Ncyc_ with _Nc_ in Neoverse scheduling models.
This brings them in line with the other Neoverse scheduling models, reducing
the amount of differences between them.
2024-09-12 14:47:13 +01:00
David Green
5c7957dd4f [AArch64] Allow i16->f64 uitofp tbl shuffles
Just as we convert i8->f32 uitofp to tbl to perform the zext, we can do the
same for i16->f64.
2024-09-11 22:21:52 +01:00
Momchil Velikov
b0ffaa7905
[AArch64] Prevent the AArch64LoadStoreOptimizer from reordering CFI instructions (#101317)
When AArch64LoadStoreOptimizer pass merges an SP update with a
load/store instruction and needs to adjust unwind information either:
* create the merged instruction at the location of the SP update
  (so no CFI  instructions are moved), or
* only move a CFI instruction if the move would not reorder it across
  other CFI  instructions

If neither of the above is possible, don't perform the optimisation.
2024-09-10 13:07:06 +01:00
Paul Walker
516f08b415
[LLVM][AArch64] Refactor sve-b16b16 instruction definitions. (#107265)
Update the predicate protecting bfloat instructions to only reference
FEAT_SVE_B16B16, which matches the specification.

Rename and move instruction classes to match the names of the encoding
groups the bfloat arithmetic instructions belong.
2024-09-10 10:58:46 +01:00
adprasad-nvidia
23595d1b96
[AArch64] Lower __builtin_bswap16 to rev16 if bswap followed by any_extend (#105375)
GCC compiles the built-in function `__builtin_bswap16`, to the ARM
instruction rev16, which reverses the byte order of 16-bit data. On the
other Clang compiles the same built-in function to e.g.
```     
        rev     w8, w0
        lsr     w0, w8, #16
```
i.e. it performs a byte reversal of a 32-bit register, (which moves the
lower half, which contains the 16-bit data, to the upper half) and then
right shifts the reversed 16-bit data back to the lower half of the
register.
We can improve Clang codegen by generating `rev16` instead of `rev` and
`lsr`, like GCC.
2024-09-10 10:57:07 +01:00
Momchil Velikov
cf8fb4320f
[AArch64] Implement NEON vamin/vamax intrinsics (#99041)
This patch implements the intrinsics of the form

    floatNxM_t vamin[q]_fN(floatNxM_t vn, floatNxM_t vm);
    floatNxM_t vamax[q]_fN(floatNxM_t vn, floatNxM_t vm);

as defined in https://github.com/ARM-software/acle/pull/324

---------

Co-authored-by: Hassnaa Hamdi <hassnaa.hamdi@arm.com>
2024-09-09 13:34:41 +01:00
Lukacma
d57be195e3
[AArch64] replace SVE intrinsics with no active lanes with zero (#107413)
This patch extends https://github.com/llvm/llvm-project/pull/73964 and
optimises SVE intrinsics into zero constants when predicate is zero.
2024-09-09 10:28:01 +01:00
Craig Topper
f2b71491d1
[MC] Make MCRegisterInfo::getLLVMRegNum return std::optional<MCRegister>. NFC (#107776) 2024-09-08 21:21:51 -07:00
David Green
307713aafc
[AArch64] Do not generate uitofp(ld4) where and/shift can be used. (#107538)
After #107201 and #107367 the codegen for zext(ld4) can use and / shift
to extract the lanes out of the original vectors elements. This avoids
the need for the expensive ld4 operations, so can lead to performance
improvements over using the interleaving loads and ushll.

This patch stops the generation of ld4 for uitofp(ld4) that would become
uitofp(zext(ld4)). It doesn't handle zext yet to make sure that widening
instructions like mull and addl are not adversely affected.
2024-09-07 15:31:26 +01:00
Igor Kirillov
bf57ecf06e
[AArch64] Prevent generating tbl instruction instead of smull (#106375)
Generating tbl instruction for zext in an expression like: mul(zext(i8),
sext) is not optimal.
Instead, allowing later optimisations to generate smull(zext, sext)
would do some of the type extensions implicitly and be faster.
2024-09-06 14:53:50 +01:00
Sam Tebbs
458c91d810
[AArch64][NEON] Lower fixed-width add partial reductions to dot product (#107078)
This PR adds lowering for fixed-width <4 x i32> and <2 x i32> partial
reductions to a dot product when Neon and the dot product feature are
available.

The work is by Max Beck-Jones (@DevM-uk).
2024-09-06 09:38:03 +01:00
David Green
9df592fb80
[AArch64] Fold away zext of extract of uzp. (#107367)
Similar to #107201, this comes up from the lowering of zext of
deinterleaving shuffles. Patterns such as ext(extract_subvector(uzp(a,
b))) can be converted to a simple and to perform the extract/zext from a
uzp1. Uzp2 can be handled with an extra shift, and due to the existing
legalization we could have and / shift between which can be combined in.

Mostly this reduces instruction count or increases the amount of
parallelism in the sequence.
2024-09-05 20:25:56 +01:00
Sander de Smalen
91a3c6f3d6
[AArch64] Remove redundant COPY from loadRegFromStackSlot (#107396)
This removes a redundant 'COPY' instruction that #81716 probably forgot
to remove.

This redundant COPY led to an issue because because code in
LiveRangeSplitting expects that the instruction emitted by
`loadRegFromStackSlot` is an instruction that accesses memory, which
isn't the case for the COPY instruction.
2024-09-05 17:54:57 +01:00
Paul Walker
be1958fd48
[LLVM][CodeGen][SVE] Implement nxvbf16 fpextend to nxvf32/nxvf64. (#107253)
NOTE: There are no dedicated SVE instructions but bf16->f32 is just a
left shift because they share the same exponent range and from there
other convert instructions can be used.
2024-09-05 17:02:48 +01:00
Jon Roelofs
bded3b3ea9
[llvm][AArch64] Improve the cost model for i128 div's (#107306) 2024-09-05 07:42:23 -07:00
Lukacma
7f0c5b0502
[AArch64]Fix invalid use of ld1/st1 in stack alloc (#105518)
This patch fixes incorrect usage of scalar+immediate variant of ld1/st1
instructions during stack allocation caused by
[c4bac7f](c4bac7f7dc).
This commit used ld1/st1 even when stack offset was outside of immediate
range for this instruction, producing invalid assembly.  This commit was also using incorrect offsets when using ld1/st1.
2024-09-05 14:47:10 +01:00
David Green
77f0488225
[AArch64] Combine zext of deinterleaving shuffle. (#107201)
This is part 1 of a few patches that are intended to take deinterleaving
shuffles with masks like `[0,4,8,12]`, where the shuffle is
zero-extended to a larger size, and optimize away the deinterleave. In
this case it converts them to `and(uzp1, mask)`, where the `uzp1` act
upon the elements in the larger type size to get the lanes into the
correct possitions, and the `and` performs the zext. It performs the
combine fairly late, on the legalized type so that uitofp that are
converted to uitofp(zext(..)) will also be handled.
2024-09-05 08:11:29 +01:00
Momchil Velikov
f7fa75b208
[AArch64] Implement intrinsics for SME2 FAMIN/FAMAX (#99063)
This patch implements these intrinsics:

``` c
  // Variants are also available for:
  //  [_f32_x2], [_f64_x2],
  //  [_f16_x4], [_f32_x4], [_f64_x4]
  svfloat16x2_t svamax[_f16_x2](svfloat16x2 zd, svfloat16x2_t zm) __arm_streaming;
  svfloat16x2_t svamin[_f16_x2](svfloat16x2 zd, svfloat16x2_t zm) __arm_streaming;
```

(cf. https://github.com/ARM-software/acle/pull/324)

Co-authored-by: Caroline Concatto <caroline.concatto@arm.com>
2024-09-04 15:29:32 +01:00
Momchil Velikov
bb1b368e0a
[AArch64] Implement intrinsics for SVE FAMIN/FAMAX (#99042)
This patch implements the following intrinsics:

* Floating-point absolute maximum (predicated)

svfloat16_t svamax[_f16]_m(svbool_t, svfloat16_t, svfloat16_t);
svfloat16_t svamax[_f16]_x(svbool_t, svfloat16_t, svfloat16_t);
svfloat16_t svamax[_f16]_z(svbool_t, svfloat16_t, svfloat16_t);

svfloat16_t svamax[_n_f16]_m(svbool_t, svfloat16_t, float16_t);
svfloat16_t svamax[_n_f16]_x(svbool_t, svfloat16_t, float16_t);
svfloat16_t svamax[_n_f16]_z(svbool_t, svfloat16_t, float16_t);

* Floating-point absolute minimum (predicated)

svfloat16_t svmin[_f16]_m(svbool_t, svfloat16_t, svfloat16_t);
svfloat16_t svmin[_f16]_x(svbool_t, svfloat16_t, svfloat16_t);
svfloat16_t svmin[_f16]_z(svbool_t, svfloat16_t, svfloat16_t);

svfloat16_t svmin[_n_f16]_m(svbool_t, svfloat16_t, float16_t);
svfloat16_t svmin[_n_f16]_x(svbool_t, svfloat16_t, float16_t);
svfloat16_t svmin[_n_f16]_z(svbool_t, svfloat16_t, float16_t);

All the intrinsics have also variants for `f32` and `f64`, and have the
`__arm_streaming` attribute.

(cf. https://github.com/ARM-software/acle/pull/324)
2024-09-04 13:07:57 +01:00
Paul Walker
2fef449f30
[LLVM][AArch64] Enable verifyTargetSDNode for scalable vectors and fix the fallout. (#104820)
Fix incorrect use of AArch64ISD::UZP1/UUNPK{HI,LO} in:
  AArch64TargetLowering::LowerDIV
  AArch64TargetLowering::LowerINSERT_SUBVECTOR
    
The latter highlighted DAG combines that relied on broken behaviour,
which this patch also fixes.
2024-09-04 11:07:11 +01:00
Lukacma
3e948eb3e8
[AArch64][NEON] Add intrinsics for LUTI (#96883)
This patch adds intrinsics for NEON LUTI2 and LUTI4 instructions as
specified in the [ACLE
proposal](https://github.com/ARM-software/acle/pull/324)
2024-09-04 10:39:59 +01:00
Lukacma
59093cae86
[AARCH64][SVE] Add intrinsics for SVE LUTI instructions (#97058)
This patch adds intrinsics for LUTI2 and LUTI4 instructions, which use
SVE registers, as specified in the
https://github.com/ARM-software/acle/pull/324
2024-09-04 10:39:43 +01:00
Kazu Hirata
a628bc3c2e [AArch64] Fix a warning
This patch fixes:

  lib/Target/AArch64/AArch64GenPostLegalizeGILowering.inc:506:14:
  error: unused variable 'GIMatchData_matchinfo'
  [-Werror,-Wunused-variable]
2024-09-03 19:55:08 -07:00