2099 Commits

Author SHA1 Message Date
Craig Topper
bc282605df
[SelectionDAG] Require last operand of (STRICT_)FP_ROUND to be a TargetConstant. (#117639)
Fix all the places I could find that did't do this. We were already
mostly correct for FP_ROUND after
9a976f36615dbe15e76c12b22f711b2e597a8e51, but not STRICT_FP_ROUND.
2024-11-25 21:36:33 -08:00
Benjamin Maxwell
cc721dba4e
[AArch64][Codegen] Improve small shufflevector/concat lowering for SME (#116662)
This now tries to widen the shuffle before generating a possibly
expensive SVE TBL, this may allow the shuffle to be matched as something
cheaper like a ZIP1.
2024-11-22 10:15:23 +00:00
Paul Walker
71b87d1267
[LLVM][SVE] Ensure all fixed length mask bits are defined. (#116819)
convertFixedMaskToScalableVector expects the mask input to honour the
BoolContents scheme employed by the target. For AArch64 this means a
mask should be zero or all ones, and thus when promoting a mask we must
use a sign extend.
2024-11-20 13:54:50 +00:00
Sergei Barannikov
a160e51500
[AArch64] Fix SDNode type mismatches between *.td files and ISel (#116523)
* `MRS`, `PTEST` and FP comparisons were missing "flags" result, and
were sometimes created with invalid types (f32, Glue, Other).
* `REV16`, `REV32`, `REV64`, and `CMGEz` were sometimes created with an
extra operand.
* `TLSDESC_CALLSEQ` had `SDNPInGlue` property, but the node was never
created with a glue operand.
2024-11-20 15:55:28 +03:00
David Green
bca846d462
[AArch64] Improve mull generation (#114997)
This attempts to clean up and improve where we generate smull/umull
using known-bits. For v2i64 types (where no mul is present), we try to
create mull more aggressively to avoid scalarization.
2024-11-20 09:12:22 +00:00
Craig Topper
ce0cc8e9eb [AArch64][VE][X86] Use getSignedTargetConstant. NFC 2024-11-18 13:12:23 -08:00
Sergei Barannikov
baf59be89b
[SelectionDAG] Fix return types of TC_RETURN for several targets (#116504)
TC_RETURN nodes do not have a glue result.
2024-11-17 02:14:05 +03:00
Sergei Barannikov
b69f646c46
[AArch64] Remove unused SDNodes (NFC) (#116236)
The corresponding enum members were only used by `EmitMOPS`, which
immediately translated them to machine opcodes. Just pass the machine
opcodes instead.
2024-11-16 13:14:42 +03:00
James Chesterman
4e1db6a318
[AArch64][SVE] Add AArch64ISD nodes for wide add instructions (#115895)
When lowering from a partial reduction to a pair of wide adds,
previously the corresponding intrinsics were returned as nodes. Now
there are AArch64ISD nodes that are returned.
2024-11-15 11:01:10 +00:00
Ricardo Jesus
e52238b59f
[AArch64] Add @llvm.experimental.vector.match (#101974)
This patch introduces an experimental intrinsic for matching the
elements of one vector against the elements of another.

For AArch64 targets that support SVE2, the intrinsic lowers to a MATCH
instruction for supported fixed and scalar vector types.
2024-11-14 09:00:19 +00:00
James Chesterman
c3c2e1e161
[AArch64][SVE] Add codegen support for partial reduction lowering to wide add instructions (#114406)
For partial reductions in the situation of the number of elements
being halved, a pair of wide add instructions can be used.
2024-11-12 13:53:35 +00:00
Kazu Hirata
a41922ad75
[AArch64] Remove unused includes (NFC) (#115685)
Identified with misc-include-cleaner.
2024-11-11 07:35:08 -08:00
Daniil Kovalev
da083e358e
[PAC][CodeGen][ELF][AArch64] Support signed GOT (#113811)
This re-applies #96164 after revert in #102434.

Support the following relocations and assembly operators:

- `R_AARCH64_AUTH_ADR_GOT_PAGE` (`:got_auth:` for `adrp`)
- `R_AARCH64_AUTH_LD64_GOT_LO12_NC` (`:got_auth_lo12:` for `ldr`)
- `R_AARCH64_AUTH_GOT_ADD_LO12_NC` (`:got_auth_lo12:` for `add`)

`LOADgotAUTH` pseudo-instruction is introduced which is later expanded to
actual instruction sequence like the following.

```
adrp x16, :got_auth:sym
add x16, x16, :got_auth_lo12:sym
ldr x0, [x16]
autia x0, x16
```

If a resign is requested, like below, `LOADgotPAC` pseudo is used, and GOT
load is lowered similarly to `LOADgotAUTH`.

```
@var = global i32 0
define ptr @resign_globalvar() {
  ret ptr ptrauth (ptr @var, i32 3, i64 43)
}
```

If FPAC bit is not set and auth instruction is emitted, a check+trap sequence
similar to one used for `AUT` pseudo is emitted to ensure auth success.

Both SelectionDAG and GlobalISel are suppported.
For FastISel, we fall back to SelectionDAG.

Tests starting with 'ptrauth-' have corresponding variants w/o this prefix.

See also specification
https://github.com/ARM-software/abi-aa/blob/main/pauthabielf64/pauthabielf64.rst#appendix-signed-got
2024-11-01 12:21:10 +03:00
Yingwei Zheng
cf9d1c1486
[SDAG] Simplify SDNodeFlags with bitwise logic (#114061)
This patch allows using enumeration values directly and simplifies the
implementation with bitwise logic. It addresses the comment in
https://github.com/llvm/llvm-project/pull/113808#discussion_r1819923625.
2024-10-31 08:10:07 +08:00
David Green
83ae171722
[AArch64] Add ComputeNumSignBits for VASHR. (#113957)
As with a normal ISD::SRA node, they take the number of sign bits of the
incoming value and increase it by the shifted amount.
2024-10-29 21:02:32 +00:00
David Green
8274be509e [AArch64] Remove header dependencies of AArch64ISelLowering.h. NFC
This patch aims to reduce the include used by AArch64ISelLowering, allowing it
to be included by unittests so that they can reference the AArch64ISD nodes.
It:
 - Moves the inclusion of AArch64SMEAttributes.h to the uses.
 - Moves LowerPtrAuthGlobalAddressStatically to a static function, so that
   AArch64PACKey is not required in the header.
 - Moves the definitions of getExceptionPointerRegister to the cpp file, to
   remove the reference of AArch64::X0.
2024-10-28 18:53:37 +00:00
Tex Riddell
c03d09ce3e
[aarch64] atan2 intrinsic lowering (p5) (#112611)
This change is part of this proposal:
https://discourse.llvm.org/t/rfc-all-the-math-intrinsics/78294

- `VecFuncs.def`: define intrinsic to sleef/armpl mapping
- `LegalizerHelper.cpp`: add missing fewerElementsVector handling for
the new atan2 intrinsic
- `AArch64ISelLowering.cpp`: Add arch64 specializations for lowering
like neon instructions
- `AArch64LegalizerInfo.cpp`: Legalize atan2.

Part 5 for Implement the atan2 HLSL Function #70096.
2024-10-24 17:53:12 -07:00
Sander de Smalen
db0e376044
[AArch64] Fix failure with inline asm and svcount (#112537)
This fixes an issue where the compiler runs into an assertion failure
for the following example:

  register svcount_t pred asm("pn8") = svptrue_c8();
  asm("ld1w { z0.s, z4.s, z8.s, z12.s }, %[pred]/z, [x0]\n"
    :
    : [pred] "Uph" (pred)
    : "memory", "cc");

Here the register constraint that ends up in the LLVM IR is "{pn8}", but
the code in `TargetRegisterInfo::getRegForInlineAsmConstraint` that
parses that string, follows a path where it queries a suitable register
class for this register (<=> PPRorPNR regclass), for which it then
chooses `nxv16i1` as a suitable type. These choices individually are
correct, but the combined result isn't, because the type should be
`aarch64svcount`.
This then results in issues later on in SelectionDAGBuilder.cpp in
CopyToReg because the type of the actual value and the computed type
from the constraint don't match.

This PR pre-empts this issue by parsing the predicate explicitly and
returning the correct register class.
2024-10-24 17:41:07 +01:00
Ricardo Jesus
8a9921f569
[AArch64] Use INDEX for constant Neon step vectors (#113424)
When compiling for an SVE target we can use INDEX to generate constant
fixed-length step vectors, e.g.:
```
uint32x4_t foo() {
  return (uint32x4_t){0, 1, 2, 3};
}
```
Currently:
```
foo():
        adrp    x8, .LCPI1_0
        ldr     q0, [x8, :lo12:.LCPI1_0]
        ret
```
With INDEX:
```
foo():
        index   z0.s, #0, #1
        ret
```

The logic for this was already in `LowerBUILD_VECTOR`, though it was
hidden under a check for `!Subtarget->isNeonAvailable()`. This patch
refactors this to enable the corresponding code path unconditionally for
constant step vectors (as long as we can use SVE for them).
2024-10-23 15:20:33 +01:00
Paul Walker
f1ade1f874
[LLVM][CodeGen][AArch64] while_le(#,max_int) -> all_active (#111183)
When the second operand of an incrementing while instruction is the
maximum value, comparisons that include equality can never fail.
2024-10-22 12:28:39 +01:00
James Chesterman
11c818816d
[AArch64] Improve index selection for histograms (#111150)
Removes unnecessary extends on the indices passed into histogram instructions. It also removes the instruction when the mask is zero.
2024-10-22 11:14:00 +01:00
David Green
009fb567ce [AArch64] Add patterns for combining qxtn+rshr to qrshrn
Similar to bd861d0e690cfd05184d86, this adds some patterns for converting
signed and unsigned variants of rshr+qxtn to qrshrn.
2024-10-21 21:06:48 +01:00
David Green
bd861d0e69 [AArch64] Add some basic patterns for qshrn.
With the truncssat nodes these are relatively simple tablegen patterns to add.
The existing intrinsics are converted to shift+truncsat to they can lower using
the new patterns.

Fixes #112925.
2024-10-21 15:04:20 +01:00
David Green
ecfeacd152 [AArch64] Convert aarch64_neon_sqxtn to ISD::TRUNCATE_SSAT_S and replace tablegen patterns
This lowers the aarch64_neon_sqxtn intrinsics to the new TRUNCATE_SSAT_S ISD
nodes, performing the same for sqxtun and uqxtn. This allows us to clean up the
tablegen patterns a little and in a future commit add combines for sqxtn.
2024-10-21 14:29:21 +01:00
Simon Pilgrim
f0b3b6d15b [DAG] isConstantIntBuildVectorOrConstantInt - peek through bitcasts (#112710) (REAPPLIED)
Alter both isConstantIntBuildVectorOrConstantInt + isConstantFPBuildVectorOrConstantFP to return a bool instead of the underlying SDNode, and adjust usage to account for this.

Update isConstantIntBuildVectorOrConstantInt to peek though bitcasts when attempting to find a constant, in particular this improves canonicalization of constants to the RHS on commutable instructions.

X86 is the beneficiary here as it often bitcasts rematerializable 0/-1 vector constants as vXi32 and bitcasts to the requested type

Minor cleanup that helps with #107423

Reapplied after regression fix ba1255def64a9c3c68d97ace051eec76f546eeb0
2024-10-20 14:23:21 +01:00
Martin Storsjö
b26df3e463 Revert "[DAG] isConstantIntBuildVectorOrConstantInt - peek through bitcasts (#112710)"
This reverts commit a630771b28f4b252e2754776b8f3ab416133951a.

This caused compilation to hang for Windows/ARM, see
https://github.com/llvm/llvm-project/pull/112710 for details.
2024-10-20 00:49:16 +03:00
Simon Pilgrim
a630771b28
[DAG] isConstantIntBuildVectorOrConstantInt - peek through bitcasts (#112710)
Alter both isConstantIntBuildVectorOrConstantInt + isConstantFPBuildVectorOrConstantFP to return a bool instead of the underlying SDNode, and adjust usage to account for this.

Update isConstantIntBuildVectorOrConstantInt to peek though bitcasts when attempting to find a constant, in particular this improves canonicalization of constants to the RHS on commutable instructions.

X86 is the beneficiary here as it often bitcasts rematerializable 0/-1 vector constants as vXi32 and bitcasts to the requested type

Minor cleanup that helps with #107423
2024-10-18 10:52:55 +01:00
Benjamin Maxwell
5f7502bf1f
[AArch64][SVE] Support lowering fixed-length BUILD_VECTORS to ZIPs (#111698)
This allows lowering fixed-length (non-constant) BUILD_VECTORS (<=
128-bit) to a chain of ZIP1 instructions when Neon is not available,
rather than using the default lowering, which is to spill to the stack
and reload.

For example,

```
t5: v4f32 = BUILD_VECTOR(t0, t1, t2, t3)
```

Becomes:

```
zip1 z0.s, z0.s, z1.s // z0 = t0,t1,...
zip1 z2.s, z2.s, z3.s // z2 = t2,t3,...
zip1 z0.d, z0.d, z2.d // z0 = t0,t1,t2,t3,...
```

When values are already in FRPs, this generally seems to lead to a more
compact output with less movement to/from the stack.
2024-10-18 10:19:22 +01:00
Keith Packard
44b020a381
[PowerPC][ISelLowering] Support -mstack-protector-guard=tls (#110928)
Add support for using a thread-local variable with a specified offset
for holding the stack guard canary value. This supports both 32- and 64-
bit PowerPC targets.

This mirrors changes from #108942 but targeting PowerPC instead of
RISCV. Because both of these PRs modify the same driver functions, this
series is stack on top of the RISC-V one.

---------

Signed-off-by: Keith Packard <keithp@keithp.com>
2024-10-17 19:06:47 -07:00
Jay Foad
85c17e4092
[LLVM] Make more use of IRBuilder::CreateIntrinsic. NFC. (#112706)
Convert many instances of:
  Fn = Intrinsic::getOrInsertDeclaration(...);
  CreateCall(Fn, ...)
to the equivalent CreateIntrinsic call.
2024-10-17 16:20:43 +01:00
Nikita Popov
255a99c29f
[APInt] Fix APInt constructions where value does not fit bitwidth (NFCI) (#80309)
This fixes all the places that hit the new assertion added in
https://github.com/llvm/llvm-project/pull/106524 in tests. That is,
cases where the value passed to the APInt constructor is not an N-bit
signed/unsigned integer, where N is the bit width and signedness is
determined by the isSigned flag.

The fixes either set the correct value for isSigned, set the
implicitTrunc flag, or perform more calculations inside APInt.

Note that the assertion is currently still disabled by default, so this
patch is mostly NFC.
2024-10-17 08:48:08 +02:00
Jay Foad
9255850e89 [LLVM] Remove unused variables after #112546 2024-10-16 16:15:34 +01:00
Jay Foad
d9c95efb6c
[LLVM] Make more use of IRBuilder::CreateIntrinsic. NFC. (#112546)
Convert almost every instance of:
  CreateCall(Intrinsic::getOrInsertDeclaration(...), ...)
to the equivalent CreateIntrinsic call.
2024-10-16 15:43:30 +01:00
David Green
a07639f4bb
[AArch64] Increase inline memmove limit to 16 stored registers (#111848)
The memcpy inline limit has been 16 for a long time, this patch makes
the memmove inline limit the same, allowing small-constant sized
memmoves to be emitted inline. The 16 is the number of registers stored,
which equates to a limit of 256 bytes.
2024-10-14 08:57:32 +01:00
Benjamin Maxwell
c3a10dc849
[AArch64] Disable consecutive store merging when Neon is unavailable (#111519)
Lowering fixed-size BUILD_VECTORS without Neon may introduce stack
spills, leading to more stores/reloads than if the stores were not
merged. In some cases, it can also prevent using paired store
instructions.

In the future, we may want to relax when SVE is available, but
currently, the SVE lowerings for BUILD_VECTOR are limited to a few
specific cases.
2024-10-11 14:15:01 +01:00
Rahul Joshi
fa789dffb1
[NFC] Rename Intrinsic::getDeclaration to getOrInsertDeclaration (#111752)
Rename the function to reflect its correct behavior and to be consistent
with `Module::getOrInsertFunction`. This is also in preparation of
adding a new `Intrinsic::getDeclaration` that will have behavior similar
to `Module::getFunction` (i.e, just lookup, no creation).
2024-10-11 05:26:03 -07:00
YunQiang Su
72fb379225
AArch64: Select FCANONICALIZE (#104429)
FMINNM/FMAXNM instructions of AArch64 follow IEEE754-2008. We can use
them to canonicalize a floating point number. And
FMINNUM_IEEE/FMAXNUM_IEEE is used by something like expanding
FMINIMUMNUM/FMAXIMUMNUM, so let's define them.

---------

Co-authored-by: Your Name <you@example.com>
2024-10-11 08:45:14 +08:00
YunQiang Su
8d35ab80fc
AArch64: Add FMINNUM_IEEE and FMAXNUM_IEEE support (#107855)
FMINNM/FMAXNM instructions of AArch64 follow IEEE754-2008. We can use
them to canonicalize a floating point number. And
FMINNUM_IEEE/FMAXNUM_IEEE is used by something like expanding
FMINIMUMNUM/FMAXIMUMNUM, so let's define them.

Update combine_andor_with_cmps.ll.
Add fp-maximumnum-minimumnum.ll, with nnan testcases only.

V1F64 is not supported yet.
If we set v1f64 as legal, FMINNUM/FMAXNUM will have some problem:
   both of them use `if (isOperationLegalOrCustom(FMAXNUM_IEEE, VT))`.

AArch64 depends on `expandFMINNUM_FMAXNUM` returning `SDValue()`
for FMAXNUM and FMINNUM.

We should fix this problem, while it will be in future patch.
2024-10-10 15:09:47 +08:00
Jeffrey Byrnes
853c43d04a
[TTI] NFC: Port TLI.shouldSinkOperands to TTI (#110564)
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.
2024-10-09 14:30:09 -07:00
Paul Walker
bfe066676b
[LLVM][CodeGen][SVE2] Implement nxvf64 fpround to nxvbf16. (#111012)
NOTE: SVE2 only because that is when FCVTX is available, which is
required to perform the necessary two-step rounding.
2024-10-08 12:01:57 +01:00
Paul Walker
02dd6b1014
[LLVM][CodeGen] Add lowering for scalable vector bfloat operations. (#109803)
Specifically:
  fabs, fadd, fceil, fdiv, ffloor, fma, fmax, fmaxnm, fmin, fminnm,
  fmul, fnearbyint, fneg, frint, fround, froundeven, fsub, fsqrt &
  ftrunc
2024-10-07 13:01:59 +01:00
Jorge Botto
a4516da49f
[AArch64] - Fold and and cmp into tst (#110347)
Fixes https://github.com/llvm/llvm-project/issues/102703.

https://godbolt.org/z/nfj8xsb1Y

The following pattern:

```
%2 = and i32 %0, 254
%3 = icmp eq i32 %2, 0
```
is optimised by instcombine into:

```%3 = icmp ult i32 %0, 2```

However, post instcombine leads to worse aarch64 than the unoptimised version.

Pre instcombine:
```
        tst     w0, #0xfe
        cset    w0, eq
        ret
```
Post instcombine:
```
        and     w8, w0, #0xff
        cmp     w8, #2
        cset    w0, lo
        ret
```


In the unoptimised version, SelectionDAG converts `SETCC (AND X 254) 0 EQ` into `CSEL 0 1 1 (ANDS X 254)`, which gets emitted as a `tst`.

In the optimised version, SelectionDAG converts `SETCC (AND X 255) 2 ULT` into `CSEL 0 1 2 (SUBS (AND X 255) 2)`, which gets emitted as an `and`/`cmp`.

This PR adds an optimisation to `AArch64ISelLowering`, converting `SETCC (AND X Y) Z ULT` into `SETCC (AND X (Y & ~(Z - 1))) 0 EQ` when `Z` is a power of two. This makes SelectionDAG/Codegen produce the same optimised code for both examples.
2024-10-03 17:56:01 +01:00
Paul Walker
d283705829
[AArch64][SVE] Fix definition of bfloat fcvt intrinsics. (#110281)
Affected intrinsics:
  llvm.aarch64.sve.fcvt.bf16f32
  llvm.aarch64.sve.fcvtnt.bf16f32
    
The named intrinsics took a predicate based on the smallest element type
when it should be based on the largest. The intrinsics have been replace
by v2 equivalents and affected code ported to use them.
    
Patch includes changes to getSVEPredicateBitCast() that ensure the
generated code for the auto-upgraded old intrinsics is unchanged.
2024-10-03 12:36:01 +01:00
James Chesterman
b2a6814126
[AArch64][NEON][SVE] Lower i8 to i64 partial reduction to a dot product (#110220)
An i8 to i64 partial reduction can instead be done with an i8 to i32 dot
product followed by a sign extension.
2024-10-01 13:26:38 +01:00
CarolineConcatto
308c9a9451
[Clang][LLVM][AArch64] Add intrinsic for MOVT SME2 instruction (#97602)
This patch adds these intrinsics:

  // Variants are also available for:
  // [_s8], [_u16], [_s16], [_u32], [_s32], [_u64], [_s64]
  // [_bf16], [_f16], [_f32], [_f64]
void svwrite_lane_zt[_u8](uint64_t zt0, svuint8_t zt, uint64_t idx)
__arm_streaming __arm_inout("zt0");
void svwrite_zt[_u8](uint64_t zt0, svuint8_t zt) __arm_streaming
__arm_inout("zt0");

according to PR#324[1]
[1]https://github.com/ARM-software/acle/pull/324
2024-10-01 10:11:32 +01:00
Paul Walker
3e3780ef6a
[LLVM][CodeGen][SVE] Implement nxvf32 fpround to nxvbf16. (#107420) 2024-09-24 13:15:26 +01:00
Sam Tebbs
f7714342ae
[AArch64][NEON][SVE] Lower mixed sign/zero extended partial reductions to usdot (#107566)
This PR adds lowering for partial reductions of a mix of sign/zero
extended inputs to the usdot intrinsic.
2024-09-19 14:00:45 +01:00
David Green
960c975acd
[AArch64] Expand scmp/ucmp vector operations with sub (#108830)
Unlike scalar, where AArch64 prefers expanding scmp/ucmp with select,
under Neon we can use the arithmetic expansion to generate fewer
instructions. Notably it also prevents the scalarization of vselect
during vector-legalization.
2024-09-16 18:44:52 +01:00
David Green
5c7957dd4f [AArch64] Allow i16->f64 uitofp tbl shuffles
Just as we convert i8->f32 uitofp to tbl to perform the zext, we can do the
same for i16->f64.
2024-09-11 22:21:52 +01:00
adprasad-nvidia
23595d1b96
[AArch64] Lower __builtin_bswap16 to rev16 if bswap followed by any_extend (#105375)
GCC compiles the built-in function `__builtin_bswap16`, to the ARM
instruction rev16, which reverses the byte order of 16-bit data. On the
other Clang compiles the same built-in function to e.g.
```     
        rev     w8, w0
        lsr     w0, w8, #16
```
i.e. it performs a byte reversal of a 32-bit register, (which moves the
lower half, which contains the 16-bit data, to the upper half) and then
right shifts the reversed 16-bit data back to the lower half of the
register.
We can improve Clang codegen by generating `rev16` instead of `rev` and
`lsr`, like GCC.
2024-09-10 10:57:07 +01:00