1915 Commits

Author SHA1 Message Date
Dinar Temirbulatov
990c4bc95f
[AArch64][SVE2] Generate SVE2 BSL instruction in LLVM for bit-twiddling. (#83514)
Allow to fold or/and-and to BSL instuction for scalable vectors.
2024-04-10 11:07:59 +01:00
Sizov Nikita
d38bff460a
[AArch64] SimplifyDemandedBitsForTargetNode - add AArch64ISD::BICi handling (#76644)
Fold BICi if all destination bits are already known to be zeroes

```llvm
define <8 x i16> @haddu_known(<8 x i8> %a0, <8 x i8> %a1) {
  %x0 = zext <8 x i8> %a0 to <8 x i16>
  %x1 = zext <8 x i8> %a1 to <8 x i16>
  %hadd = call <8 x i16> @llvm.aarch64.neon.uhadd.v8i16(<8 x i16> %x0, <8 x i16> %x1)
  %res = and <8 x i16> %hadd, <i16 511, i16 511, i16 511, i16 511,i16 511, i16 511, i16 511, i16 511>
  ret <8 x i16> %res
}
declare <8 x i16> @llvm.aarch64.neon.uhadd.v8i16(<8 x i16>, <8 x i16>)
```

```
haddu_known:                            // @haddu_known
        ushll   v0.8h, v0.8b, #0
        ushll   v1.8h, v1.8b, #0
        uhadd   v0.8h, v0.8h, v1.8h
        bic     v0.8h, #254, lsl #8 <-- this one will be removed as we know high bits are zero extended
        ret
```

Fixes #53881
Fixes #53622
2024-04-06 21:41:24 +01:00
Daniel Paoliello
43ba568daa
Prepend all library intrinsics with # when building for Arm64EC (#87542)
While attempting to build some Rust code, I was getting linker errors
due to missing functions that are implemented in `compiler-rt`. Turns
out that when `compiler-rt` is built for Arm64EC, all its function names
are mangled with the leading `#`.

This change removes the hard-coded list of library-implemented
intrinsics to mangle for Arm64EC, and instead assumes that they all must
be mangled.
2024-04-05 12:06:47 -07:00
Paul Kirth
f0724f0704
[llvm][NFC] Update URL in comment about Android ABI
The previous URL was stale, and referenced 'master' instead of 'main',
which will never be updated.

Reviewers: topperc, enh-google

Reviewed By: enh-google

Pull Request: https://github.com/llvm/llvm-project/pull/87726
2024-04-05 09:41:53 -07:00
Prabhuk
212b1a84a6
[CallSiteInfo][NFC] CallSiteInfo -> CallSiteInfo.ArgRegPairs (#86842)
CallSiteInfo is originally used only for argument - register pairs. Make
it struct, in which we can store additional data for call sites.

Also, the variables/methods used for CallSiteInfo are named for its
original use case, e.g., CallFwdRegsInfo. Refactor these for the
upcoming
use, e.g. addCallArgsForwardingRegs() -> addCallSiteInfo().

An upcoming patch will add type ids for indirect calls to propogate them
from
middle-end to the back-end. The type ids will be then used to emit the
call
graph section.

Original RFC:
https://lists.llvm.org/pipermail/llvm-dev/2021-June/151044.html
Updated RFC:
https://lists.llvm.org/pipermail/llvm-dev/2021-July/151739.html

Differential Revision: https://reviews.llvm.org/D107109?id=362888

Co-authored-by: Necip Fazil Yildiran <necip@google.com>
2024-04-02 13:05:16 -07:00
Sander de Smalen
f914e8e77c
[AArch64][SME] Add coalescer barrier for args/results in locally streaming functions. (#85388)
Similar to how we protected FP/fixed-vector arguments and results from
calls, we should do the same for arguments/results from locally-streaming
functions such that those are not spilled/filled as ZPR registers.

This may cause a small regression (additional spills/fills), which is
addressed by #85386.
2024-03-26 11:40:31 +00:00
David Green
96819daa3d
[AArch64] Handle v2i16 and v2i8 in concat load combine. (#86264)
This extends the concat load patch from
https://reviews.llvm.org/D121400, which was later moved to a combine, to
handle v2i8 and v2i16 concat loads too.
2024-03-25 17:10:23 +00:00
Graham Hunter
36a3f8f647
[TTI][TLI][AArch64] Support scalable immediates with isLegalAddImmediate (#84173)
Adds a second parameter (default to 0) to isLegalAddImmediate, to
represent a scalable immediate.

Extends the AArch64 implementation to match immediates based on what addvl and inc[h|w|d] support.
2024-03-20 10:28:46 +00:00
Graham Hunter
cd768ec983
[AArch64] Support scalable offsets with isLegalAddressingMode (#83255)
Allows us to indicate that an addressing mode featuring a
vscale-relative immediate offset is supported.
2024-03-20 10:13:20 +00:00
Takuya Shimizu
65058a8d73
[AArch64][SelectionDAG] Expand v1f64-typed sin,cos,pow,log,exp intrinsics (#83745)
This patch makes NEON-enabled AArch64 backend expand the `sin, cos, pow,
log, log2, log10, exp, exp2, exp10` intrinsics for `v1f64` data type,
all of which caused selection failure before this patch.
Fixes https://github.com/llvm/llvm-project/issues/83729
2024-03-17 11:28:32 +09:00
Sander de Smalen
e639e7e986
[AArch64] NFC: Simplify the smstart/smstop pseudo. (#85067)
This is just a bit of cleanup to make the pseudo/code easier to
understand. This is based on the observation that we only need to pass
in a runtime value for 'pstate' if is actually needed for generating a
runtime check.
2024-03-15 08:50:58 +00:00
Usman Nadeem
0b46884036
Revert "Revert "[AArch64] Improve lowering of truncating uzp1"" (#85119)
Reverts llvm/llvm-project#85115
The fix was already merged in
79cd2c0bb9
2024-03-13 11:58:10 -07:00
Mehdi Amini
06e310fee1
Revert "[AArch64] Improve lowering of truncating uzp1" (#85115)
Reverts llvm/llvm-project#82457

The bot is broken, likely because of mid-air collision.
2024-03-13 11:32:53 -07:00
Usman Nadeem
57b991ab39
[AArch64] Improve lowering of truncating uzp1 (#82457)
There were two existing patterns:
    `concat_vectors(trunc(x), trunc(y)) -> uzp1(x, y)`
`concat_vectors(assertzext(trunc(x)), assertzext(trunc(y))) -> uzp1(x,
y)`

Move them into a class and add the following `assertsext` pattern to it:
`concat_vectors(assertsext(trunc(x)), assertsext(trunc(y))) -> uzp1(x,
y)`

Add the following transform for v8i8 and v4i16 result types to help with
pattern matching:
  `truncating uzp1(x, y) -> trunc(concat(x, y))`
And a pattern to go with it:
  `trunc(concat_vectors(x, y)) -> uzp1 (x, y)`

Add another isel pattern for v8i8 and v4i16 result vector types, similar
to
the existing concat pattern, but with a trunc node in the begining:
`trunc(concat_vectors(assertext_trunc(x), assertext_trunc(y))) ->
xtn(uzp1(x, y))`
2024-03-13 09:05:55 -07:00
Sander de Smalen
e42e97a4ad
[AArch64][SME] Don't mark 'smstart za' as using/defining VG. (#84775)
VG is only used/defined when changing the streaming mode, using 'smstart
sm' or plainly 'smstart' (same for smstop).
2024-03-13 08:21:33 +00:00
David Majnemer
edc1c3d24e [AArch64] Make more vector f16 operations legal
v8f16 is a legal type but promoting to v16f16 would result in an illegal
type.

Let's legalize these by a combination of splitting+promoting resulting
in a pair of v4f16.

Also, we were being overly cautious with different v4f16 nodes. Mark
more of them safe to promote to v4f32.
2024-03-08 19:52:54 +00:00
David Majnemer
5f935e9181 [AArch64] Optimize fp64 <-> fp16 SIMD conversions
Legalization would result in needless scalarization. Add some
DAGCombines to fix this up.
2024-03-08 19:52:53 +00:00
David Majnemer
9e759f3523 [AArch64] Fix fptoi/itofp for bf16
There were a number of issues that needed to be addressed:
- i64 to bf16 did not correctly round
- strict rounding needed to yield a chain
- fastisel did not have logic to bail on bf16
2024-03-06 06:17:39 +00:00
Fangrui Song
201572e34b
[AArch64] Implement -fno-plt for SelectionDAG/GlobalISel
Clang sets the nonlazybind attribute for certain ObjC features. The
AArch64 SelectionDAG implementation for non-intrinsic calls
(46e36f0953aabb5e5cd00ed8d296d60f9f71b424) is behind a cl option.

GCC implements -fno-plt for a few ELF targets. In Clang, -fno-plt also
sets the nonlazybind attribute. For SelectionDAG, make the cl option not
affect ELF so that non-intrinsic calls to a dso_preemptable function use
GOT. Adjust AArch64TargetLowering::LowerCall to handle intrinsic calls.

For FastISel, change `fastLowerCall` to bail out when a call is due to
-fno-plt.

For GlobalISel, handle non-intrinsic calls in CallLowering::lowerCall
and intrinsic calls in AArch64CallLowering::lowerCall (where the
target-independent CallLowering::lowerCall is not called).
The GlobalISel test in `call-rv-marker.ll` is therefore updated.

Note: the current -fno-plt -fpic implementation does not use GOT for a
preemptable function.

Link: #78275

Pull Request: https://github.com/llvm/llvm-project/pull/78890
2024-03-05 13:55:29 -08:00
Benjamin Kramer
8cc8fdaf5c [AArch64] Also promote vector bf16 INT_TP_FP to f32
This mirrors the scalar version.
2024-03-04 23:34:56 +01:00
David Majnemer
930e7ff9ae [AArch64] Optimize abs, neg and copysign for fp16/bf16
We can use bitwise arithmetic to implement these, making them
considerably faster than legalization via promotion.
2024-03-04 20:05:05 +00:00
David Majnemer
23bc5b6392 [AArch64] Mark bf16 as custom for truncating stores & add a comment
While we don't use SVE2 as a fallback for missing NEON instructions for
BF16, it is confusing to break symmetry with fp16.

While we are here, add a comment explaining how BF16 immediates work.
2024-03-04 06:33:25 +00:00
David Majnemer
3dd6750027 [AArch64] Add more complete support for BF16
We can use a small amount of integer arithmetic to round FP32 to BF16
and extend BF16 to FP32.

While a number of operations still require promotion, this can be
reduced for some rather simple operations like abs, copysign, fneg but
these can be done in a follow-up.

A few neat optimizations are implemented:
- round-inexact-to-odd is used for F64 to BF16 rounding.
- quieting signaling NaNs for f32 -> bf16 tries to detect if a prior
  operation makes it unnecessary.
2024-03-03 22:39:50 +00:00
David Green
0e9a102129 [AArch64] Remove unused AArch64ISD::BIT. NFC
These were last used in the fcopysign lowering, which now uses AArch64ISD::BSP.
2024-03-01 11:44:58 +00:00
Sander de Smalen
5bd01ac822
[AArch64] Re-enable rematerialization for streaming-mode-changing functions. (#83235)
We can add implicit defs/uses of the 'VG' register to the instructions
to prevent the register allocator from rematerializing values in between
streaming-mode changes, as the def/use of VG will further nail down the
ordering that comes out of ISel. This avoids the heavy-handed approach
to prevent any kind of rematerialization.

While we could add 'VG' as a Use to all SVE instructions, we only really
need to do this for instructions that are rematerializable, as the
smstart/smstop instructions and pseudos act as scheduling barriers which
is sufficient to prevent other instructions from being scheduled in
between the streaming-mode-changing call sequence. However, we may
revisit this in the future.
2024-02-29 15:35:46 +00:00
Lukacma
26402777eb
[AArch64] Optimized generated assembly for bool to svbool_t conversions (#83001)
In certain cases Legalizer was generating `AND(WHILELO, SPLAT 1)` instruction pattern, when `WHILELO` would be sufficient.
2024-02-28 16:45:39 +00:00
Sander de Smalen
41427b0e8e
[AArch64] Disable FastISel/GlobalISel for ZT0 state (#82768)
For __arm_new("zt0") we need to have special setup code in the prologue.
For calls that don't preserve zt0, we need to emit code preserve ZT0
around the call.
This is only emitted by SelectionDAG ISel at the moment.
2024-02-28 10:42:16 +00:00
Nashe Mncube
744c0057e7
[AArch64][CodeGen] Fix crash when fptrunc returns fp16 with +nofp attr (#81724)
When performing lowering of the fptrunc opcode returning fp16 with the
+nofp flag enabled we could trigger a compiler crash. This is because we
had no custom lowering implemented. This patch 
the case in which we need to promote an fp16 return type
for fptrunc when the +nofp attr is enabled.
2024-02-22 19:15:52 +00:00
Dinar Temirbulatov
5a023f564f
[AArch64][SVE2] Enable dynamic shuffle for fixed length types. (#72490)
When SVE register size is unknown or the minimal size is not equal to
the maximum size then we could determine the actual SVE register size in
the runtime and adjust shuffle mask in the runtime.
2024-02-21 14:59:47 +00:00
David Green
6c84709eff
[AArch64] Materialize constants via fneg. (#80641)
This is something that is already done as a special case for copysign,
this patch extends it to be more generally applied. If we are trying to
matrialize a negative constant (notably -0.0, 0x80000000), then there
may be no movi encoding that creates the immediate, but a fneg(movi)
might.

Some of the existing patterns for RADDHN needed to be adjusted to keep
them in line with the new immediates.
2024-02-14 13:55:51 +00:00
Usman Nadeem
44d85c5b15
[AArch64][SVE2] Use a PatFrag for URSHR (#81304)
Follow-up for #78374
2024-02-12 09:03:41 -08:00
Nikita Popov
92d7992205
[AArch64] Only apply bool vector bitcast opt if result is scalar (#81256)
This optimization tries to optimize bitcasts from `<N x i1>` to iN, but
currently also triggers for `<N x i1>` to `<M x iK>` bitcasts, if custom
lowering has been requested for these for an unrelated reason. Fix this
by explicitly checking that the result type is scalar.

Fixes https://github.com/llvm/llvm-project/issues/81216.
2024-02-12 10:00:34 +01:00
ostannard
5452cbc4a6
[AArch64] Indirect tail-calls cannot use x16 with pac-ret+pc (#81020)
When using -mbranch-protection=pac-ret+pc, x16 is used in the function
epilogue to hold the address of the signing instruction. This is used by
a HINT instruction which can only use x16, so we can't change this. This
means that we can't use it to hold the function pointer for an indirect
tail-call.

There is existing code to force indirect tail-calls to use x16 or x17
when BTI is enabled, so there are now 4 combinations:

bti  pac-ret+pc  Valid function pointer registers
off  off         Any non callee-saved register
on   off         x16 or x17
off  on          Any non callee-saved register except x16
on   on          x17
2024-02-08 15:31:54 +00:00
Rin Dobrescu
7f292b8fb1
[AArch64] Convert concat(uhadd(a,b), uhadd(c,d)) to uhadd(concat(a,c), concat(b,d)) (#80674)
We can convert concat(v4i16 uhadd(a,b), v4i16 uhadd(c,d)) to v8i16
uhadd(concat(a,c), concat(b,d)), which can lead to further
simplifications.
2024-02-06 11:02:06 +00:00
Fangrui Song
d4de4c3eaf
[AArch64] Support optional constant offset for constraint "S" (#80255)
Modify the initial implementation (https://reviews.llvm.org/D46745) to
support a constant offset so that the following code will compile:
```
int a[2][2];
void foo() { asm("// %0" :: "S"(&a[1][1])); }
```

We use the generic code path for "s". In GCC's aarch64 port, "S" is
supported for PIC while "s" isn't, making "s" less useful. We implement
"S" but not "s".

Similar to #80201 for RISC-V.
2024-02-02 10:33:09 -08:00
Matthew Devereau
d9c20e437f
[AArch64][SME] Implement inline-asm clobbers for za/zt0 (#79276)
This enables specifing "za" or "zt0" to the clobber list for inline asm.
This complies with the acle SME addition to the asm extension here:
https://github.com/ARM-software/acle/pull/276
2024-02-02 08:12:05 +00:00
Usman Nadeem
1d1432356e
[AArch64][SVE2] Generate urshr rounding shift rights (#78374)
Add a new node `AArch64ISD::URSHR_I_PRED`.

`srl(add(X, 1 << (ShiftValue - 1)), ShiftValue)` is transformed to
`urshr`, or to `rshrnb` (as before) if the result it truncated.

`uzp1(rshrnb(uunpklo(X),C), rshrnb(uunpkhi(X), C))` is converted to
`urshr(X, C)` (tested by the wide_trunc tests).

Pattern matching code in `canLowerSRLToRoundingShiftForVT` is taken
from prior code in rshrnb. It returns true if the add has NUW or if the
number of bits used in the return value allow us to not care about the
overflow (tested by rshrnb test cases).
2024-01-31 14:03:58 -08:00
Rin Dobrescu
2907c63311
Revert "[AArch64] Convert concat(uhadd(a,b), uhadd(c,d)) to uhadd(concat(a,c), concat(b,d))" (#80157)
Reverts llvm/llvm-project#79464 while figuring out why the tests are
failing.
2024-01-31 16:45:25 +00:00
Rin Dobrescu
cf828aee24
[AArch64] Convert concat(uhadd(a,b), uhadd(c,d)) to uhadd(concat(a,c), concat(b,d)) (#79464)
We can convert concat(v4i16 uhadd(a,b), v4i16 uhadd(c,d)) to v8i16
uhadd(concat(a,c), concat(b,d)), which can lead to further
simplifications.
2024-01-31 12:52:12 +00:00
Sander de Smalen
dd73666182
[SME] Stop RA from coalescing COPY instructions that transcend beyond smstart/smstop. (#78294)
This patch introduces a 'COALESCER_BARRIER' which is a pseudo node that
expands to
a 'nop', but which stops the register allocator from coalescing a COPY
node when
its use/def crosses a SMSTART or SMSTOP instruction.

For example:

    %0:fpr64 = COPY killed $d0
    undef %2.dsub:zpr = COPY %0       // <- Do not coalesce this COPY
    ADJCALLSTACKDOWN 0, 0
MSRpstatesvcrImm1 1, 0, csr_aarch64_smstartstop, implicit-def dead $d0
    $d0 = COPY killed %0
    BL @use_f64, csr_aarch64_aapcs

If the COPY would be coalesced, that would lead to:

    $d0 = COPY killed %0

being replaced by:

    $d0 = COPY killed %2.dsub

which means the whole ZPR reg would be live upto the call, causing the
MSRpstatesvcrImm1 (smstop) to spill/reload the ZPR register:

    str     q0, [sp]   // 16-byte Folded Spill
    smstop  sm
    ldr     z0, [sp]   // 16-byte Folded Reload
    bl      use_f64

which would be incorrect for two reasons:
1. The program may load more data than it has allocated.
2. If there are other SVE objects on the stack, the compiler might use
the
   'mul vl' addressing modes to access the spill location.

By disabling the coalescing, we get the desired results:

    str     d0, [sp, #8]  // 8-byte Folded Spill
    smstop  sm
    ldr     d0, [sp, #8]  // 8-byte Folded Reload
    bl      use_f64
2024-01-31 09:04:13 +00:00
Billy Laws
c761b4a5e4
[AArch64] Fix variadic tail-calls on ARM64EC (#79774)
ARM64EC varargs calls expect that x4 = sp at entry, special handling is
needed to ensure this with tail calls since they occur after the
epilogue and the x4 write happens before.

I tried going through AArch64MachineFrameLowering for this, hoping to
avoid creating the dummy object but this was the best I could do since
the stack info that uses isn't populated at this stage,
CreateFixedObject also explicitly forbids 0 sized objects.
2024-01-30 18:32:15 -08:00
Florian Hahn
d1e162e5d9
[AArch64] Add custom lowering for load <3 x i8>. (#78632)
Add custom combine to lower load <3 x i8> as the more efficient sequence
below:
   ldrb wX, [x0, #2]
   ldrh wY, [x0]
   orr wX, wY, wX, lsl #16
   fmov s0, wX

At the moment, there are almost no cases in which such vector operations
will be generated automatically. The motivating case is non-power-of-2
SLP vectorization: https://github.com/llvm/llvm-project/pull/77790
2024-01-30 14:04:27 +00:00
David Green
9520773c46
[AArch64] Don't generate neon integer complex numbers with +sve2. NFC (#79829)
The condition for allowing integer complex number support could also
allow neon fixed length complex numbers if +sve2 was specified. This
tightens the condition to only allow integer complex number support for
scalable vectors.

We could generalize this in the future to generate SVE intrinsics for
fixed-length vectors, but for the moment this opts for the simpler fix.
2024-01-29 16:46:22 +00:00
Kazu Hirata
8f8cab6b78 [llvm] Use Instruction::hasMetadata (NFC) 2024-01-27 22:20:22 -08:00
Eli Friedman
bee1557ffc [NFC][AArch64] Fix indentation. 2024-01-26 10:26:19 -08:00
Florian Hahn
eb678d8993
[AArch64] Combine store (trunc X to <3 x i8>) to sequence of ST1.b. (#78637)
Improve codegen for (trunc X to <3 x i8>) by converting it to a sequence
of 3 ST1.b, but first converting the truncate operand to either v8i8 or
v16i8, extracting the lanes for the truncate results and storing them.

At the moment, there are almost no cases in which such vector operations
will be generated automatically. The motivating case is non-power-of-2
SLP vectorization: https://github.com/llvm/llvm-project/pull/77790

PR: https://github.com/llvm/llvm-project/pull/78637
2024-01-25 18:28:44 +00:00
Nico Weber
184ca39529
[llvm] Move CodeGenTypes library to its own directory (#79444)
Finally addresses https://reviews.llvm.org/D148769#4311232 :)

No behavior change.
2024-01-25 12:01:31 -05:00
Eli Friedman
a6065f0fa5
Arm64EC entry/exit thunks, consolidated. (#79067)
This combines the previously posted patches with some additional work
I've done to more closely match MSVC output.

Most of the important logic here is implemented in
AArch64Arm64ECCallLowering. The purpose of the
AArch64Arm64ECCallLowering is to take "normal" IR we'd generate for
other targets, and generate most of the Arm64EC-specific bits:
generating thunks, mangling symbols, generating aliases, and generating
the .hybmp$x table. This is all done late for a few reasons: to
consolidate the logic as much as possible, and to ensure the IR exposed
to optimization passes doesn't contain complex arm64ec-specific
constructs.

The other changes are supporting changes, to handle the new constructs
generated by that pass.

There's a global llvm.arm64ec.symbolmap representing the .hybmp$x
entries for the thunks. This gets handled directly by the AsmPrinter
because it needs symbol indexes that aren't available before that.

There are two new calling conventions used to represent calls to and
from thunks: ARM64EC_Thunk_X64 and ARM64EC_Thunk_Native. There are a few
changes to handle the associated exception-handling info,
SEH_SaveAnyRegQP and SEH_SaveAnyRegQPX.

I've intentionally left out handling for structs with small
non-power-of-two sizes, because that's easily separated out. The rest of
my current work is here. I squashed my current patches because they were
split in ways that didn't really make sense. Maybe I could split out
some bits, but it's hard to meaningfully test most of the parts
independently.

Thanks to @dpaoliello for extensive testing and suggestions.

(Originally posted as https://reviews.llvm.org/D157547 .)
2024-01-22 21:28:07 -08:00
Rin Dobrescu
365aa1574a
[AArch64] Convert UADDV(add(zext, zext)) into UADDLV(concat). (#78301)
We can convert a UADDV(add(zext(64-bit source), zext(64-bit source)))
into UADDLV(concat), where the concat represents the 64-bit zext
sources.
2024-01-22 11:59:40 +00:00
Kerry McLaughlin
a8a3711e74
[AArch64][SME2] Preserve ZT0 state around function calls (#78321)
If a function has ZT0 state and calls a function which does not
preserve ZT0, the caller must save and restore ZT0 around the call.
If the caller shares ZT0 state and the callee is not shared ZA, we must
additionally call SMSTOP/SMSTART ZA around the call.

This patch adds new AArch64ISDNodes for spilling & filling ZT0.
Where requiresPreservingZT0 is true, ZT0 state will be preserved
across a call.
2024-01-20 12:06:00 +00:00