Fold BICi if all destination bits are already known to be zeroes
```llvm
define <8 x i16> @haddu_known(<8 x i8> %a0, <8 x i8> %a1) {
%x0 = zext <8 x i8> %a0 to <8 x i16>
%x1 = zext <8 x i8> %a1 to <8 x i16>
%hadd = call <8 x i16> @llvm.aarch64.neon.uhadd.v8i16(<8 x i16> %x0, <8 x i16> %x1)
%res = and <8 x i16> %hadd, <i16 511, i16 511, i16 511, i16 511,i16 511, i16 511, i16 511, i16 511>
ret <8 x i16> %res
}
declare <8 x i16> @llvm.aarch64.neon.uhadd.v8i16(<8 x i16>, <8 x i16>)
```
```
haddu_known: // @haddu_known
ushll v0.8h, v0.8b, #0
ushll v1.8h, v1.8b, #0
uhadd v0.8h, v0.8h, v1.8h
bic v0.8h, #254, lsl #8 <-- this one will be removed as we know high bits are zero extended
ret
```
Fixes#53881Fixes#53622
While attempting to build some Rust code, I was getting linker errors
due to missing functions that are implemented in `compiler-rt`. Turns
out that when `compiler-rt` is built for Arm64EC, all its function names
are mangled with the leading `#`.
This change removes the hard-coded list of library-implemented
intrinsics to mangle for Arm64EC, and instead assumes that they all must
be mangled.
The previous URL was stale, and referenced 'master' instead of 'main',
which will never be updated.
Reviewers: topperc, enh-google
Reviewed By: enh-google
Pull Request: https://github.com/llvm/llvm-project/pull/87726
CallSiteInfo is originally used only for argument - register pairs. Make
it struct, in which we can store additional data for call sites.
Also, the variables/methods used for CallSiteInfo are named for its
original use case, e.g., CallFwdRegsInfo. Refactor these for the
upcoming
use, e.g. addCallArgsForwardingRegs() -> addCallSiteInfo().
An upcoming patch will add type ids for indirect calls to propogate them
from
middle-end to the back-end. The type ids will be then used to emit the
call
graph section.
Original RFC:
https://lists.llvm.org/pipermail/llvm-dev/2021-June/151044.html
Updated RFC:
https://lists.llvm.org/pipermail/llvm-dev/2021-July/151739.html
Differential Revision: https://reviews.llvm.org/D107109?id=362888
Co-authored-by: Necip Fazil Yildiran <necip@google.com>
Similar to how we protected FP/fixed-vector arguments and results from
calls, we should do the same for arguments/results from locally-streaming
functions such that those are not spilled/filled as ZPR registers.
This may cause a small regression (additional spills/fills), which is
addressed by #85386.
This extends the concat load patch from
https://reviews.llvm.org/D121400, which was later moved to a combine, to
handle v2i8 and v2i16 concat loads too.
Adds a second parameter (default to 0) to isLegalAddImmediate, to
represent a scalable immediate.
Extends the AArch64 implementation to match immediates based on what addvl and inc[h|w|d] support.
This patch makes NEON-enabled AArch64 backend expand the `sin, cos, pow,
log, log2, log10, exp, exp2, exp10` intrinsics for `v1f64` data type,
all of which caused selection failure before this patch.
Fixes https://github.com/llvm/llvm-project/issues/83729
This is just a bit of cleanup to make the pseudo/code easier to
understand. This is based on the observation that we only need to pass
in a runtime value for 'pstate' if is actually needed for generating a
runtime check.
There were two existing patterns:
`concat_vectors(trunc(x), trunc(y)) -> uzp1(x, y)`
`concat_vectors(assertzext(trunc(x)), assertzext(trunc(y))) -> uzp1(x,
y)`
Move them into a class and add the following `assertsext` pattern to it:
`concat_vectors(assertsext(trunc(x)), assertsext(trunc(y))) -> uzp1(x,
y)`
Add the following transform for v8i8 and v4i16 result types to help with
pattern matching:
`truncating uzp1(x, y) -> trunc(concat(x, y))`
And a pattern to go with it:
`trunc(concat_vectors(x, y)) -> uzp1 (x, y)`
Add another isel pattern for v8i8 and v4i16 result vector types, similar
to
the existing concat pattern, but with a trunc node in the begining:
`trunc(concat_vectors(assertext_trunc(x), assertext_trunc(y))) ->
xtn(uzp1(x, y))`
v8f16 is a legal type but promoting to v16f16 would result in an illegal
type.
Let's legalize these by a combination of splitting+promoting resulting
in a pair of v4f16.
Also, we were being overly cautious with different v4f16 nodes. Mark
more of them safe to promote to v4f32.
There were a number of issues that needed to be addressed:
- i64 to bf16 did not correctly round
- strict rounding needed to yield a chain
- fastisel did not have logic to bail on bf16
Clang sets the nonlazybind attribute for certain ObjC features. The
AArch64 SelectionDAG implementation for non-intrinsic calls
(46e36f0953aabb5e5cd00ed8d296d60f9f71b424) is behind a cl option.
GCC implements -fno-plt for a few ELF targets. In Clang, -fno-plt also
sets the nonlazybind attribute. For SelectionDAG, make the cl option not
affect ELF so that non-intrinsic calls to a dso_preemptable function use
GOT. Adjust AArch64TargetLowering::LowerCall to handle intrinsic calls.
For FastISel, change `fastLowerCall` to bail out when a call is due to
-fno-plt.
For GlobalISel, handle non-intrinsic calls in CallLowering::lowerCall
and intrinsic calls in AArch64CallLowering::lowerCall (where the
target-independent CallLowering::lowerCall is not called).
The GlobalISel test in `call-rv-marker.ll` is therefore updated.
Note: the current -fno-plt -fpic implementation does not use GOT for a
preemptable function.
Link: #78275
Pull Request: https://github.com/llvm/llvm-project/pull/78890
While we don't use SVE2 as a fallback for missing NEON instructions for
BF16, it is confusing to break symmetry with fp16.
While we are here, add a comment explaining how BF16 immediates work.
We can use a small amount of integer arithmetic to round FP32 to BF16
and extend BF16 to FP32.
While a number of operations still require promotion, this can be
reduced for some rather simple operations like abs, copysign, fneg but
these can be done in a follow-up.
A few neat optimizations are implemented:
- round-inexact-to-odd is used for F64 to BF16 rounding.
- quieting signaling NaNs for f32 -> bf16 tries to detect if a prior
operation makes it unnecessary.
We can add implicit defs/uses of the 'VG' register to the instructions
to prevent the register allocator from rematerializing values in between
streaming-mode changes, as the def/use of VG will further nail down the
ordering that comes out of ISel. This avoids the heavy-handed approach
to prevent any kind of rematerialization.
While we could add 'VG' as a Use to all SVE instructions, we only really
need to do this for instructions that are rematerializable, as the
smstart/smstop instructions and pseudos act as scheduling barriers which
is sufficient to prevent other instructions from being scheduled in
between the streaming-mode-changing call sequence. However, we may
revisit this in the future.
For __arm_new("zt0") we need to have special setup code in the prologue.
For calls that don't preserve zt0, we need to emit code preserve ZT0
around the call.
This is only emitted by SelectionDAG ISel at the moment.
When performing lowering of the fptrunc opcode returning fp16 with the
+nofp flag enabled we could trigger a compiler crash. This is because we
had no custom lowering implemented. This patch
the case in which we need to promote an fp16 return type
for fptrunc when the +nofp attr is enabled.
When SVE register size is unknown or the minimal size is not equal to
the maximum size then we could determine the actual SVE register size in
the runtime and adjust shuffle mask in the runtime.
This is something that is already done as a special case for copysign,
this patch extends it to be more generally applied. If we are trying to
matrialize a negative constant (notably -0.0, 0x80000000), then there
may be no movi encoding that creates the immediate, but a fneg(movi)
might.
Some of the existing patterns for RADDHN needed to be adjusted to keep
them in line with the new immediates.
This optimization tries to optimize bitcasts from `<N x i1>` to iN, but
currently also triggers for `<N x i1>` to `<M x iK>` bitcasts, if custom
lowering has been requested for these for an unrelated reason. Fix this
by explicitly checking that the result type is scalar.
Fixes https://github.com/llvm/llvm-project/issues/81216.
When using -mbranch-protection=pac-ret+pc, x16 is used in the function
epilogue to hold the address of the signing instruction. This is used by
a HINT instruction which can only use x16, so we can't change this. This
means that we can't use it to hold the function pointer for an indirect
tail-call.
There is existing code to force indirect tail-calls to use x16 or x17
when BTI is enabled, so there are now 4 combinations:
bti pac-ret+pc Valid function pointer registers
off off Any non callee-saved register
on off x16 or x17
off on Any non callee-saved register except x16
on on x17
Modify the initial implementation (https://reviews.llvm.org/D46745) to
support a constant offset so that the following code will compile:
```
int a[2][2];
void foo() { asm("// %0" :: "S"(&a[1][1])); }
```
We use the generic code path for "s". In GCC's aarch64 port, "S" is
supported for PIC while "s" isn't, making "s" less useful. We implement
"S" but not "s".
Similar to #80201 for RISC-V.
This enables specifing "za" or "zt0" to the clobber list for inline asm.
This complies with the acle SME addition to the asm extension here:
https://github.com/ARM-software/acle/pull/276
Add a new node `AArch64ISD::URSHR_I_PRED`.
`srl(add(X, 1 << (ShiftValue - 1)), ShiftValue)` is transformed to
`urshr`, or to `rshrnb` (as before) if the result it truncated.
`uzp1(rshrnb(uunpklo(X),C), rshrnb(uunpkhi(X), C))` is converted to
`urshr(X, C)` (tested by the wide_trunc tests).
Pattern matching code in `canLowerSRLToRoundingShiftForVT` is taken
from prior code in rshrnb. It returns true if the add has NUW or if the
number of bits used in the return value allow us to not care about the
overflow (tested by rshrnb test cases).
This patch introduces a 'COALESCER_BARRIER' which is a pseudo node that
expands to
a 'nop', but which stops the register allocator from coalescing a COPY
node when
its use/def crosses a SMSTART or SMSTOP instruction.
For example:
%0:fpr64 = COPY killed $d0
undef %2.dsub:zpr = COPY %0 // <- Do not coalesce this COPY
ADJCALLSTACKDOWN 0, 0
MSRpstatesvcrImm1 1, 0, csr_aarch64_smstartstop, implicit-def dead $d0
$d0 = COPY killed %0
BL @use_f64, csr_aarch64_aapcs
If the COPY would be coalesced, that would lead to:
$d0 = COPY killed %0
being replaced by:
$d0 = COPY killed %2.dsub
which means the whole ZPR reg would be live upto the call, causing the
MSRpstatesvcrImm1 (smstop) to spill/reload the ZPR register:
str q0, [sp] // 16-byte Folded Spill
smstop sm
ldr z0, [sp] // 16-byte Folded Reload
bl use_f64
which would be incorrect for two reasons:
1. The program may load more data than it has allocated.
2. If there are other SVE objects on the stack, the compiler might use
the
'mul vl' addressing modes to access the spill location.
By disabling the coalescing, we get the desired results:
str d0, [sp, #8] // 8-byte Folded Spill
smstop sm
ldr d0, [sp, #8] // 8-byte Folded Reload
bl use_f64
ARM64EC varargs calls expect that x4 = sp at entry, special handling is
needed to ensure this with tail calls since they occur after the
epilogue and the x4 write happens before.
I tried going through AArch64MachineFrameLowering for this, hoping to
avoid creating the dummy object but this was the best I could do since
the stack info that uses isn't populated at this stage,
CreateFixedObject also explicitly forbids 0 sized objects.
Add custom combine to lower load <3 x i8> as the more efficient sequence
below:
ldrb wX, [x0, #2]
ldrh wY, [x0]
orr wX, wY, wX, lsl #16
fmov s0, wX
At the moment, there are almost no cases in which such vector operations
will be generated automatically. The motivating case is non-power-of-2
SLP vectorization: https://github.com/llvm/llvm-project/pull/77790
The condition for allowing integer complex number support could also
allow neon fixed length complex numbers if +sve2 was specified. This
tightens the condition to only allow integer complex number support for
scalable vectors.
We could generalize this in the future to generate SVE intrinsics for
fixed-length vectors, but for the moment this opts for the simpler fix.
Improve codegen for (trunc X to <3 x i8>) by converting it to a sequence
of 3 ST1.b, but first converting the truncate operand to either v8i8 or
v16i8, extracting the lanes for the truncate results and storing them.
At the moment, there are almost no cases in which such vector operations
will be generated automatically. The motivating case is non-power-of-2
SLP vectorization: https://github.com/llvm/llvm-project/pull/77790
PR: https://github.com/llvm/llvm-project/pull/78637
This combines the previously posted patches with some additional work
I've done to more closely match MSVC output.
Most of the important logic here is implemented in
AArch64Arm64ECCallLowering. The purpose of the
AArch64Arm64ECCallLowering is to take "normal" IR we'd generate for
other targets, and generate most of the Arm64EC-specific bits:
generating thunks, mangling symbols, generating aliases, and generating
the .hybmp$x table. This is all done late for a few reasons: to
consolidate the logic as much as possible, and to ensure the IR exposed
to optimization passes doesn't contain complex arm64ec-specific
constructs.
The other changes are supporting changes, to handle the new constructs
generated by that pass.
There's a global llvm.arm64ec.symbolmap representing the .hybmp$x
entries for the thunks. This gets handled directly by the AsmPrinter
because it needs symbol indexes that aren't available before that.
There are two new calling conventions used to represent calls to and
from thunks: ARM64EC_Thunk_X64 and ARM64EC_Thunk_Native. There are a few
changes to handle the associated exception-handling info,
SEH_SaveAnyRegQP and SEH_SaveAnyRegQPX.
I've intentionally left out handling for structs with small
non-power-of-two sizes, because that's easily separated out. The rest of
my current work is here. I squashed my current patches because they were
split in ways that didn't really make sense. Maybe I could split out
some bits, but it's hard to meaningfully test most of the parts
independently.
Thanks to @dpaoliello for extensive testing and suggestions.
(Originally posted as https://reviews.llvm.org/D157547 .)
If a function has ZT0 state and calls a function which does not
preserve ZT0, the caller must save and restore ZT0 around the call.
If the caller shares ZT0 state and the callee is not shared ZA, we must
additionally call SMSTOP/SMSTART ZA around the call.
This patch adds new AArch64ISDNodes for spilling & filling ZT0.
Where requiresPreservingZT0 is true, ZT0 state will be preserved
across a call.