This re-applies #96164 after revert in #102434.
Support the following relocations and assembly operators:
- `R_AARCH64_AUTH_ADR_GOT_PAGE` (`:got_auth:` for `adrp`)
- `R_AARCH64_AUTH_LD64_GOT_LO12_NC` (`:got_auth_lo12:` for `ldr`)
- `R_AARCH64_AUTH_GOT_ADD_LO12_NC` (`:got_auth_lo12:` for `add`)
`LOADgotAUTH` pseudo-instruction is introduced which is later expanded to
actual instruction sequence like the following.
```
adrp x16, :got_auth:sym
add x16, x16, :got_auth_lo12:sym
ldr x0, [x16]
autia x0, x16
```
If a resign is requested, like below, `LOADgotPAC` pseudo is used, and GOT
load is lowered similarly to `LOADgotAUTH`.
```
@var = global i32 0
define ptr @resign_globalvar() {
ret ptr ptrauth (ptr @var, i32 3, i64 43)
}
```
If FPAC bit is not set and auth instruction is emitted, a check+trap sequence
similar to one used for `AUT` pseudo is emitted to ensure auth success.
Both SelectionDAG and GlobalISel are suppported.
For FastISel, we fall back to SelectionDAG.
Tests starting with 'ptrauth-' have corresponding variants w/o this prefix.
See also specification
https://github.com/ARM-software/abi-aa/blob/main/pauthabielf64/pauthabielf64.rst#appendix-signed-got
This shares most of its code with the scalar sincos expansion. It allows
expanding vector FSINCOS nodes to a library call from the specified
`-vector-library`. The upside of this is it will mean the vectorizer
only needs to handle the sincos intrinsic, which has no memory effects,
and this can handle lowering the intrinsic to a call that takes output
pointers.
This teaches dagcombiner to fold:
`(asr (add nsw x, y), 1) -> (avgfloors x, y)`
`(lsr (add nuw x, y), 1) -> (avgflooru x, y)`
as well the combine them to a ceil variant:
`(avgfloors (add nsw x, y), 1) -> (avgceils x, y)`
`(avgflooru (add nuw x, y), 1) -> (avgceilu x, y)`
iff valid for the target.
Removes some of the ARM MVE patterns that are now dead code.
It adds the avg opcodes to `IsQRMVEInstruction` as to preserve the
immediate splatting as before.
SVE2 adds the constructive splice instruction, which takes a tuple.
Even though the register allocator must ensure that the tuple uses
consecutive registers for the tuple, it's likely to be more efficient
than using the destructive splice instruction when the first operand
is reused.
This adds the `llvm.sincos` intrinsic, legalization, and lowering.
The `llvm.sincos` intrinsic takes a floating-point value and returns
both the sine and cosine (as a struct).
```
declare { float, float } @llvm.sincos.f32(float %Val)
declare { double, double } @llvm.sincos.f64(double %Val)
declare { x86_fp80, x86_fp80 } @llvm.sincos.f80(x86_fp80 %Val)
declare { fp128, fp128 } @llvm.sincos.f128(fp128 %Val)
declare { ppc_fp128, ppc_fp128 } @llvm.sincos.ppcf128(ppc_fp128 %Val)
declare { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> %Val)
```
The lowering is built on top of the existing FSINCOS ISD node, with
additional type legalization to allow for f16, f128, and vector values.
As part of FEAT_PAuthLR, a new DWARF Frame Instruction was introduced,
`DW_CFA_AARCH64_negate_ra_state_with_pc`. This instructs Libunwind that
the PC has been used with the signing instruction. This change includes
three commits
- Libunwind support for the newly introduced DWARF Instruction
- CodeGen Support for the DWARF Instructions
- Reversing the changes made in #96377. Due to
`DW_CFA_AARCH64_negate_ra_state_with_pc`'s requirements to be placed
immediately after the signing instruction, this would mean the CFI
Instruction location was not consistent with the generated location when
not using FEAT_PAuthLR. The commit reverses the changes and makes the
location consistent across the different branch protection options.
While this does have a code size effect, this is a negligible one.
For the ABI information, see here:
853286c7ab/aadwarf64/aadwarf64.rst (id23)
Some tests contain errors in constrained intrinsic usage, such as missed
or extra type parameters, wrong type parameters order and some other.
---------
Co-authored-by: Andy Kaylor <andy_kaylor@yahoo.com>
Credits: https://github.com/llvm/llvm-project/pull/72976
LLVM ERROR: cannot select: %3:zpr(<vscale x 2 x s64>) = G_MUL %0:fpr,
%1:fpr (in function: xmulnxv2i64)
;; mul
define void @xmulnxv2i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b,
ptr %p) {
entry:
%c = mul <vscale x 2 x i64> %a, %b
store <vscale x 2 x i64> %c, ptr %p, align 16
ret void
}
define void @mulnxv4i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b,
ptr %p) {
entry:
%c = mul <vscale x 4 x i32> %a, %b
store <vscale x 4 x i32> %c, ptr %p, align 16
ret void
}
define void @mulnxv8i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b,
ptr %p) {
entry:
%c = mul <vscale x 8 x i16> %a, %b
store <vscale x 8 x i16> %c, ptr %p, align 16
ret void
}
define void @mulnxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b,
ptr %p) {
entry:
%c = mul <vscale x 16 x i8> %a, %b
store <vscale x 16 x i8> %c, ptr %p, align 16
ret void
}
Both are based on MachineLICMBase, and the functionality there is
"switched" based on a PreRegAlloc flag. This commit is simply about
trusting the original value of that flag, defined by the `MachineLICM`
and `EarlyMachineLICM` classes.
The `PreRegAlloc` flag used to be overwritten it based on MRI.isSSA(),
which is un-reliable due to how it is inferred by the MIRParser. I see
that we can now define isSSA in MIR (thanks @gargaroff ), meaning the
fix isn’t really needed anymore, but redefining that flag still feels
wrong.
Note that I'm looking into upstreaming more changes to MachineLICM, see
[the discourse
thread](https://discourse.llvm.org/t/extending-post-regalloc-machinelicm/82725).
This change is part of this proposal:
https://discourse.llvm.org/t/rfc-all-the-math-intrinsics/78294
- `VecFuncs.def`: define intrinsic to sleef/armpl mapping
- `LegalizerHelper.cpp`: add missing fewerElementsVector handling for
the new atan2 intrinsic
- `AArch64ISelLowering.cpp`: Add arch64 specializations for lowering
like neon instructions
- `AArch64LegalizerInfo.cpp`: Legalize atan2.
Part 5 for Implement the atan2 HLSL Function #70096.
This fixes an issue where the compiler runs into an assertion failure
for the following example:
register svcount_t pred asm("pn8") = svptrue_c8();
asm("ld1w { z0.s, z4.s, z8.s, z12.s }, %[pred]/z, [x0]\n"
:
: [pred] "Uph" (pred)
: "memory", "cc");
Here the register constraint that ends up in the LLVM IR is "{pn8}", but
the code in `TargetRegisterInfo::getRegForInlineAsmConstraint` that
parses that string, follows a path where it queries a suitable register
class for this register (<=> PPRorPNR regclass), for which it then
chooses `nxv16i1` as a suitable type. These choices individually are
correct, but the combined result isn't, because the type should be
`aarch64svcount`.
This then results in issues later on in SelectionDAGBuilder.cpp in
CopyToReg because the type of the actual value and the computed type
from the constraint don't match.
This PR pre-empts this issue by parsing the predicate explicitly and
returning the correct register class.
When compiling for an SVE target we can use INDEX to generate constant
fixed-length step vectors, e.g.:
```
uint32x4_t foo() {
return (uint32x4_t){0, 1, 2, 3};
}
```
Currently:
```
foo():
adrp x8, .LCPI1_0
ldr q0, [x8, :lo12:.LCPI1_0]
ret
```
With INDEX:
```
foo():
index z0.s, #0, #1
ret
```
The logic for this was already in `LowerBUILD_VECTOR`, though it was
hidden under a check for `!Subtarget->isNeonAvailable()`. This patch
refactors this to enable the corresponding code path unconditionally for
constant step vectors (as long as we can use SVE for them).
Before this patch, redundant COPY couldn't be removed for the following
case:
```
$R0 = OP ...
... // Read of %R0
$R1 = COPY killed $R0
```
This patch adds support for tracking the users of the source register
during backward propagation, so that we can remove the redundant COPY in
the above case and optimize it to:
```
$R1 = OP ...
... // Replace all uses of %R0 with $R1
```
With this change and appropriate linker changes
(https://r.android.com/3236256)
AOSP boots with memtag-global throughout the platform.
Without this change, we would sometimes generate PC-relative references
to tagged globals, which then do not have the proper tag.
Change the spill weight calculations for `optsize` functions to remove
the block frequency multiplier. For those functions, we do not want to
consider the runtime cost of spilling, only the codesize cost.
I built a large app with the basic and greedy (default) register
allocator enabled.
| Regalloc Type | Uncompressed Size Delta | Compressed Size Delta |
| - | - | - |
| Basic | -303.8 KiB (-0.23%) | -232.0 KiB (-0.39%) |
| Greedy | 159.1 KiB (0.12%) | 130.1 KiB (0.22%) |
Since I only saw a size win with the basic register allocator, I decided
to only change the behavior for that type.
With the truncssat nodes these are relatively simple tablegen patterns to add.
The existing intrinsics are converted to shift+truncsat to they can lower using
the new patterns.
Fixes#112925.
This allows lowering fixed-length (non-constant) BUILD_VECTORS (<=
128-bit) to a chain of ZIP1 instructions when Neon is not available,
rather than using the default lowering, which is to spill to the stack
and reload.
For example,
```
t5: v4f32 = BUILD_VECTOR(t0, t1, t2, t3)
```
Becomes:
```
zip1 z0.s, z0.s, z1.s // z0 = t0,t1,...
zip1 z2.s, z2.s, z3.s // z2 = t2,t3,...
zip1 z0.d, z0.d, z2.d // z0 = t0,t1,t2,t3,...
```
When values are already in FRPs, this generally seems to lead to a more
compact output with less movement to/from the stack.
This helps clear up some of the legalization artefacts. Not all of the
cast_combines are added (notably select combines) as they currently have
questionable benefit in the test updates.
In #99726, `-fptrauth-type-info-vtable-pointer-discrimination` was
introduced, which is intended to enable type and address discrimination
for type_info vtable pointers. However, some codegen logic for actually
enabling address discrimination was missing. This patch addresses the
issue.
Fixes#101716
Some targets (e.g. PPC and Hexagon) already did this. I think it's best
to do this consistently so that frontend authors don't run into
inconsistent results when they emit `naked` functions. For example, in
Zig, we had to change our emit code to also set `frame-pointer=none` to
get reliable results across targets.
Note: I don't have commit access.
Similar to #111287, this moves the UseOutlineAtomics legalization rules to a
boolean predicate as opposed to needing the be nested functions.
There appeared to be a pair of redundant customIfs for s128 sizes (assuming
only scalars are supported).
Under AArch64 it is common and will become more common to have operation
legalization rules dependant on a feature of the architecture. For
example HasFP16 or the newer CSSC integer min/max instructions, among
many others. With the current legalization rules this either means
adding a custom predicate based on the feature as in
`legalIf([=](const LegalityQuery &Query) { return HasFP16 && ...; }` or
splitting the legalization rules into pieces that place rules optionally
into them base on the features available.
This patch proposes an alternative where the existing routines like
legalFor(..) are provided a boolean predicate, which if false skips
adding the rule. It makes the rules cleaner and will hopefully allow
them to scale better as we add more features.
The SVE predicates for loads/stores I have changed to just be always
available. Scalable vectors without SVE have never been supported, but
it could also add a condition.
Fix a check for extending loads in DAGCombiner,
where if the result type has more bits than the
loaded type it should count as an extending load.
All backends apart from AArch64 ignore this
ExtTy argument to shouldReduceLoadWidth, so this
change currently only impacts AArch64.