The selection code for aarch64_sve_ld[nt]1_pn_x{2,4} intrinsics gates
the use of strided load instructions behind the SME2 target feature.
However, the instructions are only available in streaming mode.
On most operating systems, the x16 and x17 registers are not special,
so there is no benefit, and only a code size cost, to constraining AUT to
only using them. Therefore, adjust the backend to only use the AUT pseudo
(renamed AUTx16x17 for clarity) on Darwin platforms. All other platforms
use an unconstrained variant of the pseudo, AUTxMxN, for selection.
Reviewers: ahmedbougacha, kovdan01, atrosinenko
Reviewed By: atrosinenko
Pull Request: https://github.com/llvm/llvm-project/pull/132857
When big-endian we need to use ld1/st1 for vector loads and stores so
that we get the elements in the correct order, but this prevents
postindex addressing from being used. Fix this by adding the appropriate
ISel patterns, plus the relevant changes in ISelLowering and
ISelDAGToDAG to cause postindex addressing to be used.
When converting ROTR(XOR(a, b)) to XAR(a, b), or ROTR(a, a) to XAR(a, zero)
we were not handling v1i64 types, meaning illegal copies get generated. This
addresses that by generating insert_subreg and extract_subreg for v1i64 to
keep the values with the correct types.
Fixes#141746
When compiling VLS SVE, the compiler often replaces VL-based offsets
with immediate-based ones. This leads to a mismatch in the allowed
addressing modes due to SVE loads/stores generally expecting immediate
offsets relative to VL. For example, given:
```c
svfloat64_t foo(const double *x) {
svbool_t pg = svptrue_b64();
return svld1_f64(pg, x+svcntd());
}
```
When compiled with `-msve-vector-bits=128`, we currently generate:
```gas
foo:
ptrue p0.d
mov x8, #2
ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
ret
```
Instead, we could be generating:
```gas
foo:
ldr z0, [x0, #1, mul vl]
ret
```
Likewise for other types, stores, and other VLS lengths.
This patch achieves the above by extending `SelectAddrModeIndexedSVE`
to let constants through when `vscale` is known.
This patch adds codegen for all Arm9.6-a compare-and-branch
instructions, that operate on full w or x registers. The instruction
variants operating on half-words (cbh) and bytes (cbb) are added in a
subsequent patch.
Since CB doesn't use standard 4-bit Arm condition codes but a reduced
set of conditions, encoded in 3 bits, some conditions are expressed by
modifying operands, namely incrementing or decrementing immediate
operands and swapping register operands. To invert a CB instruction it's
therefore not enough to just modify the condition code which doesn't
play particularly well with how the backend is currently organized. We
therefore introduce a number of pseudos which operate on the standard
4-bit condition codes and lower them late during codegen.
When the predicate of a destructive operation is known to be all-true,
for example
fabs z0.s, p0/m, z1.s
then the entire output register is written and we can use a zeroing
(instead of a merging) form of the instruction, for example
fabs z0.s, p0/z, z1.s
thus eliminate the dependency on the input-output destination register
without the need to insert a `movprfx`.
This patch complements (and in the case of
2b3266c170,
fixes a regression) the following:
7f4414b2a1
[AArch64] Generate zeroing forms of certain SVE2.2 instructions (4/11)
(https://github.com/llvm/llvm-project/pull/116830)
2474cf7ad1
[AArch64] Generate zeroing forms of certain SVE2.2 instructions (3/11)
(https://github.com/llvm/llvm-project/pull/116829)
6f285d3115
[AArch64] Generate zeroing forms of certain SVE2.2 instructions (2/11)
(https://github.com/llvm/llvm-project/pull/116828)
2b3266c170
[AArch64] Generate zeroing forms of certain SVE2.2 instructions (1/11)
(https://github.com/llvm/llvm-project/pull/116259)
This function is most often used in range based loops or algorithms
where the iterator is implicitly dereferenced. The dereference returns
an SDNode * of the user rather than SDUse * so users() is a better name.
I've long beeen annoyed that we can't write a range based loop over
SDUse when we need getOperandNo. I plan to rename use_iterator to
user_iterator and add a use_iterator that returns SDUse& on dereference.
This will make it more like IR.
This patch implements the following intrinsics:
8-bit floating-point convert to half-precision or BFloat16 (in-order).
``` c
// Variant is also available for: _bf16[_mf8]_x2
svfloat16x2_t svcvt1_f16[_mf8]_x2_fpm(svmfloat8_t zn, fpm_t fpm) __arm_streaming;
svfloat16x2_t svcvt2_f16[_mf8]_x2_fpm(svmfloat8_t zn, fpm_t fpm) __arm_streaming;
```
In accordance with https://github.com/ARM-software/acle/pull/323.
Co-authored-by: Marin Lukac marian.lukac@arm.com
Co-authored-by: Caroline Concatto caroline.concatto@arm.com
This patch implements the following intrinsics:
8-bit floating-point convert to deinterleaved half-precision or
BFloat16.
``` c
// Variant is also available for: _bf16[_mf8]_x2
svfloat16x2_t svcvtl1_f16[_mf8]_x2_fpm(svmfloat8_t zn, fpm_t fpm) __arm_streaming;
svfloat16x2_t svcvtl2_f16[_mf8]_x2_fpm(svmfloat8_t zn, fpm_t fpm) __arm_streaming;
```
Defined in https://github.com/ARM-software/acle/pull/323
Co-authored-by: Caroline Concatto caroline.concatto@arm.com
Co-authored-by: Marian Lukac marian.lukac@arm.com
Alter the #ifdef values from #110986 and #115292 to use _MSC_VER instead of _WIN32 to stop the pragmas being used on gcc/mingw builds
Noticed by @mstorsjo
Similar to #110986 - disabling inlining on MSVC release builds avoids an excessive build time issue affecting all recent versions of CL.EXE
Fixes#114425
When trying to evaluate an expression in a narrower type, the
DAGCombine should propagate the disjoint flag, as it's equally
valid on the narrower expression.
This helps improve better use of addressing modes for some
Arm SME instructions, for example.
This patch was reverted because of a failing C test.
It now has being solved and can be merged into main again
This patch adds these intrinsics:
// Variants are also available for: _s8
svuint8x4_t svluti4_zt_u8_x4(uint64_t zt0, svuint8x2_t zn)
__arm_streaming __arm_in("zt0");
according to PR#324[1]
[1]ARM-software/acle#324
This patch implements these intrinsics:
FSCALE SINGLE AND MULTI
```
// Variants are also available for:
// [_single_f32_x2], [_single_f64_x2],
// [_single_f16_x4], [_single_f32_x4], [_single_f64_x4]
svfloat16x2_t svscale[_single_f16_x2](svfloat16x2_t zd, svfloat16_t zm) __arm_streaming;
// Variants are also available for:
// [_f32_x2], [_f64_x2],
// [_f16_x4], [_f32_x4], [_f64_x4]
svfloat16x2_t svscale[_f16_x2](svfloat16x2_t zd, svfloat16x2_t zm) __arm_streaming
```
(cf. https://github.com/ARM-software/acle/pull/323)
Co-authored-by: Caroline Concatto <caroline.concatto@arm.com>
This patch adds these intrinsics:
// Variants are also available for: _s8
svuint8x4_t svluti4_zt_u8_x4(uint64_t zt0, svuint8x2_t zn)
__arm_streaming __arm_in("zt0");
according to PR#324[1]
[1]ARM-software/acle#324
This introduces 3 hardening modes in the authentication step of
auth/resign lowering:
- unchecked, which uses the AUT instructions as-is
- poison, which detects authentication failure (using an XPAC+CMP
sequence), explicitly yielding the XPAC result rather than the
AUT result, to avoid leaking
- trap, which additionally traps on authentication failure,
using BRK #0xC470 + key (IA C470, IB C471, DA C472, DB C473.)
Not all modes are necessarily useful in all contexts, and there
are more performant alternative lowerings in specific contexts
(e.g., when I/D TBI enablement is a target ABI guarantee.)
These will be implemented separately.
This is controlled by the `ptrauth-auth-traps` function attributes,
and can be overridden using `-aarch64-ptrauth-auth-checks=`.
This also adds the FPAC extension, which we haven't needed
before, to improve isel when we can rely on HW checking.
This adds intrinsics defined in ARM-software/acle#260
Doing this requires some changes to the GCS instruction definitions, as
these intrinsics make use of how some instructions don't modify the
input register when GCS is disabled, and they need to be correctly
marked with mayLoad/mayStore/hasSideEffects for instruction selection to
work.
According to the specification in
ARM-software/acle#309 this adds the intrinsics
Move and zero multiple ZA single-vector groups to vector registers
// Variants are also available for _za8_u8, _za16_s16, _za16_u16,
// _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32,
// _za64_s64, _za64_u64 and _za64_f64
svint8x2_t svreadz_za8_s8_vg1x2(uint32_t slice)
__arm_streaming __arm_inout("za");
// Variants are also available for _za8_u8, _za16_s16, _za16_u16,
// _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32,
// _za64_s64, _za64_u64 and _za64_f64
svint8x4_t svreadz_za8_s8_vg1x4(uint32_t slice)
__arm_streaming __arm_inout("za");
According to the specification in
ARM-software/acle#309 this adds the intrinsics
// Variants are also available for _za8_u8, _za16_s16, _za16_u16, //
_za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32, // _za64_s64,
_za64_u64 and _za64_f64
svint8x2_t svreadz_hor_za8_s8_vg2(uint64_t tile, uint32_t slice)
__arm_streaming __arm_inout("za");
// Variants are also available for _za8_u8, _za16_s16, _za16_u16, //
_za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32, // _za64_s64,
_za64_u64 and _za64_f64
svint8x4_t svreadz_hor_za8_s8_vg4(uint64_t tile, uint32_t slice)
__arm_streaming __arm_inout("za");
// Variants are also available for _za8_u8, _za16_s16, _za16_u16, //
_za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32, // _za64_s64,
_za64_u64 and _za64_f64
svint8x2_t svreadz_ver_za8_s8_vg2(uint64_t tile, uint32_t slice)
__arm_streaming __arm_inout("za");
// Variants are also available for _za8_u8, _za16_s16, _za16_u16, //
_za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32, // _za64_s64,
_za64_u64 and _za64_f64
svint8x4_t svreadz_ver_za8_s8_vg4(uint64_t tile, uint32_t slice)
__arm_streaming __arm_inout("za");
I've not added any new tests for these, because the original conditions
were wrong (they did not consider streaming mode) and we have tests for
the positive cases.
- Fix build with `EXPENSIVE_CHECKS`
- Remove unused `PassName::ID` to resolve warning
- Mark `~SelectionDAGISel` virtual so AArch64 backend can work properly
This reverts commit de37c06f01772e02465ccc9f538894c76d89a7a1 to
de37c06f01772e02465ccc9f538894c76d89a7a1
It still breaks EXPENSIVE_CHECKS build. Sorry.
Port selection dag isel to new pass manager.
Only `AMDGPU` and `X86` support new pass version. `-verify-machineinstrs` in new pass manager belongs to verify instrumentation, it is enabled by default.
According to the specification in
https://github.com/ARM-software/acle/pull/309 this adds the intrinsics
```
svfloat32x2_t svcvt_f32[_f16_x2](svfloat16_t zn) __arm_streaming;
svfloat32x2_t svcvtl_f32[_f16_x2](svfloat16_t zn) __arm_streaming;
```
These are available only if __ARM_FEATURE_SME_F16F16 is enabled.
---------
Co-authored-by: Caroline Concatto <caroline.concatto@arm.com>
According to the specification in
https://github.com/ARM-software/acle/pull/309 this adds the intrinsics
```
svbfloat16x2_t svclamp[_single_bf16_x2](svbfloat16x2_t zd, svbfloat16_t zn,
svbfloat16_t zm) __arm_streaming;
svbfloat16x4_t svclamp[_single_bf16_x4](svbfloat16x4_t zd, svbfloat16_t zn,
svbfloat16_t zm) __arm_streaming;
```
These are available only if __ARM_FEATURE_SME_B16B16 is enabled.
According to the specification in
https://github.com/ARM-software/acle/pull/309 this adds the intrinsics
```
svbfloat16x2_t svclamp[_single_bf16_x2](svbfloat16x2_t zd, svbfloat16_t zn,
svbfloat16_t zm) __arm_streaming;
svbfloat16x4_t svclamp[_single_bf16_x4](svbfloat16x4_t zd, svbfloat16_t zn,
svbfloat16_t zm) __arm_streaming;
```
These are available only if __ARM_FEATURE_SME_B16B16 is enabled.
According to the specification in
https://github.com/ARM-software/acle/pull/309 this adds the intrinsics
```
svfloat32x2_t svcvt_f32[_f16_x2](svfloat16_t zn) __arm_streaming;
svfloat32x2_t svcvtl_f32[_f16_x2](svfloat16_t zn) __arm_streaming;
```
These are available only if __ARM_FEATURE_SME_F16F16 is enabled.
---------
Co-authored-by: Caroline Concatto <caroline.concatto@arm.com>
The immediate forms of SQADD and SQSUB interpret their immediate
operand as unsigned and thus effectively only support positive
immediate operands.
The original code is only wrong for the i8 cases because they
previously accepted all values, however, the new patterns enable more
uses and this I've added tests for the larger element types as well.
The existing heuristics were assuming that every core behaves like an
Apple A7, where any extend/shift costs an extra micro-op... but in
reality, nothing else behaves like that.
On some older Cortex designs, shifts by 1 or 4 cost extra, but all other
shifts/extensions are free. On all other cores, as far as I can tell,
all shifts/extensions for integer loads are free (i.e. the same cost as
an unshifted load).
To reflect this, this patch:
- Enables aggressive folding of shifts into loads by default.
- Removes the old AddrLSLFast feature, since it applies to everything
except A7 (and even if you are explicitly targeting A7, we want to
assume extensions are free because the code will almost always run on a
newer core).
- Adds a new feature AddrLSLSlow14 that applies specifically to the
Cortex cores where shifts by 1 or 4 cost extra.
I didn't add support for AddrLSLSlow14 on the GlobalISel side because it
would require a bunch of refactoring to work correctly. Someone can pick
this up as a followup.