Implement the remaining CIR lowerings for the AdvSIMD (Neon)
`vceqz` intrinsic group (bitwise equal to zero).
Most variants of `vceqz` variant were already supported; this patch
completes the rest of the group [1] that was left as a TODO.
Tests for these intrinsics are moved from:
* test/CodeGen/AArch64/neon_intrinsics.c
* test/CodeGen/AArch64/v8.2a-fp16-intrinsics.c
to:
* test/CodeGen/AArch64/neon/intrinsics.c
* test/CodeGen/AArch64/neon/fullfp16,
respectively.
The implementation largely mirrors the existing lowering in
CodeGen/TargetBuiltins/ARM.cpp.
Reference:
[1] https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#bitwise-equal-to-zero
Not all AArch64 intrinsics categorized as SISD (Single Instruction
Single Data) are truly SISD. Add comments clarifying this distinction.
Also update EmitCommonNeonSISDBuiltinExpr:
* Move the assert to the top of the function and add a descriptive
message to make the assumptions explicit.
* Remove unnecessary temporary variables (e.g. BuiltinID) and use
SISDInfo directly.
No functional changes intended.
Implement the remaining CIR lowerings for the AdvSIMD (Neon)
`vceqz{|q|d|s}_*` intrinsic group (bitwise equal to zero).
The `vceqzd_s64` variant was already supported; this patch completes
the rest of the group [1].
Tests for these intrinsics are moved from:
* test/CodeGen/AArch64/neon-misc.c
to:
* test/CodeGen/AArch64/neon/intrinsics.c
The implementation largely mirrors the existing lowering in
CodeGen/TargetBuiltins/ARM.cpp.
`emitCommonNeonBuiltinExpr` is introduced to support these lowerings.
`getNeonType` is moved without functional changes.
Reference:
[1] https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#bitwise-equal-to-zero
This patch performs small cleanups and fixes in the AArch64 builtins
lowering code, with the goal of aligning the CIR path more closely
with the existing Clang CodeGen implementation.
Changes include:
* Make sure that `noundef` is consistently matched using `{{.*}}`.
* Rename `AArch64BuiltinInfo` to `armVectorIntrinsicInfo` for better
consistency with the original CodeGen implementation.
* Simplify `emitAArch64CompareBuiltinExpr`, fix an incorrect
assert condition (missing `!`) and make sure to use the input `kind`
condition instead of hard-coding `cir::CmpOpKind::eq`.
* Improve and clarify comments.
No functional changes intended (NFC).
Add `__arm_atomic_store_with_stshh` implementation as defined in the
ACLE. Validate arguments passed are correct, and lower to the `stshh`
intrinsic plus an atomic store using a pseudo-instruction with the
allowed orderings:
* memory orderings: relaxed, release, seq_cst
* retention policies: keep, strm
The `STSHH` instruction (Store with Store Hint for Hardware) is part
of the `FEAT_PCDPHINT` extension.
Remove the outstanding calls to `EmitScalarExpr` in
`EmitAArch64BuiltinExpr` that are no longer required.
This is a follow-up for #181794 and #181974 - please refer to
those PRs for more context.
Refactor `EmitAArch64BuiltinExpr` so that all AArch64/NEON builtins
handled by this hook _and marked as overloaded_ share a common path
for generating LLVM IR arguments (collected into the `Ops`
`SmallVector<Value*>`) (*). This is a follow-up for #181794 - please
refer to that PR for more context.
As in the previous PR, the key change is implemented in
`HasExtraNeonArgument` , i.e. in the hook that identifies Builtins with
the extra argument. In this PR, I am replacing the ad-hoc switch
statement with a more principled approach borrowed from SemaARM.cpp,
namely:
```cpp
static bool HasExtraNeonArgument(unsigned BuiltinID) {
// (...)
uint64_t mask = 0;
switch (BuiltinID) {
#define GET_NEON_OVERLOAD_CHECK
#include "clang/Basic/arm_fp16.inc"
#include "clang/Basic/arm_neon.inc"
#undef GET_NEON_OVERLOAD_CHECK
// Non-neon builtins for controling VFP that take extra argument for
// discriminating the type.
case ARM::BI__builtin_arm_vcvtr_f:
case ARM::BI__builtin_arm_vcvtr_d:
mask = 1;
}
switch (BuiltinID) {
default: break;
}
if (mask)
return true;
return false;
```
This is preferred because the extra argument is defined for Sema
verification. CodeGen should reuse the same source of truth rather than
duplicating or partially reimplementing the logic.
No functional change intended.
(*) `EmitAArch64BuiltinExpr` contains two large switch statements
intended to separate handling of non-overloaded and overloaded builtins.
In practice, the split is not consistently enforced. Patch 1/2
refactored the first switch (non-overloaded path). This patch applies
the same cleanup to the overloaded path and completes the refactoring.
Refactor `EmitAArch64BuiltinExpr` so that all AArch64/NEON builtins
handled by this hook _and marked as non-overloaded_ share a common path
for generating LLVM IR arguments (collected into the `Ops`
`SmallVector<Value*>`) (*)
Previously, the argument emission loop unconditionally skipped the
trailing argument:
```cpp
for (unsigned i = 0, e = E->getNumArgs() - 1; i != e; ++i)
```
This was originally intended to ignore the extra Sema-only argument
used by overloaded NEON builtins (e.g. the type discriminator passed
by `__builtin_neon_*` intrinsics). However, this logic was applied
unconditionally.
This patch updates the loop to skip the trailing argument only when
`HasExtraNeonArgument` returns true for non-SISD builtins:
```cpp
bool HasExtraArg = !IsSISD && HasExtraNeonArgument(BuiltinID);
unsigned NumArgs =
E->getNumArgs() - (HasExtraArg ? 1 : 0);
for (unsigned i = 0, e = NumArgs; i != e; ++i)
```
This preserves existing IR generation behaviour while making the
handling of Sema-only NEON discriminator arguments explicit.
For context, type discriminators can be found in definitions of various
builtins in `arm_neon.h`. For example, `vsriq_n_p64(<args>)` expands
into the following call:
```cpp
__builtin_neon_vsriq_n_v(<args>, 38)
```
The trailing `38` encodes the concrete NEON vector type
(e.g. `poly64x2_t`) for overload resolution in Sema; it is not
semantically part of the operation and is ignored during IR generation.
As part of this change, `HasExtraNeonArgument` was completed so
that these discriminator arguments are correctly identified.
No functional change intended.
(*) This refers to two large `switch` stmts inside
`EmitAArch64BuiltinExpr` that are meant to switch the processing into
non-overloaded and overloaded builtins. The intended split between
non-overloaded and overloaded builtins is not consistently enforced: the
second switch (nominally handling overloaded builtins) also processes
some non-overloaded cases. This patch refactors only the first switch
and prepares for a follow-up cleanup in 2/2.
Updates the logic in `CodeGenFunction::EmitAArch64BuiltinExpr` so that
we always start with the general code and we only fall-back to
specialised cases (i.e. `switch` stmts) for intrinsics for which the
general code does no apply.
BEFORE (only high-level:
```cpp
Value *CodeGenFunction::EmitAArch64BuiltinExpr() {
(...)
/// 1. SWITCH STMT FOR NON-OVERLOADED INTRINSIS
switch (BuiltinID) {
default break:
case NEON::BI__builtin_neon_vabsh_f16:
(...)
}
/// 2. GENERAL CODE
Builtin = findARMVectorIntrinsicInMap(AArch64SIMDIntrinsicMap, BuiltinID,
AArch64SIMDIntrinsicsProvenSorted);
if (Builtin)
return EmitCommonNeonBuiltinExpr(
Builtin->BuiltinID, Builtin->LLVMIntrinsic, Builtin->AltLLVMIntrinsic,
Builtin->NameHint, Builtin->TypeModifier, E, Ops,
/*never use addresses*/ Address::invalid(), Address::invalid(), Arch);
if (Value *V = EmitAArch64TblBuiltinExpr(*this, BuiltinID, E, Ops, Arch))
return V;
/// 3. SWITCH STMT FOR THE REMAINING INTRINSIS
switch (BuiltinID) {
default return nullptr:
case NEON::BI__builtin_neon_vbsl_v:
(...)
}
}
```
AFTER:
```cpp
Value *CodeGenFunction::EmitAArch64BuiltinExpr() {
/// 1. GENERAL CODE
Builtin = findARMVectorIntrinsicInMap(AArch64SIMDIntrinsicMap, BuiltinID,
AArch64SIMDIntrinsicsProvenSorted);
if (Builtin)
return EmitCommonNeonBuiltinExpr(
Builtin->BuiltinID, Builtin->LLVMIntrinsic, Builtin->AltLLVMIntrinsic,
Builtin->NameHint, Builtin->TypeModifier, E, Ops,
/*never use addresses*/ Address::invalid(), Address::invalid(), Arch);
if (Value *V = EmitAArch64TblBuiltinExpr(*this, BuiltinID, E, Ops, Arch))
return V;
/// 2. SWITCH STMT FOR NON-OVERLOADED INTRINSIS
switch (BuiltinID) {
default break:
case NEON::BI__builtin_neon_vabsh_f16:
(...)
}
/// 3. SWITCH STMT FOR THE REMAINING INTRINSIS
switch (BuiltinID) {
default return nullptr:
case NEON::BI__builtin_neon_vbsl_v:
(...)
}
}
```
In addition:
* Remove `vaddq_p128+ vcvtq_high_bf16_f32 + vcvtq_low_bf16_f32` from
`AArch64SIMDIntrinsicMap`. Those were not required
there (it's an array for intrinsics for which the general code-gen
works, but that's not the case for those).
* Extracted the declaration of `Int` so that it can be re-used.
Add CIR lowering support for the non-overloaded NEON intrinsics
`vnegd_s64` and `vnegh_f16`.
The associated tests are shared with the existing default codegen tests:
* `neon-intrinsics.c` → `neon/intrinsics.c`
* `v8.2a-fp16-intrinsics.c` → `neon/fullfp16.c`
A new test file,
* `clang/test/CodeGen/AArch64/neon/fullfp16.c`
is introduced and is intended to eventually replace:
* `clang/test/CodeGen/AArch64/v8.2a-fp16-intrinsics.c`
Since both intrinsics are non-overloaded, the CIR and default codegen
handling is moved to the appropriate switch statements. The previous
placement was incorrect.
This change also includes minor refactoring in `CIRGenBuilder.h` to
better group related hooks.
Rather than creating a dedicated ClangIR test file, the original test file for
this intrinsic is effectively reused:
* clang/test/CodeGen/AArch64/neon-intrinsics.c
“Effectively” meaning that the corresponding test is moved (rather than
literally reused) to a new file within the original AArch64 builtins test
directory:
* clang/test/CodeGen/AArch64/neon/intrinsics.c
This is necessary to avoid lowering unsupported examples from intrinsics.c with
`-fclangir`. The new file will eventually replace the original one once all
builtins from it can be lowered via ClangIR.
To facilitate test re-use, new LIT "feature" is added so that CIR tests can be
run conditionally, e.g. the following will only run when `CLANG_ENABLE_CIR` is
set:
```C
// RUN: %if cir %{%clang_cc1 ... %}
```
This sort of substitutions are documented in [2].
REFERENCES:
[1] https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon]&q=vceqzd_s64
[2] https://llvm.org/docs/TestingGuide.html#substitutions
This patch adds support in Clang for the RPRFM instruction, by adding
the following intrinsics:
```
void __pldx_range(unsigned int *access_kind*, unsigned int retention_policy,
signed int length*, unsigned int count, signed int stride,
size_t reuse distance, void const *addr);
void __pld_range(unsigned int access_kind*, unsigned int retention_policy,
uint64_t metadata, void const *addr);
```
The `__ARM_PREFETCH_RANGE` macro can be used to test whether these
intrinsics are implemented. If the RPRFM instruction is not available, this
instruction is a NOP.
This implements the following ACLE proposal:
https://github.com/ARM-software/acle/pull/423
This is the last of the generic instructions created from MVE
intrinsics. It was a little more awkward than the others due to it
taking a Type as one of the arguments. This creates a new function to
create the intrinsic we need.
Proposed in [this ACLE
proposal](https://github.com/ARM-software/acle/pull/409), this PR
implements widening FMMLA intrinsics.
- F16 to F32
- MF8 to F32
- MF8 to F16
Additional changes:
- IsOverloadCvt flag renamed to IsOverloadFirstandLast for clarity, as
the name implies conversion. Implementation remains unchanged.
Add support for the following new AArch64 Neon intrinsics:
```
float16x8_t vmmlaq_f16_mf8_fpm(float16x8_t, mfloat8x16_t, mfloat8x16_t, fpm_t);
float32x4_t vmmlaq_f32_mf8_fpm(float32x4_t, mfloat8x16_t, mfloat8x16_t, fpm_t);
```
This is the first step in removing some NEON reduction intrinsics that
duplicate the behaviour of their llvm.vector.reduce counterpart.
NOTE: The i8/i16 variants differ in that the NEON versions return an i32
result. However, this looks more about making their code generation
convenient with SelectionDAG disgarding the extra bits. This is only
relevant for the next phase because the Clang usage always truncate
their result, making llvm.vector.reduce a drop in replacement.
`UnqualPtrTy` didn't always match `llvm::PointerType::getUnqual`:
sometimes it returned a pointer that is not in address space 0 (notably
for SPIRV).
Since `UnqualPtrTy` was used as the "generic" or "default" pointer type,
this patch renames it to `DefaultPtrTy` to avoid confusion with LLVM's
`PointerType::getUnqual`.
Builtins for reading the streaming vector length are canonicalised to
use the aarch64.sme.cntsd intrinisic and a multiply, i.e.
- cntsb -> cntsd * 8
- cntsh -> cntsd * 4
- cntsw -> cntsd * 2
This patch also removes the LLVM intrinsics for cnts[b,h,w], and adds
patterns to improve codegen when cntsd is multiplied by a constant.
Use .i16.f16 intrinsic formats for intrinsics like vcvth_s16_f16.
Avoids issues with incorrect saturation that arise when using .i32.f16
formats for the same conversions.
Fixes https://github.com/llvm/llvm-project/issues/154343.
Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
This option is confusingly named. What it actually controls is whether,
under the default of `-ffloat16-excess-precision=standard`, it is
beneficial for performance to perform calculations on float (without
intermediate rounding) or not. For `-ffloat16-excess-precision=none` the
LLVM `half` type will always be used, and all backends are expected to
legalize it correctly.
This is handled by the instcombine added in #147930; there is no need
for any clang-specific folding. NFC as all clang tests for
`__arm_in_streaming_mode()` used -O1, which applies the LLVM
instcombines.
FMV priority is the returned value of a polymorphic function. On RISC-V
and X86 targets a 32-bit value is enough. On AArch64 we currently need
64 bits and we will soon exceed that. APInt seems to be a suitable
replacement for uint64_t, presumably with minimal compile time overhead.
It allows bit manipulation, comparison and variable bit width.
Adds support for __sys Clang builtin for AArch64
__sys is a long existing MSVC intrinsic used to manage caches, tlbs, etc
by writing to system registers:
* It takes a macro-generated constant and uses it to form the AArch64 SYS instruction which is MSR with op0=1. The macro drops op0 and expects the implementation to hardcode it to 1 in the encoding.
* Volume use is in systems code (kernels, hypervisors, boot environments, firmware)
* Has an unused return value due to MSVC cut/paste error
Implementation:
* Clang builtin, sharing code with Read/WriteStatusReg
* Hardcodes the op0=1
* Explicitly returns 0
* Code-format change from clang-format
* Unittests included
* Not limited to MSVC-environment as its generally useful and neutral
This marks ffloor as legal providing that armv8 and neon is present (or
fullfp16 for the fp16 instructions). The existing arm_neon_vrintm
intrinsics are auto-upgraded to llvm.floor.
If this is OK I will update the other vrint intrinsics.
CreateVScale took a scaling parameter that had a single use outside of
IRBuilder with all other callers having to create a redundant
ConstantInt. To work round this some code perferred to use
CreateIntrinsic directly.
This patch simplifies CreateVScale to return a call to the llvm.vscale()
intrinsic and nothing more. As well as simplifying the existing call
sites I've also migrated the uses of CreateIntrinsic.
Whilst IRBuilder used CreateVScale's scaling parameter as part of the
implementations of CreateElementCount and CreateTypeSize, I have
follow-on work to switch them to the NUW varaiety and thus they would
stop using CreateVScale's scaling as well. To prepare for this I have
moved the multiplication and constant folding into the implementations
of CreateElementCount and CreateTypeSize.
As a final step I have replaced some callers of CreateVScale with
CreateElementCount where it's clear from the code they wanted the
latter.
This patch adds fp8 variants to existing intrinsics, whose operation
doesn't depend on arguments being a specific type.
It also changes mfloat8 type representation in memory from `i8` to
`<1xi8>`
Most callers want a constant index. Instead of making every caller
create a ConstantInt, we can do it in IRBuilder. This is similar to
createInsertElement/createExtractElement.
This is an expensive header, only include it where needed. Move some
functions out of line to achieve that.
This reduces time to build clang by ~0.5% in terms of instructions
retired.
Currently arm_neon.h emits C-style casts to do vector type casts. This
relies on implicit conversion between vector types to be enabled, which
is currently deprecated behaviour and soon will disappear. To ensure
NEON code will keep working afterwards, this patch changes all this
vector type casts into bitcasts.
Co-authored-by: Momchil Velikov <momchil.velikov@arm.com>
clang/lib/CodeGen/CGBuiltin.cpp is over 1MB long (>23k LoC), and can
take minutes to recompile (depending on compiler and host system) when
modified, and 5 seconds for clangd to update for every edit. Splitting
this file was discussed in this thread:
https://discourse.llvm.org/t/splitting-clang-s-cgbuiltin-cpp-over-23k-lines-long-takes-1min-to-compile/
and the idea has received a number of +1 votes, hence this change.