This option is confusingly named. What it actually controls is whether,
under the default of `-ffloat16-excess-precision=standard`, it is
beneficial for performance to perform calculations on float (without
intermediate rounding) or not. For `-ffloat16-excess-precision=none` the
LLVM `half` type will always be used, and all backends are expected to
legalize it correctly.
Replaces the XOP/AVX512 per-element rotation/funnel shift builtins with the generic __builtin_elementwise_fshl/fshr
We still have uniform immediate variants to handle next.
Part of #153152
The following intrinsics were replaced by a combination of
`__builtin_shufflevector` and `__builtin_convertvector`:
- `__builtin_ia32_vcvtph2ps`
- `__builtin_ia32_vcvtph2ps256`
Fixes#152749
The following intrinsics were replaced by `__builtin_elementwise_fma`:
- `__builtin_ia32_vfmaddps(256)`
- `__builtin_ia32_vfmaddpd(256)`
- `__builtin_ia32_vfmaddph(256)`
- `__builtin_ia32_vfmaddbf16(128 | 256 | 512)`
All the aforementioned `__builtin_ia32_vfmadd*` intrinsics are
equivalent to a `__builtin_elementwise_fma`, so keeping them is an
unnecessary indirection.
Fixes [#152461](https://github.com/llvm/llvm-project/issues/152461)
---------
Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>
I fixed support for varargs functions
(previously it didn't crash but the codegen was incorrect).
I added tests for structs and unions which already work. With the
multivalue abi they crash in the backend, so I added a sema check that
rejects structs and unions for that abi.
It will also crash in the backend if passed an int128 or float128 type.
This is handled by the instcombine added in #147930; there is no need
for any clang-specific folding. NFC as all clang tests for
`__arm_in_streaming_mode()` used -O1, which applies the LLVM
instcombines.
Tests if the runtime type of the function pointer matches the static
type. If this returns false, calling the function pointer will trap.
Uses `@llvm.wasm.ref.test.func` added in #147486.
Also adds a "gc" wasm feature to gate the use of the ref.test
instruction.
"amdgpu-as" is way too vague and doesn't give enough context.
We may want to support it on normal atomics too, to control the synchronized (ordered) AS.
If we do that, the name has to be less vague.
FMV priority is the returned value of a polymorphic function. On RISC-V
and X86 targets a 32-bit value is enough. On AArch64 we currently need
64 bits and we will soon exceed that. APInt seems to be a suitable
replacement for uint64_t, presumably with minimal compile time overhead.
It allows bit manipulation, comparison and variable bit width.
- [x] Implement refract using HLSL source in hlsl_intrinsics.h
- [x] Implement the refract SPIR-V target built-in in
clang/include/clang/Basic/BuiltinsSPIRV.td
- [x] Add sema checks for refract to CheckSPIRVBuiltinFunctionCall in
clang/lib/Sema/SemaSPIRV.cpp
- [x] Add codegen for spv refract to EmitSPIRVBuiltinExpr in
CGBuiltin.cpp
- [x] Add codegen tests to clang/test/CodeGenHLSL/builtins/refract.hlsl
- [x] Add spv codegen test to clang/test/CodeGenSPIRV/Builtins/refract.c
- [x] Add sema tests to clang/test/SemaHLSL/BuiltIns/refract-errors.hlsl
- [x] Add spv sema tests to
clang/test/SemaSPIRV/BuiltIns/refract-errors.c
- [x] Create the int_spv_refract intrinsic in IntrinsicsSPIRV.td
- [x] In SPIRVInstructionSelector.cpp create the refract lowering and
map it to int_spv_refract in SPIRVInstructionSelector::selectIntrinsic.
- [x] Create SPIR-V backend test case in
llvm/test/CodeGen/SPIRV/hlsl-intrinsics/refract.ll
- [x] Check for what OpenCL support is needed.
Resolves https://github.com/llvm/llvm-project/issues/99153
XAndesBFHCvt provides two builtins functions for converting between
float and bf16. Users can use them to convert bf16 values loaded from
memory to float, perform arithmetic operations, then convert them back
to bf16 and store them to memory.
The load/store and move operations for bf16 will be handled in a later
patch.
The patch adds intrinsics and lowering logic for GlobalSize,
GlobalOffset, SubgroupMaxSize, NumWorkgroups, WorkgroupSize,
WorkgroupId, LocalInvocationId, GlobalInvocationId, SubgroupSize,
NumSubgroups, SubgroupId and SubgroupLocalInvocationId SPIR-V builtins.
The patch also extend spv_thread_id, spv_group_id and
spv_thread_id_in_group to return anyint rather than i32. This allows the
intrinsics to support the opencl environment.
For each of the intrinsics, new clang builtins were added as well as a
binding for the SPIR-V "friendly" format. The original format doesn't
define such binding (uses global variables) but it is not possible to
express the Input SC which is normally required by the environement
specs, and using builtin functions is the most usual approach for other
backend and programming models.
Adds support for __sys Clang builtin for AArch64
__sys is a long existing MSVC intrinsic used to manage caches, tlbs, etc
by writing to system registers:
* It takes a macro-generated constant and uses it to form the AArch64 SYS instruction which is MSR with op0=1. The macro drops op0 and expects the implementation to hardcode it to 1 in the encoding.
* Volume use is in systems code (kernels, hypervisors, boot environments, firmware)
* Has an unused return value due to MSVC cut/paste error
Implementation:
* Clang builtin, sharing code with Read/WriteStatusReg
* Hardcodes the op0=1
* Explicitly returns 0
* Code-format change from clang-format
* Unittests included
* Not limited to MSVC-environment as its generally useful and neutral
This marks ffloor as legal providing that armv8 and neon is present (or
fullfp16 for the fp16 instructions). The existing arm_neon_vrintm
intrinsics are auto-upgraded to llvm.floor.
If this is OK I will update the other vrint intrinsics.
CreateVScale took a scaling parameter that had a single use outside of
IRBuilder with all other callers having to create a redundant
ConstantInt. To work round this some code perferred to use
CreateIntrinsic directly.
This patch simplifies CreateVScale to return a call to the llvm.vscale()
intrinsic and nothing more. As well as simplifying the existing call
sites I've also migrated the uses of CreateIntrinsic.
Whilst IRBuilder used CreateVScale's scaling parameter as part of the
implementations of CreateElementCount and CreateTypeSize, I have
follow-on work to switch them to the NUW varaiety and thus they would
stop using CreateVScale's scaling as well. To prepare for this I have
moved the multiplication and constant folding into the implementations
of CreateElementCount and CreateTypeSize.
As a final step I have replaced some callers of CreateVScale with
CreateElementCount where it's clear from the code they wanted the
latter.
The patch introduce __builtin_spirv_generic_cast_to_ptr_explicit which
is lowered to the llvm.spv.generic.cast.to.ptr.explicit intrinsic.
The SPIR-V builtins are now split into 3 differents file:
BuiltinsSPIRVCore.td,
BuiltinsSPIRVVK.td for Vulkan specific builtins, BuiltinsSPIRVCL.td for
OpenCL specific builtins
and BuiltinsSPIRVCommon.td for common ones.
The patch also introduces a new header defining its SPIR-V friendly
equivalent (__spirv_GenericCastToPtrExplicit_ToGlobal,
__spirv_GenericCastToPtrExplicit_ToLocal and
__spirv_GenericCastToPtrExplicit_ToPrivate). The functions are declared
as aliases to the new builtin allowing C-like languages to have a
definition to rely on as well as gaining proper front-end diagnostics.
The motivation for the header is to provide a stable binding for
applications or library (such as SYCL) and allows non SPIR-V targets to
provide an implementation (via libclc or similar to how it is done for
gpuintrin.h).
Note: This relands #140615 adding a ".count" suffix to the non-".all"
variants.
Our current intrinsic support for barrier intrinsics is confusing and
incomplete, with multiple intrinsics mapping to the same instruction and
intrinsic names not clearly conveying intrinsic semantics. Further, we
lack support for some variants. This change unifies the IR
representation to a single consistently named set of intrinsics.
- llvm.nvvm.barrier.cta.sync.aligned.all(i32)
- llvm.nvvm.barrier.cta.sync.aligned.count(i32, i32)
- llvm.nvvm.barrier.cta.arrive.aligned.count(i32, i32)
- llvm.nvvm.barrier.cta.sync.all(i32)
- llvm.nvvm.barrier.cta.sync.count(i32, i32)
- llvm.nvvm.barrier.cta.arrive.count(i32, i32)
The following Auto-Upgrade rules are used to maintain compatibility with
IR using the legacy intrinsics:
* llvm.nvvm.barrier0 --> llvm.nvvm.barrier.cta.sync.aligned.all(0)
* llvm.nvvm.barrier.n --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
* llvm.nvvm.bar.sync --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
* llvm.nvvm.barrier --> llvm.nvvm.barrier.cta.sync.aligned.count(x, y)
* llvm.nvvm.barrier.sync --> llvm.nvvm.barrier.cta.sync.all(x)
* llvm.nvvm.barrier.sync.cnt --> llvm.nvvm.barrier.cta.sync.count(x, y)
Our current intrinsic support for barrier intrinsics is confusing and
incomplete, with multiple intrinsics mapping to the same instruction and
intrinsic names not clearly conveying intrinsic semantics. Further, we
lack support for some variants. This change unifies the IR
representation to a single consistently named set of intrinsics.
- llvm.nvvm.barrier.cta.sync.aligned.all(i32)
- llvm.nvvm.barrier.cta.sync.aligned(i32, i32)
- llvm.nvvm.barrier.cta.arrive.aligned(i32, i32)
- llvm.nvvm.barrier.cta.sync.all(i32)
- llvm.nvvm.barrier.cta.sync(i32, i32)
- llvm.nvvm.barrier.cta.arrive(i32, i32)
The following Auto-Upgrade rules are used to maintain compatibility with
IR using the legacy intrinsics:
* llvm.nvvm.barrier0 --> llvm.nvvm.barrier.cta.sync.aligned.all(0)
* llvm.nvvm.barrier.n --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
* llvm.nvvm.bar.sync --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
* llvm.nvvm.barrier --> llvm.nvvm.barrier.cta.sync.aligned(x, y)
* llvm.nvvm.barrier.sync --> llvm.nvvm.barrier.cta.sync.all(x)
* llvm.nvvm.barrier.sync.cnt --> llvm.nvvm.barrier.cta.sync(x, y)
…__builtin_scalbn
Clang generates library calls for __builtin_* functions which can be a
problem for GPUs that cannot handle them. This patch generates call to
device implementation for __builtin_logb and ldexp intrinsic for
__builtin_scalbn.
This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to
LDS from global (address space 1) pointers and buffer fat pointers
(address space 7), since they use the same API and "gather from a
pointer to LDS" is something of an abstract operation.
This commit adds the intrinsic and its lowerings for addrspaces 1 and 7,
and updates the MLIR wrappers to use it (loosening up the restrictions
on loads to LDS along the way to match the ground truth from target
features).
It also plumbs the intrinsic through to clang.
This patch adds fp8 variants to existing intrinsics, whose operation
doesn't depend on arguments being a specific type.
It also changes mfloat8 type representation in memory from `i8` to
`<1xi8>`