With this commit, the CLC fmin/fmax builtins use clang's
__builtin_elementwise_(min|max)imumnum which helps us generate LLVM
minimumnum/maximumnum intrinsics directly. These intrinsics uniformly
select the non-NaN input over the (quiet or signalling) NaN input, which
corresponds to what the OpenCL CTS tests.
These intrinsics maintain the vector types, as opposed to scalarizing,
which was previously happening. This commit therefore helps to optimize
codegen for those targets.
Note that there is ongoing discussion regarding how these builtins
should handle signalling NaNs in the OpenCL specification and whether
they should be able to return a quiet NaN as per the IEEE behaviour. If
the specification and/or CTS is ever updated to allow or mandate
returning a qNAN, these builtins could/should be updated to use
__builtin_elementwise_(min|max)num instead which would lower to LLVM
minnum/maxnum intrinsics.
The SPIR-V targets maintain the old implementations, as the LLVM ->
SPIR-V translator can't currently handle the LLVM intrinsics. The
implementation has been simplifies to consistently use clang builtins,
as opposed to before where the half version was explicitly defined.
[1] https://github.com/KhronosGroup/OpenCL-CTS/pull/2285
With libclc being a 'runtime', the top-level build assumes that there is
a corresopnding 'libclc' target. We previously weren't providing this,
leading to a build failure if the user tried to build it.
This commit remedies this by adding support for building the 'libclc'
target. It does so by adding dependencies from the OpenCL builtins to
this target. It uses a configurable in-between target -
libclc-opencl-builtins - to ease the possibility of adding non-OpenCL
builtin libraries in the future.
Also delete unary_def_via_fp32.inc. There are small changes in
amdgcn--amdhsa.bc due to vector conversion is scalarized, e.g.
%2 = fpext <4 x half> %0 to <4 x float>
%3 = extractelement <4 x float> %2, i64 0
%4 = tail call float @llvm.fabs.f32(float %3)
->
%2 = extractelement <4 x half> %0, i64 0
%3 = tail call half @llvm.fabs.f16(half %2)
%4 = fpext half %3 to float
Fix the symlink creation logic to use relative paths instead of
absolute, in order to ensure that the installed symlinks actually refer
to the installed .bc files rather than the ones from the build
directory. This was broken in #146833. The change is a bit roundabout
but it attempts to preserve the spirit of #146833, that is the ability
to use multiple output directories (provided they all resides in
`${LIBCLC_OUTPUT_LIBRARY_DIR}` and preserve the same structure in the
installed tree).
Signed-off-by: Michał Górny <mgorny@gentoo.org>
Fix `libclc/utils/CMakeLists.txt` to expose `prepare_builtins_*`
variables in parent scope. This was a regression introduced in #148815
where the code was moved into subdirectory, and the variables would no
longer be accessible to calls in top-level CMakeLists, resulting in
attempting to build targets with empty command:
```
[1566/1676] cd /var/tmp/portage/llvm-core/libclc-22.0.0.9999/work/libclc_build && -o /var/tmp/portage/llvm-core/libclc-22.0.0.9999/work/libclc_build/clspv--.bc /var/tmp/portage/llvm-core/libclc-22.0.0.9999/work/libclc_build/obj.libclc.dir/clspv--/builtins.opt.clspv--.bc
FAILED: clspv--.bc /var/tmp/portage/llvm-core/libclc-22.0.0.9999/work/libclc_build/clspv--.bc
cd /var/tmp/portage/llvm-core/libclc-22.0.0.9999/work/libclc_build && -o /var/tmp/portage/llvm-core/libclc-22.0.0.9999/work/libclc_build/clspv--.bc /var/tmp/portage/llvm-core/libclc-22.0.0.9999/work/libclc_build/obj.libclc.dir/clspv--/builtins.opt.clspv--.bc
/bin/sh: line 1: -o: command not found
```
Add corresponding clc functions, which are implemented with clang
__scoped_atomic builtins. OpenCL functions are implemented as a wrapper
over clc functions.
Also change legacy atomic_inc and atomic_dec to re-use the newly added
clc_atomic_inc/dec implementations. llvm-diff only no change to
atomic_inc and atomic_dec in bitcode.
Notes:
* Generic OpenCL built-ins functions uses __ATOMIC_SEQ_CST and
__MEMORY_SCOPE_DEVICE for memory order and memory scope parameters.
* OpenCL atomic_*_explicit, atomic_flag* built-ins are not implemented
yet.
* OpenCL built-ins of atomic_intptr_t, atomic_uintptr_t, atomic_size_t
and atomic_ptrdiff_t types are not implemented yet.
* llvm-diff shows no change to nvptx64--nvidiacl.bc and
amdgcn--amdhsa.bc since __opencl_c_atomic_order_seq_cst and
__opencl_c_atomic_scope_device are not defined in these two targets.
The implementation is based on reference implementation in
OpenCL-CTS/test_integer_ops. The generic implementations pass
OpenCL-CTS/test_integer_ops tests on Intel GPU.
The file is listing build artifacts to ignore, but LLVM has long had the
policy that in-tree builds are not supported, so the ignore rules
shouldn't serve their original purpose anymore.
The rules however are annoying because although they probably intended
only to ignore top-level build artifacts, they lack the leading `/` so
they match any file with the ignored name anywhere under `libclc/`.
Changes in this PR:
* Declare most of workitem functions in clc and opencl folders.
* Call clc workitem function in corresponding OpenCL workitem function.
* Move ptx-nvidiacl workitem built-in implementations into clc.
* Move a few amdgcn workitem built-in implementations into clc.
* Include only needed headers in OpenCL workitem functions.
* Implement get_local_linear_id, get_max_sub_group_size,
get_num_sub_groups,
get_sub_group_id, get_sub_group_local_id, get_sub_group_size for
ptx-nvidiacl.
llvm-diff shows this PR adds a few new symbols to nvptx64--nvidiacl.bc.
llvm-diff shows no change to amdgcn--amdhsa.bc, nvptx--.bc and
nvptx64--.bc.
This commit finishes the work started in #146840 and #147276. It makes
each OpenCL header self-contained and each implementation file include
only the headers it needs. It removes the need for a catch-all include
file of all OpenCL builtin declarations.
This commit continues the work from #146840 and extends it to the maths,
geomtrics, common, and relational directories.
All headers have include guards and, where appropriate, include the
minimal code required for their specific definitions. Implementation
files no longer include the large catch-all header of all OpenCL builtin
declarations.
This commit starts the process of reducing the amount of code included
by OpenCL builtins, hopefully reducing build times in the process.
It introduces a minimal OpenCL header - opencl-base.h - which includes
only the OpenCL type definitions and the macros necessary for
declaring/defining functions.
Where the OpenCL builtin implementations would currently include the
whole of <clc/opencl/clc.h>, which defines *all* OpenCL builtins, now
they include only the specific declaration they need.
This mirrors how the CLC builtins are defined.
Rename to FUNCTION if it is for declaration, since it doesn't make much
sense to use __CLC_FUNCTION for OpenCL function declaration. Rename to
__IMPL_FUNCTION if it is for definition, since in some cases
implementation function isn't clc_* function.
The prepare target was depending on the output of a custom command, but
wasn't the full path to that file. This tripped up CMake if the file was
removed as it didn't know how to rebuild that file.
These changes were split off from #146503.
This commit makes the output directories of libclc artefacts explicit.
It creates a variable for the final output directory -
LIBCLC_OUTPUT_LIBRARY_DIR - which has not changed. This allows future
changes to alter the output directory more simply, such as by pointing
it to somewhere inside clang's resource directory.
This commit also changes the output directory of each target's
intermediate builtins.*.bc files. They are now placed into each
respective libclc target's object directory, rather than the top-level
libclc binary directory. This should help keep the binary directory a
bit tidier.
This target provides a unified build target for all devices under the
single triple. This way a user doesn't have to know device names to
build a specific target's bytecode libraries.
Device names may be considered as internal implementation details as
they are not exposed to users of CMake; users only specify triples to
build. Now, instead of `prepare-{barts,cayman,cedar,cypress}-r600--.bc`,
for example, a user may now build simply `prepare-r600--` and have all
four of those libraries built.
This commit also refactors the CMake somewhat. We were previously
diverging between the SPIR-V and other targets, and duplicating a bit of
logic like the creation of the 'prepare' targets, the targets'
properties, and the installation directory. It's cleaner and hopefully
more robust to share this code between all targets. This commit also
takes this opportunity to improve some comments around this code.
In OpenCL Extended Instruction Set Specification, nancode can be signed
integer or vector of signed integers values.
This PR has no change to amdgcn--amdhsa.bc and nvptx64--nvidiacl.bc
because the newly added clc functions are not used in OpenCL library.
With this PR, if we have customized implementation for scalar or vector
length = 2, we don't need to write new macros, e.g.
https://github.com/intel/llvm/blob/fb18321705f6/libclc/clc/include/clc/clcmacro.h#L15
Undef __HALF_ONLY, __FLOAT_ONLY and __DOUBLE_ONLY at the end of
clc/include/clc/math/gentype.inc
llvm-diff shows no change to nvptx64--nvidiacl.bc and amdgcn--amdhsa.bc
For a kernel such as
kernel void foo(__global double3 *z) {
double3 x = {0.6631661088,0.6612268107,0.1513627528};
int3 y = {-1980459213,-660855407,615708204};
*z = pown(x, y);
}
we were not storing anything to z, because the implementation of pown
relied on an floating-point-to-integer conversion where the
floating-point value was outside of the integer's range. Although in
LLVM IR we permit that operation so long as we end up ignoring its
result -- that is the general rule for poison -- one thing we are not
permitted to do is have conditional branches that depend on it, and
through the call to __clc_ldexp, we did have that.
To fix this, rather than changing expv at the end to INFINITY/0, we can
change v at the start to values that we know will produce INFINITY/0
without performing such out-of-range conversions.
Tested with
clang --target=nvptx64 -S -O3 -o - test.cl \
-Xclang -mlink-builtin-bitcode \
-Xclang runtimes/runtimes-bins/libclc/nvptx64--.bc
A grep showed that this exact same code existed in three more places, so
I changed it there too, though I did not do a broader search for other
similar code that potentially has the same problem.
Also delete unused _CLC_DEFINE_BINARY_BUILTIN_WITH_SCALAR_SECOND_ARG,
_CLC_DEFINE_UNARY_BUILTIN_FP16 and _CLC_DEFINE_BINARY_BUILTIN_FP16.
llvm-diff shows no change to nvptx64--nvidiacl.bc and amdgcn--amdhsa.bc
This commit deprecates the use of LLVM_ENABLE_PROJECTS in favour of
LLVM_ENABLE_RUNTIMES when building libclc.
Alternatively, using -DLLVM_RUNTIME_TARGETS=<triple> combined with
-DRUNTIMES_<triple>_LLVM_ENABLE_RUNTIMES=libclc also gets pretty far but
fails due to zlib problems building the LLVM utility 'prepare_builtins'.
I'm not sure what's going on there but I don't think it's required at
this stage. More work would be required to support that option.
This does nothing to change how the host tools are found in order to be
used to actually build the libclc libraries.
Note that under such a configuration the final libclc builtin libraries
are placed in `<build>/runtimes/runtimes-bins/libclc/`, which differs
from a non-runtimes build. The installation location remains the same.
Fixes#124013.
This commit moves the various vload and vstore builtins (including
vload_half, vloada_half, etc.) to the CLC library.
This is almost entirely a code move and does not make any attempt to
clean up or optimize the definitions of these builtins. There is no
change to any of the targets' builtin libraries, except that the vstore
helper rounding functions are now internalized.
Cleanups can come in future work. The new CLC declarations and new
OpenCL wrappers show how these CLC implementations could be defined more
simply. The builtins could probably also be vectorized in future work;
right now all of the 'half' versions for both vload and vstore are
essentially scalarized.
The half variants were missing but are trivial to implement. There were
some incorrect mixed type overloads (step(float, double)) which aren't
in the OpenCL specification and so have been removed.
Like certain other builtins the CLC step function only deals with
identical types. The OpenCL layer is responsible for casting the scalar
argument to a vector.
This commit also trivially vectorizes the CLC function, generating
better bytecode.
This commit provides definitions of builtins with the generic address
space.
One concept to consider is the difference between supporting the generic
address space from the user's perspective and the requirement for libclc
as a compiler implementation detail to define separate generic address
space builtins. In practice a target (like NVPTX) might notionally
support the generic address space, but it's mapped to the same LLVM
target address space as another address space (often the private one).
In such cases libclc must be careful not to define both private and
generic overloads of the same builtin. We track these two concepts
separately, and make the assumption that if the generic address space
does clash with another, it's with the private one. We track the
concepts separately because there are some builtins such as atomics that
are defined for the generic address space but not the private address
space.
Previously the OpenCL address space overloads of remquo would call into
the one and only 'private' CLC remquo. This was an outlier compared with
the other pointer-argumented maths builtins.
This commit moves the definitions of all address space overloads to the
CLC library to give more control over each address space to CLC
implementers.
There are some minor changes to the generated bytecode but it's simply
moving IR instructions around.
This commits moves all OpenCL builtins under a top-level 'opencl'
directory, akin to how the CLC builtins are organized. This new
structure aims to better convey the separation of the two layers and
that 'CLC' is not a subset of OpenCL or a libclc target.
In doing so this commit moves the location of the 'lib' directory to
match CLC: libclc/generic/lib/ becomes libclc/opencl/lib/generic/. This
allows us to remove some special casing in CMake and ensure a common
directory structure.
It also tries to better communicate that the OpenCL headers are
libclc-specific OpenCL headers and should not be confused with or used
as standard OpenCL headers. It does so by ensuring includes are of the
form <clc/opencl/*>. It might be that we don't specifically need the
libclc OpenCL headers and we simply could use clang's built-in
declarations, but we can revisit that later.
Aside from the code move, there is some code formatting and updating a
couple of OpenCL builtin includes to use the readily available gentype
helpers. This allows us to remove some '.inc' files.
This completes the set of maths builtins.
No attempt to vectorize or optimize this code. The implementation is
licensed to SunPro so will probably need to be replaced at some point in
the future anyway. Calls to other builtins have been replaced with the
CLC equivalents, and some bit-hacking was replaced with the fabs
builtin.
This enables file_specific_compile_options to take precedence over
ARG_COMPILE_FLAGS. For example, if we add -fno-slp-vectorize to
COMPILE_OPTIONS of a file, the behavior changes as follows:
* Before this PR: -fno-slp-vectorize is overwritten by -O3, resulting in
SLP vectorizer remaining enabled.
* After this PR: -fno-slp-vectorize overwrites -O3, effectively
disabling SLP vectorizer.
The previous method splits vector data into two halves. shuffle_vector
concatenates the two results into a vector data of original size. This
PR eliminates the use of shuffle_vector.
The half overloads are trivially identical to the float and double ones.
It didn't seem worth using 'gentype' for the OpenCL layer or CLC
declarations so they're just written out explicitly. It does help avoid
less trivial repetition in the CLC implementation, though.
This commit moves the logb and ilogb builtins to the CLC library.
It simultaneously optimizes them both for vector types and for half
types. Vector types were being scalarized in some cases. Half types were
previously promoting to float, whereas this commit provides them a
native implementation.
Everything passes the OpenCL-CTS.
I had to intuit some magic numbers used by these implementations in
order to generate the half variants. I gave them clearer definitions
derived from what I believe are their actual component numbers, but
named them 'magic' to convey that they weren't derived from first
principles.
This commit also refactors how geometric builtins are defined and
declared, by sharing more helpers. It also removes an unnecessary
gentype-like helper in favour of the more complete math/gentype.inc.
There are no changes to the IR for any of these four builtins.
The 'normalize' builtin will follow in a subsequent commit because it
would involve the addition of missing halfn-type overloads for
completeness.
There was already a __clc_tan in the OpenCL layer. This commit moves the
function over whilst vectorizing it.
The function __clc_tan is no longer a public symbol, which should have
never been the case.