Changes in this PR:
* Declare most of workitem functions in clc and opencl folders.
* Call clc workitem function in corresponding OpenCL workitem function.
* Move ptx-nvidiacl workitem built-in implementations into clc.
* Move a few amdgcn workitem built-in implementations into clc.
* Include only needed headers in OpenCL workitem functions.
* Implement get_local_linear_id, get_max_sub_group_size,
get_num_sub_groups,
get_sub_group_id, get_sub_group_local_id, get_sub_group_size for
ptx-nvidiacl.
llvm-diff shows this PR adds a few new symbols to nvptx64--nvidiacl.bc.
llvm-diff shows no change to amdgcn--amdhsa.bc, nvptx--.bc and
nvptx64--.bc.
Rename to FUNCTION if it is for declaration, since it doesn't make much
sense to use __CLC_FUNCTION for OpenCL function declaration. Rename to
__IMPL_FUNCTION if it is for definition, since in some cases
implementation function isn't clc_* function.
In OpenCL Extended Instruction Set Specification, nancode can be signed
integer or vector of signed integers values.
This PR has no change to amdgcn--amdhsa.bc and nvptx64--nvidiacl.bc
because the newly added clc functions are not used in OpenCL library.
With this PR, if we have customized implementation for scalar or vector
length = 2, we don't need to write new macros, e.g.
https://github.com/intel/llvm/blob/fb18321705f6/libclc/clc/include/clc/clcmacro.h#L15
Undef __HALF_ONLY, __FLOAT_ONLY and __DOUBLE_ONLY at the end of
clc/include/clc/math/gentype.inc
llvm-diff shows no change to nvptx64--nvidiacl.bc and amdgcn--amdhsa.bc
For a kernel such as
kernel void foo(__global double3 *z) {
double3 x = {0.6631661088,0.6612268107,0.1513627528};
int3 y = {-1980459213,-660855407,615708204};
*z = pown(x, y);
}
we were not storing anything to z, because the implementation of pown
relied on an floating-point-to-integer conversion where the
floating-point value was outside of the integer's range. Although in
LLVM IR we permit that operation so long as we end up ignoring its
result -- that is the general rule for poison -- one thing we are not
permitted to do is have conditional branches that depend on it, and
through the call to __clc_ldexp, we did have that.
To fix this, rather than changing expv at the end to INFINITY/0, we can
change v at the start to values that we know will produce INFINITY/0
without performing such out-of-range conversions.
Tested with
clang --target=nvptx64 -S -O3 -o - test.cl \
-Xclang -mlink-builtin-bitcode \
-Xclang runtimes/runtimes-bins/libclc/nvptx64--.bc
A grep showed that this exact same code existed in three more places, so
I changed it there too, though I did not do a broader search for other
similar code that potentially has the same problem.
Also delete unused _CLC_DEFINE_BINARY_BUILTIN_WITH_SCALAR_SECOND_ARG,
_CLC_DEFINE_UNARY_BUILTIN_FP16 and _CLC_DEFINE_BINARY_BUILTIN_FP16.
llvm-diff shows no change to nvptx64--nvidiacl.bc and amdgcn--amdhsa.bc
This commit moves the various vload and vstore builtins (including
vload_half, vloada_half, etc.) to the CLC library.
This is almost entirely a code move and does not make any attempt to
clean up or optimize the definitions of these builtins. There is no
change to any of the targets' builtin libraries, except that the vstore
helper rounding functions are now internalized.
Cleanups can come in future work. The new CLC declarations and new
OpenCL wrappers show how these CLC implementations could be defined more
simply. The builtins could probably also be vectorized in future work;
right now all of the 'half' versions for both vload and vstore are
essentially scalarized.
The half variants were missing but are trivial to implement. There were
some incorrect mixed type overloads (step(float, double)) which aren't
in the OpenCL specification and so have been removed.
Like certain other builtins the CLC step function only deals with
identical types. The OpenCL layer is responsible for casting the scalar
argument to a vector.
This commit also trivially vectorizes the CLC function, generating
better bytecode.
This commit provides definitions of builtins with the generic address
space.
One concept to consider is the difference between supporting the generic
address space from the user's perspective and the requirement for libclc
as a compiler implementation detail to define separate generic address
space builtins. In practice a target (like NVPTX) might notionally
support the generic address space, but it's mapped to the same LLVM
target address space as another address space (often the private one).
In such cases libclc must be careful not to define both private and
generic overloads of the same builtin. We track these two concepts
separately, and make the assumption that if the generic address space
does clash with another, it's with the private one. We track the
concepts separately because there are some builtins such as atomics that
are defined for the generic address space but not the private address
space.
Previously the OpenCL address space overloads of remquo would call into
the one and only 'private' CLC remquo. This was an outlier compared with
the other pointer-argumented maths builtins.
This commit moves the definitions of all address space overloads to the
CLC library to give more control over each address space to CLC
implementers.
There are some minor changes to the generated bytecode but it's simply
moving IR instructions around.
This completes the set of maths builtins.
No attempt to vectorize or optimize this code. The implementation is
licensed to SunPro so will probably need to be replaced at some point in
the future anyway. Calls to other builtins have been replaced with the
CLC equivalents, and some bit-hacking was replaced with the fabs
builtin.
The previous method splits vector data into two halves. shuffle_vector
concatenates the two results into a vector data of original size. This
PR eliminates the use of shuffle_vector.
The half overloads are trivially identical to the float and double ones.
It didn't seem worth using 'gentype' for the OpenCL layer or CLC
declarations so they're just written out explicitly. It does help avoid
less trivial repetition in the CLC implementation, though.
This commit moves the logb and ilogb builtins to the CLC library.
It simultaneously optimizes them both for vector types and for half
types. Vector types were being scalarized in some cases. Half types were
previously promoting to float, whereas this commit provides them a
native implementation.
Everything passes the OpenCL-CTS.
I had to intuit some magic numbers used by these implementations in
order to generate the half variants. I gave them clearer definitions
derived from what I believe are their actual component numbers, but
named them 'magic' to convey that they weren't derived from first
principles.
This commit also refactors how geometric builtins are defined and
declared, by sharing more helpers. It also removes an unnecessary
gentype-like helper in favour of the more complete math/gentype.inc.
There are no changes to the IR for any of these four builtins.
The 'normalize' builtin will follow in a subsequent commit because it
would involve the addition of missing halfn-type overloads for
completeness.
There was already a __clc_tan in the OpenCL layer. This commit moves the
function over whilst vectorizing it.
The function __clc_tan is no longer a public symbol, which should have
never been the case.
This commit moves the remaining FP64 sin and cos helper functions to the
CLC library. As a consequence, it formally moves all sin, cos and sincos
builtins to the CLC library. Previously, the FP16 and FP32 were
nominally there but still in the OpenCL layer while waiting for the FP64
ones.
The FP64 builtins are now vectorized as the FP16 and FP32 ones were
earlier.
One helper table had to be changed. It was previously a table of bytes
loaded by each work-item as uint4. Since this doesn't vectorize well,
the table was split to load two ulongNs per work-item. While this might
not be as efficient on some devices, one mitigating factor is that we
were previously loading 48 bytes per work-item in total, but only using
40 of them. With this commit we only load the bytes we need.
These two tables were being used by the CLC library but their
definitions still remained in the OpenCL layer. This worked out after
linking the two together but is a layering violation.
This had a side effect of removing the two table getters from the final
bytecode library, which were never intended to be exposed.
These two tables should probably be refactored so allow better
vectorization of log/log2/log10, but that is left to future work.
This commit moves the fdim builtin to the CLC library. It simultaneously
simplifies the codegen, unifying it between scalar and vector and
avoiding bithacking for vector types.
We had two ways of achieving the same thing. This commit removes
unary_builtin.inc in favour of the approach combining gentype.inc with
unary_def.inc.
There is no change to the codegen for any target.
This is an alternative to #128506 which doesn't attempt to change the
codegen for fmin and fmax on their way to the CLC library.
The amdgcn and r600 custom definitions of fmin/fmax are now converted to
custom definitions of __clc_fmin and __clc_fmax.
For simplicity, the CLC library doesn't provide vector/scalar versions
of these builtins. The OpenCL layer wraps those up to the vector/vector
versions.
The only codegen change is that non-standard vector/scalar overloads of
fmin/fmax have been removed. We were currently (accidentally,
presumably) providing overloads with mixed elment types such as
fmin(double2, float), fmax(half4, double), etc. The only vector/scalar
overloads in the OpenCL spec are those with scalars of the same element
type as the vector in the first argument.
This commit moves the shuffle and shuffle2 builtins to the CLC library.
In so doing it makes the headers simpler and re-usable for other builtin
layers to hook into the CLC functions, if they wish.
An additional gentype utility has been made available, which provides a
consistent vector-size-or-1 macro for use.
The existing __CLC_VECSIZE is defined but empty which is useful in
certain applications, such as in concatenation with a type to make a
correctly sized scalar or vector type. However, this isn't usable in the
same preprocessor lines when wanting to check for specific vector sizes,
as e.g., '__CLC_VECSIZE == 2' resolves to '== 2' which is invalid. In
local testing this is also useful for the geometric builtins which are
only available for scalar types and vector types of 2, 3, or 4 elements.
No codegen changes are observed, except the internal shuffle/shuffle2
utility functions are no longer made publicly available.
Some files were accidentally given two copyright headers. Another was
missing one. This commit also converts that file's dos line endings to
unix ones and reformats a comment.
Devices not supporting denormals can compare them true against zero. It
leads to result not matching the CTS expectation when either supporting
or not denormals.
For example for 0x1.008p-140 we get {0x1.008p-140, 0} while the CTS
expects {0x1.008p-1, -139} when supporting denormals, or {0, 0} when not
supporting denormals (flushed to zero).
Ref #129871
clspv uses a better implementation that is not using a bigger side when
not available.
Add a dummy implementation for mul_hi to avoid to override the
implementation of clspv with the one in libclc.
These are the three remaining native builtins not yet ported.
There are elementwise versions of exp10 and tan which correspond to the
intrinsics, which may be preferable to the current versions which route
through other native builtins. Those could be changed in a follow-up if
desired.
Also enable half-precision variants of tgamma, which were previously
missing.
Note that unlike recent work, these builtins are not vectorized as part
of this commit. Ultimately all three call into lgamma_r, which has heavy
control flow (including switch statements) that would be difficult to
vectorize. Additionally the lgamma_r algorithm is copyrighted to SunPro
so may need a rewrite in the future anyway.
There are no codegen changes (to non-SPIR-V targets) with this commit,
aside from the new half builtins.
This commit moves the 'native' builtins that use asm statements to
generate LLVM intrinsics to the CLC library. In doing so it converts
them to use the appropriate elementwise builtin to generate the same
intrinsic; there are no codegen changes to any target except to AMDGPU
targets where `native_log` is no longer custom implemented and instead
used the clang elementwise builtin.
This work forms part of #127196 and indeed with this commit there are no
'generic' builtins using/abusing asm statements - the remaining builtins
are specific to the amdgpu and r600 targets.
The function was already nominally in the CLC namespace; this commit
just moves it over.
This commit also vectorizes the builtin to avoid scalarization.
Splitting the 'ln_tbl' into two in db98e292 wasn't done thoroughly
enough as some references to the old table still remained. This commit
fixes the unresolved references by updating to the new split table.
These functions were already nominally in the CLC library.
Similar to others, these builtins are now vectorized and are not broken
down into scalar types.