This commit moves the various vload and vstore builtins (including
vload_half, vloada_half, etc.) to the CLC library.
This is almost entirely a code move and does not make any attempt to
clean up or optimize the definitions of these builtins. There is no
change to any of the targets' builtin libraries, except that the vstore
helper rounding functions are now internalized.
Cleanups can come in future work. The new CLC declarations and new
OpenCL wrappers show how these CLC implementations could be defined more
simply. The builtins could probably also be vectorized in future work;
right now all of the 'half' versions for both vload and vstore are
essentially scalarized.
The half variants were missing but are trivial to implement. There were
some incorrect mixed type overloads (step(float, double)) which aren't
in the OpenCL specification and so have been removed.
Like certain other builtins the CLC step function only deals with
identical types. The OpenCL layer is responsible for casting the scalar
argument to a vector.
This commit also trivially vectorizes the CLC function, generating
better bytecode.
This completes the set of maths builtins.
No attempt to vectorize or optimize this code. The implementation is
licensed to SunPro so will probably need to be replaced at some point in
the future anyway. Calls to other builtins have been replaced with the
CLC equivalents, and some bit-hacking was replaced with the fabs
builtin.
The half overloads are trivially identical to the float and double ones.
It didn't seem worth using 'gentype' for the OpenCL layer or CLC
declarations so they're just written out explicitly. It does help avoid
less trivial repetition in the CLC implementation, though.
This commit moves the logb and ilogb builtins to the CLC library.
It simultaneously optimizes them both for vector types and for half
types. Vector types were being scalarized in some cases. Half types were
previously promoting to float, whereas this commit provides them a
native implementation.
Everything passes the OpenCL-CTS.
I had to intuit some magic numbers used by these implementations in
order to generate the half variants. I gave them clearer definitions
derived from what I believe are their actual component numbers, but
named them 'magic' to convey that they weren't derived from first
principles.
This commit also refactors how geometric builtins are defined and
declared, by sharing more helpers. It also removes an unnecessary
gentype-like helper in favour of the more complete math/gentype.inc.
There are no changes to the IR for any of these four builtins.
The 'normalize' builtin will follow in a subsequent commit because it
would involve the addition of missing halfn-type overloads for
completeness.
There was already a __clc_tan in the OpenCL layer. This commit moves the
function over whilst vectorizing it.
The function __clc_tan is no longer a public symbol, which should have
never been the case.
This commit moves the remaining FP64 sin and cos helper functions to the
CLC library. As a consequence, it formally moves all sin, cos and sincos
builtins to the CLC library. Previously, the FP16 and FP32 were
nominally there but still in the OpenCL layer while waiting for the FP64
ones.
The FP64 builtins are now vectorized as the FP16 and FP32 ones were
earlier.
One helper table had to be changed. It was previously a table of bytes
loaded by each work-item as uint4. Since this doesn't vectorize well,
the table was split to load two ulongNs per work-item. While this might
not be as efficient on some devices, one mitigating factor is that we
were previously loading 48 bytes per work-item in total, but only using
40 of them. With this commit we only load the bytes we need.
This commit moves the fdim builtin to the CLC library. It simultaneously
simplifies the codegen, unifying it between scalar and vector and
avoiding bithacking for vector types.
This is an alternative to #128506 which doesn't attempt to change the
codegen for fmin and fmax on their way to the CLC library.
The amdgcn and r600 custom definitions of fmin/fmax are now converted to
custom definitions of __clc_fmin and __clc_fmax.
For simplicity, the CLC library doesn't provide vector/scalar versions
of these builtins. The OpenCL layer wraps those up to the vector/vector
versions.
The only codegen change is that non-standard vector/scalar overloads of
fmin/fmax have been removed. We were currently (accidentally,
presumably) providing overloads with mixed elment types such as
fmin(double2, float), fmax(half4, double), etc. The only vector/scalar
overloads in the OpenCL spec are those with scalars of the same element
type as the vector in the first argument.
This commit moves the shuffle and shuffle2 builtins to the CLC library.
In so doing it makes the headers simpler and re-usable for other builtin
layers to hook into the CLC functions, if they wish.
An additional gentype utility has been made available, which provides a
consistent vector-size-or-1 macro for use.
The existing __CLC_VECSIZE is defined but empty which is useful in
certain applications, such as in concatenation with a type to make a
correctly sized scalar or vector type. However, this isn't usable in the
same preprocessor lines when wanting to check for specific vector sizes,
as e.g., '__CLC_VECSIZE == 2' resolves to '== 2' which is invalid. In
local testing this is also useful for the geometric builtins which are
only available for scalar types and vector types of 2, 3, or 4 elements.
No codegen changes are observed, except the internal shuffle/shuffle2
utility functions are no longer made publicly available.
These are the three remaining native builtins not yet ported.
There are elementwise versions of exp10 and tan which correspond to the
intrinsics, which may be preferable to the current versions which route
through other native builtins. Those could be changed in a follow-up if
desired.
Also enable half-precision variants of tgamma, which were previously
missing.
Note that unlike recent work, these builtins are not vectorized as part
of this commit. Ultimately all three call into lgamma_r, which has heavy
control flow (including switch statements) that would be difficult to
vectorize. Additionally the lgamma_r algorithm is copyrighted to SunPro
so may need a rewrite in the future anyway.
There are no codegen changes (to non-SPIR-V targets) with this commit,
aside from the new half builtins.
This commit moves the 'native' builtins that use asm statements to
generate LLVM intrinsics to the CLC library. In doing so it converts
them to use the appropriate elementwise builtin to generate the same
intrinsic; there are no codegen changes to any target except to AMDGPU
targets where `native_log` is no longer custom implemented and instead
used the clang elementwise builtin.
This work forms part of #127196 and indeed with this commit there are no
'generic' builtins using/abusing asm statements - the remaining builtins
are specific to the amdgpu and r600 targets.
The function was already nominally in the CLC namespace; this commit
just moves it over.
This commit also vectorizes the builtin to avoid scalarization.
These functions were already nominally in the CLC library.
Similar to others, these builtins are now vectorized and are not broken
down into scalar types.
These functions were already nominally in the CLC namespace; this commit
just formally moves them over.
Note that 'half' versions of these CLC functions are now provided.
Previously the corresponding OpenCL builtins would forward directly to
the 'float' versions of the CLC builtins. Now the OpenCL builtins call
the 'half' CLC builtins, which themselves call the 'float' CLC versions.
This keeps the interface between the OpenCL and CLC libraries neater and
keeps the CLC library self-contained.
No changes to the generated code for non-SPIR-V targets is observed.
As with other work in this area, these builtins are now vectorized.
A further table has been split into two. There was discrepancy between
comments above the table describing the values as "lead" and "tail" and
variables taken from the table called "head" and "tail", so these have
been unified as head/tail.
These four functions all related in that they share tables and helper
functions. Furthermore, the acosh and atanh builtins call log1p.
As with other work in this area, these builtins are now vectorized. To
enable this, there are new table accessor functions which return a
vector of table values using a vector of indices. These are internally
scalarized, in the absence of gather operations. Some tables which were
tables of multiple entries (e.g., double2) are split into two separate
"low" and "high" tables. This might affect the performance of memory
operations but are hopefully mitigated by better codegen overall.
Similar to d46a6999, this commit simultaneously moves these three
functions to the CLC library and optimizes them for vector types by
avoiding scalarization.
This commit moves most of the sincos helper functions to the CLC
library. It simultaneously vectorizes them with the aim to increase
performance for vector types by avoiding scalarization.
Some helpers for double types remain as they use various features not
yet ready, like 'fract' which in turn relies on 'fmin'; neither of these
are in the CLC library. They also use table lookups and type punning
which don't translate well to vector versions.
As a proof of concept, float and half versions of the sin and cos
builtins are now vectorized and use the CLC helpers to do so. They
remain in the OpenCL layer but will be simpler to move to the CLC
library when the double versions are ready.
This was already nominally in the CLC library; this commit just formally
moves it over. It simultaneously optimizes it for vector types by
avoiding scalarization.
This also adds missing half variants to certain targets.
It also optimizes some targets' implementations to perform the operation
directly in vector types, as opposed to scalarizing.
This is fairly straightforward for most targets.
We use the element-wise sqrt builtin by default. We also remove a legacy
pre-filtering of the input argument, which the intrinsic now officially
handles.
AMDGPU provides its own implementation of sqrt for double types. This
commit moves this into the implementation of CLC sqrt. It uses weak
linkage on the 'default' CLC sqrt to allow AMDGPU to only override the
builtin for the types it cares about.
This function was already conceptually in the CLC namespace - this just
formally moves it over.
Note however that this commit marks a change in how libclc functions may
be overridden by targets.
Until now we have been using a purely build-system-based approach where
targets could register identically-named files which took responsibility
for the implementation of the builtin in its entirety.
This system wasn't well equipped to deal with AMD's overriding of
__clc_ldexp for only a subset of types, and furthermore conditionally on
a pre-defined macro.
One option for handling this would be to require AMD to duplicate code
for the versions of __clc_ldexp it's *not* interested in overriding. We
could also make it easier for targets to re-define CLC functions through
macros or .inc files. Both of these have obvious downsides. We could
also keep AMD's overriding in the OpenCL layer and bypass CLC
altogether, but this has limited use.
We could use weak linkage on the "base" implementations of CLC
functions, and allow targets to opt-in to providing their own
implementations on a much finer granularity. This commit supports this
as a proof of concept; we could expand it to all CLC builtins if
accepted.
Note that the existing filename-based "claiming" approach is still in
effect, so targets have to name their overrides differently to have both
files compiled. This could also be refined.
This commit also enables fp16 log, which was previously missing.
Other than that, no changes to codegen for AMDGPU/Nvidia targets.
Note that for simplicity this commit doesn't try to refactor or optimize
the implementations. Notably, each log is only implementated for scalar
types; vector types are scalarized. It doesn't look too difficult to
make the implementations suitable for vector codegen, so I'll try that
in a future commit.
There's also an unused implementation of log in clc_log_base.h, whereas
the implementation currently used by libclc targets re-uses log2 with an
additional multiplication. That should also be cleaned up as on first
inspection it looks a more optimal implementation, though it would have
to be checked against the OpenCL CTS for good measure.
This builtin is a little more involved than others as targets deal with
fma in various different ways.
Fundamentally, the CLC __clc_fma builtin compiles to
__builtin_elementwise_fma, which compiles to the @llvm.fma intrinsic.
However, in the case of fp32 fma some targets call the __clc_sw_fma
function, which provides a software implementation of the builtin. This
in principle is controlled by the __CLC_HAVE_HW_FMA32 macro and may be a
runtime decision, depending on how the target defines that macro.
All targets build the CLC fma functions for all types. This is to the
CLC library can have a reliable internal implementation for its own
purposes.
For AMD/NVPTX targets there are no meaningful changes to the generated
LLVM bytecode. Some blocks of code have moved around, which confounds
llvm-diff.
For the clspv and SPIR-V/Mesa targets, only fp32 fma is of interest. Its
use in libclc is tightly controlled by checking __CLC_HAVE_HW_FMA32
first. This can either be a compile-time constant (1, for clspv) or a
runtime function for SPIR-V/Mesa.
The SPIR-V/Mesa target only provided fp32 fma in the OpenCL layer. It
unconditionally mapped that to the __clc_sw_fma builtin, even though the
generic version in theory had a runtime toggle through
__CLC_HAVE_HW_FMA32 specifically for that target. Callers of fma,
though, would end up using the ExtInst fma, *not* calling the _Z3fmafff
function provided by libclc.
This commit keeps this system in place in the OpenCL layer, by mapping
fma to __clc_sw_fma. Where other builtins would previously call fma
(i.e., result in the ExtInst), they now call __clc_fma. This function
checks the __CLC_HAVE_HW_FMA32 runtime toggle, which selects between the
slow version or the quick version. The quick version is the LLVM fma
intrinsic which llvm-spirv translates to the ExtInst.
The clspv target had its own software implementation of fp32 fma, which
it called unconditionally - even though __CLC_HAVE_HW_FMA32 is 1 for
that target. This is potentially just so its library ships a software
version which it can fall back on. In the OpenCL layer, the target
doesn't provide fp64 fma, and maps fp16 fma to fp32 mad.
This commit keeps this system roughly in place: in the OpenCL layer it
maps fp32 fma to __clc_sw_fma, and fp16 fma to mad. Where builtins would
previously call into fma, they now call __clc_fma, which compiles to the
LLVM intrinsic. If this goes through a translation to SPIR-V it will
become the fma ExtInst, or the intrinsic could be replaced by the
_Z3fmafff software implementation.
The clspv and SPIR-V/Mesa targets could potentially be cleaned up later,
depending on their needs.
This commit moves the frexp builtin to the CLC library.
It simultaneously optimizes the code generated for half vectors, which
was previously scalarizing and casting up to float. With this commit it
still casts up to float, but keeps it in the vector form.
The "generic" unary_(def|decl)_with_ptr files are intended to be re-used
by the sincos and fract builtins in the future as they share an
identical type signature.
This commit moves the sign builtin's implementation to the CLC library.
It simultaneously optimizes it (for vector types) by removing
control-flow from the implementation.
The __CLC_INTERNAL preprocessor definition has been repurposed (without
the leading underscores) to be passed when building the internal CLC
library. It was only used in one other place to guard an extra maths
preprocessor definition, which we can do unconditionally.
This commit moves the rotate builtin to the CLC library.
It also optimizes rotate(x, n) to generate the @llvm.fshl(x, x, n)
intrinsic, for both scalar and vector types. The previous implementation
was too cautious in its handling of the shift amount; the OpenCL rules
state that the shift amount is always treated as an unsigned value
modulo the bitwidth.
This commit moves the mad_sat builtin to the CLC library.
It also optimizes it for vector types by avoiding scalarization. To help
do this it transforms the previous control-flow code into vector select
code. This has also been done for the scalar versions for simplicity.
This commit moves over the OpenCL clz, hadd, mad24, mad_hi, mul24,
mul_hi, popcount, rhadd, and upsample builtins to the CLC library.
This commit also optimizes the vector forms of the mul_hi and upsample
builtins to consistently remain in vector types, instead of recursively
splitting vectors down to the scalar form.
The OpenCL mad_hi builtin wasn't previously publicly available from the
CLC libraries, as it was hash-defined to mul_hi in the header files.
That issue has been fixed, and mad_hi is now exposed.
The custom AMD implementation/workaround for popcount has been removed
as it was only required for clang < 7.
There are still two integer functions which haven't been moved over. The
OpenCL mad_sat builtin uses many of the other integer builtins, and
would benefit from optimization for vector types. That can take place in
a follow-up commit. The rotate builtin could similarly use some more
dedicated focus, potentially using clang builtins.
Using the `__builtin_elementwise_(add|sub)_sat` functions allows us to
directly optimize to the desired intrinsic, and avoid scalarization for
vector types.
This commit moves the implementation of the copysign builtin to the CLC
library.
It simultaneously optimizes it for vector types by avoiding
scalarization. It does so by using the __builtin_elementwise_copysign
clang builtins, which can handle vector types.
It also fixes a bug in the half/fp16 implementation of the builtin. This
version was using an incorrect mask (0x7FFFF instead of 0x7FFF) and was
thus preserving the original sign bit, rather than masking it out.
There were two implementations of this - one that implemented nextafter
in software, and another that called a clang builtin. No in-tree targets
called the builtin, so all targets build the software version. The
builtin version has been removed, and the software version has been
renamed to be the "default".
This commit also optimizes nextafter, to avoid scalarization as much as
possible. Note however that the (CLC) relational builtins still
scalarize; those will be optimized in a separate commit.
Since nextafter is used by some convert_type builtins, the diff to IR
codegen is not limited to the builtin itself.
All targets build `__clc_mad` -- even SPIR-V targets -- since it
compiles to the optimal `llvm.fmuladd` intrinsic. There is no change to
the bytecode generated for non-SPIR-V targets.
The `mix` builtin, which is implemented as a wrapper around `mad`, is
left as an OpenCL-layer wrapper of `__clc_mad`. I don't know if it's
worth having a specific CLC version of `mix`.
The changes to the other CLC files/functions are moving uses of `mad` to
`__clc_mad`, and reformatting. There is an additional instance of
`trunc` becoming `__clc_trunc`, which was missed before.
Missing half variants were also added.
The builtins are now consistently emitted in vector form (i.e., with a
splat of the literal to the appropriate vector size).