64 Commits

Author SHA1 Message Date
Wenju He
76bb98746b
[NFC][libclc] add missing __CLC_ prefix all internal macros (#153523)
This unifies naming scheme of macros to address review comment
https://github.com/intel/llvm/pull/19779#discussion_r2272194357

math constant value macros are not changed, e.g.
`#define AU0 -9.86494292470009928597e-03`
2025-08-18 07:21:04 +08:00
Fraser Cormack
586cacdbdd
[libclc] Optimize generic CLC fmin/fmax (#128506)
With this commit, the CLC fmin/fmax builtins use clang's
__builtin_elementwise_(min|max)imumnum which helps us generate LLVM
minimumnum/maximumnum intrinsics directly. These intrinsics uniformly
select the non-NaN input over the (quiet or signalling) NaN input, which
corresponds to what the OpenCL CTS tests.

These intrinsics maintain the vector types, as opposed to scalarizing,
which was previously happening. This commit therefore helps to optimize
codegen for those targets.

Note that there is ongoing discussion regarding how these builtins
should handle signalling NaNs in the OpenCL specification and whether
they should be able to return a quiet NaN as per the IEEE behaviour. If
the specification and/or CTS is ever updated to allow or mandate
returning a qNAN, these builtins could/should be updated to use
__builtin_elementwise_(min|max)num instead which would lower to LLVM
minnum/maxnum intrinsics.

The SPIR-V targets maintain the old implementations, as the LLVM ->
SPIR-V translator can't currently handle the LLVM intrinsics. The
implementation has been simplifies to consistently use clang builtins,
as opposed to before where the half version was explicitly defined.

[1] https://github.com/KhronosGroup/OpenCL-CTS/pull/2285
2025-07-29 13:21:42 +01:00
Wenju He
bcd0d97224
[libclc] Simplify unary_def_scalarize.inc's use in __clc_erf/erfc/tgamma (#150181)
Also delete unary_def_via_fp32.inc. There are small changes in
amdgcn--amdhsa.bc due to vector conversion is scalarized, e.g.
  %2 = fpext <4 x half> %0 to <4 x float>
  %3 = extractelement <4 x float> %2, i64 0
  %4 = tail call float @llvm.fabs.f32(float %3)
->
  %2 = extractelement <4 x half> %0, i64 0
  %3 = tail call half @llvm.fabs.f16(half %2)
  %4 = fpext half %3 to float
2025-07-29 08:25:58 +08:00
Wenju He
cf36f49c04
[libclc] Enable clang fp reciprocal in clc_native_divide/recip/rsqrt/tan (#149269)
The pragma adds `arcp` flag to `fdiv` instruction in these functions.
The flag can provide better performance.
2025-07-18 07:50:35 +08:00
Wenju He
fa9cd47328
[NFC][libclc] Rename __CLC_FUNCTION to either FUNCTION or __IMPL_FUNCTION (#146999)
Rename to FUNCTION if it is for declaration, since it doesn't make much
sense to use __CLC_FUNCTION for OpenCL function declaration. Rename to
__IMPL_FUNCTION if it is for definition, since in some cases
implementation function isn't clc_* function.
2025-07-07 08:07:51 +08:00
Wenju He
b0e6faae08
[libclc] Add missing clc_lgamma_r with generic address space pointer arg (#146495)
There is no change to amdgcn--amdhsa.bc and nvptx64--nvidiacl.bc because
__opencl_c_generic_address_space is not defined for them.
2025-07-02 08:28:01 +08:00
Wenju He
93fe52f19e
[libclc] Add __clc_nan implementation with signed nancode argument (#146485)
In OpenCL Extended Instruction Set Specification, nancode can be signed
integer or vector of signed integers values.
This PR has no change to amdgcn--amdhsa.bc and nvptx64--nvidiacl.bc
because the newly added clc functions are not used in OpenCL library.
2025-07-02 08:27:46 +08:00
Wenju He
338dee0742
[NFC][libclc] Refactor _CLC_*_VECTORIZE macros to functions in .inc files (#145678)
With this PR, if we have customized implementation for scalar or vector
length = 2, we don't need to write new macros, e.g.
https://github.com/intel/llvm/blob/fb18321705f6/libclc/clc/include/clc/clcmacro.h#L15

Undef __HALF_ONLY, __FLOAT_ONLY and __DOUBLE_ONLY at the end of
clc/include/clc/math/gentype.inc

llvm-diff shows no change to nvptx64--nvidiacl.bc and amdgcn--amdhsa.bc
2025-06-30 17:19:19 +08:00
Harald van Dijk
46ee7f1908
[libclc] Avoid out-of-range float-to-int. (#145698)
For a kernel such as

    kernel void foo(__global double3 *z) {
      double3 x = {0.6631661088,0.6612268107,0.1513627528};
      int3 y = {-1980459213,-660855407,615708204};
      *z = pown(x, y);
    }

we were not storing anything to z, because the implementation of pown
relied on an floating-point-to-integer conversion where the
floating-point value was outside of the integer's range. Although in
LLVM IR we permit that operation so long as we end up ignoring its
result -- that is the general rule for poison -- one thing we are not
permitted to do is have conditional branches that depend on it, and
through the call to __clc_ldexp, we did have that.

To fix this, rather than changing expv at the end to INFINITY/0, we can
change v at the start to values that we know will produce INFINITY/0
without performing such out-of-range conversions.

Tested with

    clang --target=nvptx64 -S -O3 -o - test.cl \
      -Xclang -mlink-builtin-bitcode \
      -Xclang runtimes/runtimes-bins/libclc/nvptx64--.bc

A grep showed that this exact same code existed in three more places, so
I changed it there too, though I did not do a broader search for other
similar code that potentially has the same problem.
2025-06-25 16:37:06 +01:00
Wenju He
13a9b86f62
[NFC][libclc] Replace and delete _CLC_DEFINE_UNARY/BINARY/TERNARY_BUILTIN macros (#145458)
Also delete unused _CLC_DEFINE_BINARY_BUILTIN_WITH_SCALAR_SECOND_ARG,
_CLC_DEFINE_UNARY_BUILTIN_FP16 and _CLC_DEFINE_BINARY_BUILTIN_FP16.

llvm-diff shows no change to nvptx64--nvidiacl.bc and amdgcn--amdhsa.bc
2025-06-25 13:48:53 +08:00
Fraser Cormack
94142d9bb0
[libclc] Support the generic address space (#137183)
This commit provides definitions of builtins with the generic address
space.

One concept to consider is the difference between supporting the generic
address space from the user's perspective and the requirement for libclc
as a compiler implementation detail to define separate generic address
space builtins. In practice a target (like NVPTX) might notionally
support the generic address space, but it's mapped to the same LLVM
target address space as another address space (often the private one).

In such cases libclc must be careful not to define both private and
generic overloads of the same builtin. We track these two concepts
separately, and make the assumption that if the generic address space
does clash with another, it's with the private one. We track the
concepts separately because there are some builtins such as atomics that
are defined for the generic address space but not the private address
space.
2025-05-21 17:50:00 +01:00
Fraser Cormack
0bc7f41db8
[libclc] Move all remquo address spaces to CLC library (#140871)
Previously the OpenCL address space overloads of remquo would call into
the one and only 'private' CLC remquo. This was an outlier compared with
the other pointer-argumented maths builtins.

This commit moves the definitions of all address space overloads to the
CLC library to give more control over each address space to CLC
implementers.

There are some minor changes to the generated bytecode but it's simply
moving IR instructions around.
2025-05-21 11:26:04 +01:00
Fraser Cormack
c27e10fa65
[libclc] Mov erf & erfc to CLC library (#140524)
This completes the set of maths builtins.

No attempt to vectorize or optimize this code. The implementation is
licensed to SunPro so will probably need to be replaced at some point in
the future anyway. Calls to other builtins have been replaced with the
CLC equivalents, and some bit-hacking was replaced with the fabs
builtin.
2025-05-19 11:32:35 +01:00
Wenju He
299a278db1
[libclc] Improving vector code generated from scalar code (#140008)
The previous method splits vector data into two halves. shuffle_vector
concatenates the two results into a vector data of original size. This
PR eliminates the use of shuffle_vector.
2025-05-16 10:20:32 +01:00
Fraser Cormack
95c683fc1b
[libclc] Move logb/ilogb to CLC library; optimize (#128028)
This commit moves the logb and ilogb builtins to the CLC library.

It simultaneously optimizes them both for vector types and for half
types. Vector types were being scalarized in some cases. Half types were
previously promoting to float, whereas this commit provides them a
native implementation.

Everything passes the OpenCL-CTS.

I had to intuit some magic numbers used by these implementations in
order to generate the half variants. I gave them clearer definitions
derived from what I believe are their actual component numbers, but
named them 'magic' to convey that they weren't derived from first
principles.
2025-05-13 11:47:35 +01:00
Fraser Cormack
dd89af7f55
[libclc] Move 'half' builtins to CLC library (#139563)
There are no changes to the generated bytecode.
2025-05-12 17:32:05 +01:00
Fraser Cormack
87978ea272
[libclc] Move tan to the CLC library (#139547)
There was already a __clc_tan in the OpenCL layer. This commit moves the
function over whilst vectorizing it.

The function __clc_tan is no longer a public symbol, which should have
never been the case.
2025-05-12 14:55:27 +01:00
Fraser Cormack
4f107cd8f8
[libclc] Move sin, cos & sincos to CLC library (#139527)
This commit moves the remaining FP64 sin and cos helper functions to the
CLC library. As a consequence, it formally moves all sin, cos and sincos
builtins to the CLC library. Previously, the FP16 and FP32 were
nominally there but still in the OpenCL layer while waiting for the FP64
ones.

The FP64 builtins are now vectorized as the FP16 and FP32 ones were
earlier.

One helper table had to be changed. It was previously a table of bytes
loaded by each work-item as uint4. Since this doesn't vectorize well,
the table was split to load two ulongNs per work-item. While this might
not be as efficient on some devices, one mitigating factor is that we
were previously loading 48 bytes per work-item in total, but only using
40 of them. With this commit we only load the bytes we need.
2025-05-12 11:32:15 +01:00
Fraser Cormack
c5b750f5af [libclc] Move log2/log10 tables to CLC tables impl
These two tables were being used by the CLC library but their
definitions still remained in the OpenCL layer. This worked out after
linking the two together but is a layering violation.

This had a side effect of removing the two table getters from the final
bytecode library, which were never intended to be exposed.

These two tables should probably be refactored so allow better
vectorization of log/log2/log10, but that is left to future work.
2025-05-01 10:23:28 +01:00
Fraser Cormack
6c4dd8d1d2
[libclc] Move minmag & maxmag to the CLC library (#137982) 2025-05-01 09:43:40 +01:00
Fraser Cormack
75f040ab3e
[libclc] Clean up unnecessary #undef __CLC_BODYs (#137959)
This macro is automatically undefined by the various gentype-like
helpers.
2025-04-30 16:13:04 +01:00
Fraser Cormack
694a42f018
[libclc] Avoid casting NANs & literals to 'gentype' (#137824)
By having these already defined as type 'gentype' we can avoid
unnecessary casting.
2025-04-29 17:33:21 +01:00
Fraser Cormack
ea688c031e
[libclc] Move fdim to CLC library; simplify (#137811)
This commit moves the fdim builtin to the CLC library. It simultaneously
simplifies the codegen, unifying it between scalar and vector and
avoiding bithacking for vector types.
2025-04-29 16:41:07 +01:00
Fraser Cormack
837d5a740f
[libclc][NFC] Remove unary_builtin.inc (#137656)
We had two ways of achieving the same thing. This commit removes
unary_builtin.inc in favour of the approach combining gentype.inc with
unary_def.inc.

There is no change to the codegen for any target.
2025-04-29 14:17:17 +01:00
Fraser Cormack
78d95cc544
[libclc] Move fract to the CLC library (#137785)
The builtin was already vectorized so there's no difference to codegen
for non-SPIR-V targets.
2025-04-29 13:58:13 +01:00
Fraser Cormack
4609b6a3e7
[libclc] Move fmin & fmax to CLC library (#134218)
This is an alternative to #128506 which doesn't attempt to change the
codegen for fmin and fmax on their way to the CLC library.

The amdgcn and r600 custom definitions of fmin/fmax are now converted to
custom definitions of __clc_fmin and __clc_fmax.

For simplicity, the CLC library doesn't provide vector/scalar versions
of these builtins. The OpenCL layer wraps those up to the vector/vector
versions.

The only codegen change is that non-standard vector/scalar overloads of
fmin/fmax have been removed. We were currently (accidentally,
presumably) providing overloads with mixed elment types such as
fmin(double2, float), fmax(half4, double), etc. The only vector/scalar
overloads in the OpenCL spec are those with scalars of the same element
type as the vector in the first argument.
2025-04-29 10:51:24 +01:00
Romaric Jodin
0e98817458
libclc: frexp: fix implementation regarding denormals (#134823)
Devices not supporting denormals can compare them true against zero. It
leads to result not matching the CTS expectation when either supporting
or not denormals.

For example for 0x1.008p-140 we get {0x1.008p-140, 0} while the CTS
expects {0x1.008p-1, -139} when supporting denormals, or {0, 0} when not
supporting denormals (flushed to zero).

Ref #129871
2025-04-08 14:50:26 +01:00
Fraser Cormack
ddc48fefe3
[libclc] Move native_(exp10|powr|tan) to CLC library (#134080)
These are the three remaining native builtins not yet ported.

There are elementwise versions of exp10 and tan which correspond to the
intrinsics, which may be preferable to the current versions which route
through other native builtins. Those could be changed in a follow-up if
desired.
2025-04-02 17:37:17 +01:00
Fraser Cormack
f186041553
[libclc] Move sinh, cosh & tanh to the CLC library (#134063)
This commit also vectorizes the builtins.
2025-04-02 15:22:42 +01:00
Fraser Cormack
d51525ba36
[libclc] Move lgamma, lgamma_r & tgamma to CLC library (#134053)
Also enable half-precision variants of tgamma, which were previously
missing.

Note that unlike recent work, these builtins are not vectorized as part
of this commit. Ultimately all three call into lgamma_r, which has heavy
control flow (including switch statements) that would be difficult to
vectorize. Additionally the lgamma_r algorithm is copyrighted to SunPro
so may need a rewrite in the future anyway.

There are no codegen changes (to non-SPIR-V targets) with this commit,
aside from the new half builtins.
2025-04-02 15:20:32 +01:00
Fraser Cormack
dd19e7eaaa
[libclc] Move cbrt to the CLC library; vectorize (#133940) 2025-04-02 10:18:24 +01:00
Fraser Cormack
f14ff59da7
[libclc] Move exp, exp2 and expm1 to the CLC library (#133932)
These all share the use of a common helper function so are handled in
one go. These builtins are also now vectorized.
2025-04-01 18:15:37 +01:00
Fraser Cormack
bcf0f8d8aa
[libclc] Move exp10 to the CLC library (#133899)
The builtin was already nominally in the CLC library; this commit just
moves it over. It also vectorizes the builtin on its way.
2025-04-01 14:39:17 +01:00
Fraser Cormack
13a313fe58
[libclc] Move sinpi/cospi/tanpi to the CLC library (#133889)
Additionally, these builtins are now vectorized.

This also moves the native_recip and native_divide builtins as they are
used by the tanpi builtin.
2025-04-01 12:03:21 +01:00
Fraser Cormack
ad48fffb53
[libclc] Move several 'native' builtins to CLC library (#129679)
This commit moves the 'native' builtins that use asm statements to
generate LLVM intrinsics to the CLC library. In doing so it converts
them to use the appropriate elementwise builtin to generate the same
intrinsic; there are no codegen changes to any target except to AMDGPU
targets where `native_log` is no longer custom implemented and instead
used the clang elementwise builtin.

This work forms part of #127196 and indeed with this commit there are no
'generic' builtins using/abusing asm statements - the remaining builtins
are specific to the amdgpu and r600 targets.
2025-04-01 09:20:54 +01:00
Fraser Cormack
7a2b160e76
[libclc] Move rootn to the CLC library; optimize (#133735)
The function was already nominally in the CLC namespace; this commit
just moves it over.

This commit also vectorizes the builtin to avoid scalarization.
2025-04-01 09:19:50 +01:00
Fraser Cormack
87602f6d03
[libclc] Fix unresolved reference to missing table (#133691)
Splitting the 'ln_tbl' into two in db98e292 wasn't done thoroughly
enough as some references to the old table still remained. This commit
fixes the unresolved references by updating to the new split table.
2025-03-31 16:55:23 +01:00
Fraser Cormack
b52977b868
[libclc] Move pow, powr & pown to the CLC library (#133294)
These functions were already nominally in the CLC library.

Similar to others, these builtins are now vectorized and are not broken
down into scalar types.
2025-03-28 08:23:24 +00:00
Fraser Cormack
d32e71d7c7
[libclc] Move fmod, remainder & remquo to the CLC library (#132054)
These functions were already nominally in the CLC namespace; this commit
just formally moves them over.

Note that 'half' versions of these CLC functions are now provided.
Previously the corresponding OpenCL builtins would forward directly to
the 'float' versions of the CLC builtins. Now the OpenCL builtins call
the 'half' CLC builtins, which themselves call the 'float' CLC versions.
This keeps the interface between the OpenCL and CLC libraries neater and
keeps the CLC library self-contained.

No changes to the generated code for non-SPIR-V targets is observed.
2025-03-27 14:53:19 +00:00
Fraser Cormack
3284559cca
[libclc] Move atan2/atan2pi to the CLC library (#133226)
As with other work in this area, these builtins are now vectorized.

A further table has been split into two. There was discrepancy between
comments above the table describing the values as "lead" and "tail" and
variables taken from the table called "head" and "tail", so these have
been unified as head/tail.
2025-03-27 10:59:09 +00:00
Fraser Cormack
db98e2922f
[libclc] Move log1p/asinh/acosh/atanh to the CLC library (#132956)
These four functions all related in that they share tables and helper
functions. Furthermore, the acosh and atanh builtins call log1p.

As with other work in this area, these builtins are now vectorized. To
enable this, there are new table accessor functions which return a
vector of table values using a vector of indices. These are internally
scalarized, in the absence of gather operations. Some tables which were
tables of multiple entries (e.g., double2) are split into two separate
"low" and "high" tables. This might affect the performance of memory
operations but are hopefully mitigated by better codegen overall.
2025-03-27 09:19:07 +00:00
Fraser Cormack
3013458a79
[libclc] Move asinpi/acospi/atanpi to the CLC library (#132918)
Similar to d46a6999, this commit simultaneously moves these three
functions to the CLC library and optimizes them for vector types by
avoiding scalarization.
2025-03-25 13:31:53 +00:00
Fraser Cormack
d46a699953
[libclc] Move asin/acos/atan to the CLC library (#132788)
This commit simultaneously moves these three functions to the CLC
library and optimizing them for vector types by avoiding scalarization.
2025-03-25 09:11:32 +00:00
Fraser Cormack
70c325bf6a
[libclc] Move fp32 sincos helpers to CLC library (#132753)
This commit moves most of the sincos helper functions to the CLC
library. It simultaneously vectorizes them with the aim to increase
performance for vector types by avoiding scalarization.

Some helpers for double types remain as they use various features not
yet ready, like 'fract' which in turn relies on 'fmin'; neither of these
are in the CLC library. They also use table lookups and type punning
which don't translate well to vector versions.

As a proof of concept, float and half versions of the sin and cos
builtins are now vectorized and use the CLC helpers to do so. They
remain in the OpenCL layer but will be simpler to move to the CLC
library when the double versions are ready.
2025-03-24 16:09:31 +00:00
Fraser Cormack
7d048674a4
[libclc] Add license headers to files missing them (#132239)
This commit bulk updates all '.h', '.cl', '.inc', and '.cpp' files to
add any missing license headers.

The remaining files are generally CMake, SOURCES, scripts, markdown,
etc.

There are still some '.ll' files which may benefit from a license
header. I can't find an example of an LLVM IR file with a license header
in the rest of LLVM, but unlike most other (sub)projects, libclc has
examples of LLVM IR as source files, compiled and built into the
library.
2025-03-24 10:10:38 +00:00
Fraser Cormack
82912fd620
[libclc] Update license headers (#132070)
This commit bulk-updates the libclc license headers to the current
Apache-2.0 WITH LLVM-exception license in situations where they were
previously attributed to AMD - and occasionally under an additional
single individual contributor - under an MIT license.

AMD signed the LLVM relicensing agreement and so agreed for their past
contributions under the new LLVM license.

The LLVM project also has had a long-standing, unwritten, policy of not
adding copyright notices to source code. This policy was recently
written up [1]. This commit therefore also removes these copyright
notices at the same time.

Note that there are outstanding copyright notices attributed to others -
and many files missing copyright headers - which will be dealt with in
future work.

[1]
https://llvm.org/docs/DeveloperPolicy.html#embedded-copyright-or-contributed-by-statements
2025-03-20 11:40:09 +00:00
Fraser Cormack
760eeac6a2
[libclc] Reduce bithacking in CLC frexp (#129871)
Also replace some magic constants with named ones.

Checking against FP zero and using isnan and isinf functions allows the
optimizer to create one unified @llvm.is.fpclass intrinsic. This results
in fewer more canonical IR instructions.
2025-03-05 14:18:51 +00:00
Fraser Cormack
e5d5503e4e
[libclc] Move hypot to CLC library; optimize (#129551)
This was already nominally in the CLC library; this commit just formally
moves it over. It simultaneously optimizes it for vector types by
avoiding scalarization.
2025-03-04 14:16:16 +00:00
Fraser Cormack
1357279df9
[libclc] Move rsqrt to the CLC library (#129045)
This also adds missing half variants to certain targets.

It also optimizes some targets' implementations to perform the operation
directly in vector types, as opposed to scalarizing.
2025-02-27 15:46:58 +00:00
Fraser Cormack
285b411e46
[libclc] Move sqrt to CLC library (#128748)
This is fairly straightforward for most targets.

We use the element-wise sqrt builtin by default. We also remove a legacy
pre-filtering of the input argument, which the intrinsic now officially
handles.

AMDGPU provides its own implementation of sqrt for double types. This
commit moves this into the implementation of CLC sqrt. It uses weak
linkage on the 'default' CLC sqrt to allow AMDGPU to only override the
builtin for the types it cares about.
2025-02-27 12:30:24 +00:00