(#187999)
This fixes conformance failures for double and
without -cl-denorms-are-zero. Optimizations are
able to eliminate the unusued quo handling without
duplicating most of the code.
This was failing in the float case without -cl-denorms-are-zero
and failing for double. This now passes in all cases.
This was originally ported from rocm device libs in
8db45e4cf170cc6044a0afe7a0ed8876dcd9a863. This is mostly a port
in of more recent changes with a few changes.
- Templatification, which almost but doesn't quite enable
vectorization yet due to the outer branch and loop.
- Merging of the 3 types into one shared code path, instead of
duplicating per type with 3 different functions implemented together.
There are only some slight differences for the half case, which mostly
evaluates as float.
- Splitting out of the is_odd tracking, instead of deriving it from the
accumulated quotient. This costs an extra register, but saves several
instructions. This also enables automatic elimination of all of the quo
output handling when this code is reused for remainder. I'm guessing
this would be unnecessary if SimplifyDemandedBits handled phis.
- Removal of the slow FMA path. I don't see how this would ever be
faster with the number of instructions replacing it. This is really a
problem for the compiler to solve anyway.
This is pretty verbose and ugly. We're pulling the base implementation
in for the double cases, and scalarizing it. Also fully defining the
half and float cases to directly use the intrinsic, for all vector
types. It would be much more convenient if we had linker based overrides
for the generic implementations, rather than per source file.
Follow the ordinary gentype conventions for the log implementation,
instead of using a plain header. This doesn't quite yet enable
vectorization, due to how the table is currently indexed. This should
make it easier for targets to selectively overload the function for
a subset of types.
These should be implementable by checking the behavior of
the canonicalize intrinsic. Hack around spirv still failing
on canonicalize by overriding and assuming DAZ for float.
The base case is correct denormal handling, not flushing. This
also matches the spec controls, which starts at IEEE and
flushing is enabled with -cl-denorms-are-zero.
Also fix wrong defaults for half and double. Denormal support is
not optional for these.
It seems that ?: is not quite equivalent to select for floating-point
vectors. With ?:, the resulting IR involves integer bitcasts and
integer vector typed select. Use select so this is an fp-select. This
enables finite math only contexts to optimize out the select.
This feels like it's a clang bug though.
project-specific headers should use "". Keep #include <amdhsa_abi.h>
llvm-diff shows no change to libclc.bc for spir--, spir64--, nvptx64--,
nvptx64--nvidiacl, nvptx64-nvidia-cuda and amdgcn-amd-amdhsa-llvm when
LIBCLC_TARGETS_TO_BUILD is "all".
Verified that reversing spir64--/libclc.spv and spir--/libclc.spv to
LLVM bitcode shows no diff.
Also fix `__CLC_INTEGER_CLC_BITFIELD_EXTRACT_SIGNED_H__` guard per
copilot review.
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Nan should work on either path, but the small reduction
path is smaller. There's also possible codegen benefits to
knowing the large reduction will not need to handle nans.
The 4 flavors of pow were originally ported from rocm
device libs between c45ec604f593fcb03d770f4398142d2446017f68,
cc5c65b2c25e0a82fbad95f0ce3bb5262e29eeee, and
fe8e00bc3c65115b2e3d2a43cf3d0d756a934a52. Update to a newer
version. Additionally expose fast variants for use by the
libcall optimizer (e.g, __pow_fast) for float types.
The explicit handling of nan is unnecessary. Clamp infinities
to nan at the input. This allows optimizations of the following
implementation code to take advantage of the knowledge that it
does not need to handle infinities.
When LLVM_TARGETS_TO_BUILD contains host target, runtime build sets
CMAKE_C_COMPILER to clang-cl on Windows.
Changes to fix build on Windows:
- libclc struggles to pass specific flags to clang-cl MSVC-like interface.
- compile flag handling will be consistent across all host systems.
- libclc build is cross-compilation for offloading targets.
Fix "unknown target triple" errors when LLVM_TARGETS_TO_BUILD is empty.
Adding -disable-llvm-passes reduces this to a very basic sanity check
of Clang frontend. This allows the test to pass even if SPIR-V backend
is not enabled, as the frontend can still generate IR for the target.
Summary:
This can be made generic, which works as expected on NVPTX and SPIR-V.
We do not replace this for AMDGPU because the dedicated built-in has an
extra argument that controls whether or not local memory or global
memory will be invalidated. It would be correct to use this generic
operation there, but we'd lose that minor optimization so we likely
should not regress.
Follow-up of 9b96ebc. There are binary_def.inc and unary_def.inc in
header directory.
- clc_ep.inc -> clc_ep_decl.inc
- relational/binary_def.inc -> relational/relational_binary_def.inc
- relational/unary_def.inc -> relational/relational_unary_def.inc
These .inc files in the header directory have the same name as .inc
files in implementation directory. Rename them to avoid name conflict
and avoid wrong file being used in implementation. This fixes bitcode
change when changing `#include <>` to `#include ""`.