Create signed constant using getSignedConstant(), to avoid future
assertion failures when we disable implicit truncation in getConstant().
This also touches some generic legalization code, which apparently only
AMDGPU tests.
Use a local pointer type to represent the named barrier in builtin and
intrinsic. This makes the definitions more user friendly
bacause they do not need to worry about the hardware ID assignment. Also
this approach is more like the other popular GPU programming language.
Named barriers should be represented as global variables of addrspace(3)
in LLVM-IR. Compiler assigns the special LDS offsets for those variables
during AMDGPULowerModuleLDS pass. Those addresses are converted to hw
barrier ID during instruction selection. The rest of the
instruction-selection changes are primarily due to the
intrinsic-definition changes.
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.
Use GCNPat instead of Custom Lowering to select instructions for
intrinsic llvm.fptrunc.round. "SupportedRoundMode : TImmLeaf" is used as
a predicate to select only when the rounding mode is supported.
"as_hw_round_mode : SDNodeXForm" is developed to translate the round
modes to the corresponding ones that hardware recognizes.
For some reason, isOperationLegalOrCustom is not the same as
isOperationLegal || isOperationCustom. Unfortunately, it checks
if the type is legal which makes it uesless for custom lowering
on non-legal types (which is always ppcf128).
Really the DAG builder shouldn't be going to expand this in the
builder, it makes it difficult to work with. It's only here to work
around the DAG requiring legal integer types the same size as
the FP type after type legalization.
SMUL_LOHI and UMUL_LOHI are different operations because the high part
of the result is different, so it is not OK to optimize the signed
version to MUL_U24/MULHI_U24 or the unsigned version to
MUL_I24/MULHI_I24.
This work simplifies and generalizes the instruction definition for
intrinsic llvm.fptrunc.round. We no longer name the instruction with the
rounding mode. Instead, we introduce an immediate operand for the
rounding mode for the pseudo instruction. This immediate will be used to
set up the hardware mode register at the time the real instruction is
generated. We name the pseudo instruction as FPTRUNC_ROUND_F16_F32 (for
f32 -> f16), which is easy to generalize for other types.
"round.towardzero" and "round.tonearest" are added for f32 -> f16
truncating, in addition to the existing "round.upward" and
"round.downward". Other rounding modes are not supported by hardware at
this moment.
This patch enables the target-independent lowering of llvm.lround via
GlobalISel. For SelectionDAG, the instrinsic is custom lowered for
AMDGPU. In order to support vector floating point input for llvm.lround,
this patch extends the target independent APIs and provide support for
scalarizing. pr98950 is needed to let verifier allow vector floating
point types
Summary:
These Libcalls represent which functions are available to the backend.
If a runtime call is not available, the target sets the the name to
`nullptr`. Currently, this logic is spread around the various targets.
This patch pulls all of the locations that disable libcalls into the
intializer. This patch is effectively NFC.
The motivation behind this patch is that currently the LTO handling uses
the list of all runtime calls to determine which functions cannot be
internalized and must be extracted from static libraries. We do not want
this to happen for libcalls that are not emitted by the backend. A
follow-up patch will move out this logic so the LTO pass can know which
rtlib calls are actually used by the backend.
In the SelectionDAG lowering of the memcpy intrinsic, this optimization
introduces additional chains between fixed-size groups of loads and the
corresponding stores. While initially introduced to ensure that wider
load/store-pair instructions are generated on AArch64, this optimization
also improves code generation for AMDGPU: Ganged loads are scheduled
into a clause; stores only await completion of their corresponding load.
The chosen value of 16 performed good in microbenchmarks, values of 8,
32, or 64 would perform similarly.
The testcase updates are autogenerated by
utils/update_llc_test_checks.py.
See also:
- PR introducing this optimization: https://reviews.llvm.org/D46477
Part of SWDEV-455845.
These are redundant with the unsuffixed versions, and have a name
collision with surprising behavior when the base intrinsic is used with
v2bf16.
The global and flat variants should be removed too, but those are complicated
due to using v2i16 in place of the natural v2bf16. Those cases can soon be
completely deleted in favor of atomicrmw.
The GlobalISel codegen change is broken and substitutes handling as bf16
for handling as f16, but it's a bug that this passed the IRTranslator in the first
place.
Use LSH to lower ctlz_zero_undef instead of subtracting leading zeros
for i8 and i16.
Related to [77615](https://github.com/llvm/llvm-project/pull/77615).
---------
Co-authored-by: Leon Clark <leoclark@amd.com>
This is the first step to eliminating shouldCastAtomicRMWIInIR. This and
the other atomic expand casting hooks should be removed. This adds
duplicate legalization machinery and interfaces. This is already what
codegen is supposed to do, and already does for the promotion case.
In the case of atomicrmw xchg, there seems to be some benefit to having
the bitcasts moved outside of the cmpxchg loop on targets with separate
int and FP registers, which we should be able to deal with by directly
checking for the legality of the underlying operation.
The casting path was also losing metadata when it recreated the
instruction.
I'm planning to remove StringRef::equals in favor of
StringRef::operator==.
- StringRef::operator==/!= outnumber StringRef::equals by a factor of
38 under llvm/ in terms of their usage.
- The elimination of StringRef::equals brings StringRef closer to
std::string_view, which has operator== but not equals.
- S == "foo" is more readable than S.equals("foo"), especially for
!Long.Expression.equals("str") vs Long.Expression != "str".
These hooks should be removed. This is a trivial legalization transform
the legalizer needs to support. The IR just complicates things, and it
was losing metadata. Implement the DAG promotion support, and switch
AMDGPU over to using it.
Really we'd be a lot better off merging ATOMIC_LOAD and LOAD like
GlobalISel does.
On gfx11 shaders run with PRIV=1, which causes `s_trap 2` to be treated
as a nop, which means it isn't a correct lowering for the trap
intrinsic. As a workaround, this commit instead lowers the trap
intrinsic to instructions that simulate the behavior of s_trap 2.
Fixes: SWDEV-438421
We had some instances when LLVM would not inline fixed-count memcpy and
ended up
attempting to lower it a a libcall, which would not work on AMDGPU as
the
address space doesn't meet the requirement, causing compiler crash.
The patch relaxes the threshold used for -Os/-Oz compilation so we're
always allowed
to inline memory copy functions.
This patch basically does the same thing as
https://reviews.llvm.org/D158226 for
AMDGPU.
Fix#88497.
Fixes casts between double/float/half and i128. The pass seems to be
broken for bfloat though. I also believe we could have a better implementation
which attempts to make use the native 32-bit conversion instructions like
the 64-bit expansion does.
Remove LSH transform and restore previous lowering.
Fixes conformance issue in
[77615](https://github.com/llvm/llvm-project/pull/77615) where OpenCL
integer_ops tests fail for integer_clz.
Co-authored-by: Leon Clark <leoclark@amd.com>
This enables IR expansion for i128 divisions. The vector case is still
broken because ExpandLargeDivRem doesn't try to handle them.
Fixes: SWDEV-426193
MUL24 can now return a i64 for i32 operands, but the combine was never
updated to handle this case. Extend the operand when rewriting the ADD
to handle it.
Fixes SWDEV-436654
Add custom lowering for ctlz.i8 to avoid multiple add/sub operations.
---------
Co-authored-by: Leon Clark <leoclark@amd.com>
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>