628 Commits

Author SHA1 Message Date
Jon Chesterfield
4e0ba801ea Revert "[amdgpu][lds] Simplify error diag path - lds variable names are no longer special"
Test case didn't run locally, investigating

This reverts commit 7bad469182ff2f6423ea209d5a1e81acca600568.
2024-12-08 12:00:13 +00:00
Jon Chesterfield
7bad469182 [amdgpu][lds] Simplify error diag path - lds variable names are no longer special 2024-12-08 11:26:33 +00:00
Nikita Popov
3317c9ceac
[AMDGPU] Use getSignedConstant() where necessary (#117328)
Create signed constant using getSignedConstant(), to avoid future
assertion failures when we disable implicit truncation in getConstant().

This also touches some generic legalization code, which apparently only
AMDGPU tests.
2024-11-25 09:49:34 +01:00
Kazu Hirata
be187369a0
[AMDGPU] Remove unused includes (NFC) (#116154)
Identified with misc-include-cleaner.
2024-11-13 21:10:03 -08:00
Sergei Barannikov
3d73dbe7f0
[AMDGPU] Remove unused AMDGPUISD enum members (NFC) (#115582)
Those were only used in `getTargetNodeName`.
2024-11-11 23:39:20 +03:00
Gang Chen
8c752900dd
[AMDGPU] modify named barrier builtins and intrinsics (#114550)
Use a local pointer type to represent the named barrier in builtin and
intrinsic. This makes the definitions more user friendly
bacause they do not need to worry about the hardware ID assignment. Also
this approach is more like the other popular GPU programming language.
Named barriers should be represented as global variables of addrspace(3)
in LLVM-IR. Compiler assigns the special LDS offsets for those variables
during AMDGPULowerModuleLDS pass. Those addresses are converted to hw
barrier ID during instruction selection. The rest of the
instruction-selection changes are primarily due to the
intrinsic-definition changes.
2024-11-06 10:37:22 -08:00
Matt Arsenault
88e23eb2cf
DAG: Fix legalization of vector addrspacecasts (#113964) 2024-10-29 08:08:50 -05:00
Jeffrey Byrnes
853c43d04a
[TTI] NFC: Port TLI.shouldSinkOperands to TTI (#110564)
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.
2024-10-09 14:30:09 -07:00
Jay Foad
8d13e7b8c3
[AMDGPU] Qualify auto. NFC. (#110878)
Generated automatically with:
$ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find
lib/Target/AMDGPU/ -type f)
2024-10-03 13:07:54 +01:00
Jay Foad
39babbffc9
[AMDGPU] Implement isSDNodeAlwaysUniform for INTRINSIC_W_CHAIN (#110114)
There are no always uniform side-effecting intrinsics upstream to test
this with, but we have examples downstream.
2024-09-26 14:44:14 +01:00
Pierre van Houtryve
758444ca3e
[AMDGPU] Promote uniform ops to I32 in DAGISel (#106383)
Promote uniform binops, selects and setcc between 2 and 16 bits to 32
bits in DAGISel

Solves #64591
2024-09-19 09:00:21 +02:00
Stanislav Mekhanoshin
0745219d4a
[AMDGPU] Add target intrinsic for s_buffer_prefetch_data (#107293) 2024-09-06 11:41:21 -07:00
Changpeng Fang
26b0bef192
AMDGPU: Use pattern to select instruction for intrinsic llvm.fptrunc.round (#105761)
Use GCNPat instead of Custom Lowering to select instructions for
intrinsic llvm.fptrunc.round. "SupportedRoundMode : TImmLeaf" is used as
a predicate to select only when the rounding mode is supported.
"as_hw_round_mode : SDNodeXForm" is developed to translate the round
modes to the corresponding ones that hardware recognizes.
2024-08-29 11:43:58 -07:00
Matt Arsenault
7b7b0b95b2
DAG: Check if is_fpclass is custom, instead of isLegalOrCustom (#105577)
For some reason, isOperationLegalOrCustom is not the same as
isOperationLegal || isOperationCustom. Unfortunately, it checks
if the type is legal which makes it uesless for custom lowering
on non-legal types (which is always ppcf128).

Really the DAG builder shouldn't be going to expand this in the
builder, it makes it difficult to work with. It's only here to work
around the DAG requiring legal integer types the same size as
the FP type after type legalization.
2024-08-29 14:05:43 +04:00
Jay Foad
d0fe52d951
[AMDGPU] Fix sign confusion in performMulLoHiCombine (#105831)
SMUL_LOHI and UMUL_LOHI are different operations because the high part
of the result is different, so it is not OK to optimize the signed
version to MUL_U24/MULHI_U24 or the unsigned version to
MUL_I24/MULHI_I24.
2024-08-27 17:09:40 +01:00
Changpeng Fang
16929219b0
AMDGPU: Add tonearest and towardzero roundings for intrinsic llvm.fptrunc.round (#104486)
This work simplifies and generalizes the instruction definition for
intrinsic llvm.fptrunc.round. We no longer name the instruction with the
rounding mode. Instead, we introduce an immediate operand for the
rounding mode for the pseudo instruction. This immediate will be used to
set up the hardware mode register at the time the real instruction is
generated. We name the pseudo instruction as FPTRUNC_ROUND_F16_F32 (for
f32 -> f16), which is easy to generalize for other types.

"round.towardzero" and "round.tonearest" are added for f32 -> f16
truncating, in addition to the existing "round.upward" and
"round.downward". Other rounding modes are not supported by hardware at
this moment.
2024-08-17 11:22:47 -07:00
Craig Topper
51bad732dc [SelectionDAG] Replace EVTToAPFloatSemantics with MVT/EVT::getFltSemantics. (#103001) 2024-08-13 11:35:28 -07:00
Kazu Hirata
f4fb735840
[llvm] Construct SmallVector<SDValue> with ArrayRef (NFC) (#102578) 2024-08-09 09:15:42 -07:00
Matt Arsenault
88a85942ce
AMDGPU: Directly handle all atomicrmw cases in SIISelLowering (#102439) 2024-08-08 22:45:43 +04:00
Matt Arsenault
1d2b2d29d7
AMDGPU: Cleanup extract_subvector actions (NFC) (#101454)
The base AMDGPUISelLowering was setting custom action on 16-bit
vector types, but also set in SIISelLowering.
2024-08-01 10:55:28 +04:00
Matt Arsenault
e24dc34aa0
AMDGPU: Fix asserting in DAG kernel argument lowering on v6i32 (#100528)
Remove this pointless assertion for the number of vector elements.
2024-07-25 14:03:28 +04:00
Sumanth Gundapaneni
0ee32c4573
[AMDGPU] Implement llvm.lrint intrinsic lowering (#98931)
This patch enabled the target-independent lowering of llvm.lrint via
GlobalISel.
For SelectionDAG, the instrinsic is custom lowered for AMDGPU.
2024-07-24 23:34:31 +04:00
Sumanth Gundapaneni
fc832d5349
[AMDGPU] Implement llvm.lround intrinsic lowering. (#98970)
This patch enables the target-independent lowering of llvm.lround via
GlobalISel. For SelectionDAG, the instrinsic is custom lowered for
AMDGPU. In order to support vector floating point input for llvm.lround,
this patch extends the target independent APIs and provide support for
scalarizing. pr98950 is needed to let verifier allow vector floating
point types
2024-07-23 20:34:34 +04:00
Jay Foad
c7309dadbf
[AMDGPU] Use range-based for loops. NFC. (#99047) 2024-07-17 10:18:03 +01:00
Jay Foad
38a1dec30b [AMDGPU] Use std::min with initializer list. NFC. 2024-07-16 15:25:08 +01:00
Joseph Huber
3f1a767572
[LLVM] Factor disabled Libcalls into the initializer (#98421)
Summary:
These Libcalls represent which functions are available to the backend.
If a runtime call is not available, the target sets the the name to
`nullptr`. Currently, this logic is spread around the various targets.
This patch pulls all of the locations that disable libcalls into the
intializer. This patch is effectively NFC.

The motivation behind this patch is that currently the LTO handling uses
the list of all runtime calls to determine which functions cannot be
internalized and must be extracted from static libraries. We do not want
this to happen for libcalls that are not emitted by the backend. A
follow-up patch will move out this logic so the LTO pass can know which
rtlib calls are actually used by the backend.
2024-07-11 12:59:25 -05:00
Fabian Ritter
e1094dd889
[AMDGPU][DAG] Enable ganging up of memcpy loads/stores for AMDGPU (#96185)
In the SelectionDAG lowering of the memcpy intrinsic, this optimization
introduces additional chains between fixed-size groups of loads and the
corresponding stores. While initially introduced to ensure that wider
load/store-pair instructions are generated on AArch64, this optimization
also improves code generation for AMDGPU: Ganged loads are scheduled
into a clause; stores only await completion of their corresponding load.

The chosen value of 16 performed good in microbenchmarks, values of 8,
32, or 64 would perform similarly.
The testcase updates are autogenerated by
utils/update_llc_test_checks.py.

See also:
 - PR introducing this optimization: https://reviews.llvm.org/D46477

Part of SWDEV-455845.
2024-07-03 08:32:35 +02:00
Nikita Popov
9df71d7673
[IR] Add getDataLayout() helpers to Function and GlobalValue (#96919)
Similar to https://github.com/llvm/llvm-project/pull/96902, this adds
`getDataLayout()` helpers to Function and GlobalValue, replacing the
current `getParent()->getDataLayout()` pattern.
2024-06-28 08:36:49 +02:00
Matt Arsenault
8520061281
AMDGPU: Support local atomicrmw fmin/fmax for float/double (#95590)
This has always been supported. Somehow, we ended up with 2
copies of clang builtins for this case, and the newer one
erroneously requires gfx8-insts.
2024-06-18 18:34:34 +02:00
Matt Arsenault
3b997294d6
AMDGPU: Remove .v2bf16 buffer atomic fadd intrinsics (#95783)
These are redundant with the unsuffixed versions, and have a name
collision with surprising behavior when the base intrinsic is used with
v2bf16.

The global and flat variants should be removed too, but those are complicated
due to using v2i16 in place of the natural v2bf16. Those cases can soon be
completely deleted in favor of atomicrmw.

The GlobalISel codegen change is broken and substitutes handling as bf16
for handling as f16, but it's a bug that this passed the IRTranslator in the first
place.
2024-06-17 21:44:52 +02:00
Matt Arsenault
c894f90c58
AMDGPU: Do not assert on v6x16 buffer load intrinsics (#94966)
Just use the original type and let it hit a standard legalization error.
2024-06-10 16:38:06 +02:00
Mirko Brkušanin
1e6a82b8ef
[AMDGPU] Legalize and select raw/struct_buffer_load with tfe (#93310) 2024-05-27 14:09:17 +02:00
Leon Clark
fb2c6597e3
[AMDGPU] Use LSH for lowering ctlz_zero_undef.i8/i16 (#88512)
Use LSH to lower ctlz_zero_undef instead of subtracting leading zeros
for i8 and i16.

Related to [77615](https://github.com/llvm/llvm-project/pull/77615).

---------

Co-authored-by: Leon Clark <leoclark@amd.com>
2024-05-19 21:45:24 +01:00
Stanislav Mekhanoshin
5d18d575d8
[AMDGPU] Make fneg/fabs/copysign legal for bf16 (#91676)
These are just bit operations, exactly the same as with f16.
2024-05-10 14:33:47 -07:00
Matt Arsenault
82bb2534d4
AMDGPU: Don't bitcast float typed atomic store in IR (#90116)
Implement the promotion in the DAG.

Depends #90113
2024-05-07 21:43:22 +02:00
Matt Arsenault
7927bcdb8a
AMDGPU: Do not bitcast atomicrmw in IR (#90045)
This is the first step to eliminating shouldCastAtomicRMWIInIR. This and
the other atomic expand casting hooks should be removed. This adds
duplicate legalization machinery and interfaces. This is already what
codegen is supposed to do, and already does for the promotion case.

In the case of atomicrmw xchg, there seems to be some benefit to having
the bitcasts moved outside of the cmpxchg loop on targets with separate
int and FP registers, which we should be able to deal with by directly
checking for the legality of the underlying operation.

The casting path was also losing metadata when it recreated the
instruction.
2024-05-07 18:26:32 +02:00
Kazu Hirata
c18bcd0a57
[Target] Use StringRef::operator== instead of StringRef::equals (NFC) (#91072) (#91138)
I'm planning to remove StringRef::equals in favor of
StringRef::operator==.

- StringRef::operator==/!= outnumber StringRef::equals by a factor of
  38 under llvm/ in terms of their usage.

- The elimination of StringRef::equals brings StringRef closer to
  std::string_view, which has operator== but not equals.

- S == "foo" is more readable than S.equals("foo"), especially for
  !Long.Expression.equals("str") vs Long.Expression != "str".
2024-05-05 13:43:10 -07:00
Shilei Tian
d47c4984e9
[AMDGPU][ISel] Add more trunc store actions regarding bf16 (#90493) 2024-04-29 18:27:52 -04:00
Shilei Tian
8e17c84836
[AMDGPU][ISel] Set trunc store action to expand for v4f32->v4bf16 (#90427) 2024-04-29 09:08:54 -04:00
Matt Arsenault
f1112ebe07
AMDGPU: Do not bitcast atomic load in IR (#90060)
These hooks should be removed. This is a trivial legalization transform
the legalizer needs to support. The IR just complicates things, and it
was losing metadata. Implement the DAG promotion support, and switch
AMDGPU over to using it.

Really we'd be a lot better off merging ATOMIC_LOAD and LOAD like
GlobalISel does.
2024-04-26 12:20:40 +02:00
Emma Pilkington
a04714701f
[AMDGPU] Add a trap lowering workaround for gfx11 (#85854)
On gfx11 shaders run with PRIV=1, which causes `s_trap 2` to be treated
as a nop, which means it isn't a correct lowering for the trap
intrinsic. As a workaround, this commit instead lowers the trap
intrinsic to instructions that simulate the behavior of s_trap 2.

Fixes: SWDEV-438421
2024-04-24 09:43:54 -04:00
Shilei Tian
9ce74d6d47
[AMDGPU][CodeGen] Improve handling of memcpy for -Os/-Oz compilations (#87632)
We had some instances when LLVM would not inline fixed-count memcpy and
ended up
attempting to lower it a a libcall, which would not work on AMDGPU as
the
address space doesn't meet the requirement, causing compiler crash.

The patch relaxes the threshold used for -Os/-Oz compilation so we're
always allowed
to inline memory copy functions.

This patch basically does the same thing as
https://reviews.llvm.org/D158226 for
AMDGPU.

Fix #88497.
2024-04-16 09:34:18 -04:00
Jay Foad
95258419f6
[AMDGPU] Use AMDGPU::isIntrinsicAlwaysUniform in isSDNodeAlwaysUniform (#87085)
This is mostly just a simplification, but tests show a slight codegen
improvement in code using the deprecated amdgcn.icmp/fcmp intrinsics.
2024-03-30 08:01:18 +00:00
Matt Arsenault
c7c561ef98 AMDGPU: Enable ExpandLargeFpConvert for > 64-bit types
Fixes casts between double/float/half and i128. The pass seems to be
broken for bfloat though. I also believe we could have a better implementation
which attempts to make use the native 32-bit conversion instructions like
the 64-bit expansion does.
2024-03-15 16:08:39 +05:30
Leon Clark
5b07fd4799
[AMDGPU] Fix OpenCL conformance test failures for ctlz. (#83170)
Remove LSH transform and restore previous lowering.

Fixes conformance issue in
[77615](https://github.com/llvm/llvm-project/pull/77615) where OpenCL
integer_ops tests fail for integer_clz.

Co-authored-by: Leon Clark <leoclark@amd.com>
2024-02-29 22:28:13 +00:00
Francesco Petrogalli
969d7ecf0b
[llvm][CodeGen] Add ValueType v3i1. [NFCI] (#82338) 2024-02-26 16:01:52 +01:00
Francesco Petrogalli
fffcc5ca83
[CodeGen] Add ValueType v3i8 (NFCI). (#80826) 2024-02-08 16:54:12 +01:00
Matt Arsenault
a5d206df79
AMDGPU: Set max supported div/rem size to 64 (#80669)
This enables IR expansion for i128 divisions. The vector case is still
broken because ExpandLargeDivRem doesn't try to handle them.

Fixes: SWDEV-426193
2024-02-05 19:09:38 +05:30
Pierre van Houtryve
ce72f78f37
[AMDGPU] Fix mul combine for MUL24 (#79110)
MUL24 can now return a i64 for i32 operands, but the combine was never
updated to handle this case. Extend the operand when rewriting the ADD
to handle it.

Fixes SWDEV-436654
2024-01-29 16:37:20 +01:00
Leon Clark
2759cfa0c3
[AMDGPU] Remove unnecessary add instructions in ctlz.i8 (#77615)
Add custom lowering for ctlz.i8 to avoid multiple add/sub operations.

---------

Co-authored-by: Leon Clark <leoclark@amd.com>
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
2024-01-19 10:16:46 +00:00