9345 Commits

Author SHA1 Message Date
Stanislav Mekhanoshin
8fcb712167
[AMDGPU] gfx1250 runlines for global-atomicrmw-fadd.ll. NFC (#159817) 2025-09-19 10:58:41 -07:00
Akash Dutta
c256966fe2
[AMDGPU]: Unpack packed instructions overlapped by MFMAs post-RA scheduling (#157968)
This is a cleaned up version of PR #151704. These optimizations are now
performed post-RA scheduling.
2025-09-19 09:41:02 -07:00
Jeffrey Byrnes
ac8f3cdcf3 [AMDGPU] Precommit test for memory intrinics CGP handling
Change-Id: Id229f849b1d8552bbe59d6e18114042ef1614fad
2025-09-19 07:42:26 -07:00
Mariusz Sikora
eed99d5008
[AMDGPU] Fix the magic number RegisterClass for SReg_32 in test (#159761) 2025-09-19 14:14:33 +02:00
Fabian Ritter
d5607694e1
[AMDGPU][SDAG] DAGCombine PTRADD -> disjoint OR (#146075)
If we can't fold a PTRADD's offset into its users, lowering them to
disjoint ORs is preferable: Often, a 32-bit OR instruction suffices
where we'd otherwise use a pair of 32-bit additions with carry.

This needs to be a DAGCombine (and not a selection rule) because its
main purpose is to enable subsequent DAGCombines for bitwise operations.
We don't want to just turn PTRADDs into disjoint ORs whenever that's
sound because this transform loses the information that the operation
implements pointer arithmetic, which AMDGPU for instance needs when
folding constant offsets.

For SWDEV-516125.
2025-09-19 11:58:41 +02:00
Fabian Ritter
a2dcc88f39
[AMDGPU][SDAG] Handle ISD::PTRADD in various special cases (#145330)
There are more places in SIISelLowering.cpp and AMDGPUISelDAGToDAG.cpp
that check for ISD::ADD in a pointer context, but as far as I can tell
those are only relevant for 32-bit pointer arithmetic (like frame
indices/scratch addresses and LDS), for which we don't enable PTRADD
generation yet.

For SWDEV-516125.
2025-09-19 10:19:38 +02:00
Fabian Ritter
adfa6a4c14
[AMDGPU][SDAG] Test ISD::PTRADD handling in various special cases (#145329)
Pre-committing tests to show improvements in a follow-up PR.
2025-09-19 09:43:30 +02:00
Matt Arsenault
116ca9522e
Greedy: Take copy hints involving subregisters (#159570)
Previously this would only accept full copy hints. This relaxes
this to accept some subregister copies. Specifically, this now
accepts:
  - Copies to/from physical registers if there is a compatible
    super register
  - Subreg-to-subreg copies

This has the potential to repeatedly add the same hint to the
hint vector, but not sure if that's a real problem.
2025-09-19 09:37:36 +09:00
Matt Arsenault
33e8e5a846
AMDGPU: Add more mfma loop test cases (#159492)
Test cases where the exit uses must be VGPRs,
and don't happen to be a store that could use AGPRs.
2025-09-19 09:36:46 +09:00
Stanislav Mekhanoshin
6ac0abf8c4
[AMDGPU] gfx1251 VOP3 dpp support (#159654) 2025-09-18 16:18:09 -07:00
Stanislav Mekhanoshin
8cfbace7b2
[AMDGPU] gfx1251 VOP2 dpp support (#159641) 2025-09-18 15:38:29 -07:00
Stanislav Mekhanoshin
e3c7b7f806
[AMDGPU] gfx1251 VOP1 dpp support (#159637) 2025-09-18 13:42:06 -07:00
Simon Pilgrim
6e47bff24d
[AMDGPU] callee-special-input-vgprs.ll / callee-special-input-vgprs-packed.ll - regenerate test coverage (#159587) 2025-09-18 15:19:48 +00:00
Fabian Ritter
01b4b2a5b8
[AMDGPU][SDAG] Handle ISD::PTRADD in VOP3 patterns (#143881)
This patch mirrors similar patterns for ISD::ADD. The main difference is
that ISD::ADD is commutative, so that a pattern definition for, e.g.,
(add (mul x, y), z), automatically also handles (add z, (mul x, y)).
ISD::PTRADD is not commutative, so we would need to handle these cases
explicitly. This patch only implements (ptradd z, (op x, y)) patterns,
where the nested operation (shift or multiply) is the offset of the
ptradd (i.e., the right operand), since base pointers that are the
result of a shift or multiply seem less likely.

For SWDEV-516125.
2025-09-18 15:01:07 +02:00
Petar Avramovic
2ec7959b96
[AMDGPU][SIInsertWaitcnts] Track SCC. Insert KM_CNT waits for SCC writes. (#157843)
Add new event SCC_WRITE for s_barrier_signal_isfirst and s_barrier_leave,
instructions that write to SCC, counter is KM_CNT.
Also start tracking SCC for reads and writes.
s_barrier_wait on the same barrier guarantees that the SCC write from
s_barrier_signal_isfirst has landed, no need to insert s_wait_kmcnt.
2025-09-18 14:41:01 +02:00
Simon Pilgrim
85527609a0
[AMDGPU] kernel-argument-dag-lowering.ll - regenerate test coverage (#159526) 2025-09-18 09:34:38 +00:00
Stanislav Mekhanoshin
221f8eef9d
[AMDGPU] Add gfx1251 runlines to cooperative atomcis tests. NFC (#159437) 2025-09-17 14:08:05 -07:00
Stanislav Mekhanoshin
e556dc0b23
[AMDGPU] Add gfx1251 subtarget (#159430) 2025-09-17 13:02:02 -07:00
Stanislav Mekhanoshin
f0090bacc1
[AMDGPU] Fold copies of constant physical registers into their uses (#154410)
Co-authored-by: Jay Foad <Jay.Foad@amd.com>

Co-authored-by: Jay Foad <Jay.Foad@amd.com>
2025-09-17 10:49:34 -07:00
Brox Chen
2b2b580c8d
[AMDGPU][CodeGen][True16] Track waitcnt as vgpr32 instead of vgpr16 for D16 Instructions in GFX11 (#157795)
It seems the VMEM access on hi/lo half could interfere the other half.
Track waitcnt of vgpr32 instead of vgpr16 for 16bit reg in GFX11.

---------

Co-authored-by: Joe Nash <joseph.nash@amd.com>
2025-09-17 10:09:06 -04:00
Matt Arsenault
aac8eb85b2
AMDGPU: Fixes for regbankselecting copies of i1 physregs to sgprs (#159283)
If the source register of a copy was a physical sgpr copied to an
s1 value, this would assert.
2025-09-17 19:48:39 +09:00
sstipano
56ebbebada
[AMDGPU][NFC] Add back -new-reg-bank-select flag. (#159181) 2025-09-17 11:29:16 +02:00
Shilei Tian
f7f7abcde4
[NFC][AMDGPU] Add a missing test case about cluster dims (#159179) 2025-09-16 22:56:24 -04:00
Krzysztof Drewniak
96ce9f9d64
[AMDGPU] Prevent re-visits in LowerBufferFatPointers (#159168)
Fixes https://github.com/iree-org/iree/issues/22001

The visitor in SplitPtrStructs would re-visit instructions if an
instruction earlier in program order caused a recursive visit() call via
getPtrParts(). This would cause instructions to be processed multiple
times.

As a consequence of this, PHI nodes could be added to the Conditionals
array multiple times, which would to a conditinoal that was already
simplified being processed multiple times. After the code moved to
InstSimplifyFolder, this re-processing, combined with more agressive
simplifications, would lead to an attempt to replace an instruction with
itself, causing an assertion failure and crash.

This commit resolves the issue and adds the reduced form of the crashing
input as a test.
2025-09-16 18:02:18 -07:00
Stanislav Mekhanoshin
4ab8dabc25
[AMDGPU] Add s_cluster_barrier on gfx1250 (#159175) 2025-09-16 14:49:48 -07:00
Shilei Tian
8122ccdca9
[AMDGPU] Set TGID_EN_X/Y/Z when cluster ID intrinsics are used (#159120)
Hardware initializes a single value in ttmp9 which is either the
workgroup ID X or cluster ID X. Most of this patch is a refactoring to
use a single `PreloadedValue` enumerator for this value, instead of two
enumerators `WORKGROUP_ID_X` and `CLUSTER_ID_X` referring to the same
value.

This makes it simpler to have a single attribute
`amdgpu-no-workgroup-id-x` indicating that this value is not used, which
in turns sets the TGID_EN_X bit appropriately to tell the hardware
whether to initialize it.

All of the above applies to Y and Z similarly.

Fixes: LWPSCGFX13-568

Co-authored-by: Jay Foad <jay.foad@amd.com>
2025-09-16 15:37:01 -04:00
Shilei Tian
158eeb344b
[AMDGPU] Change scale_sel to be 4 bits (#157900)
The latest SP changes updated it to use `OP_SEL[0:3]` instead of
`OP_SEL[0:2]`.

Fixes SWDEV-554472.
2025-09-16 15:36:45 -04:00
Stanislav Mekhanoshin
fd59fd563f
[AMDGPU] Add aperture classes to VS_64 (#158823)
Should not do anything.
2025-09-16 11:15:50 -07:00
Matt Arsenault
0648c5183f
AMDGPU: Fix some broken regclass numbers in mir tests (#159102) 2025-09-16 17:45:29 +00:00
Janek van Oirschot
341cdbc970
[AMDGPU] Elide bitcast fold i64 imm to build_vector (#154115)
Elide bitcast combine to build_vector in case of i64 immediate that can
be materialized through 64b mov
2025-09-16 16:44:51 +01:00
Rajveer Singh Bharadwaj
08a58b2cea
[InstCombine] Optimize redundant floating point comparisons in or/and inst's (#158097)
Resolves #157371

We can eliminate one of the `fcmp` when we have two same `olt` or `ogt`
instructions matched in `or`/`and` simplification.
2025-09-16 20:52:11 +05:30
Matt Arsenault
babdad3fdb
AMDGPU: Try to unspill VGPRs after rewriting MFMAs to AGPR form (#154323)
After replacing VGPR MFMAs with the AGPR form, we've alleviated VGPR
pressure which may have triggered spills during allocation. Identify
these spill slots, and try to reassign them to newly freed VGPRs,
and replace the spill instructions with copies.

Fixes #154260
2025-09-17 00:11:32 +09:00
Jay Foad
eeced0d073
[AMDGPU] Use larger immediate values in S_NOP (#158990)
The S_NOP instruction has an immediate operand which is one less than
the number of cycles to delay for. The maximum value that may be encoded
in this field was increased in GFX8 and again in GFX12.
2025-09-16 15:51:06 +01:00
Stanislav Mekhanoshin
76efbc068a
[AMDGPU] Fix codegen to emit COPY instead of S_MOV_B64 for aperture regs (#158754) 2025-09-16 02:26:32 -07:00
Stanislav Mekhanoshin
72aa946762
[AMDGPU] Drop high 32 bits of aperture registers (#158725)
Fixes: SWDEV-551181
2025-09-16 02:11:39 -07:00
Shilei Tian
04cd39ae28
[AMDGPU] Add the support for .cluster_dims code object metadata (#158721)
Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>
2025-09-15 16:13:07 -04:00
Shilei Tian
27b242fbff
[AMDGPU][Attributor] Add AAAMDGPUClusterDims (#158076) 2025-09-15 15:04:33 -04:00
macurtis-amd
2c091e6aec
AMDGPU: Report unaligned scratch access as fast if supported by tgt (#158036)
This enables more consecutive load folding during
aggressive-instcombine.

The original motivating example provided by Jeff Byrnes:
https://godbolt.org/z/8ebcTEjTs

Example provided by Nikita Popov: https://godbolt.org/z/Gv1j4vjqE as
part of my original attempt to fix the issue (PR
[#133301](https://github.com/llvm/llvm-project/pull/133301), see his
[comment](https://github.com/llvm/llvm-project/pull/133301#issuecomment-2984905809)).

This changes the value of `IsFast` returned by `In
SITargetLowering::allowsMisalignedMemoryAccessesImpl` to be non-zero for
private and flat addresses if the subtarget supports unaligned scratch
accesses.

This enables aggressive-instcombine to do more folding of consecutive
loads (see
[here](cbd496581f/llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp (L811))).

Summary performance impact on
[composable_kernel](https://github.com/ROCm/composable_kernel):

|GPU|speedup (geomean*)|
|---|---|
|MI300A| 1.11|
|MI300X| 1.14|
|MI350X| 1.03|

[*] Just to be clear, this is the geomean across kernels which were
impacted by this change - not across all CK kernels.
2025-09-15 05:03:02 -05:00
David Green
1c21d5cb9b
[GlobalISel] Remove GI known bits cache (#157352)
There is a cache on the known-bit computed by global-isel. It only works
inside a single query to computeKnownBits, which limits its usefulness,
and according to the tests can sometimes limit the effectiveness of
known-bits queries. (Although some AMD tests look longer). Keeping the
cache valid and clearing it at the correct times can also require being
careful about the functions called inside known-bits queries.

I measured compile-time of removing it and came up with:
```
7zip      2.06405E+11     2.06436E+11     0.015018992
Bullet    1.01298E+11     1.01186E+11     -0.110236169
ClamAV    57942466667     57848066667     -0.16292023
SPASS     45444466667     45402966667     -0.091320249
consumer  35432466667     35381233333     -0.144594317
kimwitu++ 40858833333     40927933333     0.169118877
lencod    70022366667     69950633333     -0.102443457
mafft     38439900000     38413233333     -0.069372362
sqlite3   35822266667     35770033333     -0.145812474
tramp3d   82083133333     82045600000     -0.045726
Average                                   -0.068828739
```
The last column is % difference between with / without the cache. So in
total it seems to be costing slightly more to keep the current
known-bits cache than if it was removed. (Measured in instruction count,
similar to llvm-compile-time-tracker).

The hit rate wasn't terrible - higher than I expected. In the
llvm-test-suite+external projects it was hit 4791030 times out of
91107008 queries, slightly more than 5%.

Note that as globalisel increases in complexity, more known bits calls
might be made and the numbers might shift. If that is the case it might
be better to have a cache that works across calls, providing it doesn't
make effectiveness worse.
2025-09-15 07:32:00 +01:00
Shilei Tian
1180c2ced0
[AMDGPU] Support lowering of cluster related instrinsics (#157978)
Since many code are connected, this also changes how workgroup id is lowered.

Co-authored-by: Jay Foad <jay.foad@amd.com>
Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>
2025-09-12 21:11:17 -04:00
Matt Arsenault
9af4a85460
AMDGPU: Add test which shows unnecessary register alignment (#158168)
The b96 tr loads are a special case that does not require even
aligned VGPRs
2025-09-13 10:10:59 +09:00
choikwa
ef7de8d144
[AMDGPU] Remove scope check in SIInsertWaitcnts::generateWaitcntInstBefore (#157821)
This change was motivated by CK where many VMCNT(0)'s were generated due
to instructions lacking !alias.scope metadata. The two causes of this
were:
1) LowerLDSModule not tacking on scope metadata on a single LDS variable
2) IPSCCP pass before inliner replacing noalias ptr derivative with a
global value, which made inliner unable to track it back to the noalias
   ptr argument.

However, it turns out that IPSCCP losing the scope information was
largely ineffectual as ScopedNoAliasAA was able to handle asymmetric
condition, where one MemLoc was missing scope, and still return NoAlias
result.

AMDGPU however was checking for existence of scope in SIInsertWaitcnts
and conservatively treating it as aliasing all and inserted VMCNT(0)
before DS_READs, forcing it to wait for all previous LDS DMA
instructions.

Since we know that ScopedNoAliasAA can handle asymmetry, we should also
allow AA query to determine if two MIs may alias.

Passed PSDB.

Previous attempt to address the issue in IPSCCP, likely stalled:
https://github.com/llvm/llvm-project/pull/154522
This solution may be preferrable over that as issue only affects AMDGPU.
2025-09-12 14:51:36 -04:00
Matt Arsenault
9e1d656c68
AMDGPU: Remove MIMG special case in adjustAllocatableRegClass (#158184)
I have no idea why this was here. MIMG atomics use tied operands
for the input and output, so AV classes should have always worked.
We have poor test coverage for AGPRs with atomics, so add a partial
set. Everything seems to work OK, although it seems image cmpswap
always uses VGPRs unnecessarily.
2025-09-12 09:02:24 +00:00
Matt Arsenault
188901d6ca
AMDGPU: Fix returning wrong type for stack passed sub-dword arguments (#158002)
Fixes assertion with -debug-only=isel on LowerFormalArguments result.
That assert really shouldn't be under LLVM_DEBUG.

Fixes #157997
2025-09-12 10:50:03 +09:00
Matt Arsenault
5a21128f24
AMDGPU: Relax legal register operand constraint (#157989)
Find a common subclass instead of directly checking for a subclass
relationship. This fixes folding logic for unaligned register defs
into aligned use contexts. e.g., a vreg_64 def into an av_64_align2
use should be able to find the common subclass vreg_align2. This
avoids regressions in future patches.

Checking the subclass was also redundant on the subregister path;
getMatchingSuperRegClass is sufficient.
2025-09-12 08:57:47 +09:00
Changpeng Fang
05a705efda
[AMDGPU] Restrict to VGPR only for mfma scale operands (#158117)
Restrict to VGPR only (VRegSrc_32) for mfma scale operands to workaround
a hardware design defect: For all Inline/SGPR constants, SP HW use bits
[30:23] as the scale.

TODO: We may still be able to allow Inline Constants/SGPR, with a proper
shift, to obtain a potentially better performance.

Fixes: SWDEV-548629
2025-09-11 13:10:33 -07:00
Chris Jackson
5e6564b098
[AMDGPU][SDAG] Legalise v2i32 or/xor/and instructions to make use of 64-bit wide instructions (#140694)
- Enable s_or_b64/s_and_b64/s_xor_b64 for v2i32. Add various additional
combines to make use of these newly legalised instructions.
- Update several tests and separate legacy r600 tests where necessary.
2025-09-11 13:32:44 +01:00
Petar Avramovic
b97010865c
AMDGPU/GlobalISel: Import D16 load patterns and add combines for them (#153178)
Add G_AMDGPU_LOAD_D16 generic instructions and GINodeEquivs for them,
this will import D16 load patterns to global-isel's tablegened
instruction selector.
For newly imported patterns to work add combines for G_AMDGPU_LOAD_D16
in AMDGPURegBankCombiner.
2025-09-11 12:02:50 +02:00
Petar Avramovic
40270e8ef2
AMDGPU/GlobalISel: Add regbanklegalize rules for load and store (#153176)
Cover all the missing cases and add very detailed tests for each rule.
In summary:
- Flat and Scratch, addrspace(0) and addrspace(5), loads are always
  divergent.
- Global and Constant, addrspace(1) and addrspace(4), have real uniform
  loads, s_load, but require additional checks for align and flags in mmo.
  For not natural align or not uniform mmo do uniform-in-vgpr lowering.
- Private, addrspace(3), only has instructions for divergent load, for
  uniform do uniform-in-vgpr lowering.
- Store rules are simplified using Ptr32 and Ptr64.
  All operands need to be vgpr.

Some tests have code size regression since they use more sgpr instructions,
marked with FixMe comment to get back to later.
2025-09-11 11:26:20 +02:00
choikwa
8ae3aeaca0
[AMDGPU] NFC. Add testcase to test SIInsertWaitcnts::generateWaitcntInstBefore (#157938)
Pre-commit testcase for https://github.com/llvm/llvm-project/pull/157821
2025-09-10 16:39:02 -04:00