105 Commits

Author SHA1 Message Date
LU-JOHN
18f7e625bd
Revert "[AMDGPU] Generate more swaps" (#187723)
Reverts llvm/llvm-project#184164. Issue hit in testing, LCOMPILER-1587.
2026-03-20 12:03:20 -05:00
LU-JOHN
81396ebc51
[AMDGPU] Generate more swaps (#184164)
Generate more swaps from:

```
   mov T, X
   ...
   mov X, Y
   ...
   mov Y, X
```
by being more careful about what use/defs of X, Y, T are allowed in
intervening code and allowing flexibility where the swap is inserted.

---------

Signed-off-by: John Lu <John.Lu@amd.com>
2026-03-03 09:11:11 -06:00
Ruiling, Song
686987a540
ValueTracking/AMDGPU: handle mbcnt in computeKnownBitsFromOperator (#183229)
This helps canonicalize some address calculation. This would further
help immediate folding into memory load instructions in the backend.

The order changes to v_mad_u32_u24 is just because
@llvm.amdgcn.mul.u24.i32 was used in codegen prepare after this change.
It does not really change anything important.
2026-03-02 10:48:15 +08:00
Carl Ritson
5cc4b05380
[AMDGPU] Add scheduling DAG mutation for hazard latencies (#170075)
Improve waitcnt merging in ML kernel loops by increasing latencies on
VALU writes to SGPRs.
Specifically this helps with the case of V_CMP output feeding V_CNDMASK
instructions.
2026-02-03 11:10:28 +09:00
Matt Arsenault
1db5d6410b
AMDGPU: Move softPromoteHalfType override to R600 only (#177419)
As expected the code is much worse, but more correct.
We could do a better job with source modifier management around
fp16_to_fp/fp_to_fp16.
2026-01-26 15:23:04 +00:00
Jay Foad
d748c81218
[AMDGPU] Change the immediate operand of s_waitcnt_depctr / s_wait_alu (#169378)
The 16-bit immediate operand of s_waitcnt_depctr / s_wait_alu has some
unused bits. Previously codegen would set these bits to 1, but setting
them to 0 matches the SP3 assembler behaviour better, which in turn
means that we can print them using the human readable SP3 syntax:

s_wait_alu 0xfffd ; unused bits set to 1
s_wait_alu 0xff9d ; unused bits set to 0
s_wait_alu depctr_va_vcc(0) ; unused bits set to 0, human readable

Note that the set of unused bits changed between GFX10.1 and GFX10.3.
2025-11-25 11:55:26 +00:00
hstk30-hw
a6cec3f3e5
Reland "[RegAlloc] Fix the terminal rule check for interfere with DstReg (#168661)" (#169219)
Reland d5f3ab8ec97786476a077b0c8e35c7c337dfddf2, fix testcases.
2025-11-24 09:27:25 +08:00
Aiden Grossman
d5f3ab8ec9 Revert "[RegAlloc] Fix the terminal rule check for interfere with DstReg (#168661)"
This reverts commit 0859ac5866a0228f5607dd329f83f4a9622dedcc.

This caused a couple test failures, likely due to a mid-air collision.
Reverting for now to get the tree back to green and allow the original
author to run UTC/friends and verify the output.
2025-11-23 05:17:45 +00:00
hstk30-hw
0859ac5866
[RegAlloc] Fix the terminal rule check for interfere with DstReg (#168661)
This maybe a bug which is introduced by commit
6749ae36b4a33769e7a77cf812d7cd0a908ae3b9, and has been present ever
since.
In this case, `OtherReg` always overlaps with `DstReg` cause they from
the `Copy` all.
2025-11-23 10:11:24 +08:00
Matt Arsenault
0fa6a67a42
AMDGPU: Use v_mov_b32 to implement divergent zext i32->i64 (#168166)
Some cases are relying on SIFixSGPRCopies to force VALU
reg_sequence inputs with SGPR inputs to use all VGPR inputs,
but this doesn't always happen if the reg_sequence isn't
invalid. Make sure we use a vgpr up-front here so we don't
rely on something later.
2025-11-14 20:19:24 -08:00
Matt Arsenault
bbde792786
AMDGPU: Relax shouldCoalesce to allow more register tuple widening (#166475)
Allow widening up to 128-bit registers or if the new register class
is at least as large as one of the existing register classes.

This was artificially limiting. In particular this was doing the wrong
thing with sequences involving copies between VGPRs and AV registers.
Nearly all test changes are improvements.

The coalescer does not just widen registers out of nowhere. If it's
trying
to "widen" a register, it's generally packing a register into an
existing
register tuple, or in a situation where the constraints imply the wider
class anyway. 067a11015 addressed the allocation failure concern by
rejecting coalescing if there are no available registers. The original
change in a4e63ead4b didn't include a realistic testcase to judge if
this is harmful for pressure. I would expect any issues from this to
be of garden variety subreg handling issue. We could use more dynamic
state information here if it really is an issue.

I get the best results by removing this override completely. This is
a smaller step for patch splitting purposes.
2025-11-11 13:50:57 -08:00
Carl Ritson
385c12134a
[AMDGPU] Rework GFX11 VALU Mask Write Hazard (#138663)
Apply additional counter waits to address VALU writes to SGPRs. Rework
expiry detection and apply wait coalescing to mitigate some of the
additional waits.
2025-10-28 16:09:28 +09:00
LU-JOHN
9abbec66bf
[AMDGPU] Reland "Remove redundant s_cmp_lg_* sX, 0" (#164201)
Reland PR https://github.com/llvm/llvm-project/pull/162352. Fix by
excluding SI_PC_ADD_REL_OFFSET from instructions that set SCC = DST!=0.
Passes check-libc-amdgcn-amd-amdhsa now.

Distribution of instructions that allowed a redundant S_CMP to be
deleted in check-libc-amdgcn-amd-amdhsa test:

```
S_AND_B32      485
S_AND_B64      47
S_ANDN2_B32    42
S_ANDN2_B64    277492
S_CSELECT_B64  17631
S_LSHL_B32     6
S_OR_B64       11
```

---------

Signed-off-by: John Lu <John.Lu@amd.com>
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-10-22 08:42:29 -05:00
Jan Patrick Lehr
023b1f6a8e
Revert "[AMDGPU] Remove redundant s_cmp_lg_* sX, 0 " (#164116)
Reverts llvm/llvm-project#162352

Broke our buildbot:
https://lab.llvm.org/buildbot/#/builders/10/builds/15674
To reproduce

cd llvm-project
cmake -S llvm -B thebuild -C offload/cmake/caches/AMDGPULibcBot.cmake
-GNinja
cd thebuild
ninja
ninja check-libc-amdgcn-amd-amdhsa
2025-10-18 22:38:14 +02:00
LU-JOHN
8e5f6dd37c
[AMDGPU] Remove redundant s_cmp_lg_* sX, 0 (#162352)
Remove redundant s_cmp_lg_* sX, 0 if SALU instruction already sets SCC
if sX!=0.

---------

Signed-off-by: John Lu <John.Lu@amd.com>
2025-10-18 09:33:47 -05:00
Matt Arsenault
c6e280e7ed
PeepholeOpt: Fix losing subregister indexes on full copies (#161310)
Previously if we had a subregister extract reading from a
full copy, the no-subregister incoming copy would overwrite
the DefSubReg index of the folding context.

There's one ugly rvv regression, but it's a downstream
issue of this; an unnecessary same class reg-to-reg full copy
was avoided.
2025-10-02 13:36:47 +09:00
Brox Chen
dfdfc4e490
[AMDGPU][True16][Codegen] remove another build_vector pattern from true16 (#149861)
Remove another build_vector pattern which takes a i16 but placed in a
VGPR_32 from true16 mode. This stop isel from generating illegal
"vgpr_32 = COPY vgpr_16".

ISel will use vgpr16 build vector pattern in true16 mode instead
2025-09-04 18:08:18 -04:00
Matt Arsenault
b1b5102624
AMDGPU: Start considering new atomicrmw metadata on integer operations (#122138)
Start considering !amdgpu.no.remote.memory.access and
!amdgpu.no.fine.grained.host.memory metadata when deciding to expand
integer atomic operations. This does not yet attempt to accurately
handle fadd/fmin/fmax, which are trickier and require migrating the
old "amdgpu-unsafe-fp-atomics" attribute.
2025-08-22 05:29:36 +00:00
Matt Arsenault
01f785cac4
AMDGPU: Expand remaining system atomic operations (#122137)
System scope atomics need to use cmpxchg loops if we know
nothing about the allocation the address is from.
aea5980e26e6a87dab9f8acb10eb3a59dd143cb1 started this, this
expands the set to cover the remaining integer operations.

Don't expand xchg and add, those theoretically should work over PCIe.
This is a pre-commit which will introduce performance regressions.
Subsequent changes will add handling of new atomicrmw metadata, which
will avoid the expansion.

Note this still isn't conservative enough; we do need to expand
some device scope atomics if the memory is in fine-grained remote
memory.
2025-08-22 13:55:04 +09:00
Brox Chen
c50ed05cad
[AMDGPU][True16][CodeGen] use vgpr16 for zext patterns (reopen #153894) (#154211)
recreate this patch from
https://github.com/llvm/llvm-project/pull/153894

It seems ISel sliently ignore the `i64 = zext i16` with a chained
`reg_sequence` pattern and thus this is causing a selection failure in
hip test. Recreate a new patch with an alternative pattern, and added a
ll test global-extload-gfx11plus.ll
2025-08-20 10:26:49 -04:00
Brox Chen
d49aab10bd
Revert "[AMDGPU][True16][CodeGen] use vgpr16 for zext patterns (#1538… (#154163)
This reverts commit 7c53c6162bd43d952546a3ef7d019babd5244c29.

This patch hit an issue in hip test. revert and will reopen later
2025-08-18 14:01:19 -04:00
Brox Chen
7c53c6162b
[AMDGPU][True16][CodeGen] use vgpr16 for zext patterns (#153894)
Update true16 mode with zext patterns using vgpr16 for 16bit data types.
This stop isel from inserting invalid "vgpr32 = copy vgpr16"
2025-08-18 11:01:57 -04:00
Shilei Tian
fc0653f31c
[RFC][NFC][AMDGPU] Remove -verify-machineinstrs from llvm/test/CodeGen/AMDGPU/*.ll (#150024)
Recent upstream trends have moved away from explicitly using `-verify-machineinstrs`, as it's already covered by the expensive checks. This PR removes almost all `-verify-machineinstrs` from tests in `llvm/test/CodeGen/AMDGPU/*.ll`, leaving only those tests where its removal currently causes failures.
2025-07-23 13:42:46 -04:00
Brox Chen
0d2b47ae4a
[AMDGPU][True16][CodeGen] stop emitting spgr_lo16 from isel (#144819)
When true16 is enabled, isel start to emit sgpr_lo16 register when a
trunc/sext i16/i32 is generated, or a salu32 is used by vgpr16 or vice
versa. And this causes a problem as sgpr_lo16 is not fully supported in
the pipeline.

True16 mode works fine in -O3 mode since folding pass remove sgpr_lo16
from the pipeline. However it hit a problem in -O0 mode as folding pass
is skipped.

This patch did:
1. stop emitting sgpr_lo16 from isel
2. update codegen pattern to split uniformed/divergent pattern for
i16/i32 conversion
3. update fix-sgpr-copy pass to address legalization requirement in
true16 mode, update fix-sgpr-copies-f16-true16.mir
test to include all possible combinations

This patch is tested with cts and downstream repo with -O0 testing
2025-07-09 16:17:14 -04:00
Guy David
76274eb2b3
[PHIElimination] Revert #131837 #146320 #146337 (#146850)
Reverting because mis-compiles:
- https://github.com/llvm/llvm-project/pull/131837
- https://github.com/llvm/llvm-project/pull/146320
- https://github.com/llvm/llvm-project/pull/146337
2025-07-03 07:48:08 -04:00
Guy David
f5c62ee0fa
[PHIElimination] Reuse existing COPY in predecessor basic block (#131837)
The insertion point of COPY isn't always optimal and could eventually
lead to a worse block layout, see the regression test in the first
commit.

This change affects many architectures but the amount of total
instructions in the test cases seems too be slightly lower.
2025-06-29 21:28:42 +03:00
Ana Mihajlovic
08d747c1ef
[AMDGPU] Fix bad removal of s_delay_alu (#145728)
instructionWaitsForSGPRWrites function covers ALL SALU instructions,
including those like s_waitcnt that don't read from sgpr. This results
in removing delay_alu instructions in cases like VALU->SGPR->VALU, which
results in performance regression. Change modifies the function so that
it checks if instruction also reads a sgpr.
2025-06-27 16:15:10 +02:00
Brox Chen
6dbc01e801
[AMDGPU][True16][CodeGen] update GFX11Plus codegen test with true16 flag (#135078)
This is a NFC patch.

This patch run a bulk update on CodeGen tests that are impacted by the
true16 features. This patch applies:
1. duplicate GFX11plus runlines and apply them with
"+mattr=+real-true16" and "+mattr=-real-true16"
2. update the test with the update script

For some GISEL runlines, the current CodeGen do not fully support the
true16 version. Still update the runlines, but comment out the failing
one, and added a "FIXME-TRUE16" comment to that test for easier
tracking. These test will be fixed in the following patches.

This is in a transition state that we support both
"+real-true16/-real-true16" in our code base. We plan to move to
"+real-true16" as default, and finally remove "-real-true16" mode and
test lines.
2025-04-23 13:06:52 -04:00
zhijian lin
afda4c295b
Reland [SelectionDAG] Folding ZERO-EXTEND/SIGN_EXTEND poison to Poison value in getNode (#136701)
This patch addresses the signed/zero extension of poison by using a
poison value of the extended type instead of a constant zero of the
extended type.
2025-04-22 17:36:41 -04:00
Nico Weber
e18a77cfbe Revert "[SelectionDAG] Folding ZERO-EXTEND/SIGN_EXTEND poison to Poison value in getNode (#122741)"
This reverts commit f12078e72601e7c03e5d66afab034313caf8f791.

Breaks `check-llvm`, see comments on https://github.com/llvm/llvm-project/pull/122741
2025-04-21 10:51:03 -04:00
zhijian lin
f12078e726
[SelectionDAG] Folding ZERO-EXTEND/SIGN_EXTEND poison to Poison value in getNode (#122741)
The PR will fix the issue
https://github.com/llvm/llvm-project/issues/122728

This patch addresses the signed/zero extension of poison by using a
poison value of the extended type instead of a constant zero of the
extended type.
2025-04-21 10:02:21 -04:00
Shoreshen
121cd7c6f0
Re apply 130577 narrow math for and operand (#133896)
Re-apply https://github.com/llvm/llvm-project/pull/130577

Which is reverted in https://github.com/llvm/llvm-project/pull/133880

The old application failed in address sanitizer due to
`tryNarrowMathIfNoOverflow` was called after `I.eraseFromParent();` in
`AMDGPUCodeGenPrepareImpl::visitBinaryOperator`, it create a use after
free failure.

To fix this, `tryNarrowMathIfNoOverflow` will be called before and
directly return if `tryNarrowMathIfNoOverflow` result in true.
2025-04-17 17:03:32 +08:00
Shoreshen
7f14b2a9eb
Revert "[AMDGPU][CodeGenPrepare] Narrow 64 bit math to 32 bit if profitable" (#133880)
Reverts llvm/llvm-project#130577
2025-04-01 17:37:02 +08:00
Shoreshen
145b4a3950
[AMDGPU][CodeGenPrepare] Narrow 64 bit math to 32 bit if profitable (#130577)
For Add, Sub, Mul with Int64 type, if profitable, then do:
1. Trunc operands to Int32 type
2. Apply 32 bit Add/Sub/Mul
3. Zext to Int64 type
2025-04-01 11:18:17 +08:00
Ana Mihajlovic
8c7550132f
[AMDGPU] Unused sdst writing to null (#133229)
Unused sdst writing to null to avoid a false VALU->SALU dependency
stall. This requires using the VOP3 encoding.
2025-03-28 18:12:34 +01:00
Ana Mihajlovic
459b4e3fe1
Reland "[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212)" (#131111)
We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from
it). When VALU is issued, it increments internal counter VA_SDST used to
track use of this SGPR. SALU will not issue until VA_SDST is zero, that
is when VALU is finished writing. Therefore, delays added by s_delay_alu
are not needed in this situation.
2025-03-13 10:26:20 +01:00
Kazu Hirata
aa008e0008 Revert "[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212)"
This reverts commit 71582c6667a6334c688734cae628e906b3c1ac1d.

Multiple buildbot failures have been reported:
https://github.com/llvm/llvm-project/pull/127212
2025-03-12 12:09:09 -07:00
Ana Mihajlovic
71582c6667
[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212)
We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from
it). When VALU is issued, it increments internal counter VA_SDST used to
track use of this SGPR. SALU will not issue until VA_SDST is zero, that
is when VALU is finished writing. Therefore, delays added by s_delay_alu
are not needed in this situation.
2025-03-12 09:33:07 -07:00
Matt Arsenault
6aea6308d1
AMDGPU: Fix creating illegally typed readfirstlane in atomic optimizer (#128388)
We need to promote 8/16-bit cases to 32-bit. Unfortunately we are
missing demanded bits optimizations on readfirstlane, so we end up
emitting
an and instruction on the input. I'm also surprised this pass isn't
handling
half or bfloat yet.
2025-02-24 18:39:49 +07:00
Matt Arsenault
1bb43068f1
PeepholeOpt: Allow introducing subregister uses on reg_sequence (#127052)
This reverts d246cc618adc52fdbd69d44a2a375c8af97b6106. We now handle
composing subregister extracts through reg_sequence.
2025-02-22 09:16:14 +07:00
Sergei Barannikov
ff9c041d96
[MachineScheduler] Fix physreg dependencies of ExitSU (#123541)
Providing the correct operand index allows addPhysRegDataDeps to compute
the correct latency.

Pull Request: https://github.com/llvm/llvm-project/pull/123541
2025-02-01 20:40:50 +03:00
Matt Arsenault
d246cc618a
PeepholeOpt: Do not add subregister indexes to reg_sequence operands (#124111)
Given the rest of the pass just gives up when it needs to compose
subregisters, folding a subregister extract directly into a reg_sequence
is counterproductive. Later fold attempts in the function will give up
on the subregister operand, preventing looking up through the reg_sequence.

It may still be profitable to do these folds if we start handling
the composes. There are some test regressions, but this mostly
looks better.
2025-01-30 20:42:02 +07:00
Carl Ritson
a3a3e6997b
[AMDGPU] Rewrite GFX12 SGPR hazard handling to dedicated pass (#118750)
- Algorithm operates over whole IR to attempt to minimize waits.
- Add support for VALU->VALU SGPR hazards via VA_SDST/VA_VCC.
2025-01-30 11:21:11 +09:00
Shilei Tian
6548b6354d Reapply "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403)"
This reverts commit ca33649abe5fad93c57afef54e43ed9b3249cd86.
2024-11-08 20:21:16 -05:00
Shilei Tian
ca33649abe Revert "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403)"
This reverts commit e215a1e27d84adad2635a52393621eb4fa439dc9 as it broke both
hip and openmp buildbots.
2024-11-08 16:36:35 -05:00
Shilei Tian
e215a1e27d
[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403) 2024-11-08 13:05:35 -05:00
Stanislav Mekhanoshin
6d7e51de5e
[AMDGPU] Extend type support for update_dpp intrinsic (#114597)
We can split 64-bit DPP as a post-RA pseudo if control values are
supported, but cannot handle other types.
2024-11-05 13:59:14 -08:00
Stanislav Mekhanoshin
3277c7cd28
[AMDGPU] Skip VGPR deallocation for waveslot limited kernels (#112765)
MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels. It's
been identified this message is only really needed for VGPR limited
kernels. A kernel becomes VGPR limited if a total number of VGPRs per
SIMD / number of used VGPRs is more than a number of wave slots.
2024-10-21 09:39:52 -07:00
Pierre van Houtryve
924a64a348
[AMDGPU] Only emit SCOPE_SYS global_wb (#110636)
global_wb with scopes lower than SCOPE_SYS is unnecessary for
correctness.

I was initially optimistic they would be very cheap no-ops but they can
actually be quite expensive so let's avoid them.
2024-10-07 07:35:31 +02:00
Matt Arsenault
8632e8bd64
AMDGPU: Fix implicit vcc def to vcc_lo on wave32 targets (#109514) 2024-09-23 13:20:21 +04:00