96 Commits

Author SHA1 Message Date
Matt Arsenault
0fa6a67a42
AMDGPU: Use v_mov_b32 to implement divergent zext i32->i64 (#168166)
Some cases are relying on SIFixSGPRCopies to force VALU
reg_sequence inputs with SGPR inputs to use all VGPR inputs,
but this doesn't always happen if the reg_sequence isn't
invalid. Make sure we use a vgpr up-front here so we don't
rely on something later.
2025-11-14 20:19:24 -08:00
Matt Arsenault
bbde792786
AMDGPU: Relax shouldCoalesce to allow more register tuple widening (#166475)
Allow widening up to 128-bit registers or if the new register class
is at least as large as one of the existing register classes.

This was artificially limiting. In particular this was doing the wrong
thing with sequences involving copies between VGPRs and AV registers.
Nearly all test changes are improvements.

The coalescer does not just widen registers out of nowhere. If it's
trying
to "widen" a register, it's generally packing a register into an
existing
register tuple, or in a situation where the constraints imply the wider
class anyway. 067a11015 addressed the allocation failure concern by
rejecting coalescing if there are no available registers. The original
change in a4e63ead4b didn't include a realistic testcase to judge if
this is harmful for pressure. I would expect any issues from this to
be of garden variety subreg handling issue. We could use more dynamic
state information here if it really is an issue.

I get the best results by removing this override completely. This is
a smaller step for patch splitting purposes.
2025-11-11 13:50:57 -08:00
Carl Ritson
385c12134a
[AMDGPU] Rework GFX11 VALU Mask Write Hazard (#138663)
Apply additional counter waits to address VALU writes to SGPRs. Rework
expiry detection and apply wait coalescing to mitigate some of the
additional waits.
2025-10-28 16:09:28 +09:00
LU-JOHN
9abbec66bf
[AMDGPU] Reland "Remove redundant s_cmp_lg_* sX, 0" (#164201)
Reland PR https://github.com/llvm/llvm-project/pull/162352. Fix by
excluding SI_PC_ADD_REL_OFFSET from instructions that set SCC = DST!=0.
Passes check-libc-amdgcn-amd-amdhsa now.

Distribution of instructions that allowed a redundant S_CMP to be
deleted in check-libc-amdgcn-amd-amdhsa test:

```
S_AND_B32      485
S_AND_B64      47
S_ANDN2_B32    42
S_ANDN2_B64    277492
S_CSELECT_B64  17631
S_LSHL_B32     6
S_OR_B64       11
```

---------

Signed-off-by: John Lu <John.Lu@amd.com>
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-10-22 08:42:29 -05:00
Jan Patrick Lehr
023b1f6a8e
Revert "[AMDGPU] Remove redundant s_cmp_lg_* sX, 0 " (#164116)
Reverts llvm/llvm-project#162352

Broke our buildbot:
https://lab.llvm.org/buildbot/#/builders/10/builds/15674
To reproduce

cd llvm-project
cmake -S llvm -B thebuild -C offload/cmake/caches/AMDGPULibcBot.cmake
-GNinja
cd thebuild
ninja
ninja check-libc-amdgcn-amd-amdhsa
2025-10-18 22:38:14 +02:00
LU-JOHN
8e5f6dd37c
[AMDGPU] Remove redundant s_cmp_lg_* sX, 0 (#162352)
Remove redundant s_cmp_lg_* sX, 0 if SALU instruction already sets SCC
if sX!=0.

---------

Signed-off-by: John Lu <John.Lu@amd.com>
2025-10-18 09:33:47 -05:00
Matt Arsenault
c6e280e7ed
PeepholeOpt: Fix losing subregister indexes on full copies (#161310)
Previously if we had a subregister extract reading from a
full copy, the no-subregister incoming copy would overwrite
the DefSubReg index of the folding context.

There's one ugly rvv regression, but it's a downstream
issue of this; an unnecessary same class reg-to-reg full copy
was avoided.
2025-10-02 13:36:47 +09:00
Brox Chen
dfdfc4e490
[AMDGPU][True16][Codegen] remove another build_vector pattern from true16 (#149861)
Remove another build_vector pattern which takes a i16 but placed in a
VGPR_32 from true16 mode. This stop isel from generating illegal
"vgpr_32 = COPY vgpr_16".

ISel will use vgpr16 build vector pattern in true16 mode instead
2025-09-04 18:08:18 -04:00
Matt Arsenault
b1b5102624
AMDGPU: Start considering new atomicrmw metadata on integer operations (#122138)
Start considering !amdgpu.no.remote.memory.access and
!amdgpu.no.fine.grained.host.memory metadata when deciding to expand
integer atomic operations. This does not yet attempt to accurately
handle fadd/fmin/fmax, which are trickier and require migrating the
old "amdgpu-unsafe-fp-atomics" attribute.
2025-08-22 05:29:36 +00:00
Matt Arsenault
01f785cac4
AMDGPU: Expand remaining system atomic operations (#122137)
System scope atomics need to use cmpxchg loops if we know
nothing about the allocation the address is from.
aea5980e26e6a87dab9f8acb10eb3a59dd143cb1 started this, this
expands the set to cover the remaining integer operations.

Don't expand xchg and add, those theoretically should work over PCIe.
This is a pre-commit which will introduce performance regressions.
Subsequent changes will add handling of new atomicrmw metadata, which
will avoid the expansion.

Note this still isn't conservative enough; we do need to expand
some device scope atomics if the memory is in fine-grained remote
memory.
2025-08-22 13:55:04 +09:00
Brox Chen
c50ed05cad
[AMDGPU][True16][CodeGen] use vgpr16 for zext patterns (reopen #153894) (#154211)
recreate this patch from
https://github.com/llvm/llvm-project/pull/153894

It seems ISel sliently ignore the `i64 = zext i16` with a chained
`reg_sequence` pattern and thus this is causing a selection failure in
hip test. Recreate a new patch with an alternative pattern, and added a
ll test global-extload-gfx11plus.ll
2025-08-20 10:26:49 -04:00
Brox Chen
d49aab10bd
Revert "[AMDGPU][True16][CodeGen] use vgpr16 for zext patterns (#1538… (#154163)
This reverts commit 7c53c6162bd43d952546a3ef7d019babd5244c29.

This patch hit an issue in hip test. revert and will reopen later
2025-08-18 14:01:19 -04:00
Brox Chen
7c53c6162b
[AMDGPU][True16][CodeGen] use vgpr16 for zext patterns (#153894)
Update true16 mode with zext patterns using vgpr16 for 16bit data types.
This stop isel from inserting invalid "vgpr32 = copy vgpr16"
2025-08-18 11:01:57 -04:00
Shilei Tian
fc0653f31c
[RFC][NFC][AMDGPU] Remove -verify-machineinstrs from llvm/test/CodeGen/AMDGPU/*.ll (#150024)
Recent upstream trends have moved away from explicitly using `-verify-machineinstrs`, as it's already covered by the expensive checks. This PR removes almost all `-verify-machineinstrs` from tests in `llvm/test/CodeGen/AMDGPU/*.ll`, leaving only those tests where its removal currently causes failures.
2025-07-23 13:42:46 -04:00
Brox Chen
0d2b47ae4a
[AMDGPU][True16][CodeGen] stop emitting spgr_lo16 from isel (#144819)
When true16 is enabled, isel start to emit sgpr_lo16 register when a
trunc/sext i16/i32 is generated, or a salu32 is used by vgpr16 or vice
versa. And this causes a problem as sgpr_lo16 is not fully supported in
the pipeline.

True16 mode works fine in -O3 mode since folding pass remove sgpr_lo16
from the pipeline. However it hit a problem in -O0 mode as folding pass
is skipped.

This patch did:
1. stop emitting sgpr_lo16 from isel
2. update codegen pattern to split uniformed/divergent pattern for
i16/i32 conversion
3. update fix-sgpr-copy pass to address legalization requirement in
true16 mode, update fix-sgpr-copies-f16-true16.mir
test to include all possible combinations

This patch is tested with cts and downstream repo with -O0 testing
2025-07-09 16:17:14 -04:00
Guy David
76274eb2b3
[PHIElimination] Revert #131837 #146320 #146337 (#146850)
Reverting because mis-compiles:
- https://github.com/llvm/llvm-project/pull/131837
- https://github.com/llvm/llvm-project/pull/146320
- https://github.com/llvm/llvm-project/pull/146337
2025-07-03 07:48:08 -04:00
Guy David
f5c62ee0fa
[PHIElimination] Reuse existing COPY in predecessor basic block (#131837)
The insertion point of COPY isn't always optimal and could eventually
lead to a worse block layout, see the regression test in the first
commit.

This change affects many architectures but the amount of total
instructions in the test cases seems too be slightly lower.
2025-06-29 21:28:42 +03:00
Ana Mihajlovic
08d747c1ef
[AMDGPU] Fix bad removal of s_delay_alu (#145728)
instructionWaitsForSGPRWrites function covers ALL SALU instructions,
including those like s_waitcnt that don't read from sgpr. This results
in removing delay_alu instructions in cases like VALU->SGPR->VALU, which
results in performance regression. Change modifies the function so that
it checks if instruction also reads a sgpr.
2025-06-27 16:15:10 +02:00
Brox Chen
6dbc01e801
[AMDGPU][True16][CodeGen] update GFX11Plus codegen test with true16 flag (#135078)
This is a NFC patch.

This patch run a bulk update on CodeGen tests that are impacted by the
true16 features. This patch applies:
1. duplicate GFX11plus runlines and apply them with
"+mattr=+real-true16" and "+mattr=-real-true16"
2. update the test with the update script

For some GISEL runlines, the current CodeGen do not fully support the
true16 version. Still update the runlines, but comment out the failing
one, and added a "FIXME-TRUE16" comment to that test for easier
tracking. These test will be fixed in the following patches.

This is in a transition state that we support both
"+real-true16/-real-true16" in our code base. We plan to move to
"+real-true16" as default, and finally remove "-real-true16" mode and
test lines.
2025-04-23 13:06:52 -04:00
zhijian lin
afda4c295b
Reland [SelectionDAG] Folding ZERO-EXTEND/SIGN_EXTEND poison to Poison value in getNode (#136701)
This patch addresses the signed/zero extension of poison by using a
poison value of the extended type instead of a constant zero of the
extended type.
2025-04-22 17:36:41 -04:00
Nico Weber
e18a77cfbe Revert "[SelectionDAG] Folding ZERO-EXTEND/SIGN_EXTEND poison to Poison value in getNode (#122741)"
This reverts commit f12078e72601e7c03e5d66afab034313caf8f791.

Breaks `check-llvm`, see comments on https://github.com/llvm/llvm-project/pull/122741
2025-04-21 10:51:03 -04:00
zhijian lin
f12078e726
[SelectionDAG] Folding ZERO-EXTEND/SIGN_EXTEND poison to Poison value in getNode (#122741)
The PR will fix the issue
https://github.com/llvm/llvm-project/issues/122728

This patch addresses the signed/zero extension of poison by using a
poison value of the extended type instead of a constant zero of the
extended type.
2025-04-21 10:02:21 -04:00
Shoreshen
121cd7c6f0
Re apply 130577 narrow math for and operand (#133896)
Re-apply https://github.com/llvm/llvm-project/pull/130577

Which is reverted in https://github.com/llvm/llvm-project/pull/133880

The old application failed in address sanitizer due to
`tryNarrowMathIfNoOverflow` was called after `I.eraseFromParent();` in
`AMDGPUCodeGenPrepareImpl::visitBinaryOperator`, it create a use after
free failure.

To fix this, `tryNarrowMathIfNoOverflow` will be called before and
directly return if `tryNarrowMathIfNoOverflow` result in true.
2025-04-17 17:03:32 +08:00
Shoreshen
7f14b2a9eb
Revert "[AMDGPU][CodeGenPrepare] Narrow 64 bit math to 32 bit if profitable" (#133880)
Reverts llvm/llvm-project#130577
2025-04-01 17:37:02 +08:00
Shoreshen
145b4a3950
[AMDGPU][CodeGenPrepare] Narrow 64 bit math to 32 bit if profitable (#130577)
For Add, Sub, Mul with Int64 type, if profitable, then do:
1. Trunc operands to Int32 type
2. Apply 32 bit Add/Sub/Mul
3. Zext to Int64 type
2025-04-01 11:18:17 +08:00
Ana Mihajlovic
8c7550132f
[AMDGPU] Unused sdst writing to null (#133229)
Unused sdst writing to null to avoid a false VALU->SALU dependency
stall. This requires using the VOP3 encoding.
2025-03-28 18:12:34 +01:00
Ana Mihajlovic
459b4e3fe1
Reland "[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212)" (#131111)
We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from
it). When VALU is issued, it increments internal counter VA_SDST used to
track use of this SGPR. SALU will not issue until VA_SDST is zero, that
is when VALU is finished writing. Therefore, delays added by s_delay_alu
are not needed in this situation.
2025-03-13 10:26:20 +01:00
Kazu Hirata
aa008e0008 Revert "[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212)"
This reverts commit 71582c6667a6334c688734cae628e906b3c1ac1d.

Multiple buildbot failures have been reported:
https://github.com/llvm/llvm-project/pull/127212
2025-03-12 12:09:09 -07:00
Ana Mihajlovic
71582c6667
[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212)
We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from
it). When VALU is issued, it increments internal counter VA_SDST used to
track use of this SGPR. SALU will not issue until VA_SDST is zero, that
is when VALU is finished writing. Therefore, delays added by s_delay_alu
are not needed in this situation.
2025-03-12 09:33:07 -07:00
Matt Arsenault
6aea6308d1
AMDGPU: Fix creating illegally typed readfirstlane in atomic optimizer (#128388)
We need to promote 8/16-bit cases to 32-bit. Unfortunately we are
missing demanded bits optimizations on readfirstlane, so we end up
emitting
an and instruction on the input. I'm also surprised this pass isn't
handling
half or bfloat yet.
2025-02-24 18:39:49 +07:00
Matt Arsenault
1bb43068f1
PeepholeOpt: Allow introducing subregister uses on reg_sequence (#127052)
This reverts d246cc618adc52fdbd69d44a2a375c8af97b6106. We now handle
composing subregister extracts through reg_sequence.
2025-02-22 09:16:14 +07:00
Sergei Barannikov
ff9c041d96
[MachineScheduler] Fix physreg dependencies of ExitSU (#123541)
Providing the correct operand index allows addPhysRegDataDeps to compute
the correct latency.

Pull Request: https://github.com/llvm/llvm-project/pull/123541
2025-02-01 20:40:50 +03:00
Matt Arsenault
d246cc618a
PeepholeOpt: Do not add subregister indexes to reg_sequence operands (#124111)
Given the rest of the pass just gives up when it needs to compose
subregisters, folding a subregister extract directly into a reg_sequence
is counterproductive. Later fold attempts in the function will give up
on the subregister operand, preventing looking up through the reg_sequence.

It may still be profitable to do these folds if we start handling
the composes. There are some test regressions, but this mostly
looks better.
2025-01-30 20:42:02 +07:00
Carl Ritson
a3a3e6997b
[AMDGPU] Rewrite GFX12 SGPR hazard handling to dedicated pass (#118750)
- Algorithm operates over whole IR to attempt to minimize waits.
- Add support for VALU->VALU SGPR hazards via VA_SDST/VA_VCC.
2025-01-30 11:21:11 +09:00
Shilei Tian
6548b6354d Reapply "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403)"
This reverts commit ca33649abe5fad93c57afef54e43ed9b3249cd86.
2024-11-08 20:21:16 -05:00
Shilei Tian
ca33649abe Revert "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403)"
This reverts commit e215a1e27d84adad2635a52393621eb4fa439dc9 as it broke both
hip and openmp buildbots.
2024-11-08 16:36:35 -05:00
Shilei Tian
e215a1e27d
[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403) 2024-11-08 13:05:35 -05:00
Stanislav Mekhanoshin
6d7e51de5e
[AMDGPU] Extend type support for update_dpp intrinsic (#114597)
We can split 64-bit DPP as a post-RA pseudo if control values are
supported, but cannot handle other types.
2024-11-05 13:59:14 -08:00
Stanislav Mekhanoshin
3277c7cd28
[AMDGPU] Skip VGPR deallocation for waveslot limited kernels (#112765)
MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels. It's
been identified this message is only really needed for VGPR limited
kernels. A kernel becomes VGPR limited if a total number of VGPRs per
SIMD / number of used VGPRs is more than a number of wave slots.
2024-10-21 09:39:52 -07:00
Pierre van Houtryve
924a64a348
[AMDGPU] Only emit SCOPE_SYS global_wb (#110636)
global_wb with scopes lower than SCOPE_SYS is unnecessary for
correctness.

I was initially optimistic they would be very cheap no-ops but they can
actually be quite expensive so let's avoid them.
2024-10-07 07:35:31 +02:00
Matt Arsenault
8632e8bd64
AMDGPU: Fix implicit vcc def to vcc_lo on wave32 targets (#109514) 2024-09-23 13:20:21 +04:00
Jay Foad
e55d6f5ea2
[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive (#107889)
Always generate v_cndmask_b32 instead of modifying exec around
v_mov_b32. This is expected to be faster because
modifying exec generally causes pipeline stalls.
2024-09-11 17:16:06 +01:00
Carl Ritson
16cda01d22
[AMDGPU] V_SET_INACTIVE optimizations (#98864)
Optimize V_SET_INACTIVE by allow it to run in WWM.
Hence WWM sections are not broken up for inactive lane setting.
WWM V_SET_INACTIVE can typically be lower to V_CNDMASK.
Some cases require use of exec manipulation V_MOV as previous code.
GFX9 sees slight instruction count increase in edge cases due to
smaller constant bus.

Additionally avoid introducing exec manipulation and V_MOVs where
a source of V_SET_INACTIVE is the destination.
This is a common pattern as WWM register pre-allocation often
assigns the same register.
2024-09-05 14:39:28 +09:00
Jay Foad
5a6926ce49 [AMDGPU] Fix test update after #107108 2024-09-04 11:48:08 +01:00
Jay Foad
126d6f2710
[AMDGPU] Improve codegen for GFX10+ DPP reductions and scans (#107108)
Use poison for an unused input to the permlanex16 intrinsic, to improve
register allocation and avoid an unnecessary v_mov instruction.
2024-09-04 11:03:22 +01:00
Carl Ritson
86627149f6
[AMDGPU] Mitigate GFX12 VALU read SGPR hazard (#100067)
Any SGPR read by a VALU can potentially obscure SALU writes to the same
register.
Insert s_wait_alu instructions to mitigate the hazard on affected paths.

Compute a global cache of SGPRs with any VALU reads and use this to
avoid inserting mitigation for SGPRs never accessed by VALUs.

To avoid excessive search when compile time is priority implement
secondary mode where all SALU writes are mitigated.

Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-09-04 12:15:20 +09:00
Changpeng Fang
a82032918c
[AMDGPU] Remove -wavefrontsize32 and -wavefrontsize64 from GFX10+ tests (NFC) (#100711)
They are no longer needed after the patch: [AMDGPU] Remove wavefrontsize
feature from GFX10: https://github.com/llvm/llvm-project/pull/98400
The exception is when "target-features" are set to "+wavefrontsize32" or
"+wavefrontsize64", we still need to remove a wavefrontsize feature
before add a different one to make sure only one of them are present.
2024-07-26 00:42:24 -07:00
Christudasan Devadasan
229e118559
[AMDGPU] Codegen support for constrained multi-dword sloads (#96163)
For targets that support xnack replay feature (gfx8+), the
multi-dword scalar loads shouldn't clobber any register that
holds the src address. The constrained version of the scalar
loads have the early clobber flag attached to the dst operand
to restrict RA from re-allocating any of the src regs for its
dst operand.
2024-07-23 13:59:15 +05:30
Pierre van Houtryve
b3a446650c
[AMDGPU] Implement GFX12 Memory Model (#98591)
- Emit GLOBAL_WB instructions
- Reflect synscope on instructions's `scope:` operand

Fixes SWDEV-468508
Fixes SWDEV-470735
Fixes SWDEV-468392
Fixes SWDEV-469622
2024-07-16 10:53:06 +02:00
Vikram Hegde
cf230e7799
[AMDGPU] Enable atomic optimizer for divergent i64 and double values (#96934) 2024-07-15 17:49:09 +05:30