548 Commits

Author SHA1 Message Date
Pierre van Houtryve
6f7c77fe90
[AMDGPU] Check noalias.addrspace in mayAccessScratchThroughFlat (#151319)
PR #149247 made the MD accessible by the backend so we can now leverage
it in the memory model. The first use case here is detecting if a flat op
can access scratch memory.
Benefits both the MemoryLegalizer and InsertWaitCnt.
2025-08-19 07:42:59 +02:00
Stanislav Mekhanoshin
a26c3e9491
[AMDGPU] User SGPR count increased to 32 on gfx1250 (#154205) 2025-08-18 15:04:56 -07:00
Stanislav Mekhanoshin
57c1e01e48
[AMDGPU] Don't allow wgp mode on gfx1250 (#153680)
- gfx1250 only supports cu mode
2025-08-14 15:16:56 -07:00
Stanislav Mekhanoshin
49f2093477
[AMDGPU] Increase LDS to 320K on gfx1250 (#153645) 2025-08-14 12:52:00 -07:00
Stanislav Mekhanoshin
80d430df5d
[AMDGPU] Add MSG_SAVEWAVE_HAS_TDM on gfx1250 (#153483) 2025-08-13 23:01:50 -07:00
Stanislav Mekhanoshin
fc911fe928
[AMDGPU] Add HW_REG_IB_STS2 on gfx1250 (#153479) 2025-08-13 23:01:28 -07:00
Stanislav Mekhanoshin
ea14834966
[AMDGPU] Per-subtarget DPP instruction classification (#153096)
This is NFCI at this point.
2025-08-11 15:41:02 -07:00
Stanislav Mekhanoshin
dddeb07c2e
[AMDGPU] Restrict packed math FP32 instructions to read only one SGPR per operand on gfx12+ (#152465)
Sec. 4.6.7.1 of the gfx1250 SPG states that if an SGPR is used
as an operand, only one SGPR will be read for both the low and high
operations. As a result, the corresponding bits in `op_sel` and
`op_sel_hi` must be the same when the operand is an SGPR.

Co-authored-by: Tian, Shilei <Shilei.Tian@amd.com>

Co-authored-by: Tian, Shilei <Shilei.Tian@amd.com>
2025-08-07 16:13:34 -07:00
Stanislav Mekhanoshin
66392a8d8d
[AMDGPU] Add XNACK_STATE_PRIV and _MASK gfx1250 registers (#152374)
Co-authored-by: Pierre Vanhoutryve <pierre.vanhoutryve@amd.com>

Co-authored-by: Pierre Vanhoutryve <pierre.vanhoutryve@amd.com>
2025-08-06 14:44:17 -07:00
Stanislav Mekhanoshin
d08c2977e8
[AMDGPU] Add MC support for new gfx1250 src_flat_scratch_base_lo/hi (#152203) 2025-08-05 14:35:48 -07:00
Matt Arsenault
1d7a0fa08a
AMDGPU: Move asm constraint physreg parsing to utils (#150903)
Also fixes an assertion on out of bound physical register
indexes.
2025-08-01 16:11:11 +09:00
Stanislav Mekhanoshin
ce40863209
[AMDGPU] Add v_cvt_sr|pk_bf8|fp8_f16 gfx1250 instructions (#151415) 2025-07-30 17:24:45 -07:00
Changpeng Fang
67e2faa50c
[AMDGPU] MC support for async load and store on gfx1250 (#151030) 2025-07-28 13:45:37 -07:00
Stanislav Mekhanoshin
a0b854d576
[AMDGPU] MC support for gfx1250 scale_offset modifier (#149881) 2025-07-21 15:04:59 -07:00
Changpeng Fang
d6094370cb
AMDGPU: Support v_wmma_f32_16x16x128_f8f6f4 on gfx1250 (#149684)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2025-07-21 10:09:42 -07:00
Diana Picus
20d8398825
[AMDGPU] ISel & PEI for whole wave functions (#145858)
Whole wave functions are functions that will run with a full EXEC mask.
They will not be invoked directly, but instead will be launched by way
of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in
a future patch). These functions are meant as an alternative to the
`llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics.

Whole wave functions will set EXEC to -1 in the prologue and restore the
original value of EXEC in the epilogue. They must have a special first
argument, `i1 %active`, that is going to be mapped to EXEC. They may
have either the default calling convention or amdgpu_gfx. The inactive
lanes need to be preserved for all registers used, active lanes only for
the CSRs.

At the IR level, arguments to a whole wave function (other than
`%active`) contain poison in their inactive lanes. Likewise, the return
value for the inactive lanes is poison.

This patch contains the following work:
* 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN
  used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return
  a SReg_1 representing `%active`, which needs to be passed into
  SI_WHOLE_WAVE_FUNC_RETURN.
* SelectionDAG support for generating these 2 new pseudos and the
  special handling of %active. Since the return may be in a different
  basic block, it's difficult to add the virtual reg for %active to
  SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF
  which is later replaced via a custom inserter.
* Expansion of the 2 pseudos during prolog/epilog insertion. PEI also
  marks any used VGPRs as WWM registers, which are then spilled and
  restored with the usual logic.

Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic
and a lot of optimization work (especially in order to reduce spills
around function calls).

---------

Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
Co-authored-by: Shilei Tian <i@tianshilei.me>
2025-07-21 10:39:09 +02:00
Changpeng Fang
b52cf756ce
AMDGPU: Treat WMMA XDL ops as TRANS in S_DELAY_ALU insertion for gfx1250 (#149208)
WMMA XDL instructions are tracked as TRANs ops and the compiler should
consider them the same as TRANS in S_DELAY_ALU insertion. We use a searchable
table for the InsertDelayAlu pass to recognize these WMMA XDL instructions.

Co-authored-by: Stefan Stipanovic <Stefan.Stipanovic@amd.com>
2025-07-16 17:07:48 -07:00
Stanislav Mekhanoshin
5277021c3c
[AMDGPU] Add gfx1250 v_fmac_f64 implementation (#148725) 2025-07-14 15:39:04 -07:00
Stanislav Mekhanoshin
f090554359
[AMDGPU] MC support for v_fmaak_f64/v_fmamk_f64 gfx1250 intructions (#148282) 2025-07-11 14:17:03 -07:00
Stanislav Mekhanoshin
7920dff394
[AMDGPU] VOPD/VOPD3 changes for gfx1250 (#147602) 2025-07-10 14:15:01 -07:00
Stanislav Mekhanoshin
00a85e5704
[AMDGPU] gfx1250: MC support for 64-bit literals (#147861) 2025-07-09 22:25:47 -07:00
Changpeng Fang
eda3161c35
AMDGPU: Implement tensor load and store instructions for gfx1250 (#146636)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2025-07-03 13:49:34 -07:00
Kazu Hirata
e79e22c9af
[AMDGPU] Remove an unnecessary cast (NFC) (#146548)
Val is already of uint64_t.
2025-07-01 10:42:22 -07:00
Christudasan Devadasan
08b8d467d4
[AMDGPU][GFX1250] Insert S_WAIT_XCNT for SMEM and VMEM load-stores (#145566)
This patch tracks the register operands of both VMEM (FLAT, MUBUF,
MTBUF) and SMEM load-store operations and inserts a S_WAIT_XCNT
instruction with sufficient wait-count before potentially redefining
them. For VMEM instructions, XNACK is returned in the same order as
they were issued and hence non-zero counter values can be inserted.
However, SMEM execution is out-of-order and so is their XNACK reception.
Thus, only zero counter value can be inserted to capture SMEM dependencies.
2025-06-25 10:40:36 +05:30
Diana Picus
a201f8872a
[AMDGPU] Replace dynamic VGPR feature with attribute (#133444)
Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget
feature, as requested in #130030.
2025-06-24 11:09:36 +02:00
Stanislav Mekhanoshin
fa0b84f23c
[AMDGPU] Rename call instructions from b64 to i64 (#145103)
These get renamed in gfx1250 and on from B64 to I64:

  S_CALL_I64
  S_GET_PC_I64
  S_RFE_I64
  S_SET_PC_I64
  S_SWAP_PC_I64
2025-06-21 21:42:09 -07:00
Diana Picus
40a7dce9ef
[AMDGPU] Remove duplicated/confusing helpers. NFCI (#142598)
Move canGuaranteeTCO and mayTailCallThisCC into AMDGPUBaseInfo instead
of keeping two copies for DAG/Global ISel.

Also remove isKernelCC, which doesn't agree with isKernel and doesn't
seem very useful.

While at it, also move all the CC-related helpers into AMDGPUBaseInfo.h and
mark them constexpr.
2025-06-05 11:19:20 +02:00
Fangrui Song
a0901a2f87 Replace #include MCAsmLexer.h with AsmLexer.h
MCAsmLexer.h has been made a forwarder header since #134207
2025-05-25 11:57:29 -07:00
Kazu Hirata
70ef89b913
[AMDGPU] Use std::optional::value_or (NFC) (#140006) 2025-05-14 23:50:13 -07:00
Lucas Ramirez
6456ee056f
Reapply "[AMDGPU][Scheduler] Refactor ArchVGPR rematerialization during scheduling (#125885)" (#139548)
This reapplies 067caaa and 382a085 (reverting b35f6e2) with fixes to
issues detected by the address sanitizer (MIs have to be removed from
live intervals before being removed from their parent MBB).

Original commit description below.

AMDGPU scheduler's `PreRARematStage` attempts to increase function
occupancy w.r.t. ArchVGPR usage by rematerializing trivial
ArchVGPR-defining instruction next to their single use. It first
collects all eligible trivially rematerializable instructions in the
function, then sinks them one-by-one while recomputing occupancy in all
affected regions each time to determine if and when it has managed to
increase overall occupancy. If it does, changes are committed to the
scheduler's state; otherwise modifications to the IR are reverted and
the scheduling stage gives up.

In both cases, this scheduling stage currently involves repeated queries
for up-to-date occupancy estimates and some state copying to enable
reversal of sinking decisions when occupancy is revealed not to
increase. The current implementation also does not accurately track
register pressure changes in all regions affected by sinking decisions.

This commit refactors this scheduling stage, improving RP tracking and
splitting the stage into two distinct steps to avoid repeated occupancy
queries and IR/state rollbacks.

- Analysis and collection (`canIncreaseOccupancyOrReduceSpill`). The
number of ArchVGPRs to save to reduce spilling or increase function
occupancy by 1 (when there is no spilling) is computed. Then,
instructions eligible for rematerialization are collected, stopping as
soon as enough have been identified to be able to achieve our goal
(according to slightly optimistic heuristics). If there aren't enough of
such instructions, the scheduling stage stops here.
- Rematerialization (`rematerialize`). Instructions collected in the
first step are rematerialized one-by-one. Now we are able to directly
update the scheduler's state since we have already done the occupancy
analysis and know we won't have to rollback any state. Register
pressures for impacted regions are recomputed only once, as opposed to
at every sinking decision.

In the case where the stage attempted to increase occupancy, and if both
rematerializations alone and rescheduling after were unable to improve
occupancy, then all rematerializations are rollbacked.
2025-05-13 11:11:00 +02:00
Vitaly Buka
b35f6e26a5
Revert "[AMDGPU][Scheduler] Refactor ArchVGPR rematerialization during scheduling (#125885)" (#139341)
And related "[AMDGPU] Regenerate mfma-loop.ll test"

Introduce memory error detected by Asan #125885.

This reverts commit 382a085a95b0abeac77b150b7b644b372bd08e78.
This reverts commit 067caaafb58a156d0d77229422607782a639f5b5.
2025-05-09 17:51:46 -07:00
Ivan Kosarev
66d3980b53
[AMDGPU][NFC] Remove _DEFERRED operands. (#139123)
All immediates are deferred now.
2025-05-09 10:10:53 +01:00
Ivan Kosarev
c290f48a45
[AMDGPU][NFC] Remove unused operand types. (#139062) 2025-05-08 12:48:25 +01:00
Lucas Ramirez
067caaafb5
[AMDGPU][Scheduler] Refactor ArchVGPR rematerialization during scheduling (#125885)
AMDGPU scheduler's `PreRARematStage` attempts to increase function
occupancy w.r.t. ArchVGPR usage by rematerializing trivial
ArchVGPR-defining instruction next to their single use. It first
collects all eligible trivially rematerializable instructions in the
function, then sinks them one-by-one while recomputing occupancy in all
affected regions each time to determine if and when it has managed to
increase overall occupancy. If it does, changes are committed to the
scheduler's state; otherwise modifications to the IR are reverted and
the scheduling stage gives up.

In both cases, this scheduling stage currently involves repeated queries
for up-to-date occupancy estimates and some state copying to enable
reversal of sinking decisions when occupancy is revealed not to
increase. The current implementation also does not accurately track
register pressure changes in all regions affected by sinking decisions.

This commit refactors this scheduling stage, improving RP tracking and
splitting the stage into two distinct steps to avoid repeated occupancy
queries and IR/state rollbacks.

- Analysis and collection (`canIncreaseOccupancyOrReduceSpill`). The
number of ArchVGPRs to save to reduce spilling or increase function
occupancy by 1 (when there is no spilling) is computed. Then,
instructions eligible for rematerialization are collected, stopping as
soon as enough have been identified to be able to achieve our goal
(according to slightly optimistic heuristics). If there aren't enough of
such instructions, the scheduling stage stops here.
- Rematerialization (`rematerialize`). Instructions collected in the
first step are rematerialized one-by-one. Now we are able to directly
update the scheduler's state since we have already done the occupancy
analysis and know we won't have to rollback any state. Register
pressures for impacted regions are recomputed only once, as opposed to
at every sinking decision.

In the case where the stage attempted to increase occupancy, and if both
rematerializations alone and rescheduling after were unable to improve
occupancy, then all rematerializations are rollbacked.
2025-05-08 12:51:06 +02:00
David Stuttard
1a32613dac
[AMDGPU] Update pal metadata for v3.6 and fix v3.0 (#135196)
Update entry_point for all pal versions below 3.6.
3.6 and above removes entry_point.
2025-04-28 13:31:14 +01:00
Kazu Hirata
fcd0664126
[AMDGPU] Simplify GetMember...::Get (NFC) (#137536)
We can use "constexpr if" to combine the two variants of functions.
2025-04-27 14:09:16 -07:00
Kazu Hirata
cbd32833fb
[AMDGPU] Simplify PrintField::printField (NFC) (#137502)
We can use "constexpr if" to combine the two variants of functions.
2025-04-27 08:52:29 -07:00
Brox Chen
cbe8f3ad76
[AMDGPU][True16][MC] fix fmac_f16_t16 vop3 format (#135464)
add fmac_f16_t16_e64 to isfmac check to fix the vop3 format of
fmac_f16_t16 instruction
2025-04-13 18:10:31 -04:00
David Stuttard
15428e0d78
[AMDGPU] Add support for point sample accel out of order returns (#127991)
Add target feature for point sample acceleration and enable it for
relevant
targets.

Also add support to insert waitcnts where required when point sample
accel may
have occurred. This has implications for out of order returns, which is
why
extra waitcnts are required.

Add a VMEM_NOSAMPLER bit in the register masks to determine when
waitcnt is required.
2025-04-10 15:33:48 +01:00
Shilei Tian
3e742b517a
[NFC][AMDGPU] clang-format AMDGPUBaseInfo.[h,cpp] (#133559) 2025-03-29 10:28:34 -04:00
Shilei Tian
02b45f4b81
[AMDGPU] Add a new function getIntegerPairAttribute (#133271)
The new function will return `std::nullopt` when any error occurs.
2025-03-27 15:38:54 -04:00
Shilei Tian
f1ac2afe21
Reapply "[AMDGPU] Use COV6 by default (#118515)" (#130963)
This reverts commit 68bcba6d7a1cc18996c0bcb7c62267c62d2040d0.
2025-03-21 15:26:45 -04:00
Diana Picus
1f84495255
[AMDGPU] Update target helpers & GCNSchedStrategy for dynamic VGPRs (#130047)
In dynamic VGPR mode, we can allocate up to 8 blocks of either 16 or 32
VGPRs (based on a chip-wide setting which we can model with a Subtarget
feature). Update some of the subtarget helpers to reflect this.

In particular:
- getVGPRAllocGranule is set to the block size
- getAddresableNumVGPR will limit itself to 8 * size of a block

We also try to be more careful about how many VGPR blocks we allocate.
Therefore, when deciding if we should revert scheduling after a given
stage, we check that we haven't increased the number of VGPR blocks that
need to be allocated.

---------

Co-authored-by: Jannik Silvanus <jannik.silvanus@amd.com>
2025-03-19 10:29:38 +01:00
Alex Voicu
c1fabd681f
[llvm][AMDGPU] Enable FWD_PROGRESS bit for GFX10+ (#128367)
From GFX10 onwards it is possible to employ benevolent scheduling of
waves. This patch unconditionally enables, for the `amdhsa` OS, the bit
which controls that capability, as it is beneficial for algorithms that
rely on more complex concurrent coordination and it is generally
performance neutral otherwise.
2025-03-17 23:17:46 +00:00
Kazu Hirata
508db53d1a
[AMDGPU] Avoid repeated hash lookups (NFC) (#131493) 2025-03-15 22:35:42 -07:00
Fangrui Song
7722d7519c [MC] evaluateAsRelocatableImpl: remove the Fixup argument
Follow-up to d6fbffa23c84e622735b3e880fd800985c1c0072 . This commit
updates all call sites and removes the argument from the function.
2025-03-15 16:10:19 -07:00
Shilei Tian
dccc0a836c
[NFC][AMDGPU] Replace more direct arch comparison with isAMDGCN() (#131379)
This is an extension of #131357. Hopefully this would be the last one.
2025-03-14 17:02:15 -04:00
Ana Mihajlovic
65ade6d2eb
[AMDGPU] Merge consecutive wait_alu instruction (#128916) 2025-03-12 10:27:27 +01:00
Matt Arsenault
0eaca04125 AMDGPU: Remove repeated define in base info header
The identical define is repeated on the previous line.
2025-03-04 15:58:06 +07:00
Jay Foad
44607666b3
[AMDGPU] Simplify conditional expressions. NFC. (#129228)
Simplfy `cond ? val : false` to `cond && val` and similar.
2025-03-03 10:40:49 +00:00