971 Commits

Author SHA1 Message Date
Matt Arsenault
694a488708
AMDGPU: Add pseudoinstruction for 64-bit agpr or vgpr constants (#154499)
64-bit version of 7425af4b7aaa31da10bd1bc7996d3bb212c79d88. We
still need to lower to 32-bit v_accagpr_write_b32s, so this has
a unique value restriction that requires both halves of the constant
to be 32-bit inline immediates. This only introduces the new
pseudo definitions, but doesn't try to use them yet.
2025-08-20 22:54:37 +09:00
Matt Arsenault
ed0e531044
AMDGPU: Use Register type for isStackAccess (#154320) 2025-08-19 23:00:45 +09:00
Pierre van Houtryve
6f7c77fe90
[AMDGPU] Check noalias.addrspace in mayAccessScratchThroughFlat (#151319)
PR #149247 made the MD accessible by the backend so we can now leverage
it in the memory model. The first use case here is detecting if a flat op
can access scratch memory.
Benefits both the MemoryLegalizer and InsertWaitCnt.
2025-08-19 07:42:59 +02:00
Stanislav Mekhanoshin
906c9e9542
[AMDGPU] Remove misplaced assert. (#154187)
The assert that RegScavenger required for long branching is now
placed below the code to use s_add_pc64, where it is actually
used.
2025-08-18 13:58:54 -07:00
Stanislav Mekhanoshin
13716843eb
[AMDGPU] Make s_setprio_inc_wg a scheduling boundary (#154188) 2025-08-18 13:20:38 -07:00
Stanislav Mekhanoshin
ea14834966
[AMDGPU] Per-subtarget DPP instruction classification (#153096)
This is NFCI at this point.
2025-08-11 15:41:02 -07:00
Stanislav Mekhanoshin
dddeb07c2e
[AMDGPU] Restrict packed math FP32 instructions to read only one SGPR per operand on gfx12+ (#152465)
Sec. 4.6.7.1 of the gfx1250 SPG states that if an SGPR is used
as an operand, only one SGPR will be read for both the low and high
operations. As a result, the corresponding bits in `op_sel` and
`op_sel_hi` must be the same when the operand is an SGPR.

Co-authored-by: Tian, Shilei <Shilei.Tian@amd.com>

Co-authored-by: Tian, Shilei <Shilei.Tian@amd.com>
2025-08-07 16:13:34 -07:00
Shilei Tian
351b38f266
[AMDGPU] Mark address space cast from private to flat as divergent if target supports globally addressable scratch (#152376)
Globally addressable scratch is a new feature introduced in gfx1250.
However, this feature changes how scratch space is mapped into the flat
aperture, making address space casts from private to flat no longer
uniform.
2025-08-06 17:08:56 -04:00
Changpeng Fang
32161e9de3
[AMDGPU] Do not fold an immediate into instructions with frame indexes (#151263)
Do not fold an immediate into an instruction that already has a frame
index operand. A frame index could possibly turn out to be another immediate.

Fixes: SWDEV-536263

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-08-06 11:47:37 -07:00
Stanislav Mekhanoshin
33abf05af4
[AMDGPU] gfx1250 v_permlane_* instructions (#151749) 2025-08-01 16:14:19 -07:00
Stanislav Mekhanoshin
ce40863209
[AMDGPU] Add v_cvt_sr|pk_bf8|fp8_f16 gfx1250 instructions (#151415) 2025-07-30 17:24:45 -07:00
Brox Chen
2a3f72ee6e
[AMDGPU][CodeGen][True16] Correct size calculation for d16 insts (#151042)
D16 pesudo instructions are introduced in true16 mode to represet a D16
load/store. In MC lowering, the pesudo instructions are lowered to the
corresponding D16 Lo/Hi MC Inst respecting the register allocation.

However, the pesudo instruction has size 0 and cause an issue in the
Inst size estimation. Use D16 Lo when calculating inst size
2025-07-29 13:01:57 -04:00
Pierre van Houtryve
2ad4e93ded
[AMDGPU][gfx1250] Use SCOPE_SE for stores that may hit scratch (#150586) 2025-07-28 11:40:56 +02:00
Changpeng Fang
400ce1a3d3
[AMDGPU] Support AMDGPUClamp for bf16 on gfx1250 (#150663)
Scalar version uses V_MAX_BF16_PSEUDO which is expanded to V_PK_MAX_BF16
with unused high bits. If V_PK_MAX_BF16 is produced directly instead
that creates problem with folding of the clamp into other scalar
instructions due to incompatible clamp bits.

FIXME-TRUE16: enable bf16 clamp with true16

---------

Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2025-07-25 12:13:06 -07:00
Jay Foad
8005c6a108
[AMDGPU] Simplify SIInstrInfo::isLegalToSwap. NFC. (#149058) 2025-07-25 13:02:34 +01:00
Stanislav Mekhanoshin
2346968807
[AMDGPU] Add V_ADD|SUB|MUL_U64 gfx1250 opcodes (#150291) 2025-07-23 13:17:56 -07:00
Stanislav Mekhanoshin
a0b854d576
[AMDGPU] MC support for gfx1250 scale_offset modifier (#149881) 2025-07-21 15:04:59 -07:00
Diana Picus
20d8398825
[AMDGPU] ISel & PEI for whole wave functions (#145858)
Whole wave functions are functions that will run with a full EXEC mask.
They will not be invoked directly, but instead will be launched by way
of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in
a future patch). These functions are meant as an alternative to the
`llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics.

Whole wave functions will set EXEC to -1 in the prologue and restore the
original value of EXEC in the epilogue. They must have a special first
argument, `i1 %active`, that is going to be mapped to EXEC. They may
have either the default calling convention or amdgpu_gfx. The inactive
lanes need to be preserved for all registers used, active lanes only for
the CSRs.

At the IR level, arguments to a whole wave function (other than
`%active`) contain poison in their inactive lanes. Likewise, the return
value for the inactive lanes is poison.

This patch contains the following work:
* 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN
  used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return
  a SReg_1 representing `%active`, which needs to be passed into
  SI_WHOLE_WAVE_FUNC_RETURN.
* SelectionDAG support for generating these 2 new pseudos and the
  special handling of %active. Since the return may be in a different
  basic block, it's difficult to add the virtual reg for %active to
  SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF
  which is later replaced via a custom inserter.
* Expansion of the 2 pseudos during prolog/epilog insertion. PEI also
  marks any used VGPRs as WWM registers, which are then spilled and
  restored with the usual logic.

Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic
and a lot of optimization work (especially in order to reduce spills
around function calls).

---------

Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
Co-authored-by: Shilei Tian <i@tianshilei.me>
2025-07-21 10:39:09 +02:00
Matt Arsenault
176ae32de0
AMDGPU: Fix introducing use of killed vgpr in gfx908 agpr copy (#149291)
When searching for an existing VGPR source for an AGPR to AGPR
copy on gfx908, this wasn't verifying the vgpr wasn't killed by
other prior uses.
2025-07-18 15:34:47 +09:00
Matt Arsenault
1614c3b3c7
AMDGPU: Always use AV spill pseudos on targets with AGPRs (#149099)
This increases allocator freedom to inflate register classes
to the AV class, we don't need to introduce a new restriction
by basing the opcode on the current virtual register class.
Ideally we would avoid this if we don't have any allocatable
AGPRs for the function, but it probably doesn't make much
difference in the end result if they are excluded from the
final allocation order.
2025-07-18 15:31:50 +09:00
Matt Arsenault
4a3cb437a3 AMDGPU: Avoid hardcoding mov opcode 2025-07-17 15:11:52 +09:00
Changpeng Fang
b52cf756ce
AMDGPU: Treat WMMA XDL ops as TRANS in S_DELAY_ALU insertion for gfx1250 (#149208)
WMMA XDL instructions are tracked as TRANs ops and the compiler should
consider them the same as TRANS in S_DELAY_ALU insertion. We use a searchable
table for the InsertDelayAlu pass to recognize these WMMA XDL instructions.

Co-authored-by: Stefan Stipanovic <Stefan.Stipanovic@amd.com>
2025-07-16 17:07:48 -07:00
Stanislav Mekhanoshin
703501e661
[AMDGPU] Select flat GVS loads on gfx1250 (#149183) 2025-07-16 15:06:37 -07:00
Stanislav Mekhanoshin
82d7405b3b
[AMDGPU] Use S_ADD_PC_I64 for long branches in gfx1250 (#148961) 2025-07-15 17:14:56 -07:00
Stanislav Mekhanoshin
2d6534b7da
[AMDGPU] gfx1250 64-bit relocations and fixups (#148951) 2025-07-15 17:13:42 -07:00
Paul Trojahn
70e1a3cead
[AMDGPU] Check legality of both operands before swap (#148843)
When trying to fold an SGPR into the second operand to a DPP add,
si-fold-operands correctly determines that this is not possible and
attempts to swap the second and third operand. This succeeds even if the
third operand is an SGPR, creating an illegal dpp add with two SGPR
operands. We need to check both operands if they are legal in their new
position.

This causes a crash at compile time for a test in triton on gfx12:

345c633787/python/test/unit/language/test_core.py (L2718)

Co-authored-by: Paul Trojahn <paul.trojahn@amd.com>
2025-07-15 15:55:26 -04:00
Stanislav Mekhanoshin
cbba8f0acb
[AMDGPU] Codegen support for v_fmaak_f64/f_fmamk_f64 (#148734) 2025-07-14 17:57:06 -07:00
Stanislav Mekhanoshin
a32040e483
[AMDGPU] Use 64-bit literals in codegen on gfx1250 (#148727) 2025-07-14 15:47:24 -07:00
Stanislav Mekhanoshin
d1e3ab9c4b
[AMDGPU] Use v_mov_b64 in codegen on gfx1250 (#148272) 2025-07-11 22:16:50 -07:00
Stanislav Mekhanoshin
f090554359
[AMDGPU] MC support for v_fmaak_f64/v_fmamk_f64 gfx1250 intructions (#148282) 2025-07-11 14:17:03 -07:00
Brox Chen
0d2b47ae4a
[AMDGPU][True16][CodeGen] stop emitting spgr_lo16 from isel (#144819)
When true16 is enabled, isel start to emit sgpr_lo16 register when a
trunc/sext i16/i32 is generated, or a salu32 is used by vgpr16 or vice
versa. And this causes a problem as sgpr_lo16 is not fully supported in
the pipeline.

True16 mode works fine in -O3 mode since folding pass remove sgpr_lo16
from the pipeline. However it hit a problem in -O0 mode as folding pass
is skipped.

This patch did:
1. stop emitting sgpr_lo16 from isel
2. update codegen pattern to split uniformed/divergent pattern for
i16/i32 conversion
3. update fix-sgpr-copy pass to address legalization requirement in
true16 mode, update fix-sgpr-copies-f16-true16.mir
test to include all possible combinations

This patch is tested with cts and downstream repo with -O0 testing
2025-07-09 16:17:14 -04:00
Changpeng Fang
eda3161c35
AMDGPU: Implement tensor load and store instructions for gfx1250 (#146636)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2025-07-03 13:49:34 -07:00
Changpeng Fang
5035d20dcb
AMDGPU: Implement ds_atomic_async_barrier_arrive_b64/ds_atomic_barrier_arrive_rtn_b64 (#146409)
These two instructions are supported by gfx1250. We define the
instructions and implement the corresponding intrinsic and builtin.

Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2025-07-01 11:08:49 -07:00
Matt Arsenault
4cb8308ee9
AMDGPU: Avoid report_fatal_error for unsupported ds_ordered_count (#145172) 2025-06-26 14:06:05 +09:00
Brox Chen
505906bff6
[AMDGPU][True16][CodeGen] do not legalize t16 operand during user scan (#145450)
The legalize t16 operand function could insert a reg_sequence which
modify the user list of the targetted register, and we should not call
it in the middle of an user list iteration
2025-06-24 23:49:22 -04:00
Stanislav Mekhanoshin
40eee8ec7f
[AMDGPU] Add s_setprio_inc_wg gfx1250 instruction (#145152) 2025-06-22 12:52:05 -07:00
Brox Chen
e75e2485f2
[AMDGPU][True16][Codegen] keep srcmod/clamp/omod from v_s_xxx_f16 when moved to VALU (#144781)
https://github.com/llvm/llvm-project/pull/141152 causes an issue in
v_s_xxx_f16 lowering in both true16/fake16 flow.

V_S_XXX_F16 are special insts which has scalar input/output but in VALU
VOP3 format. Need to keep the srcmod/clamp/omod when lower it to its
corresponding VALU inst with vector input/output.
2025-06-19 09:26:45 -04:00
Brox Chen
e48731bc03
[AMDGPU][True16][CodeGen] v_s_xxx_f16 t16 mode handling in movetoVALU process (#141152)
Add op_sel for v_s_xxx_f16 when move them to VALU

update a few related codegen test for gfx12 in true16 mode
2025-06-10 15:36:44 -04:00
Jay Foad
9cacc4138e
[AMDGPU] Move S_ADD_U64_PSEUDO handling into getVALUOp. NFC. (#142934)
S_ADD_U64_PSEUDO and S_SUB_U64_PSEUDO are not "special cases" so can be
handled in getVALUOp instead of moveToVALUImpl.
2025-06-05 16:49:24 +01:00
Brox Chen
b668b6439a
[AMDGPU][True16][CodeGen] legalize 16bit and 32bit use-def chain for moveToVALU in si-fix-sgpr-lowering (#138734)
Two changes in this patch:
1. Covered another case in legalizeOperandVALUt16 functions and the COPY
lowering, when SALU16 is used by SALU32, need to insert a reg_sequence
after moved to valu (previously only considered SALU32 used by SALU16
case)
2. Moved the useMI analysis into addUsersToMoveVALUList. Legalize the
targetted operand when needed.

Turn on frem test with true16 mode for gfx1150 which is failing before
this patch. A few bitcast tests also impacted by this change with some
v_mov being replaced to dual mov
2025-06-04 09:53:10 -04:00
Matt Arsenault
65b90c59ce
AMDGPU: Remove redundant operand folding checks (#140587)
This was pre-filtering out a specific situation from being
added to the fold candidate list. The operand legality will
ultimately be checked with isOperandLegal before the fold is
performed, so I don't see the plus in pre-filtering this one
case.
2025-05-29 19:38:45 +02:00
Justin Bogner
b7bb256703
Warn on misuse of DiagnosticInfo classes that hold Twines (#137397)
This annotates the `Twine` passed to the constructors of the various
DiagnosticInfo subclasses with `[[clang::lifetimebound]]`, which causes
us to warn when we would try to print the twine after it had already
been destructed.

We also update `DiagnosticInfoUnsupported` to hold a `const Twine &`
like all of the other DiagnosticInfo classes, since this warning allows
us to clean up all of the places where it was being used incorrectly.
2025-05-28 12:26:39 -07:00
Ivan Kosarev
66d3980b53
[AMDGPU][NFC] Remove _DEFERRED operands. (#139123)
All immediates are deferred now.
2025-05-09 10:10:53 +01:00
Ivan Kosarev
c290f48a45
[AMDGPU][NFC] Remove unused operand types. (#139062) 2025-05-08 12:48:25 +01:00
Brox Chen
09d01be856
[AMDGPU][True16][CodeGen] replace subreg_to_reg to req_sequence (#138746)
Since subreg_to_reg is considered broken in llvm, replace subreg_to_reg
to reg_sequence
2025-05-07 10:28:10 -04:00
Frederik Harwath
f541a3aad8
[AMDGPU] SIInstrInfo: Fix resultDependsOnExec for VOPC instructions (#134629)
SIInstrInfo::resultDependsOnExec assumes that operand 0 of a comparison
is always the destination of the instruction. This is not true for
instructions in VOPC form where it is "src0". This led to a crash in
machine-cse.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-04-22 10:17:35 +02:00
Philip Reames
f2ecd86e34
[Analysis] Remove implicit LocationSize conversion from uint64_t (#133342)
This change removes the uint64_t constructor on LocationSize
preventing implicit conversion, and fixes up the using APIs to adapt to
the change. Note that I'm adding a couple of explicit conversion points
on routines where passing in a fixed offset as an integer seems likely
to have well understood semantics.

We had an unfortunate case which arose if you tried to pass a TypeSize
value to a parameter of LocationSize type. We'd find the implicit
conversion path through TypeSize -> uint64_t -> LocationSize which works
just fine for fixed values, but looses information and fails assertions
if the TypeSize was scalable. This change breaks the first link in that
implicit conversion chain since that seemed to be the easier one.
2025-04-18 07:46:31 -07:00
Brox Chen
bf388f8a43
[AMDGPU][True16][CodeGen] legalize operands when move16bit SALU to VALU (#133985)
This is a follow up PR from
https://github.com/llvm/llvm-project/pull/132089.

When a V2S copy and its useMI are lowered to VALU,  this patch check:
If the generated new VALU is a true16 inst. Add subreg access on all
operands if necessary.

an example MIR looks like:
```
%1:vgpr_32 = V_CVT_F32_U32_e64 %0:vgpr_32, 0, 0 ...
%2:sreg_32 = COPY %1:vgpr_32
%3:sreg_32 = S_FLOOR_F16 %2:sreg_32, ...
```
currently lowered to
```
%1:vgpr_32 = V_CVT_F32_U32_e64 %0:vgpr_32, 0, 0 ...
%2:vgpr_16 = V_FLOOR_F16_t16_e64 0, %1:vgpr_32, 0, 0, 0 ...
```
after this patch
```
%1:vgpr_32 = V_CVT_F32_U32_e64 %0:vgpr_32, 0, 0 ...
%2:vgpr_16 = V_FLOOR_F16_t16_e64 0, %1.lo16:vgpr_32, 0, 0, 0 ...
```
2025-04-03 12:26:41 -04:00
Brox Chen
dd1d41f833
[AMDGPU][True16][CodeGen] fix moveToVALU with proper subreg access in true16 (#132089)
There are V2S copies between vpgr16 and spgr32 in true16 mode. This is
caused by vgpr16 and sgpr32 both selectable by 16bit src in ISel.

When a V2S copy and its useMI are lowered to VALU,  this patch check
1. If the generated new VALU is used by a true16 inst. Add subreg access
if necessary.
2. Legalize the V2S copy by replacing it to subreg_to_reg

an example MIR looks like:
```
%2:sgpr_32 = COPY %1:vgpr_16
%3:sgpr_32 = S_OR_B32 %2:sgpr_32, ...
%4:vgpr_16 = V_ADD_F16_t16 %3:sgpr_32, ...
```
currently lowered to
```
%2:vgpr_32 = COPY %1:vgpr_16
%3:vgpr_32 = V_OR_B32 %2:vgpr_32, ...
%4:vgpr_16 = V_ADD_F16_t16 %3:vgpr_32, ...
```
after this patch
```
%2:vgpr_32 = SUBREG_TO_REG 0, %1:vgpr_16, lo16
%3:vgpr_32 = V_OR_B32 %2:vgpr_32, ...
%4:vgpr_16 = V_ADD_F16_t16 %3.lo16:vgpr_32, ...
```
2025-04-01 12:40:18 -04:00
Stephen Thomas
2e3fa4ba9e
[AMDGPU] Insert before and after instructions that always use GDS (#131338)
It is an architectural requirement that there must be no outstanding GDS
instructions when an "always GDS" instruction is issued, and also that
an always GDS instruction must be allowed to complete.

Insert waits on DScnt/LGKMcnt prior to (if necessary) and subsequent to
(unconditionally) any always GDS instruction, and an additional S_NOP if
the subsequent wait was followed by S_ENDPGM.

Always GDS instructions are GWS instructions, DS_ORDERED_COUNT,
DS_ADD_GS_REG_RTN, and DS_SUB_GS_REG_RTN (the latter two as considered
always GDS as of this patch).
2025-03-21 09:33:04 +00:00