llvm-project

Author	SHA1	Message	Date
Matt Arsenault	694a488708	AMDGPU: Add pseudoinstruction for 64-bit agpr or vgpr constants (#154499 ) 64-bit version of 7425af4b7aaa31da10bd1bc7996d3bb212c79d88. We still need to lower to 32-bit v_accagpr_write_b32s, so this has a unique value restriction that requires both halves of the constant to be 32-bit inline immediates. This only introduces the new pseudo definitions, but doesn't try to use them yet.	2025-08-20 22:54:37 +09:00
Matt Arsenault	ed0e531044	AMDGPU: Use Register type for isStackAccess (#154320 )	2025-08-19 23:00:45 +09:00
Pierre van Houtryve	6f7c77fe90	[AMDGPU] Check noalias.addrspace in mayAccessScratchThroughFlat (#151319 ) PR #149247 made the MD accessible by the backend so we can now leverage it in the memory model. The first use case here is detecting if a flat op can access scratch memory. Benefits both the MemoryLegalizer and InsertWaitCnt.	2025-08-19 07:42:59 +02:00
Stanislav Mekhanoshin	906c9e9542	[AMDGPU] Remove misplaced assert. (#154187 ) The assert that RegScavenger required for long branching is now placed below the code to use s_add_pc64, where it is actually used.	2025-08-18 13:58:54 -07:00
Stanislav Mekhanoshin	13716843eb	[AMDGPU] Make s_setprio_inc_wg a scheduling boundary (#154188 )	2025-08-18 13:20:38 -07:00
Stanislav Mekhanoshin	ea14834966	[AMDGPU] Per-subtarget DPP instruction classification (#153096 ) This is NFCI at this point.	2025-08-11 15:41:02 -07:00
Stanislav Mekhanoshin	dddeb07c2e	[AMDGPU] Restrict packed math FP32 instructions to read only one SGPR per operand on gfx12+ (#152465 ) Sec. 4.6.7.1 of the gfx1250 SPG states that if an SGPR is used as an operand, only one SGPR will be read for both the low and high operations. As a result, the corresponding bits in `op_sel` and `op_sel_hi` must be the same when the operand is an SGPR. Co-authored-by: Tian, Shilei <Shilei.Tian@amd.com> Co-authored-by: Tian, Shilei <Shilei.Tian@amd.com>	2025-08-07 16:13:34 -07:00
Shilei Tian	351b38f266	[AMDGPU] Mark address space cast from private to flat as divergent if target supports globally addressable scratch (#152376 ) Globally addressable scratch is a new feature introduced in gfx1250. However, this feature changes how scratch space is mapped into the flat aperture, making address space casts from private to flat no longer uniform.	2025-08-06 17:08:56 -04:00
Changpeng Fang	32161e9de3	[AMDGPU] Do not fold an immediate into instructions with frame indexes (#151263 ) Do not fold an immediate into an instruction that already has a frame index operand. A frame index could possibly turn out to be another immediate. Fixes: SWDEV-536263 --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-08-06 11:47:37 -07:00
Stanislav Mekhanoshin	33abf05af4	[AMDGPU] gfx1250 v_permlane_* instructions (#151749 )	2025-08-01 16:14:19 -07:00
Stanislav Mekhanoshin	ce40863209	[AMDGPU] Add v_cvt_sr\|pk_bf8\|fp8_f16 gfx1250 instructions (#151415 )	2025-07-30 17:24:45 -07:00
Brox Chen	2a3f72ee6e	[AMDGPU][CodeGen][True16] Correct size calculation for d16 insts (#151042 ) D16 pesudo instructions are introduced in true16 mode to represet a D16 load/store. In MC lowering, the pesudo instructions are lowered to the corresponding D16 Lo/Hi MC Inst respecting the register allocation. However, the pesudo instruction has size 0 and cause an issue in the Inst size estimation. Use D16 Lo when calculating inst size	2025-07-29 13:01:57 -04:00
Pierre van Houtryve	2ad4e93ded	[AMDGPU][gfx1250] Use SCOPE_SE for stores that may hit scratch (#150586 )	2025-07-28 11:40:56 +02:00
Changpeng Fang	400ce1a3d3	[AMDGPU] Support AMDGPUClamp for bf16 on gfx1250 (#150663 ) Scalar version uses V_MAX_BF16_PSEUDO which is expanded to V_PK_MAX_BF16 with unused high bits. If V_PK_MAX_BF16 is produced directly instead that creates problem with folding of the clamp into other scalar instructions due to incompatible clamp bits. FIXME-TRUE16: enable bf16 clamp with true16 --------- Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2025-07-25 12:13:06 -07:00
Jay Foad	8005c6a108	[AMDGPU] Simplify SIInstrInfo::isLegalToSwap. NFC. (#149058 )	2025-07-25 13:02:34 +01:00
Stanislav Mekhanoshin	2346968807	[AMDGPU] Add V_ADD\|SUB\|MUL_U64 gfx1250 opcodes (#150291 )	2025-07-23 13:17:56 -07:00
Stanislav Mekhanoshin	a0b854d576	[AMDGPU] MC support for gfx1250 scale_offset modifier (#149881 )	2025-07-21 15:04:59 -07:00
Diana Picus	20d8398825	[AMDGPU] ISel & PEI for whole wave functions (#145858 ) Whole wave functions are functions that will run with a full EXEC mask. They will not be invoked directly, but instead will be launched by way of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in a future patch). These functions are meant as an alternative to the `llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics. Whole wave functions will set EXEC to -1 in the prologue and restore the original value of EXEC in the epilogue. They must have a special first argument, `i1 %active`, that is going to be mapped to EXEC. They may have either the default calling convention or amdgpu_gfx. The inactive lanes need to be preserved for all registers used, active lanes only for the CSRs. At the IR level, arguments to a whole wave function (other than `%active`) contain poison in their inactive lanes. Likewise, the return value for the inactive lanes is poison. This patch contains the following work: * 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return a SReg_1 representing `%active`, which needs to be passed into SI_WHOLE_WAVE_FUNC_RETURN. * SelectionDAG support for generating these 2 new pseudos and the special handling of %active. Since the return may be in a different basic block, it's difficult to add the virtual reg for %active to SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF which is later replaced via a custom inserter. * Expansion of the 2 pseudos during prolog/epilog insertion. PEI also marks any used VGPRs as WWM registers, which are then spilled and restored with the usual logic. Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic and a lot of optimization work (especially in order to reduce spills around function calls). --------- Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com> Co-authored-by: Shilei Tian <i@tianshilei.me>	2025-07-21 10:39:09 +02:00
Matt Arsenault	176ae32de0	AMDGPU: Fix introducing use of killed vgpr in gfx908 agpr copy (#149291 ) When searching for an existing VGPR source for an AGPR to AGPR copy on gfx908, this wasn't verifying the vgpr wasn't killed by other prior uses.	2025-07-18 15:34:47 +09:00
Matt Arsenault	1614c3b3c7	AMDGPU: Always use AV spill pseudos on targets with AGPRs (#149099 ) This increases allocator freedom to inflate register classes to the AV class, we don't need to introduce a new restriction by basing the opcode on the current virtual register class. Ideally we would avoid this if we don't have any allocatable AGPRs for the function, but it probably doesn't make much difference in the end result if they are excluded from the final allocation order.	2025-07-18 15:31:50 +09:00
Matt Arsenault	4a3cb437a3	AMDGPU: Avoid hardcoding mov opcode	2025-07-17 15:11:52 +09:00
Changpeng Fang	b52cf756ce	AMDGPU: Treat WMMA XDL ops as TRANS in S_DELAY_ALU insertion for gfx1250 (#149208 ) WMMA XDL instructions are tracked as TRANs ops and the compiler should consider them the same as TRANS in S_DELAY_ALU insertion. We use a searchable table for the InsertDelayAlu pass to recognize these WMMA XDL instructions. Co-authored-by: Stefan Stipanovic <Stefan.Stipanovic@amd.com>	2025-07-16 17:07:48 -07:00
Stanislav Mekhanoshin	703501e661	[AMDGPU] Select flat GVS loads on gfx1250 (#149183 )	2025-07-16 15:06:37 -07:00
Stanislav Mekhanoshin	82d7405b3b	[AMDGPU] Use S_ADD_PC_I64 for long branches in gfx1250 (#148961 )	2025-07-15 17:14:56 -07:00
Stanislav Mekhanoshin	2d6534b7da	[AMDGPU] gfx1250 64-bit relocations and fixups (#148951 )	2025-07-15 17:13:42 -07:00
Paul Trojahn	70e1a3cead	[AMDGPU] Check legality of both operands before swap (#148843 ) When trying to fold an SGPR into the second operand to a DPP add, si-fold-operands correctly determines that this is not possible and attempts to swap the second and third operand. This succeeds even if the third operand is an SGPR, creating an illegal dpp add with two SGPR operands. We need to check both operands if they are legal in their new position. This causes a crash at compile time for a test in triton on gfx12: `345c633787/python/test/unit/language/test_core.py (L2718)` Co-authored-by: Paul Trojahn <paul.trojahn@amd.com>	2025-07-15 15:55:26 -04:00
Stanislav Mekhanoshin	cbba8f0acb	[AMDGPU] Codegen support for v_fmaak_f64/f_fmamk_f64 (#148734 )	2025-07-14 17:57:06 -07:00
Stanislav Mekhanoshin	a32040e483	[AMDGPU] Use 64-bit literals in codegen on gfx1250 (#148727 )	2025-07-14 15:47:24 -07:00
Stanislav Mekhanoshin	d1e3ab9c4b	[AMDGPU] Use v_mov_b64 in codegen on gfx1250 (#148272 )	2025-07-11 22:16:50 -07:00
Stanislav Mekhanoshin	f090554359	[AMDGPU] MC support for v_fmaak_f64/v_fmamk_f64 gfx1250 intructions (#148282 )	2025-07-11 14:17:03 -07:00
Brox Chen	0d2b47ae4a	[AMDGPU][True16][CodeGen] stop emitting spgr_lo16 from isel (#144819 ) When true16 is enabled, isel start to emit sgpr_lo16 register when a trunc/sext i16/i32 is generated, or a salu32 is used by vgpr16 or vice versa. And this causes a problem as sgpr_lo16 is not fully supported in the pipeline. True16 mode works fine in -O3 mode since folding pass remove sgpr_lo16 from the pipeline. However it hit a problem in -O0 mode as folding pass is skipped. This patch did: 1. stop emitting sgpr_lo16 from isel 2. update codegen pattern to split uniformed/divergent pattern for i16/i32 conversion 3. update fix-sgpr-copy pass to address legalization requirement in true16 mode, update fix-sgpr-copies-f16-true16.mir test to include all possible combinations This patch is tested with cts and downstream repo with -O0 testing	2025-07-09 16:17:14 -04:00
Changpeng Fang	eda3161c35	AMDGPU: Implement tensor load and store instructions for gfx1250 (#146636 ) Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2025-07-03 13:49:34 -07:00
Changpeng Fang	5035d20dcb	AMDGPU: Implement ds_atomic_async_barrier_arrive_b64/ds_atomic_barrier_arrive_rtn_b64 (#146409 ) These two instructions are supported by gfx1250. We define the instructions and implement the corresponding intrinsic and builtin. Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2025-07-01 11:08:49 -07:00
Matt Arsenault	4cb8308ee9	AMDGPU: Avoid report_fatal_error for unsupported ds_ordered_count (#145172 )	2025-06-26 14:06:05 +09:00
Brox Chen	505906bff6	[AMDGPU][True16][CodeGen] do not legalize t16 operand during user scan (#145450 ) The legalize t16 operand function could insert a reg_sequence which modify the user list of the targetted register, and we should not call it in the middle of an user list iteration	2025-06-24 23:49:22 -04:00
Stanislav Mekhanoshin	40eee8ec7f	[AMDGPU] Add s_setprio_inc_wg gfx1250 instruction (#145152 )	2025-06-22 12:52:05 -07:00
Brox Chen	e75e2485f2	[AMDGPU][True16][Codegen] keep srcmod/clamp/omod from v_s_xxx_f16 when moved to VALU (#144781 ) https://github.com/llvm/llvm-project/pull/141152 causes an issue in v_s_xxx_f16 lowering in both true16/fake16 flow. V_S_XXX_F16 are special insts which has scalar input/output but in VALU VOP3 format. Need to keep the srcmod/clamp/omod when lower it to its corresponding VALU inst with vector input/output.	2025-06-19 09:26:45 -04:00
Brox Chen	e48731bc03	[AMDGPU][True16][CodeGen] v_s_xxx_f16 t16 mode handling in movetoVALU process (#141152 ) Add op_sel for v_s_xxx_f16 when move them to VALU update a few related codegen test for gfx12 in true16 mode	2025-06-10 15:36:44 -04:00
Jay Foad	9cacc4138e	[AMDGPU] Move S_ADD_U64_PSEUDO handling into getVALUOp. NFC. (#142934 ) S_ADD_U64_PSEUDO and S_SUB_U64_PSEUDO are not "special cases" so can be handled in getVALUOp instead of moveToVALUImpl.	2025-06-05 16:49:24 +01:00
Brox Chen	b668b6439a	[AMDGPU][True16][CodeGen] legalize 16bit and 32bit use-def chain for moveToVALU in si-fix-sgpr-lowering (#138734 ) Two changes in this patch: 1. Covered another case in legalizeOperandVALUt16 functions and the COPY lowering, when SALU16 is used by SALU32, need to insert a reg_sequence after moved to valu (previously only considered SALU32 used by SALU16 case) 2. Moved the useMI analysis into addUsersToMoveVALUList. Legalize the targetted operand when needed. Turn on frem test with true16 mode for gfx1150 which is failing before this patch. A few bitcast tests also impacted by this change with some v_mov being replaced to dual mov	2025-06-04 09:53:10 -04:00
Matt Arsenault	65b90c59ce	AMDGPU: Remove redundant operand folding checks (#140587 ) This was pre-filtering out a specific situation from being added to the fold candidate list. The operand legality will ultimately be checked with isOperandLegal before the fold is performed, so I don't see the plus in pre-filtering this one case.	2025-05-29 19:38:45 +02:00
Justin Bogner	b7bb256703	Warn on misuse of DiagnosticInfo classes that hold Twines (#137397 ) This annotates the `Twine` passed to the constructors of the various DiagnosticInfo subclasses with `[[clang::lifetimebound]]`, which causes us to warn when we would try to print the twine after it had already been destructed. We also update `DiagnosticInfoUnsupported` to hold a `const Twine &` like all of the other DiagnosticInfo classes, since this warning allows us to clean up all of the places where it was being used incorrectly.	2025-05-28 12:26:39 -07:00
Ivan Kosarev	66d3980b53	[AMDGPU][NFC] Remove _DEFERRED operands. (#139123 ) All immediates are deferred now.	2025-05-09 10:10:53 +01:00
Ivan Kosarev	c290f48a45	[AMDGPU][NFC] Remove unused operand types. (#139062 )	2025-05-08 12:48:25 +01:00
Brox Chen	09d01be856	[AMDGPU][True16][CodeGen] replace subreg_to_reg to req_sequence (#138746 ) Since subreg_to_reg is considered broken in llvm, replace subreg_to_reg to reg_sequence	2025-05-07 10:28:10 -04:00
Frederik Harwath	f541a3aad8	[AMDGPU] SIInstrInfo: Fix resultDependsOnExec for VOPC instructions (#134629 ) SIInstrInfo::resultDependsOnExec assumes that operand 0 of a comparison is always the destination of the instruction. This is not true for instructions in VOPC form where it is "src0". This led to a crash in machine-cse. --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-04-22 10:17:35 +02:00
Philip Reames	f2ecd86e34	[Analysis] Remove implicit LocationSize conversion from uint64_t (#133342 ) This change removes the uint64_t constructor on LocationSize preventing implicit conversion, and fixes up the using APIs to adapt to the change. Note that I'm adding a couple of explicit conversion points on routines where passing in a fixed offset as an integer seems likely to have well understood semantics. We had an unfortunate case which arose if you tried to pass a TypeSize value to a parameter of LocationSize type. We'd find the implicit conversion path through TypeSize -> uint64_t -> LocationSize which works just fine for fixed values, but looses information and fails assertions if the TypeSize was scalable. This change breaks the first link in that implicit conversion chain since that seemed to be the easier one.	2025-04-18 07:46:31 -07:00
Brox Chen	bf388f8a43	[AMDGPU][True16][CodeGen] legalize operands when move16bit SALU to VALU (#133985 ) This is a follow up PR from https://github.com/llvm/llvm-project/pull/132089. When a V2S copy and its useMI are lowered to VALU, this patch check: If the generated new VALU is a true16 inst. Add subreg access on all operands if necessary. an example MIR looks like: ``` %1:vgpr_32 = V_CVT_F32_U32_e64 %0:vgpr_32, 0, 0 ... %2:sreg_32 = COPY %1:vgpr_32 %3:sreg_32 = S_FLOOR_F16 %2:sreg_32, ... ``` currently lowered to ``` %1:vgpr_32 = V_CVT_F32_U32_e64 %0:vgpr_32, 0, 0 ... %2:vgpr_16 = V_FLOOR_F16_t16_e64 0, %1:vgpr_32, 0, 0, 0 ... ``` after this patch ``` %1:vgpr_32 = V_CVT_F32_U32_e64 %0:vgpr_32, 0, 0 ... %2:vgpr_16 = V_FLOOR_F16_t16_e64 0, %1.lo16:vgpr_32, 0, 0, 0 ... ```	2025-04-03 12:26:41 -04:00
Brox Chen	dd1d41f833	[AMDGPU][True16][CodeGen] fix moveToVALU with proper subreg access in true16 (#132089 ) There are V2S copies between vpgr16 and spgr32 in true16 mode. This is caused by vgpr16 and sgpr32 both selectable by 16bit src in ISel. When a V2S copy and its useMI are lowered to VALU, this patch check 1. If the generated new VALU is used by a true16 inst. Add subreg access if necessary. 2. Legalize the V2S copy by replacing it to subreg_to_reg an example MIR looks like: ``` %2:sgpr_32 = COPY %1:vgpr_16 %3:sgpr_32 = S_OR_B32 %2:sgpr_32, ... %4:vgpr_16 = V_ADD_F16_t16 %3:sgpr_32, ... ``` currently lowered to ``` %2:vgpr_32 = COPY %1:vgpr_16 %3:vgpr_32 = V_OR_B32 %2:vgpr_32, ... %4:vgpr_16 = V_ADD_F16_t16 %3:vgpr_32, ... ``` after this patch ``` %2:vgpr_32 = SUBREG_TO_REG 0, %1:vgpr_16, lo16 %3:vgpr_32 = V_OR_B32 %2:vgpr_32, ... %4:vgpr_16 = V_ADD_F16_t16 %3.lo16:vgpr_32, ... ```	2025-04-01 12:40:18 -04:00
Stephen Thomas	2e3fa4ba9e	[AMDGPU] Insert before and after instructions that always use GDS (#131338 ) It is an architectural requirement that there must be no outstanding GDS instructions when an "always GDS" instruction is issued, and also that an always GDS instruction must be allowed to complete. Insert waits on DScnt/LGKMcnt prior to (if necessary) and subsequent to (unconditionally) any always GDS instruction, and an additional S_NOP if the subsequent wait was followed by S_ENDPGM. Always GDS instructions are GWS instructions, DS_ORDERED_COUNT, DS_ADD_GS_REG_RTN, and DS_SUB_GS_REG_RTN (the latter two as considered always GDS as of this patch).	2025-03-21 09:33:04 +00:00

1 2 3 4 5 ...

971 Commits