llvm-project

Author	SHA1	Message	Date
Petar Avramovic	2ec7959b96	[AMDGPU][SIInsertWaitcnts] Track SCC. Insert KM_CNT waits for SCC writes. (#157843 ) Add new event SCC_WRITE for s_barrier_signal_isfirst and s_barrier_leave, instructions that write to SCC, counter is KM_CNT. Also start tracking SCC for reads and writes. s_barrier_wait on the same barrier guarantees that the SCC write from s_barrier_signal_isfirst has landed, no need to insert s_wait_kmcnt.	2025-09-18 14:41:01 +02:00
Brox Chen	2b2b580c8d	[AMDGPU][CodeGen][True16] Track waitcnt as vgpr32 instead of vgpr16 for D16 Instructions in GFX11 (#157795 ) It seems the VMEM access on hi/lo half could interfere the other half. Track waitcnt of vgpr32 instead of vgpr16 for 16bit reg in GFX11. --------- Co-authored-by: Joe Nash <joseph.nash@amd.com>	2025-09-17 10:09:06 -04:00
choikwa	ef7de8d144	[AMDGPU] Remove scope check in SIInsertWaitcnts::generateWaitcntInstBefore (#157821 ) This change was motivated by CK where many VMCNT(0)'s were generated due to instructions lacking !alias.scope metadata. The two causes of this were: 1) LowerLDSModule not tacking on scope metadata on a single LDS variable 2) IPSCCP pass before inliner replacing noalias ptr derivative with a global value, which made inliner unable to track it back to the noalias ptr argument. However, it turns out that IPSCCP losing the scope information was largely ineffectual as ScopedNoAliasAA was able to handle asymmetric condition, where one MemLoc was missing scope, and still return NoAlias result. AMDGPU however was checking for existence of scope in SIInsertWaitcnts and conservatively treating it as aliasing all and inserted VMCNT(0) before DS_READs, forcing it to wait for all previous LDS DMA instructions. Since we know that ScopedNoAliasAA can handle asymmetry, we should also allow AA query to determine if two MIs may alias. Passed PSDB. Previous attempt to address the issue in IPSCCP, likely stalled: https://github.com/llvm/llvm-project/pull/154522 This solution may be preferrable over that as issue only affects AMDGPU.	2025-09-12 14:51:36 -04:00
Stanislav Mekhanoshin	6aebbb0a85	[AMDGPU] Define 1024 VGPRs on gfx1250 (#156765 ) This is a baseline support, it is not useable yet.	2025-09-03 16:25:18 -07:00
Nicolai Hähnle	46762421c3	AMDGPU/GFX12: Do not wait unnecessarily before barriers (#154970 ) The barrier intrinsic itself should not have memory semantics. Frontends should use appropriate fence instructions for memory effects, and some frontends want to rely on that for performance (e.g. wait only for LDS before a barrier). See the code comment for more detail.	2025-08-23 00:07:59 -07:00
Ivan Kosarev	faca8c9ed4	[AMDGPU][NFC] Only include CodeGenPassBuilder.h where needed. (#154769 ) Saves around 125-210 MB of compilation memory usage per source for roughly one third of our backend sources, ~60 MB on average.	2025-08-22 10:05:06 +01:00
Stanislav Mekhanoshin	d0ee82040c	[AMDGPU] Add s_barrier_init\|join\|leave instructions (#153296 )	2025-08-12 15:07:07 -07:00
Sameer Sahasrabuddhe	8f187c74b3	[AMDGPU] introduce S_WAITCNT_LDS_DIRECT in the memory legalizer (#150887 ) The new instruction represents the unknown number of waitcnts needed at a release operation to ensure that prior direct loads to LDS (formerly called LDS DMA) are completed. The instruction is replaced in SIInsertWaitcnts with a suitable value for vmcnt(). Co-authored-by: Austin Kerbow <austin.kerbow@amd.com>.	2025-07-30 11:23:28 +05:30
Pierre van Houtryve	2ad4e93ded	[AMDGPU][gfx1250] Use SCOPE_SE for stores that may hit scratch (#150586 )	2025-07-28 11:40:56 +02:00
Stanislav Mekhanoshin	9deb7f6062	[AMDGPU] gfx1250 vmem prefetch target intrinsics and builtins (#150466 )	2025-07-24 12:13:59 -07:00
Diana Picus	20d8398825	[AMDGPU] ISel & PEI for whole wave functions (#145858 ) Whole wave functions are functions that will run with a full EXEC mask. They will not be invoked directly, but instead will be launched by way of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in a future patch). These functions are meant as an alternative to the `llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics. Whole wave functions will set EXEC to -1 in the prologue and restore the original value of EXEC in the epilogue. They must have a special first argument, `i1 %active`, that is going to be mapped to EXEC. They may have either the default calling convention or amdgpu_gfx. The inactive lanes need to be preserved for all registers used, active lanes only for the CSRs. At the IR level, arguments to a whole wave function (other than `%active`) contain poison in their inactive lanes. Likewise, the return value for the inactive lanes is poison. This patch contains the following work: * 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return a SReg_1 representing `%active`, which needs to be passed into SI_WHOLE_WAVE_FUNC_RETURN. * SelectionDAG support for generating these 2 new pseudos and the special handling of %active. Since the return may be in a different basic block, it's difficult to add the virtual reg for %active to SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF which is later replaced via a custom inserter. * Expansion of the 2 pseudos during prolog/epilog insertion. PEI also marks any used VGPRs as WWM registers, which are then spilled and restored with the usual logic. Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic and a lot of optimization work (especially in order to reduce spills around function calls). --------- Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com> Co-authored-by: Shilei Tian <i@tianshilei.me>	2025-07-21 10:39:09 +02:00
Jay Foad	ab25de7dec	[AMDGPU] Move common fields out of WaitcntBrackets. NFC. (#148864 ) WaitcntBrackets holds per-basic-block information about the state of wait counters. It also held a bunch of fields that are constant throughout a run of the pass. This patch moves them out into the SIInsertWaitcnts class, for better logical separation and to save a tiny bit of memory.	2025-07-17 15:20:24 +01:00
Jay Foad	359dca0dad	[AMDGPU] Move class WaitcntBrackets after class SIInsertWaitcnts. NFC. This is a prerequisite for "[AMDGPU] Move common fields out of WaitcntBrackets. NFC. (#148864)"	2025-07-17 13:58:41 +01:00
Jay Foad	ad715beca1	[AMDGPU] Remove HasSampler variable. NFC. (#146682 ) Putting the complex condition in a variable does not help readability. It is simpler to use separate `if`s.	2025-07-02 15:44:32 +01:00
Jay Foad	2b03efc7fb	[AMDGPU] Use isImage. NFC. (#146677 )	2025-07-02 14:18:42 +01:00
Sameer Sahasrabuddhe	a34a024812	[AMDGPU][SIInsertWaitCnts] skip meta instructions early (#145720 ) When iterating over a block, meta instructions have no effect on wait counts, but their presence drops the reference to earlier waitcnt instructions before they are processed. This results in spurious wait counts, which do not affect correctness, but are also not required in the resulting program. Skipping meta instructions as soon as they are seen cleans this up.	2025-07-01 22:02:48 +05:30
Sameer Sahasrabuddhe	67302b2e6f	[AMDGPU][NFC] rename some constants for readability (#145870 )	2025-06-26 20:00:26 +05:30
Jay Foad	b6b08117bd	[AMDGPU] Simplify S_WAIT_XCNT insertion. NFC. (#145682 )	2025-06-25 20:56:41 +01:00
Sameer Sahasrabuddhe	81e07996aa	[AMDGPU][SIInsertWaitcnts] don't crash when printing messages at end of block (#145694 )	2025-06-25 20:59:55 +05:30
Christudasan Devadasan	08b8d467d4	[AMDGPU][GFX1250] Insert S_WAIT_XCNT for SMEM and VMEM load-stores (#145566 ) This patch tracks the register operands of both VMEM (FLAT, MUBUF, MTBUF) and SMEM load-store operations and inserts a S_WAIT_XCNT instruction with sufficient wait-count before potentially redefining them. For VMEM instructions, XNACK is returned in the same order as they were issued and hence non-zero counter values can be inserted. However, SMEM execution is out-of-order and so is their XNACK reception. Thus, only zero counter value can be inserted to capture SMEM dependencies.	2025-06-25 10:40:36 +05:30
Diana Picus	a201f8872a	[AMDGPU] Replace dynamic VGPR feature with attribute (#133444 ) Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget feature, as requested in #130030.	2025-06-24 11:09:36 +02:00
Stanislav Mekhanoshin	0c2191b3a7	[AMDGPU] Omit image waits in function prologue on gfx1250 (#145097 )	2025-06-20 14:11:29 -07:00
Sameer Sahasrabuddhe	62fe5e428a	[NFC][AMDGPU] print more info when debugging SIInsertWaitcnts pass (#144629 )	2025-06-19 13:47:37 +05:30
Carl Ritson	34b6285735	[AMDGPU] Treat image_msaa_load as a sampler operation (#141726 ) While image_msaa_load does not take a sampler, it can behave as if it does on some hardware. This has implications for wait counting and clausing.	2025-05-28 19:56:05 +09:00
Kazu Hirata	cc78177e8f	[llvm] Use *Map::try_emplace (NFC) (#141190 ) try_emplace can default-construct values, so we do not need to do so on our own. Plus, try_emplace(Key) is much simpler/shorter than insert({Key, LongValueType()}).	2025-05-22 23:50:58 -07:00
Robert Imschweiler	e55172f139	[AMDGPU] Classify FLAT instructions as VMEM (#137148 ) Also adapt hazard and wait handling.	2025-05-07 09:20:52 +02:00
Pierre van Houtryve	ec3a90509d	[AMDGPU][InsertWaitCnts] Track global_wb/inv/wbinv (#135340 ) wb/wbinv use storecnt, inv uses loadcnt. Track them as VMEM_WRITE_ACCESS and VMEM_READ_ACCESS to avoid InsertWaitCnt incorrectly eliminating the waitcnts after these instructions. Solves SWDEV-526604	2025-04-22 14:53:55 +02:00
David Stuttard	15428e0d78	[AMDGPU] Add support for point sample accel out of order returns (#127991 ) Add target feature for point sample acceleration and enable it for relevant targets. Also add support to insert waitcnts where required when point sample accel may have occurred. This has implications for out of order returns, which is why extra waitcnts are required. Add a VMEM_NOSAMPLER bit in the register masks to determine when waitcnt is required.	2025-04-10 15:33:48 +01:00
Jay Foad	008c875be8	[AMDGPU] Fix excessive stack usage in SIInsertWaitcnts::run (#134835 ) Noticed on Windows when running LLVM as part of a graphics driver, with total stack usage limited to about 128 KB. In some cases this function would overflow the stack. On Linux this reduces stack usage in this function from about 32 KB to about 0.5 KB.	2025-04-08 14:08:42 +01:00
Jay Foad	6f93c0676f	[AMDGPU] Make a few WaitcntBrackets methods const. NFC. (#134824 )	2025-04-08 10:44:02 +01:00
Austin Kerbow	e75f586b81	[AMDGPU] Relax lds dma waitcnt with no aliasing pair (#131842 ) If we cannot find any lds DMA instruction that is aliased by some load from lds, we will still insert vmcnt(0). This is overly cautious since handling inter-thread dependences is normally managed by the memory model instead of the waitcnt pass, so this change updates the behavior to be more inline with how other types of memory events are handled.	2025-03-24 10:38:47 -07:00
Akshat Oke	f10dc76f03	[AMDGPU][NPM] Port SIInsertWaitcnts to NPM (#130061 )	2025-03-24 21:36:45 +05:30
Kazu Hirata	3c2731ce46	[AMDGPU] Avoid repeated hash lookups (NFC) (#132657 )	2025-03-23 22:30:09 -07:00
Stephen Thomas	2e3fa4ba9e	[AMDGPU] Insert before and after instructions that always use GDS (#131338 ) It is an architectural requirement that there must be no outstanding GDS instructions when an "always GDS" instruction is issued, and also that an always GDS instruction must be allowed to complete. Insert waits on DScnt/LGKMcnt prior to (if necessary) and subsequent to (unconditionally) any always GDS instruction, and an additional S_NOP if the subsequent wait was followed by S_ENDPGM. Always GDS instructions are GWS instructions, DS_ORDERED_COUNT, DS_ADD_GS_REG_RTN, and DS_SUB_GS_REG_RTN (the latter two as considered always GDS as of this patch).	2025-03-21 09:33:04 +00:00
Diana Picus	8a53324aa5	[AMDGPU] Deallocate VGPRs before exiting in dynamic VGPR mode (#130037 ) In dynamic VGPR mode, Waves must deallocate all VGPRs before exiting. If the shader program does not do this, hardware inserts `S_ALLOC_VGPR 0` before S_ENDPGM, but this may incur some performance cost. Therefore it's better if the compiler proactively generates that instruction. This patch extends `si-insert-waitcnts` to deallocate the VGPRs via a `S_ALLOC_VGPR 0` before any `S_ENDPGM` when in dynamic VGPR mode.	2025-03-19 09:00:36 +01:00
Brox Chen	222b99d3aa	[AMDGPU][True16][CodeGen] update waitcnt for true16 (#128927 ) update waitcnt pass to check hi16 and lo16 in true16 mode --------- Co-authored-by: Jay Foad <jay.foad@gmail.com>	2025-03-11 10:59:51 -04:00
Jay Foad	d6c0839c9c	[AMDGPU] Reduce size of SGPR arrays in SIInsertWaitcnts. NFC. (#130097 )	2025-03-06 13:44:16 +00:00
Dmitri Gribenko	4e6721b70d	[llvm] Fix an unused variable warning	2025-03-06 14:43:13 +01:00
Jay Foad	59e0704a52	[AMDGPU] Remove RegisterEncoding from SIInsertWaitcnts. NFC. (#130056 ) The information in this struct seemed useless. VGPR0 and SGPR0 were always 0. VGPRL and SGPRL were only used in assertions.	2025-03-06 12:50:00 +00:00
Mariusz Sikora	cd3acd1bff	[AMDGPU] Remove unused s_barrier_{init,join,leave} instructions (#129548 )	2025-03-04 17:52:43 +01:00
Rahul Joshi	bee9664970	[TableGen] Emit OpName as an enum class instead of a namespace (#125313 ) - Change InstrInfoEmitter to emit OpName as an enum class instead of an anonymous enum in the OpName namespace. - This will help clearly distinguish between values that are OpNames vs just operand indices and should help avoid bugs due to confusion between the two. - Rename OpName::OPERAND_LAST to NUM_OPERAND_NAMES. - Emit declaration of getOperandIdx() along with the OpName enum so it doesn't have to be repeated in various headers. - Also updated AMDGPU, RISCV, and WebAssembly backends to conform to the new definition of OpName (mostly mechanical changes).	2025-02-12 08:19:30 -08:00
Stanislav Mekhanoshin	8a20c6459e	[AMDGPU] Create new option for force flush load counter (#124974 ) In ceratin situations it is beneficial to wait for all outstanding loads regardless of specific load's data we need. This may allow to reduce a number of cache requests. Fixes: SWDEV-511507	2025-01-30 11:14:38 -08:00
Kazu Hirata	be187369a0	[AMDGPU] Remove unused includes (NFC) (#116154 ) Identified with misc-include-cleaner.	2024-11-13 21:10:03 -08:00
Stanislav Mekhanoshin	3277c7cd28	[AMDGPU] Skip VGPR deallocation for waveslot limited kernels (#112765 ) MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels. It's been identified this message is only really needed for VGPR limited kernels. A kernel becomes VGPR limited if a total number of VGPRs per SIMD / number of used VGPRs is more than a number of wave slots.	2024-10-21 09:39:52 -07:00
Shilei Tian	a74659445d	[AMDGPU] Skip terminators when forcing emit zero flag (#112116 ) When forcing emit zero, we need to skip terminators of a MBB; otherwise the terminator list of the MBB would be broken.	2024-10-14 11:46:18 -04:00
Jay Foad	cbc4be2dd5	[AMDGPU] Use MachineInstr::mayLoadOrStore. NFC.	2024-10-14 15:37:56 +01:00
Shilei Tian	ed77df56f2	[NFC] clang-format llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp	2024-10-14 00:57:01 -04:00
Shilei Tian	3da7d55b35	[NFC][AMDGPU] Remove unnecessary member `ForceEmitZeroWaitcnts` (#112114 ) We can use `ForceEmitZeroFlag` directly.	2024-10-14 00:54:16 -04:00
Jay Foad	8d13e7b8c3	[AMDGPU] Qualify auto. NFC. (#110878 ) Generated automatically with: $ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find lib/Target/AMDGPU/ -type f)	2024-10-03 13:07:54 +01:00
Jay Foad	e64ef74e64	[AMDGPU] Remember to clear a DenseMap between runs of SIInsertWaitcnts (#110650 ) This caused nondeterministic codegen in some cases.	2024-10-02 10:07:54 +01:00

1 2 3 4 5 ...

261 Commits