llvm-project

Author	SHA1	Message	Date
Domenic Nutile	5b33f85a08	[AMDGPU] Change isSingleLaneExecution to account for WWM enabling lanes even if there's only one workitem (#188316 ) This issue was discovered during some downstream work around Vulkan CTS tests, specifically `dEQP-VK.subgroups.arithmetic.compute.subgroupadd_float` --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2026-04-06 12:51:46 -04:00
Mirko Brkušanin	5d9eb0c76a	[AMDGPU] Define new targets gfx1171 and gfx1172 (#187735 )	2026-04-01 18:16:11 +02:00
Nicolai Hähnle	2fb59b4e6c	AMDGPU: Document two more in-flight address spaces (#185690 ) There are two more address spaces being worked on internally for future features. Note this in the documentation now to reduce the risk of clashes.	2026-03-10 17:13:59 +00:00
Diana Picus	f2e8e2faff	[AMDGPU] Make chain functions receive a stack pointer (#184616 ) Currently, chain functions are free to set up a stack pointer if they need one, and they assume they can start at scratch offset 0. This is not correct if CWSR and dynamic VGPRs are both enabled, since in that case we need to reserve an area at offset 0 for the trap handler, but only when running on a compute queue (which we determine at runtime). Rather than duplicate in every chain function the code sequence for determining if/how much scratch space needs to be reserved, this patch changes the ABI of chain functions so that they receive a stack pointer from their caller. Since chain functions can no longer use plain offsets to access their own stack, we'll also need to allocate a frame pointer more often (and sometimes also a base pointer). For simplicity, we use the same registers that `amdgpu_gfx` functions do (s32, s33, s34). This may change in the future. Chain functions never return to their caller and thus don't need to preserve the frame or base pointer. Another consequence is that now we might need to realign the stack in some cases (since it no longer starts at the infinitely aligned 0).	2026-03-06 11:01:42 +01:00
Mirko Brkušanin	d0f50d5574	[AMDGPU] Remove DX10_CLAMP and IEEE bits from gfx1170 (#182107 ) Add `DX10ClampAndIEEEMode` feature and set it for every subtarget prior to gfx1170	2026-03-04 12:16:41 +01:00
David Stuttard	cd68939326	[AMDGPU] Add attribute for FWD_PROGRESS (#181675 ) Added an attribute for FWD_PROGRESS that allows it to be turned off for some shaders.	2026-02-26 11:35:36 +00:00
Stanislav Mekhanoshin	33fd75f55d	[AMDGPU] Add gfx12-5-generic subtarget (#183381 ) This is functionally equivalent to gfx1250.	2026-02-25 13:34:48 -08:00
Konstantin Zhuravlyov	ea4bbfbf5b	AMDGPU/Docs: Reserve 0x060 and 0x070 ELF MACH (e_flags) (#182341 )	2026-02-19 14:06:02 -05:00
Pierre van Houtryve	c30879ea91	[AMDGPU][Doc] Small fix for GFX12 release atomic memory model doc (#182241 ) That row goes for both generic/global but it only said global.	2026-02-19 11:38:05 +01:00
Sameer Sahasrabuddhe	b02b395a1e	[AMDGPU] Asynchronous loads from global/buffer to LDS on pre-GFX12 (#180466 ) The existing "LDS DMA" builtins/intrinsics copy data from global/buffer pointer to LDS. These are now augmented with their ".async" version, where the compiler does not automatically track completion. The completion is now tracked using explicit mark/wait intrinsics, which must be inserted by the user. This makes it possible to write programs with efficient waits in software pipeline loops. The program can now wait for only the oldest outstanding operations to finish, while launching more operations for later use. This change only contains the new names of the builtins/intrinsics, which continue to behave exactly like their non-async counterparts. A later change will implement the actual mark/wait semantics in SIInsertWaitcnts. This is part of a stack split out from #173259: - #180467 - #180466 Fixes: SWDEV-521121	2026-02-11 05:26:58 +00:00
Pierre van Houtryve	b79ba02479	[AMDGPU][GFX12.5] Reimplement monitor load as an atomic operation (#177343 ) Load monitor operations make more sense as atomic operations, as non-atomic operations cannot be used for inter-thread communication w/o additional synchronization. The previous built-in made it work because one could just override the CPol bits, but that bypasses the memory model and forces the user to learn about ISA bits encoding. Making load monitor an atomic operation has a couple of advantages. First, the memory model foundation for it is stronger. We just lean on the existing rules for atomic operations. Second, the CPol bits are abstracted away from the user, which avoids leaking ISA details into the API. This patch also adds supporting memory model and intrinsics documentation to AMDGPUUsage. Solves SWDEV-516398.	2026-02-09 09:57:27 +01:00
Mirko Brkušanin	20b5849e17	[AMDGPU] Define new target gfx1170 (#180185 )	2026-02-06 14:38:50 +01:00
Aaditya	e9e8b38c80	[AMDGPU] Update documentation for wave reduction intrinsics (#175132 )	2026-01-30 18:19:40 +05:30
Mariusz Sikora	6de6f7b46b	[AMDGPU] Define gfx1310 target with ELF number 0x50 (#177355 ) For now this is identical to gfx1250. --------- Co-authored-by: Jay Foad <jay.foad@amd.com>	2026-01-22 17:08:38 +01:00
Pierre van Houtryve	3dda4b5463	[Doc][AMDGPU] Add barrier execution & memory model (#170447 ) Add a formal execution model, and a memory model for the execution barrier primitives available in GFX12.0 and below. The model also works for GFX12.5 workgroup/workgroup trap barriers, but does not include the new barrier types and instructions added in GFX12.5. These will be added at a later date.	2026-01-21 16:12:33 +01:00
Stanislav Mekhanoshin	dd947ebcf3	[AMDGPU] Update gfx1250 memory model for global acquire/release (#175865 ) Inserts required waits around GLOBAL_INV/GLOBAL_WBINV for agent scope and above.	2026-01-15 03:25:03 -08:00
Pankaj Dwivedi	017a27cb1a	[AMDGPU][Docs] Document amdgpu-expand-waitcnt-profiling attribute (#175750 )	2026-01-13 18:21:13 +05:30
Jay Foad	475f022cb7	[AMDGPU] Add support for GFX12 expert scheduling mode 2 (#170319 )	2026-01-09 15:49:10 +00:00
Jay Foad	b3c3e5fd99	[AMDGPU] Simplify and document waitcnt handling on call and return (#172453 ) Start documenting the ABI conventions for dependency counters on function call and return. Stop pretending that SIInsertWaitcnts can handle anything other than the default documented behavior.	2026-01-05 13:29:54 +00:00
Shilei Tian	c97de4387b	Revert "[AMDGPU] add clamp immediate operand to WMMA iu8 intrinsic (#171069 )" (#174303 ) This reverts commit 2c376ffeca490a5732e4fd6e98e5351fcf6d692a because it breaks assembler. ``` $ llvm-mc -triple=amdgcn -mcpu=gfx1250 -show-encoding <<< "v_wmma_i32_16x16x64_iu8 v[16:23], v[0:7], v[8:15], v[16:23] matrix_b_reuse" v_wmma_i32_16x16x64_iu8 v[16:23], v[0:7], v[8:15], v[16:23] clamp ; encoding: [0x10,0x80,0x72,0xcc,0x00,0x11,0x42,0x1c] ``` We have a fundamental issue in the clamp support in VOP3P instructions, which will need more changes.	2026-01-04 02:13:21 +00:00
Muhammad Abdul	2c376ffeca	[AMDGPU] add clamp immediate operand to WMMA iu8 intrinsic (#171069 ) Fixes #166989 - Adds a clamp immediate operand to the AMDGPU WMMA iu8 intrinsic and threads it through LLVM IR, MIR lowering, Clang builtins/tests, and MLIR ROCDL dialect so all layers agree on the new operand - Updates AMDGPUWmmaIntrinsicModsAB so the clamp attribute is emitted, teaches VOP3P encoding to accept the immediate, and adjusts Clang codegen/builtin headers plus MLIR op definitions and tests to match - Documents what the WMMA clamp operand do - Implement bitcode AutoUpgrade for source compatibility on WMMA IU8 Intrinsic op Possible future enhancements: - infer clamping as an optimization fold based on the use context --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-12-27 12:51:29 -05:00
anjenner	27651133e2	AMDGPU: Drop and upgrade llvm.amdgcn.atomic.csub/cond.sub to atomicrmw (#105553 ) These both perform conditional subtraction, returning the minuend and zero respectively, if the difference is negative.	2025-12-09 23:13:33 +00:00
Jan Patrick Lehr	41a6a0a1d2	[AMDGPU] Add some more product names for GPUs (#170469 )	2025-12-04 07:08:20 +01:00
Changpeng Fang	5f38ae4a77	[AMDGPU] update LDS block size for gfx1250 (#167614 ) LDS block size should be 2048 bytes (512 dwords) based on current spec.	2025-11-17 16:03:47 -08:00
Kazu Hirata	ea0ecd63d4	[llvm] Proofread *.rst (#168254 ) This patch is limited to hyphenation to ease the review process.	2025-11-16 08:09:07 -08:00
Krzysztof Drewniak	0190951a3e	[AMDGPU] Update buffer fat pointer docs for gfx1250, fix formatting (#167818 )	2025-11-14 11:13:57 -08:00
Krzysztof Drewniak	d4e9982787	[AMDGPU] Document meaning of alignment of buffer fat pointers, intrinsics (#167553 ) This commit adds documentation clarifying the meaning of `align` on ptr addrpsace(7) (buffer fat pointer) and ptr addrspace(9) (bufferef structured pointer) operations (specifying that both the base and the offset need to be aligned) and documents the meaning of the `align` attribute when used as an argument on .buffer.ptr. intrinsics.	2025-11-12 19:39:27 -08:00
Kazu Hirata	ce7f9f9ccd	[llvm] Proofread *.rst (#167108 ) This patch is limited to single-word replacements to fix spelling and/or grammar to ease the review process. Punctuation and markdown fixes are specifically excluded.	2025-11-08 07:41:23 -08:00
Nicolai Hähnle	917d815d4e	AMDGPU: Preliminary documentation for named barriers (#165502 )	2025-11-07 18:10:59 +00:00
Nicolai Hähnle	cfca229782	AMDGPU: Add and clarify reserved address spaces (#166486 ) Address spaces 10 and 11 are reserved for future use in the sense that we plain to upstream their use. Address space 12 is used by LLPC. It is used in a workaround for an issue with SMEM accesses to PRT buffers that is specific to the LLPC ecosystem and makes no sense to upstream.	2025-11-05 01:57:54 +00:00
Pierre van Houtryve	07d47c792b	[AMDGPU] Update code sequence for CU-mode Release Fences in GFX10+ (#161638 ) They were previously optimized to not emit any waitcnt, which is technically correct because there is no reordering of operations at workgroup scope in CU mode for GFX10+. This breaks transitivity however, for example if we have the following sequence of events in one thread: - some stores - store atomic release syncscope("workgroup") - barrier then another thread follows with - barrier - load atomic acquire - store atomic release syncscope("agent") It does not work because, while the other thread sees the stores, it cannot release them at the wider scope. Our release fences aren't strong enough to "wait" on stores from other waves. We also cannot strengthen our release fences any further to allow for releasing other wave's stores because only GFX12 can do that with `global_wb`. GFX10-11 do not have the writeback instruction. It'd also add yet another level of complexity to code sequences, with both acquire/release having CU-mode only alternatives. Lastly, acq/rel are always used together. The price for synchronization has to be paid either at the acq, or the rel. Strengthening the releases would just make the memory model more complex but wouldn't help performance. So the choice here is to streamline the code sequences by making CU and WGP mode emit almost identical (vL0 inv is not needed in CU mode) code for release (or stronger) atomic ordering. This also removes the `vm_vsrc(0)` wait before barriers. Now that the release fence in CU mode is strong enough, it is no longer needed. Supersedes #160501 Solves SC1-6454	2025-10-21 09:23:46 +02:00
Krzysztof Drewniak	d37141776f	[AMDGPU] Enable volatile and non-temporal for loads to LDS (#153244 ) The primary purpose of this commit is to enable marking loads to LDS (global.load.lds, buffer.*.load.lds) volatile (using bit 31 of the aux as with normal buffer loads) and to ensure that their !nontemporal annotations translate to appropriate settings of te cache control bits. However, in the process of implementing this feature, we also fixed - Incorrect handling of buffer loads to LDS in GlobalISel - Updating the handling of volatile on buffers in SIMemoryLegalizer: previously, the mapping of address spaces would cause volatile on buffer loads to be silently dropped on at least gfx10. --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-10-20 12:42:22 -05:00
Nicolai Hähnle	56ee43a863	AMDGPU: Document address spaces as reserved (#163996 ) They are going to be used for internal work downstream that we do expect to upstream eventually.	2025-10-17 17:34:28 +00:00
Jan Patrick Lehr	4773751799	[AMDGPU] Add product names to processor table (#163717 )	2025-10-16 12:17:20 +02:00
Kazu Hirata	bc0c232a2a	[llvm] Proofread AMDGPUUsage.rst (#163331 )	2025-10-14 07:15:53 -07:00
Jun Wang	bdd98a0147	[AMDGPU] Add documentation files for GFX12. (#157151 ) This patch adds documentation files for GFX12.	2025-10-01 15:31:50 -07:00
Stanislav Mekhanoshin	f81cc8bddc	[AMDGPU] Update gfx1250 documentation. NFC (#160457 )	2025-09-24 10:04:47 -07:00
Stanislav Mekhanoshin	5f105fe806	[AMDGPU] Update documentation about DWARF registers mapping. NFC (#159447 )	2025-09-17 14:08:19 -07:00
Stanislav Mekhanoshin	e556dc0b23	[AMDGPU] Add gfx1251 subtarget (#159430 )	2025-09-17 13:02:02 -07:00
Shilei Tian	04cd39ae28	[AMDGPU] Add the support for `.cluster_dims` code object metadata (#158721 ) Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>	2025-09-15 16:13:07 -04:00
Shilei Tian	27b242fbff	[AMDGPU][Attributor] Add `AAAMDGPUClusterDims` (#158076 )	2025-09-15 15:04:33 -04:00
Shilei Tian	1180c2ced0	[AMDGPU] Support lowering of cluster related instrinsics (#157978 ) Since many code are connected, this also changes how workgroup id is lowered. Co-authored-by: Jay Foad <jay.foad@amd.com> Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>	2025-09-12 21:11:17 -04:00
Stanislav Mekhanoshin	10cb685939	[AMDGPU] llvm.prefetch documentation for gfx1250. NFC (#157949 )	2025-09-10 14:06:32 -07:00
Pierre van Houtryve	49a898f9b5	[AMDGPU][gfx1250] Support "cluster" syncscope (#157641 ) Defaults to "agent" for targets that do not support it. - Add documentation - Register it in MachineModuleInfo - Add MemoryLegalizer support	2025-09-10 11:41:43 +02:00
Pierre van Houtryve	dcaa29c8ed	Revert "[AMDGPU][gfx1250] Add `cu-store` subtarget feature (#150588 )" (#157639 ) This reverts commit be17791f2624f22b3ed24a2539406164a379125d. This is not necessary for gfx1250 anymore.	2025-09-10 10:20:59 +02:00
Jan Patrick Lehr	c209cca677	[Docs][AMDGPU] Add gfx1200/gfx1201 product names (#155577 ) I took the liberty to add the product names according to Wikipedia.	2025-08-27 14:03:26 +02:00
Stanislav Mekhanoshin	438c099c23	[AMDGPU] gfx1250 kernel descriptor update (#155008 )	2025-08-22 12:58:41 -07:00
Gang Chen	60dbde69cd	[AMDGPU] report named barrier cnt part2 (#154588 )	2025-08-20 12:00:45 -07:00
Tim Renouf	f279c47cb3	AMDGPU gfx12: Add _dvgpr$ symbols for dynamic VGPRs (#148251 ) For each function with the AMDGPU_CS_Chain calling convention, with dynamic VGPRs enabled, add a _dvgpr$ symbol, with the value of the function symbol, plus an offset encoding one less than the number of VGPR blocks used by the function (16 VGPRs per block, no more than 128) in bits 5..3 of the symbol value. This is used by a front-end to have functions that are chained rather than called, and a dispatcher that dynamically resizes the VGPR count before dispatching to a function.	2025-08-15 16:33:06 +01:00
Stanislav Mekhanoshin	49f2093477	[AMDGPU] Increase LDS to 320K on gfx1250 (#153645 )	2025-08-14 12:52:00 -07:00

1 2 3 4 5 ...

461 Commits