llvm-project

Author	SHA1	Message	Date
Stanislav Mekhanoshin	8fcb712167	[AMDGPU] gfx1250 runlines for global-atomicrmw-fadd.ll. NFC (#159817 )	2025-09-19 10:58:41 -07:00
Akash Dutta	c256966fe2	[AMDGPU]: Unpack packed instructions overlapped by MFMAs post-RA scheduling (#157968 ) This is a cleaned up version of PR #151704. These optimizations are now performed post-RA scheduling.	2025-09-19 09:41:02 -07:00
Jeffrey Byrnes	ac8f3cdcf3	[AMDGPU] Precommit test for memory intrinics CGP handling Change-Id: Id229f849b1d8552bbe59d6e18114042ef1614fad	2025-09-19 07:42:26 -07:00
Mariusz Sikora	eed99d5008	[AMDGPU] Fix the magic number RegisterClass for SReg_32 in test (#159761 )	2025-09-19 14:14:33 +02:00
Fabian Ritter	d5607694e1	[AMDGPU][SDAG] DAGCombine PTRADD -> disjoint OR (#146075 ) If we can't fold a PTRADD's offset into its users, lowering them to disjoint ORs is preferable: Often, a 32-bit OR instruction suffices where we'd otherwise use a pair of 32-bit additions with carry. This needs to be a DAGCombine (and not a selection rule) because its main purpose is to enable subsequent DAGCombines for bitwise operations. We don't want to just turn PTRADDs into disjoint ORs whenever that's sound because this transform loses the information that the operation implements pointer arithmetic, which AMDGPU for instance needs when folding constant offsets. For SWDEV-516125.	2025-09-19 11:58:41 +02:00
Fabian Ritter	a2dcc88f39	[AMDGPU][SDAG] Handle ISD::PTRADD in various special cases (#145330 ) There are more places in SIISelLowering.cpp and AMDGPUISelDAGToDAG.cpp that check for ISD::ADD in a pointer context, but as far as I can tell those are only relevant for 32-bit pointer arithmetic (like frame indices/scratch addresses and LDS), for which we don't enable PTRADD generation yet. For SWDEV-516125.	2025-09-19 10:19:38 +02:00
Fabian Ritter	adfa6a4c14	[AMDGPU][SDAG] Test ISD::PTRADD handling in various special cases (#145329 ) Pre-committing tests to show improvements in a follow-up PR.	2025-09-19 09:43:30 +02:00
Matt Arsenault	116ca9522e	Greedy: Take copy hints involving subregisters (#159570 ) Previously this would only accept full copy hints. This relaxes this to accept some subregister copies. Specifically, this now accepts: - Copies to/from physical registers if there is a compatible super register - Subreg-to-subreg copies This has the potential to repeatedly add the same hint to the hint vector, but not sure if that's a real problem.	2025-09-19 09:37:36 +09:00
Matt Arsenault	33e8e5a846	AMDGPU: Add more mfma loop test cases (#159492 ) Test cases where the exit uses must be VGPRs, and don't happen to be a store that could use AGPRs.	2025-09-19 09:36:46 +09:00
Stanislav Mekhanoshin	6ac0abf8c4	[AMDGPU] gfx1251 VOP3 dpp support (#159654 )	2025-09-18 16:18:09 -07:00
Stanislav Mekhanoshin	8cfbace7b2	[AMDGPU] gfx1251 VOP2 dpp support (#159641 )	2025-09-18 15:38:29 -07:00
Stanislav Mekhanoshin	e3c7b7f806	[AMDGPU] gfx1251 VOP1 dpp support (#159637 )	2025-09-18 13:42:06 -07:00
Simon Pilgrim	6e47bff24d	[AMDGPU] callee-special-input-vgprs.ll / callee-special-input-vgprs-packed.ll - regenerate test coverage (#159587 )	2025-09-18 15:19:48 +00:00
Fabian Ritter	01b4b2a5b8	[AMDGPU][SDAG] Handle ISD::PTRADD in VOP3 patterns (#143881 ) This patch mirrors similar patterns for ISD::ADD. The main difference is that ISD::ADD is commutative, so that a pattern definition for, e.g., (add (mul x, y), z), automatically also handles (add z, (mul x, y)). ISD::PTRADD is not commutative, so we would need to handle these cases explicitly. This patch only implements (ptradd z, (op x, y)) patterns, where the nested operation (shift or multiply) is the offset of the ptradd (i.e., the right operand), since base pointers that are the result of a shift or multiply seem less likely. For SWDEV-516125.	2025-09-18 15:01:07 +02:00
Petar Avramovic	2ec7959b96	[AMDGPU][SIInsertWaitcnts] Track SCC. Insert KM_CNT waits for SCC writes. (#157843 ) Add new event SCC_WRITE for s_barrier_signal_isfirst and s_barrier_leave, instructions that write to SCC, counter is KM_CNT. Also start tracking SCC for reads and writes. s_barrier_wait on the same barrier guarantees that the SCC write from s_barrier_signal_isfirst has landed, no need to insert s_wait_kmcnt.	2025-09-18 14:41:01 +02:00
Simon Pilgrim	85527609a0	[AMDGPU] kernel-argument-dag-lowering.ll - regenerate test coverage (#159526 )	2025-09-18 09:34:38 +00:00
Stanislav Mekhanoshin	221f8eef9d	[AMDGPU] Add gfx1251 runlines to cooperative atomcis tests. NFC (#159437 )	2025-09-17 14:08:05 -07:00
Stanislav Mekhanoshin	e556dc0b23	[AMDGPU] Add gfx1251 subtarget (#159430 )	2025-09-17 13:02:02 -07:00
Stanislav Mekhanoshin	f0090bacc1	[AMDGPU] Fold copies of constant physical registers into their uses (#154410 ) Co-authored-by: Jay Foad <Jay.Foad@amd.com> Co-authored-by: Jay Foad <Jay.Foad@amd.com>	2025-09-17 10:49:34 -07:00
Brox Chen	2b2b580c8d	[AMDGPU][CodeGen][True16] Track waitcnt as vgpr32 instead of vgpr16 for D16 Instructions in GFX11 (#157795 ) It seems the VMEM access on hi/lo half could interfere the other half. Track waitcnt of vgpr32 instead of vgpr16 for 16bit reg in GFX11. --------- Co-authored-by: Joe Nash <joseph.nash@amd.com>	2025-09-17 10:09:06 -04:00
Matt Arsenault	aac8eb85b2	AMDGPU: Fixes for regbankselecting copies of i1 physregs to sgprs (#159283 ) If the source register of a copy was a physical sgpr copied to an s1 value, this would assert.	2025-09-17 19:48:39 +09:00
sstipano	56ebbebada	[AMDGPU][NFC] Add back -new-reg-bank-select flag. (#159181 )	2025-09-17 11:29:16 +02:00
Shilei Tian	f7f7abcde4	[NFC][AMDGPU] Add a missing test case about cluster dims (#159179 )	2025-09-16 22:56:24 -04:00
Krzysztof Drewniak	96ce9f9d64	[AMDGPU] Prevent re-visits in LowerBufferFatPointers (#159168 ) Fixes https://github.com/iree-org/iree/issues/22001 The visitor in SplitPtrStructs would re-visit instructions if an instruction earlier in program order caused a recursive visit() call via getPtrParts(). This would cause instructions to be processed multiple times. As a consequence of this, PHI nodes could be added to the Conditionals array multiple times, which would to a conditinoal that was already simplified being processed multiple times. After the code moved to InstSimplifyFolder, this re-processing, combined with more agressive simplifications, would lead to an attempt to replace an instruction with itself, causing an assertion failure and crash. This commit resolves the issue and adds the reduced form of the crashing input as a test.	2025-09-16 18:02:18 -07:00
Stanislav Mekhanoshin	4ab8dabc25	[AMDGPU] Add s_cluster_barrier on gfx1250 (#159175 )	2025-09-16 14:49:48 -07:00
Shilei Tian	8122ccdca9	[AMDGPU] Set TGID_EN_X/Y/Z when cluster ID intrinsics are used (#159120 ) Hardware initializes a single value in ttmp9 which is either the workgroup ID X or cluster ID X. Most of this patch is a refactoring to use a single `PreloadedValue` enumerator for this value, instead of two enumerators `WORKGROUP_ID_X` and `CLUSTER_ID_X` referring to the same value. This makes it simpler to have a single attribute `amdgpu-no-workgroup-id-x` indicating that this value is not used, which in turns sets the TGID_EN_X bit appropriately to tell the hardware whether to initialize it. All of the above applies to Y and Z similarly. Fixes: LWPSCGFX13-568 Co-authored-by: Jay Foad <jay.foad@amd.com>	2025-09-16 15:37:01 -04:00
Shilei Tian	158eeb344b	[AMDGPU] Change `scale_sel` to be 4 bits (#157900 ) The latest SP changes updated it to use `OP_SEL[0:3]` instead of `OP_SEL[0:2]`. Fixes SWDEV-554472.	2025-09-16 15:36:45 -04:00
Stanislav Mekhanoshin	fd59fd563f	[AMDGPU] Add aperture classes to VS_64 (#158823 ) Should not do anything.	2025-09-16 11:15:50 -07:00
Matt Arsenault	0648c5183f	AMDGPU: Fix some broken regclass numbers in mir tests (#159102 )	2025-09-16 17:45:29 +00:00
Janek van Oirschot	341cdbc970	[AMDGPU] Elide bitcast fold i64 imm to build_vector (#154115 ) Elide bitcast combine to build_vector in case of i64 immediate that can be materialized through 64b mov	2025-09-16 16:44:51 +01:00
Rajveer Singh Bharadwaj	08a58b2cea	[InstCombine] Optimize redundant floating point comparisons in `or`/`and` inst's (#158097 ) Resolves #157371 We can eliminate one of the `fcmp` when we have two same `olt` or `ogt` instructions matched in `or`/`and` simplification.	2025-09-16 20:52:11 +05:30
Matt Arsenault	babdad3fdb	AMDGPU: Try to unspill VGPRs after rewriting MFMAs to AGPR form (#154323 ) After replacing VGPR MFMAs with the AGPR form, we've alleviated VGPR pressure which may have triggered spills during allocation. Identify these spill slots, and try to reassign them to newly freed VGPRs, and replace the spill instructions with copies. Fixes #154260	2025-09-17 00:11:32 +09:00
Jay Foad	eeced0d073	[AMDGPU] Use larger immediate values in S_NOP (#158990 ) The S_NOP instruction has an immediate operand which is one less than the number of cycles to delay for. The maximum value that may be encoded in this field was increased in GFX8 and again in GFX12.	2025-09-16 15:51:06 +01:00
Stanislav Mekhanoshin	76efbc068a	[AMDGPU] Fix codegen to emit COPY instead of S_MOV_B64 for aperture regs (#158754 )	2025-09-16 02:26:32 -07:00
Stanislav Mekhanoshin	72aa946762	[AMDGPU] Drop high 32 bits of aperture registers (#158725 ) Fixes: SWDEV-551181	2025-09-16 02:11:39 -07:00
Shilei Tian	04cd39ae28	[AMDGPU] Add the support for `.cluster_dims` code object metadata (#158721 ) Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>	2025-09-15 16:13:07 -04:00
Shilei Tian	27b242fbff	[AMDGPU][Attributor] Add `AAAMDGPUClusterDims` (#158076 )	2025-09-15 15:04:33 -04:00
macurtis-amd	2c091e6aec	AMDGPU: Report unaligned scratch access as fast if supported by tgt (#158036 ) This enables more consecutive load folding during aggressive-instcombine. The original motivating example provided by Jeff Byrnes: https://godbolt.org/z/8ebcTEjTs Example provided by Nikita Popov: https://godbolt.org/z/Gv1j4vjqE as part of my original attempt to fix the issue (PR [#133301](https://github.com/llvm/llvm-project/pull/133301), see his [comment](https://github.com/llvm/llvm-project/pull/133301#issuecomment-2984905809)). This changes the value of `IsFast` returned by `In SITargetLowering::allowsMisalignedMemoryAccessesImpl` to be non-zero for private and flat addresses if the subtarget supports unaligned scratch accesses. This enables aggressive-instcombine to do more folding of consecutive loads (see [here](`cbd496581f/llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp (L811)`)). Summary performance impact on [composable_kernel](https://github.com/ROCm/composable_kernel): \|GPU\|speedup (geomean)\| \|---\|---\| \|MI300A\| 1.11\| \|MI300X\| 1.14\| \|MI350X\| 1.03\| [] Just to be clear, this is the geomean across kernels which were impacted by this change - not across all CK kernels.	2025-09-15 05:03:02 -05:00
David Green	1c21d5cb9b	[GlobalISel] Remove GI known bits cache (#157352 ) There is a cache on the known-bit computed by global-isel. It only works inside a single query to computeKnownBits, which limits its usefulness, and according to the tests can sometimes limit the effectiveness of known-bits queries. (Although some AMD tests look longer). Keeping the cache valid and clearing it at the correct times can also require being careful about the functions called inside known-bits queries. I measured compile-time of removing it and came up with: ``` 7zip 2.06405E+11 2.06436E+11 0.015018992 Bullet 1.01298E+11 1.01186E+11 -0.110236169 ClamAV 57942466667 57848066667 -0.16292023 SPASS 45444466667 45402966667 -0.091320249 consumer 35432466667 35381233333 -0.144594317 kimwitu++ 40858833333 40927933333 0.169118877 lencod 70022366667 69950633333 -0.102443457 mafft 38439900000 38413233333 -0.069372362 sqlite3 35822266667 35770033333 -0.145812474 tramp3d 82083133333 82045600000 -0.045726 Average -0.068828739 ``` The last column is % difference between with / without the cache. So in total it seems to be costing slightly more to keep the current known-bits cache than if it was removed. (Measured in instruction count, similar to llvm-compile-time-tracker). The hit rate wasn't terrible - higher than I expected. In the llvm-test-suite+external projects it was hit 4791030 times out of 91107008 queries, slightly more than 5%. Note that as globalisel increases in complexity, more known bits calls might be made and the numbers might shift. If that is the case it might be better to have a cache that works across calls, providing it doesn't make effectiveness worse.	2025-09-15 07:32:00 +01:00
Shilei Tian	1180c2ced0	[AMDGPU] Support lowering of cluster related instrinsics (#157978 ) Since many code are connected, this also changes how workgroup id is lowered. Co-authored-by: Jay Foad <jay.foad@amd.com> Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>	2025-09-12 21:11:17 -04:00
Matt Arsenault	9af4a85460	AMDGPU: Add test which shows unnecessary register alignment (#158168 ) The b96 tr loads are a special case that does not require even aligned VGPRs	2025-09-13 10:10:59 +09:00
choikwa	ef7de8d144	[AMDGPU] Remove scope check in SIInsertWaitcnts::generateWaitcntInstBefore (#157821 ) This change was motivated by CK where many VMCNT(0)'s were generated due to instructions lacking !alias.scope metadata. The two causes of this were: 1) LowerLDSModule not tacking on scope metadata on a single LDS variable 2) IPSCCP pass before inliner replacing noalias ptr derivative with a global value, which made inliner unable to track it back to the noalias ptr argument. However, it turns out that IPSCCP losing the scope information was largely ineffectual as ScopedNoAliasAA was able to handle asymmetric condition, where one MemLoc was missing scope, and still return NoAlias result. AMDGPU however was checking for existence of scope in SIInsertWaitcnts and conservatively treating it as aliasing all and inserted VMCNT(0) before DS_READs, forcing it to wait for all previous LDS DMA instructions. Since we know that ScopedNoAliasAA can handle asymmetry, we should also allow AA query to determine if two MIs may alias. Passed PSDB. Previous attempt to address the issue in IPSCCP, likely stalled: https://github.com/llvm/llvm-project/pull/154522 This solution may be preferrable over that as issue only affects AMDGPU.	2025-09-12 14:51:36 -04:00
Matt Arsenault	9e1d656c68	AMDGPU: Remove MIMG special case in adjustAllocatableRegClass (#158184 ) I have no idea why this was here. MIMG atomics use tied operands for the input and output, so AV classes should have always worked. We have poor test coverage for AGPRs with atomics, so add a partial set. Everything seems to work OK, although it seems image cmpswap always uses VGPRs unnecessarily.	2025-09-12 09:02:24 +00:00
Matt Arsenault	188901d6ca	AMDGPU: Fix returning wrong type for stack passed sub-dword arguments (#158002 ) Fixes assertion with -debug-only=isel on LowerFormalArguments result. That assert really shouldn't be under LLVM_DEBUG. Fixes #157997	2025-09-12 10:50:03 +09:00
Matt Arsenault	5a21128f24	AMDGPU: Relax legal register operand constraint (#157989 ) Find a common subclass instead of directly checking for a subclass relationship. This fixes folding logic for unaligned register defs into aligned use contexts. e.g., a vreg_64 def into an av_64_align2 use should be able to find the common subclass vreg_align2. This avoids regressions in future patches. Checking the subclass was also redundant on the subregister path; getMatchingSuperRegClass is sufficient.	2025-09-12 08:57:47 +09:00
Changpeng Fang	05a705efda	[AMDGPU] Restrict to VGPR only for mfma scale operands (#158117 ) Restrict to VGPR only (VRegSrc_32) for mfma scale operands to workaround a hardware design defect: For all Inline/SGPR constants, SP HW use bits [30:23] as the scale. TODO: We may still be able to allow Inline Constants/SGPR, with a proper shift, to obtain a potentially better performance. Fixes: SWDEV-548629	2025-09-11 13:10:33 -07:00
Chris Jackson	5e6564b098	[AMDGPU][SDAG] Legalise v2i32 or/xor/and instructions to make use of 64-bit wide instructions (#140694 ) - Enable s_or_b64/s_and_b64/s_xor_b64 for v2i32. Add various additional combines to make use of these newly legalised instructions. - Update several tests and separate legacy r600 tests where necessary.	2025-09-11 13:32:44 +01:00
Petar Avramovic	b97010865c	AMDGPU/GlobalISel: Import D16 load patterns and add combines for them (#153178 ) Add G_AMDGPU_LOAD_D16 generic instructions and GINodeEquivs for them, this will import D16 load patterns to global-isel's tablegened instruction selector. For newly imported patterns to work add combines for G_AMDGPU_LOAD_D16 in AMDGPURegBankCombiner.	2025-09-11 12:02:50 +02:00
Petar Avramovic	40270e8ef2	AMDGPU/GlobalISel: Add regbanklegalize rules for load and store (#153176 ) Cover all the missing cases and add very detailed tests for each rule. In summary: - Flat and Scratch, addrspace(0) and addrspace(5), loads are always divergent. - Global and Constant, addrspace(1) and addrspace(4), have real uniform loads, s_load, but require additional checks for align and flags in mmo. For not natural align or not uniform mmo do uniform-in-vgpr lowering. - Private, addrspace(3), only has instructions for divergent load, for uniform do uniform-in-vgpr lowering. - Store rules are simplified using Ptr32 and Ptr64. All operands need to be vgpr. Some tests have code size regression since they use more sgpr instructions, marked with FixMe comment to get back to later.	2025-09-11 11:26:20 +02:00
choikwa	8ae3aeaca0	[AMDGPU] NFC. Add testcase to test SIInsertWaitcnts::generateWaitcntInstBefore (#157938 ) Pre-commit testcase for https://github.com/llvm/llvm-project/pull/157821	2025-09-10 16:39:02 -04:00

1 2 3 4 5 ...

9345 Commits