llvm-project

Author	SHA1	Message	Date
Matt Arsenault	0fa6a67a42	AMDGPU: Use v_mov_b32 to implement divergent zext i32->i64 (#168166 ) Some cases are relying on SIFixSGPRCopies to force VALU reg_sequence inputs with SGPR inputs to use all VGPR inputs, but this doesn't always happen if the reg_sequence isn't invalid. Make sure we use a vgpr up-front here so we don't rely on something later.	2025-11-14 20:19:24 -08:00
Matt Arsenault	bbde792786	AMDGPU: Relax shouldCoalesce to allow more register tuple widening (#166475 ) Allow widening up to 128-bit registers or if the new register class is at least as large as one of the existing register classes. This was artificially limiting. In particular this was doing the wrong thing with sequences involving copies between VGPRs and AV registers. Nearly all test changes are improvements. The coalescer does not just widen registers out of nowhere. If it's trying to "widen" a register, it's generally packing a register into an existing register tuple, or in a situation where the constraints imply the wider class anyway. 067a11015 addressed the allocation failure concern by rejecting coalescing if there are no available registers. The original change in a4e63ead4b didn't include a realistic testcase to judge if this is harmful for pressure. I would expect any issues from this to be of garden variety subreg handling issue. We could use more dynamic state information here if it really is an issue. I get the best results by removing this override completely. This is a smaller step for patch splitting purposes.	2025-11-11 13:50:57 -08:00
Carl Ritson	385c12134a	[AMDGPU] Rework GFX11 VALU Mask Write Hazard (#138663 ) Apply additional counter waits to address VALU writes to SGPRs. Rework expiry detection and apply wait coalescing to mitigate some of the additional waits.	2025-10-28 16:09:28 +09:00
LU-JOHN	9abbec66bf	[AMDGPU] Reland "Remove redundant s_cmp_lg_* sX, 0" (#164201 ) Reland PR https://github.com/llvm/llvm-project/pull/162352. Fix by excluding SI_PC_ADD_REL_OFFSET from instructions that set SCC = DST!=0. Passes check-libc-amdgcn-amd-amdhsa now. Distribution of instructions that allowed a redundant S_CMP to be deleted in check-libc-amdgcn-amd-amdhsa test: ``` S_AND_B32 485 S_AND_B64 47 S_ANDN2_B32 42 S_ANDN2_B64 277492 S_CSELECT_B64 17631 S_LSHL_B32 6 S_OR_B64 11 ``` --------- Signed-off-by: John Lu <John.Lu@amd.com> Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-10-22 08:42:29 -05:00
Jan Patrick Lehr	023b1f6a8e	Revert "[AMDGPU] Remove redundant s_cmp_lg_* sX, 0 " (#164116 ) Reverts llvm/llvm-project#162352 Broke our buildbot: https://lab.llvm.org/buildbot/#/builders/10/builds/15674 To reproduce cd llvm-project cmake -S llvm -B thebuild -C offload/cmake/caches/AMDGPULibcBot.cmake -GNinja cd thebuild ninja ninja check-libc-amdgcn-amd-amdhsa	2025-10-18 22:38:14 +02:00
LU-JOHN	8e5f6dd37c	[AMDGPU] Remove redundant s_cmp_lg_* sX, 0 (#162352 ) Remove redundant s_cmp_lg_* sX, 0 if SALU instruction already sets SCC if sX!=0. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2025-10-18 09:33:47 -05:00
Matt Arsenault	c6e280e7ed	PeepholeOpt: Fix losing subregister indexes on full copies (#161310 ) Previously if we had a subregister extract reading from a full copy, the no-subregister incoming copy would overwrite the DefSubReg index of the folding context. There's one ugly rvv regression, but it's a downstream issue of this; an unnecessary same class reg-to-reg full copy was avoided.	2025-10-02 13:36:47 +09:00
Brox Chen	dfdfc4e490	[AMDGPU][True16][Codegen] remove another build_vector pattern from true16 (#149861 ) Remove another build_vector pattern which takes a i16 but placed in a VGPR_32 from true16 mode. This stop isel from generating illegal "vgpr_32 = COPY vgpr_16". ISel will use vgpr16 build vector pattern in true16 mode instead	2025-09-04 18:08:18 -04:00
Matt Arsenault	b1b5102624	AMDGPU: Start considering new atomicrmw metadata on integer operations (#122138 ) Start considering !amdgpu.no.remote.memory.access and !amdgpu.no.fine.grained.host.memory metadata when deciding to expand integer atomic operations. This does not yet attempt to accurately handle fadd/fmin/fmax, which are trickier and require migrating the old "amdgpu-unsafe-fp-atomics" attribute.	2025-08-22 05:29:36 +00:00
Matt Arsenault	01f785cac4	AMDGPU: Expand remaining system atomic operations (#122137 ) System scope atomics need to use cmpxchg loops if we know nothing about the allocation the address is from. aea5980e26e6a87dab9f8acb10eb3a59dd143cb1 started this, this expands the set to cover the remaining integer operations. Don't expand xchg and add, those theoretically should work over PCIe. This is a pre-commit which will introduce performance regressions. Subsequent changes will add handling of new atomicrmw metadata, which will avoid the expansion. Note this still isn't conservative enough; we do need to expand some device scope atomics if the memory is in fine-grained remote memory.	2025-08-22 13:55:04 +09:00
Brox Chen	c50ed05cad	[AMDGPU][True16][CodeGen] use vgpr16 for zext patterns (reopen #153894 ) (#154211 ) recreate this patch from https://github.com/llvm/llvm-project/pull/153894 It seems ISel sliently ignore the `i64 = zext i16` with a chained `reg_sequence` pattern and thus this is causing a selection failure in hip test. Recreate a new patch with an alternative pattern, and added a ll test global-extload-gfx11plus.ll	2025-08-20 10:26:49 -04:00
Brox Chen	d49aab10bd	Revert "[AMDGPU][True16][CodeGen] use vgpr16 for zext patterns (#1538… (#154163 ) This reverts commit 7c53c6162bd43d952546a3ef7d019babd5244c29. This patch hit an issue in hip test. revert and will reopen later	2025-08-18 14:01:19 -04:00
Brox Chen	7c53c6162b	[AMDGPU][True16][CodeGen] use vgpr16 for zext patterns (#153894 ) Update true16 mode with zext patterns using vgpr16 for 16bit data types. This stop isel from inserting invalid "vgpr32 = copy vgpr16"	2025-08-18 11:01:57 -04:00
Shilei Tian	fc0653f31c	[RFC][NFC][AMDGPU] Remove `-verify-machineinstrs` from `llvm/test/CodeGen/AMDGPU/.ll` (#150024 ) Recent upstream trends have moved away from explicitly using `-verify-machineinstrs`, as it's already covered by the expensive checks. This PR removes almost all `-verify-machineinstrs` from tests in `llvm/test/CodeGen/AMDGPU/.ll`, leaving only those tests where its removal currently causes failures.	2025-07-23 13:42:46 -04:00
Brox Chen	0d2b47ae4a	[AMDGPU][True16][CodeGen] stop emitting spgr_lo16 from isel (#144819 ) When true16 is enabled, isel start to emit sgpr_lo16 register when a trunc/sext i16/i32 is generated, or a salu32 is used by vgpr16 or vice versa. And this causes a problem as sgpr_lo16 is not fully supported in the pipeline. True16 mode works fine in -O3 mode since folding pass remove sgpr_lo16 from the pipeline. However it hit a problem in -O0 mode as folding pass is skipped. This patch did: 1. stop emitting sgpr_lo16 from isel 2. update codegen pattern to split uniformed/divergent pattern for i16/i32 conversion 3. update fix-sgpr-copy pass to address legalization requirement in true16 mode, update fix-sgpr-copies-f16-true16.mir test to include all possible combinations This patch is tested with cts and downstream repo with -O0 testing	2025-07-09 16:17:14 -04:00
Guy David	76274eb2b3	[PHIElimination] Revert #131837 #146320 #146337 (#146850 ) Reverting because mis-compiles: - https://github.com/llvm/llvm-project/pull/131837 - https://github.com/llvm/llvm-project/pull/146320 - https://github.com/llvm/llvm-project/pull/146337	2025-07-03 07:48:08 -04:00
Guy David	f5c62ee0fa	[PHIElimination] Reuse existing COPY in predecessor basic block (#131837 ) The insertion point of COPY isn't always optimal and could eventually lead to a worse block layout, see the regression test in the first commit. This change affects many architectures but the amount of total instructions in the test cases seems too be slightly lower.	2025-06-29 21:28:42 +03:00
Ana Mihajlovic	08d747c1ef	[AMDGPU] Fix bad removal of s_delay_alu (#145728 ) instructionWaitsForSGPRWrites function covers ALL SALU instructions, including those like s_waitcnt that don't read from sgpr. This results in removing delay_alu instructions in cases like VALU->SGPR->VALU, which results in performance regression. Change modifies the function so that it checks if instruction also reads a sgpr.	2025-06-27 16:15:10 +02:00
Brox Chen	6dbc01e801	[AMDGPU][True16][CodeGen] update GFX11Plus codegen test with true16 flag (#135078 ) This is a NFC patch. This patch run a bulk update on CodeGen tests that are impacted by the true16 features. This patch applies: 1. duplicate GFX11plus runlines and apply them with "+mattr=+real-true16" and "+mattr=-real-true16" 2. update the test with the update script For some GISEL runlines, the current CodeGen do not fully support the true16 version. Still update the runlines, but comment out the failing one, and added a "FIXME-TRUE16" comment to that test for easier tracking. These test will be fixed in the following patches. This is in a transition state that we support both "+real-true16/-real-true16" in our code base. We plan to move to "+real-true16" as default, and finally remove "-real-true16" mode and test lines.	2025-04-23 13:06:52 -04:00
zhijian lin	afda4c295b	Reland [SelectionDAG] Folding ZERO-EXTEND/SIGN_EXTEND poison to Poison value in getNode (#136701 ) This patch addresses the signed/zero extension of poison by using a poison value of the extended type instead of a constant zero of the extended type.	2025-04-22 17:36:41 -04:00
Nico Weber	e18a77cfbe	Revert "[SelectionDAG] Folding ZERO-EXTEND/SIGN_EXTEND poison to Poison value in getNode (#122741 )" This reverts commit f12078e72601e7c03e5d66afab034313caf8f791. Breaks `check-llvm`, see comments on https://github.com/llvm/llvm-project/pull/122741	2025-04-21 10:51:03 -04:00
zhijian lin	f12078e726	[SelectionDAG] Folding ZERO-EXTEND/SIGN_EXTEND poison to Poison value in getNode (#122741 ) The PR will fix the issue https://github.com/llvm/llvm-project/issues/122728 This patch addresses the signed/zero extension of poison by using a poison value of the extended type instead of a constant zero of the extended type.	2025-04-21 10:02:21 -04:00
Shoreshen	121cd7c6f0	Re apply 130577 narrow math for and operand (#133896 ) Re-apply https://github.com/llvm/llvm-project/pull/130577 Which is reverted in https://github.com/llvm/llvm-project/pull/133880 The old application failed in address sanitizer due to `tryNarrowMathIfNoOverflow` was called after `I.eraseFromParent();` in `AMDGPUCodeGenPrepareImpl::visitBinaryOperator`, it create a use after free failure. To fix this, `tryNarrowMathIfNoOverflow` will be called before and directly return if `tryNarrowMathIfNoOverflow` result in true.	2025-04-17 17:03:32 +08:00
Shoreshen	7f14b2a9eb	Revert "[AMDGPU][CodeGenPrepare] Narrow 64 bit math to 32 bit if profitable" (#133880 ) Reverts llvm/llvm-project#130577	2025-04-01 17:37:02 +08:00
Shoreshen	145b4a3950	[AMDGPU][CodeGenPrepare] Narrow 64 bit math to 32 bit if profitable (#130577 ) For Add, Sub, Mul with Int64 type, if profitable, then do: 1. Trunc operands to Int32 type 2. Apply 32 bit Add/Sub/Mul 3. Zext to Int64 type	2025-04-01 11:18:17 +08:00
Ana Mihajlovic	8c7550132f	[AMDGPU] Unused sdst writing to null (#133229 ) Unused sdst writing to null to avoid a false VALU->SALU dependency stall. This requires using the VOP3 encoding.	2025-03-28 18:12:34 +01:00
Ana Mihajlovic	459b4e3fe1	Reland "[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212 )" (#131111 ) We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from it). When VALU is issued, it increments internal counter VA_SDST used to track use of this SGPR. SALU will not issue until VA_SDST is zero, that is when VALU is finished writing. Therefore, delays added by s_delay_alu are not needed in this situation.	2025-03-13 10:26:20 +01:00
Kazu Hirata	aa008e0008	Revert "[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212 )" This reverts commit 71582c6667a6334c688734cae628e906b3c1ac1d. Multiple buildbot failures have been reported: https://github.com/llvm/llvm-project/pull/127212	2025-03-12 12:09:09 -07:00
Ana Mihajlovic	71582c6667	[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212 ) We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from it). When VALU is issued, it increments internal counter VA_SDST used to track use of this SGPR. SALU will not issue until VA_SDST is zero, that is when VALU is finished writing. Therefore, delays added by s_delay_alu are not needed in this situation.	2025-03-12 09:33:07 -07:00
Matt Arsenault	6aea6308d1	AMDGPU: Fix creating illegally typed readfirstlane in atomic optimizer (#128388 ) We need to promote 8/16-bit cases to 32-bit. Unfortunately we are missing demanded bits optimizations on readfirstlane, so we end up emitting an and instruction on the input. I'm also surprised this pass isn't handling half or bfloat yet.	2025-02-24 18:39:49 +07:00
Matt Arsenault	1bb43068f1	PeepholeOpt: Allow introducing subregister uses on reg_sequence (#127052 ) This reverts d246cc618adc52fdbd69d44a2a375c8af97b6106. We now handle composing subregister extracts through reg_sequence.	2025-02-22 09:16:14 +07:00
Sergei Barannikov	ff9c041d96	[MachineScheduler] Fix physreg dependencies of ExitSU (#123541 ) Providing the correct operand index allows addPhysRegDataDeps to compute the correct latency. Pull Request: https://github.com/llvm/llvm-project/pull/123541	2025-02-01 20:40:50 +03:00
Matt Arsenault	d246cc618a	PeepholeOpt: Do not add subregister indexes to reg_sequence operands (#124111 ) Given the rest of the pass just gives up when it needs to compose subregisters, folding a subregister extract directly into a reg_sequence is counterproductive. Later fold attempts in the function will give up on the subregister operand, preventing looking up through the reg_sequence. It may still be profitable to do these folds if we start handling the composes. There are some test regressions, but this mostly looks better.	2025-01-30 20:42:02 +07:00
Carl Ritson	a3a3e6997b	[AMDGPU] Rewrite GFX12 SGPR hazard handling to dedicated pass (#118750 ) - Algorithm operates over whole IR to attempt to minimize waits. - Add support for VALU->VALU SGPR hazards via VA_SDST/VA_VCC.	2025-01-30 11:21:11 +09:00
Shilei Tian	6548b6354d	Reapply "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403 )" This reverts commit ca33649abe5fad93c57afef54e43ed9b3249cd86.	2024-11-08 20:21:16 -05:00
Shilei Tian	ca33649abe	Revert "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403 )" This reverts commit e215a1e27d84adad2635a52393621eb4fa439dc9 as it broke both hip and openmp buildbots.	2024-11-08 16:36:35 -05:00
Shilei Tian	e215a1e27d	[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403 )	2024-11-08 13:05:35 -05:00
Stanislav Mekhanoshin	6d7e51de5e	[AMDGPU] Extend type support for update_dpp intrinsic (#114597 ) We can split 64-bit DPP as a post-RA pseudo if control values are supported, but cannot handle other types.	2024-11-05 13:59:14 -08:00
Stanislav Mekhanoshin	3277c7cd28	[AMDGPU] Skip VGPR deallocation for waveslot limited kernels (#112765 ) MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels. It's been identified this message is only really needed for VGPR limited kernels. A kernel becomes VGPR limited if a total number of VGPRs per SIMD / number of used VGPRs is more than a number of wave slots.	2024-10-21 09:39:52 -07:00
Pierre van Houtryve	924a64a348	[AMDGPU] Only emit SCOPE_SYS global_wb (#110636 ) global_wb with scopes lower than SCOPE_SYS is unnecessary for correctness. I was initially optimistic they would be very cheap no-ops but they can actually be quite expensive so let's avoid them.	2024-10-07 07:35:31 +02:00
Matt Arsenault	8632e8bd64	AMDGPU: Fix implicit vcc def to vcc_lo on wave32 targets (#109514 )	2024-09-23 13:20:21 +04:00
Jay Foad	e55d6f5ea2	[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive (#107889 ) Always generate v_cndmask_b32 instead of modifying exec around v_mov_b32. This is expected to be faster because modifying exec generally causes pipeline stalls.	2024-09-11 17:16:06 +01:00
Carl Ritson	16cda01d22	[AMDGPU] V_SET_INACTIVE optimizations (#98864 ) Optimize V_SET_INACTIVE by allow it to run in WWM. Hence WWM sections are not broken up for inactive lane setting. WWM V_SET_INACTIVE can typically be lower to V_CNDMASK. Some cases require use of exec manipulation V_MOV as previous code. GFX9 sees slight instruction count increase in edge cases due to smaller constant bus. Additionally avoid introducing exec manipulation and V_MOVs where a source of V_SET_INACTIVE is the destination. This is a common pattern as WWM register pre-allocation often assigns the same register.	2024-09-05 14:39:28 +09:00
Jay Foad	5a6926ce49	[AMDGPU] Fix test update after #107108	2024-09-04 11:48:08 +01:00
Jay Foad	126d6f2710	[AMDGPU] Improve codegen for GFX10+ DPP reductions and scans (#107108 ) Use poison for an unused input to the permlanex16 intrinsic, to improve register allocation and avoid an unnecessary v_mov instruction.	2024-09-04 11:03:22 +01:00
Carl Ritson	86627149f6	[AMDGPU] Mitigate GFX12 VALU read SGPR hazard (#100067 ) Any SGPR read by a VALU can potentially obscure SALU writes to the same register. Insert s_wait_alu instructions to mitigate the hazard on affected paths. Compute a global cache of SGPRs with any VALU reads and use this to avoid inserting mitigation for SGPRs never accessed by VALUs. To avoid excessive search when compile time is priority implement secondary mode where all SALU writes are mitigated. Co-authored-by: Shilei Tian <shilei.tian@amd.com>	2024-09-04 12:15:20 +09:00
Changpeng Fang	a82032918c	[AMDGPU] Remove -wavefrontsize32 and -wavefrontsize64 from GFX10+ tests (NFC) (#100711 ) They are no longer needed after the patch: [AMDGPU] Remove wavefrontsize feature from GFX10: https://github.com/llvm/llvm-project/pull/98400 The exception is when "target-features" are set to "+wavefrontsize32" or "+wavefrontsize64", we still need to remove a wavefrontsize feature before add a different one to make sure only one of them are present.	2024-07-26 00:42:24 -07:00
Christudasan Devadasan	229e118559	[AMDGPU] Codegen support for constrained multi-dword sloads (#96163 ) For targets that support xnack replay feature (gfx8+), the multi-dword scalar loads shouldn't clobber any register that holds the src address. The constrained version of the scalar loads have the early clobber flag attached to the dst operand to restrict RA from re-allocating any of the src regs for its dst operand.	2024-07-23 13:59:15 +05:30
Pierre van Houtryve	b3a446650c	[AMDGPU] Implement GFX12 Memory Model (#98591 ) - Emit GLOBAL_WB instructions - Reflect synscope on instructions's `scope:` operand Fixes SWDEV-468508 Fixes SWDEV-470735 Fixes SWDEV-468392 Fixes SWDEV-469622	2024-07-16 10:53:06 +02:00
Vikram Hegde	cf230e7799	[AMDGPU] Enable atomic optimizer for divergent i64 and double values (#96934 )	2024-07-15 17:49:09 +05:30

1 2

96 Commits