llvm-project

Author	SHA1	Message	Date
LU-JOHN	18f7e625bd	Revert "[AMDGPU] Generate more swaps" (#187723 ) Reverts llvm/llvm-project#184164. Issue hit in testing, LCOMPILER-1587.	2026-03-20 12:03:20 -05:00
LU-JOHN	81396ebc51	[AMDGPU] Generate more swaps (#184164 ) Generate more swaps from: ``` mov T, X ... mov X, Y ... mov Y, X ``` by being more careful about what use/defs of X, Y, T are allowed in intervening code and allowing flexibility where the swap is inserted. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2026-03-03 09:11:11 -06:00
Ruiling, Song	686987a540	ValueTracking/AMDGPU: handle mbcnt in computeKnownBitsFromOperator (#183229 ) This helps canonicalize some address calculation. This would further help immediate folding into memory load instructions in the backend. The order changes to v_mad_u32_u24 is just because @llvm.amdgcn.mul.u24.i32 was used in codegen prepare after this change. It does not really change anything important.	2026-03-02 10:48:15 +08:00
Carl Ritson	5cc4b05380	[AMDGPU] Add scheduling DAG mutation for hazard latencies (#170075 ) Improve waitcnt merging in ML kernel loops by increasing latencies on VALU writes to SGPRs. Specifically this helps with the case of V_CMP output feeding V_CNDMASK instructions.	2026-02-03 11:10:28 +09:00
Matt Arsenault	1db5d6410b	AMDGPU: Move softPromoteHalfType override to R600 only (#177419 ) As expected the code is much worse, but more correct. We could do a better job with source modifier management around fp16_to_fp/fp_to_fp16.	2026-01-26 15:23:04 +00:00
Jay Foad	d748c81218	[AMDGPU] Change the immediate operand of s_waitcnt_depctr / s_wait_alu (#169378 ) The 16-bit immediate operand of s_waitcnt_depctr / s_wait_alu has some unused bits. Previously codegen would set these bits to 1, but setting them to 0 matches the SP3 assembler behaviour better, which in turn means that we can print them using the human readable SP3 syntax: s_wait_alu 0xfffd ; unused bits set to 1 s_wait_alu 0xff9d ; unused bits set to 0 s_wait_alu depctr_va_vcc(0) ; unused bits set to 0, human readable Note that the set of unused bits changed between GFX10.1 and GFX10.3.	2025-11-25 11:55:26 +00:00
hstk30-hw	a6cec3f3e5	Reland "[RegAlloc] Fix the terminal rule check for interfere with DstReg (#168661 )" (#169219 ) Reland d5f3ab8ec97786476a077b0c8e35c7c337dfddf2, fix testcases.	2025-11-24 09:27:25 +08:00
Aiden Grossman	d5f3ab8ec9	Revert "[RegAlloc] Fix the terminal rule check for interfere with DstReg (#168661 )" This reverts commit 0859ac5866a0228f5607dd329f83f4a9622dedcc. This caused a couple test failures, likely due to a mid-air collision. Reverting for now to get the tree back to green and allow the original author to run UTC/friends and verify the output.	2025-11-23 05:17:45 +00:00
hstk30-hw	0859ac5866	[RegAlloc] Fix the terminal rule check for interfere with DstReg (#168661 ) This maybe a bug which is introduced by commit 6749ae36b4a33769e7a77cf812d7cd0a908ae3b9, and has been present ever since. In this case, `OtherReg` always overlaps with `DstReg` cause they from the `Copy` all.	2025-11-23 10:11:24 +08:00
Matt Arsenault	0fa6a67a42	AMDGPU: Use v_mov_b32 to implement divergent zext i32->i64 (#168166 ) Some cases are relying on SIFixSGPRCopies to force VALU reg_sequence inputs with SGPR inputs to use all VGPR inputs, but this doesn't always happen if the reg_sequence isn't invalid. Make sure we use a vgpr up-front here so we don't rely on something later.	2025-11-14 20:19:24 -08:00
Matt Arsenault	bbde792786	AMDGPU: Relax shouldCoalesce to allow more register tuple widening (#166475 ) Allow widening up to 128-bit registers or if the new register class is at least as large as one of the existing register classes. This was artificially limiting. In particular this was doing the wrong thing with sequences involving copies between VGPRs and AV registers. Nearly all test changes are improvements. The coalescer does not just widen registers out of nowhere. If it's trying to "widen" a register, it's generally packing a register into an existing register tuple, or in a situation where the constraints imply the wider class anyway. 067a11015 addressed the allocation failure concern by rejecting coalescing if there are no available registers. The original change in a4e63ead4b didn't include a realistic testcase to judge if this is harmful for pressure. I would expect any issues from this to be of garden variety subreg handling issue. We could use more dynamic state information here if it really is an issue. I get the best results by removing this override completely. This is a smaller step for patch splitting purposes.	2025-11-11 13:50:57 -08:00
Carl Ritson	385c12134a	[AMDGPU] Rework GFX11 VALU Mask Write Hazard (#138663 ) Apply additional counter waits to address VALU writes to SGPRs. Rework expiry detection and apply wait coalescing to mitigate some of the additional waits.	2025-10-28 16:09:28 +09:00
LU-JOHN	9abbec66bf	[AMDGPU] Reland "Remove redundant s_cmp_lg_* sX, 0" (#164201 ) Reland PR https://github.com/llvm/llvm-project/pull/162352. Fix by excluding SI_PC_ADD_REL_OFFSET from instructions that set SCC = DST!=0. Passes check-libc-amdgcn-amd-amdhsa now. Distribution of instructions that allowed a redundant S_CMP to be deleted in check-libc-amdgcn-amd-amdhsa test: ``` S_AND_B32 485 S_AND_B64 47 S_ANDN2_B32 42 S_ANDN2_B64 277492 S_CSELECT_B64 17631 S_LSHL_B32 6 S_OR_B64 11 ``` --------- Signed-off-by: John Lu <John.Lu@amd.com> Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-10-22 08:42:29 -05:00
Jan Patrick Lehr	023b1f6a8e	Revert "[AMDGPU] Remove redundant s_cmp_lg_* sX, 0 " (#164116 ) Reverts llvm/llvm-project#162352 Broke our buildbot: https://lab.llvm.org/buildbot/#/builders/10/builds/15674 To reproduce cd llvm-project cmake -S llvm -B thebuild -C offload/cmake/caches/AMDGPULibcBot.cmake -GNinja cd thebuild ninja ninja check-libc-amdgcn-amd-amdhsa	2025-10-18 22:38:14 +02:00
LU-JOHN	8e5f6dd37c	[AMDGPU] Remove redundant s_cmp_lg_* sX, 0 (#162352 ) Remove redundant s_cmp_lg_* sX, 0 if SALU instruction already sets SCC if sX!=0. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2025-10-18 09:33:47 -05:00
Matt Arsenault	c6e280e7ed	PeepholeOpt: Fix losing subregister indexes on full copies (#161310 ) Previously if we had a subregister extract reading from a full copy, the no-subregister incoming copy would overwrite the DefSubReg index of the folding context. There's one ugly rvv regression, but it's a downstream issue of this; an unnecessary same class reg-to-reg full copy was avoided.	2025-10-02 13:36:47 +09:00
Brox Chen	dfdfc4e490	[AMDGPU][True16][Codegen] remove another build_vector pattern from true16 (#149861 ) Remove another build_vector pattern which takes a i16 but placed in a VGPR_32 from true16 mode. This stop isel from generating illegal "vgpr_32 = COPY vgpr_16". ISel will use vgpr16 build vector pattern in true16 mode instead	2025-09-04 18:08:18 -04:00
Matt Arsenault	b1b5102624	AMDGPU: Start considering new atomicrmw metadata on integer operations (#122138 ) Start considering !amdgpu.no.remote.memory.access and !amdgpu.no.fine.grained.host.memory metadata when deciding to expand integer atomic operations. This does not yet attempt to accurately handle fadd/fmin/fmax, which are trickier and require migrating the old "amdgpu-unsafe-fp-atomics" attribute.	2025-08-22 05:29:36 +00:00
Matt Arsenault	01f785cac4	AMDGPU: Expand remaining system atomic operations (#122137 ) System scope atomics need to use cmpxchg loops if we know nothing about the allocation the address is from. aea5980e26e6a87dab9f8acb10eb3a59dd143cb1 started this, this expands the set to cover the remaining integer operations. Don't expand xchg and add, those theoretically should work over PCIe. This is a pre-commit which will introduce performance regressions. Subsequent changes will add handling of new atomicrmw metadata, which will avoid the expansion. Note this still isn't conservative enough; we do need to expand some device scope atomics if the memory is in fine-grained remote memory.	2025-08-22 13:55:04 +09:00
Brox Chen	c50ed05cad	[AMDGPU][True16][CodeGen] use vgpr16 for zext patterns (reopen #153894 ) (#154211 ) recreate this patch from https://github.com/llvm/llvm-project/pull/153894 It seems ISel sliently ignore the `i64 = zext i16` with a chained `reg_sequence` pattern and thus this is causing a selection failure in hip test. Recreate a new patch with an alternative pattern, and added a ll test global-extload-gfx11plus.ll	2025-08-20 10:26:49 -04:00
Brox Chen	d49aab10bd	Revert "[AMDGPU][True16][CodeGen] use vgpr16 for zext patterns (#1538… (#154163 ) This reverts commit 7c53c6162bd43d952546a3ef7d019babd5244c29. This patch hit an issue in hip test. revert and will reopen later	2025-08-18 14:01:19 -04:00
Brox Chen	7c53c6162b	[AMDGPU][True16][CodeGen] use vgpr16 for zext patterns (#153894 ) Update true16 mode with zext patterns using vgpr16 for 16bit data types. This stop isel from inserting invalid "vgpr32 = copy vgpr16"	2025-08-18 11:01:57 -04:00
Shilei Tian	fc0653f31c	[RFC][NFC][AMDGPU] Remove `-verify-machineinstrs` from `llvm/test/CodeGen/AMDGPU/.ll` (#150024 ) Recent upstream trends have moved away from explicitly using `-verify-machineinstrs`, as it's already covered by the expensive checks. This PR removes almost all `-verify-machineinstrs` from tests in `llvm/test/CodeGen/AMDGPU/.ll`, leaving only those tests where its removal currently causes failures.	2025-07-23 13:42:46 -04:00
Brox Chen	0d2b47ae4a	[AMDGPU][True16][CodeGen] stop emitting spgr_lo16 from isel (#144819 ) When true16 is enabled, isel start to emit sgpr_lo16 register when a trunc/sext i16/i32 is generated, or a salu32 is used by vgpr16 or vice versa. And this causes a problem as sgpr_lo16 is not fully supported in the pipeline. True16 mode works fine in -O3 mode since folding pass remove sgpr_lo16 from the pipeline. However it hit a problem in -O0 mode as folding pass is skipped. This patch did: 1. stop emitting sgpr_lo16 from isel 2. update codegen pattern to split uniformed/divergent pattern for i16/i32 conversion 3. update fix-sgpr-copy pass to address legalization requirement in true16 mode, update fix-sgpr-copies-f16-true16.mir test to include all possible combinations This patch is tested with cts and downstream repo with -O0 testing	2025-07-09 16:17:14 -04:00
Guy David	76274eb2b3	[PHIElimination] Revert #131837 #146320 #146337 (#146850 ) Reverting because mis-compiles: - https://github.com/llvm/llvm-project/pull/131837 - https://github.com/llvm/llvm-project/pull/146320 - https://github.com/llvm/llvm-project/pull/146337	2025-07-03 07:48:08 -04:00
Guy David	f5c62ee0fa	[PHIElimination] Reuse existing COPY in predecessor basic block (#131837 ) The insertion point of COPY isn't always optimal and could eventually lead to a worse block layout, see the regression test in the first commit. This change affects many architectures but the amount of total instructions in the test cases seems too be slightly lower.	2025-06-29 21:28:42 +03:00
Ana Mihajlovic	08d747c1ef	[AMDGPU] Fix bad removal of s_delay_alu (#145728 ) instructionWaitsForSGPRWrites function covers ALL SALU instructions, including those like s_waitcnt that don't read from sgpr. This results in removing delay_alu instructions in cases like VALU->SGPR->VALU, which results in performance regression. Change modifies the function so that it checks if instruction also reads a sgpr.	2025-06-27 16:15:10 +02:00
Brox Chen	6dbc01e801	[AMDGPU][True16][CodeGen] update GFX11Plus codegen test with true16 flag (#135078 ) This is a NFC patch. This patch run a bulk update on CodeGen tests that are impacted by the true16 features. This patch applies: 1. duplicate GFX11plus runlines and apply them with "+mattr=+real-true16" and "+mattr=-real-true16" 2. update the test with the update script For some GISEL runlines, the current CodeGen do not fully support the true16 version. Still update the runlines, but comment out the failing one, and added a "FIXME-TRUE16" comment to that test for easier tracking. These test will be fixed in the following patches. This is in a transition state that we support both "+real-true16/-real-true16" in our code base. We plan to move to "+real-true16" as default, and finally remove "-real-true16" mode and test lines.	2025-04-23 13:06:52 -04:00
zhijian lin	afda4c295b	Reland [SelectionDAG] Folding ZERO-EXTEND/SIGN_EXTEND poison to Poison value in getNode (#136701 ) This patch addresses the signed/zero extension of poison by using a poison value of the extended type instead of a constant zero of the extended type.	2025-04-22 17:36:41 -04:00
Nico Weber	e18a77cfbe	Revert "[SelectionDAG] Folding ZERO-EXTEND/SIGN_EXTEND poison to Poison value in getNode (#122741 )" This reverts commit f12078e72601e7c03e5d66afab034313caf8f791. Breaks `check-llvm`, see comments on https://github.com/llvm/llvm-project/pull/122741	2025-04-21 10:51:03 -04:00
zhijian lin	f12078e726	[SelectionDAG] Folding ZERO-EXTEND/SIGN_EXTEND poison to Poison value in getNode (#122741 ) The PR will fix the issue https://github.com/llvm/llvm-project/issues/122728 This patch addresses the signed/zero extension of poison by using a poison value of the extended type instead of a constant zero of the extended type.	2025-04-21 10:02:21 -04:00
Shoreshen	121cd7c6f0	Re apply 130577 narrow math for and operand (#133896 ) Re-apply https://github.com/llvm/llvm-project/pull/130577 Which is reverted in https://github.com/llvm/llvm-project/pull/133880 The old application failed in address sanitizer due to `tryNarrowMathIfNoOverflow` was called after `I.eraseFromParent();` in `AMDGPUCodeGenPrepareImpl::visitBinaryOperator`, it create a use after free failure. To fix this, `tryNarrowMathIfNoOverflow` will be called before and directly return if `tryNarrowMathIfNoOverflow` result in true.	2025-04-17 17:03:32 +08:00
Shoreshen	7f14b2a9eb	Revert "[AMDGPU][CodeGenPrepare] Narrow 64 bit math to 32 bit if profitable" (#133880 ) Reverts llvm/llvm-project#130577	2025-04-01 17:37:02 +08:00
Shoreshen	145b4a3950	[AMDGPU][CodeGenPrepare] Narrow 64 bit math to 32 bit if profitable (#130577 ) For Add, Sub, Mul with Int64 type, if profitable, then do: 1. Trunc operands to Int32 type 2. Apply 32 bit Add/Sub/Mul 3. Zext to Int64 type	2025-04-01 11:18:17 +08:00
Ana Mihajlovic	8c7550132f	[AMDGPU] Unused sdst writing to null (#133229 ) Unused sdst writing to null to avoid a false VALU->SALU dependency stall. This requires using the VOP3 encoding.	2025-03-28 18:12:34 +01:00
Ana Mihajlovic	459b4e3fe1	Reland "[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212 )" (#131111 ) We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from it). When VALU is issued, it increments internal counter VA_SDST used to track use of this SGPR. SALU will not issue until VA_SDST is zero, that is when VALU is finished writing. Therefore, delays added by s_delay_alu are not needed in this situation.	2025-03-13 10:26:20 +01:00
Kazu Hirata	aa008e0008	Revert "[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212 )" This reverts commit 71582c6667a6334c688734cae628e906b3c1ac1d. Multiple buildbot failures have been reported: https://github.com/llvm/llvm-project/pull/127212	2025-03-12 12:09:09 -07:00
Ana Mihajlovic	71582c6667	[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212 ) We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from it). When VALU is issued, it increments internal counter VA_SDST used to track use of this SGPR. SALU will not issue until VA_SDST is zero, that is when VALU is finished writing. Therefore, delays added by s_delay_alu are not needed in this situation.	2025-03-12 09:33:07 -07:00
Matt Arsenault	6aea6308d1	AMDGPU: Fix creating illegally typed readfirstlane in atomic optimizer (#128388 ) We need to promote 8/16-bit cases to 32-bit. Unfortunately we are missing demanded bits optimizations on readfirstlane, so we end up emitting an and instruction on the input. I'm also surprised this pass isn't handling half or bfloat yet.	2025-02-24 18:39:49 +07:00
Matt Arsenault	1bb43068f1	PeepholeOpt: Allow introducing subregister uses on reg_sequence (#127052 ) This reverts d246cc618adc52fdbd69d44a2a375c8af97b6106. We now handle composing subregister extracts through reg_sequence.	2025-02-22 09:16:14 +07:00
Sergei Barannikov	ff9c041d96	[MachineScheduler] Fix physreg dependencies of ExitSU (#123541 ) Providing the correct operand index allows addPhysRegDataDeps to compute the correct latency. Pull Request: https://github.com/llvm/llvm-project/pull/123541	2025-02-01 20:40:50 +03:00
Matt Arsenault	d246cc618a	PeepholeOpt: Do not add subregister indexes to reg_sequence operands (#124111 ) Given the rest of the pass just gives up when it needs to compose subregisters, folding a subregister extract directly into a reg_sequence is counterproductive. Later fold attempts in the function will give up on the subregister operand, preventing looking up through the reg_sequence. It may still be profitable to do these folds if we start handling the composes. There are some test regressions, but this mostly looks better.	2025-01-30 20:42:02 +07:00
Carl Ritson	a3a3e6997b	[AMDGPU] Rewrite GFX12 SGPR hazard handling to dedicated pass (#118750 ) - Algorithm operates over whole IR to attempt to minimize waits. - Add support for VALU->VALU SGPR hazards via VA_SDST/VA_VCC.	2025-01-30 11:21:11 +09:00
Shilei Tian	6548b6354d	Reapply "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403 )" This reverts commit ca33649abe5fad93c57afef54e43ed9b3249cd86.	2024-11-08 20:21:16 -05:00
Shilei Tian	ca33649abe	Revert "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403 )" This reverts commit e215a1e27d84adad2635a52393621eb4fa439dc9 as it broke both hip and openmp buildbots.	2024-11-08 16:36:35 -05:00
Shilei Tian	e215a1e27d	[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403 )	2024-11-08 13:05:35 -05:00
Stanislav Mekhanoshin	6d7e51de5e	[AMDGPU] Extend type support for update_dpp intrinsic (#114597 ) We can split 64-bit DPP as a post-RA pseudo if control values are supported, but cannot handle other types.	2024-11-05 13:59:14 -08:00
Stanislav Mekhanoshin	3277c7cd28	[AMDGPU] Skip VGPR deallocation for waveslot limited kernels (#112765 ) MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels. It's been identified this message is only really needed for VGPR limited kernels. A kernel becomes VGPR limited if a total number of VGPRs per SIMD / number of used VGPRs is more than a number of wave slots.	2024-10-21 09:39:52 -07:00
Pierre van Houtryve	924a64a348	[AMDGPU] Only emit SCOPE_SYS global_wb (#110636 ) global_wb with scopes lower than SCOPE_SYS is unnecessary for correctness. I was initially optimistic they would be very cheap no-ops but they can actually be quite expensive so let's avoid them.	2024-10-07 07:35:31 +02:00
Matt Arsenault	8632e8bd64	AMDGPU: Fix implicit vcc def to vcc_lo on wave32 targets (#109514 )	2024-09-23 13:20:21 +04:00

1 2 3

105 Commits