llvm-project

Author	SHA1	Message	Date
Manuel Carrasco	ab4b689258	[AMDGPU][SIFoldOperands] Fix OR -1 fold (#189655 ) In SIFoldOperands, folding `or x, -1` to `v_mov_b32 -1` removed `Src1Idx`, which is incorrect because `-1` is in `Src0Idx` (after canonicalization). Closes https://github.com/llvm/llvm-project/issues/189677.	2026-04-01 13:37:37 +01:00
Stanislav Mekhanoshin	06a903e938	[AMDGPU] Clear no convergence flag on operand folding. NFCI (#179438 ) Clear the flag. It fails verification if set, only convergent operations may have NoConvergent flag. NFCI as it is now because it just does not happen.	2026-02-03 10:46:26 -08:00
paperchalice	62aa40a4dd	[AMDGPU] Remove `NoSignedZerosFPMath` uses (#178343 ) One of global flags in `resetTargetOptions`, users should use `nsz` instead. `fneg_fadd_0_f64` from `AMDGPU/fneg-combines.new.ll` will have regression when `fadd` is annotated with `nsz`.	2026-01-30 09:18:40 +08:00
Ryan Mitchell	13b20e7aea	[AMDGPU][SILoadStoreOptimizer] Fix lds address operand offset (#176816 ) The offset operand in GLOBAL_LOAD_ASYNC_TO_LDS_B128, for instance, is added to both the lds and global address, but SILoadStoreOptimizer is currently unaware of that. This PR inserts an add to counteract the offset meant for the global address. This one add is better than not doing the optimization at all, and having to insert 2 adds for each global address calculation (with no offset). ``` ; ENABLE-LABEL: name: promote_async_load_offset ; ENABLE: liveins: $ttmp7, $vgpr0, $sgpr0_sgpr1 ; ENABLE-NEXT: {{ $}} ; ENABLE-NEXT: renamable $vgpr1 = V_LSHLREV_B32_e32 8, $vgpr0, implicit $exec ; ENABLE-NEXT: renamable $vgpr2, renamable $vcc_lo = V_ADD_CO_U32_e64 $vgpr0, 512, 0, implicit $exec ; ENABLE-NEXT: renamable $vgpr3, dead $sgpr_null = V_ADDC_U32_e64 0, killed $vgpr0, killed $vcc_lo, 0, implicit $exec ; ENABLE-NEXT: renamable $vgpr1 = disjoint V_OR_B32_e32 0, killed $vgpr1, implicit $exec ; ENABLE-NEXT: renamable $vgpr0 = V_ADD_U32_e32 256, $vgpr1, implicit $exec ; ENABLE-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed $vgpr0, $vgpr2_vgpr3, -256, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3) ; ENABLE-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed $vgpr1, killed $vgpr2_vgpr3, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3) ; DISABLE-LABEL: name: promote_async_load_offset ; DISABLE: liveins: $ttmp7, $vgpr0, $sgpr0_sgpr1 ; DISABLE-NEXT: {{ $}} ; DISABLE-NEXT: renamable $vgpr1 = V_LSHLREV_B32_e32 8, $vgpr0, implicit $exec ; DISABLE-NEXT: renamable $vgpr2, renamable $vcc_lo = V_ADD_CO_U32_e64 256, $vgpr0, 0, implicit $exec ; DISABLE-NEXT: renamable $vgpr3, $sgpr_null = V_ADDC_U32_e64 0, $vgpr0, killed $vcc_lo, 0, implicit $exec ; DISABLE-NEXT: renamable $vgpr1 = disjoint V_OR_B32_e32 0, killed $vgpr1, implicit $exec ; DISABLE-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 $vgpr1, killed $vgpr2_vgpr3, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3) ; DISABLE-NEXT: renamable $vgpr2, renamable $vcc_lo = V_ADD_CO_U32_e64 512, $vgpr0, 0, implicit $exec ; DISABLE-NEXT: renamable $vgpr3, $sgpr_null = V_ADDC_U32_e64 0, killed $vgpr0, killed $vcc_lo, 0, implicit $exec ; DISABLE-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed $vgpr1, killed $vgpr2_vgpr3, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3) ``` This PR also promotes the global address to an offset when the offset is calculated with V_ADD_U64 on applicable gfx versions, (and inversely adds the LDS offset), whereas previously the optimization opportunity was missed entirely.	2026-01-26 09:23:17 +01:00
Sam Elliott	7184229fea	[NFC][MI] Tidy Up RegState enum use (2/2) (#177090 ) This Change makes `RegState` into an enum class, with bitwise operators. It also: - Updates declarations of flag variables/arguments/returns from `unsigned` to `RegState`. - Updates empty RegState initializers from 0 to `{}`. If this is causing problems in downstream code: - Adopt the `RegState getXXXRegState(bool)` functions instead of using a ternary operator such as `bool ? RegState::XXX : 0`. - Adopt the `bool hasRegState(RegState, RegState)` function instead of using a bitwise check of the flags.	2026-01-23 00:19:03 -08:00
Shilei Tian	02d34a76f7	[NFCI][AMDGPU] Remove more redundant code from `GCNSubtarget.h` (#177297 ) We are getting pretty close to use `GET_SUBTARGETINFO_MACRO` in the header with this cleanup.	2026-01-22 09:07:15 -05:00
Shilei Tian	b4aa3d3ae3	[NFC] Check operand type instead of opcode (#168641 ) A folow-up of #168458.	2025-11-18 21:37:56 -05:00
Shilei Tian	6665642ce4	[AMDGPU] Don't fold an i64 immediate value if it can't be replicated from its lower 32-bit (#168458 ) On some targets, a packed f32 instruction can only read 32 bits from a scalar operand (SGPR or literal) and replicates the bits to both channels. In this case, we should not fold an immediate value if it can't be replicated from its lower 32-bit. Fixes SWDEV-567139.	2025-11-18 17:11:10 -05:00
LU-JOHN	9fa15ef916	[AMDGPU] When shrinking and/or to bitset, remove implicit scc def (#168128 ) When shrinking and/or to bitset remove leftover implicit scc def. bitset* instructions do not set scc. Signed-off-by: John Lu <John.Lu@amd.com>	2025-11-15 09:21:43 -06:00
Ivan Kosarev	71eaf14094	[TableGen] Split *GenRegisterInfo.inc. (#167700 ) Reduces memory usage compiling backend sources, most notably for AMDGPU by ~98 MB per source on average. AMDGPUGenRegisterInfo.inc is tens of megabytes in size now, and is even larger downstream. At the same time, it is included in nearly all backend sources, typically just for a small portion of its content, resulting in compilation being unnecessarily memory-hungry, which in turn stresses buildbots and wastes their resources. Splitting .inc files also helps avoiding extra ccache misses where changes in .td files don't cause changes in all parts of what previously was a single .inc file. It is thought that rather than building on top of the current single-output-file design of TableGen, e.g., using `split-file`, it would be more preferable to recognise the need for multi-file outputs and give it a proper first-class support directly in TableGen.	2025-11-14 16:30:51 +00:00
Jay Foad	72c69aefba	[AMDGPU] Make use of getFunction and getMF. NFC. (#167872 )	2025-11-14 11:00:57 +00:00
Matt Arsenault	e3a9ac5e24	AMDGPU: Remove wrapper around TRI::getRegClass (#159885 ) This shadows the member in the base class, but differs slightly in behavior. The base method doesn't check for the invalid case.	2025-11-11 15:31:52 -08:00
Matt Arsenault	55422e804b	CodeGen: Remove TRI argument from getRegClass (#158225 ) TargetInstrInfo now directly holds a reference to TargetRegisterInfo and does not need TRI passed in anywhere.	2025-11-10 15:43:55 -08:00
Abhay Kanhere	b4b57adb89	[AMDGPU][MachineVerifier] test failures in SIFoldOperands (#166600 ) After PR:https://github.com/llvm/llvm-project/pull/151421 merged following fails in SIFoldOperands showed up. LLVM :: CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mfma.gfx90a.ll LLVM :: CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx90a.ll LLVM :: CodeGen/AMDGPU/llvm.amdgcn.mfma.ll LLVM :: CodeGen/AMDGPU/mfma-loop.ll LLVM :: CodeGen/AMDGPU/rewrite-vgpr-mfma-to-agpr.ll In Folding code, if folded operand is register ensure earlyClobber is set. --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com> Co-authored-by: Shilei Tian <i@tianshilei.me>	2025-11-07 21:12:19 -08:00
Matt Arsenault	67b6fd04dd	AMDGPU: Delete redundant recursive copy handling code (#157032 ) This fixes a regression exposed after 445415219708f9539801018e03282049ca33e0e2. This introduces a few small regressions for true16. There are more cases where the value can propagate through subregister extracts which need new handling. They're also small enough that perhaps there's a way to avoid needing to deal with this case in the first place.	2025-11-05 18:01:12 -08:00
Matt Arsenault	1a5494ca4a	AMDGPU: Use RegClassByHwMode to manage operand VGPR operand constraints (#158272 ) This removes special case processing in TargetInstrInfo::getRegClass to fixup register operands which depending on the subtarget support AGPRs, or require even aligned registers. This regresses assembler diagnostics, which currently work by hackily accepting invalid cases and then post-rejecting a validly parsed instruction. On the plus side this now emits a comment when disassembling unaligned registers for targets with the alignment requirement.	2025-10-08 11:19:54 +09:00
Brox Chen	b8127cc8d0	[AMDGPU][True16][CodeGen] fix v_mov_b16_t16 index in folding pass (#161764 ) With true16 mode v_mov_b16_t16 is added as new foldable copy inst, but the src operand is in different index. Use the correct src index for v_mov_b16_t16.	2025-10-03 17:34:42 -04:00
Matt Arsenault	80fd3eda25	AMDGPU: Fix constrain register logic for physregs (#161794 ) We do not need to reconstrain physical registers. Enables an additional fold for constant physregs.	2025-10-03 21:20:36 +09:00
Matt Arsenault	597f93d36b	AMDGPU: Check if immediate is legal for av_mov_b32_imm_pseudo (#160819 ) This is primarily to avoid folding a frame index materialized into an SGPR into the pseudo; this would end up looking like: %sreg = s_mov_b32 %stack.0 %av_32 = av_mov_b32_imm_pseudo %sreg Which is not useful. Match the check used for the b64 case. This is limited to the pseudo to avoid regression due to gfx908's special case - it is expecting to pass here with v_accvgpr_write_b32 for illegal cases, and stay in the intermediate state with an sgpr input. This avoids regressions in a future patch.	2025-09-27 08:24:20 +09:00
Josh Hutton	de59bc42ed	[AMDGPU] Avoid constraining RC based on folded into operand (NFC) (#160743 ) The RC of the folded operand does not need to be constrained based on the RC of the current operand we are folding into. The purpose of this PR is to facilitate this PR: https://github.com/llvm/llvm-project/pull/151033	2025-09-26 05:08:09 +00:00
Stanislav Mekhanoshin	f0090bacc1	[AMDGPU] Fold copies of constant physical registers into their uses (#154410 ) Co-authored-by: Jay Foad <Jay.Foad@amd.com> Co-authored-by: Jay Foad <Jay.Foad@amd.com>	2025-09-17 10:49:34 -07:00
Matt Arsenault	ea9acc97f1	CodeGen: Surface shouldRewriteCopySrc utility function (#158524 ) Change shouldRewriteCopySrc to return the common register class and expose it as a utility function. I've found myself reproducing essentially the same logic in multiple places. The purpose of this function is to jsut work through the API constraints of which combination of register class and subreg indexes you have. i.e. you need to use a different function if you have 0, 1, or 2 subregister indexes involved in a pair of copy-like operations.	2025-09-16 14:53:49 +09:00
Matt Arsenault	7289f2cd0c	CodeGen: Remove MachineFunction argument from getRegClass (#158188 ) This is a low level utility to parse the MCInstrInfo and should not depend on the state of the function.	2025-09-12 19:22:02 +09:00
Matt Arsenault	dd5eb46690	AMDGPU: Fold 64-bit immediate into copy to AV class (#155615 ) This is in preparation for patches which will intoduce more copies to av registers.	2025-09-03 09:29:59 +09:00
Matt Arsenault	3a7d14acce	AMDGPU: Avoid using exact class check in reg_sequence AGPR fold (#156135 ) This does better in cases which mix align2 and non-align2 classes.	2025-09-03 09:05:48 +09:00
Matt Arsenault	96e4caadb4	AMDGPU: Stop special casing aligned VGPR targets in operand folding (#155559 ) Perform a register class constraint check when performing the fold	2025-09-02 16:15:25 +00:00
Matt Arsenault	a0c472d50f	AMDGPU: Remove special case of SGPR_LO class in imm folding (#155518 ) Previous change accidentally broke this which shows it's not doing anything.	2025-08-27 06:08:38 +00:00
Matt Arsenault	9091108c66	AMDGPU: Fold mov imm to copy to av_32 class (#155428 ) Previously we had special case folding into copies to AGPR_32, ignoring AV_32. Try folding into the pseudos. Not sure why the true16 case regressed.	2025-08-27 02:13:14 +00:00
Matt Arsenault	4454152197	AMDGPU: Replace copy-to-mov-imm folding logic with class compat checks (#154501 ) This strengthens the check to ensure the new mov's source class is compatible with the source register. This avoids using the register sized based checks in getMovOpcode, which don't quite understand AV superclasses correctly. As a side effect it also enables more folds into true16 movs. getMovOpcode should probably be deleted, or at least replaced with class check based logic. In this particular case other legality checks need to be mixed in with attempted IR changes, so I didn't try to push all of that into the opcode selection.	2025-08-26 23:41:35 +09:00
Stanislav Mekhanoshin	3ef3b30c3c	Revert "[AMDGPU] Fold copies of constant physical registers into their uses (#154183 )" (#154219 ) This reverts commit 3395676a18ab580f21ebcd4324feaf1294a8b6d9. Fails libc/test/src/string/libc.test.src.string.memmove_test.__hermetic__	2025-08-18 16:22:47 -07:00
Stanislav Mekhanoshin	3395676a18	[AMDGPU] Fold copies of constant physical registers into their uses (#154183 ) With current codegen this only affects src_flat_scratch_base_lo/hi. Co-authored-by: Jay Foad <Jay.Foad@amd.com> Co-authored-by: Jay Foad <Jay.Foad@amd.com>	2025-08-18 13:07:36 -07:00
Stanislav Mekhanoshin	d09dbdabb9	[AMDGPU] bf16 clamp folding (#152573 )	2025-08-07 12:59:50 -07:00
Ivan Kosarev	2b20cf7291	[AMDGPU] Fold into uses of splat REG_SEQUENCEs through COPYs. (#145691 )	2025-08-04 16:18:33 +01:00
Stanislav Mekhanoshin	ce40863209	[AMDGPU] Add v_cvt_sr\|pk_bf8\|fp8_f16 gfx1250 instructions (#151415 )	2025-07-30 17:24:45 -07:00
Matt Arsenault	5f3eea7ef2	AMDGPU: Fix not folding splat immediate into VGPR MFMA src2 (#150628 )	2025-07-26 13:54:49 +09:00
Stanislav Mekhanoshin	006858cd4d	[AMDGPU] Prevent folding of FI with scale_offset on gfx1250 (#149894 ) SS forms of SCRATCH_LOAD_DWORD do not support SCALE_OFFSET, so if this bit is used SCRATCH_LOAD_DWORD_SADDR cannot be formed. This generally shall not happen because FI is not supposed to be scaled, but add this as a precaution.	2025-07-21 15:05:43 -07:00
macurtis-amd	402b989693	AMDGPU: Fix assert when multi operands to update after folding imm (#148205 ) In the original motivating test case, [FoldList](`d8a2141ff9/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (L1764)`) had entries: ``` #0: UseMI: %224:sreg_32 = S_OR_B32 %219.sub0:sreg_64, %219.sub1:sreg_64, implicit-def dead $scc UseOpNo: 1 #1: UseMI: %224:sreg_32 = S_OR_B32 %219.sub0:sreg_64, %219.sub1:sreg_64, implicit-def dead $scc UseOpNo: 2 ``` After calling [updateOperand(#0)](`d8a2141ff9/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (L1773)`), [tryConstantFoldOp(#0.UseMI)](`d8a2141ff9/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (L1786)`) removed operand 1, and entry #1.UseOpNo was no longer valid, resulting in an [assert](`4a35214bdd/llvm/include/llvm/ADT/ArrayRef.h (L452)`). This change defers constant folding until all operands have been updated so that UseOpNo values remain stable.	2025-07-16 06:37:08 -05:00
Matt Arsenault	e1f224b99a	AMDGPU: Handle folding vector splats of inline split f64 inline immediates (#140878 ) Recognize a reg_sequence with 32-bit elements that produce a 64-bit splat value. This enables folding f64 constants into mfma operands	2025-06-26 07:45:49 +09:00
Matt Arsenault	472c9141f9	AMDGPU: Fix tracking subreg defs when folding through reg_sequence (#140608 ) We weren't fully respecting the type of a def of an immediate vs. the type at the use point. Refactor the folding logic to track the value to fold, as well as a subregister to apply to the underlying value. This is similar to how PeepholeOpt tracks subregisters (though only for pure copy-like instructions, no constants). Fixes #139317	2025-06-26 07:42:55 +09:00
Matt Arsenault	80064b6e32	AMDGPU: Try constant fold after folding immediate (#141862 ) This helps avoid some regressions in a future patch. The or 0 pattern appears in the division tests because the reduce 64-bit bit operation to a 32-bit one with half identity value is only implemented for constants. We could fix that by using computeKnownBits. Additionally the pattern disappears if I optimize the IR division expansion, so that IR should probably be emitted more optimally in the first place.	2025-06-10 11:44:44 +09:00
Daniil Fukalov	5208f722d8	[AMDGPU] Fix SIFoldOperandsImpl::canUseImmWithOpSel() for VOP3 packed [B]F16 imms. (#142142 ) VOP3 instructions ignore opsel source modifiers, so a constant that contains two different [B]F16 imms cannot be encoded into instruction with an src opsel. E.g. without the fix the following instructions `s_mov_b32 s0, 0x40003c00 // <half 1.0, half 2.0>` `v_cvt_scalef32_pk_fp8_f16 v0, s0, v2` lose `2.0` imm and are folded into `v_cvt_scalef32_pk_fp8_f16 v1, 1.0, 1.0` Fixes SWDEV-531672	2025-05-30 16:38:07 +02:00
Matt Arsenault	65b90c59ce	AMDGPU: Remove redundant operand folding checks (#140587 ) This was pre-filtering out a specific situation from being added to the fold candidate list. The operand legality will ultimately be checked with isOperandLegal before the fold is performed, so I don't see the plus in pre-filtering this one case.	2025-05-29 19:38:45 +02:00
Matt Arsenault	1b07c589b2	AMDGPU: Delete seemingly dead s_fmaak_f32/s_fmamk_f32 folding code (#140580 ) No tests fail with this. I'm not sure I understand the comment, there can't be any folding into an operand that had to already be a constant. I tried different combinations of immediates to these instructions but never hit the condition.	2025-05-29 19:36:05 +02:00
Fabian Ritter	fb27867bd5	[AMDGPU] SIFoldOperands: Delay foldCopyToVGPROfScalarAddOfFrameIndex (#141558 ) foldCopyToVGPROfScalarAddOfFrameIndex transforms s_adds whose results are copied to vector registers into v_adds. We don't want to do that if foldInstOperand (which so far runs later) can fold the sreg->vreg copy away. This patch therefore delays foldCopyToVGPROfScalarAddOfFrameIndex until after foldInstOperand. This avoids unnecessary movs in the flat-scratch-svs.ll test and also avoids regressions in an upcoming patch to enable ISD::PTRADD nodes.	2025-05-27 11:30:51 +02:00
Rahul Joshi	52c2e45c11	[NFC][CodeGen] Adopt MachineFunctionProperties convenience accessors (#141101 )	2025-05-23 08:30:29 -07:00
Matt Arsenault	36018494fd	AMDGPU: Check for subreg match when folding through reg_sequence (#140582 ) We need to consider the use instruction's intepretation of the bits, not the defined immediate without use context. This will regress some cases where we previously coud match f64 inline constants. We can restore them by either using pseudo instructions to materialize f64 constants, or recognizing reg_sequence decomposed into 32-bit pieces for them (which essentially means recognizing every other input is a 0). Fixes #139908	2025-05-19 21:44:44 +02:00
Matt Arsenault	4ddab1252f	AMDGPU: Move reg_sequence splat handling (#140313 ) This code clunkily tried to find a splat reg_sequence by looking at every use of the reg_sequence, and then looking back at the reg_sequence to see if it's a splat. Extract this into a separate helper function to help clean this up. This now parses whether the reg_sequence forms a splat once, and defers the legal inline immediate check to the use check (which is really use context dependent) The one regression is in globalisel, which has an extra copy that should have been separately folded out. It was getting dealt with by the handling of foldable copies in tryToFoldACImm. This is preparation for #139908 and #139317	2025-05-17 08:18:01 +02:00
Ivan Kosarev	c290f48a45	[AMDGPU][NFC] Remove unused operand types. (#139062 )	2025-05-08 12:48:25 +01:00
Akhilesh Moorthy	9c9013f703	[AMDGPU] Handle MachineOperandType global address in SIFoldOperands. (#135424 ) This patch handles the global operand type properly, fixing the bug : Assertion `(isFI() \|\| isCPI() \|\| isTargetIndex() \|\| isJTI()) && "Wrong MachineOperand accessor"` failed. Fixes SWDEV-504645 --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-05-05 18:12:35 +02:00
Kazu Hirata	d144c13ae5	[Target] Remove unused local variables (NFC) (#138443 )	2025-05-04 07:56:38 -07:00

1 2 3 4 5 ...

303 Commits