llvm-project

Author	SHA1	Message	Date
Matt Arsenault	fbf74b2553	AMDGPU: Select vector reg class for divergent build_vector (#168169 ) The main improvement is to the mfma tests. There are some mild regressions scattered around, and a few major ones. The worst regressions are in some of the bitcast tests; these are cases where the SGPR argument list runs out and uses VGPRs, and the copies-from-VGPR are misidentified as divergent. Most of the shufflevector tests are also regressions. These end up with cleaner MIR, but then get poor regalloc decisions.	2025-11-14 21:53:39 -08:00
Matt Arsenault	0fa6a67a42	AMDGPU: Use v_mov_b32 to implement divergent zext i32->i64 (#168166 ) Some cases are relying on SIFixSGPRCopies to force VALU reg_sequence inputs with SGPR inputs to use all VGPR inputs, but this doesn't always happen if the reg_sequence isn't invalid. Make sure we use a vgpr up-front here so we don't rely on something later.	2025-11-14 20:19:24 -08:00
Matt Arsenault	bbde792786	AMDGPU: Relax shouldCoalesce to allow more register tuple widening (#166475 ) Allow widening up to 128-bit registers or if the new register class is at least as large as one of the existing register classes. This was artificially limiting. In particular this was doing the wrong thing with sequences involving copies between VGPRs and AV registers. Nearly all test changes are improvements. The coalescer does not just widen registers out of nowhere. If it's trying to "widen" a register, it's generally packing a register into an existing register tuple, or in a situation where the constraints imply the wider class anyway. 067a11015 addressed the allocation failure concern by rejecting coalescing if there are no available registers. The original change in a4e63ead4b didn't include a realistic testcase to judge if this is harmful for pressure. I would expect any issues from this to be of garden variety subreg handling issue. We could use more dynamic state information here if it really is an issue. I get the best results by removing this override completely. This is a smaller step for patch splitting purposes.	2025-11-11 13:50:57 -08:00
Carl Ritson	385c12134a	[AMDGPU] Rework GFX11 VALU Mask Write Hazard (#138663 ) Apply additional counter waits to address VALU writes to SGPRs. Rework expiry detection and apply wait coalescing to mitigate some of the additional waits.	2025-10-28 16:09:28 +09:00
LU-JOHN	9abbec66bf	[AMDGPU] Reland "Remove redundant s_cmp_lg_* sX, 0" (#164201 ) Reland PR https://github.com/llvm/llvm-project/pull/162352. Fix by excluding SI_PC_ADD_REL_OFFSET from instructions that set SCC = DST!=0. Passes check-libc-amdgcn-amd-amdhsa now. Distribution of instructions that allowed a redundant S_CMP to be deleted in check-libc-amdgcn-amd-amdhsa test: ``` S_AND_B32 485 S_AND_B64 47 S_ANDN2_B32 42 S_ANDN2_B64 277492 S_CSELECT_B64 17631 S_LSHL_B32 6 S_OR_B64 11 ``` --------- Signed-off-by: John Lu <John.Lu@amd.com> Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-10-22 08:42:29 -05:00
Jan Patrick Lehr	023b1f6a8e	Revert "[AMDGPU] Remove redundant s_cmp_lg_* sX, 0 " (#164116 ) Reverts llvm/llvm-project#162352 Broke our buildbot: https://lab.llvm.org/buildbot/#/builders/10/builds/15674 To reproduce cd llvm-project cmake -S llvm -B thebuild -C offload/cmake/caches/AMDGPULibcBot.cmake -GNinja cd thebuild ninja ninja check-libc-amdgcn-amd-amdhsa	2025-10-18 22:38:14 +02:00
LU-JOHN	8e5f6dd37c	[AMDGPU] Remove redundant s_cmp_lg_* sX, 0 (#162352 ) Remove redundant s_cmp_lg_* sX, 0 if SALU instruction already sets SCC if sX!=0. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2025-10-18 09:33:47 -05:00
Matt Arsenault	884130bf93	AMDGPU: Allow folding multiple uses of some immediates into copies (#154757 ) In some cases this will require an avoidable re-defining of a register, but it works out better most of the time. Also allow folding 64-bit immediates into subregister extracts, unless it would break an inline constant. We could be more aggressive here, but this set of conditions seems to do a reasonable job without introducing too many regressions.	2025-09-06 08:22:09 +09:00
Ana Mihajlovic	08d747c1ef	[AMDGPU] Fix bad removal of s_delay_alu (#145728 ) instructionWaitsForSGPRWrites function covers ALL SALU instructions, including those like s_waitcnt that don't read from sgpr. This results in removing delay_alu instructions in cases like VALU->SGPR->VALU, which results in performance regression. Change modifies the function so that it checks if instruction also reads a sgpr.	2025-06-27 16:15:10 +02:00
Ruiling, Song	3e47d8deba	MachineScheduler: Reset next cluster candidate for each node (#139513 ) When a node is picked, we should reset its next cluster candidate to null before releasing its successors/predecessors.	2025-05-28 14:53:46 +08:00
Jay Foad	e30301aaec	[MachineCopyPropagation] Make use of lane mask info in basic block liveins (#140248 )	2025-05-16 16:54:34 +01:00
Ana Mihajlovic	8c7550132f	[AMDGPU] Unused sdst writing to null (#133229 ) Unused sdst writing to null to avoid a false VALU->SALU dependency stall. This requires using the VOP3 encoding.	2025-03-28 18:12:34 +01:00
Ana Mihajlovic	459b4e3fe1	Reland "[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212 )" (#131111 ) We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from it). When VALU is issued, it increments internal counter VA_SDST used to track use of this SGPR. SALU will not issue until VA_SDST is zero, that is when VALU is finished writing. Therefore, delays added by s_delay_alu are not needed in this situation.	2025-03-13 10:26:20 +01:00
Matt Arsenault	26ae98c4d7	AMDGPU: Replace undef global initializers in tests with poison (#131051 )	2025-03-13 13:41:03 +07:00
Kazu Hirata	aa008e0008	Revert "[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212 )" This reverts commit 71582c6667a6334c688734cae628e906b3c1ac1d. Multiple buildbot failures have been reported: https://github.com/llvm/llvm-project/pull/127212	2025-03-12 12:09:09 -07:00
Ana Mihajlovic	71582c6667	[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212 ) We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from it). When VALU is issued, it increments internal counter VA_SDST used to track use of this SGPR. SALU will not issue until VA_SDST is zero, that is when VALU is finished writing. Therefore, delays added by s_delay_alu are not needed in this situation.	2025-03-12 09:33:07 -07:00
Sergei Barannikov	ff9c041d96	[MachineScheduler] Fix physreg dependencies of ExitSU (#123541 ) Providing the correct operand index allows addPhysRegDataDeps to compute the correct latency. Pull Request: https://github.com/llvm/llvm-project/pull/123541	2025-02-01 20:40:50 +03:00
Matt Arsenault	d246cc618a	PeepholeOpt: Do not add subregister indexes to reg_sequence operands (#124111 ) Given the rest of the pass just gives up when it needs to compose subregisters, folding a subregister extract directly into a reg_sequence is counterproductive. Later fold attempts in the function will give up on the subregister operand, preventing looking up through the reg_sequence. It may still be profitable to do these folds if we start handling the composes. There are some test regressions, but this mostly looks better.	2025-01-30 20:42:02 +07:00
Matt Arsenault	5a3299a684	AMDGPU: Remove some -verify-machineinstrs from tests (#117736 ) We should leave these for EXPENSIVE_CHECKS builds. Some of these were near the top of slowest tests.	2024-11-26 12:59:15 -05:00
Shilei Tian	6548b6354d	Reapply "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403 )" This reverts commit ca33649abe5fad93c57afef54e43ed9b3249cd86.	2024-11-08 20:21:16 -05:00
Shilei Tian	ca33649abe	Revert "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403 )" This reverts commit e215a1e27d84adad2635a52393621eb4fa439dc9 as it broke both hip and openmp buildbots.	2024-11-08 16:36:35 -05:00
Shilei Tian	e215a1e27d	[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403 )	2024-11-08 13:05:35 -05:00
Stanislav Mekhanoshin	6d7e51de5e	[AMDGPU] Extend type support for update_dpp intrinsic (#114597 ) We can split 64-bit DPP as a post-RA pseudo if control values are supported, but cannot handle other types.	2024-11-05 13:59:14 -08:00
Stanislav Mekhanoshin	3277c7cd28	[AMDGPU] Skip VGPR deallocation for waveslot limited kernels (#112765 ) MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels. It's been identified this message is only really needed for VGPR limited kernels. A kernel becomes VGPR limited if a total number of VGPRs per SIMD / number of used VGPRs is more than a number of wave slots.	2024-10-21 09:39:52 -07:00
Matt Arsenault	8632e8bd64	AMDGPU: Fix implicit vcc def to vcc_lo on wave32 targets (#109514 )	2024-09-23 13:20:21 +04:00
Jay Foad	e55d6f5ea2	[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive (#107889 ) Always generate v_cndmask_b32 instead of modifying exec around v_mov_b32. This is expected to be faster because modifying exec generally causes pipeline stalls.	2024-09-11 17:16:06 +01:00
Carl Ritson	16cda01d22	[AMDGPU] V_SET_INACTIVE optimizations (#98864 ) Optimize V_SET_INACTIVE by allow it to run in WWM. Hence WWM sections are not broken up for inactive lane setting. WWM V_SET_INACTIVE can typically be lower to V_CNDMASK. Some cases require use of exec manipulation V_MOV as previous code. GFX9 sees slight instruction count increase in edge cases due to smaller constant bus. Additionally avoid introducing exec manipulation and V_MOVs where a source of V_SET_INACTIVE is the destination. This is a common pattern as WWM register pre-allocation often assigns the same register.	2024-09-05 14:39:28 +09:00
Jay Foad	126d6f2710	[AMDGPU] Improve codegen for GFX10+ DPP reductions and scans (#107108 ) Use poison for an unused input to the permlanex16 intrinsic, to improve register allocation and avoid an unnecessary v_mov instruction.	2024-09-04 11:03:22 +01:00
Changpeng Fang	a82032918c	[AMDGPU] Remove -wavefrontsize32 and -wavefrontsize64 from GFX10+ tests (NFC) (#100711 ) They are no longer needed after the patch: [AMDGPU] Remove wavefrontsize feature from GFX10: https://github.com/llvm/llvm-project/pull/98400 The exception is when "target-features" are set to "+wavefrontsize32" or "+wavefrontsize64", we still need to remove a wavefrontsize feature before add a different one to make sure only one of them are present.	2024-07-26 00:42:24 -07:00
Christudasan Devadasan	229e118559	[AMDGPU] Codegen support for constrained multi-dword sloads (#96163 ) For targets that support xnack replay feature (gfx8+), the multi-dword scalar loads shouldn't clobber any register that holds the src address. The constrained version of the scalar loads have the early clobber flag attached to the dst operand to restrict RA from re-allocating any of the src regs for its dst operand.	2024-07-23 13:59:15 +05:30
Vikram Hegde	cf230e7799	[AMDGPU] Enable atomic optimizer for divergent i64 and double values (#96934 )	2024-07-15 17:49:09 +05:30
Matt Arsenault	b1bcb7ca46	Reapply "AMDGPU: Move attributor into optimization pipeline (#83131 )" and follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851 ) This reverts commit adaff46d087799072438dd744b038e6fd50a2d78. Drop the -O3 checks from default-attributes.hip. I don't know why they are different on some bots but reverting this is far too disruptive.	2024-07-15 11:51:44 +04:00
dyung	adaff46d08	Revert "AMDGPU: Move attributor into optimization pipeline (#83131 )" and follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851 ) This reverts commits 677cc15e0ff2e0e6aa30538eb187990a6a8f53c0 and 78bc1b64a6dc3fb6191355a5e1b502be8b3668e7. The test CodeGenHIP/default-attributes.hip is failing on multiple bots even after the attempted fix including the following: - https://lab.llvm.org/buildbot/#/builders/3/builds/1473 - https://lab.llvm.org/buildbot/#/builders/65/builds/1380 - https://lab.llvm.org/buildbot/#/builders/161/builds/595 - https://lab.llvm.org/buildbot/#/builders/154/builds/1372 - https://lab.llvm.org/buildbot/#/builders/133/builds/1547 - https://lab.llvm.org/buildbot/#/builders/81/builds/755 - https://lab.llvm.org/buildbot/#/builders/40/builds/570 - https://lab.llvm.org/buildbot/#/builders/13/builds/748 - https://lab.llvm.org/buildbot/#/builders/12/builds/1845 - https://lab.llvm.org/buildbot/#/builders/11/builds/1695 - https://lab.llvm.org/buildbot/#/builders/190/builds/1829 - https://lab.llvm.org/buildbot/#/builders/193/builds/962 - https://lab.llvm.org/buildbot/#/builders/23/builds/991 - https://lab.llvm.org/buildbot/#/builders/144/builds/2256 - https://lab.llvm.org/buildbot/#/builders/46/builds/1614 These bots have been broken for a day, so reverting to get everything back to green.	2024-07-14 18:48:54 -07:00
Matt Arsenault	78bc1b64a6	AMDGPU: Move attributor into optimization pipeline (#83131 ) Removing it from the codegen pipeline induces a lot of test churn because llc is no longer optimizing out implicit arguments to kernels. Mostly mechanical, but there are some creative test updates. I preferred to take the changes as-is in tests where the ABI isn't relevant. In cases where it's more relevant, or the optimize out logic was too ingrained in the test, I pre-run the optimization. Some cases manually add attributes to disable inputs.	2024-07-14 08:36:33 +04:00
Vikram Hegde	2a9607168b	[AMDGPU] Cleanup bitcast spam in atomic optimizer (#96933 )	2024-07-08 10:53:16 +05:30
Jay Foad	3cf539fb04	[AMDGPU] Combine or remove redundant waitcnts at the end of each MBB (#87539 ) Call generateWaitcnt unconditionally at the end of SIInsertWaitcnts::insertWaitcntInBlock. Even if we don't need to generate a new waitcnt instruction it has the effect of combining or removing redundant waitcnts that were already present. Tests show various small improvements in waitcnt placement.	2024-04-04 10:14:16 +01:00
Fangrui Song	9e9907f1cf	[AMDGPU,test] Change llc -march= to -mtriple= (#75982 ) Similar to 806761a7629df268c8aed49657aeccffa6bca449. For IR files without a target triple, -mtriple= specifies the full target triple while -march= merely sets the architecture part of the default target triple, leaving a target triple which may not make sense, e.g. amdgpu-apple-darwin. Therefore, -march= is error-prone and not recommended for tests without a target triple. The issue has been benign as we recognize $unknown-apple-darwin as ELF instead of rejecting it outrightly. This patch changes AMDGPU tests to not rely on the default OS/environment components. Tests that need fixes are not changed: ``` LLVM :: CodeGen/AMDGPU/fabs.f64.ll LLVM :: CodeGen/AMDGPU/fabs.ll LLVM :: CodeGen/AMDGPU/floor.ll LLVM :: CodeGen/AMDGPU/fneg-fabs.f64.ll LLVM :: CodeGen/AMDGPU/fneg-fabs.ll LLVM :: CodeGen/AMDGPU/r600-infinite-loop-bug-while-reorganizing-vector.ll LLVM :: CodeGen/AMDGPU/schedule-if-2.ll ```	2024-01-16 21:54:58 -08:00
Acim Maravic	48f36c6e74	[LLVM] Make use of s_flbit_i32_b64 and s_ff1_i32_b64 (#75158 ) Update DAG ISel to support 64bit versions S_FF1_I32_B64 and S_FLBIT_I32_B664 --------- Co-authored-by: Acim Maravic <Acim.Maravic@amd.com>	2023-12-25 11:55:20 +01:00
Pierre van Houtryve	ef067f5204	[AMDGPU][SIInsertWaitcnts] Do not add s_waitcnt when the counters are known to be 0 already (#72830 ) Co-authored-by: Juan Manuel MARTINEZ CAAMAÑO <juamarti@amd.com>	2023-12-15 12:33:32 +01:00
Stanislav Mekhanoshin	fe8335babb	[AMDGPU] Select 64-bit imm moves if can be encoded as 32 bit operand (#70395 ) This allows folding of 64-bit operands if fit into 32-bit. Fixes https://github.com/llvm/llvm-project/issues/67781	2023-10-30 08:12:28 -07:00
Pravin Jagtap	f09360d20d	[AMDGPU] Support FAdd/FSub global atomics in AMDGPUAtomicOptimizer. Reduction and Scan are implemented using `Iterative` and `DPP` strategy for `float` type. Reviewed By: arsenm, #amdgpu Differential Revision: https://reviews.llvm.org/D156301	2023-08-30 11:57:48 -04:00
Jay Foad	7fa7a08f21	[AMDGPU] Insert s_nop before s_sendmsg sendmsg(MSG_DEALLOC_VGPRS) Differential Revision: https://reviews.llvm.org/D155681	2023-07-19 10:33:11 +01:00
Sameer Sahasrabuddhe	7a101798b7	Revert "[AMDGPU] Mark mbcnt as convergent" This reverts commit 37114036aa57e53217a57afacd7f47b36114edfb. The output of mbcnt does not depend on other active lanes, and hence it is not convergent. The original change was made as a possible fix for https://github.com/ROCm-Developer-Tools/HIP/issues/3172 But changing mbcnt does not fix that issue. Reviewed By: ruiling, foad, yaxunl Differential Revision: https://reviews.llvm.org/D153953	2023-06-30 13:10:44 +05:30
Pravin Jagtap	597fb7fb46	[AMDGPU] Switch to the new cl option amdgpu-atomic-optimizer-strategy. Atomic optimizer is turned on by default through D152649. This patch removes the usage of old command line option amdgpu-atomic-optimizations and transfer the responsibility to `amdgpu-atomic-optimizer-strategy`. We can safely remove old option when LLPC remove its all usage. Reviewed By: foad, arsenm, #amdgpu, cdevadas Differential Revision: https://reviews.llvm.org/D153007	2023-06-22 07:06:42 -04:00
Pravin Jagtap	f6c8a8e9cb	[AMDGPU] Iterative scan implementation for atomic optimizer. This patch provides an alternative implementation to DPP for Scan Computations. An alternative implementation iterates over all active lanes of Wavefront using llvm.cttz and performs the following steps: 1. Read the value that needs to be atomically incremented using llvm.amdgcn.readlane intrinsic 2. Accumulate the result. 3. Update the scan result using llvm.amdgcn.writelane intrinsic if intermediate scan results are needed later in the kernel. Reviewed By: arsenm, cdevadas Differential Revision: https://reviews.llvm.org/D147408	2023-06-09 01:08:44 -04:00
Nikita Popov	bdf2fbba9c	[AMDGPU] Convert some tests to opaque pointers (NFC)	2022-12-19 12:41:13 +01:00
Guozhi Wei	11e86868c1	[MachineCSE] Allow CSE for instructions with ignorable operands Ignorable operands don't impact instruction's behavior, we can safely do CSE on the instruction. It is split from D130919. It has big impact to some AMDGPU test cases. For example in atomic_optimizations_raw_buffer.ll, when trying to check if the following instruction can be CSEed %37:vgpr_32 = V_MOV_B32_e32 0, implicit $exec Function isCallerPreservedOrConstPhysReg is called on operand "implicit $exec", this function is implemented as - return TRI.isCallerPreservedPhysReg(Reg, MF) \|\| + return TRI.isCallerPreservedPhysReg(Reg, MF) \|\| TII.isIgnorableUse(MO) \|\| (MRI.reservedRegsFrozen() && MRI.isConstantPhysReg(Reg)); Both TRI.isCallerPreservedPhysReg and MRI.isConstantPhysReg return false on this operand, so isCallerPreservedOrConstPhysReg is also false, it causes LLVM failed to CSE this instruction. With this patch TII.isIgnorableUse returns true for the operand $exec, so isCallerPreservedOrConstPhysReg also returns true, it causes this instruction to be CSEed with previous instruction %14:vgpr_32 = V_MOV_B32_e32 0, implicit $exec So I got different result from here. AMDGPU's implementation of isIgnorableUse is bool SIInstrInfo::isIgnorableUse(const MachineOperand &MO) const { // Any implicit use of exec by VALU is not a real register read. return MO.getReg() == AMDGPU::EXEC && MO.isImplicit() && isVALU(MO.getParent()) && !resultDependsOnExec(MO.getParent()); } Since the operand $exec is not a real register read, my understanding is it's reasonable to do CSE on such instructions. Because more instructions are CSEed, so I get less instructions generated for these tests. Differential Revision: https://reviews.llvm.org/D137222	2022-11-14 19:34:59 +00:00
Carl Ritson	c316332e17	[Sink] Allow sinking of invariant loads across critical edges Invariant loads can always be sunk. Reviewed By: foad, arsenm Differential Revision: https://reviews.llvm.org/D135133	2022-10-06 09:21:12 +09:00
Joe Nash	d1af09ad96	[AMDGPU] gfx11 Generate VOPD Instructions We form VOPD instructions in the GCNCreateVOPD pass by combining back-to-back component instructions. There are strict register constraints for creating a legal VOPD, namely that the matching operands (e.g. src0x and src0y, src1x and src1y) must be in different register banks. We add a PostRA scheduler mutation to put possible VOPD components back-to-back. Depends on D128442, D128270 Reviewed By: #amdgpu, rampitec Differential Revision: https://reviews.llvm.org/D128656	2022-07-05 09:18:19 -04:00
Jay Foad	0f94d2b385	[AMDGPU] GFX11: automatically release VGPRs at the end of the shader GFX11 has a new message type MSG_DEALLOC_VGPRS which can be used to release a shader's VGPRs. Sending this at the end of a shader (just before the s_endpgm) can help overall system performance in cases where the s_endpgm would have to wait for outstanding VMEM stores to complete before releasing the VGPRs. Differential Revision: https://reviews.llvm.org/D128442	2022-06-30 20:55:14 +01:00

1 2 3

107 Commits