llvm-project

Author	SHA1	Message	Date
Brox Chen	a61cc1b99a	[AMDGPU][True16][CodeGen] Skip combineDpp with t16 instructions (#128918 ) We only emits v_mov_b32/64_dpp. Don't combine t16 instructions with mov dpp. Update the test inputs to be legal. It is future work to emit v_mov_b16_dpp, and then update GCNDPPCombine to combine it with the 16-bit instructions.	2025-03-31 10:18:25 -04:00
Simon Pilgrim	9b32f3d096	[DAG] visitEXTRACT_SUBVECTOR - don't return early on failure of EXTRACT_SUBVECTOR(INSERT_SUBVECTOR()) -> BITCAST fold (#133695 ) Always allow later folds to try to match as well.	2025-03-31 14:32:43 +01:00
Fangrui Song	04a67528d3	[MC] Simplify MCBinaryExpr/MCUnaryExpr printing by reducing parentheses (#133674 ) The existing pretty printer generates excessive parentheses for MCBinaryExpr expressions. This update removes unnecessary parentheses of MCBinaryExpr with +/- operators and MCUnaryExpr. Since relocatable expressions only use + and -, this change improves readability in most cases. Examples: - (SymA - SymB) + C now prints as SymA - SymB + C. This updates the output of -fexperimental-relative-c++-abi-vtables for AArch64 and x86 to `.long _ZN1B3fooEv@PLT-_ZTV1B-8` - expr + (MCTargetExpr) now prints as expr + MCTargetExpr, with this change primarily affecting AMDGPUMCExpr.	2025-03-30 22:03:14 -07:00
Ana Mihajlovic	8c7550132f	[AMDGPU] Unused sdst writing to null (#133229 ) Unused sdst writing to null to avoid a false VALU->SALU dependency stall. This requires using the VOP3 encoding.	2025-03-28 18:12:34 +01:00
LU-JOHN	827f2ad643	AMDGPU: Convert vector 64-bit shl to 32-bit if shift amt >= 32 (#132964 ) Convert vector 64-bit shl to 32-bit if shift amt is known to be >= 32. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2025-03-28 23:46:35 +07:00
Ana Mihajlovic	f7a034d400	[AMDGPU] (x or y) xor -1 -> x nor y (#130264 ) Added pattern so s_nor is selected for ((i1 x or i1 y) xor -1) instead of s_or and s_xor . This patch is for i1 divergent. The ballot in the test is added for the retrieval of lanemask. The control flow is needed because the combiner can't pass through phi instructions.	2025-03-28 11:20:17 +01:00
Brox Chen	06411399fb	[AMDGPU][True16][CodeGen] srl pattern for true16 mode (#132987 ) Added a srl pattern for true16 flow. Changing right shift 16bit to a reg_sequence `srl vgpr32, 16 -> reg_sequence (vgpr32.hi16, 0)` and finally it's lowered to two COPY `vdst.lo16 = COPY vsrc.hi16` `vdst.hi16 = COPY 0` The benefits of this transform is allowing the following pass to optimize out these copy.	2025-03-26 18:38:20 -04:00
Alex MacLean	672c51c9cb	[SDAG][tests] add some test cases covering an add-based rotate (#132842 ) Add tests to various targets covering rotate idioms where an 'ADD' node is used to combine the halves instead of an 'OR'. Some of these cases will be better optimized following #125612, while others are already well optimized or do not have a valid fold to a rotate or funnel-shift.	2025-03-26 09:47:28 -07:00
Akshat Oke	719b029c16	[AMDGPU][NPM] Port SILateBranchLowering to NPM (#130063 )	2025-03-26 19:28:19 +05:30
Jeffrey Byrnes	e5641f6584	[AMDGPU] Autogen checks for mfma-loop.ll (#133004 ) Needed for a RegisterCoalescing patch	2025-03-25 15:24:40 -07:00
Juan Manuel Martinez Caamaño	2f8d699845	[AMDGPU][SelectionDAG] Use COPY instead of S_MOV_B32 to assign values to M0 (#132957 ) This is consistent with what's done on GISel. This allows the register coalescer to remove the redundant intermediate `s_mov_b32` instructions by using `m0` directly as the result register.	2025-03-25 19:05:43 +01:00
Jeffrey Byrnes	25938389c0	[AMDGPU] Autogen checks for agpr-csr.ll (#132959 ) Needed for a RegisterCoalescer patch	2025-03-25 10:28:35 -07:00
LU-JOHN	70aeb89094	Calculate KnownBits from Metadata correctly for vector loads (#128908 ) Calculate KnownBits correctly from metadata for vector loads. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2025-03-25 22:46:30 +07:00
Akshat Oke	f8e908a0ed	[AMDGPU][NPM] Port SIInsertHardClauses to NPM (#130062 )	2025-03-25 15:33:32 +05:30
Austin Kerbow	e75f586b81	[AMDGPU] Relax lds dma waitcnt with no aliasing pair (#131842 ) If we cannot find any lds DMA instruction that is aliased by some load from lds, we will still insert vmcnt(0). This is overly cautious since handling inter-thread dependences is normally managed by the memory model instead of the waitcnt pass, so this change updates the behavior to be more inline with how other types of memory events are handled.	2025-03-24 10:38:47 -07:00
Jay Foad	02ed65912e	[AMDGPU] 4-align TTMP triples (#132759 ) Follow up to e4284a7c70cd "[AMDGPU] 4-align SGPR triples". Previously TTMP triples like ttmp[3:5] were aligned on a 3-TTMP boundary which has no basis in hardware. Aligning them on a 4-TTMP boundary matches what we do for SGPRs, which reduces the number of extra register classes synthesized by TableGen, bringing the total number down from 653 to 615.	2025-03-24 17:11:39 +00:00
Ana Mihajlovic	cdea46cc8c	[AMDGPU] Add pattern for inverse.ballot.i64 Wave32 (#132770 )	2025-03-24 17:30:02 +01:00
Akshat Oke	f10dc76f03	[AMDGPU][NPM] Port SIInsertWaitcnts to NPM (#130061 )	2025-03-24 21:36:45 +05:30
Juan Manuel Martinez Caamaño	5634e7e2f0	[AMDGCN][SIWholeQuadMode] Rework splitBlock/lowerKillI1/lowerKillF32 to handle case when SI_KILL_I1_TERMINATOR -1 0 is not the unique terminator The lowerKillI1 method wrongly handled cases where it inserted a new S_BRANCH instruction when the kill was not the only terminator, and then tried to split the block. `SI_KILL_I1_TERMINATOR -1,0` doesn't have any effect. Instead of lowering to an unconditional branch, we remove the instruction and insert an unconditional branch only if the instruction is the last terminator. No split is needed in this case (if the last terminator has been reached, then the whole block was processed). Also stop generating an unconditional branch in splitBlock: this branch was redundant since TermMI is promoted to a terminator that fallsthrough to the next block already. Solves SWDEV-508819	2025-03-24 15:57:08 +01:00
Pierre van Houtryve	c457c88951	[GlobalISel] Combine (sext (trunc x)) to (sext_inreg x) (#131622 ) Split from #131312	2025-03-24 09:32:04 +01:00
Pierre van Houtryve	6e3c24fc0a	[DAG] Combine (sext (sext_in_reg x)) to (sext_in_reg (any_extend x)) (#132386 )	2025-03-24 09:31:02 +01:00
Shoreshen	054e0b41a8	[AMDGPU] Add all type for bitcast on VReg_512 (#131775 ) Add all types pattern for bitcast on VReg_512	2025-03-24 11:52:10 +08:00
Shilei Tian	f1ac2afe21	Reapply "[AMDGPU] Use COV6 by default (#118515 )" (#130963 ) This reverts commit 68bcba6d7a1cc18996c0bcb7c62267c62d2040d0.	2025-03-21 15:26:45 -04:00
Stephen Thomas	2e3fa4ba9e	[AMDGPU] Insert before and after instructions that always use GDS (#131338 ) It is an architectural requirement that there must be no outstanding GDS instructions when an "always GDS" instruction is issued, and also that an always GDS instruction must be allowed to complete. Insert waits on DScnt/LGKMcnt prior to (if necessary) and subsequent to (unconditionally) any always GDS instruction, and an additional S_NOP if the subsequent wait was followed by S_ENDPGM. Always GDS instructions are GWS instructions, DS_ORDERED_COUNT, DS_ADD_GS_REG_RTN, and DS_SUB_GS_REG_RTN (the latter two as considered always GDS as of this patch).	2025-03-21 09:33:04 +00:00
Brox Chen	55d3a55cc1	[AMDGPU][True16][CodeGen]disable true16 on fneg test (#132221 ) This is a NFC change. Revert the failed test case in https://github.com/llvm/llvm-project/pull/131206	2025-03-20 10:32:49 -04:00
Diana Picus	e17b3cdfb3	[AMDGPU] Dynamic VGPR support for llvm.amdgcn.cs.chain (#130094 ) The llvm.amdgcn.cs.chain intrinsic has a 'flags' operand which may indicate that we want to reallocate the VGPRs before performing the call. A call with the following arguments: ``` llvm.amdgcn.cs.chain %callee, %exec, %sgpr_args, %vgpr_args, /flags/0x1, %num_vgprs, %fallback_exec, %fallback_callee ``` is supposed to do the following: - copy the SGPR and VGPR args into their respective registers - try to change the VGPR allocation - if the allocation has succeeded, set EXEC to %exec and jump to %callee, otherwise set EXEC to %fallback_exec and jump to %fallback_callee This patch implements the dynamic VGPR behaviour by generating an S_ALLOC_VGPR followed by S_CSELECT_B32/64 instructions for the EXEC and callee. The rest of the call sequence is left undisturbed (i.e. identical to the case where the flags are 0 and we don't use dynamic VGPRs). We achieve this by introducing some new pseudos (SI_CS_CHAIN_TC_Wn_DVGPR) which are expanded in the SILateBranchLowering pass, just like the simpler SI_CS_CHAIN_TC_Wn pseudos. The main reason is so that we don't risk other passes (particularly the PostRA scheduler) introducing instructions between the S_ALLOC_VGPR and the jump. Such instructions might end up using VGPRs that have been deallocated, or the wrong EXEC mask. Once the whole backend treats S_ALLOC_VGPR and changes to EXEC as barriers for instructions that use VGPRs, we could in principle move the expansion earlier (but in the absence of a good reason for that my personal preference is to keep it later in order to make debugging easier). Since the expansion happens after register allocation, we're careful to select constants to immediate operands instead of letting ISel generate S_MOVs which could interfere with register allocation (i.e. make it look like we need more registers than we actually do). For GFX12, S_ALLOC_VGPR only works in wave32 mode, so we bail out during ISel in wave64 mode. However, we can define the pseudos for wave64 too so it's easy to handle if future generations support it. --------- Co-authored-by: Ana Mihajlovic <Ana.Mihajlovic@amd.com> Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>	2025-03-20 08:38:04 +01:00
Jeffrey Byrnes	6cc918089a	[AMDGPU] Autogen checks for mfma-no-register-aliasing.ll (#132117 ) For an upcoming RegisterCoalescer PR	2025-03-19 16:02:51 -07:00
Mariusz Sikora	4f5ccf22fa	[AMDGPU] Support image_bvh8_intersect_ray instruction and intrinsic. (#130041 ) Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>	2025-03-19 16:08:08 +01:00
Emma Pilkington	3eddb992d0	[AMDGPU] Fix a crash by skipping DBG instrs at start of sched region (#131167 ) Fixes SWDEV-514946	2025-03-19 09:31:54 -04:00
Diana Picus	72c3c30452	[AMDGPU] Allocate scratch space for dVGPRs for CWSR (#130055 ) The CWSR trap handler needs to save and restore the VGPRs. When dynamic VGPRs are in use, the fixed function hardware will only allocate enough space for one VGPR block. The rest will have to be stored in scratch, at offset 0. This patch allocates the necessary space by: - generating a prologue that checks at runtime if we're on a compute queue (since CWSR only works on compute queues); for this we will have to check the ME_ID bits of the ID_HW_ID2 register - if that is non-zero, we can assume we're on a compute queue and initialize the SP and FP with enough room for the dynamic VGPRs - forcing all compute entry functions to use a FP so they can access their locals/spills correctly (this isn't ideal but it's the quickest to implement) Note that at the moment we allocate enough space for the theoretical maximum number of VGPRs that can be allocated dynamically (for blocks of 16 registers, this will be 128, of which we subtract the first 16, which are already allocated by the fixed function hardware). Future patches may decide to allocate less if they can prove the shader never allocates that many blocks. Also note that this should not affect any reported stack sizes (e.g. PAL backend_stack_size etc).	2025-03-19 13:49:19 +01:00
Matt Arsenault	5b6b4fdb4b	DAG: Fix promote of half freeze (#131844 )	2025-03-19 18:30:34 +07:00
Diana Picus	8a53324aa5	[AMDGPU] Deallocate VGPRs before exiting in dynamic VGPR mode (#130037 ) In dynamic VGPR mode, Waves must deallocate all VGPRs before exiting. If the shader program does not do this, hardware inserts `S_ALLOC_VGPR 0` before S_ENDPGM, but this may incur some performance cost. Therefore it's better if the compiler proactively generates that instruction. This patch extends `si-insert-waitcnts` to deallocate the VGPRs via a `S_ALLOC_VGPR 0` before any `S_ENDPGM` when in dynamic VGPR mode.	2025-03-19 09:00:36 +01:00
Shoreshen	b907920058	[AMDGPU] auto-generate file check line for amdgcn.bitcast.ll (#131955 ) Replace check lines by auto-generated	2025-03-19 15:40:58 +08:00
Mariusz Sikora	575fde0995	[AMDGPU] Add intrinsic and MI for image_bvh_dual_intersect_ray (#130038 ) - Add llvm.amdgcn.image.bvh.dual.intersect.ray intrinsic and image_bvh_dual_intersect_ray machine instruction. - Add llvm_v10i32_ty and llvm_v10f32_ty --------- Co-authored-by: Mateja Marjanovic <mateja.marjanovic@amd.com>	2025-03-19 07:35:09 +01:00
Akshat Oke	6cc23faaac	[AMDGPU][NPM] Port AMDGPUMarkLastScratchLoad to NPM (#131738 ) This finishes all passes for the optimized regalloc path. --------- Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>	2025-03-19 09:27:05 +05:30
Matt Arsenault	5ac680c5bf	AMDGPU: Add more freeze codegen tests (#131843 )	2025-03-19 10:17:59 +07:00
Matt Arsenault	428e3a27c3	AMDGPU: Fix attributor not handling all trap intrinsics (#131758 )	2025-03-19 10:17:28 +07:00
Matt Arsenault	2e39533e50	AMDGPU: Fix broken check prefix and degraded cov4 test coverage (#131757 )	2025-03-19 08:30:05 +07:00
Matt Arsenault	6f44be97d0	IR: Make llvm.fake.use a DefaultAttrsIntrinsic (#131743 ) This shouldn't be special and is just an ordinary sideeffect.	2025-03-19 08:29:04 +07:00
Carl Ritson	0e4116a6b9	[AMDGPU] Fix typing error in multi dimensional promote alloca (#131763 ) Fix type error when GEP uses i64 index introduced in #127973.	2025-03-19 08:17:04 +09:00
Julian Brown	84909d7977	[AMDGCN] Allow unscheduling of bundled insns This is a patch arising from AMD's fuzzing project. In the test case, the scheduling algorithm decides to undo an attempted schedule, but is unprepared to handle bundled instructions at that point -- and those can arise via the expansion of intrinsics earlier in compilation. The fix is to use the splice method instead of remove/insert, since that can handle bundles properly.	2025-03-18 11:56:51 -05:00
Matt Arsenault	dea5aa73fa	AMDGPU: Move insertion into V2SCopies map (#130776 ) Insert the start instruction directly into the map before the uses. This prevents improperly re-visting sgpr->vgpr phi inputs multiple times which would trigger a use after free. I don't particularly trust the iteration scheme here. This is also unnecessarily revisting transitive users of a phi or reg_sequence for every input operand, but I will address that separately. Fixes #130646. I also believe it fixes #130119, although that test fails less consistently for me.	2025-03-18 23:28:49 +07:00
Fabian Ritter	332f060363	[SeparateConstOffsetFromGEP] Don't set unsound inbounds flag (#130616 ) The language reference says about inbounds geps that "if the getelementptr has any non-zero indices[...] [t]he base pointer has an in bounds address of the allocated object that it is based on [and] [d]uring the successive addition of offsets to the address, the resulting pointer must remain in bounds of the allocated object at each step." If (gep inbounds p, (a + 5)) is translated to (gep [inbounds] (gep p, a), 5) with p pointing to the beginning of an object and a=-4, as the example in the comments suggests, that's the case for neither of the resulting geps. Therefore, we need to clear the inbounds flag for both geps. We might want to use ValueTracking to check if a is known to be non-negative to preserve the inbounds flags. For the AMDGPU tests with scratch instructions, removing the unsound inbounds flag means that AMDGPUDAGToDAGISel::isFlatScratchBaseLegal sees no NUW flag at the pointer add, which prevents generation of scratch instructions with immediate offsets. For SWDEV-516125.	2025-03-18 12:30:20 +01:00
Diana Picus	0a21ef9536	[AMDGPU] Add SubtargetFeature for dynamic VGPR mode (#130030 ) This represents a hardware mode supported only for wave32 compute shaders. When enabled, we set the `.dynamic_vgpr_en` field of `.compute_registers` to true in the PAL metadata. This will be changed to use an attribute after downstream consumers have been migrated.	2025-03-18 11:48:01 +01:00
Matt Arsenault	c5fe075eaf	AMDGPU: Use freeze poison instead of undef in alloca promotion (#131285 ) Previously the value created to represent the uninitialized memory of the alloca was undef. Use freeze poison instead. Enables some optimization improvements (which need defeating in the limit tests), but also a few regressions. Seems to leave behind dead code in some cases too.	2025-03-18 17:27:02 +07:00
Vikash Gupta	bdb63208b4	[AMDGPU][CodeGen] Using MBB's liveIn check in tandem with MCRegAliasIterator in SILowerSGPRSpills (#129848 ) This patch replaces use of MachineRegisterInfo's liveIn check with the machine basicBlock's liveIn. As the MRI's liveIn is inconsistent with the entry MBB liveIns, when it comes to the machine verifier checks. PS: Its an alternative solution with respect to #126926.	2025-03-18 10:51:07 +05:30
Jim Lin	00cad3ed22	[SDAG] Handle extract_subvector in isKnownNeverNaN (#131581 ) Propagate nnan across extract_subvector.	2025-03-18 09:37:16 +08:00
Matt Arsenault	092e25571c	AMDGPU: Add REQUIRES: asserts to machine pass violation test We should promote this to a proper error and not llvm_unreachable	2025-03-18 07:31:50 +07:00
Tim Gymnich	887cf1f8ce	[AMDGPU][GlobalISel] Enable vector reductions (#131413 ) - Enable llvm vector reductions for AMDGPU. fixes https://github.com/llvm/llvm-project/issues/114816	2025-03-17 14:25:30 -07:00
Shilei Tian	e2c43ba981	[NFC][AMDGPU] Auto generate check lines for `llvm/test/CodeGen/AMDGPU/packed-fp32.ll` (#131629 )	2025-03-17 11:42:03 -04:00

1 2 3 4 5 ...

8516 Commits