llvm-project

Author	SHA1	Message	Date
Sterling-Augustine	7514225052	Use a more proper idiom for "the output file doesn't matter". NFC. (#134280 ) As in the description. Follow up to PR #134179.	2025-04-03 10:24:10 -07:00
Brox Chen	bf388f8a43	[AMDGPU][True16][CodeGen] legalize operands when move16bit SALU to VALU (#133985 ) This is a follow up PR from https://github.com/llvm/llvm-project/pull/132089. When a V2S copy and its useMI are lowered to VALU, this patch check: If the generated new VALU is a true16 inst. Add subreg access on all operands if necessary. an example MIR looks like: ``` %1:vgpr_32 = V_CVT_F32_U32_e64 %0:vgpr_32, 0, 0 ... %2:sreg_32 = COPY %1:vgpr_32 %3:sreg_32 = S_FLOOR_F16 %2:sreg_32, ... ``` currently lowered to ``` %1:vgpr_32 = V_CVT_F32_U32_e64 %0:vgpr_32, 0, 0 ... %2:vgpr_16 = V_FLOOR_F16_t16_e64 0, %1:vgpr_32, 0, 0, 0 ... ``` after this patch ``` %1:vgpr_32 = V_CVT_F32_U32_e64 %0:vgpr_32, 0, 0 ... %2:vgpr_16 = V_FLOOR_F16_t16_e64 0, %1.lo16:vgpr_32, 0, 0, 0 ... ```	2025-04-03 12:26:41 -04:00
Krzysztof Drewniak	f23bb530cf	[AMDGPULowerBufferFatPointers] Use InstSimplifyFolder during rewrites (#134137 ) This PR updates AMDGPULowerBufferFatPointers to use the InstSimplifyFolder when creating IR during buffer fat pointer lowering. This shouldn't cause any large functional changes and might improve the quality of the generated code.	2025-04-03 10:12:18 -05:00
Sterling-Augustine	f68a5185d0	Allow this test to pass when the source is on a read-only filesystem (#134179 ) llc attempts to create an empty file in the current directory, but it can't do that on a read-only file system. Send that empty-output to stdout, which prevents this failure.	2025-04-02 16:49:57 -07:00
Brox Chen	066787b9bd	[AMDGPU][True16][CodeGen] fold clamp update for true16 (#128919 ) Check through COPY for possible clamp folding for v_mad_mixhi_f16 isel	2025-04-02 17:10:53 -04:00
Brox Chen	fb0e7b5f16	[AMDGPU][True16][CodeGen] Implement sgpr folding in true16 (#128929 ) We haven't implemented 16 bit SGPRs. Currently allow 32-bit SGPRs to be folded into True16 bit instructions taking 16 bit values. Also use sgpr_32 when Imm is copied to spgr_lo16 so it could be further folded. This improves generated code quality.	2025-04-02 16:08:26 -04:00
Juan Manuel Martinez Caamaño	beae0e9f1a	[AMDGPU] Use a target feature to enable __builtin_amdgcn_global_load_lds on gfx9/10 (#133055 ) This patch introduces the `vmem-to-lds-load-insts` target feature, which can be used to enable builtins `__builtin_amdgcn_global_load_lds` and `__builtin_amdgcn_raw_ptr_buffer_load_lds` on platforms which have this feature. This feature is only available on gfx9/10. A limitation of using a common target feature for both builtins is that we could have made `__builtin_amdgcn_raw_ptr_buffer_load_lds` available on gfx6,7,8.	2025-04-02 20:00:09 +02:00
Juan Manuel Martinez Caamaño	0375ef07c3	[Clang][AMDGPU] Add __builtin_amdgcn_cvt_off_f32_i4 (#133741 ) This built-in maps to `V_CVT_OFF_F32_I4` which treats its input as a 4-bit signed integer and returns `0.0625f * src`. SWDEV-518861	2025-04-02 19:51:40 +02:00
Akshat Oke	a13a51b91f	[AMDGPU][NPM] Port AMDGPUSetWavePriority to NPM (#130064 )	2025-04-02 16:28:05 +05:30
Brox Chen	dd1d41f833	[AMDGPU][True16][CodeGen] fix moveToVALU with proper subreg access in true16 (#132089 ) There are V2S copies between vpgr16 and spgr32 in true16 mode. This is caused by vgpr16 and sgpr32 both selectable by 16bit src in ISel. When a V2S copy and its useMI are lowered to VALU, this patch check 1. If the generated new VALU is used by a true16 inst. Add subreg access if necessary. 2. Legalize the V2S copy by replacing it to subreg_to_reg an example MIR looks like: ``` %2:sgpr_32 = COPY %1:vgpr_16 %3:sgpr_32 = S_OR_B32 %2:sgpr_32, ... %4:vgpr_16 = V_ADD_F16_t16 %3:sgpr_32, ... ``` currently lowered to ``` %2:vgpr_32 = COPY %1:vgpr_16 %3:vgpr_32 = V_OR_B32 %2:vgpr_32, ... %4:vgpr_16 = V_ADD_F16_t16 %3:vgpr_32, ... ``` after this patch ``` %2:vgpr_32 = SUBREG_TO_REG 0, %1:vgpr_16, lo16 %3:vgpr_32 = V_OR_B32 %2:vgpr_32, ... %4:vgpr_16 = V_ADD_F16_t16 %3.lo16:vgpr_32, ... ```	2025-04-01 12:40:18 -04:00
Shoreshen	7f14b2a9eb	Revert "[AMDGPU][CodeGenPrepare] Narrow 64 bit math to 32 bit if profitable" (#133880 ) Reverts llvm/llvm-project#130577	2025-04-01 17:37:02 +08:00
Valery Pykhtin	af0b0ce665	[AMDGPU] Fix SIFoldOperandsImpl::tryFoldZeroHighBits when met non-reg src1 operand. (#133761 ) This happens when a constant is propagated to a V_AND 0xFFFF, reg instruction. Fixes failures like: ``` llc: /github/llvm-project/llvm/include/llvm/CodeGen/MachineOperand.h:366: llvm::Register llvm::MachineOperand::getReg() const: Assertion `isReg() && "This is not a register operand!"' failed. Stack dump: 0. Program arguments: /github/llvm-project/build/Debug/bin/llc -mtriple=amdgcn -mcpu=gfx1101 -verify-machineinstrs -run-pass si-fold-operands /github/llvm-project/llvm/test/CodeGen/AMDGPU/fold-zero-high-bits-skips-non-reg.mir -o - 1. Running pass 'Function Pass Manager' on module '/github/llvm-project/llvm/test/CodeGen/AMDGPU/fold-zero-high-bits-skips-non-reg.mir'. 2. Running pass 'SI Fold Operands' on function '@test_tryFoldZeroHighBits_skips_nonreg' ... #12 0x00007f5a55005cfc llvm::MachineOperand::getReg() const /github/llvm-project/llvm/include/llvm/CodeGen/MachineOperand.h:0:5 #13 0x00007f5a555c6bf5 (anonymous namespace)::SIFoldOperandsImpl::tryFoldZeroHighBits(llvm::MachineInstr&) const /github/llvm-project/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp:1459:36 #14 0x00007f5a555c63ad (anonymous namespace)::SIFoldOperandsImpl::run(llvm::MachineFunction&) /github/llvm-project/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp:2455:11 #15 0x00007f5a555c6780 (anonymous namespace)::SIFoldOperandsLegacy::runOnMachineFunction ```	2025-04-01 10:27:58 +02:00
Shoreshen	145b4a3950	[AMDGPU][CodeGenPrepare] Narrow 64 bit math to 32 bit if profitable (#130577 ) For Add, Sub, Mul with Int64 type, if profitable, then do: 1. Trunc operands to Int32 type 2. Apply 32 bit Add/Sub/Mul 3. Zext to Int64 type	2025-04-01 11:18:17 +08:00
Brox Chen	a61cc1b99a	[AMDGPU][True16][CodeGen] Skip combineDpp with t16 instructions (#128918 ) We only emits v_mov_b32/64_dpp. Don't combine t16 instructions with mov dpp. Update the test inputs to be legal. It is future work to emit v_mov_b16_dpp, and then update GCNDPPCombine to combine it with the 16-bit instructions.	2025-03-31 10:18:25 -04:00
Simon Pilgrim	9b32f3d096	[DAG] visitEXTRACT_SUBVECTOR - don't return early on failure of EXTRACT_SUBVECTOR(INSERT_SUBVECTOR()) -> BITCAST fold (#133695 ) Always allow later folds to try to match as well.	2025-03-31 14:32:43 +01:00
Fangrui Song	04a67528d3	[MC] Simplify MCBinaryExpr/MCUnaryExpr printing by reducing parentheses (#133674 ) The existing pretty printer generates excessive parentheses for MCBinaryExpr expressions. This update removes unnecessary parentheses of MCBinaryExpr with +/- operators and MCUnaryExpr. Since relocatable expressions only use + and -, this change improves readability in most cases. Examples: - (SymA - SymB) + C now prints as SymA - SymB + C. This updates the output of -fexperimental-relative-c++-abi-vtables for AArch64 and x86 to `.long _ZN1B3fooEv@PLT-_ZTV1B-8` - expr + (MCTargetExpr) now prints as expr + MCTargetExpr, with this change primarily affecting AMDGPUMCExpr.	2025-03-30 22:03:14 -07:00
Ana Mihajlovic	8c7550132f	[AMDGPU] Unused sdst writing to null (#133229 ) Unused sdst writing to null to avoid a false VALU->SALU dependency stall. This requires using the VOP3 encoding.	2025-03-28 18:12:34 +01:00
LU-JOHN	827f2ad643	AMDGPU: Convert vector 64-bit shl to 32-bit if shift amt >= 32 (#132964 ) Convert vector 64-bit shl to 32-bit if shift amt is known to be >= 32. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2025-03-28 23:46:35 +07:00
Ana Mihajlovic	f7a034d400	[AMDGPU] (x or y) xor -1 -> x nor y (#130264 ) Added pattern so s_nor is selected for ((i1 x or i1 y) xor -1) instead of s_or and s_xor . This patch is for i1 divergent. The ballot in the test is added for the retrieval of lanemask. The control flow is needed because the combiner can't pass through phi instructions.	2025-03-28 11:20:17 +01:00
Brox Chen	06411399fb	[AMDGPU][True16][CodeGen] srl pattern for true16 mode (#132987 ) Added a srl pattern for true16 flow. Changing right shift 16bit to a reg_sequence `srl vgpr32, 16 -> reg_sequence (vgpr32.hi16, 0)` and finally it's lowered to two COPY `vdst.lo16 = COPY vsrc.hi16` `vdst.hi16 = COPY 0` The benefits of this transform is allowing the following pass to optimize out these copy.	2025-03-26 18:38:20 -04:00
Alex MacLean	672c51c9cb	[SDAG][tests] add some test cases covering an add-based rotate (#132842 ) Add tests to various targets covering rotate idioms where an 'ADD' node is used to combine the halves instead of an 'OR'. Some of these cases will be better optimized following #125612, while others are already well optimized or do not have a valid fold to a rotate or funnel-shift.	2025-03-26 09:47:28 -07:00
Akshat Oke	719b029c16	[AMDGPU][NPM] Port SILateBranchLowering to NPM (#130063 )	2025-03-26 19:28:19 +05:30
Jeffrey Byrnes	e5641f6584	[AMDGPU] Autogen checks for mfma-loop.ll (#133004 ) Needed for a RegisterCoalescing patch	2025-03-25 15:24:40 -07:00
Juan Manuel Martinez Caamaño	2f8d699845	[AMDGPU][SelectionDAG] Use COPY instead of S_MOV_B32 to assign values to M0 (#132957 ) This is consistent with what's done on GISel. This allows the register coalescer to remove the redundant intermediate `s_mov_b32` instructions by using `m0` directly as the result register.	2025-03-25 19:05:43 +01:00
Jeffrey Byrnes	25938389c0	[AMDGPU] Autogen checks for agpr-csr.ll (#132959 ) Needed for a RegisterCoalescer patch	2025-03-25 10:28:35 -07:00
LU-JOHN	70aeb89094	Calculate KnownBits from Metadata correctly for vector loads (#128908 ) Calculate KnownBits correctly from metadata for vector loads. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2025-03-25 22:46:30 +07:00
Akshat Oke	f8e908a0ed	[AMDGPU][NPM] Port SIInsertHardClauses to NPM (#130062 )	2025-03-25 15:33:32 +05:30
Austin Kerbow	e75f586b81	[AMDGPU] Relax lds dma waitcnt with no aliasing pair (#131842 ) If we cannot find any lds DMA instruction that is aliased by some load from lds, we will still insert vmcnt(0). This is overly cautious since handling inter-thread dependences is normally managed by the memory model instead of the waitcnt pass, so this change updates the behavior to be more inline with how other types of memory events are handled.	2025-03-24 10:38:47 -07:00
Jay Foad	02ed65912e	[AMDGPU] 4-align TTMP triples (#132759 ) Follow up to e4284a7c70cd "[AMDGPU] 4-align SGPR triples". Previously TTMP triples like ttmp[3:5] were aligned on a 3-TTMP boundary which has no basis in hardware. Aligning them on a 4-TTMP boundary matches what we do for SGPRs, which reduces the number of extra register classes synthesized by TableGen, bringing the total number down from 653 to 615.	2025-03-24 17:11:39 +00:00
Ana Mihajlovic	cdea46cc8c	[AMDGPU] Add pattern for inverse.ballot.i64 Wave32 (#132770 )	2025-03-24 17:30:02 +01:00
Akshat Oke	f10dc76f03	[AMDGPU][NPM] Port SIInsertWaitcnts to NPM (#130061 )	2025-03-24 21:36:45 +05:30
Juan Manuel Martinez Caamaño	5634e7e2f0	[AMDGCN][SIWholeQuadMode] Rework splitBlock/lowerKillI1/lowerKillF32 to handle case when SI_KILL_I1_TERMINATOR -1 0 is not the unique terminator The lowerKillI1 method wrongly handled cases where it inserted a new S_BRANCH instruction when the kill was not the only terminator, and then tried to split the block. `SI_KILL_I1_TERMINATOR -1,0` doesn't have any effect. Instead of lowering to an unconditional branch, we remove the instruction and insert an unconditional branch only if the instruction is the last terminator. No split is needed in this case (if the last terminator has been reached, then the whole block was processed). Also stop generating an unconditional branch in splitBlock: this branch was redundant since TermMI is promoted to a terminator that fallsthrough to the next block already. Solves SWDEV-508819	2025-03-24 15:57:08 +01:00
Pierre van Houtryve	c457c88951	[GlobalISel] Combine (sext (trunc x)) to (sext_inreg x) (#131622 ) Split from #131312	2025-03-24 09:32:04 +01:00
Pierre van Houtryve	6e3c24fc0a	[DAG] Combine (sext (sext_in_reg x)) to (sext_in_reg (any_extend x)) (#132386 )	2025-03-24 09:31:02 +01:00
Shoreshen	054e0b41a8	[AMDGPU] Add all type for bitcast on VReg_512 (#131775 ) Add all types pattern for bitcast on VReg_512	2025-03-24 11:52:10 +08:00
Shilei Tian	f1ac2afe21	Reapply "[AMDGPU] Use COV6 by default (#118515 )" (#130963 ) This reverts commit 68bcba6d7a1cc18996c0bcb7c62267c62d2040d0.	2025-03-21 15:26:45 -04:00
Stephen Thomas	2e3fa4ba9e	[AMDGPU] Insert before and after instructions that always use GDS (#131338 ) It is an architectural requirement that there must be no outstanding GDS instructions when an "always GDS" instruction is issued, and also that an always GDS instruction must be allowed to complete. Insert waits on DScnt/LGKMcnt prior to (if necessary) and subsequent to (unconditionally) any always GDS instruction, and an additional S_NOP if the subsequent wait was followed by S_ENDPGM. Always GDS instructions are GWS instructions, DS_ORDERED_COUNT, DS_ADD_GS_REG_RTN, and DS_SUB_GS_REG_RTN (the latter two as considered always GDS as of this patch).	2025-03-21 09:33:04 +00:00
Brox Chen	55d3a55cc1	[AMDGPU][True16][CodeGen]disable true16 on fneg test (#132221 ) This is a NFC change. Revert the failed test case in https://github.com/llvm/llvm-project/pull/131206	2025-03-20 10:32:49 -04:00
Diana Picus	e17b3cdfb3	[AMDGPU] Dynamic VGPR support for llvm.amdgcn.cs.chain (#130094 ) The llvm.amdgcn.cs.chain intrinsic has a 'flags' operand which may indicate that we want to reallocate the VGPRs before performing the call. A call with the following arguments: ``` llvm.amdgcn.cs.chain %callee, %exec, %sgpr_args, %vgpr_args, /flags/0x1, %num_vgprs, %fallback_exec, %fallback_callee ``` is supposed to do the following: - copy the SGPR and VGPR args into their respective registers - try to change the VGPR allocation - if the allocation has succeeded, set EXEC to %exec and jump to %callee, otherwise set EXEC to %fallback_exec and jump to %fallback_callee This patch implements the dynamic VGPR behaviour by generating an S_ALLOC_VGPR followed by S_CSELECT_B32/64 instructions for the EXEC and callee. The rest of the call sequence is left undisturbed (i.e. identical to the case where the flags are 0 and we don't use dynamic VGPRs). We achieve this by introducing some new pseudos (SI_CS_CHAIN_TC_Wn_DVGPR) which are expanded in the SILateBranchLowering pass, just like the simpler SI_CS_CHAIN_TC_Wn pseudos. The main reason is so that we don't risk other passes (particularly the PostRA scheduler) introducing instructions between the S_ALLOC_VGPR and the jump. Such instructions might end up using VGPRs that have been deallocated, or the wrong EXEC mask. Once the whole backend treats S_ALLOC_VGPR and changes to EXEC as barriers for instructions that use VGPRs, we could in principle move the expansion earlier (but in the absence of a good reason for that my personal preference is to keep it later in order to make debugging easier). Since the expansion happens after register allocation, we're careful to select constants to immediate operands instead of letting ISel generate S_MOVs which could interfere with register allocation (i.e. make it look like we need more registers than we actually do). For GFX12, S_ALLOC_VGPR only works in wave32 mode, so we bail out during ISel in wave64 mode. However, we can define the pseudos for wave64 too so it's easy to handle if future generations support it. --------- Co-authored-by: Ana Mihajlovic <Ana.Mihajlovic@amd.com> Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>	2025-03-20 08:38:04 +01:00
Jeffrey Byrnes	6cc918089a	[AMDGPU] Autogen checks for mfma-no-register-aliasing.ll (#132117 ) For an upcoming RegisterCoalescer PR	2025-03-19 16:02:51 -07:00
Mariusz Sikora	4f5ccf22fa	[AMDGPU] Support image_bvh8_intersect_ray instruction and intrinsic. (#130041 ) Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>	2025-03-19 16:08:08 +01:00
Emma Pilkington	3eddb992d0	[AMDGPU] Fix a crash by skipping DBG instrs at start of sched region (#131167 ) Fixes SWDEV-514946	2025-03-19 09:31:54 -04:00
Diana Picus	72c3c30452	[AMDGPU] Allocate scratch space for dVGPRs for CWSR (#130055 ) The CWSR trap handler needs to save and restore the VGPRs. When dynamic VGPRs are in use, the fixed function hardware will only allocate enough space for one VGPR block. The rest will have to be stored in scratch, at offset 0. This patch allocates the necessary space by: - generating a prologue that checks at runtime if we're on a compute queue (since CWSR only works on compute queues); for this we will have to check the ME_ID bits of the ID_HW_ID2 register - if that is non-zero, we can assume we're on a compute queue and initialize the SP and FP with enough room for the dynamic VGPRs - forcing all compute entry functions to use a FP so they can access their locals/spills correctly (this isn't ideal but it's the quickest to implement) Note that at the moment we allocate enough space for the theoretical maximum number of VGPRs that can be allocated dynamically (for blocks of 16 registers, this will be 128, of which we subtract the first 16, which are already allocated by the fixed function hardware). Future patches may decide to allocate less if they can prove the shader never allocates that many blocks. Also note that this should not affect any reported stack sizes (e.g. PAL backend_stack_size etc).	2025-03-19 13:49:19 +01:00
Matt Arsenault	5b6b4fdb4b	DAG: Fix promote of half freeze (#131844 )	2025-03-19 18:30:34 +07:00
Diana Picus	8a53324aa5	[AMDGPU] Deallocate VGPRs before exiting in dynamic VGPR mode (#130037 ) In dynamic VGPR mode, Waves must deallocate all VGPRs before exiting. If the shader program does not do this, hardware inserts `S_ALLOC_VGPR 0` before S_ENDPGM, but this may incur some performance cost. Therefore it's better if the compiler proactively generates that instruction. This patch extends `si-insert-waitcnts` to deallocate the VGPRs via a `S_ALLOC_VGPR 0` before any `S_ENDPGM` when in dynamic VGPR mode.	2025-03-19 09:00:36 +01:00
Shoreshen	b907920058	[AMDGPU] auto-generate file check line for amdgcn.bitcast.ll (#131955 ) Replace check lines by auto-generated	2025-03-19 15:40:58 +08:00
Mariusz Sikora	575fde0995	[AMDGPU] Add intrinsic and MI for image_bvh_dual_intersect_ray (#130038 ) - Add llvm.amdgcn.image.bvh.dual.intersect.ray intrinsic and image_bvh_dual_intersect_ray machine instruction. - Add llvm_v10i32_ty and llvm_v10f32_ty --------- Co-authored-by: Mateja Marjanovic <mateja.marjanovic@amd.com>	2025-03-19 07:35:09 +01:00
Akshat Oke	6cc23faaac	[AMDGPU][NPM] Port AMDGPUMarkLastScratchLoad to NPM (#131738 ) This finishes all passes for the optimized regalloc path. --------- Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>	2025-03-19 09:27:05 +05:30
Matt Arsenault	5ac680c5bf	AMDGPU: Add more freeze codegen tests (#131843 )	2025-03-19 10:17:59 +07:00
Matt Arsenault	428e3a27c3	AMDGPU: Fix attributor not handling all trap intrinsics (#131758 )	2025-03-19 10:17:28 +07:00

1 2 3 4 5 ...

8529 Commits