llvm-project

Author	SHA1	Message	Date
yingopq	754ed95b66	[Mips] Fix compiler crash when returning fp128 after calling a functi… (#117525 ) …on returning { i8, i128 } Fixes https://github.com/llvm/llvm-project/issues/96432.	2025-01-20 16:47:40 +08:00
Kazu Hirata	bfb6bb69fd	[AMDGPU] Fix a warning This patch fixes: llvm/lib/Target/AMDGPU/SIISelLowering.cpp:13908:46: error: comparison of integers of different signs: 'uint32_t' (aka 'unsigned int') and 'int' [-Werror,-Wsign-compare]	2025-01-16 22:40:08 -08:00
Vikram Hegde	225fc4f356	[AMDGPU][SDAG] Try folding "lshr i64 + mad" to "mad_u64_u32" (#119218 ) The intention is to use a "copy" instead of a "sub" to handle the high parts of 64-bit multiply for this specific case. This unlocks copy prop use cases where the copy can be reused by later multiply+add sequences if possible. Fixes: SWDEV-487672, SWDEV-487669	2025-01-17 11:09:39 +05:30
Matt Arsenault	ca95519704	AMDGPU: Implement isExtractVecEltCheap (#122460 ) Once again we have excessive TLI hooks with bad defaults. Permit this for 32-bit element vectors, which are just use-different-register. We should permit 16-bit vectors as cheap with legal packed instructions, but I see some mixed improvements and regressions that need investigation.	2025-01-17 08:38:01 +07:00
Acim Maravic	cc3aab580b	[AMDGPU] Handle nontemporal and amdgpu.last.use metadata in amdgpu-lower-buffer-fat-pointers (#120139 )	2025-01-14 11:22:20 +01:00
Mirko Brkušanin	3def49cb64	[AMDGPU] Remove s_wakeup_barrier instruction (#122277 )	2025-01-10 11:30:22 +01:00
Matt Arsenault	46ca6dfb5f	AMDGPU: Add disjoint to or produced from lowering vector ops (#122424 )	2025-01-10 16:21:53 +07:00
Jay Foad	fd922c4b4f	[CodeGen] Add const to getAddrModeArguments argument. NFC. (#122335 )	2025-01-10 09:19:25 +00:00
Chinmay Deshpande	211bcf67aa	[AMDGPU] Implement IR variant of isFMAFasterThanFMulAndFAdd (#121465 )	2025-01-10 09:05:41 +05:30
Matt Arsenault	d2b78c646b	AMDGPU: Custom lower bf16 shuffles (#122252 ) We already custom lower the other 16-bit element type shuffles.	2025-01-09 21:37:27 +07:00
Matt Arsenault	09583dec15	AMDGPU: Reduce 64-bit add width if low bits are known 0 (#122049 ) If one of the inputs has all 0 bits, the low part cannot carry and we can just pass through the original value. Add case: https://alive2.llvm.org/ce/z/TNc7hf Sub case: https://alive2.llvm.org/ce/z/AjH2-J We could do this in the general case with computeKnownBits, but add is so common this could be potentially expensive for something which will fire infrequently. One potential concern is this could break the 64-bit add we expect to see for addressing mode matching, but these constants shouldn't appear often in addressing expressions. One test for large offset expressions changes but isn't worse. Fixes https://github.com/ROCm/llvm-project/issues/237	2025-01-08 22:33:54 +07:00
Aaditya	0bd1c87996	[AMDGPU] Support divergent sized dynamic alloca (#121148 ) Currently, AMDGPU backend can handle uniform-sized dynamic allocas. This patch extends support for divergent-sized dynamic allocas. When the size argument of a dynamic alloca is divergent, a wave-wide reduction is performed to get the required stack space. `@llvm.amdgcn.wave.reduce.umax` is used to perform the wave reduction. Dynamic allocas are not completely supported yet, as the stack is not properly restored on function exit. This patch doesn't attempt to address the aforementioned issue. Note: Compiler already Zero-Extends or Truncates all other types(of alloca size arg) to i32.	2025-01-06 12:28:24 +07:00
Aaditya	c7606710f9	[AMDGPU] Update base addr of dyn alloca considering GrowingUp stack (#119822 ) Currently, compiler calculates the base address of dynamic sized stack object (alloca) as follows: 1. `NewSP = Align(CurrSP + Size)` _where_ `Size = # of elements * wave size * alloca type` 2. `BaseAddr = NewSP` 3. The alignment is computed as: `AlignedAddr = Addr & ~(Alignment - 1)` 4. Return the `BaseAddr` This makes sense when stack is grows downwards. AMDGPU stack grows upwards, the base address needs to be aligned first and SP bump by required size later: 1. `BaseAddr = Align(CurrSP)` 2. `NewSP = BaseAddr + Size` 3. `AlignedAddr = (Addr + (Alignment - 1)) & ~(Alignment - 1)` 4. and returns the `BaseAddr`.	2024-12-20 10:27:27 +05:30
Craig Topper	f139bde8d8	[SelectionDAG] Move SDNode::use_iterator::getOperandNo to SDUse. (#120536 ) This allows us to write more range based for loops because we no longer need the iterator. It also matches IR's Use class.	2024-12-19 09:07:42 -08:00
Craig Topper	e6b2495545	[SelectionDAG] Split SDNode::use_iterator into user_iterator and use_iterator. (#120531 ) SDNode::use_iterator now returns an SDUse& when dereferenced. SDNode::user_iterator returns SDNode*. SDNode::use_begin/use_end/uses work on use_iterator. SDNode::user_begin/user_end/users work on user_iterator. We can now write range based for loops using SDUse& and SDNode::uses(). I've converted many of these in this patch. I didn't update loops that have additional variables updated in their for statement. Some loops use SDNode::use_iterator::getOperandNo() which also prevents using range based for loops. I plan to move this into SDUse in a follow up patch.	2024-12-19 08:35:32 -08:00
Craig Topper	bd261ecc5a	[SelectionDAG] Add SDNode::user_begin() and use it in some places (#120509 ) Most of these are just places that want the first user and aren't iterating over the whole list. While there I changed some use_size() == 1 to hasOneUse() which is more efficient. This is part of an effort to rename use_iterator to user_iterator and provide a use_iterator that dereferences to SDUse&. This patch helps reduce the diff on later patches.	2024-12-18 22:13:04 -08:00
Craig Topper	104ad9258a	[SelectionDAG] Rename SDNode::uses() to users(). (#120499 ) This function is most often used in range based loops or algorithms where the iterator is implicitly dereferenced. The dereference returns an SDNode * of the user rather than SDUse * so users() is a better name. I've long beeen annoyed that we can't write a range based loop over SDUse when we need getOperandNo. I plan to rename use_iterator to user_iterator and add a use_iterator that returns SDUse& on dereference. This will make it more like IR.	2024-12-18 20:09:33 -08:00
Aaditya	0ae75eba67	[AMDGPU] Assert if stack grows downwards. (#119888 )	2024-12-14 17:44:40 +05:30
Piotr Sobczak	a2d086af2c	[AMDGPU] Fix FMA combine (#119217 ) Update the check in the FMA combine to check dot10-insts instead of dot7-insts. The target of the combine, v_dot2_f32_f16, is available only if dot10-insts target feature is enabled. The issue probably dates back to the change that split out dot10-insts out of dot7-insts. As far as I can see, this does not affect any current targets, but if a future target has dot7-insts, but not dot10-insts that would cause a crash ("cannot select") for the input ir in the test.	2024-12-10 10:11:19 +01:00
Vikash Gupta	0b0d9a3bee	[CodeGen] [AMDGPU] Attempt DAGCombine for fmul with select to ldexp (#111109 ) The materialization cost of 32-bit non-inline in case of fmul is quite relatively more, rather than if possible to combine it into ldexp instruction for specific scenarios (for datatypes like f64, f32 and f16) as this is being handled here : The dag combine for any pair of select values which are exact exponent of 2. ``` fmul x, select(y, A, B) -> ldexp (x, select i32 (y, a, b)) fmul x, select(y, -A, -B) -> ldexp ((fneg x), select i32 (y, a, b)) where, A=2^a & B=2^b ; a and b are integers. ``` This dagCombine is handled separately in fmulCombine (newly defined in SIIselLowering), targeting fmul fusing it with select type operand into ldexp. Thus, it fixes #104900.	2024-12-09 12:52:04 +05:30
Austin Kerbow	b1d42465fc	[AMDGPU] Fix hidden kernarg preload count inconsistency (#116759 ) It is possible that the number of hidden arguments that are selected to be preloaded in AMDGPULowerKernel arguments and isel can differ. This isn't an issue with explicit arguments since isel can lower the argument correctly either way, but with hidden arguments we may have alignment issues if we try to load these hidden arguments that were added to the kernel signature. The reason for the mismatch is that isel reserves an extra synthetic user SGPR for module LDS. Instead of teaching lowerFormalArguments how to handle these properly it makes more sense and is less expensive to fix the mismatch and assert if we ever run into this issue again. We should never be trying to lower these in the normal way. In a future change we probably want to revise how we track "synthetic" user SGPRs and unify the handling in GCNUserSGPRUsageInfo. Sometimes synthetic SGPRSs are considered user SGPRs and sometimes they are not. Until then this patch resolves the inconsistency, fixes the bug, and is otherwise a NFC.	2024-12-08 10:10:08 -08:00
Jakub Chlanda	edbebda454	[AMDGPU] Assert previous SGPR exists when bundling preloaded args (#118802 ) This came up from a downstream static analysis tool.	2024-12-06 09:15:22 +01:00
Matt Arsenault	15676ec552	AMDGPU: Add support for V_CVT_PK_F16_F32 instruction for gfx950 (#118300 ) Co-authored-by: Shilei Tian <shilei.tian@amd.com>	2024-12-02 16:04:24 -05:00
Matt Arsenault	7221bc74bc	AMDGPU: Make v2f16 minimum/maximum legal for gfx950 (#117738 )	2024-11-26 14:51:05 -05:00
Matt Arsenault	f5e92eb04b	AMDGPU: Handle f32 minimum3/maximum3 pattern for gfx950 (#117737 )	2024-11-26 14:47:52 -05:00
Matt Arsenault	e57b327be2	AMDGPU: Legalize fminimum and fmaximum f32 for gfx950 (#117634 ) Select to minimum3/maximum3. Leave f16/v2f16 for later since it's complicated by only having the vector version.	2024-11-26 14:44:09 -05:00
Piotr Sobczak	a96ec01e1a	[AMDGPU] Optimize out s_barrier_signal/_wait (#116993 ) Extend the optimization that converts s_barrier to wave_barrier (nop) when the number of work items is not larger than wave size. This handles the "split barrier" form of s_barrier where the barrier is represented by separate intrinsics (s_barrier_signal/s_barrier_wait). Note: the version where s_barrier is used in gfx12 (and later split) has the optimization already, but some front-ends may prefer to use split intrinsics and this is being addressed by the patch.	2024-11-26 10:04:32 +01:00
Craig Topper	bc282605df	[SelectionDAG] Require last operand of (STRICT_)FP_ROUND to be a TargetConstant. (#117639 ) Fix all the places I could find that did't do this. We were already mostly correct for FP_ROUND after 9a976f36615dbe15e76c12b22f711b2e597a8e51, but not STRICT_FP_ROUND.	2024-11-25 21:36:33 -08:00
Matt Arsenault	e97fb2207e	AMDGPU: Add support for load transpose instructions for gfx950 (#117378 ) This patch support for intrinsics in clang, as well as assembly instructions in the backend. Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>	2024-11-25 09:39:04 -08:00
Nikita Popov	3317c9ceac	[AMDGPU] Use getSignedConstant() where necessary (#117328 ) Create signed constant using getSignedConstant(), to avoid future assertion failures when we disable implicit truncation in getConstant(). This also touches some generic legalization code, which apparently only AMDGPU tests.	2024-11-25 09:49:34 +01:00
Matt Arsenault	1944d192bd	AMDGPU: Use isWave[32\|64] instead of comparing size value (#117411 )	2024-11-23 09:30:57 -08:00
Matt Arsenault	bd8a953e9b	AMDGPU: Fix mfma scale source legalization (#117238 ) Code inside assert changes the variable instead of the comparison. Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2024-11-21 15:30:01 -08:00
Matt Arsenault	01c9a14ccf	AMDGPU: Define v_mfma_f32_{16x16x128\|32x32x64}_f8f6f4 instructions (#116723 ) These use a new VOP3PX encoding for the v_mfma_scale_* instructions, which bundles the pre-scale v_mfma_ld_scale_b32. None of the modifiers are supported yet (op_sel, neg or clamp). I'm not sure the intrinsic should really expose op_sel (or any of the others). If I'm reading the documentation correctly, we should be able to just have the raw scale operands and auto-match op_sel to byte extract patterns. The op_sel syntax also seems extra horrible in this usage, especially with the usual assumed op_sel_hi=-1 behavior.	2024-11-21 08:51:58 -08:00
Jay Foad	ade0750e35	[AMDGPU] Fix some cache policy checks for GFX12+ (#116396 ) Fix coding errors found by inspection and check that the swz bit still serves to prevent merging of buffer loads/stores on GFX12+.	2024-11-21 08:22:59 +00:00
Matt Arsenault	927032807d	AMDGPU: Handle gfx950 96/128-bit buffer_load_lds (#116681 ) Enforcing this limit in the clang builtin will come later.	2024-11-18 22:01:56 -08:00
Matt Arsenault	50224bd5ba	AMDGPU: Handle gfx950 global_load_lds_* instructions (#116680 ) Define global_load_lds_dwordx3 and global_load_dwordx4. Oddly it seems dwordx2 was skipped.	2024-11-18 21:58:02 -08:00
Matt Arsenault	738bdd4969	AMDGPU: Add V_CVT_PK_BF16_F32 for gfx950 (#116678 )	2024-11-18 21:50:54 -08:00
Sergei Barannikov	baf59be89b	[SelectionDAG] Fix return types of TC_RETURN for several targets (#116504 ) TC_RETURN nodes do not have a glue result.	2024-11-17 02:14:05 +03:00
Kazu Hirata	be187369a0	[AMDGPU] Remove unused includes (NFC) (#116154 ) Identified with misc-include-cleaner.	2024-11-13 21:10:03 -08:00
Kazu Hirata	4048c64306	[llvm] Remove redundant control flow statements (NFC) (#115831 ) Identified with readability-redundant-control-flow.	2024-11-12 10:09:42 -08:00
Shilei Tian	6548b6354d	Reapply "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403 )" This reverts commit ca33649abe5fad93c57afef54e43ed9b3249cd86.	2024-11-08 20:21:16 -05:00
Shilei Tian	ca33649abe	Revert "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403 )" This reverts commit e215a1e27d84adad2635a52393621eb4fa439dc9 as it broke both hip and openmp buildbots.	2024-11-08 16:36:35 -05:00
Shilei Tian	e215a1e27d	[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403 )	2024-11-08 13:05:35 -05:00
Jeffrey Byrnes	ae6dbed594	[AMDGPU] Use correct DWord for v_dot4 S0 operand (#115224 ) Fixes a copy-paste typo. The typo resulted in producing bad v_perm based operands for the v_dot4 combine. When adding a corresponding byte pair to the v_dot byte pair chains, we must take note of the byte position in the corresponding source nodes. These byte positions are used to ensure we extract the correct DWord from the ultimate source, and formulate a correct perm_mask from the extracted DWord. With the typo, we the S0 byte would used the DWord offset for the corresponding S1 byte. If this offset was not the same as the true DWord offset for the S0 byte, we would extract and use the wrong byte for S0 in the v_dot. Fixes https://github.com/llvm/llvm-project/issues/112941	2024-11-06 20:48:20 -08:00
Gang Chen	8c752900dd	[AMDGPU] modify named barrier builtins and intrinsics (#114550 ) Use a local pointer type to represent the named barrier in builtin and intrinsic. This makes the definitions more user friendly bacause they do not need to worry about the hardware ID assignment. Also this approach is more like the other popular GPU programming language. Named barriers should be represented as global variables of addrspace(3) in LLVM-IR. Compiler assigns the special LDS offsets for those variables during AMDGPULowerModuleLDS pass. Those addresses are converted to hw barrier ID during instruction selection. The rest of the instruction-selection changes are primarily due to the intrinsic-definition changes.	2024-11-06 10:37:22 -08:00
Stanislav Mekhanoshin	6d7e51de5e	[AMDGPU] Extend type support for update_dpp intrinsic (#114597 ) We can split 64-bit DPP as a post-RA pseudo if control values are supported, but cannot handle other types.	2024-11-05 13:59:14 -08:00
Matt Arsenault	30dd1297fa	AMDGPU: Custom expand flat cmpxchg which may access private (#109410 ) 64-bit flat cmpxchg instructions do not work correctly for scratch addresses, and need to be expanded as non-atomic. Allow custom expansion of cmpxchg in AtomicExpand, as is already the case for atomicrmw.	2024-11-04 09:29:38 -08:00
Shilei Tian	11df0ce140	[NFC][AMDGPU] Use structured binding to replace explicit use of std::pair	2024-11-02 15:11:55 -04:00
Matt Arsenault	1d0370872f	AMDGPU: Expand flat atomics that may access private memory (#109407 ) If the runtime flat address resolves to a scratch address, 64-bit atomics do not work correctly. Insert a runtime address space check (which is quite likely to be uniform) and select between the non-atomic and real atomic cases. Consider noalias.addrspace metadata and avoid this expansion when possible (we also need to consider it to avoid infinitely expanding after adding the predication code).	2024-10-31 08:08:48 -07:00
Stanislav Mekhanoshin	7cd29741fa	[AMDGPU] Extend mov_dpp8 intrinsic lowering for generic types (#114296 ) The int_amdgcn_mov_dpp8 is overloaded, but we can only select i32. To allow a corresponding builtin to be overloaded the same way as int_amdgcn_mov_dpp we need it to be able to split unsupported values.	2024-10-31 01:15:25 -07:00

1 2 3 4 5 ...

1495 Commits