llvm-project

Author	SHA1	Message	Date
Craig Topper	f139bde8d8	[SelectionDAG] Move SDNode::use_iterator::getOperandNo to SDUse. (#120536 ) This allows us to write more range based for loops because we no longer need the iterator. It also matches IR's Use class.	2024-12-19 09:07:42 -08:00
Craig Topper	e6b2495545	[SelectionDAG] Split SDNode::use_iterator into user_iterator and use_iterator. (#120531 ) SDNode::use_iterator now returns an SDUse& when dereferenced. SDNode::user_iterator returns SDNode*. SDNode::use_begin/use_end/uses work on use_iterator. SDNode::user_begin/user_end/users work on user_iterator. We can now write range based for loops using SDUse& and SDNode::uses(). I've converted many of these in this patch. I didn't update loops that have additional variables updated in their for statement. Some loops use SDNode::use_iterator::getOperandNo() which also prevents using range based for loops. I plan to move this into SDUse in a follow up patch.	2024-12-19 08:35:32 -08:00
Craig Topper	bd261ecc5a	[SelectionDAG] Add SDNode::user_begin() and use it in some places (#120509 ) Most of these are just places that want the first user and aren't iterating over the whole list. While there I changed some use_size() == 1 to hasOneUse() which is more efficient. This is part of an effort to rename use_iterator to user_iterator and provide a use_iterator that dereferences to SDUse&. This patch helps reduce the diff on later patches.	2024-12-18 22:13:04 -08:00
Craig Topper	104ad9258a	[SelectionDAG] Rename SDNode::uses() to users(). (#120499 ) This function is most often used in range based loops or algorithms where the iterator is implicitly dereferenced. The dereference returns an SDNode * of the user rather than SDUse * so users() is a better name. I've long beeen annoyed that we can't write a range based loop over SDUse when we need getOperandNo. I plan to rename use_iterator to user_iterator and add a use_iterator that returns SDUse& on dereference. This will make it more like IR.	2024-12-18 20:09:33 -08:00
Aaditya	0ae75eba67	[AMDGPU] Assert if stack grows downwards. (#119888 )	2024-12-14 17:44:40 +05:30
Piotr Sobczak	a2d086af2c	[AMDGPU] Fix FMA combine (#119217 ) Update the check in the FMA combine to check dot10-insts instead of dot7-insts. The target of the combine, v_dot2_f32_f16, is available only if dot10-insts target feature is enabled. The issue probably dates back to the change that split out dot10-insts out of dot7-insts. As far as I can see, this does not affect any current targets, but if a future target has dot7-insts, but not dot10-insts that would cause a crash ("cannot select") for the input ir in the test.	2024-12-10 10:11:19 +01:00
Vikash Gupta	0b0d9a3bee	[CodeGen] [AMDGPU] Attempt DAGCombine for fmul with select to ldexp (#111109 ) The materialization cost of 32-bit non-inline in case of fmul is quite relatively more, rather than if possible to combine it into ldexp instruction for specific scenarios (for datatypes like f64, f32 and f16) as this is being handled here : The dag combine for any pair of select values which are exact exponent of 2. ``` fmul x, select(y, A, B) -> ldexp (x, select i32 (y, a, b)) fmul x, select(y, -A, -B) -> ldexp ((fneg x), select i32 (y, a, b)) where, A=2^a & B=2^b ; a and b are integers. ``` This dagCombine is handled separately in fmulCombine (newly defined in SIIselLowering), targeting fmul fusing it with select type operand into ldexp. Thus, it fixes #104900.	2024-12-09 12:52:04 +05:30
Austin Kerbow	b1d42465fc	[AMDGPU] Fix hidden kernarg preload count inconsistency (#116759 ) It is possible that the number of hidden arguments that are selected to be preloaded in AMDGPULowerKernel arguments and isel can differ. This isn't an issue with explicit arguments since isel can lower the argument correctly either way, but with hidden arguments we may have alignment issues if we try to load these hidden arguments that were added to the kernel signature. The reason for the mismatch is that isel reserves an extra synthetic user SGPR for module LDS. Instead of teaching lowerFormalArguments how to handle these properly it makes more sense and is less expensive to fix the mismatch and assert if we ever run into this issue again. We should never be trying to lower these in the normal way. In a future change we probably want to revise how we track "synthetic" user SGPRs and unify the handling in GCNUserSGPRUsageInfo. Sometimes synthetic SGPRSs are considered user SGPRs and sometimes they are not. Until then this patch resolves the inconsistency, fixes the bug, and is otherwise a NFC.	2024-12-08 10:10:08 -08:00
Jakub Chlanda	edbebda454	[AMDGPU] Assert previous SGPR exists when bundling preloaded args (#118802 ) This came up from a downstream static analysis tool.	2024-12-06 09:15:22 +01:00
Matt Arsenault	15676ec552	AMDGPU: Add support for V_CVT_PK_F16_F32 instruction for gfx950 (#118300 ) Co-authored-by: Shilei Tian <shilei.tian@amd.com>	2024-12-02 16:04:24 -05:00
Matt Arsenault	7221bc74bc	AMDGPU: Make v2f16 minimum/maximum legal for gfx950 (#117738 )	2024-11-26 14:51:05 -05:00
Matt Arsenault	f5e92eb04b	AMDGPU: Handle f32 minimum3/maximum3 pattern for gfx950 (#117737 )	2024-11-26 14:47:52 -05:00
Matt Arsenault	e57b327be2	AMDGPU: Legalize fminimum and fmaximum f32 for gfx950 (#117634 ) Select to minimum3/maximum3. Leave f16/v2f16 for later since it's complicated by only having the vector version.	2024-11-26 14:44:09 -05:00
Piotr Sobczak	a96ec01e1a	[AMDGPU] Optimize out s_barrier_signal/_wait (#116993 ) Extend the optimization that converts s_barrier to wave_barrier (nop) when the number of work items is not larger than wave size. This handles the "split barrier" form of s_barrier where the barrier is represented by separate intrinsics (s_barrier_signal/s_barrier_wait). Note: the version where s_barrier is used in gfx12 (and later split) has the optimization already, but some front-ends may prefer to use split intrinsics and this is being addressed by the patch.	2024-11-26 10:04:32 +01:00
Craig Topper	bc282605df	[SelectionDAG] Require last operand of (STRICT_)FP_ROUND to be a TargetConstant. (#117639 ) Fix all the places I could find that did't do this. We were already mostly correct for FP_ROUND after 9a976f36615dbe15e76c12b22f711b2e597a8e51, but not STRICT_FP_ROUND.	2024-11-25 21:36:33 -08:00
Matt Arsenault	e97fb2207e	AMDGPU: Add support for load transpose instructions for gfx950 (#117378 ) This patch support for intrinsics in clang, as well as assembly instructions in the backend. Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>	2024-11-25 09:39:04 -08:00
Nikita Popov	3317c9ceac	[AMDGPU] Use getSignedConstant() where necessary (#117328 ) Create signed constant using getSignedConstant(), to avoid future assertion failures when we disable implicit truncation in getConstant(). This also touches some generic legalization code, which apparently only AMDGPU tests.	2024-11-25 09:49:34 +01:00
Matt Arsenault	1944d192bd	AMDGPU: Use isWave[32\|64] instead of comparing size value (#117411 )	2024-11-23 09:30:57 -08:00
Matt Arsenault	bd8a953e9b	AMDGPU: Fix mfma scale source legalization (#117238 ) Code inside assert changes the variable instead of the comparison. Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2024-11-21 15:30:01 -08:00
Matt Arsenault	01c9a14ccf	AMDGPU: Define v_mfma_f32_{16x16x128\|32x32x64}_f8f6f4 instructions (#116723 ) These use a new VOP3PX encoding for the v_mfma_scale_* instructions, which bundles the pre-scale v_mfma_ld_scale_b32. None of the modifiers are supported yet (op_sel, neg or clamp). I'm not sure the intrinsic should really expose op_sel (or any of the others). If I'm reading the documentation correctly, we should be able to just have the raw scale operands and auto-match op_sel to byte extract patterns. The op_sel syntax also seems extra horrible in this usage, especially with the usual assumed op_sel_hi=-1 behavior.	2024-11-21 08:51:58 -08:00
Jay Foad	ade0750e35	[AMDGPU] Fix some cache policy checks for GFX12+ (#116396 ) Fix coding errors found by inspection and check that the swz bit still serves to prevent merging of buffer loads/stores on GFX12+.	2024-11-21 08:22:59 +00:00
Matt Arsenault	927032807d	AMDGPU: Handle gfx950 96/128-bit buffer_load_lds (#116681 ) Enforcing this limit in the clang builtin will come later.	2024-11-18 22:01:56 -08:00
Matt Arsenault	50224bd5ba	AMDGPU: Handle gfx950 global_load_lds_* instructions (#116680 ) Define global_load_lds_dwordx3 and global_load_dwordx4. Oddly it seems dwordx2 was skipped.	2024-11-18 21:58:02 -08:00
Matt Arsenault	738bdd4969	AMDGPU: Add V_CVT_PK_BF16_F32 for gfx950 (#116678 )	2024-11-18 21:50:54 -08:00
Sergei Barannikov	baf59be89b	[SelectionDAG] Fix return types of TC_RETURN for several targets (#116504 ) TC_RETURN nodes do not have a glue result.	2024-11-17 02:14:05 +03:00
Kazu Hirata	be187369a0	[AMDGPU] Remove unused includes (NFC) (#116154 ) Identified with misc-include-cleaner.	2024-11-13 21:10:03 -08:00
Kazu Hirata	4048c64306	[llvm] Remove redundant control flow statements (NFC) (#115831 ) Identified with readability-redundant-control-flow.	2024-11-12 10:09:42 -08:00
Shilei Tian	6548b6354d	Reapply "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403 )" This reverts commit ca33649abe5fad93c57afef54e43ed9b3249cd86.	2024-11-08 20:21:16 -05:00
Shilei Tian	ca33649abe	Revert "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403 )" This reverts commit e215a1e27d84adad2635a52393621eb4fa439dc9 as it broke both hip and openmp buildbots.	2024-11-08 16:36:35 -05:00
Shilei Tian	e215a1e27d	[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403 )	2024-11-08 13:05:35 -05:00
Jeffrey Byrnes	ae6dbed594	[AMDGPU] Use correct DWord for v_dot4 S0 operand (#115224 ) Fixes a copy-paste typo. The typo resulted in producing bad v_perm based operands for the v_dot4 combine. When adding a corresponding byte pair to the v_dot byte pair chains, we must take note of the byte position in the corresponding source nodes. These byte positions are used to ensure we extract the correct DWord from the ultimate source, and formulate a correct perm_mask from the extracted DWord. With the typo, we the S0 byte would used the DWord offset for the corresponding S1 byte. If this offset was not the same as the true DWord offset for the S0 byte, we would extract and use the wrong byte for S0 in the v_dot. Fixes https://github.com/llvm/llvm-project/issues/112941	2024-11-06 20:48:20 -08:00
Gang Chen	8c752900dd	[AMDGPU] modify named barrier builtins and intrinsics (#114550 ) Use a local pointer type to represent the named barrier in builtin and intrinsic. This makes the definitions more user friendly bacause they do not need to worry about the hardware ID assignment. Also this approach is more like the other popular GPU programming language. Named barriers should be represented as global variables of addrspace(3) in LLVM-IR. Compiler assigns the special LDS offsets for those variables during AMDGPULowerModuleLDS pass. Those addresses are converted to hw barrier ID during instruction selection. The rest of the instruction-selection changes are primarily due to the intrinsic-definition changes.	2024-11-06 10:37:22 -08:00
Stanislav Mekhanoshin	6d7e51de5e	[AMDGPU] Extend type support for update_dpp intrinsic (#114597 ) We can split 64-bit DPP as a post-RA pseudo if control values are supported, but cannot handle other types.	2024-11-05 13:59:14 -08:00
Matt Arsenault	30dd1297fa	AMDGPU: Custom expand flat cmpxchg which may access private (#109410 ) 64-bit flat cmpxchg instructions do not work correctly for scratch addresses, and need to be expanded as non-atomic. Allow custom expansion of cmpxchg in AtomicExpand, as is already the case for atomicrmw.	2024-11-04 09:29:38 -08:00
Shilei Tian	11df0ce140	[NFC][AMDGPU] Use structured binding to replace explicit use of std::pair	2024-11-02 15:11:55 -04:00
Matt Arsenault	1d0370872f	AMDGPU: Expand flat atomics that may access private memory (#109407 ) If the runtime flat address resolves to a scratch address, 64-bit atomics do not work correctly. Insert a runtime address space check (which is quite likely to be uniform) and select between the non-atomic and real atomic cases. Consider noalias.addrspace metadata and avoid this expansion when possible (we also need to consider it to avoid infinitely expanding after adding the predication code).	2024-10-31 08:08:48 -07:00
Stanislav Mekhanoshin	7cd29741fa	[AMDGPU] Extend mov_dpp8 intrinsic lowering for generic types (#114296 ) The int_amdgcn_mov_dpp8 is overloaded, but we can only select i32. To allow a corresponding builtin to be overloaded the same way as int_amdgcn_mov_dpp we need it to be able to split unsupported values.	2024-10-31 01:15:25 -07:00
Jay Foad	6bf4476ffb	[AMDGPU] Fix @llvm.amdgcn.cs.chain with callee not provably uniform (#114200 ) The correct behavior is to insert a readfirstlane. This worked except for an inappropriate assertion in SITargetLowering::LowerCall.	2024-10-30 16:18:29 +00:00
Jay Foad	8ee5e19c87	[AMDGPU] Fix @llvm.amdgcn.cs.chain with SGPR args not provably uniform (#114232 ) The correct behaviour is to insert a readfirstlane. SelectionDAG was already doing this in some cases, but not in the general case for chain calls. GlobalISel was already doing this for return values but not for arguments.	2024-10-30 16:12:37 +00:00
Shilei Tian	4cf128512b	[NFC][AMDGPU] Use C++17 structured bindings as much as possible (#113939 ) This only changes `llvm/lib/Target/AMDGPU/SIISelLowering.cpp`. There are five uses of `std::tie` remaining because they can't be replaced with C++17 structured bindings.	2024-10-28 13:55:57 -04:00
Shilei Tian	c3fe0e46e2	[NFC][AMDGPU] clang-format `llvm/lib/Target/AMDGPU/SIISelLowering.cpp` (#112645 )	2024-10-21 16:42:25 -07:00
Rahul Joshi	6924fc0326	[LLVM] Add `Intrinsic::getDeclarationIfExists` (#112428 ) Add `Intrinsic::getDeclarationIfExists` to lookup an existing declaration of an intrinsic in a `Module`.	2024-10-16 07:21:10 -07:00
Fabian Ritter	173c68239d	[AMDGPU] Enable unaligned scratch accesses (#110219 ) This allows us to emit wide generic and scratch memory accesses when we do not have alignment information. In cases where accesses happen to be properly aligned or where generic accesses do not go to scratch memory, this improves performance of the generated code by a factor of up to 16x and reduces code size, especially when lowering memcpy and memmove intrinsics. Also: Make the use of the FeatureUnalignedScratchAccess feature more consistent: FeatureUnalignedScratchAccess and EnableFlatScratch are now orthogonal, whereas, before, code assumed that the latter implies the former at some places. Part of SWDEV-455845.	2024-10-11 08:50:49 +02:00
Jay Foad	62b3a4bc70	[AMDGPU] Improve codegen for s_barrier_init (#111866 )	2024-10-10 19:40:02 +01:00
Matt Arsenault	c198f775cd	AMDGPU: Remove flat/global fmin/fmax intrinsics (#105642 ) These have been replaced with atomicrmw	2024-10-09 09:27:28 +04:00
Shilei Tian	88a239d292	[AMDGPU] Adopt new lowering sequence for `fdiv16` (#109295 ) The current lowering of `fdiv16` can generate incorrectly rounded result in some cases. The new sequence was provided by the HW team, as shown below written in C++. ``` half fdiv(half a, half b) { float a32 = float(a); float b32 = float(b); float r32 = 1.0f / b32; float q32 = a32 * r32; float e32 = -b32 * q32 + a32; q32 = e32 * r32 + q32; e32 = -b32 * q32 + a32; float tmp = e32 * r32; uin32_t tmp32 = std::bit_cast<uint32_t>(tmp); tmp32 = tmp32 & 0xff800000; tmp = std::bit_cast<float>(tmp32); q32 = tmp + q32; half q16 = half(q32); q16 = div_fixup_f16(q16); return q16; } ``` Fixes SWDEV-477608.	2024-10-08 09:49:20 -04:00
Austin Kerbow	c4d89203f3	[AMDGPU] Support preloading hidden kernel arguments (#98861 ) Adds hidden kernel arguments to the function signature and marks them inreg if they should be preloaded into user SGPRs. The normal kernarg preloading logic then takes over with some additional checks for the correct implicitarg_ptr alignment. Special care is needed so that metadata for the hidden arguments is not added twice when generating the code object.	2024-10-06 17:44:33 -07:00
Matt Arsenault	428ae0f12e	AMDGPU: Do not tail call if an inreg argument requires waterfalling (#111002 ) If we have a divergent value passed to an outgoing inreg argument, the call needs to be executed in a waterfall loop and thus cannot be tail called. The waterfall handling of arbitrary calls is broken on the selectiondag path, so some of these cases still hit an error later. I also noticed the argument evaluation code in isEligibleForTailCallOptimization is not correctly accounting for implicit argument assignments. It also seems inreg codegen is generally broken; we are assigning arguments to the reserved private resource descriptor.	2024-10-04 00:04:02 +04:00
Matt Arsenault	c08d7b3de7	AMDGPU: Fix verifier error on tail call target in vgprs (#110984 ) We allow tail calls of known uniform function pointers. This would produce a verifier error if the uniform value is in VGPRs. Insert readfirstlanes just in case this occurs, which will fold out later if it is unnecessary. GlobalISel should need a similar fix, but it currently does not attempt tail calls of indirect calls. Fixes #107447 Fixes subissue of #110930	2024-10-03 21:50:56 +04:00
Jay Foad	8d13e7b8c3	[AMDGPU] Qualify auto. NFC. (#110878 ) Generated automatically with: $ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find lib/Target/AMDGPU/ -type f)	2024-10-03 13:07:54 +01:00

1 2 3 4 5 ...

1482 Commits