llvm-project

Author	SHA1	Message	Date
Ana Mihajlovic	8c7550132f	[AMDGPU] Unused sdst writing to null (#133229 ) Unused sdst writing to null to avoid a false VALU->SALU dependency stall. This requires using the VOP3 encoding.	2025-03-28 18:12:34 +01:00
Brox Chen	8bc0f879a0	[AMDGPU][True16][CodeGen] D16 LDS load/store pseudo instructions in true16 (#131427 ) Implement new pseudos with the suffix _t16 which have VGPR_16 as the store src or load dst. This affects LDS 8 and 16-bit loads and stores. Lower the pseudos to the existing real Hi/Lo instructions in MC inst layer with VGPR_32 src or dst --------- Co-authored-by: Abhinav <abhinav.garg@amd.com>	2025-03-17 10:28:45 -04:00
Ana Mihajlovic	459b4e3fe1	Reland "[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212 )" (#131111 ) We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from it). When VALU is issued, it increments internal counter VA_SDST used to track use of this SGPR. SALU will not issue until VA_SDST is zero, that is when VALU is finished writing. Therefore, delays added by s_delay_alu are not needed in this situation.	2025-03-13 10:26:20 +01:00
Kazu Hirata	aa008e0008	Revert "[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212 )" This reverts commit 71582c6667a6334c688734cae628e906b3c1ac1d. Multiple buildbot failures have been reported: https://github.com/llvm/llvm-project/pull/127212	2025-03-12 12:09:09 -07:00
Ana Mihajlovic	71582c6667	[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212 ) We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from it). When VALU is issued, it increments internal counter VA_SDST used to track use of this SGPR. SALU will not issue until VA_SDST is zero, that is when VALU is finished writing. Therefore, delays added by s_delay_alu are not needed in this situation.	2025-03-12 09:33:07 -07:00
Matt Arsenault	d246cc618a	PeepholeOpt: Do not add subregister indexes to reg_sequence operands (#124111 ) Given the rest of the pass just gives up when it needs to compose subregisters, folding a subregister extract directly into a reg_sequence is counterproductive. Later fold attempts in the function will give up on the subregister operand, preventing looking up through the reg_sequence. It may still be profitable to do these folds if we start handling the composes. There are some test regressions, but this mostly looks better.	2025-01-30 20:42:02 +07:00
Carl Ritson	a3a3e6997b	[AMDGPU] Rewrite GFX12 SGPR hazard handling to dedicated pass (#118750 ) - Algorithm operates over whole IR to attempt to minimize waits. - Add support for VALU->VALU SGPR hazards via VA_SDST/VA_VCC.	2025-01-30 11:21:11 +09:00
Lucas Ramirez	6206f5444f	[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748 ) Occupancy (i.e., the number of waves per EU) depends, in addition to register usage, on per-workgroup LDS usage as well as on the range of possible workgroup sizes. Mirroring the latter, occupancy should therefore be expressed as a range since different group sizes generally yield different achievable occupancies. `getOccupancyWithLocalMemSize` currently returns a scalar occupancy based on the maximum workgroup size and LDS usage. With respect to the workgroup size range, this scalar can be the minimum, the maximum, or neither of the two of the range of achievable occupancies. This commit fixes the function by making it compute and return the range of achievable occupancies w.r.t. workgroup size and LDS usage; it also renames it to `getOccupancyWithWorkGroupSizes` since it is the range of workgroup sizes that produces the range of achievable occupancies. Computing the achievable occupancy range is surprisingly involved. Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum occupancies i.e., sometimes workgroup sizes inside the range yield the occupancy bounds. The implementation finds these sizes in constant time; heavy documentation explains the rationale behind the sometimes relatively obscure calculations. As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU, 64-wide waves. Also consider a function with no LDS usage and a flat workgroup size range of [513,1024]. - A group of 513 items requires 9 waves per group. Only 4 groups made up of 9 waves each can fit fully on a CU at any given time, for a total of 36 waves on the CU, or 9 per EU. However, filling as much as possible the remaining 40-36=4 wave slots without decreasing the number of groups reveals that a larger group of 640 items yields 40 waves on the CU, or 10 per EU. - Similarly, a group of 1024 items requires 16 waves per group. Only 2 groups made up of 16 waves each can fit fully on a CU ay any given time, for a total of 32 waves on the CU, or 8 per EU. However, removing as many waves as possible from the groups without being able to fit another equal-sized group on the CU reveals that a smaller group of 896 items yields 28 waves on the CU, or 7 per EU. Therefore the achievable occupancy range for this function is not [8,9] as the group size bounds directly yield, but [7,10]. Naturally this change causes a lot of test churn as instruction scheduling is driven by achievable occupancy estimates. In most unit tests the flat workgroup size range is the default [1,1024] which, ignoring potential LDS limitations, would previously produce a scalar occupancy of 8 (derived from 1024) on a lot of targets, whereas we now consider the maximum occupancy to be 10 in such cases. Most tests are updated automatically and checked manually for sanity. I also manually changed some non-automatically generated assertions when necessary. Fixes #118220.	2025-01-23 16:07:57 +01:00
Sergei Barannikov	1941f34172	[TableGen][GISel] Import more "multi-level" patterns (#120332 ) Previously, if the destination DAG has an untyped leaf, we would import the pattern only if that leaf is defined by the top-level source DAG. This is an unnecessary restriction. Here is an example of such pattern: ``` def : Pat<(add (mul v8i16:$vA, v8i16:$vB), v8i16:$vC), (VMLADDUHM $vA, $vB, $vC)>; ``` Previously, it failed to import because `add` doesn't define neither `$vA` nor `$vB`. This change reduces the number of skipped patterns as follows: ``` AArch64: 8695 -> 8548 (-147) AMDGPU: 11333 -> 11240 (-93) ARM: 4297 -> 4278 (-1) PowerPC: 3955 -> 3010 (-945) ``` Other GISel-enabled targets are unaffected.	2024-12-18 14:44:55 +03:00
Jay Foad	4e70720139	[AMDGPU] Add some gfx1200 test coverage	2024-06-27 14:53:59 +01:00
David Stuttard	1fb1fcf343	Add mad support for v_pk_* 16 bit integer (#95104 )	2024-06-13 16:46:40 +01:00
Pierre van Houtryve	4b1910b11d	[GlobalISel][AMDGPU] Import patterns with multiple defs (#84171 ) Fixes #63216	2024-03-08 09:39:10 +01:00
Nicolai Hähnle	49b492048a	AMDGPU: Fix packed 16-bit inline constants (#76522 ) Consistently treat packed 16-bit operands as 32-bit values, because that's really what they are. The attempt to treat them differently was ultimately incorrect and lead to miscompiles, e.g. when using non-splat constants such as (1, 0) as operands. Recognize 32-bit float constants for i/u16 instructions. This is a bit odd conceptually, but it matches HW behavior and SP3. Remove isFoldableLiteralV216; there was too much magic in the dependency between it and its use in SIFoldOperands. Instead, we now simply rely on checking whether a constant is an inline constant, and trying a bunch of permutations of the low and high halves. This is more obviously correct and leads to some new cases where inline constants are used as shown by tests. Move the logic for switching packed add vs. sub into SIFoldOperands. This has two benefits: all logic that optimizes for inline constants in packed math is now in one place; and it applies to both SelectionDAG and GISel paths. Disable the use of opsel with v_dot* instructions on gfx11. They are documented to ignore opsel on src0 and src1. It may be interesting to re-enable to use of opsel on src2 as a future optimization. A similar "proper" fix of what inline constants mean could potentially be applied to unpacked 16-bit ops. However, it's less clear what the benefit would be, and there are surely places where we'd have to carefully audit whether values are properly sign- or zero-extended. It is best to keep such a change separate. Fixes: Corruption in FSR 2.0 (latent bug exposed by an LLPC change)	2024-01-04 00:10:15 +01:00
Pierre van Houtryve	dd32d26a37	[AMDGPU] Form V_MAD_U64_U32 from mul24 (#72393 ) Fixes SWDEV-421067	2023-12-11 11:38:27 +01:00
Jay Foad	e38c29c2b7	[AMDGPU] Add GFX11 test coverage to integer-mad-patterns.ll	2023-12-08 13:06:03 +00:00
Simon Pilgrim	e9caa37e9c	[DAG] Move lshr narrowing from visitANDLike to SimplifyDemandedBits Inspired by some of the cases from D145468 Let SimplifyDemandedBits handle the narrowing of lshr to half-width if we don't require the upper bits, the narrowed shift is profitable and the zext/trunc are free. A future patch will propose the equivalent shl narrowing combine. Differential Revision: https://reviews.llvm.org/D146121	2023-07-17 15:50:09 +01:00
Jay Foad	f7684d8510	[DAG] Use legal shift amount type in DAGTypeLegalizer::JoinIntegers Documentation for TargetLowering::getShiftAmountTy says that LegalTypes should generally be true during type legalization, so this patch does that. On AMDGPU the effect is that we use i32 (a sane type) instead of i64 (pointer sized type) for more shift amounts, which in turn allows more formation of rotates and funnel shifts pre-legalization. Differential Revision: https://reviews.llvm.org/D154960	2023-07-12 08:12:09 +01:00
Amara Emerson	3a80bdb316	[GlobalISel] Remove an erroneous oneuse check in the G_ADD reassociation combine. This check was unnecessary/incorrect, it was already being done by the target hook default implementation, and the one in the matcher was checking for a completely different thing. This change: 1) Removes the check and updates affected tests which now do some more reassociations. 2) Modifies the AMDGPU hooks which were stubbed with "return true" to also do the oneuse check. Not sure why I didn't do this the first time.	2023-07-10 01:03:12 -07:00
Jay Foad	f2c164c815	[AMDGPU] Do not wait for vscnt on function entry and return SIInsertWaitcnts inserts waitcnt instructions to resolve data dependencies. The GFX10+ vscnt (VMEM store count) counter is never used in this way. It is only used to resolve memory dependencies, and that is handled by SIMemoryLegalizer. Hence there is no need to conservatively wait for vscnt to be 0 on function entry and before returns. Differential Revision: https://reviews.llvm.org/D153537	2023-07-04 12:22:38 +01:00
Amaury Séchet	a70d5e25f3	[DAGCombine] Make sure combined nodes are added back to the worklist in topological order. Currently, a node and its users are added back to the worklist in reverse topological order after it is combined. This diff changes that order to be topological. This is part of a larger migration to get the DAGCombiner to process nodes in topological order. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D127115	2023-06-13 09:14:37 +00:00
Matt Arsenault	4469aff148	AMDGPU: Add baseline tests for integer mad matching Test some clpeak-like patterns with multiple use muls.	2023-06-09 19:17:56 -04:00

21 Commits