llvm-project

Author	SHA1	Message	Date
Jeffrey Byrnes	acb7859f07	[MachineSink] Extend loop sinking capability (#117247 ) The current MIR cycle sinking capabilities are rather limited. It only support sinking copies into a single successor block while obeying limits. This opt-in feature adds a more aggressive option, that is not limited to the above concerns. The feature will try to "sink" by duplicating any top-level preheader instruction (that we are sure is safe to sink) into any user block, then does some dead code cleanup. In particular, this is useful for high RP situations when loop bodies have control flow.	2025-01-23 17:08:23 -08:00
Lucas Ramirez	6206f5444f	[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748 ) Occupancy (i.e., the number of waves per EU) depends, in addition to register usage, on per-workgroup LDS usage as well as on the range of possible workgroup sizes. Mirroring the latter, occupancy should therefore be expressed as a range since different group sizes generally yield different achievable occupancies. `getOccupancyWithLocalMemSize` currently returns a scalar occupancy based on the maximum workgroup size and LDS usage. With respect to the workgroup size range, this scalar can be the minimum, the maximum, or neither of the two of the range of achievable occupancies. This commit fixes the function by making it compute and return the range of achievable occupancies w.r.t. workgroup size and LDS usage; it also renames it to `getOccupancyWithWorkGroupSizes` since it is the range of workgroup sizes that produces the range of achievable occupancies. Computing the achievable occupancy range is surprisingly involved. Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum occupancies i.e., sometimes workgroup sizes inside the range yield the occupancy bounds. The implementation finds these sizes in constant time; heavy documentation explains the rationale behind the sometimes relatively obscure calculations. As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU, 64-wide waves. Also consider a function with no LDS usage and a flat workgroup size range of [513,1024]. - A group of 513 items requires 9 waves per group. Only 4 groups made up of 9 waves each can fit fully on a CU at any given time, for a total of 36 waves on the CU, or 9 per EU. However, filling as much as possible the remaining 40-36=4 wave slots without decreasing the number of groups reveals that a larger group of 640 items yields 40 waves on the CU, or 10 per EU. - Similarly, a group of 1024 items requires 16 waves per group. Only 2 groups made up of 16 waves each can fit fully on a CU ay any given time, for a total of 32 waves on the CU, or 8 per EU. However, removing as many waves as possible from the groups without being able to fit another equal-sized group on the CU reveals that a smaller group of 896 items yields 28 waves on the CU, or 7 per EU. Therefore the achievable occupancy range for this function is not [8,9] as the group size bounds directly yield, but [7,10]. Naturally this change causes a lot of test churn as instruction scheduling is driven by achievable occupancy estimates. In most unit tests the flat workgroup size range is the default [1,1024] which, ignoring potential LDS limitations, would previously produce a scalar occupancy of 8 (derived from 1024) on a lot of targets, whereas we now consider the maximum occupancy to be 10 in such cases. Most tests are updated automatically and checked manually for sanity. I also manually changed some non-automatically generated assertions when necessary. Fixes #118220.	2025-01-23 16:07:57 +01:00
Nico Weber	99d450e9f5	Revert "[AMDGPU] SIPeepholeSDWA: Disable on existing SDWA instructions (#123942 )" This reverts commit 6fdaaafd89d7cbc15dafe3ebf1aa3235d148aaab. Breaks check-llvm, see https://github.com/llvm/llvm-project/pull/123942#issuecomment-2609861953	2025-01-23 09:19:42 -05:00
Matt Arsenault	e28e93550a	AMDGPU: Make vector_shuffle legal for v2i32 with v_pk_mov_b32 (#123684 ) For VALU shuffles, this saves an instruction in some case.	2025-01-23 20:58:02 +07:00
Kareem Ergawy	ff55c9bc63	[llvm][amdgpu] Handle indirect refs to LDS GVs during LDS lowering (#124089 ) Fixes #123800 Extends LDS lowering by allowing it to discover transitive indirect/escpaing references to LDS GVs. For example, given the following input: ```llvm @lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8 %store_type = type { i32, ptr } @place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8 define amdgpu_kernel void @offloading_kernel() { store ptr @indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @place_to_store_indirect_caller, i32 0), align 8 call void @call_unknown() ret void } define void @call_unknown() { %1 = alloca ptr, align 8 %2 = call i32 %1() ret void } define void @indirectly_load_lds() { call void @directly_load_lds() ret void } define void @directly_load_lds() { %2 = load ptr, ptr addrspace(3) @lds_item_to_indirectly_load, align 8 ret void } ``` With the above input, prior to this patch, LDS lowering failed to lower the reference to `@lds_item_to_indirectly_load` because: 1. it is indirectly called by a function whose address is taken in the kernel. 2. we did not check if the kernel indirectly makes any calls to unknown functions (we only checked the direct calls). Co-authored-by: Jon Chesterfield <jonathan.chesterfield@amd.com>	2025-01-23 14:53:11 +01:00
Frederik Harwath	6fdaaafd89	[AMDGPU] SIPeepholeSDWA: Disable on existing SDWA instructions (#123942 ) This is meant as a short-term workaround for an invalid conversion in this pass that occurs because existing SDWA selections are not correctly taken into account during the conversion. See the draft PR #123221 for an attempt to fix the actual issue. --------- Co-authored-by: Frederik Harwath <fharwath@amd.com>	2025-01-23 14:32:01 +01:00
Matt Arsenault	93d35ad5f5	AMDGPU: Delete FillMFMAShadowMutation (#123861 ) No test changes with this removed and it appears to be obsolete.	2025-01-22 22:41:25 +07:00
Akshat Oke	a343b8e595	[AMDGPU][NewPM] Port SILowerWWMCopies to NPM (#123695 )	2025-01-22 14:54:01 +05:30
TiborGY	3630d9ef65	[PartiallyInlineLibCalls] Add infrastructure for emitting optimization remarks from PartiallyInlineLibCalls (#122654 ) I am planning to add some optimization remarks to the `PartiallyInlineLibCalls` pass. However, since this pass does not emit any optimization remarks yet, I have to add the "infrastructure" for that first, which is what this PR is about.	2025-01-22 13:15:40 +07:00
Shoreshen	7c58d6363a	[AMDGPU] Add commute for some VOP3 inst (#121326 ) add commute for some VOP3 inst, allow commute for both inline constant operand, adjust tests Fixes #111205	2025-01-22 11:08:26 +07:00
Shoreshen	e8811ad3cc	[AMDGPU] Fix unreachable reg bit width (#122107 ) Add register class bit width for SReg_256_XNULL and SReg_128_XNULL	2025-01-22 10:05:47 +07:00
Matt Arsenault	5e79ae60a6	DAG: Fix vector_shuffle -> splat fold defining undef lanes (#123596 ) For shuffle vector splats with undef lanes in the mask, this was introducing real values. Filter out build_vector results based on the undef elements in the mask. This avoids AMDGPU test regressions in a future change. test/CodeGen/X86/urem-seteq-illegal-types.ll looks worse but I didn't investigate.	2025-01-21 23:55:50 +07:00
Brox Chen	70632f9566	[AMDGPU][True16][MC] true16 for v_cmp_xx_f16 (#122943 ) A bulk commit of true16 support for v_cmp_xx_f16 instructions including: v_cmp_f_f16 v_cmp_eq_f16 v_cmp_le_f16 v_cmp_gt_f16 v_cmp_lg_f16 v_cmp_ge_f16 v_cmp_o_f16 v_cmp_u_f16 v_cmp_nge_f16 v_cmp_nlg_f16 v_cmp_ngt_f16 v_cmp_nle_f16 v_cmp_neq_f16 v_cmp_nlt_f16 v_cmp_t_f16 Added a GFX12 runline for fcmp.f16	2025-01-21 10:06:22 -05:00
Chinmay Deshpande	9ca1323de1	[AMDGPU] Fix crash due to missing check for FLAT instructions that dont use vector registers when computing VALU hazard (#123627 )	2025-01-21 05:50:58 -08:00
lialan	5d9c717597	[GISel] Fold shifts to constant result. (#123510 ) This resolves #123212	2025-01-21 05:10:45 -08:00
Janek van Oirschot	82944595fa	[AMDGPU] Change scope of resource usage info symbols (#114810 ) Change scope of resource usage info MC symbols to align with the function linkage type	2025-01-21 13:10:06 +00:00
Akshat Oke	9b6e8df896	[AMDGPU][NewPM] Port SIFixVGPRCopies to NPM (#123592 ) Extends NPM pipeline support till PostRegAlloc passes (greedy is in the works)	2025-01-21 15:27:46 +05:30
David Stuttard	ebc5020564	[AMDGPU] Update entry point name for PAL metadata (#123581 ) Old entry-point metadata being updated. Nothing is required to account for deprecation as nothing uses the old style	2025-01-21 09:37:22 +00:00
Matt Arsenault	585858aeb6	AMDGPU: Fix asm constrains in new shuffle tests These passed prechecks but failed after cc5eba1737146a727a61b5dbe16d8c2ac453981e	2025-01-21 10:49:42 +07:00
Matt Arsenault	7786266dc7	AMDGPU: Expand shuffle testing with generated tests (#123574 ) Add some generated tests with every shuffle permutation for relevant vector element types and sizes. Not sure if this is going overboard with the number of tests. I pruned out the largest cases (16 and 32-bit cases are impractically large), and there's redundancy when testing the pointer cases (at least for SelectionDAG). This uses inline assembly to produce sample values because of how the ABI is lowered when using a function argument. Since we break all arguments into 32-bit pieces, a shuffle never ends up forming. We need separate handling to reconstruct shuffles in contexts involving physical registers in ABI contexts. I wrote a small tool to generate these, so I can easily change the exact test body. Not sure if it's worth posting anywhere. This is in preparation for making better use of v_pk_mov_b32, v_mov_b64 and s_mov_b64 in shuffles.	2025-01-21 10:08:42 +07:00
Krzysztof Drewniak	697c1883f1	Reapply "[AMDGPU] Handle natively unsupported types in addrspace(7) lowering" (#123660 ) (#123657) This reverts commit 64749fb01538fba2b56d9850497d5f3a626cabc2. Adds a constructor to VecSlice to address the failure	2025-01-20 16:12:17 -06:00
Krzysztof Drewniak	64749fb015	Revert "[AMDGPU] Handle natively unsupported types in addrspace(7) lowering" (#123657 ) Reverts llvm/llvm-project#110572 Seem to have broken a buildbot, not sure why https://lab.llvm.org/buildbot/#/builders/108/builds/8346	2025-01-20 13:14:04 -05:00
Krzysztof Drewniak	3805355ef6	[AMDGPU] Handle natively unsupported types in addrspace(7) lowering (#110572 ) The current lowering for ptr addrspace(7) assumed that the instruction selector can handle arbtrary LLVM types, which is not the case. Code generation can't deal with - Values that aren't 8, 16, 32, 64, 96, or 128 bits long - Aggregates (this commit only handles arrays of scalars, more may come) - Vectors of more than one byte - 3-word values that aren't a vector of 3 32-bit values (for axample, a <6 x half>) This commit adds a buffer contents type legalizer that adds the needed bitcasts, zero-extensions, and splits into subcompnents needed to convert a load or store operation into one that can be successfully lowered through code generation. In the long run, some of the involved bitcasts (though potentially not the buffer operation splitting) ought to be handled by the instruction legalizer, but SelectionDAG makes this difficult. It also takes advantage of the new `nuw` flag on `getelementptr` when lowering GEPs to offset additions. We don't currently plumb through `nsw` on GEPs since that should likely be a separate change and would require declaring what we mean by "the address" in the context of the GEP guarantees.	2025-01-20 11:33:35 -06:00
Fabian Ritter	cc5eba1737	[AMDGPU] Reject misaligned SGPR constraints for inline asm (#123590 ) The indices of SGPR register pairs need to be 2-aligned and SGPR quadruplets need to be 4-aligned. With this patch, we report an error when inline asm register constraints specify a misaligned register index, instead of silently dropping the specified index. Fixes #123208 --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-01-20 15:47:11 +01:00
Fraser Cormack	9cf24652e7	[AMDGPU] Fix spurious NoAlias results (#122309 ) After a30e50fc, AMDGPUAAResult is being called in more situations where BasicAA isn't sure. This exposed some regressions where NoAlias is being incorrectly returned for two identical pointers. The fix is to check the underlying objects for equality before returning NoAlias.	2025-01-20 14:19:30 +00:00
Akshat Oke	96c4f978d0	[AMDGPU][NewPM] Port SIOptimizeExecMasking to NPM (#123572 )	2025-01-20 16:34:01 +05:30
Carl Ritson	f811482a74	[AMDGPU] SIWholeQuadMode: Ensure earliest WQM entry point for PS (#123266 ) Ensure shaders running WQM (PS) enter at the earliest point irrespective of WQM marking.	2025-01-19 15:50:33 +09:00
Stanislav Mekhanoshin	fbea21aa52	[AMDGPU] Add test for VALU hoisiting from WWM region. NFC. (#123234 ) The test demonstraits a suboptimal VALU hoisting from a WWM region. As a result we have 2 WWM regions instead of one.	2025-01-17 10:06:44 -08:00
Brox Chen	703e9e97d9	[AMDGPU][True16][CodeGen] true16 codegen for bswap (#122849 ) true16 codegen pattern for bswap	2025-01-17 09:36:55 -05:00
Vikram Hegde	225fc4f356	[AMDGPU][SDAG] Try folding "lshr i64 + mad" to "mad_u64_u32" (#119218 ) The intention is to use a "copy" instead of a "sub" to handle the high parts of 64-bit multiply for this specific case. This unlocks copy prop use cases where the copy can be reused by later multiply+add sequences if possible. Fixes: SWDEV-487672, SWDEV-487669	2025-01-17 11:09:39 +05:30
Matt Arsenault	7475f0a345	DAG: Avoid forming shufflevector from a single extract_vector_elt (#122672 ) This avoids regressions in a future AMDGPU commit. Previously we would have a build_vector (extract_vector_elt x), undef with free access to the elements bloated into a shuffle of one element + undef, which has much worse combine support than the extract. Alternatively could check aggressivelyPreferBuildVectorSources, but I'm not sure it's really different than isExtractVecEltCheap.	2025-01-17 08:44:43 +07:00
Matt Arsenault	ca95519704	AMDGPU: Implement isExtractVecEltCheap (#122460 ) Once again we have excessive TLI hooks with bad defaults. Permit this for 32-bit element vectors, which are just use-different-register. We should permit 16-bit vectors as cheap with legal packed instructions, but I see some mixed improvements and regressions that need investigation.	2025-01-17 08:38:01 +07:00
Matt Arsenault	4431106630	DAG: Fix vector bin op scalarize defining a partially undef vector (#122459 ) This avoids some of the pending regressions after AMDGPU implements isExtractVecEltCheap. In a case like shl <value, undef>, splat k, because the second operand was fully defined, we would fall through and use the splat value for the first operand, losing the undef high bits. This would result in an additional instruction to handle the high bits. Add some reduced testcases for different opcodes for one of the regressions.	2025-01-17 08:34:03 +07:00
Brox Chen	8a0c2e7567	[AMDGPU][True16][MC][CodeGen] true16 for v_cndmask_b16 (#119736 ) Support true16 format for v_cndmask_b16 in MC and CodeGen in true16 and fake16 flow. Since we are replacing `v_cndmask_b16` to `v_cndmask_b16_t16/fake16`, we have to at least update the fake16 codeGen to get codeGen test passing. For this case, we have to update the true16 and with fake16 together, otherwise some of the true16 tests will fail	2025-01-16 17:18:28 -05:00
Christudasan Devadasan	1797fb6b23	[AMDGPU][NewPM] Port SILowerControlFlow pass into NPM. (#123045 )	2025-01-16 11:06:38 +05:30
jofrn	c8bbbaa5c7	[SelectionDAG][AMDGPU] Negative offset when selecting scratch sv offsets (#122251 ) APInt will fail when given a negative offset. SelectScratchSVAddr utilizes this function and can be given a negative offset as well, so this change modifies it to use APSInt instead.	2025-01-15 06:56:28 -05:00
Mariusz Sikora	b3924cb9ec	[AMDGPU] Set Convergent property for image.(getlod/sample*) intrinsics which uses WQM (#122908 ) This change adds IntrConvergent property to image.getlod intrinsic and to several image.sample intrinsics. All image.sample intrinsics apart from LOD(_L), Level 0(_LZ), Derivative(_D) will be marked as Convergent.	2025-01-15 10:23:28 +01:00
Shoreshen	b665dddd70	[AMDGPU] Add tests for v_sat_pk_u8_i16 codegen (#122438 ) Preparation for #121124 This PR provides tests added into [PR](https://github.com/llvm/llvm-project/pull/121124) that add selection patterns for instruction `v_sat_pk`, in order to specify the change of the tests before and after the commit. Pre-commit tests PR for #121124 : Add selection patterns for instruction `v_sat_pk`	2025-01-14 19:26:46 -05:00
Brox Chen	f1b1c7f3c1	[AMDGPU][True16][CodeGen] Undo sub(x,c) to add in true16 flow (#118854 ) Undo sub x, c -> add x, -c canonicalization in true16 fow. This duplicating the pattern from fake16 and implemement the same pattern in true16 format	2025-01-14 10:57:33 -05:00
Brox Chen	5e26ff35c1	[AMDGPU][True16][MC] true16 for v_cmp_lt_f16 (#122499 ) True16 format for v_cmp_lt_f16. Update VOPC t16 and fake16 pseudo.	2025-01-14 10:03:36 -05:00
Acim Maravic	cc3aab580b	[AMDGPU] Handle nontemporal and amdgpu.last.use metadata in amdgpu-lower-buffer-fat-pointers (#120139 )	2025-01-14 11:22:20 +01:00
Piotr Sobczak	40fa7f5e8b	[AMDGPU] Fix computed kill mask (#122736 ) Replace S_XOR with S_ANDN2 when computing the kill mask in demote/kill lowering. This has the effect of AND'ing demote/kill condition with exec which is needed for proper live mask update. The S_XOR is inadequate because it may return true for lane with exec=0. This patch fixes an image corruption in game. I think the issue went unnoticed because demote/kill condition is often naturally dependent on exec, so AND'ing with exec is usually not required.	2025-01-14 10:00:40 +01:00
Brox Chen	0f3aeca16f	[AMDGPU][True16][CodeGen] Update and/or/xor codegen pattern for i16 (#121835 ) In true16 flow, remove and/or/xor 32bit patterns for i16	2025-01-13 16:48:00 -05:00
Brox Chen	26e13091ea	[AMDGPU][True16][CodeGen] true16 codegen pattern for v_pack_b32_f16 (#121988 ) true16 codegen pattern for v_pack_b32_f16	2025-01-13 12:26:36 -05:00
Matt Arsenault	f4598194b5	DAG: Fold bitcast of scalar_to_vector to anyext (#122660 ) scalar_to_vector is difficult to make appear and test, but I found one case where this makes an observable difference. It fires more often than this in the test suite, but most of them have no net result in the final code. This helps reduce regressions in a future commit.	2025-01-13 19:38:58 +07:00
Matt Arsenault	e9a55770dc	AMDGPU: Add gfx9 run line to scalar_to_vector test (#122659 )	2025-01-13 19:35:56 +07:00
Akshat Oke	73b0e8a191	[AMDGPU][NewPM] Port AMDGPUOpenCLEnqueuedBlockLowering to NPM (#122434 )	2025-01-13 17:52:30 +05:30
Akshat Oke	7bf1cb702b	[AMDGPU][NewPM] Port AMDGPURemoveIncompatibleFunctions to NPM (#122261 )	2025-01-13 10:11:40 +05:30
Shilei Tian	f15da5fb78	[AMDGPU] Fix an invalid cast in `AMDGPULateCodeGenPrepare::visitLoadInst` (#122494 ) Fixes: SWDEV-507695	2025-01-12 23:40:25 -05:00
Austin Kerbow	657fb4433e	[AMDGPU] Add target hook to isGlobalMemoryObject (#112781 ) We want special handing for IGLP instructions in the scheduler but they should still be treated like they have side effects by other passes. Add a target hook to the ScheduleDAGInstrs DAG builder so that we have more control over this.	2025-01-11 09:57:57 -08:00

1 2 3 4 5 ...

8192 Commits