8192 Commits

Author SHA1 Message Date
Jeffrey Byrnes
acb7859f07
[MachineSink] Extend loop sinking capability (#117247)
The current MIR cycle sinking capabilities are rather limited. It only
support sinking copies into a single successor block while obeying
limits.

This opt-in feature adds a more aggressive option, that is not limited
to the above concerns. The feature will try to "sink" by duplicating any
top-level preheader instruction (that we are sure is safe to sink) into
any user block, then does some dead code cleanup. In particular, this is
useful for high RP situations when loop bodies have control flow.
2025-01-23 17:08:23 -08:00
Lucas Ramirez
6206f5444f
[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748)
Occupancy (i.e., the number of waves per EU) depends, in addition to
register usage, on per-workgroup LDS usage as well as on the range of
possible workgroup sizes. Mirroring the latter, occupancy should
therefore be expressed as a range since different group sizes generally
yield different achievable occupancies.

`getOccupancyWithLocalMemSize` currently returns a scalar occupancy
based on the maximum workgroup size and LDS usage. With respect to the
workgroup size range, this scalar can be the minimum, the maximum, or
neither of the two of the range of achievable occupancies. This commit
fixes the function by making it compute and return the range of
achievable occupancies w.r.t. workgroup size and LDS usage; it also
renames it to `getOccupancyWithWorkGroupSizes` since it is the range of
workgroup sizes that produces the range of achievable occupancies.

Computing the achievable occupancy range is surprisingly involved.
Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum
occupancies i.e., sometimes workgroup sizes inside the range yield the
occupancy bounds. The implementation finds these sizes in constant time;
heavy documentation explains the rationale behind the sometimes
relatively obscure calculations.

As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU,
64-wide waves. Also consider a function with no LDS usage and a flat
workgroup size range of [513,1024].

- A group of 513 items requires 9 waves per group. Only 4 groups made up
of 9 waves each can fit fully on a CU at any given time, for a total of
36 waves on the CU, or 9 per EU. However, filling as much as possible
the remaining 40-36=4 wave slots without decreasing the number of groups
reveals that a larger group of 640 items yields 40 waves on the CU, or
10 per EU.
- Similarly, a group of 1024 items requires 16 waves per group. Only 2
groups made up of 16 waves each can fit fully on a CU ay any given time,
for a total of 32 waves on the CU, or 8 per EU. However, removing as
many waves as possible from the groups without being able to fit another
equal-sized group on the CU reveals that a smaller group of 896 items
yields 28 waves on the CU, or 7 per EU.

Therefore the achievable occupancy range for this function is not [8,9]
as the group size bounds directly yield, but [7,10].

Naturally this change causes a lot of test churn as instruction
scheduling is driven by achievable occupancy estimates. In most unit
tests the flat workgroup size range is the default [1,1024] which,
ignoring potential LDS limitations, would previously produce a scalar
occupancy of 8 (derived from 1024) on a lot of targets, whereas we now
consider the maximum occupancy to be 10 in such cases. Most tests are
updated automatically and checked manually for sanity. I also manually
changed some non-automatically generated assertions when necessary.

Fixes #118220.
2025-01-23 16:07:57 +01:00
Nico Weber
99d450e9f5 Revert "[AMDGPU] SIPeepholeSDWA: Disable on existing SDWA instructions (#123942)"
This reverts commit 6fdaaafd89d7cbc15dafe3ebf1aa3235d148aaab.
Breaks check-llvm, see
https://github.com/llvm/llvm-project/pull/123942#issuecomment-2609861953
2025-01-23 09:19:42 -05:00
Matt Arsenault
e28e93550a
AMDGPU: Make vector_shuffle legal for v2i32 with v_pk_mov_b32 (#123684)
For VALU shuffles, this saves an instruction in some case.
2025-01-23 20:58:02 +07:00
Kareem Ergawy
ff55c9bc63
[llvm][amdgpu] Handle indirect refs to LDS GVs during LDS lowering (#124089)
Fixes #123800

Extends LDS lowering by allowing it to discover transitive
indirect/escpaing references to LDS GVs.

For example, given the following input:
```llvm
@lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8

%store_type = type { i32, ptr }
@place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8

define amdgpu_kernel void @offloading_kernel() {
  store ptr @indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @place_to_store_indirect_caller, i32 0), align 8
  call void @call_unknown()
  ret void
}

define void @call_unknown() {
  %1 = alloca ptr, align 8
  %2 = call i32 %1()
  ret void
}

define void @indirectly_load_lds() {
  call void @directly_load_lds()
  ret void
}

define void @directly_load_lds() {
  %2 = load ptr, ptr addrspace(3) @lds_item_to_indirectly_load, align 8
  ret void
}

```

With the above input, prior to this patch, LDS lowering failed to lower
the reference to `@lds_item_to_indirectly_load` because:
1. it is indirectly called by a function whose address is taken in the
kernel.
2. we did not check if the kernel indirectly makes any calls to unknown
functions (we only checked the direct calls).

Co-authored-by: Jon Chesterfield <jonathan.chesterfield@amd.com>
2025-01-23 14:53:11 +01:00
Frederik Harwath
6fdaaafd89
[AMDGPU] SIPeepholeSDWA: Disable on existing SDWA instructions (#123942)
This is meant as a short-term workaround for an invalid conversion in
this pass that occurs because existing SDWA selections are not correctly
taken into account during the conversion.

See the draft PR #123221 for an attempt to fix the actual issue.

---------

Co-authored-by: Frederik Harwath <fharwath@amd.com>
2025-01-23 14:32:01 +01:00
Matt Arsenault
93d35ad5f5
AMDGPU: Delete FillMFMAShadowMutation (#123861)
No test changes with this removed and it appears to
be obsolete.
2025-01-22 22:41:25 +07:00
Akshat Oke
a343b8e595
[AMDGPU][NewPM] Port SILowerWWMCopies to NPM (#123695) 2025-01-22 14:54:01 +05:30
TiborGY
3630d9ef65
[PartiallyInlineLibCalls] Add infrastructure for emitting optimization remarks from PartiallyInlineLibCalls (#122654)
I am planning to add some optimization remarks to the
`PartiallyInlineLibCalls` pass. However, since this pass does not emit any 
optimization remarks yet, I have to add the "infrastructure" for that first, which 
is what this PR is about.
2025-01-22 13:15:40 +07:00
Shoreshen
7c58d6363a
[AMDGPU] Add commute for some VOP3 inst (#121326)
add commute for some VOP3 inst, allow commute for both inline constant
operand, adjust tests

Fixes #111205
2025-01-22 11:08:26 +07:00
Shoreshen
e8811ad3cc
[AMDGPU] Fix unreachable reg bit width (#122107)
Add register class bit width for SReg_256_XNULL and SReg_128_XNULL
2025-01-22 10:05:47 +07:00
Matt Arsenault
5e79ae60a6
DAG: Fix vector_shuffle -> splat fold defining undef lanes (#123596)
For shuffle vector splats with undef lanes in the mask,
this was introducing real values. Filter out build_vector
results based on the undef elements in the mask.

This avoids AMDGPU test regressions in a future change.

test/CodeGen/X86/urem-seteq-illegal-types.ll looks worse
but I didn't investigate.
2025-01-21 23:55:50 +07:00
Brox Chen
70632f9566
[AMDGPU][True16][MC] true16 for v_cmp_xx_f16 (#122943)
A bulk commit of true16 support for v_cmp_xx_f16 instructions including:
v_cmp_f_f16
v_cmp_eq_f16
v_cmp_le_f16
v_cmp_gt_f16
v_cmp_lg_f16
v_cmp_ge_f16
v_cmp_o_f16
v_cmp_u_f16
v_cmp_nge_f16
v_cmp_nlg_f16
v_cmp_ngt_f16
v_cmp_nle_f16
v_cmp_neq_f16
v_cmp_nlt_f16
v_cmp_t_f16

Added a GFX12 runline for fcmp.f16
2025-01-21 10:06:22 -05:00
Chinmay Deshpande
9ca1323de1
[AMDGPU] Fix crash due to missing check for FLAT instructions that dont use vector registers when computing VALU hazard (#123627) 2025-01-21 05:50:58 -08:00
lialan
5d9c717597
[GISel] Fold shifts to constant result. (#123510)
This resolves #123212
2025-01-21 05:10:45 -08:00
Janek van Oirschot
82944595fa
[AMDGPU] Change scope of resource usage info symbols (#114810)
Change scope of resource usage info MC symbols to align with the function linkage type
2025-01-21 13:10:06 +00:00
Akshat Oke
9b6e8df896
[AMDGPU][NewPM] Port SIFixVGPRCopies to NPM (#123592)
Extends NPM pipeline support till PostRegAlloc passes (greedy is in the
works)
2025-01-21 15:27:46 +05:30
David Stuttard
ebc5020564
[AMDGPU] Update entry point name for PAL metadata (#123581)
Old entry-point metadata being updated. Nothing is required
to account for deprecation as nothing uses the old style
2025-01-21 09:37:22 +00:00
Matt Arsenault
585858aeb6 AMDGPU: Fix asm constrains in new shuffle tests
These passed prechecks but failed after cc5eba1737146a727a61b5dbe16d8c2ac453981e
2025-01-21 10:49:42 +07:00
Matt Arsenault
7786266dc7
AMDGPU: Expand shuffle testing with generated tests (#123574)
Add some generated tests with every shuffle permutation
for relevant vector element types and sizes. Not sure if this
is going overboard with the number of tests. I pruned out the largest
cases (16 and 32-bit cases are impractically large), and there's
redundancy when testing the pointer cases (at least for SelectionDAG).

This uses inline assembly to produce sample values because of how the
ABI is lowered when using a function argument. Since we break all
arguments into 32-bit pieces, a shuffle never ends up forming. We
need separate handling to reconstruct shuffles in contexts involving
physical registers in ABI contexts.

I wrote a small tool to generate these, so I can easily change the
exact test body. Not sure if it's worth posting anywhere.

This is in preparation for making better use of v_pk_mov_b32,
v_mov_b64 and s_mov_b64 in shuffles.
2025-01-21 10:08:42 +07:00
Krzysztof Drewniak
697c1883f1
Reapply "[AMDGPU] Handle natively unsupported types in addrspace(7) lowering" (#123660)
(#123657)

This reverts commit 64749fb01538fba2b56d9850497d5f3a626cabc2.

Adds a constructor to VecSlice to address the failure
2025-01-20 16:12:17 -06:00
Krzysztof Drewniak
64749fb015
Revert "[AMDGPU] Handle natively unsupported types in addrspace(7) lowering" (#123657)
Reverts llvm/llvm-project#110572

Seem to have broken a buildbot, not sure why
https://lab.llvm.org/buildbot/#/builders/108/builds/8346
2025-01-20 13:14:04 -05:00
Krzysztof Drewniak
3805355ef6
[AMDGPU] Handle natively unsupported types in addrspace(7) lowering (#110572)
The current lowering for ptr addrspace(7) assumed that the instruction
selector can handle arbtrary LLVM types, which is not the case. Code
generation can't deal with
- Values that aren't 8, 16, 32, 64, 96, or 128 bits long
- Aggregates (this commit only handles arrays of scalars, more may come)
- Vectors of more than one byte
- 3-word values that aren't a vector of 3 32-bit values (for axample, a
<6 x half>)

This commit adds a buffer contents type legalizer that adds the needed
bitcasts, zero-extensions, and splits into subcompnents needed to
convert a load or store operation into one that can be successfully
lowered through code generation.

In the long run, some of the involved bitcasts (though potentially not
the buffer operation splitting) ought to be handled by the instruction
legalizer, but SelectionDAG makes this difficult.

It also takes advantage of the new `nuw` flag on `getelementptr` when
lowering GEPs to offset additions.

We don't currently plumb through `nsw` on GEPs since that should likely
be a separate change and would require declaring what we mean by "the
address" in the context of the GEP guarantees.
2025-01-20 11:33:35 -06:00
Fabian Ritter
cc5eba1737
[AMDGPU] Reject misaligned SGPR constraints for inline asm (#123590)
The indices of SGPR register pairs need to be 2-aligned and SGPR
quadruplets need to be 4-aligned. With this patch, we report an error
when inline asm register constraints specify a misaligned register
index, instead of silently dropping the specified index.

Fixes #123208

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-01-20 15:47:11 +01:00
Fraser Cormack
9cf24652e7
[AMDGPU] Fix spurious NoAlias results (#122309)
After a30e50fc, AMDGPUAAResult is being called in more situations where
BasicAA isn't sure. This exposed some regressions where NoAlias is being
incorrectly returned for two identical pointers.

The fix is to check the underlying objects for equality before returning
NoAlias.
2025-01-20 14:19:30 +00:00
Akshat Oke
96c4f978d0
[AMDGPU][NewPM] Port SIOptimizeExecMasking to NPM (#123572) 2025-01-20 16:34:01 +05:30
Carl Ritson
f811482a74
[AMDGPU] SIWholeQuadMode: Ensure earliest WQM entry point for PS (#123266)
Ensure shaders running WQM (PS) enter at the earliest point irrespective
of WQM marking.
2025-01-19 15:50:33 +09:00
Stanislav Mekhanoshin
fbea21aa52
[AMDGPU] Add test for VALU hoisiting from WWM region. NFC. (#123234)
The test demonstraits a suboptimal VALU hoisting from a WWM
region. As a result we have 2 WWM regions instead of one.
2025-01-17 10:06:44 -08:00
Brox Chen
703e9e97d9
[AMDGPU][True16][CodeGen] true16 codegen for bswap (#122849)
true16 codegen pattern for bswap
2025-01-17 09:36:55 -05:00
Vikram Hegde
225fc4f356
[AMDGPU][SDAG] Try folding "lshr i64 + mad" to "mad_u64_u32" (#119218)
The intention is to use a "copy" instead of a "sub" to handle the high
parts of 64-bit multiply for this specific case.

This unlocks copy prop use cases where the copy can be reused by later
multiply+add sequences if possible.

Fixes: SWDEV-487672, SWDEV-487669
2025-01-17 11:09:39 +05:30
Matt Arsenault
7475f0a345
DAG: Avoid forming shufflevector from a single extract_vector_elt (#122672)
This avoids regressions in a future AMDGPU commit. Previously we
would have a build_vector (extract_vector_elt x), undef with free
access to the elements bloated into a shuffle of one element + undef,
which has much worse combine support than the extract.

Alternatively could check aggressivelyPreferBuildVectorSources, but
I'm not sure it's really different than isExtractVecEltCheap.
2025-01-17 08:44:43 +07:00
Matt Arsenault
ca95519704
AMDGPU: Implement isExtractVecEltCheap (#122460)
Once again we have excessive TLI hooks with bad defaults. Permit this
for 32-bit element vectors, which are just use-different-register.
We should permit 16-bit vectors as cheap with legal packed instructions,
but I see some mixed improvements and regressions that need investigation.
2025-01-17 08:38:01 +07:00
Matt Arsenault
4431106630
DAG: Fix vector bin op scalarize defining a partially undef vector (#122459)
This avoids some of the pending regressions after AMDGPU implements
isExtractVecEltCheap.

In a case like shl <value, undef>, splat k, because the second operand
was fully defined, we would fall through and use the splat value for the
first operand, losing the undef high bits. This would result in an additional
instruction to handle the high bits. Add some reduced testcases for different
opcodes for one of the regressions.
2025-01-17 08:34:03 +07:00
Brox Chen
8a0c2e7567
[AMDGPU][True16][MC][CodeGen] true16 for v_cndmask_b16 (#119736)
Support true16 format for v_cndmask_b16 in MC and CodeGen in true16 and
fake16 flow.

Since we are replacing `v_cndmask_b16` to `v_cndmask_b16_t16/fake16`, we
have to at least update the fake16 codeGen to get codeGen test passing.
For this case, we have to update the true16 and with fake16 together,
otherwise some of the true16 tests will fail
2025-01-16 17:18:28 -05:00
Christudasan Devadasan
1797fb6b23
[AMDGPU][NewPM] Port SILowerControlFlow pass into NPM. (#123045) 2025-01-16 11:06:38 +05:30
jofrn
c8bbbaa5c7
[SelectionDAG][AMDGPU] Negative offset when selecting scratch sv offsets (#122251)
APInt will fail when given a negative offset. SelectScratchSVAddr
utilizes this function and can be given a negative offset as well, so
this change modifies it to use APSInt instead.
2025-01-15 06:56:28 -05:00
Mariusz Sikora
b3924cb9ec
[AMDGPU] Set Convergent property for image.(getlod/sample*) intrinsics which uses WQM (#122908)
This change adds IntrConvergent property to image.getlod intrinsic and
to several image.sample intrinsics. All image.sample intrinsics apart
from LOD(_L), Level 0(_LZ), Derivative(_D) will be marked as Convergent.
2025-01-15 10:23:28 +01:00
Shoreshen
b665dddd70
[AMDGPU] Add tests for v_sat_pk_u8_i16 codegen (#122438)
Preparation for #121124 

This PR provides tests added into
[PR](https://github.com/llvm/llvm-project/pull/121124) that add
selection patterns for instruction `v_sat_pk`, in order to specify the
change of the tests before and after the commit.

Pre-commit tests PR for #121124 : Add selection patterns for instruction
`v_sat_pk`
2025-01-14 19:26:46 -05:00
Brox Chen
f1b1c7f3c1
[AMDGPU][True16][CodeGen] Undo sub(x,c) to add in true16 flow (#118854)
Undo sub x, c -> add x, -c canonicalization in true16 fow.

This duplicating the pattern from fake16 and implemement the same
pattern in true16 format
2025-01-14 10:57:33 -05:00
Brox Chen
5e26ff35c1
[AMDGPU][True16][MC] true16 for v_cmp_lt_f16 (#122499)
True16 format for v_cmp_lt_f16. Update VOPC t16 and fake16 pseudo.
2025-01-14 10:03:36 -05:00
Acim Maravic
cc3aab580b
[AMDGPU] Handle nontemporal and amdgpu.last.use metadata in amdgpu-lower-buffer-fat-pointers (#120139) 2025-01-14 11:22:20 +01:00
Piotr Sobczak
40fa7f5e8b
[AMDGPU] Fix computed kill mask (#122736)
Replace S_XOR with S_ANDN2 when computing the kill mask in demote/kill
lowering. This has the effect of AND'ing demote/kill condition with exec
which is needed for proper live mask update.

The S_XOR is inadequate because it may return true for lane with exec=0.

This patch fixes an image corruption in game.

I think the issue went unnoticed because demote/kill condition is often
naturally dependent on exec, so AND'ing with exec is usually not
required.
2025-01-14 10:00:40 +01:00
Brox Chen
0f3aeca16f
[AMDGPU][True16][CodeGen] Update and/or/xor codegen pattern for i16 (#121835)
In true16 flow, remove and/or/xor 32bit patterns for i16
2025-01-13 16:48:00 -05:00
Brox Chen
26e13091ea
[AMDGPU][True16][CodeGen] true16 codegen pattern for v_pack_b32_f16 (#121988)
true16 codegen pattern for v_pack_b32_f16
2025-01-13 12:26:36 -05:00
Matt Arsenault
f4598194b5
DAG: Fold bitcast of scalar_to_vector to anyext (#122660)
scalar_to_vector is difficult to make appear and test,
but I found one case where this makes an observable difference.
It fires more often than this in the test suite, but most of them
have no net result in the final code. This helps reduce regressions
in a future commit.
2025-01-13 19:38:58 +07:00
Matt Arsenault
e9a55770dc
AMDGPU: Add gfx9 run line to scalar_to_vector test (#122659) 2025-01-13 19:35:56 +07:00
Akshat Oke
73b0e8a191
[AMDGPU][NewPM] Port AMDGPUOpenCLEnqueuedBlockLowering to NPM (#122434) 2025-01-13 17:52:30 +05:30
Akshat Oke
7bf1cb702b
[AMDGPU][NewPM] Port AMDGPURemoveIncompatibleFunctions to NPM (#122261) 2025-01-13 10:11:40 +05:30
Shilei Tian
f15da5fb78
[AMDGPU] Fix an invalid cast in AMDGPULateCodeGenPrepare::visitLoadInst (#122494)
Fixes: SWDEV-507695
2025-01-12 23:40:25 -05:00
Austin Kerbow
657fb4433e
[AMDGPU] Add target hook to isGlobalMemoryObject (#112781)
We want special handing for IGLP instructions in the scheduler but they
should still be treated like they have side effects by other passes. Add
a target hook to the ScheduleDAGInstrs DAG builder so that we have more
control over this.
2025-01-11 09:57:57 -08:00