9133 Commits

Author SHA1 Message Date
Shoreshen
04aebbfbe2
[AMDGPU] Delete AMDGPU Unify Metadata pass (#153548)
Fixes #153150
2025-08-14 16:16:32 +08:00
Robert Imschweiler
d21feb5e66
[AMDGPU] Fix crash for inline-asm inputs of type MVT::Other (#153425) 2025-08-13 17:27:31 +02:00
Petar Avramovic
4d4966d481
AMDGPU/GlobalISel: Add regbanklegalize rules for ptr-add (#153175) 2025-08-13 15:49:48 +02:00
Diana Picus
420a5de1a4
[AMDGPU] Ignore inactive VGPRs in .vgpr_count (#149052)
When using the `amdgcn.init.whole.wave` intrinsic, we add dummy VGPR
arguments with the purpose of preserving their inactive lanes. The
pattern may look something like this:

```
entry:
  call amdgcn.init.whole.wave
  branch to shader or tail

shader:
  $vInactive = IMPLICIT_DEF ; Tells regalloc it's safe to use the active lanes
  actual code...

tail:
  call amdgcn.cs.chain [...], implicit $vInactive
```

We should not report these VGPRs in the `.vgpr_count` metadata. This
patch achieves that goal by ignoring meta instructions and calls. This should
be safe since if those registers are actually used in any other context,
they will be counted there. The same reasoning applies in the general
case, so we don't explicitly check for the existence of `init.whole.wave`.

This is a reworked version of #133242, which was reverted in #144039
and split into smaller bits.
2025-08-13 10:47:00 +02:00
Shoreshen
db96363c0a
[AMDGPU] Avoid put implicit_def into bundle that break reg's liveness (#142563)
Cause:
1. `implicit_def` inside bundle does not count for define of reg in
machineinst verifier
2. Including `implicit_def` will cause relative reg not define, result
in `Bad machine code: Using an undefined physical register` in the
machineinst verifier

Fixes https://github.com/llvm/llvm-project/issues/139102

---------

Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
2025-08-13 10:41:44 +08:00
Stanislav Mekhanoshin
d0ee82040c
[AMDGPU] Add s_barrier_init|join|leave instructions (#153296) 2025-08-12 15:07:07 -07:00
Adam Yang
8710571aba
[AMDGPU] Fixed llvm-debuginfo-analyzer for AMDGPU. (#145125)
Constructing Target triple with `ObjectFile::makeTriple` instead of just
with `Arch` and leaving the rest unknown. Also creating the subtarget
with the `CPU`. AMDGPU needs the full triple and `CPU` to disassemble
correctly.

To run a full test, also fixed a failure in `SIPreAllocateWWMRegs` with
the `$noreg` operand in `DBG_VALUE`.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-08-12 22:04:52 +00:00
Philip Reames
4d629f9744
[MIR] Remove std::variant from multiple save/restore point handling [nfc] (#153226)
In review of bbde6b, I had originally proposed that we support the
legacy text format. As review evolved, it bacame clear this had been a
bad idea (too much complexity), but in order to let that patch finally
move forward, I approved the change with the variant. This change undoes
the variant, and updates all the tests to just use the array form.
2025-08-12 11:23:05 -07:00
Orlando Cazalet-Hyams
54f92c7806
[RemoveDIs][AMDGPU] Replace defunct getAssignmentMarkers call (#153212)
Not quite NFC as it looks like the original intrinsic-handling code
never got updated to use records. This was never caught because that
code wasn't tested. I've adjusted an existing test so the behaviour is
now covered.
2025-08-12 17:20:38 +01:00
Petar Avramovic
f88be47fbf
AMDGPU/GlobalISel: Switch a few tests to new-reg-bank-select (#153174) 2025-08-12 15:03:31 +02:00
Jim M. R. Teichgräber
5d54a576fe
[AMDGPU] AMDGPULateCodeGenPrepare Legacy PM: replace setPreservesAll() with setPreservesCFG() (#148167)
This PR depends on #148165; the first commit
(90f1d0a881a21a8b4f192622d798c290770fda63) belongs to that PR. The
changes are distinct, so separate PRs seemed like the best option. I
don't have commit access, so I couldn't use user-branches to mark the
dependency.

As AMDGPULateCodeGenPrepare actually performs changes that invalidate
Uniformity Analysis; use `setPreservesCFG()` to mark this, instead of
`setPreservesAll()` which wrongly includes preserving Uniformity
Analysis.

Note that before #148165, this would still have preserved Uniformity
Analysis, hence the dependency. In addition, `amdgpu/llc-pipeline.cc`
needs to be changed when both changes are in effect, but those changes
would make the test fail if the PRs weren't based on one another.

Note on why this hasn't caused issues so far:
It just so happens that AMDGPULateCodeGenPrepare is always immediately
followed by AMDGPUUnifyDivergentExitNodes, which *does* invalidate most
analyses, including Uniformity. And because UnifyDivergentExitNodes only
looks at terminators, and LateCGP seemingly does not replace uniform
values with divergent values, or divergent values with uniform values,
and it only *inserts new values that are not looked at by
UnifyDivergentExitNodes*, this bug remained hidden.

---

I ran `git-clang-format` on my changes. I tested them using the
`check-llvm` target; no unexpected failures occurred after I made the
change to `amdgpu/llc-pipeline.ll`.
2025-08-12 19:40:02 +09:00
Fabian Ritter
e9ece175f9
[AMDGPU][GISel] Only fold flat offsets if they are inbounds (#153001)
For flat memory instructions where the address is supplied as a base address
register with an immediate offset, the memory aperture test ignores the
immediate offset. Currently, ISel does not respect that, which leads to
miscompilations where valid input programs crash when the address computation
relies on the immediate offset to get the base address in the proper memory
aperture. Global or scratch instructions are not affected.

This patch only selects flat instructions with immediate offsets from address
computations with the inbounds flag: If the address computation does not leave
the bounds of the allocated object, it cannot leave the bounds of the memory
aperture and is therefore safe to handle with an immediate offset.

Relevant tests are in fold-gep-offset.ll.

Analogous to #132353 for SDAG (which is not yet in a mergeable state, its
progress is currently blocked by #146076).

Fixes SWDEV-516125 for GISel.
2025-08-12 10:14:20 +02:00
Fabian Ritter
53af2e693d
[AMDGPU][GISel] Add inbounds flag to FLAT GISel tests (#153000)
This is in preparation for a patch that disables folding offsets into FLAT
instructions if the corresponding address computation is not inbounds, to avoid
miscompilations where this would lead to wrong aperture check results.

With the added inbounds flags for GEPs and G_PTR_ADDs affecting FLAT
instructions, the outputs for these tests won't change.

For SWDEV-516125.
2025-08-12 09:35:19 +02:00
Matt Arsenault
ff53086924
AMDGPU: Add new VA inline asm constraint for AV registers (#152665)
Add a new constraint corresponding to the AV_* register classes
for operands which can allocate AGPRs or VGPRs. This applies
to load and stores on gfx90a+, and srcA / srcB for MFMA instructions.

The error emitted on unsupported targets isn't ideal, it is
produced by the register allocator without a rationale, but it is
consistent with the existing errors.

I mostly want this for writing allocation tests.
2025-08-12 10:17:28 +09:00
Stanislav Mekhanoshin
ea14834966
[AMDGPU] Per-subtarget DPP instruction classification (#153096)
This is NFCI at this point.
2025-08-11 15:41:02 -07:00
Stanislav Mekhanoshin
b9ecee9d47
[AMDGPU] Fix DPP combining into V_BITOP3_B32 (#153083) 2025-08-11 15:39:02 -07:00
Matt Arsenault
9a293530d9
AMDGPU: Handle multiple AGPR MFMA rewrites (#147975)
I have this firing on one of the real examples, need to
produce the tests and check a few edge cases
2025-08-11 23:10:35 +09:00
Alexander Richardson
87ad9122e5
[AMDGPULowerBufferFatPointers] Handle ptrtoaddr by extending the offset
Reviewed By: krzysz00

Pull Request: https://github.com/llvm/llvm-project/pull/139413
2025-08-09 16:28:12 -07:00
Alexander Richardson
4cdfc0dc0d
[AMDGPU] Baseline test for ptrtoaddr in lower-buffer-fat-pointers
We should only be extracting the 32-bit offset in the ptrtoaddr case.

Reviewed By: arsenm

Pull Request: https://github.com/llvm/llvm-project/pull/143812
2025-08-09 16:26:57 -07:00
Chris Jackson
f9cb95c9b0
[AMDGPU] Add additional test cases to integer src mod test (#152692)
Adds missing 16-bit test cases to the test that src mods are not applied
to integers in instructions with canonicalizing patterns.
2025-08-08 23:14:36 +01:00
Lucas Ramirez
83c308f014
[AMDGPU][Scheduler] Consistent occupancy calculation during rematerialization (#149224)
The `RPTarget`'s way of determining whether VGPRs are beneficial to save
and whether the target has been reached w.r.t. VGPR usage currently
assumes, if `CombinedVGPRSavings` is true, that free slots in one VGPR
RC can always be used for the other. Implicitly, this makes the
rematerialization stage (only current user of `RPTarget`) follow a
different occupancy calculation than the "regular one" that the
scheduler uses, one that assumes that ArchVGPR/AGPR usage can be
balanced perfectly and at no cost, which is untrue in general. This
ultimately yields suboptimal rematerialization decisions that require
cross-VGPR-RC copies unnecessarily.

This fixes that, making the `RPTarget`'s internal model of occupancy
consistent with the regular one. The `CombinedVGPRSavings` flag is
removed, and a form of cross-VGPR-RC saving implemented only for unified
RFs, which is where it makes the most sense. Only when the amount of
free VGPRs in a given VGPR RC (ArchVPGR or AGPR) is lower than the
excess VGPR usage in the other VGPR RC does the `RPTarget` consider that
a pressure reduction in the former will be beneficial to the latter.
2025-08-08 14:26:04 +02:00
Diana Picus
a910a6a8b5
[AMDGPU] AsmPrinter: Unify arg handling (#151672)
When computing the number of registers required by entry functions, the
`AMDGPUAsmPrinter` needs to take into account both the register usage
computed by the `AMDGPUResourceUsageAnalysis` pass, and the number
of registers initialized by the hardware. At the moment, the way it
computes the latter is different for graphics vs compute, due to differences in
the implementation. For kernels, all the information needed is available in
the `SIMachineFunctionInfo`, but for graphics shaders we would iterate over
the `Function`  arguments in the `AMDGPUAsmPrinter`. This pretty much 
repeats some of the logic from instruction selection.

This patch introduces 2 new members to `SIMachineFunctionInfo`, one
for SGPRs and one for VGPRs. Both will be computed during instruction
selection and then used during `AMDGPUAsmPrinter`, removing the need
to refer to the `Function` when printing assembly.

This patch is NFC except for the fact that we now add the extra SGPRs
(VCC, XNACK etc) to the number of SGPRs computed for graphics entry points.
I'm not sure why these weren't included before. It would be nice if
someone could confirm if that was just an oversight or if we have some docs
somewhere that I haven't managed to find. Only one test is affected (its SGPR
usage increases because we now take into account the XNACK registers).
2025-08-08 12:00:37 +02:00
Matt Arsenault
81f3ddf4a2
AMDGPU: Rewrite VGPR MFMAs to AGPR when directly copied to AGPR class (#152480) 2025-08-08 18:20:21 +09:00
Nikita Popov
c23b4fbdbb
[IR] Remove size argument from lifetime intrinsics (#150248)
Now that #149310 has restricted lifetime intrinsics to only work on
allocas, we can also drop the explicit size argument. Instead, the size
is implied by the alloca.

This removes the ability to only mark a prefix of an alloca alive/dead.
We never used that capability, so we should remove the need to handle
that possibility everywhere (though many key places, including stack
coloring, did not actually respect this).
2025-08-08 11:09:34 +02:00
David Stuttard
c7c0229480
Revert "[AMDGPU] SelectionDAG divergence tracking should take into account Target divergency. (#147560)" (#152548)
This reverts commit 9293b65a616b8de432a654d046e802540b146372.
2025-08-08 09:05:59 +01:00
Stanislav Mekhanoshin
29cde86ecc
[AMDGPU] Removed extra blank lines from tests. NFC. (#152612) 2025-08-08 00:53:00 -07:00
Carl Ritson
0bdd312b1d
[AMDGPU] Generate some WQM/WWM tests (NFC) (#152635)
Update llvm.amdgcn.kill.ll and wqm.mir to be generated.
This preparatory work for refactoring of WQM/WWM pass.
2025-08-08 14:10:13 +09:00
Stanislav Mekhanoshin
469863111f
[AMDGPU] Enable CodeGen for v_pk_fma_bf16 (#152578) 2025-08-07 16:19:59 -07:00
Stanislav Mekhanoshin
dddeb07c2e
[AMDGPU] Restrict packed math FP32 instructions to read only one SGPR per operand on gfx12+ (#152465)
Sec. 4.6.7.1 of the gfx1250 SPG states that if an SGPR is used
as an operand, only one SGPR will be read for both the low and high
operations. As a result, the corresponding bits in `op_sel` and
`op_sel_hi` must be the same when the operand is an SGPR.

Co-authored-by: Tian, Shilei <Shilei.Tian@amd.com>

Co-authored-by: Tian, Shilei <Shilei.Tian@amd.com>
2025-08-07 16:13:34 -07:00
Stanislav Mekhanoshin
82046c7f33
[AMDGPU] Adjust hard clause rules for gfx1250 (#152592)
Change from GFX12: Relax S_CLAUSE rules to all all non-flat memory types
in
the same clause, and all Flat types in the same.

For VMEM/FLAT clause types now look like:

- Non-Flat (load, store, atomic): buffer, global, scratch, TDM, Async
- Flat: load, store, atomic
2025-08-07 14:59:31 -07:00
Stanislav Mekhanoshin
abc22f771e
[AMDGPU] Fix buffer addressing mode matching (#152584)
Starting in gfx1250, voffset and immoffset are zero-extended from 32
bits
to 45 bits before being added together.
2025-08-07 14:23:41 -07:00
Stanislav Mekhanoshin
d09dbdabb9
[AMDGPU] bf16 clamp folding (#152573) 2025-08-07 12:59:50 -07:00
Chris Jackson
9f102a9004
[AMDGPU] Recognise bitmask operations as srcmods on select (#152119)
Add to the VOP patterns to recognise when or/xor/and are masking only the most significant bit of i32/v2i32/i64 and replace with the corresponding FP source modifier.
2025-08-07 19:45:09 +01:00
Matt Arsenault
7402cd6ded AMDGPU: Disable AGPR selection in mfma rewrite test
This makes the test actually test the intended rewrite
pass. Also add some tests with inline immediates in src2.
Switch the target to gfx942 for future test functions.
2025-08-07 16:28:19 +09:00
Matt Arsenault
f44d8d583c
AMDGPU: Add a few missing mfma rewrite tests (#152434)
Test other splitting situations that appear in greedy.
This includes ensuring we have a case that hits a local split
and instruction split (most of the tests hit the region split path).

Also test a few cases where the final result isn't fully used, resulting
in partial copy bundles instead of a simple full copy. Test physreg
and virtreg agpr interference with a reassignment candidate.

I'm accumulating too many failure cases, and MIR tests are very prone
to painful merge conflicts, so I've added a few more tests and extracted
new tests from #147975.

Closes #149026
2025-08-07 16:14:45 +09:00
Stanislav Mekhanoshin
b296ea9c14
[AMDGPU] s_get_shader_cycles_u64 gfx1250 instruction (#152390)
It is the same as reading SHADER_CYCLES_LO and SHADER_CYCLES_HI
but with a single instruction.
2025-08-06 15:32:28 -07:00
Stanislav Mekhanoshin
c2eddec4ff
[AMDGPU] System scope atomics are emulated over PCIe in gfx1250 (#152369)
HW will emulate unsupported PCIe atomics via CAS loop, we do not need to
expand these anymore.
2025-08-06 13:08:12 -07:00
Stanislav Mekhanoshin
334d0be2d4
[AMDGPU] Support 64-bit LDS atomic fadd on gfx1250 (#152368) 2025-08-06 13:07:56 -07:00
Changpeng Fang
32161e9de3
[AMDGPU] Do not fold an immediate into instructions with frame indexes (#151263)
Do not fold an immediate into an instruction that already has a frame
index operand. A frame index could possibly turn out to be another immediate.

Fixes: SWDEV-536263

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-08-06 11:47:37 -07:00
Tim Renouf
e99c565cd2
MC,AMDGPU: Don't pad .text with s_code_end if it would otherwise be empty (#147980)
We don't want that padding in a module that only contains data, not
code.

Also fix MCSection::hasInstructions() so it works with the asm streamer
too.
2025-08-06 13:25:45 +01:00
Diana Picus
14cd133931
Revert "[AMDGPU] Intrinsic for launching whole wave functions" (#152286)
Reverts llvm/llvm-project#145859 because it broke a HIP test:
```
[34/59] Building CXX object External/HIP/CMakeFiles/TheNextWeek-hip-6.3.0.dir/workload/ray-tracing/TheNextWeek/main.cc.o
FAILED: External/HIP/CMakeFiles/TheNextWeek-hip-6.3.0.dir/workload/ray-tracing/TheNextWeek/main.cc.o 
/home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/llvm/bin/clang++ -DNDEBUG  -O3 -DNDEBUG   -w -Werror=date-time --rocm-path=/opt/botworker/llvm/External/hip/rocm-6.3.0 --offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx1030 --offload-arch=gfx1100 -xhip -mfma -MD -MT External/HIP/CMakeFiles/TheNextWeek-hip-6.3.0.dir/workload/ray-tracing/TheNextWeek/main.cc.o -MF External/HIP/CMakeFiles/TheNextWeek-hip-6.3.0.dir/workload/ray-tracing/TheNextWeek/main.cc.o.d -o External/HIP/CMakeFiles/TheNextWeek-hip-6.3.0.dir/workload/ray-tracing/TheNextWeek/main.cc.o -c /home/botworker/bbot/clang-hip-vega20/llvm-test-suite/External/HIP/workload/ray-tracing/TheNextWeek/main.cc
fatal error: error in backend: Cannot select: intrinsic %llvm.amdgcn.readfirstlane
```
2025-08-06 12:24:52 +02:00
Diana Picus
0461cd3d1d
[AMDGPU] Intrinsic for launching whole wave functions (#145859)
Add the llvm.amdgcn.call.whole.wave intrinsic for calling whole wave
functions. This will take as its first argument the callee with the
amdgpu_gfx_whole_wave calling convention, followed by the call
parameters which must match the signature of the callee except for the
first function argument (the i1 original EXEC mask, which doesn't need
to be passed in). Indirect calls are not allowed.

Make direct calls to amdgpu_gfx_whole_wave functions a verifier error.

Unspeakable horrors happen around calls from whole wave functions, the
plan is to improve the handling of caller/callee-saved registers in
a future patch.

Tail calls are also handled in a future patch.
2025-08-06 10:25:53 +02:00
Stanislav Mekhanoshin
b8eb61adc9
[AMDGPU] Implement addrspacecast from flat <-> private on gfx1250 (#152218) 2025-08-05 16:25:23 -07:00
Stanislav Mekhanoshin
34aed0ed56
[AMDGPU] Add gfx1250 wmma_scale[16]_f32_32x16x128_f4 instructions (#152194) 2025-08-05 15:15:21 -07:00
Simon Pilgrim
9f50224b25
[DAG] Remove Depth=1 hack from isGuaranteedNotToBeUndefOrPoison checks (#152127)
Now that #146490 removed the assertion in visitFreeze to assert that the
node was still isGuaranteedNotToBeUndefOrPoison we no longer need this
reduced depth hack (which had to account for the difference in depth of
freeze(op()) vs op(freeze())

Helps with some of the minor regressions in #150017
2025-08-05 13:35:04 +01:00
Simon Pilgrim
d561259a08
[DAG] visitFREEZE - replace multiple frozen/unfrozen uses of an SDValue with just the frozen node (#150017)
Similar to InstCombinerImpl::freezeOtherUses, attempt to ensure that we
merge multiple frozen/unfrozen uses of a SDValue. This fixes a number of
hasOneUse() problems when trying to push FREEZE nodes through the DAG.

Remove SimplifyMultipleUseDemandedBits handling of FREEZE nodes as we
now want to keep the common node, and not bypass for some nodes just
because of DemandedElts.

Fixes #149799
2025-08-05 09:24:09 +01:00
Stanislav Mekhanoshin
a153e83e41
[AMDGPU] gfx1250 v_wmma_scale[16]_f32_16x16x128_f8f6f4 codegen (#152036) 2025-08-04 19:16:34 -07:00
Maksim Shelegov
862fb42b06
[AMDGPU][GlobalISel] Fix G_UNMERGE_VALUES combine (#141812)
Fixes #139791.

When trying to combine two G_UNMERGE_VALUES with a COPY between them,
a crash occurs in tryCombineUnmergeValues() because getDefIndex() tries
to find the index of the original source register in the def found by
getDefIgnoringCopies(). In the case of a COPY in between, this register
is not present in the def, only in the COPY.
2025-08-04 17:06:34 -07:00
Stanislav Mekhanoshin
37fe9f6382
[AMDGPU] Add gfx1250 v_wmma_scale[16]_f32_16x16x128_f8f6f4 MC support (#152014)
This adds new VOP3PX2e encoding
2025-08-04 14:20:12 -07:00
Ivan Kosarev
2b20cf7291
[AMDGPU] Fold into uses of splat REG_SEQUENCEs through COPYs. (#145691) 2025-08-04 16:18:33 +01:00