8607 Commits

Author SHA1 Message Date
Yingwei Zheng
c37b2549ff
Revert "[InstSimplify] Fold getelementptr inbounds null, idx -> null (#130742)" (#138168)
Revert #130742 for now to avoid breaking glibc failures until the
workaround patches are landed.
2025-05-01 14:21:59 -07:00
Shilei Tian
cf9d4048fb Reapply "[NFC][AMDGPU] Correct the check line update script for llvm/test/CodeGen/AMDGPU/attributor-flatscratchinit-undefined-behavior.ll"
This reverts commit 74f55c744a18b848cc780c42f0e3dde7e7c96195 with a fix for the
check lines.
2025-05-01 14:13:23 -04:00
Shilei Tian
74f55c744a Revert "[NFC][AMDGPU] Correct the check line update script for llvm/test/CodeGen/AMDGPU/attributor-flatscratchinit-undefined-behavior.ll"
This reverts commit 49a5dd3dac285ba12f3fcaa55cacbea5968f5a37.
2025-05-01 14:12:03 -04:00
Shilei Tian
49a5dd3dac [NFC][AMDGPU] Correct the check line update script for llvm/test/CodeGen/AMDGPU/attributor-flatscratchinit-undefined-behavior.ll 2025-05-01 14:10:46 -04:00
Shilei Tian
e25ddf9081 [NFC][AMDGPU] Auto generate check lines for llvm/test/CodeGen/AMDGPU/inline-attr.ll 2025-05-01 13:52:27 -04:00
Lucas Ramirez
e377dc4d38
[AMDGPU] Max. WG size-induced occupancy limits max. waves/EU (#137807)
The default maximum waves/EU returned by the family of
`AMDGPUSubtarget::getWavesPerEU` is currently the maximum number of
waves/EU supported by the subtarget (only a valid occupancy range in
"amdgpu-waves-per-eu" may lower that maximum). This ignores maximum
achievable occupancy imposed by flat workgroup size and LDS usage,
resulting in situations where `AMDGPUSubtarget::getWavesPerEU` produces
a maximum higher than the one from
`AMDGPUSubtarget::getOccupancyWithWorkGroupSizes`.

This limits the waves/EU range's maximum to the maximum achievable
occupancy derived from flat workgroup sizes and LDS usage. This only has
an impact on functions which restrict flat workgroup size with
"amdgpu-flat-work-group-size", since the default range of flat workgroup
sizes achieves the maximum number of waves/EU supported by the
subtarget.

Improvements to the handling of "amdgpu-waves-per-eu" are left for a
follow up PR (e.g., I think the attribute should be able to lower the
full range of waves/EU produced by these methods).
2025-05-01 13:22:23 +02:00
mssefat
71039bbc58
[AMDGPU] Fix register class constraints for si-fold-operands pass when folding immediate into copies (#131387)
Fixes https://github.com/llvm/llvm-project/issues/130020

This fixes an issue where the si-fold-operands pass would incorrectly
fold immediate values into COPY instructions targeting av_32 registers.

The pass now checks register class constraints before attempting to fold
the immediate.
2025-04-30 17:36:46 -05:00
Alexander Richardson
ee13638362
[AMDGPU] Remove explicit datalayout from tests where not needed
Since e39f6c1844fab59c638d8059a6cf139adb42279a opt will infer the
correct datalayout when given a triple. Avoid explicitly specifying it
in tests that depend on the AMDGPU target being present to avoid the
string becoming out of sync with the TargetInfo value.
Only tests with REQUIRES: amdgpu-registered-target or a local lit.cfg
were updated to ensure that tests for non-target-specific passes that
happen to use the AMDGPU layout still pass when building with a limited
set of targets.

Reviewed By: shiltian, arsenm

Pull Request: https://github.com/llvm/llvm-project/pull/137921
2025-04-30 10:58:17 -07:00
mssefat
7495f92f08
[AMDGPU] Fix undefined scc register in successor block of SI_KILL terminators (#134718)
Fix issue 131298 where an undefined $scc register causes verifier errors
when using SI_KILL_F32_COND_IMM_TERMINATOR instructions. The problem
occurs because the $scc register defined in a comparison before the kill
terminator is used in successor blocks, but was not properly marked as live-in.

This patch:
- Adds code to check if SCC is used in the successor block
- Adds SCC as a live-in to successor blocks
- Handles both explicit and implicit uses of SCC

With this patch the machine verifier no longer reports undefined $scc
errors in following kill terminator instruction.

Fixes #131298

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-04-30 09:02:45 -05:00
Akshat Oke
e91cbd4f29
[CodeGen][NPM] Port VirtRegRewriter to NPM (#130564) 2025-04-30 14:10:46 +05:30
Vikram Hegde
86d8e8d9a6
[CodeGen][NewPM] Port "PrologEpilogInserter" to NPM (#130550) 2025-04-29 13:13:45 +05:30
Sirish Pande
abec9ff47d
[AMDGPU] Correctly merge noalias scopes during lowering of LDS data. (#131664)
Currently, if there is already noalias metadata present on loads and
stores, lower module lds pass is generating a more conservative aliasing
set. This results in inhibiting scheduling intrinsics that would have
otherwise generated a better pipelined instruction.

The fix is not to always intersect already existing noalias metadata
with noalias created for lowering of LDS. But to intersect only if
noalias scopes are from the same domain, otherwise concatenate exising
noalias sets with LDS noalias.

There a few patches that have come for scopedAA in the past. Following
three should be enough background information.
https://reviews.llvm.org/D91576
https://reviews.llvm.org/D108315
https://reviews.llvm.org/D110049

Essentially, after a pass that might change aliasing info, one should
check if that pass results in change number of MayAlias or ModRef using
the following:
`opt -S -aa-pipeline=basic-aa,scoped-noalias-aa -passes=aa-eval
-evaluate-aa-metadata -print-all-alias-modref-info -disable-output`
2025-04-28 14:02:18 -05:00
Shilei Tian
3570908519
[NFC][AMDGPU] Auto generate check lines for some codegen tests (#137534)
Make preparation for #137488.
2025-04-28 09:25:05 -04:00
David Stuttard
1a32613dac
[AMDGPU] Update pal metadata for v3.6 and fix v3.0 (#135196)
Update entry_point for all pal versions below 3.6.
3.6 and above removes entry_point.
2025-04-28 13:31:14 +01:00
John Brawn
dd87127f4e
[DAGCombiner] Eliminate fp casts if we have the right fast math flags (#131345)
When floating-point operations are legalized to operations of a higher
precision (e.g. f16 fadd being legalized to f32 fadd) then we get
narrowing then widening operations between each operation. With the
appropriate fast math flags (nnan ninf contract) we can eliminate these
casts.
2025-04-28 11:21:51 +01:00
Brox Chen
72bc0525d8
[AMDGPU][True16][CodeGen] update wwm reg sorting check condition (#135053)
We currently just need to shift down 32bit wwm registers. 

Previous check condition mistakenly select 16bit registers in true16
mode. Update check condition to skip the 16bit register in wmm reg
sorting
2025-04-27 14:30:34 -04:00
Shilei Tian
3bc125490a
[AMDGPU][Verifier] Check address space of alloca instruction (#135820)
This PR updates the `Verifier` to enforce that `alloca` instructions on
AMDGPU must be in AS5. This prevents hitting a misleading backend error
like "unable to select FrameIndex," which makes it look like a backend
bug when it's actually an IR-level issue.
2025-04-26 00:54:00 -04:00
Diana Picus
5bad5d84a1
Reland [AMDGPU] Support block load/store for CSR #130013 (#137169)
Add support for using the existing SCRATCH_STORE_BLOCK and
SCRATCH_LOAD_BLOCK instructions for saving and restoring callee-saved
VGPRs. This is controlled by a new subtarget feature, block-vgpr-csr. It
does not include WWM registers - those will be saved and restored
individually, just like before. This patch does not change the ABI.

Use of this feature may lead to slightly increased stack usage, because
the memory is not compacted if certain registers don't have to be
transferred (this will happen in practice for calling conventions where
the callee and caller saved registers are interleaved in groups of 8).
However, if the registers at the end of the block of 32 don't have to be
transferred, we don't need to use a whole 128-byte stack slot - we can
trim some space off the end of the range.

In order to implement this feature, we need to rely less on the
target-independent code in the PrologEpilogInserter, so we override
several new methods in SIFrameLowering. We also add new pseudos,
SI_BLOCK_SPILL_V1024_SAVE/RESTORE.

One peculiarity is that both the SI_BLOCK_V1024_RESTORE pseudo and the
SCRATCH_LOAD_BLOCK instructions will have all the registers that are not
transferred added as implicit uses. This is done in order to inform
LiveRegUnits that those registers are not available before the restore
(since we're not really restoring them - so we can't afford to scavenge
them). Unfortunately, this trick doesn't work with the save, so before
the save all the registers in the block will be unavailable (see the
unit test).

This was reverted due to failures in the builds with expensive checks
on, now fixed by always updating LiveIntervals and SlotIndexes in
SILowerSGPRSpills.
2025-04-25 11:29:27 +02:00
Matt Arsenault
4f5cfa81dc
AMDGPU: Remove amdhsa_code_object_version module flags from most tests (#136363)
These were added to the migration from v4 to v5 and should be removed
now
that the default has changed.
2025-04-24 17:13:03 +02:00
Craig Topper
d43ce35048
[TableGen][GISel] Allow isTrivialOperatorNode to import patterns with isStore and a memory VT. (#137080)
This removes the need to explicitly set isTruncStore on truncstorei8 and
other similar PatFrags that include truncstore in their frags DAG.

This allows some new patterns to be imported for AMDGPU as you can see
in the changed test.

The extra isTruncStore were added in ae2b36e8bdfa6, along with some
other tablegen changes to look for MemoryVT along with isTruncStore. I
did not remove the code, because I'm not sure if any out of tree users
have become dependent on it. It's no longer exercised in tree.
2025-04-24 08:10:07 -07:00
anjenner
a3d05e8987
Remove an incorrect assert in MFMASmallGemmSingleWaveOpt. (#130131)
This assert was failing in a fuzzing test. I consulted with @jrbyrnes
who said:

The MFMASmallGemmSingleWaveOpt::apply() method is invoked if and only if
the user has inserted an intrinsic llvm.amdgcn.iglp.opt(i32 1) into
their source code. This intrinsic applies a highly specialized DAG
mutation to result in specific scheduling for a specific set of kernels.
These assertions are really just confirming that the characteristics of
the kernel match what is expected (i.e. The kernels are similar to the
ones this DAG mutation strategy were designed against).

However, if we apply this DAG mutation to kernels for which is was not
designed, then we may not find the types of instructions we are looking
for, and may end up with empty caches.

I think it should be fine to just return false if the cache is empty
instead of the assert.
2025-04-24 09:22:24 +01:00
Brox Chen
6dbc01e801
[AMDGPU][True16][CodeGen] update GFX11Plus codegen test with true16 flag (#135078)
This is a NFC patch.

This patch run a bulk update on CodeGen tests that are impacted by the
true16 features. This patch applies:
1. duplicate GFX11plus runlines and apply them with
"+mattr=+real-true16" and "+mattr=-real-true16"
2. update the test with the update script

For some GISEL runlines, the current CodeGen do not fully support the
true16 version. Still update the runlines, but comment out the failing
one, and added a "FIXME-TRUE16" comment to that test for easier
tracking. These test will be fixed in the following patches.

This is in a transition state that we support both
"+real-true16/-real-true16" in our code base. We plan to move to
"+real-true16" as default, and finally remove "-real-true16" mode and
test lines.
2025-04-23 13:06:52 -04:00
Diana Picus
6bb2f90557
Revert "[AMDGPU] Support block load/store for CSR" (#136846)
Reverts llvm/llvm-project#130013 due to failures with expensive checks
on.
2025-04-23 14:01:00 +02:00
Diana Picus
4a58071d87
[AMDGPU] Support block load/store for CSR (#130013)
Add support for using the existing `SCRATCH_STORE_BLOCK` and
`SCRATCH_LOAD_BLOCK` instructions for saving and restoring callee-saved
VGPRs. This is controlled by a new subtarget feature, `block-vgpr-csr`.
It does not include WWM registers - those will be saved and restored
individually, just like before. This patch does not change the ABI.

Use of this feature may lead to slightly increased stack usage, because
the memory is not compacted if certain registers don't have to be
transferred (this will happen in practice for calling conventions where
the callee and caller saved registers are interleaved in groups of 8).
However, if the registers at the end of the block of 32 don't have to be
transferred, we don't need to use a whole 128-byte stack slot - we can
trim some space off the end of the range.

In order to implement this feature, we need to rely less on the
target-independent code in the PrologEpilogInserter, so we override
several new methods in `SIFrameLowering`. We also add new pseudos,
`SI_BLOCK_SPILL_V1024_SAVE/RESTORE`.

One peculiarity is that both the SI_BLOCK_V1024_RESTORE pseudo and the
SCRATCH_LOAD_BLOCK instructions will have all the registers that are not
transferred added as implicit uses. This is done in order to inform
LiveRegUnits that those registers are not available before the restore
(since we're not really restoring them - so we can't afford to scavenge
them). Unfortunately, this trick doesn't work with the save, so before
the save all the registers in the block will be unavailable (see the
unit test).
2025-04-23 10:33:36 +02:00
zhijian lin
afda4c295b
Reland [SelectionDAG] Folding ZERO-EXTEND/SIGN_EXTEND poison to Poison value in getNode (#136701)
This patch addresses the signed/zero extension of poison by using a
poison value of the extended type instead of a constant zero of the
extended type.
2025-04-22 17:36:41 -04:00
Pierre van Houtryve
ec3a90509d
[AMDGPU][InsertWaitCnts] Track global_wb/inv/wbinv (#135340)
wb/wbinv use storecnt, inv uses loadcnt.
Track them as VMEM_WRITE_ACCESS and VMEM_READ_ACCESS to avoid
InsertWaitCnt incorrectly eliminating the waitcnts after these instructions.

Solves SWDEV-526604
2025-04-22 14:53:55 +02:00
Pierre van Houtryve
47903e3372
[AMDGPU][InsertWaitCnts] Add test for global_wb/inv/wbinv tracking (#135339) 2025-04-22 14:50:43 +02:00
Pankaj Dwivedi
a25fdd7aca
Reapply "[AMDGPU] Insert readfirstlane in the function returns in sgpr." (#136678)
Reapply  #135326 and fix the target-dependent constant check.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-04-22 17:48:55 +05:30
Mariusz Sikora
1a48e1df45
[AMDGPU] Do not fold COPY with implicit operands (#136003)
Folding may remove COPY from inside of the divergent loop.
2025-04-22 13:33:06 +02:00
Frederik Harwath
f541a3aad8
[AMDGPU] SIInstrInfo: Fix resultDependsOnExec for VOPC instructions (#134629)
SIInstrInfo::resultDependsOnExec assumes that operand 0 of a comparison
is always the destination of the instruction. This is not true for
instructions in VOPC form where it is "src0". This led to a crash in
machine-cse.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-04-22 10:17:35 +02:00
Changpeng Fang
a945f5917c
AMDGPU: Add global-isel checks and rename fptrunc.v2f16.fpmath.ll (#136609)
Also remove the checks with -enable-unsafe-fp-math (already in fptrunc.f16.ll)
2025-04-21 14:38:02 -07:00
Shilei Tian
9968ba8652 Revert "[AMDGPU] Insert readfirstlane in the function returns in sgpr. (#135326)"
This reverts commit 76ced7fa782f0d7db9efea871fa6de74706dd9cc since it breaks a
lot of bots.
2025-04-21 14:31:10 -04:00
Pankaj Dwivedi
76ced7fa78
[AMDGPU] Insert readfirstlane in the function returns in sgpr. (#135326)
insert `readfirstlane` in the function returns in sgpr.
2025-04-21 21:57:16 +05:30
Nico Weber
e18a77cfbe Revert "[SelectionDAG] Folding ZERO-EXTEND/SIGN_EXTEND poison to Poison value in getNode (#122741)"
This reverts commit f12078e72601e7c03e5d66afab034313caf8f791.

Breaks `check-llvm`, see comments on https://github.com/llvm/llvm-project/pull/122741
2025-04-21 10:51:03 -04:00
zhijian lin
f12078e726
[SelectionDAG] Folding ZERO-EXTEND/SIGN_EXTEND poison to Poison value in getNode (#122741)
The PR will fix the issue
https://github.com/llvm/llvm-project/issues/122728

This patch addresses the signed/zero extension of poison by using a
poison value of the extended type instead of a constant zero of the
extended type.
2025-04-21 10:02:21 -04:00
Matt Arsenault
a3f8836ae8 AMDGPU: Regenerate baseline checks
Clean up now unnecessary second check prefix.
2025-04-18 22:07:47 +02:00
Matt Arsenault
9bdd9dc895
AMDGPU: Mark workitem ID intrinsics with range attribute (#136196)
This avoids the need to have special handling at every use site.
Unfortunately this means we unnecessarily emit AssertZext in the DAG
(where we already directly understand the range of the intrinsic), andt
we regress in undefined cases as we don't fold out asserts on undef.
2025-04-18 12:27:38 +02:00
Simon Pilgrim
64ffecfc43
[DAG] isKnownNeverNaN - add DemandedElts element mask to isKnownNeverNaN calls (#135952)
Matches what we've done for computeKnownBits etc. to improve vector handling
2025-04-18 09:24:02 +01:00
Shoreshen
a3f38f27cd
Revert "[AMDGPU] Implement vop3p complex pattern optmization for gisel" (#136249)
Reverts llvm/llvm-project#130234
2025-04-17 23:45:30 -04:00
Shoreshen
a04580f71b
[AMDGPU] Implement vop3p complex pattern optmization for gisel (#130234)
Seeking opportunities to optimize VOP3P instructions by altering opsel,
opsel_hi, neg, neg_hi bits

Tests differences:
1. fix op_sel_hi bit for inline constant:
   1. `CodeGen/AMDGPU/packed-fp32.ll`
2. use neg bit to remove xor with 0x80008000
   1. `CodeGen/AMDGPU/strict_fsub.f16.ll`
   2. `CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.fdot2.ll`
   3. `CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.sdot4.ll`
   4. `CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.sdot8.ll`
   5. `CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.udot2.ll`
   6. `CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.udot4.ll`
   7. `CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.udot8.ll`
3. Remove xor 0x80008000, and use opsel, opsel_hi to remove alignbit
   1. `CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.sdot2.ll`
2025-04-18 10:56:20 +08:00
Changpeng Fang
8b46b98b91
AMDGPU: Fix the double rounding issue in v2f64 -> v2f16 conversion (#135659)
On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64
-> v2f16 Legal, we will generate the following sequence of instructions:
  v_cvt_f32_f64_e32 v1, s[6:7]
  v_cvt_f32_f64_e32 v2, s[4:5]
  v_cvt_pk_f16_f32 v1, v2, v1
It possibly returns imprecise results due to double rounding. This patch
fixes the issue by not setting the conversion Legal. While we may still
expect the above sequence of code when unsafe fpmath is set, I hope
https://github.com/llvm/llvm-project/pull/134738 can address that
performance concern.

Fixes: SWDEV-523856
2025-04-17 11:15:49 -07:00
Yingwei Zheng
5a993558c5
[InstSimplify] Fold getelementptr inbounds null, idx -> null (#130742)
Proof: https://alive2.llvm.org/ce/z/5ZkPx-
See also https://github.com/llvm/llvm-project/pull/130734 for the motivation.
2025-04-17 20:44:46 +08:00
Shoreshen
121cd7c6f0
Re apply 130577 narrow math for and operand (#133896)
Re-apply https://github.com/llvm/llvm-project/pull/130577

Which is reverted in https://github.com/llvm/llvm-project/pull/133880

The old application failed in address sanitizer due to
`tryNarrowMathIfNoOverflow` was called after `I.eraseFromParent();` in
`AMDGPUCodeGenPrepareImpl::visitBinaryOperator`, it create a use after
free failure.

To fix this, `tryNarrowMathIfNoOverflow` will be called before and
directly return if `tryNarrowMathIfNoOverflow` result in true.
2025-04-17 17:03:32 +08:00
Shoreshen
d647d66da6
[AMDGPU] Add illegal type convertion (#135729)
Add more bit-convert tests for illegal types conversion
2025-04-17 12:14:34 +08:00
Vikram Hegde
123b0e2a1e
Reapply "[AMDGPU][GlobalISel] Properly handle lane op lowering for larger vector types (#132358)" (#135758)
reapply https://github.com/llvm/llvm-project/pull/132358, tests updated.
2025-04-16 11:28:28 +05:30
Jun Wang
31f39c8325
[AMDGPU] Remove the AnnotateKernelFeatures pass (#130198)
Previously the AnnotateKernelFeatures pass infers two attributes:
amdgpu-calls and amdgpu-stack-objects, which are used to help determine
if flat scratch init is allowed. PR #118907 created the
amdgpu-no-flat-scratch-init attribute. Continuing with that work, this
patch makes use of this attribute to determine flat scratch init,
replacing amdgpu-calls and amdgpu-stack-objects. This also leads to the
removal of the AnnotateKernelFeatures pass.
2025-04-15 15:17:33 -07:00
Kazu Hirata
f46cea5b42 Revert "[AMDGPU][GlobalISel] Properly handle lane op lowering for larger vector types (#132358)"
This reverts commit 62ef10a0f62c668e1fa7e357f56052f3364544c5.

Multiple buildbot failures have been reported:
https://github.com/llvm/llvm-project/pull/132358
2025-04-14 23:03:55 -07:00
Vikram Hegde
62ef10a0f6
[AMDGPU][GlobalISel] Properly handle lane op lowering for larger vector types (#132358)
Fixes https://github.com/llvm/llvm-project/issues/128650

Also adds few previously existing permlane64 tests which somehow got
removed in between.
2025-04-15 10:51:58 +05:30
Pierre van Houtryve
c9eebc7af4
[GlobalISel] Combine redundant sext_inreg (#131624) 2025-04-14 11:48:08 +02:00
Pierre van Houtryve
931a78a1db
[AMDGPU] Add sext_trunc in RegBankCombiner (#131623) 2025-04-14 10:15:29 +02:00