1622 Commits

Author SHA1 Message Date
Diana Picus
a910a6a8b5
[AMDGPU] AsmPrinter: Unify arg handling (#151672)
When computing the number of registers required by entry functions, the
`AMDGPUAsmPrinter` needs to take into account both the register usage
computed by the `AMDGPUResourceUsageAnalysis` pass, and the number
of registers initialized by the hardware. At the moment, the way it
computes the latter is different for graphics vs compute, due to differences in
the implementation. For kernels, all the information needed is available in
the `SIMachineFunctionInfo`, but for graphics shaders we would iterate over
the `Function`  arguments in the `AMDGPUAsmPrinter`. This pretty much 
repeats some of the logic from instruction selection.

This patch introduces 2 new members to `SIMachineFunctionInfo`, one
for SGPRs and one for VGPRs. Both will be computed during instruction
selection and then used during `AMDGPUAsmPrinter`, removing the need
to refer to the `Function` when printing assembly.

This patch is NFC except for the fact that we now add the extra SGPRs
(VCC, XNACK etc) to the number of SGPRs computed for graphics entry points.
I'm not sure why these weren't included before. It would be nice if
someone could confirm if that was just an oversight or if we have some docs
somewhere that I haven't managed to find. Only one test is affected (its SGPR
usage increases because we now take into account the XNACK registers).
2025-08-08 12:00:37 +02:00
Stanislav Mekhanoshin
469863111f
[AMDGPU] Enable CodeGen for v_pk_fma_bf16 (#152578) 2025-08-07 16:19:59 -07:00
Stanislav Mekhanoshin
abc22f771e
[AMDGPU] Fix buffer addressing mode matching (#152584)
Starting in gfx1250, voffset and immoffset are zero-extended from 32
bits
to 45 bits before being added together.
2025-08-07 14:23:41 -07:00
Stanislav Mekhanoshin
c2eddec4ff
[AMDGPU] System scope atomics are emulated over PCIe in gfx1250 (#152369)
HW will emulate unsupported PCIe atomics via CAS loop, we do not need to
expand these anymore.
2025-08-06 13:08:12 -07:00
Stanislav Mekhanoshin
b8eb61adc9
[AMDGPU] Implement addrspacecast from flat <-> private on gfx1250 (#152218) 2025-08-05 16:25:23 -07:00
Matt Arsenault
1d7a0fa08a
AMDGPU: Move asm constraint physreg parsing to utils (#150903)
Also fixes an assertion on out of bound physical register
indexes.
2025-08-01 16:11:11 +09:00
paperchalice
8bacfb2538
[AMDGPU] Remove UnsafeFPMath uses (#151079)
Remove `UnsafeFPMath` in AMDGPU part, it blocks some bugfixes related to
clang and the ultimate goal is to remove `resetTargetOptions` method in
`TargetMachine`, see FIXME in `resetTargetOptions`.
See also
https://discourse.llvm.org/t/rfc-honor-pragmas-with-ffp-contract-fast

https://discourse.llvm.org/t/allowfpopfusion-vs-sdnodeflags-hasallowcontract

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-07-31 17:36:57 +08:00
Kazu Hirata
c6a376371d
[AMDGPU] Remove an unnecessary cast (NFC) (#151279)
value() already returns uint64_t.
2025-07-30 07:29:57 -07:00
Stanislav Mekhanoshin
3dfd939a16
[AMDGPU] gfx1250 V_{MIN|MAX}_{I|U}64 opcodes (#151256) 2025-07-29 19:13:51 -07:00
Changpeng Fang
3b66d4a987
[AMDGPU] Support builtin/intrinsics for async loads/stores on gfx1250 (#151058) 2025-07-29 08:20:05 -07:00
Daniil Fukalov
e650c4b9ef
[NFC][AMDGPU] Move cmp+select arguments optimization to SIISelLowering. (#150929)
As requested in #148740.
2025-07-28 22:11:36 +02:00
Changpeng Fang
400ce1a3d3
[AMDGPU] Support AMDGPUClamp for bf16 on gfx1250 (#150663)
Scalar version uses V_MAX_BF16_PSEUDO which is expanded to V_PK_MAX_BF16
with unused high bits. If V_PK_MAX_BF16 is produced directly instead
that creates problem with folding of the clamp into other scalar
instructions due to incompatible clamp bits.

FIXME-TRUE16: enable bf16 clamp with true16

---------

Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2025-07-25 12:13:06 -07:00
Changpeng Fang
d7a38a94cd
[AMDGPU] Support builtin/intrinsics for load monitors on gfx1250 (#150540) 2025-07-24 16:23:33 -07:00
Stanislav Mekhanoshin
96e5eed92a
[AMDGPU] Select VMEM prefetch for llvm.prefetch on gfx1250 (#150493)
We have a choice to use a scalar or vector prefetch for an uniform
pointer. Since we do not have scalar stores our scalar cache is
practically readonly. The rw argument of the prefetch intrinsic is
used to force vector operation even for an uniform case. On GFX12
scalar prefetch will be used anyway, it is still useful but it will
only bring data to L2.
2025-07-24 13:22:50 -07:00
Stanislav Mekhanoshin
9deb7f6062
[AMDGPU] gfx1250 vmem prefetch target intrinsics and builtins (#150466) 2025-07-24 12:13:59 -07:00
Changpeng Fang
473bc0d188
[AMDGPU] Support V_FMA_MIX*_BF16 instructions on gfx1250 (#150381)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2025-07-24 09:43:49 -07:00
Changpeng Fang
9a563b08e2
[AMDGPU] Support V_PK_MIN3/MAX3_NUM_F16 on gfx1250 (#150326)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2025-07-23 15:15:19 -07:00
Stanislav Mekhanoshin
2346968807
[AMDGPU] Add V_ADD|SUB|MUL_U64 gfx1250 opcodes (#150291) 2025-07-23 13:17:56 -07:00
Changpeng Fang
bc1f85d234
AMDGPU: Support packed bf16 instructions on gfx1250 (#150283)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2025-07-23 12:01:23 -07:00
Shilei Tian
7fc65569c1
[AMDGPU] Mark amdgcn_tanh as canonicalized (#150059)
Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>
2025-07-22 20:03:39 -04:00
Shilei Tian
e801a10b44
[AMDGPU] Add the code generation support for llvm.[sin/cos].bf16 (#149631)
This is a partial support because some other instructions have not been upstreamed yet.
2025-07-21 11:01:59 -04:00
Shilei Tian
ba81903196
[gfx1250][SDAG] Lower unsafe bf16 divisions (#149628)
Co-authored-by: Kosarev, Ivan <Ivan.Kosarev@amd.com>
2025-07-21 10:58:08 -04:00
Diana Picus
20d8398825
[AMDGPU] ISel & PEI for whole wave functions (#145858)
Whole wave functions are functions that will run with a full EXEC mask.
They will not be invoked directly, but instead will be launched by way
of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in
a future patch). These functions are meant as an alternative to the
`llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics.

Whole wave functions will set EXEC to -1 in the prologue and restore the
original value of EXEC in the epilogue. They must have a special first
argument, `i1 %active`, that is going to be mapped to EXEC. They may
have either the default calling convention or amdgpu_gfx. The inactive
lanes need to be preserved for all registers used, active lanes only for
the CSRs.

At the IR level, arguments to a whole wave function (other than
`%active`) contain poison in their inactive lanes. Likewise, the return
value for the inactive lanes is poison.

This patch contains the following work:
* 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN
  used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return
  a SReg_1 representing `%active`, which needs to be passed into
  SI_WHOLE_WAVE_FUNC_RETURN.
* SelectionDAG support for generating these 2 new pseudos and the
  special handling of %active. Since the return may be in a different
  basic block, it's difficult to add the virtual reg for %active to
  SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF
  which is later replaced via a custom inserter.
* Expansion of the 2 pseudos during prolog/epilog insertion. PEI also
  marks any used VGPRs as WWM registers, which are then spilled and
  restored with the usual logic.

Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic
and a lot of optimization work (especially in order to reduce spills
around function calls).

---------

Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
Co-authored-by: Shilei Tian <i@tianshilei.me>
2025-07-21 10:39:09 +02:00
Fabian Ritter
daa6de37ba
[AMDGPU][SDAG] Add target-specific ISD::PTRADD combines (#143673)
This patch adds several (AMDGPU-)target-specific DAG combines for
ISD::PTRADD nodes that reproduce existing similar transforms for
ISD::ADD nodes. There is no functional change intended for the existing
target-specific PTRADD combine.

For SWDEV-516125.
2025-07-18 10:00:54 +02:00
Shoreshen
f761d73265
[AMDGPU] Add freeze for LowerSELECT (#148796)
Trying to solve https://github.com/llvm/llvm-project/issues/147635

Add freeze for legalizer when breaking i64 select to 2 i32 select.

Several tests changed, still need to investigate why.

---------

Co-authored-by: Shilei Tian <i@tianshilei.me>
2025-07-18 13:29:33 +08:00
Kazu Hirata
2da59287aa
[Target] Remove unnecessary casts (NFC) (#149342)
getFunction().getParent() already returns Module *.
2025-07-17 15:24:25 -07:00
Pierre van Houtryve
7d52b72239
[AMDGPU] Compute GISel KnownBits for S_BFE instructions (#141588)
Next patches in the stack will emit them in the RegBankCombiner. With this, S_BFE instructions will hopefully interfere less with optimizations.
2025-07-16 09:56:45 +02:00
Stanislav Mekhanoshin
2d6534b7da
[AMDGPU] gfx1250 64-bit relocations and fixups (#148951) 2025-07-15 17:13:42 -07:00
Changpeng Fang
868793fa8e
AMDGPU: Support intrinsic selection for gfx1250 wmma instructions (#148957)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
Co-authored-by: Shilei Tian <Shilei.Tian@amd.com>
2025-07-15 15:25:05 -07:00
Shilei Tian
23ac7b938d
[AMDGPU] Add support for v_sqrt_bf16 on gfx1250 (#148921)
Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>
2025-07-15 16:15:47 -04:00
Stanislav Mekhanoshin
a32040e483
[AMDGPU] Use 64-bit literals in codegen on gfx1250 (#148727) 2025-07-14 15:47:24 -07:00
Shilei Tian
d7ec80c897
[AMDGPU] Add support for v_tanh_bf16 on gfx1250 (#147425)
Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>
2025-07-14 16:30:18 -04:00
Kazu Hirata
f1791c0ae3
[AMDGPU] Remove unnecessary casts (NFC) (#148340)
getRegisterInfo() already returns const SIRegisterInfo *.

Likewise, getInstrInfo() already returns const SIInstrInfo *.
2025-07-12 11:28:41 -07:00
Boyao Wang
697beb3f17
[TargetLowering] Change getOptimalMemOpType and findOptimalMemOpLowering to take LLVM Context (#147664)
Add LLVM Context to getOptimalMemOpType and findOptimalMemOpLowering. So
that we can use EVT::getVectorVT to generate EVT type in
getOptimalMemOpType.

Related to [#146673](https://github.com/llvm/llvm-project/pull/146673).
2025-07-10 11:11:09 +08:00
Stanislav Mekhanoshin
d0a4af725e
[AMDGPU] Add FeatureIEEEMinimumMaximumInsts. NFCI. (#147594)
Co-authored-by: Mirko Brkušanin <Mirko.Brkusanin@amd.com>
2025-07-08 14:32:44 -07:00
Changpeng Fang
5035d20dcb
AMDGPU: Implement ds_atomic_async_barrier_arrive_b64/ds_atomic_barrier_arrive_rtn_b64 (#146409)
These two instructions are supported by gfx1250. We define the
instructions and implement the corresponding intrinsic and builtin.

Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2025-07-01 11:08:49 -07:00
Changpeng Fang
1f5f381920
AMDGPU: Implement intrinsic/builtins for gfx1250 load transpose instructions (#146289) 2025-06-29 14:33:31 -07:00
Fabian Ritter
215e61c088
[AMDGPU][SDAG] Add ISD::PTRADD DAG combines (#142739)
This patch focuses on generic DAG combines, plus an AMDGPU-target-specific one
that is closely connected.

The generic DAG combine is based on a part of PR #105669 by rgwott, which was
adapted from work by jrtc27, arichardson, davidchisnall in the CHERI/Morello
LLVM tree. I added some parts and removed several disjuncts from the
reassociation condition:
- `isNullConstant(X)`, since there are address spaces where 0 is a perfectly
  normal value that shouldn't be treated specially,
- `(YIsConstant && ZOneUse)` and `(N0OneUse && ZOneUse && !ZIsConstant)`, since
  they cause regressions in AMDGPU.

For SWDEV-516125.
2025-06-26 09:40:04 +02:00
Matt Arsenault
020fefb6af
AMDGPU: Avoid report_fatal_error in image intrinsic lowering (#145201) 2025-06-26 00:00:36 +09:00
Chinmay Deshpande
3413aa83f3
Revert "[AMDGPU] Implement IR variant of isFMAFasterThanFMulAndFAdd (… (#145580)
…#121465)"

This reverts commit 211bcf67aadb1175af382f55403ae759177281c7.
2025-06-24 16:10:27 -04:00
Matt Arsenault
48155f93dd
CodeGen: Emit error if getRegisterByName fails (#145194)
This avoids using report_fatal_error and standardizes the error
message in a subset of the error conditions.
2025-06-23 16:33:35 +09:00
Matt Arsenault
16607f6437
AMDGPU: Fix typo in argument allocation error message (#145265) 2025-06-23 16:26:10 +09:00
Aaditya
6a0593b0a3
[AMDGPU] Extend wave reduce intrinsics for i32 type (#126469)
Currently, wave reduction intrinsics are supported for `umin` and `umax`
operations for `i32` type only.
This patch extends support for the following operations: 
`add`, `sub`, `min`, `max`, `and`, `or`, `xor` for `i32` type.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-06-23 10:31:22 +05:30
Matt Arsenault
ed155ff9f2
AMDGPU: Avoid report_fatal_error on ds ordered intrinsics (#145202) 2025-06-23 13:09:09 +09:00
Matt Arsenault
584a2c2e7c
AMDGPU: Avoid report_fatal_error for reporting libcalls (#145134) 2025-06-22 23:10:18 +09:00
Matt Arsenault
f280d3b705
AMDGPU: Avoid report_fatal_error for getRegisterByName subtarget case (#145173) 2025-06-22 08:19:19 +09:00
Jay Foad
6e86b7e34b
[AMDGPU] Do not replace SALU floating point multiply with VALU-only ldexp (#145048) 2025-06-20 16:52:43 +01:00
Matt Arsenault
dd4776d429
AMDGPU: Remove AMDGPUInstrInfo class (#144984)
This was never constructed and only provided one static helper
function.
2025-06-20 18:26:56 +09:00
Nicolai Hähnle
3bee9ba015
AMDGPU/GFX12: Fix s_barrier_signal_isfirst for single-wave workgroups (#143634)
Barrier instructions are no-ops in single-wave workgroups. This includes
s_barrier_signal_isfirst, which will leave SCC unmodified.

Model this correctly (via an implicit use of SCC) and ensure SCC==1
before the barrier instruction (if the wave is the only one of the
workgroup, then it is the first).

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-06-19 11:22:49 -07:00
Matt Arsenault
c80282d333
AMDGPU: Directly select minimumnum/maximumnum with ieee_mode=0 (#141903)
The hardware min/max follow the IR rules with IEEE mode disabled,
so we can avoid the canonicalizes of the input. We lose the quieting
of a signaling nan if both inputs are nans, but we only require that
with strictfp.
2025-06-18 00:27:41 +09:00