1630 Commits

Author SHA1 Message Date
Shilei Tian
b170f17861
[AMDGPU] Add support for safe bfloat16 fdiv on targets with bf16 trans instructions (#154373)
Recent changes introduced custom lowering for bf16 fdiv on targets that
support bf16 trans instructions, but only covered the unsafe version.
This PR extends that support to the safe variant.

For the safe version, the op is lowered by converting to float,
performing the div in float, and converting the result back to bf16.
This matches the behavior on targets that don't support bf16 trans
instructions.

Fixes SWDEV-550381.
2025-08-19 16:03:45 -04:00
Pierre van Houtryve
6f7c77fe90
[AMDGPU] Check noalias.addrspace in mayAccessScratchThroughFlat (#151319)
PR #149247 made the MD accessible by the backend so we can now leverage
it in the memory model. The first use case here is detecting if a flat op
can access scratch memory.
Benefits both the MemoryLegalizer and InsertWaitCnt.
2025-08-19 07:42:59 +02:00
Stanislav Mekhanoshin
c328c5d911
[AMDGPU] Combine to bf16 reciprocal square root. (#154185)
Co-authored-by: Ivan Kosarev <Ivan.Kosarev@amd.com>

Co-authored-by: Ivan Kosarev <Ivan.Kosarev@amd.com>
2025-08-18 13:07:20 -07:00
Robert Imschweiler
d21feb5e66
[AMDGPU] Fix crash for inline-asm inputs of type MVT::Other (#153425) 2025-08-13 17:27:31 +02:00
Stanislav Mekhanoshin
d0ee82040c
[AMDGPU] Add s_barrier_init|join|leave instructions (#153296) 2025-08-12 15:07:07 -07:00
Matt Arsenault
ff53086924
AMDGPU: Add new VA inline asm constraint for AV registers (#152665)
Add a new constraint corresponding to the AV_* register classes
for operands which can allocate AGPRs or VGPRs. This applies
to load and stores on gfx90a+, and srcA / srcB for MFMA instructions.

The error emitted on unsupported targets isn't ideal, it is
produced by the register allocator without a rationale, but it is
consistent with the existing errors.

I mostly want this for writing allocation tests.
2025-08-12 10:17:28 +09:00
Stanislav Mekhanoshin
ea14834966
[AMDGPU] Per-subtarget DPP instruction classification (#153096)
This is NFCI at this point.
2025-08-11 15:41:02 -07:00
Stanislav Mekhanoshin
10e146a716
[AMDGPU] Fix out of bound physreg tuple condition. NFC. (#152777)
The end register of the tuple shall be below the last existing
register. The check does not work on something like {v[255:256]}.
Overall it works correctly because if fails later at the
getMatchingSuperReg() call.
2025-08-09 01:50:13 -07:00
Diana Picus
a910a6a8b5
[AMDGPU] AsmPrinter: Unify arg handling (#151672)
When computing the number of registers required by entry functions, the
`AMDGPUAsmPrinter` needs to take into account both the register usage
computed by the `AMDGPUResourceUsageAnalysis` pass, and the number
of registers initialized by the hardware. At the moment, the way it
computes the latter is different for graphics vs compute, due to differences in
the implementation. For kernels, all the information needed is available in
the `SIMachineFunctionInfo`, but for graphics shaders we would iterate over
the `Function`  arguments in the `AMDGPUAsmPrinter`. This pretty much 
repeats some of the logic from instruction selection.

This patch introduces 2 new members to `SIMachineFunctionInfo`, one
for SGPRs and one for VGPRs. Both will be computed during instruction
selection and then used during `AMDGPUAsmPrinter`, removing the need
to refer to the `Function` when printing assembly.

This patch is NFC except for the fact that we now add the extra SGPRs
(VCC, XNACK etc) to the number of SGPRs computed for graphics entry points.
I'm not sure why these weren't included before. It would be nice if
someone could confirm if that was just an oversight or if we have some docs
somewhere that I haven't managed to find. Only one test is affected (its SGPR
usage increases because we now take into account the XNACK registers).
2025-08-08 12:00:37 +02:00
Stanislav Mekhanoshin
469863111f
[AMDGPU] Enable CodeGen for v_pk_fma_bf16 (#152578) 2025-08-07 16:19:59 -07:00
Stanislav Mekhanoshin
abc22f771e
[AMDGPU] Fix buffer addressing mode matching (#152584)
Starting in gfx1250, voffset and immoffset are zero-extended from 32
bits
to 45 bits before being added together.
2025-08-07 14:23:41 -07:00
Stanislav Mekhanoshin
c2eddec4ff
[AMDGPU] System scope atomics are emulated over PCIe in gfx1250 (#152369)
HW will emulate unsupported PCIe atomics via CAS loop, we do not need to
expand these anymore.
2025-08-06 13:08:12 -07:00
Stanislav Mekhanoshin
b8eb61adc9
[AMDGPU] Implement addrspacecast from flat <-> private on gfx1250 (#152218) 2025-08-05 16:25:23 -07:00
Matt Arsenault
1d7a0fa08a
AMDGPU: Move asm constraint physreg parsing to utils (#150903)
Also fixes an assertion on out of bound physical register
indexes.
2025-08-01 16:11:11 +09:00
paperchalice
8bacfb2538
[AMDGPU] Remove UnsafeFPMath uses (#151079)
Remove `UnsafeFPMath` in AMDGPU part, it blocks some bugfixes related to
clang and the ultimate goal is to remove `resetTargetOptions` method in
`TargetMachine`, see FIXME in `resetTargetOptions`.
See also
https://discourse.llvm.org/t/rfc-honor-pragmas-with-ffp-contract-fast

https://discourse.llvm.org/t/allowfpopfusion-vs-sdnodeflags-hasallowcontract

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-07-31 17:36:57 +08:00
Kazu Hirata
c6a376371d
[AMDGPU] Remove an unnecessary cast (NFC) (#151279)
value() already returns uint64_t.
2025-07-30 07:29:57 -07:00
Stanislav Mekhanoshin
3dfd939a16
[AMDGPU] gfx1250 V_{MIN|MAX}_{I|U}64 opcodes (#151256) 2025-07-29 19:13:51 -07:00
Changpeng Fang
3b66d4a987
[AMDGPU] Support builtin/intrinsics for async loads/stores on gfx1250 (#151058) 2025-07-29 08:20:05 -07:00
Daniil Fukalov
e650c4b9ef
[NFC][AMDGPU] Move cmp+select arguments optimization to SIISelLowering. (#150929)
As requested in #148740.
2025-07-28 22:11:36 +02:00
Changpeng Fang
400ce1a3d3
[AMDGPU] Support AMDGPUClamp for bf16 on gfx1250 (#150663)
Scalar version uses V_MAX_BF16_PSEUDO which is expanded to V_PK_MAX_BF16
with unused high bits. If V_PK_MAX_BF16 is produced directly instead
that creates problem with folding of the clamp into other scalar
instructions due to incompatible clamp bits.

FIXME-TRUE16: enable bf16 clamp with true16

---------

Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2025-07-25 12:13:06 -07:00
Changpeng Fang
d7a38a94cd
[AMDGPU] Support builtin/intrinsics for load monitors on gfx1250 (#150540) 2025-07-24 16:23:33 -07:00
Stanislav Mekhanoshin
96e5eed92a
[AMDGPU] Select VMEM prefetch for llvm.prefetch on gfx1250 (#150493)
We have a choice to use a scalar or vector prefetch for an uniform
pointer. Since we do not have scalar stores our scalar cache is
practically readonly. The rw argument of the prefetch intrinsic is
used to force vector operation even for an uniform case. On GFX12
scalar prefetch will be used anyway, it is still useful but it will
only bring data to L2.
2025-07-24 13:22:50 -07:00
Stanislav Mekhanoshin
9deb7f6062
[AMDGPU] gfx1250 vmem prefetch target intrinsics and builtins (#150466) 2025-07-24 12:13:59 -07:00
Changpeng Fang
473bc0d188
[AMDGPU] Support V_FMA_MIX*_BF16 instructions on gfx1250 (#150381)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2025-07-24 09:43:49 -07:00
Changpeng Fang
9a563b08e2
[AMDGPU] Support V_PK_MIN3/MAX3_NUM_F16 on gfx1250 (#150326)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2025-07-23 15:15:19 -07:00
Stanislav Mekhanoshin
2346968807
[AMDGPU] Add V_ADD|SUB|MUL_U64 gfx1250 opcodes (#150291) 2025-07-23 13:17:56 -07:00
Changpeng Fang
bc1f85d234
AMDGPU: Support packed bf16 instructions on gfx1250 (#150283)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2025-07-23 12:01:23 -07:00
Shilei Tian
7fc65569c1
[AMDGPU] Mark amdgcn_tanh as canonicalized (#150059)
Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>
2025-07-22 20:03:39 -04:00
Shilei Tian
e801a10b44
[AMDGPU] Add the code generation support for llvm.[sin/cos].bf16 (#149631)
This is a partial support because some other instructions have not been upstreamed yet.
2025-07-21 11:01:59 -04:00
Shilei Tian
ba81903196
[gfx1250][SDAG] Lower unsafe bf16 divisions (#149628)
Co-authored-by: Kosarev, Ivan <Ivan.Kosarev@amd.com>
2025-07-21 10:58:08 -04:00
Diana Picus
20d8398825
[AMDGPU] ISel & PEI for whole wave functions (#145858)
Whole wave functions are functions that will run with a full EXEC mask.
They will not be invoked directly, but instead will be launched by way
of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in
a future patch). These functions are meant as an alternative to the
`llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics.

Whole wave functions will set EXEC to -1 in the prologue and restore the
original value of EXEC in the epilogue. They must have a special first
argument, `i1 %active`, that is going to be mapped to EXEC. They may
have either the default calling convention or amdgpu_gfx. The inactive
lanes need to be preserved for all registers used, active lanes only for
the CSRs.

At the IR level, arguments to a whole wave function (other than
`%active`) contain poison in their inactive lanes. Likewise, the return
value for the inactive lanes is poison.

This patch contains the following work:
* 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN
  used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return
  a SReg_1 representing `%active`, which needs to be passed into
  SI_WHOLE_WAVE_FUNC_RETURN.
* SelectionDAG support for generating these 2 new pseudos and the
  special handling of %active. Since the return may be in a different
  basic block, it's difficult to add the virtual reg for %active to
  SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF
  which is later replaced via a custom inserter.
* Expansion of the 2 pseudos during prolog/epilog insertion. PEI also
  marks any used VGPRs as WWM registers, which are then spilled and
  restored with the usual logic.

Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic
and a lot of optimization work (especially in order to reduce spills
around function calls).

---------

Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
Co-authored-by: Shilei Tian <i@tianshilei.me>
2025-07-21 10:39:09 +02:00
Fabian Ritter
daa6de37ba
[AMDGPU][SDAG] Add target-specific ISD::PTRADD combines (#143673)
This patch adds several (AMDGPU-)target-specific DAG combines for
ISD::PTRADD nodes that reproduce existing similar transforms for
ISD::ADD nodes. There is no functional change intended for the existing
target-specific PTRADD combine.

For SWDEV-516125.
2025-07-18 10:00:54 +02:00
Shoreshen
f761d73265
[AMDGPU] Add freeze for LowerSELECT (#148796)
Trying to solve https://github.com/llvm/llvm-project/issues/147635

Add freeze for legalizer when breaking i64 select to 2 i32 select.

Several tests changed, still need to investigate why.

---------

Co-authored-by: Shilei Tian <i@tianshilei.me>
2025-07-18 13:29:33 +08:00
Kazu Hirata
2da59287aa
[Target] Remove unnecessary casts (NFC) (#149342)
getFunction().getParent() already returns Module *.
2025-07-17 15:24:25 -07:00
Pierre van Houtryve
7d52b72239
[AMDGPU] Compute GISel KnownBits for S_BFE instructions (#141588)
Next patches in the stack will emit them in the RegBankCombiner. With this, S_BFE instructions will hopefully interfere less with optimizations.
2025-07-16 09:56:45 +02:00
Stanislav Mekhanoshin
2d6534b7da
[AMDGPU] gfx1250 64-bit relocations and fixups (#148951) 2025-07-15 17:13:42 -07:00
Changpeng Fang
868793fa8e
AMDGPU: Support intrinsic selection for gfx1250 wmma instructions (#148957)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
Co-authored-by: Shilei Tian <Shilei.Tian@amd.com>
2025-07-15 15:25:05 -07:00
Shilei Tian
23ac7b938d
[AMDGPU] Add support for v_sqrt_bf16 on gfx1250 (#148921)
Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>
2025-07-15 16:15:47 -04:00
Stanislav Mekhanoshin
a32040e483
[AMDGPU] Use 64-bit literals in codegen on gfx1250 (#148727) 2025-07-14 15:47:24 -07:00
Shilei Tian
d7ec80c897
[AMDGPU] Add support for v_tanh_bf16 on gfx1250 (#147425)
Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>
2025-07-14 16:30:18 -04:00
Kazu Hirata
f1791c0ae3
[AMDGPU] Remove unnecessary casts (NFC) (#148340)
getRegisterInfo() already returns const SIRegisterInfo *.

Likewise, getInstrInfo() already returns const SIInstrInfo *.
2025-07-12 11:28:41 -07:00
Boyao Wang
697beb3f17
[TargetLowering] Change getOptimalMemOpType and findOptimalMemOpLowering to take LLVM Context (#147664)
Add LLVM Context to getOptimalMemOpType and findOptimalMemOpLowering. So
that we can use EVT::getVectorVT to generate EVT type in
getOptimalMemOpType.

Related to [#146673](https://github.com/llvm/llvm-project/pull/146673).
2025-07-10 11:11:09 +08:00
Stanislav Mekhanoshin
d0a4af725e
[AMDGPU] Add FeatureIEEEMinimumMaximumInsts. NFCI. (#147594)
Co-authored-by: Mirko Brkušanin <Mirko.Brkusanin@amd.com>
2025-07-08 14:32:44 -07:00
Changpeng Fang
5035d20dcb
AMDGPU: Implement ds_atomic_async_barrier_arrive_b64/ds_atomic_barrier_arrive_rtn_b64 (#146409)
These two instructions are supported by gfx1250. We define the
instructions and implement the corresponding intrinsic and builtin.

Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2025-07-01 11:08:49 -07:00
Changpeng Fang
1f5f381920
AMDGPU: Implement intrinsic/builtins for gfx1250 load transpose instructions (#146289) 2025-06-29 14:33:31 -07:00
Fabian Ritter
215e61c088
[AMDGPU][SDAG] Add ISD::PTRADD DAG combines (#142739)
This patch focuses on generic DAG combines, plus an AMDGPU-target-specific one
that is closely connected.

The generic DAG combine is based on a part of PR #105669 by rgwott, which was
adapted from work by jrtc27, arichardson, davidchisnall in the CHERI/Morello
LLVM tree. I added some parts and removed several disjuncts from the
reassociation condition:
- `isNullConstant(X)`, since there are address spaces where 0 is a perfectly
  normal value that shouldn't be treated specially,
- `(YIsConstant && ZOneUse)` and `(N0OneUse && ZOneUse && !ZIsConstant)`, since
  they cause regressions in AMDGPU.

For SWDEV-516125.
2025-06-26 09:40:04 +02:00
Matt Arsenault
020fefb6af
AMDGPU: Avoid report_fatal_error in image intrinsic lowering (#145201) 2025-06-26 00:00:36 +09:00
Chinmay Deshpande
3413aa83f3
Revert "[AMDGPU] Implement IR variant of isFMAFasterThanFMulAndFAdd (… (#145580)
…#121465)"

This reverts commit 211bcf67aadb1175af382f55403ae759177281c7.
2025-06-24 16:10:27 -04:00
Matt Arsenault
48155f93dd
CodeGen: Emit error if getRegisterByName fails (#145194)
This avoids using report_fatal_error and standardizes the error
message in a subset of the error conditions.
2025-06-23 16:33:35 +09:00
Matt Arsenault
16607f6437
AMDGPU: Fix typo in argument allocation error message (#145265) 2025-06-23 16:26:10 +09:00