1333 Commits

Author SHA1 Message Date
Sameer Sahasrabuddhe
421557974a
[AMDGPU] Use glue for convergence tokens at call-like operations (#86766)
The earlier implementation on AMDGPU used explicit token operands at
SI_CALL and SI_CALL_ISEL. This is now replaced with CONVERGENCECTRL_GLUE
operands, with the following effects:

- The treatment of tokens at call-like operations is now consistent with
the treatment at intrinsics.
- Support for tail calls using implicit tokens at SI_TCRETURN "just
works".
- The extra parameter at call-like instructions is eliminated, thus
restoring those instructions and their handling to the original state.

The new glue node is placed after the existing glue node for the
outgoing call parameters, which seems to not interfere with selection of
the call-like nodes.
2024-04-01 10:51:13 +05:30
Thomas Symalla
256343a0e9
Revert "Update amdgpu_gfx functions to use s0-s3 for inreg SGPR arguments on targets using scratch instructions for stack #78226" (#86273)
Reverts llvm/llvm-project#81394

This reverts commit 3ac243bc0d7922d083af2cf025247b5698556062.
It is not handling RSrc registers s0-s3 correctly. This leads to a
broken test, where it expects s0-s3 as function argument and uses it as
RSrc register as well.
We need to re-visit the patch, but apparently we only want to have s0-s3
as
argument registers if we don't need them as RSrc registers.
2024-03-26 11:01:08 +01:00
Changpeng Fang
350bda4419
AMDGPU: Rename intrinsics and remove f16/bf16 versions for load transpose (#86313)
Rename the intrinsics to close to the instruction mnemonic names:
Use global_load_tr_b64 and global_load_tr_b128 instead of
global_load_tr.

This patch also removes f16/bf16 versions of builtins/intrinsics. To
simplify the design, we should avoid enumerating all possible types in
implementing builtins. We can always use bitcast.
2024-03-25 16:55:22 -07:00
David Stuttard
75e528fdd9
[AMDGPU] Extend zero initialization of return values for TFE (#85759)
buffer_load instructions that use TFE also need to zero initialize
return values similar to how the image instructions currently work. Add
support for this with standard zero init of all results + zero init of
just TFE flag when enable-prt-strict-null subtarget feature is disabled.
2024-03-25 09:01:46 +00:00
Pierre van Houtryve
babbdad15b
[AMDGPU] Handle non-register operands for S_SUB/ADD_U64_PSEUDO (#86104)
This pseudo uses SSrc_b64 so it allows both an immediate or a register,
but the lowering crashed on immediate operands.
2024-03-25 09:23:40 +01:00
SahilPatidar
3ac243bc0d
Update amdgpu_gfx functions to use s0-s3 for inreg SGPR arguments on targets using scratch instructions for stack #78226 (#81394)
Resolve #78226
2024-03-21 16:52:08 +05:30
Harald van Dijk
ceb744eb2f
[AMDGPU] Fix canonicalization of truncated values. (#83054)
We were relying on roundings to implicitly canonicalize, which is
generally safe, except with roundings that may be optimized away.

Fixes #82937.
2024-03-13 12:08:39 +00:00
Arthur Eubanks
94c988bcfd [NFC] Remove unused parameter from shouldAssumeDSOLocal() 2024-03-11 19:48:17 +00:00
Shilei Tian
e963d0740e
[AMDGPU] Replace isInlinableLiteral16 with specific version (#84402)
The current implementation of `isInlinableLiteral16` assumes, a 16-bit
inlinable
literal is either an `i16` or a `fp16`. This is not always true because
of
`bf16`. However, we can't tell `fp16` and `bf16` apart by just looking
at the
value. This patch splits `isInlinableLiteral16` into three versions,
`i16`,
`fp16`, `bf16` respectively, and call the corresponding version.
2024-03-08 14:49:52 -05:00
Krzysztof Drewniak
6540f1635a
[AMDGPU] Add IR-level pass to rewrite away address space 7 (#77952)
This commit adds the -lower-buffer-fat-pointers pass, which is
applicable to all AMDGCN compilations.

The purpose of this pass is to remove the type `ptr addrspace(7)` from
incoming IR. This must be done at the LLVM IR level because `ptr
addrspace(7)`, as a 160-bit primitive type, cannot be correctly handled
by SelectionDAG.

The detailed operation of the pass is described in comments, but, in
summary, the removal proceeds by:
1. Rewriting loads and stores of ptr addrspace(7) to loads and stores of
i160 (including vectors and aggregates). This is needed because the
in-register representation of these pointers will stop matching their
in-memory representation in step 2, and so ptrtoint/inttoptr operations
are used to preserve the expected memory layout

2. Mutating the IR to replace all occurrences of `ptr addrspace(7)` with
the type `{ptr addrspace(8), ptr addrspace(6) }`, which makes the two
parts of a buffer fat pointer (the 128-bit address space 8 resource and
the 32-bit address space 6 offset) visible in the IR. This also impacts
the argument and return types of functions.

3. *Splitting* the resource and offset parts. All instructions that
produce or consume buffer fat pointers (like GEP or load) are rewritten
to produce or consume the resource and offset parts separately. For
example, GEP updates the offset part of the result and a load uses the
resource and offset parts to populate the relevant
llvm.amdgcn.raw.ptr.buffer.load intrinsic call.

At the end of this process, the original mutated instructions are
replaced by their new split counterparts, ensuring no invalidly-typed IR
escapes this pass. (For operations like call, where the struct form is
needed, insertelement operations are inserted).

Compared to LGC's PatchBufferOp (

32cda89776/lgc/patch/PatchBufferOp.cpp
): this pass
- Also handles vectors of ptr addrspace(7)s
- Also handles function boundaries
- Includes the same uniform buffer optimization for loops and
conditionals
- Does *not* handle memcpy() and friends (this is future work)
- Does *not* break up large loads and stores into smaller parts. This
should be handled by extending the legalization
of *.buffer.{load,store} to handle larger types by producing multiple
instructions (the same way ordinary LOAD and STORE are legalized). That
work is planned for a followup commit.
- Does *not* have special logic for handling divergent buffer
descriptors. The logic in LGC is, as far as I can tell, incorrect in
general, and, per discussions with @nhaehnle, isn't widely used.
Therefore, divergent descriptors are handled with waterfall loops later
in legalization.

As a final matter, this commit updates atomic expansion to treat buffer
operations analogously to global ones.

(One question for reviewers: is the new pass is the right place? Should
it be later in the pipeline?)

Differential Revision: https://reviews.llvm.org/D158463
2024-03-06 09:49:58 -06:00
Mirko Brkušanin
1fd1f4c0e1
[AMDGPU] Handle amdgpu.last.use metadata (#83816)
Convert !amdgpu.last.use metadata into MachineMemOperand for last use
and handle it in SIMemoryLegalizer similar to nontemporal and volatile.
2024-03-06 16:33:52 +01:00
Joseph Huber
1fc5e50ceb
[AMDGPU] Implement 'llvm.get.fpenv' and 'llvm.set.fpenv' (#83906)
Summary:
This patch implements the LLVM floating point environment control
intrinsics and also exposes it through clang. We encode the floating
point environment as a 64-bit value that simply concatenates the values
of the mode registers and the current trap status. We only fetch the
bits relevant for floating point instructions. That is, rounding mode,
denormalization mode, ieee, dx10 clamp, debug, enabled traps, f16
overflow, and active exceptions.
2024-03-06 08:11:54 -06:00
Shilei Tian
e9c1dbb408 Revert "[AMDGPU] Replace isInlinableLiteral16 with specific version (#81345)"
This reverts commit 530f0e64ec11327879c44f2fd55c7c28efdbaa2d because it breaks
downstream.
2024-03-06 08:42:54 -05:00
Sameer Sahasrabuddhe
60822637bf Restore "Implement convergence control in MIR using SelectionDAG (#71785)"
This restores commit c7fdd8c11e54585dc9d15d63de9742067e0506b9.
Previously reverted in f010b1bef4dda2c7082cbb41dbabf1f149cce306.

LLVM function calls carry convergence control tokens as operand bundles, where
the tokens themselves are produced by convergence control intrinsics. This patch
implements convergence control tokens in MIR as follows:

1. Introduce target-independent ISD opcodes and MIR opcodes for convergence
   control intrinsics.
2. Model token values as untyped virtual registers in MIR.

The change also introduces an additional ISD opcode CONVERGENCECTRL_GLUE and a
corresponding machine opcode with the same spelling. This glues the convergence
control token to SDNodes that represent calls to intrinsics. The glued token is
later translated to an implicit argument in the MIR.

The lowering of calls to user-defined functions is target-specific. On AMDGPU,
the convergence control operand bundle at a non-intrinsic call is translated to
an explicit argument to the SI_CALL_ISEL instruction. Post-selection adjustment
converts this explicit argument to an implicit argument on the SI_CALL
instruction.
2024-03-06 12:19:32 +05:30
Martin Wehking
d35f2c439a
Remove constant local variable (#83850)
Remove isThisReturn, which always has the value false.
Replace its uses with false directly.
2024-03-06 00:53:09 +05:30
Mitch Phillips
f010b1bef4 Revert "Restore "Implement convergence control in MIR using SelectionDAG (#71785)""
This reverts commit c7fdd8c11e54585dc9d15d63de9742067e0506b9.

Reason: Broke the sanitizer buildbots. See the comments at
https://github.com/llvm/llvm-project/pull/71785
for more information.
2024-03-04 17:05:34 +01:00
Shilei Tian
530f0e64ec
[AMDGPU] Replace isInlinableLiteral16 with specific version (#81345) 2024-03-04 08:40:42 -05:00
Sameer Sahasrabuddhe
c7fdd8c11e Restore "Implement convergence control in MIR using SelectionDAG (#71785)"
Original commit 79889734b940356ab3381423c93ae06f22e772c9.
Perviously reverted in commit a2afcd5721869d1d03c8146bae3885b3385ba15e.

LLVM function calls carry convergence control tokens as operand bundles, where
the tokens themselves are produced by convergence control intrinsics. This patch
implements convergence control tokens in MIR as follows:

1. Introduce target-independent ISD opcodes and MIR opcodes for convergence
   control intrinsics.
2. Model token values as untyped virtual registers in MIR.

The change also introduces an additional ISD opcode CONVERGENCECTRL_GLUE and a
corresponding machine opcode with the same spelling. This glues the convergence
control token to SDNodes that represent calls to intrinsics. The glued token is
later translated to an implicit argument in the MIR.

The lowering of calls to user-defined functions is target-specific. On AMDGPU,
the convergence control operand bundle at a non-intrinsic call is translated to
an explicit argument to the SI_CALL_ISEL instruction. Post-selection adjustment
converts this explicit argument to an implicit argument on the SI_CALL
instruction.
2024-03-04 13:28:04 +05:30
Pierre van Houtryve
756166e342
[AMDGPU] Improve detection of non-null addrspacecast operands (#82311)
Use IR analysis to infer when an addrspacecast operand is nonnull, then
lower it to an intrinsic that the DAG can use to skip the null check.

I did this using an intrinsic as it's non-intrusive. An alternative
would have been to allow something like `!nonnull` on `addrspacecast`
then lower that to a custom opcode (or add an operand to the
addrspacecast MIR/DAG opcodes), but it's a lot of boilerplate for just
one target's use case IMO.

I'm hoping that when we switch to GISel that we can move all this logic
to the MIR level without losing info, but currently the DAG doesn't see
enough so we need to act in CGP.

Fixes: SWDEV-316445
2024-03-01 14:01:10 +01:00
Shilei Tian
bfcf7a0707
[AMDGPU] Remove hasAtomicFaddRtnForTy as it is not used anywhere (#82841) 2024-02-23 21:14:38 -05:00
Ivan Kosarev
dfa1d9b027
[AMDGPU][NFC] Have helpers to deal with encoding fields. (#82772)
These are hoped to provide more convenient and less error prone
facilities to encode and decode fields than manually defined constants
and functions.
2024-02-23 17:34:55 +00:00
Nick Anderson
c5bbf979ad
[AMDGPU] fixes mistake in #82018 (#82223)
fixes #81766 #82018
2024-02-21 13:12:03 -05:00
Sameer Sahasrabuddhe
a2afcd5721 Revert "Implement convergence control in MIR using SelectionDAG (#71785)"
This reverts commit 79889734b940356ab3381423c93ae06f22e772c9.

Encountered multiple buildbot failures.
2024-02-21 11:07:02 +05:30
Jie Fu
086280f4d1 [AMDGPU] Fix linking error of SIISelLowering.cpp.o (NFC)
ld.lld: error: undefined symbol: llvm::MachineOperand::dump() const
>>> referenced by SIISelLowering.cpp
2024-02-21 13:07:34 +08:00
Sameer Sahasrabuddhe
79889734b9
Implement convergence control in MIR using SelectionDAG (#71785)
LLVM function calls carry convergence control tokens as operand bundles, where
the tokens themselves are produced by convergence control intrinsics. This patch
implements convergence control tokens in MIR as follows:

1. Introduce target-independent ISD opcodes and MIR opcodes for convergence
   control intrinsics.
2. Model token values as untyped virtual registers in MIR.

The change also introduces an additional ISD opcode CONVERGENCECTRL_GLUE and a
corresponding machine opcode with the same spelling. This glues the convergence
control token to SDNodes that represent calls to intrinsics. The glued token is
later translated to an implicit argument in the MIR.

The lowering of calls to user-defined functions is target-specific. On AMDGPU,
the convergence control operand bundle at a non-intrinsic call is translated to
an explicit argument to the SI_CALL_ISEL instruction. Post-selection adjustment
converts this explicit argument to an implicit argument on the SI_CALL
instruction.
2024-02-21 10:06:37 +05:30
Nick Anderson
767433ba88
[AMDGPU] fixes duplicate expressions in if stmnts in SIISelLowering.cpp (#82018)
fixes #81766
2024-02-18 21:19:15 +05:30
Shilei Tian
9c6a2de24b
[AMDGPU] Clean up functions for checking inline literals (#81282)
This patch cleans up functions for checking inline literals.
2024-02-15 12:11:51 -05:00
Joseph Huber
11fcae69db
[LLVM] Add __builtin_readsteadycounter intrinsic and builtin for realtime clocks (#81331)
Summary:
This patch adds a new intrinsic and builtin function mirroring the
existing `__builtin_readcyclecounter`. The difference is that this
implementation targets a separate counter that some targets have which
returns a fixed frequency clock that can be used to determine elapsed
time, this is different compared to the cycle counter which often has
variable frequency.

This patch only adds support for the NVPTX and AMDGPU targets.

This is done as a new and separate builtin rather than an argument to
`readcyclecounter` to avoid needing to change existing code and to make
the separation more explicit.
2024-02-13 10:06:25 -06:00
Austin Kerbow
4bcbeaed63
[AMDGPU] Enable kernel arg preloading with gfx90a (#81180)
Add a trap instruction to the beginning of the kernel prologue to handle
cases where preloading is attempted on HW loaded with incompatible
firmware.
2024-02-12 22:33:29 -08:00
Pranav Kant
c95693c746
[NFC][AMDGPU] Fix unused-variable warning (#81040)
This is only used in assert statement.
2024-02-07 13:21:01 -08:00
Jeffrey Byrnes
3115ad8980
[AMDGPU] Accept arbitrary sized sources in CalculateByteProvider (#70240)
Reland the original patch with additional commit containing fix for two
issues:

1. Attempting to bitcast using MVTs with no corresponding LLVM type.
getDWordFromOffset now works directly with the original vector to get
the corresponding elements given the DWordOffset.
2. Improper bit tracking in CalculateByteProvider for vector types using
certain ops. Previously, bit tracking for certain ops (e.g.
ISD::TRUNCATE) assumed operands were scalar types, which is not correct
since these ops have different semantics depending on vector / scalar.
CalculateByteProvider / CalculateSrcByte now exit on vector types,
handling which is a TODO.
2024-02-07 11:34:50 -08:00
Jay Foad
c2c650f62e
[AMDGPU] Stop combining arbitrary offsets into PAL relocs (#80034)
PAL uses ELF REL (not RELA) relocations which can only store a 32-bit
addend in the instruction, even for reloc types like R_AMDGPU_ABS32_HI
which require the upper 32 bits of a 64-bit address calculation to be
correct. This means that it is not safe to fold an arbitrary offset into
a GlobalAddressSDNode, so stop doing that.

In practice this is mostly a problem for small negative offsets which do
not work as expected because PAL treats the 32-bit addend as unsigned.
2024-01-31 10:28:23 +00:00
Kazu Hirata
8582d41789 [Target] Use SDValue::getConstantOperandVal (NFC) 2024-01-29 18:46:16 -08:00
Jay Foad
8b429fc3fe [AMDGPU] Update SITargetLowering::getAddrModeArguments (#78740)
Handle every intrinsic for which getTgtMemIntrinsic returns with
Info.ptrVal set to one of the intrinsic's operands. A bunch of these
cases were missing.
2024-01-29 15:03:26 +00:00
Kazu Hirata
ae46855f53 [Target] Use getConstantOperand (NFC) 2024-01-28 18:03:38 -08:00
Jay Foad
66c710ec9d
[AMDGPU] Do not bother adding reserved registers to liveins (#79436)
Tweak the implementation of llvm.amdgcn.wave.id to not add TTMP8 to the
function liveins.
2024-01-25 15:17:06 +00:00
Jay Foad
45d2d7757f
[AMDGPU] New llvm.amdgcn.wave.id intrinsic (#79325)
This is only valid on targets with architected SGPRs.
2024-01-25 07:48:06 +00:00
Jay Foad
fe9f3903f2
[AMDGPU] Update isLegalAddressingMode for GFX12 SMEM loads (#78728) 2024-01-24 21:04:43 +00:00
Jay Foad
70fc970378
[AMDGPU] Move architected SGPR implementation into isel (#79120) 2024-01-24 15:06:20 +00:00
Mirko Brkušanin
7fdf608cef
[AMDGPU] Add GFX12 WMMA and SWMMAC instructions (#77795)
Co-authored-by: Petar Avramovic <Petar.Avramovic@amd.com>
Co-authored-by: Piotr Sobczak <piotr.sobczak@amd.com>
2024-01-24 13:43:07 +01:00
Changpeng Fang
32073b8356
AMDGPU: Do not generate non-temporal hint when Load_Tr intrinsic did not specify it (#79104)
int_amdgcn_global_load_tr did not specify non-temporal load transpose,
thus we should
not genetrate the non-temporal hint for the load. We need to implement
getTgtMemIntrinsic
to create the corresponding MemSDNode. And we don't set the non-temporal
flag because
the intrinsic did not specify it.

NOTE: We need to implement getTgtMemIntrinsic for any memory intrinsics.
2024-01-23 10:05:32 -08:00
Emma Pilkington
bc82cfb38d
[AMDGPU] Add an asm directive to track code_object_version (#76267)
Named '.amdhsa_code_object_version'. This directive sets the
e_ident[ABIVERSION] in the ELF header, and should be used as the assumed
COV for the rest of the asm file.

This commit also weakens the --amdhsa-code-object-version CL flag.
Previously, the CL flag took precedence over the IR flag. Now the IR
flag/asm directive take precedence over the CL flag. This is implemented
by merging a few COV-checking functions in AMDGPUBaseInfo.h.
2024-01-21 11:54:47 -05:00
Jay Foad
1abf2570b3 [AMDGPU] Make use of CPol::SWZ_* in SelectionDAG. NFC.
For GlobalISel this was already done in
AMDGPUInstructionSelector::selectBufferLoadLds.
2024-01-19 15:48:45 +00:00
Jay Foad
7017efa1a1 Fix typo "widended" 2024-01-19 13:50:26 +00:00
Mariusz Sikora
3e6589f21c
[AMDGPU][GFX12] Add 16 bit atomic fadd instructions (#75917)
- image_atomic_pk_add_f16
- image_atomic_pk_add_bf16
- ds_pk_add_bf16
- ds_pk_add_f16
- ds_pk_add_rtn_bf16
- ds_pk_add_rtn_f16
- flat_atomic_pk_add_f16
- flat_atomic_pk_add_bf16
- global_atomic_pk_add_f16
- global_atomic_pk_add_bf16
- buffer_atomic_pk_add_f16
- buffer_atomic_pk_add_bf16
2024-01-18 14:01:09 +01:00
Mariusz Sikora
c99da46fc1
[AMDGPU][GFX12] Add Atomic cond_sub_u32 (#76224)
Co-authored-by: Vang Thao <Vang.Thao@amd.com>
2024-01-17 19:23:42 +01:00
Matt Arsenault
af4f1766ae
AMDGPU: Allocate special SGPRs before user SGPR arguments (#78234) 2024-01-17 21:41:50 +07:00
Jay Foad
a3fc0f9d2b [AMDGPU] Add comments on SITargetLowering::widenLoad 2024-01-17 10:40:24 +00:00
Jay Foad
4a77414660
[AMDGPU] CodeGen for GFX12 8/16-bit SMEM loads (#77633) 2024-01-17 10:28:03 +00:00
Kazu Hirata
7528cf5ef2 [Target] Use getConstantOperandVal (NFC) 2024-01-14 00:53:29 -08:00