6927 Commits

Author SHA1 Message Date
Jun Wang
54470176af
[AMDGPU] Add inreg support for SGPR arguments (#67182)
Function parameters marked with inreg are supposed to be allocated to
SGPRs. However, for compute functions, this is ignored and function
parameters are allocated to VGPRs. This fix modifies CC_AMDGPU_Func in
AMDGPUCallingConv.td to use SGPRs if input arg is marked inreg.
---------

Co-authored-by: Jun Wang <jun.wang7@amd.com>
2023-11-08 11:35:52 -08:00
Diana Picus
3b905a0be5 [AMDGPU] ISel for llvm.amdgcn.set.inactive.chain.arg
Add patterns to select int_amdgcn_set_inactive_chain_arg to
V_SET_INACTIVE.

This could probably use some more testing, but at least for simple cases
V_SET_INACTIVE seems to mostly work out of the box.

Differential Revision: https://reviews.llvm.org/D158605
2023-11-08 09:53:47 +01:00
Diana Picus
39830fea28 [AMDGPU][PEI] Set up SP for chain functions
Initialize the SP to 0 in the prologue of functions with the
`amdgpu_cs_chain` or `amdgpu_cs_chain_preserve` calling conventions, but
only if they need one (i.e. if they contain calls to `amdgpu_gfx`
functions or if they have stack objects).

Also make sure we don't try to realign the stack (since 0 is aligned
enough).

Differential Revision: https://reviews.llvm.org/D156413
2023-11-08 09:27:34 +01:00
Diana
1fa58c7790
[AMDGPU] Callee saves for amdgpu_cs_chain[_preserve] (#71526)
Teach prolog epilog insertion how to handle functions with the
amdgpu_cs_chain or amdgpu_cs_chain_preserve calling conventions.

For amdgpu_cs_chain functions, we only need to preserve the inactive
lanes of VGPRs above v8, and only in the presence of calls via
@llvm.amdgcn.cs.chain.

For amdgpu_cs_chain_preserve functions, we will also need to preserve
the active lanes for registers above the last argument VGPR. AFAICT
there's no direct way to find out what the last argument VGPR is, so
instead the patch uses the fact that chain calls from
amdgpu_cs_chain_preserve functions can't use more VGPRs than the
caller's VGPR arguments. In other words, it removes the operands of
SI_CS_CHAIN_TC instructions from the list of callee saved registers.

For both calling conventions, registers v0-v7 never need to be saved and
restored, so we should never add them as WWM spills.

Differential Revision: https://reviews.llvm.org/D156412
2023-11-08 08:28:15 +01:00
Carl Ritson
af6ff98c53
[AMDGPU] Move WWM register pre-allocation to during regalloc (#70618)
Move SIPreAllocateWWMRegs pass to just before VGPR allocation. This
saves recomputation of the virtual matrix and live reg map, with the
slight regression in O0 that live intervals and slot indexes must be
computed.
2023-11-08 11:54:28 +09:00
Pierre van Houtryve
5db63d29fd
[AMDGPU] PromoteAlloca: Handle load/store subvectors using non-constant indexes (#71505)
I assumed indexes were always ConstantInts, but that's not always the
case. They can be other things as well. We can easily handle that by
just emitting an add and let InstSimplify do the constant folding for
cases where it's really a ConstantInt.

Solves SWDEV-429935
2023-11-07 15:29:41 +01:00
Pierre van Houtryve
4428b01faa Reland: [AMDGPU] Remove Code Object V3 (#67118)
V3 has been deprecated for a while as well, so it can safely be removed
like V2 was removed.

- [Clang] Set minimum code object version to 4
- [lld] Fix tests using code object v3
- Remove code object V3 from the AMDGPU backend, and delete or port v3
tests to v4.
- Update docs to make it clear V3 can no longer be emitted.
2023-11-07 12:23:03 +01:00
Nikita Popov
17764d2c87
[IR] Remove FP cast constant expressions (#71408)
Remove support for the fptrunc, fpext, fptoui, fptosi, uitofp and sitofp
constant expressions. All places creating them have been removed
beforehand, so this just removes the APIs and uses of these constant
expressions in tests.

With this, the only remaining FP operation that still has constant
expression support is fcmp.

This is part of
https://discourse.llvm.org/t/rfc-remove-most-constant-expressions/63179.
2023-11-07 09:34:16 +01:00
Matt Arsenault
d34a10a47d
AMDGPU: Port AMDGPUAttributor to new pass manager (#71349) 2023-11-07 15:40:40 +09:00
Amara Emerson
6b69584660
[GlobalISel] Fall back for bf16 conversions. (#71470)
We don't support these correctly since we don't yet have FP types.
AMDGPU tests were silently miscompiling bf16 as if they were fp16.
2023-11-06 21:18:57 -08:00
Jay Foad
521ac12a25
[AMDGPU] Remove AMDGPUAsmPrinter::isBlockOnlyReachableByFallthrough (#71407)
The special handling for blocks ending with a long branch has been
unnecessary since D106445:
"[amdgpu] Add 64-bit PC support when expanding unconditional branches."
2023-11-06 16:29:52 +00:00
Jay Foad
1c6102d19b [AMDGPU] Regenerate checks for long-branch-reserve-register.ll 2023-11-06 15:33:23 +00:00
Nikita Popov
f9404a1b57 [AMDGPU] Regenerate test to fix failure 2023-11-06 15:42:02 +01:00
Valery Pykhtin
fe6893b1d8
Improve selection of conditional branch on amdgcn.ballot!=0 condition in SelectionDAG. (#68714)
Improve selection of the following pattern:

bool cnd = ...
if (amdgcn.ballot(cnd) != 0) {
  ...
}

which means "execute _then_ if any lane has satisfied the _cnd_
condition".
2023-11-06 15:16:49 +01:00
sstipanovic
22a323e3db
[AMDGPU] Select v_lshl_add_u32 instead of v_mul_lo_u32 by constant (#71035)
Instead of: v_mul_lo_u32 v0, v0, 5 we should generate: v_lshl_add_u32
v0, v0, 2, v0.
2023-11-06 14:52:27 +01:00
Diana
7f5d59b38d
[AMDGPU] ISel for @llvm.amdgcn.cs.chain intrinsic (#68186)
The @llvm.amdgcn.cs.chain intrinsic is essentially a call. The call
parameters are bundled up into 2 intrinsic arguments, one for those that
should go in the SGPRs (the 3rd intrinsic argument), and one for those
that should go in the VGPRs (the 4th intrinsic argument). Both will
often be some kind of aggregate.

Both instruction selection frameworks have some internal representation
for intrinsics (G_INTRINSIC[_WITH_SIDE_EFFECTS] for GlobalISel,
ISD::INTRINSIC_[VOID|WITH_CHAIN] for DAGISel), but we can't use those
because aggregates are dissolved very early on during ISel and we'd lose
the inreg information. Therefore, this patch shortcircuits both the
IRTranslator and SelectionDAGBuilder to lower this intrinsic as a call
from the very start. It tries to use the existing infrastructure as much
as possible, by calling into the code for lowering tail calls.

This has already gone through a few rounds of review in Phab:

Differential Revision: https://reviews.llvm.org/D153761
2023-11-06 12:30:07 +01:00
Carl Ritson
19bfe08c7f Reapply [AMDGPU] Generate wwm-reserved.ll (NFC)
Fix target triple so address locations are host independent.
2023-11-06 13:26:06 +09:00
Jessica Del
6e4692c9ee
[AMDGPU] - Add s_wqm intrinsics (#71048)
Add intrinsics to generate `s_wqm_b32` and `s_wqm_b64`.

Support VGPR arguments by inserting a `v_readfirstlane`.
2023-11-03 14:48:59 +01:00
Nikita Popov
e4a4122eb6
[IR] Remove zext and sext constant expressions (#71040)
Remove support for zext and sext constant expressions. All places
creating them have been removed beforehand, so this just removes the
APIs and uses of these constant expressions in tests.

There is some additional cleanup that can be done on top of this, e.g.
we can remove the ZExtInst vs ZExtOperator footgun.

This is part of
https://discourse.llvm.org/t/rfc-remove-most-constant-expressions/63179.
2023-11-03 10:46:07 +01:00
Nico Weber
6acd1671e6 Revert "[AMDGPU] Generate wwm-reserved.ll (NFC)"
This reverts commit b3523d7e6d8834468cfcb66e629adbe17da90ea5.
Breaks tests on mac, see:
https://github.com/llvm/llvm-project/commit/b3523d7e6d88344#commitcomment-131547708
2023-11-02 14:55:41 -04:00
Jay Foad
b90cfe4601
[AMDGPU] New ttracedata intrinsics (#70235)
Add llvm.amdgcn.s.ttracedata and llvm.amdgcn.s.ttracedata.imm which map
directly to the corresponding instructions s_ttracedata and
s_ttracedata_imm. These are inherently whole-wave operations so any
non-uniform inputs are readfirstlaned.
2023-11-02 10:35:15 +00:00
Jay Foad
65bad23e43 [AMDGPU] Fix test for #70532 (Implement moveToVALU for S_CSELECT_B64) 2023-11-02 10:31:02 +00:00
Jay Foad
1590cac494
[AMDGPU] Implement moveToVALU for S_CSELECT_B64 (#70352)
moveToVALU previously only handled S_CSELECT_B64 in the trivial case
where it was semantically equivalent to a copy. Implement the general
case using V_CNDMASK_B64_PSEUDO and implement post-RA expansion of
V_CNDMASK_B64_PSEUDO with immediate as well as register operands.
2023-11-02 10:08:09 +00:00
Jessica Del
41cf94e6b8
[AMDGPU] - Add s_quadmask intrinsics (#70804)
Add intrinsics to generate `s_quadmask_b32`
and `s_quadmask_b64`.

Support VGPR arguments by inserting a `v_readfirstlane`.
2023-11-02 10:37:52 +01:00
Thomas Symalla
18839aec4e
[AMDGPU] Detect kills in register sets when trying to form V_CMPX instructions. (#68293)
During the SIOptimizeExecMasking pass, we try to form V_CMPX
instructions by detecting S_AND_SAVEEXEC and V_MOV instructions.
Generally, we require the input operand of the V_MOV, which is the input
operand to the to-be-formed V_CMPX, to be alive. This is forced by
clearing the kill flags on the operand after V_CMPX has been generated.

However, if we have a kill of a register set that contains said
register, this will not be detected by clearKillFlags.
With this change, possible additional kill-flag candidates will be
detected during the final call to findInstrBackwards and then, the kill
flag will be removed to keep all registers in the set alive.

Co-authored-by: Thomas Symalla <thomas.symalla@amd.com>
2023-11-02 10:36:27 +01:00
Carl Ritson
b3523d7e6d [AMDGPU] Generate wwm-reserved.ll (NFC) 2023-11-02 17:50:42 +09:00
Carl Ritson
0eb516817d
[AMDGPU] Remove dom tree requirements from SIWholeQuadMode pass (#71012)
SIWholeQuadMode preserves dominator and post dominator trees, but does
not require them.
2023-11-02 17:16:19 +09:00
Tobias Stadler
373c343a77 Reland: [GlobalISel] LegalizationArtifactCombiner: Elide redundant G_AND
Reland 3686a0b after fixing an exposed miscompile in #68840

Differential Revision: https://reviews.llvm.org/D159140
2023-11-02 00:18:19 +01:00
Valery Pykhtin
e808f8a616
[AMDGPU] GCNRegPressurePrinter pass to print GCNRegPressure values for testing. (#70031)
Using GCNDownwardRPTracker or GCNUpwardRPTracker the pass collects register pressure values for a function and prints these values next to instructions. Output can be used to generate Filecheck rules in mir tests.
2023-11-01 23:01:39 +01:00
Jay Foad
86f2e09250
[AMDGPU] Tweak handling of GlobalAddress operands in SI_PC_ADD_REL_OFFSET (#70960)
When SI_PC_ADD_REL_OFFSET is expanded to S_GETPC/S_ADD/S_ADDC, the
GlobalAddress operands have to be adjusted by 4 or 12 bytes to account
for the offset from the end of the S_GETPC instruction to the literal
operands. Do this all in SIInstrInfo::expandPostRAPseudo instead of
duplicating the adjustment code in both AMDGPULegalizerInfo and
SITargetLowering. NFCI.
2023-11-01 19:48:30 +00:00
Simon Pilgrim
51d4ad6701 [AMDGPU] amdgpu-codegenprepare-idiv.ll - regenerate checks. NFC.
Reduces diffs in a future patch
2023-10-31 13:24:27 +00:00
Jay Foad
a6dabed348
[AMDGPU] Fix nondeterminism in SIFixSGPRCopies (#70644)
There are a couple of loops that iterate over V2SCopies. The iteration
order needs to be deterministic, otherwise we can call moveToVALU in
different orders, which causes temporary vregs to be allocated in
different orders, which can affect register allocation heuristics.
2023-10-31 11:47:42 +00:00
Jessica Del
b8d3ccdff1
[AMDGPU] - Add s_bitreplicate intrinsic (#69209)
Add intrinsic for s_bitreplicate. Lower to S_BITREPLICATE_B64_B32
machine instruction in both GISel and Selection DAG.

Support VGPR arguments by inserting a `v_readfirstlane`.
2023-10-31 11:26:45 +01:00
Craig Topper
9a7c26a399
[GISel] Restrict G_BSWAP to multiples of 16 bits. (#70245)
This is consistent with the IR verifier and SelectionDAG's getNode.

Update tests accordingly. I tried to keep some coverage of non-pow2 when
possible. X86 didn't like a G_UNMERGE_VALUES from s48 to 3 s16 that got
created when I tried s48.
2023-10-30 10:27:57 -07:00
Jay Foad
101008be83
[AMDGPU] CodeGen for 64-bit buffer atomic cmpswap intrinsics (#70475)
Implement codegen for:
llvm.amdgcn.raw.buffer.atomic.cmpswap.i64
llvm.amdgcn.raw.ptr.buffer.atomic.cmpswap.i64
llvm.amdgcn.struct.buffer.atomic.cmpswap.i64
llvm.amdgcn.struct.ptr.buffer.atomic.cmpswap.i64
2023-10-30 16:44:22 +00:00
Jessica Del
849297c97d
[AMDGPU][wmma] - Add tied wmma intrinsic (#69903)
These new intrinsics, `amdgcn_wmma_tied_f16_16x16x16_f16` and
`amdgcn_wmma_tied_f16_16x16x16_f16`,
explicitly tie the destination accumulator matrix to the input
accumulator matrix.

The `wmma_f16` and `wmma_bf16` intrinsics only write to 16-bit of the
32-bit destination VGPRs.
Which half is determined via the `op_sel` argument. The other half of
the destination registers remains unchanged.

In some cases however, we expect the destination to copy the other
halves from the input accumulator.
For instance, when packing two separate accumulator matrices into one.
In that case, the two matrices
are tied into the same registers, but separate halves. Then it is
important to copy the other matrix values
to the new destination.
2023-10-30 16:23:49 +01:00
Stanislav Mekhanoshin
fe8335babb
[AMDGPU] Select 64-bit imm moves if can be encoded as 32 bit operand (#70395)
This allows folding of 64-bit operands if fit into 32-bit. Fixes
https://github.com/llvm/llvm-project/issues/67781
2023-10-30 08:12:28 -07:00
Stanislav Mekhanoshin
ee6d62db99
[AMDGPU] Prevent folding of the negative i32 literals as i64 (#70274)
We can use sign extended 64-bit literals, but only for signed operands.
At the moment we do not know if an operand is signed. Such operand will
be encoded as its low 32 bits and then either correctly sign extended or
incorrectly zero extended by HW.
2023-10-30 08:07:43 -07:00
Simon Pilgrim
d96529af3c [DAG] Attempt shl narrowing in SimplifyDemandedBits (REAPPLIED)
If a shl node leaves the upper half bits zero / undemanded, then see if we can profitably perform this with a half-width shl and a free trunc/zext.

Followup to D146121

Reapplied - moved after the ShrinkDemandedOp call; reuse the existing KnownBits result; ensure that we only attempt this if all the upper bits are demanded; 547dc461225ba should address the remaining regressions that were noticed in the previous commit.

Differential Revision: https://reviews.llvm.org/D155472
2023-10-29 15:38:46 +00:00
Changpeng Fang
8ceb72ffe5
[AMDGPU] make v32i16/v32f16 legal (#70484)
Some upcoming intrinsics will be using these new types
2023-10-27 15:28:31 -07:00
Stanislav Mekhanoshin
d136432038
[AMDGPU] Remove unneeded implicit-def from shrink-i32-kimm.mir. NFC. (#70489) 2023-10-27 13:32:48 -07:00
Guozhi Wei
9a091de7fe [X86, Peephole] Enable FoldImmediate for X86
Enable FoldImmediate for X86 by implementing X86InstrInfo::FoldImmediate.

Also enhanced peephole by deleting identical instructions after FoldImmediate.

Differential Revision: https://reviews.llvm.org/D151848
2023-10-27 19:47:23 +00:00
Christudasan Devadasan
a0eb6b88f9
[AMDGPU] Try to fix the block prologs broken by RA inserted instructions (#69924)
The insertion point determined by RA while attempting spills and liverange
split at the beginning of a block goes wrong at times, and the newly
inserted vector instructions are placed before the exec-mask restore
instruction which is wrong. It occurs mainly due to the dependency on
isBasicBlockPrologue that doesn't account early inserted instructions
(spills and splits) during RA and causes the block prolog break.

A better approach for deciding the insertion point should be worked out.
For now, improving the helper function to consider all possible early
insertions. This patch includes the spill instructions. The copies
associated with liverange split should also be included in the block
prolog.
2023-10-27 19:10:18 +05:30
Christudasan Devadasan
f9cd789658
[AMDGPU] Add pseudo instructions for SGPR spill to VGPR (#69923)
For a future patch, is it important to keep the lowered SGPR
spills to be recognized as spill instructions during regalloc.
Directly lowering them into V_WRITELANE/V_READLANE won't allow
us to attach the SPILL flag to their instructions.

This patch introduces the pseudo instructions with the SGPRSpill
flag set in their Desc. They will get lowered to equivalent
instructions later during post RA pseudo expansion.
2023-10-27 17:24:10 +05:30
Matt Arsenault
b8b491c9d7 AMDGPU: Add infinite looping testcase after subrange spilling change
This infinite looped after d8127b2ba8a87a610851b9a462f2fc2526c36e37
2023-10-27 17:42:14 +09:00
Alex Richardson
e39f6c1844 [opt] Infer DataLayout from triple if not specified
There are many tests that specify a target triple/CPU flags but no
DataLayout which can lead to IR being generated that has unusual
behaviour. This commit attempts to use the default DataLayout based
on the relevant flags if there is no explicit override on the command
line or in the IR file.

One thing that is not currently possible to differentiate from a missing
datalayout `target datalayout = ""` in the IR file since the current
APIs don't allow detecting this case. If it is considered useful to
support this case (instead of passing "-data-layout=" on the command
line), I can change IR parsers to track whether they have seen such a
directive and change the callback type.

Differential Revision: https://reviews.llvm.org/D141060
2023-10-26 12:07:37 -07:00
Alexander Richardson
f118d474eb
[AMDGPU] Use alloca address space in rewrite-out-arguments.ll (#70269)
This is needed for the transform to fire with a correct data layout.
Pre-commiting this change to keep the diff of D141060 smaller.
2023-10-26 15:08:58 +01:00
Christudasan Devadasan
bb2b7530ad [AMDGPU] precommit lit test for PR 69924. 2023-10-26 17:43:14 +05:30
Jay Foad
e9c4dc18bc Revert "[AMDGPU] Use S_CSELECT for uniform i1 ext (#69703)"
This reverts commit a1260b5209968c08886e3c6183aa793de8931578.

It was causing some Vulkan CTS failures.
2023-10-26 12:56:32 +01:00
Christudasan Devadasan
16fbc45f48 Revert "[AMDGPU] Cleanup hasUnwantedEffectsWhenEXECEmpty function (#70206)"
This reverts commit 7ce613fc77af092dd6e9db71ce3747b75bc5616e.
2023-10-26 17:04:28 +05:30