6937 Commits

Author SHA1 Message Date
Jay Foad
a4196666ac
[AMDGPU] Revert "Preliminary patch for divergence driven instruction selection. Operands Folding 1." (#71710)
This reverts commit 201f892b3b597f24287ab6a712a286e25a45a7d9.
2023-11-13 13:53:10 +00:00
Jay Foad
47f29043f0 [AMDGPU] Fix a GlobalISel RUN line
This was added in D149795 without actually enabling GlobalISel.
2023-11-13 11:30:15 +00:00
Carl Ritson
edc38a6cbd
[AMDGPU] Add option to pre-allocate SGPR spill VGPRs (#70626)
SGPR spill VGPRs are WWM registers so allow them to be allocated by
SIPreAllocateWWMRegs pass.
This intentionally prevents spilling of these VGPRs when enabled.
2023-11-13 12:21:18 +09:00
Carl Ritson
52b247b1d3
[PHIElimination] Handle subranges in LiveInterval updates (#69429)
Add subrange tracking and handling for LiveIntervals during PHI
elimination.
This requires extending MachineBasicBlock::SplitCriticalEdge to also
update subrange intervals.
2023-11-13 12:16:26 +09:00
Joseph Huber
a3bd87b100 [AMDGPU] Call the FINI_ARRAY destructors in the correct order (#71815)
Summary:
The AMDGPU backend uses the linker-provided INIT_ARRAY and FINI_ARRAY
sections to call all the global constructors in a single kernel.
Previously this mistakenly used the same iteration logic for both
arrays. The destructors stored in FINI_ARRAY are stored in the same
order as
the ones in the INIT_ARRAY section so we need to traverse it in reverse
order.

Relanding after the revert in fe7b5e2cfcf6848287010291081f85fa1f6bb2ef
using the IR builder interface instead of ConstantExpr.
2023-11-10 11:01:02 -06:00
Nikita Popov
fe7b5e2cfc Revert "[AMDGPU] Call the FINI_ARRAY destructors in the correct order (#71815)"
This reverts commit c1d5865a313d0a8a254b37c852bdd444453c0f73.

Introduces a new use of ConstantExpr::getAShr().
2023-11-10 17:01:06 +01:00
Joseph Huber
c1d5865a31
[AMDGPU] Call the FINI_ARRAY destructors in the correct order (#71815)
Summary:
The AMDGPU backend uses the linker-provided INIT_ARRAY and FINI_ARRAY
sections to call all the global constructors in a single kernel.
Previously this mistakenly used the same iteration logic for both
arrays. The destructors stored in FINI_ARRAY are stored in the same
order as
the ones in the INIT_ARRAY section so we need to traverse it in reverse
order.
2023-11-10 09:34:04 -06:00
Valery Pykhtin
87b8d94371
[AMDGPU] Fix GCNUpwardRPTracker. (#71186)
Fixed:

1. Maximum register pressure calculation at the instruction level. 
Previously max RP included both def and use of registers of an
instruction. Now maximum RP includes _uses_ and _early-clobber defs_.

2. Uses were incorrectly tracked and this resulted in a mismatch of
live-in set reported by LiveIntervals and tracked live reg set when the
beginning of the block is reached.

Interface has changed, moveMaxPressure becomes deprecated and
getMaxPressure, resetMaxPressure functions are added. reset function
seem now more consistent.
2023-11-10 13:44:10 +01:00
Diana Picus
20e9e4f797 [AMDGPU] si-wqm: Skip only LiveMask COPY
si-wqm sometimes needs to save the LiveMask in the entry block. Later
on, while looking for a place to enter WQM/WWM, it unconditionally
skips over the first COPY instruction in the entry block. This is
incorrect for functions where the LiveMask doesn't need to be saved, and
therefore the first COPY is more likely a COPY from a function argument
and might need to be in some non-exact mode.

This patch fixes the issue by also checking that the source of the COPY
is the EXEC register.

This produces different code in 3 of the existing tests:

In wwm-reserved.ll, a SGPR copy is now inside the WWM area rather than
outside. This is benign.

In wave32.ll, we end up with an extra register copy. This is because
the first COPY in the block is now part of the WWM block, so
si-pre-allocate-wwm-regs will allocate a new register for its
destination (when it was outside of the WWM region, the register
allocator could just re-use the same register). We might be able to
improve this in si-pre-allocate-wwm-regs but I haven't looked into it.

The same thing happens in dual-source-blend-export.ll, but for that
one it's harder to see because of the scheduling changes. I've uploaded
the before/after si-wqm output for it here:
https://reviews.llvm.org/differential/diff/553445/

Differential Revision: https://reviews.llvm.org/D158841
2023-11-10 09:30:44 +01:00
Matt Arsenault
67c3cb4f6b AMDGPU: Use an explicit triple in test to avoid bot failures 2023-11-10 17:09:55 +09:00
Jun Wang
54470176af
[AMDGPU] Add inreg support for SGPR arguments (#67182)
Function parameters marked with inreg are supposed to be allocated to
SGPRs. However, for compute functions, this is ignored and function
parameters are allocated to VGPRs. This fix modifies CC_AMDGPU_Func in
AMDGPUCallingConv.td to use SGPRs if input arg is marked inreg.
---------

Co-authored-by: Jun Wang <jun.wang7@amd.com>
2023-11-08 11:35:52 -08:00
Diana Picus
3b905a0be5 [AMDGPU] ISel for llvm.amdgcn.set.inactive.chain.arg
Add patterns to select int_amdgcn_set_inactive_chain_arg to
V_SET_INACTIVE.

This could probably use some more testing, but at least for simple cases
V_SET_INACTIVE seems to mostly work out of the box.

Differential Revision: https://reviews.llvm.org/D158605
2023-11-08 09:53:47 +01:00
Diana Picus
39830fea28 [AMDGPU][PEI] Set up SP for chain functions
Initialize the SP to 0 in the prologue of functions with the
`amdgpu_cs_chain` or `amdgpu_cs_chain_preserve` calling conventions, but
only if they need one (i.e. if they contain calls to `amdgpu_gfx`
functions or if they have stack objects).

Also make sure we don't try to realign the stack (since 0 is aligned
enough).

Differential Revision: https://reviews.llvm.org/D156413
2023-11-08 09:27:34 +01:00
Diana
1fa58c7790
[AMDGPU] Callee saves for amdgpu_cs_chain[_preserve] (#71526)
Teach prolog epilog insertion how to handle functions with the
amdgpu_cs_chain or amdgpu_cs_chain_preserve calling conventions.

For amdgpu_cs_chain functions, we only need to preserve the inactive
lanes of VGPRs above v8, and only in the presence of calls via
@llvm.amdgcn.cs.chain.

For amdgpu_cs_chain_preserve functions, we will also need to preserve
the active lanes for registers above the last argument VGPR. AFAICT
there's no direct way to find out what the last argument VGPR is, so
instead the patch uses the fact that chain calls from
amdgpu_cs_chain_preserve functions can't use more VGPRs than the
caller's VGPR arguments. In other words, it removes the operands of
SI_CS_CHAIN_TC instructions from the list of callee saved registers.

For both calling conventions, registers v0-v7 never need to be saved and
restored, so we should never add them as WWM spills.

Differential Revision: https://reviews.llvm.org/D156412
2023-11-08 08:28:15 +01:00
Carl Ritson
af6ff98c53
[AMDGPU] Move WWM register pre-allocation to during regalloc (#70618)
Move SIPreAllocateWWMRegs pass to just before VGPR allocation. This
saves recomputation of the virtual matrix and live reg map, with the
slight regression in O0 that live intervals and slot indexes must be
computed.
2023-11-08 11:54:28 +09:00
Pierre van Houtryve
5db63d29fd
[AMDGPU] PromoteAlloca: Handle load/store subvectors using non-constant indexes (#71505)
I assumed indexes were always ConstantInts, but that's not always the
case. They can be other things as well. We can easily handle that by
just emitting an add and let InstSimplify do the constant folding for
cases where it's really a ConstantInt.

Solves SWDEV-429935
2023-11-07 15:29:41 +01:00
Pierre van Houtryve
4428b01faa Reland: [AMDGPU] Remove Code Object V3 (#67118)
V3 has been deprecated for a while as well, so it can safely be removed
like V2 was removed.

- [Clang] Set minimum code object version to 4
- [lld] Fix tests using code object v3
- Remove code object V3 from the AMDGPU backend, and delete or port v3
tests to v4.
- Update docs to make it clear V3 can no longer be emitted.
2023-11-07 12:23:03 +01:00
Nikita Popov
17764d2c87
[IR] Remove FP cast constant expressions (#71408)
Remove support for the fptrunc, fpext, fptoui, fptosi, uitofp and sitofp
constant expressions. All places creating them have been removed
beforehand, so this just removes the APIs and uses of these constant
expressions in tests.

With this, the only remaining FP operation that still has constant
expression support is fcmp.

This is part of
https://discourse.llvm.org/t/rfc-remove-most-constant-expressions/63179.
2023-11-07 09:34:16 +01:00
Matt Arsenault
d34a10a47d
AMDGPU: Port AMDGPUAttributor to new pass manager (#71349) 2023-11-07 15:40:40 +09:00
Amara Emerson
6b69584660
[GlobalISel] Fall back for bf16 conversions. (#71470)
We don't support these correctly since we don't yet have FP types.
AMDGPU tests were silently miscompiling bf16 as if they were fp16.
2023-11-06 21:18:57 -08:00
Jay Foad
521ac12a25
[AMDGPU] Remove AMDGPUAsmPrinter::isBlockOnlyReachableByFallthrough (#71407)
The special handling for blocks ending with a long branch has been
unnecessary since D106445:
"[amdgpu] Add 64-bit PC support when expanding unconditional branches."
2023-11-06 16:29:52 +00:00
Jay Foad
1c6102d19b [AMDGPU] Regenerate checks for long-branch-reserve-register.ll 2023-11-06 15:33:23 +00:00
Nikita Popov
f9404a1b57 [AMDGPU] Regenerate test to fix failure 2023-11-06 15:42:02 +01:00
Valery Pykhtin
fe6893b1d8
Improve selection of conditional branch on amdgcn.ballot!=0 condition in SelectionDAG. (#68714)
Improve selection of the following pattern:

bool cnd = ...
if (amdgcn.ballot(cnd) != 0) {
  ...
}

which means "execute _then_ if any lane has satisfied the _cnd_
condition".
2023-11-06 15:16:49 +01:00
sstipanovic
22a323e3db
[AMDGPU] Select v_lshl_add_u32 instead of v_mul_lo_u32 by constant (#71035)
Instead of: v_mul_lo_u32 v0, v0, 5 we should generate: v_lshl_add_u32
v0, v0, 2, v0.
2023-11-06 14:52:27 +01:00
Diana
7f5d59b38d
[AMDGPU] ISel for @llvm.amdgcn.cs.chain intrinsic (#68186)
The @llvm.amdgcn.cs.chain intrinsic is essentially a call. The call
parameters are bundled up into 2 intrinsic arguments, one for those that
should go in the SGPRs (the 3rd intrinsic argument), and one for those
that should go in the VGPRs (the 4th intrinsic argument). Both will
often be some kind of aggregate.

Both instruction selection frameworks have some internal representation
for intrinsics (G_INTRINSIC[_WITH_SIDE_EFFECTS] for GlobalISel,
ISD::INTRINSIC_[VOID|WITH_CHAIN] for DAGISel), but we can't use those
because aggregates are dissolved very early on during ISel and we'd lose
the inreg information. Therefore, this patch shortcircuits both the
IRTranslator and SelectionDAGBuilder to lower this intrinsic as a call
from the very start. It tries to use the existing infrastructure as much
as possible, by calling into the code for lowering tail calls.

This has already gone through a few rounds of review in Phab:

Differential Revision: https://reviews.llvm.org/D153761
2023-11-06 12:30:07 +01:00
Carl Ritson
19bfe08c7f Reapply [AMDGPU] Generate wwm-reserved.ll (NFC)
Fix target triple so address locations are host independent.
2023-11-06 13:26:06 +09:00
Jessica Del
6e4692c9ee
[AMDGPU] - Add s_wqm intrinsics (#71048)
Add intrinsics to generate `s_wqm_b32` and `s_wqm_b64`.

Support VGPR arguments by inserting a `v_readfirstlane`.
2023-11-03 14:48:59 +01:00
Nikita Popov
e4a4122eb6
[IR] Remove zext and sext constant expressions (#71040)
Remove support for zext and sext constant expressions. All places
creating them have been removed beforehand, so this just removes the
APIs and uses of these constant expressions in tests.

There is some additional cleanup that can be done on top of this, e.g.
we can remove the ZExtInst vs ZExtOperator footgun.

This is part of
https://discourse.llvm.org/t/rfc-remove-most-constant-expressions/63179.
2023-11-03 10:46:07 +01:00
Nico Weber
6acd1671e6 Revert "[AMDGPU] Generate wwm-reserved.ll (NFC)"
This reverts commit b3523d7e6d8834468cfcb66e629adbe17da90ea5.
Breaks tests on mac, see:
https://github.com/llvm/llvm-project/commit/b3523d7e6d88344#commitcomment-131547708
2023-11-02 14:55:41 -04:00
Jay Foad
b90cfe4601
[AMDGPU] New ttracedata intrinsics (#70235)
Add llvm.amdgcn.s.ttracedata and llvm.amdgcn.s.ttracedata.imm which map
directly to the corresponding instructions s_ttracedata and
s_ttracedata_imm. These are inherently whole-wave operations so any
non-uniform inputs are readfirstlaned.
2023-11-02 10:35:15 +00:00
Jay Foad
65bad23e43 [AMDGPU] Fix test for #70532 (Implement moveToVALU for S_CSELECT_B64) 2023-11-02 10:31:02 +00:00
Jay Foad
1590cac494
[AMDGPU] Implement moveToVALU for S_CSELECT_B64 (#70352)
moveToVALU previously only handled S_CSELECT_B64 in the trivial case
where it was semantically equivalent to a copy. Implement the general
case using V_CNDMASK_B64_PSEUDO and implement post-RA expansion of
V_CNDMASK_B64_PSEUDO with immediate as well as register operands.
2023-11-02 10:08:09 +00:00
Jessica Del
41cf94e6b8
[AMDGPU] - Add s_quadmask intrinsics (#70804)
Add intrinsics to generate `s_quadmask_b32`
and `s_quadmask_b64`.

Support VGPR arguments by inserting a `v_readfirstlane`.
2023-11-02 10:37:52 +01:00
Thomas Symalla
18839aec4e
[AMDGPU] Detect kills in register sets when trying to form V_CMPX instructions. (#68293)
During the SIOptimizeExecMasking pass, we try to form V_CMPX
instructions by detecting S_AND_SAVEEXEC and V_MOV instructions.
Generally, we require the input operand of the V_MOV, which is the input
operand to the to-be-formed V_CMPX, to be alive. This is forced by
clearing the kill flags on the operand after V_CMPX has been generated.

However, if we have a kill of a register set that contains said
register, this will not be detected by clearKillFlags.
With this change, possible additional kill-flag candidates will be
detected during the final call to findInstrBackwards and then, the kill
flag will be removed to keep all registers in the set alive.

Co-authored-by: Thomas Symalla <thomas.symalla@amd.com>
2023-11-02 10:36:27 +01:00
Carl Ritson
b3523d7e6d [AMDGPU] Generate wwm-reserved.ll (NFC) 2023-11-02 17:50:42 +09:00
Carl Ritson
0eb516817d
[AMDGPU] Remove dom tree requirements from SIWholeQuadMode pass (#71012)
SIWholeQuadMode preserves dominator and post dominator trees, but does
not require them.
2023-11-02 17:16:19 +09:00
Tobias Stadler
373c343a77 Reland: [GlobalISel] LegalizationArtifactCombiner: Elide redundant G_AND
Reland 3686a0b after fixing an exposed miscompile in #68840

Differential Revision: https://reviews.llvm.org/D159140
2023-11-02 00:18:19 +01:00
Valery Pykhtin
e808f8a616
[AMDGPU] GCNRegPressurePrinter pass to print GCNRegPressure values for testing. (#70031)
Using GCNDownwardRPTracker or GCNUpwardRPTracker the pass collects register pressure values for a function and prints these values next to instructions. Output can be used to generate Filecheck rules in mir tests.
2023-11-01 23:01:39 +01:00
Jay Foad
86f2e09250
[AMDGPU] Tweak handling of GlobalAddress operands in SI_PC_ADD_REL_OFFSET (#70960)
When SI_PC_ADD_REL_OFFSET is expanded to S_GETPC/S_ADD/S_ADDC, the
GlobalAddress operands have to be adjusted by 4 or 12 bytes to account
for the offset from the end of the S_GETPC instruction to the literal
operands. Do this all in SIInstrInfo::expandPostRAPseudo instead of
duplicating the adjustment code in both AMDGPULegalizerInfo and
SITargetLowering. NFCI.
2023-11-01 19:48:30 +00:00
Simon Pilgrim
51d4ad6701 [AMDGPU] amdgpu-codegenprepare-idiv.ll - regenerate checks. NFC.
Reduces diffs in a future patch
2023-10-31 13:24:27 +00:00
Jay Foad
a6dabed348
[AMDGPU] Fix nondeterminism in SIFixSGPRCopies (#70644)
There are a couple of loops that iterate over V2SCopies. The iteration
order needs to be deterministic, otherwise we can call moveToVALU in
different orders, which causes temporary vregs to be allocated in
different orders, which can affect register allocation heuristics.
2023-10-31 11:47:42 +00:00
Jessica Del
b8d3ccdff1
[AMDGPU] - Add s_bitreplicate intrinsic (#69209)
Add intrinsic for s_bitreplicate. Lower to S_BITREPLICATE_B64_B32
machine instruction in both GISel and Selection DAG.

Support VGPR arguments by inserting a `v_readfirstlane`.
2023-10-31 11:26:45 +01:00
Craig Topper
9a7c26a399
[GISel] Restrict G_BSWAP to multiples of 16 bits. (#70245)
This is consistent with the IR verifier and SelectionDAG's getNode.

Update tests accordingly. I tried to keep some coverage of non-pow2 when
possible. X86 didn't like a G_UNMERGE_VALUES from s48 to 3 s16 that got
created when I tried s48.
2023-10-30 10:27:57 -07:00
Jay Foad
101008be83
[AMDGPU] CodeGen for 64-bit buffer atomic cmpswap intrinsics (#70475)
Implement codegen for:
llvm.amdgcn.raw.buffer.atomic.cmpswap.i64
llvm.amdgcn.raw.ptr.buffer.atomic.cmpswap.i64
llvm.amdgcn.struct.buffer.atomic.cmpswap.i64
llvm.amdgcn.struct.ptr.buffer.atomic.cmpswap.i64
2023-10-30 16:44:22 +00:00
Jessica Del
849297c97d
[AMDGPU][wmma] - Add tied wmma intrinsic (#69903)
These new intrinsics, `amdgcn_wmma_tied_f16_16x16x16_f16` and
`amdgcn_wmma_tied_f16_16x16x16_f16`,
explicitly tie the destination accumulator matrix to the input
accumulator matrix.

The `wmma_f16` and `wmma_bf16` intrinsics only write to 16-bit of the
32-bit destination VGPRs.
Which half is determined via the `op_sel` argument. The other half of
the destination registers remains unchanged.

In some cases however, we expect the destination to copy the other
halves from the input accumulator.
For instance, when packing two separate accumulator matrices into one.
In that case, the two matrices
are tied into the same registers, but separate halves. Then it is
important to copy the other matrix values
to the new destination.
2023-10-30 16:23:49 +01:00
Stanislav Mekhanoshin
fe8335babb
[AMDGPU] Select 64-bit imm moves if can be encoded as 32 bit operand (#70395)
This allows folding of 64-bit operands if fit into 32-bit. Fixes
https://github.com/llvm/llvm-project/issues/67781
2023-10-30 08:12:28 -07:00
Stanislav Mekhanoshin
ee6d62db99
[AMDGPU] Prevent folding of the negative i32 literals as i64 (#70274)
We can use sign extended 64-bit literals, but only for signed operands.
At the moment we do not know if an operand is signed. Such operand will
be encoded as its low 32 bits and then either correctly sign extended or
incorrectly zero extended by HW.
2023-10-30 08:07:43 -07:00
Simon Pilgrim
d96529af3c [DAG] Attempt shl narrowing in SimplifyDemandedBits (REAPPLIED)
If a shl node leaves the upper half bits zero / undemanded, then see if we can profitably perform this with a half-width shl and a free trunc/zext.

Followup to D146121

Reapplied - moved after the ShrinkDemandedOp call; reuse the existing KnownBits result; ensure that we only attempt this if all the upper bits are demanded; 547dc461225ba should address the remaining regressions that were noticed in the previous commit.

Differential Revision: https://reviews.llvm.org/D155472
2023-10-29 15:38:46 +00:00
Changpeng Fang
8ceb72ffe5
[AMDGPU] make v32i16/v32f16 legal (#70484)
Some upcoming intrinsics will be using these new types
2023-10-27 15:28:31 -07:00