6113 Commits

Author SHA1 Message Date
Jeffrey Byrnes
1f08d3bc3a [AMDGPU] Further reduce attaching of implicit operands to spills
Extension of https://reviews.llvm.org/D141101 to even further reduce the amount of implicit operands we attach. The main benefit is to improve cability of post-ra scheduler, and reduce unneeded dependency resolution (e.g. inserting snops).

Unfortunately, we run into regressions if we completely minimize the amount implicit operands (naively), we run into some regressions (e.g. dual_movs are replaced with multiple calls to v_mov). This is even more reason to switch to LiveRegUnits.

Nonetheless, this patch removes the operands which we can for free (more or less).

Change-Id: Ib4f409202b36bdbc59eed615bc2d19fa8bd8c057

Differential Revision: https://reviews.llvm.org/D141557

Change-Id: I8b039e3c0d39436b384083f8beb947ee1b1730b2
2023-01-19 14:31:07 -08:00
Stanislav Mekhanoshin
63e7e9c875 [AMDGPU] Treat WMMA the same as MFMA for sched_barrier
MFMA and WMMA essentially the same thing, but apear on different ASICs.

Differential Revision: https://reviews.llvm.org/D142062
2023-01-19 10:52:31 -08:00
Stanislav Mekhanoshin
e7f080b359 [AMDGPU] Introduce separate register limit bias in scheduler
Current implementation abuses ErrorMargin to apply an additional
bias to VGPR and SGPR limits under a high register pressure. The
ErrorMargin exists to account for inaccuracies of the RP tracker
and not to tackle an excess pressure. Introduce separate bias for
this purpose and also make it different for SGPRs and VGPRs as we
may want to use different values in the future.

This is supposed to be NFC, however there is a subtle difference
when subtracting a margin overflows the limit. Doing two subtractions
makes it less probable, although manifests only in mir tests with
an artificially small register budget.

Differential Revision: https://reviews.llvm.org/D142051
2023-01-19 10:51:40 -08:00
Paul Kirth
557a5bc336 [codegen] Add StackFrameLayoutAnalysisPass
Issue #58168 describes the difficulty diagnosing stack size issues
identified by -Wframe-larger-than. For simple code, its easy to
understand the stack layout and where space is being allocated, but in
more complex programs, where code may be heavily inlined, unrolled, and
have duplicated code paths, it is no longer easy to manually inspect the
source program and understand where stack space can be attributed.

This patch implements a machine function pass that emits remarks with a
textual representation of stack slots, and also outputs any available
debug information to map source variables to those slots.

The new behavior can be used by adding `-Rpass-analysis=stack-frame-layout`
to the compiler invocation. Like other remarks the diagnostic
information can be saved to a file in a machine readable format by
adding -fsave-optimzation-record.

Fixes: #58168

Reviewed By: nickdesaulniers, thegameg

Differential Revision: https://reviews.llvm.org/D135488
2023-01-19 01:51:14 +00:00
Jeffrey Byrnes
f0e7ae085f [AMDGPU] Run autogen checks on test
Change-Id: I46f2ced9ceac592c2a93a00631014a806d4b0693
2023-01-18 16:12:18 -08:00
Nikita Popov
9ed2f14c87 [AsmParser] Remove typed pointer auto-detection
IR is now always parsed in opaque pointer mode, unless
-opaque-pointers=0 is explicitly given. There is no automatic
detection of typed pointers anymore.

The -opaque-pointers=0 option is added to any remaining IR tests
that haven't been migrated yet.

Differential Revision: https://reviews.llvm.org/D141912
2023-01-18 09:58:32 +01:00
Pierre van Houtryve
fd3300123d [CodeGen] Prevent overlapping subregs in getCoveringSubRegIndexes
If `getCoveringSubRegIndexes` returns a set of subregister indexes where some subregisters overlap others, it can create unsatisfiable copy bundles that eventually cause VirtRegRewriter to error out due to "cycles in copy bundle".

We can simply prevent this by making the algorithm skip over subregisters indexes that would cause an overlap with already-covered lanes.

Note that in the case of AMDGPU, this problem is caused by the lack of subregisters indexes for 13/14/15-register tuples. We have everything up until 12, then we have 16 and 32 but nothing between 12 and 16.
This means that the best candidate to do the least amount of copies when splitting a 29-register tuple was to copy (e.g.) 0-15 and 14-29, causing an overlap.
With this change, getCoveringSubRegIndexes will now prefer using something like 0-15, 16-28 and 1

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D141576
2023-01-18 03:50:17 -05:00
Pierre van Houtryve
6a60a68e72 [AMDGPU] Precommit test for D141576
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D141903
2023-01-18 03:49:37 -05:00
Anshil Gandhi
5073a622a7 [MachineBasicBlock] Explicit FT branching param
Introduce a parameter in getFallThrough() to optionally
allow returning the fall through basic block in spite of
an explicit branch instruction to it. This parameter is
set to false by default.

Introduce getLogicalFallThrough() which calls
getFallThrough(false) to obtain the block while avoiding
insertion of a jump instruction to its immediate successor.

This patch also reverts the changes made by D134557 and
solves the case where a jump is inserted after another jump
(branch-relax-no-terminators.mir).

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D140790
2023-01-17 17:12:08 -07:00
Pierre van Houtryve
9bdfd7c8db [AMDGPU] Regenerate extend-phi-subrange-not-in-parent.mir 2023-01-16 02:29:24 -05:00
Matt Arsenault
ab6b48b711 DAG: Avoid stack lowering if bitcast has an illegal vector result type
A bitcast of <10 x i32> to <5 x i64> was ending up on the
stack. Instead of doing that, handle the case where the new type
doesn't evenly divide but the elements do. Extract the individual
elements and pad with undef.

Avoids stack usage for bitcasts involving <5 x i64>. In some of these
cases, later optimizations actually eliminated the stack objects but
left behind the unused temporary stack object to final emission.

Fixes: SWDEV-377548
2023-01-15 12:37:14 -05:00
Matt Arsenault
6ee5a1a090 GlobalISel: Enable CSE for G_SEXT_INREG 2023-01-15 11:38:30 -05:00
Matt Arsenault
e70ae0f46b DAG/GlobalISel: Fix broken/redundant setting of MODereferenceable
This was incorrectly setting dereferenceable on unaligned
operands. getLoadMemOperandFlags does the alignment dereferenceabilty
check without alignment, and then both paths went on to check
isDereferenceableAndAlignedPointer. Make getLoadMemOperandFlags check
isDereferenceableAndAlignedPointer, and remove the second call.
2023-01-13 20:30:30 -05:00
Paul Kirth
fdc0bf6adc Revert "[codegen] Add StackFrameLayoutAnalysisPass"
This breaks on some AArch64 bots

This reverts commit 0a652c540556a118bbd9386ed3ab7fd9e60a9754.
2023-01-13 22:59:36 +00:00
Paul Kirth
0a652c5405 [codegen] Add StackFrameLayoutAnalysisPass
Issue #58168 describes the difficulty diagnosing stack size issues
identified by -Wframe-larger-than. For simple code, its easy to
understand the stack layout and where space is being allocated, but in
more complex programs, where code may be heavily inlined, unrolled, and
have duplicated code paths, it is no longer easy to manually inspect the
source program and understand where stack space can be attributed.

This patch implements a machine function pass that emits remarks with a
textual representation of stack slots, and also outputs any available
debug information to map source variables to those slots.

The new behavior can be used by adding `-Rpass-analysis=stack-frame-layout`
to the compiler invocation. Like other remarks the diagnostic
information can be saved to a file in a machine readable format by
adding -fsave-optimzation-record.

Fixes: #58168

Reviewed By: nickdesaulniers, thegameg

Differential Revision: https://reviews.llvm.org/D135488
2023-01-13 20:52:48 +00:00
Matt Arsenault
0d87732a1f AMDGPU: Use getConstantStringInfo for printf format strings
Tolerated printf format strings that are indexed globals and fixes
asserting on non-null terminated strings.
2023-01-13 14:11:37 -05:00
Matt Arsenault
39af5cec8b AMDGPU: Fix format string indexes for existing llvm.printf.fmts
The index stored to the buffer is just an index into this named
metadata. It would more robust to produce a private constant table,
and use a constant expression to index into it.
2023-01-13 13:18:27 -05:00
Matt Arsenault
689207e7ed AMDGPU: Some printf call edge case tests
Check printf printing printf, and printf passed to a function.
2023-01-13 12:32:04 -05:00
Matt Arsenault
0e7a83f502 AMDGPU: Don't expand printf users if printf is defined 2023-01-13 12:32:04 -05:00
Matt Arsenault
3718848653 AMDGPU/GlobalISel: Make regbankselect of implicit_def consistent with constants 2023-01-12 22:52:09 -05:00
Matt Arsenault
4d4894ab92 Partially reapply "AMDGPU: Invert handling of enqueued block detection"
This mostly reverts commit 270e96f435596449002fc89962595497481c8770.

Keep the attributor related changes around, but functionally restore
the old behavior as a workaround. Device enqueue goes back to not
working at -O0 with this version.
2023-01-12 15:02:16 -05:00
Matt Arsenault
e7cf44e723 Revert "[amdgpu] Change the RA to basic"
This reverts commit 28733d86cf7bf3e4e9667654ad6785aa8e21e04f.

This was a workaround for a bug which was fixed in
74ef03d38a59bb4da710a43dac189be3d01d0cd7
2023-01-12 08:32:02 -05:00
Jay Foad
4b89e8adda [AMDGPU] Temporarily disable FeatureBackOffBarrier for GFX11
Enabling this feature exposed some incorrect codegen, where a workgroup-
scope barrier fails to properly synchronise two waves from the same
workgroup running on different SIMDs of the same CU.

Disabling FeatureBackOffBarrier causes an s_waitcnt to be emitted before
the barrier which works around the problem.

Differential Revision: https://reviews.llvm.org/D141379
2023-01-11 17:47:56 +00:00
Thomas Symalla
2f38de3222 [NFC][AMDGPU] Pre-commit BFI test. 2023-01-11 14:24:06 +01:00
Ruiling Song
9119d9bfce AMDGPU/SIInsertWait: Skip dummy tied source
For D16 memory load instructions, the hardware usually only write to half
of the 32bit register, but we define the destination register using
32bit register for the MachineIR instruction. Without the extra tied
source register, LLVM framework will think previous write to the other
half of the register being dead. This is because by using 32bit register
as the destination register, LLVM will think the instruction will always
overwrite the whole 32bit register. By adding the extra tied source,
LLVM will think we are reading the register, so previous write to the
register will not be dead. This dummy tied source is introducing
unnecessary read-after-write dependency. The change here is to bypass the
tied source that can be skipped, thus avoiding an unnecessary s_waitcnt.

Reviewed by: foad

Differential Revision: https://reviews.llvm.org/D140537
2023-01-11 09:59:35 +08:00
Ruiling Song
5d0ff923c3 AMDGPU: Promote array alloca if used by memmove/memcpy
Reviewed by: arsenm

Differential Revision: https://reviews.llvm.org/D140599
2023-01-11 09:59:35 +08:00
Matt Arsenault
4fc07e1849 AMDGPU: Use constant and externally_initialized for block handle
The runtime initializes this.
2023-01-10 20:35:49 -05:00
Matt Arsenault
0cd3a39e95 AMDGPU: Fix opaque pointer handling for enqueued blocks, again 2023-01-10 20:35:48 -05:00
Matt Arsenault
6454391b31 AMDGPU/GlobalISel: Widen s1 SGPR constants during regbankselect
To unambiguously interpret these as 32-bit SGPRs, we need to widen
these to s32. This was selecting to a copy from a 64-bit SGPR to a
32-bit SGPR for wave64.
2023-01-10 14:45:23 -05:00
Matt Arsenault
4682039db0 AMDGPU: Don't assert on printf of half
The comment says fields should be 4-byte aligned, so just pass through
after conversion to integer. The conformance test lacks any testing of
half.
2023-01-10 14:13:23 -05:00
Matt Arsenault
fdda800ba3 AMDGPU: Fix opaque pointer and other bugs in printf of constant strings
Strip pointer casts to get to the global. Fixes not respecting indexed
constant strings. Tolerate non-null terminated and empty strings.
2023-01-10 13:39:44 -05:00
Matt Arsenault
d8534e4e98 AMDGPU: Don't insert ptrtoint for printf lowering 2023-01-10 13:07:01 -05:00
Matt Arsenault
777b905bae AMDGPU: Stop trying to specially handle vector stores in printf lowering
This was broken for 1 element vectors and trying to create invalid
casts. We can directly store any type just fine, so don't bother with
this buggy conversion logic.
2023-01-10 13:07:01 -05:00
Jay Foad
e2e5d59236 [AMDGPU] Add GFX10/GFX11 wave64 test coverage in huge-private-buffer.ll 2023-01-10 14:17:09 +00:00
Jay Foad
a7c2121d03 [AMDGPU] Fix duplicate -verify-machineinstrs option 2023-01-10 14:07:09 +00:00
Jay Foad
fadacaa87a [AMDGPU] Add GFX11 test coverage for FeatureBackOffBarrier 2023-01-10 12:36:16 +00:00
Jay Foad
efc5cedb38 [AMDGPU] Regenerate checks in waitcnt-preexisting-vscnt.mir 2023-01-10 12:36:16 +00:00
Jessica Del
f33633f512 [AMDGPU] adding test for partially masked operands
This test is testing whether the compiler behaves correctly when only
parts of an operand are masked.
In this case, no optimization is supposed to happen, since neither
the upper nor the lower half is
fully masked. Therefore, none
of the halves can be known to be zero.
The result is a regular multiplication.
2023-01-10 11:05:52 +01:00
Stanislav Mekhanoshin
c8ed36281a [AMDGPU] Cast sub-dword elements to i32 in concat_vectors
This produces better code by avoiding repacking in some cases.

Fixes: SWDEV-373436

Differential Revision: https://reviews.llvm.org/D141329
2023-01-09 15:35:49 -08:00
Jeffrey Byrnes
596c558155 [AMDGPU] More selectively attach implicit operands to agpr spills
Implicit def operands are needed when we spill partially undef super registers by each individual subregister. The implicit-def operands will allow us to lower spills without the verifier complaining. Currently, we are overzeously attaching implicit operands, when we really only need them on the first sub reg spill op. By more selectively attached the implicit ops, we will free up some unneeded dependencies for the post-ra scheduler.

Moreover, this enables a previously incorrect optimization / resolves a correctness issue in indirectCopyToAGPR. When lowering AGPR copies on GFX908, we can improve CodeGen by reusing accvgpr_writes. However, we could not reliably determine which agprs accvgpr_writes actually define due to implicit-defs.

Differential Revision: https://reviews.llvm.org/D141101
2023-01-09 15:10:06 -08:00
Stanislav Mekhanoshin
d562d30fb5 [AMDGPU] More tests for vector_shuffle.packed.ll. NFC.
Pre-commit tests before the next patch. Subtest shuffle_v16f16_concat
exposes the problem with suboptimal lowering.
2023-01-09 14:57:45 -08:00
Matt Arsenault
270e96f435 Revert "AMDGPU: Invert handling of enqueued block detection"
This reverts commit 47288cc977fa31c44cc92b4e65044a5b75c2597e.

The runtime is having trouble with this at -O0 when the inputs are
always enabled.
2023-01-07 21:48:07 -05:00
Matt Arsenault
47554a0c73 AMDGPU: Use more accurate IR type for block handle
The device library uses this as a struct with a pointer sized integer
and 2 ints.
2023-01-06 21:23:28 -05:00
Matt Arsenault
b7587ca837 AMDGPU: Add more opencl printf tests 2023-01-06 21:23:14 -05:00
Matt Arsenault
47288cc977 AMDGPU: Invert handling of enqueued block detection
Invert the sense of the attribute and let the attributor figure this
out like everything else. If needed we can have the not-OpenCL
languages set amdgpu-no-default-queue and amdgpu-no-completion-action
up front so they never have to pay the cost.

There are also so many of these now, the offset use API should
probably consider all of them at once. Maybe they should merge into
one attribute with used fields. Having separate functions for each
field in AMDGPUBaseInfo is also not the greatest API (might as well
fix this when the patch to get the object version from the module
lands).
2023-01-06 21:16:08 -05:00
Matt Arsenault
0416883dc1 AMDGPU: Fix enqueue block lowering for opaque pointers
This was looking for a specific constant cast of the function, when
the type doesn't matter. Doesn't bother trying to handle typed
pointers, it will just assert.

Things probably don't work completely correctly if the block kernel
address is captured somewhere else, but that wouldn't work before
either. The uses should really be loads out of the handle, and the
handle initializer should contain the kernel address.
2023-01-06 21:15:39 -05:00
Matt Arsenault
4ce5400a3f AMDGPU: Convert enqueue-kernel.ll to opaque pointers
This demonstrates the pass is broken with them, the follow up change
will fix it.
2023-01-06 21:15:39 -05:00
Matt Arsenault
8723836358 AMDGPU: Add additional printf string tests
Test various inputs passed to %s.
2023-01-06 17:22:13 -05:00
Matt Arsenault
b4d44322d9 AMDGPU/GlobalISel: Add missing test for implicit_def regbankselect 2023-01-06 08:58:10 -05:00
Matt Arsenault
6fe85933d4 AMDGPU/GlobalISel: Add wave32 checks to bool test 2023-01-06 08:58:10 -05:00