llvm-project

Author	SHA1	Message	Date
Jeffrey Byrnes	1f08d3bc3a	[AMDGPU] Further reduce attaching of implicit operands to spills Extension of https://reviews.llvm.org/D141101 to even further reduce the amount of implicit operands we attach. The main benefit is to improve cability of post-ra scheduler, and reduce unneeded dependency resolution (e.g. inserting snops). Unfortunately, we run into regressions if we completely minimize the amount implicit operands (naively), we run into some regressions (e.g. dual_movs are replaced with multiple calls to v_mov). This is even more reason to switch to LiveRegUnits. Nonetheless, this patch removes the operands which we can for free (more or less). Change-Id: Ib4f409202b36bdbc59eed615bc2d19fa8bd8c057 Differential Revision: https://reviews.llvm.org/D141557 Change-Id: I8b039e3c0d39436b384083f8beb947ee1b1730b2	2023-01-19 14:31:07 -08:00
Stanislav Mekhanoshin	63e7e9c875	[AMDGPU] Treat WMMA the same as MFMA for sched_barrier MFMA and WMMA essentially the same thing, but apear on different ASICs. Differential Revision: https://reviews.llvm.org/D142062	2023-01-19 10:52:31 -08:00
Stanislav Mekhanoshin	e7f080b359	[AMDGPU] Introduce separate register limit bias in scheduler Current implementation abuses ErrorMargin to apply an additional bias to VGPR and SGPR limits under a high register pressure. The ErrorMargin exists to account for inaccuracies of the RP tracker and not to tackle an excess pressure. Introduce separate bias for this purpose and also make it different for SGPRs and VGPRs as we may want to use different values in the future. This is supposed to be NFC, however there is a subtle difference when subtracting a margin overflows the limit. Doing two subtractions makes it less probable, although manifests only in mir tests with an artificially small register budget. Differential Revision: https://reviews.llvm.org/D142051	2023-01-19 10:51:40 -08:00
Paul Kirth	557a5bc336	[codegen] Add StackFrameLayoutAnalysisPass Issue #58168 describes the difficulty diagnosing stack size issues identified by -Wframe-larger-than. For simple code, its easy to understand the stack layout and where space is being allocated, but in more complex programs, where code may be heavily inlined, unrolled, and have duplicated code paths, it is no longer easy to manually inspect the source program and understand where stack space can be attributed. This patch implements a machine function pass that emits remarks with a textual representation of stack slots, and also outputs any available debug information to map source variables to those slots. The new behavior can be used by adding `-Rpass-analysis=stack-frame-layout` to the compiler invocation. Like other remarks the diagnostic information can be saved to a file in a machine readable format by adding -fsave-optimzation-record. Fixes: #58168 Reviewed By: nickdesaulniers, thegameg Differential Revision: https://reviews.llvm.org/D135488	2023-01-19 01:51:14 +00:00
Jeffrey Byrnes	f0e7ae085f	[AMDGPU] Run autogen checks on test Change-Id: I46f2ced9ceac592c2a93a00631014a806d4b0693	2023-01-18 16:12:18 -08:00
Nikita Popov	9ed2f14c87	[AsmParser] Remove typed pointer auto-detection IR is now always parsed in opaque pointer mode, unless -opaque-pointers=0 is explicitly given. There is no automatic detection of typed pointers anymore. The -opaque-pointers=0 option is added to any remaining IR tests that haven't been migrated yet. Differential Revision: https://reviews.llvm.org/D141912	2023-01-18 09:58:32 +01:00
Pierre van Houtryve	fd3300123d	[CodeGen] Prevent overlapping subregs in getCoveringSubRegIndexes If `getCoveringSubRegIndexes` returns a set of subregister indexes where some subregisters overlap others, it can create unsatisfiable copy bundles that eventually cause VirtRegRewriter to error out due to "cycles in copy bundle". We can simply prevent this by making the algorithm skip over subregisters indexes that would cause an overlap with already-covered lanes. Note that in the case of AMDGPU, this problem is caused by the lack of subregisters indexes for 13/14/15-register tuples. We have everything up until 12, then we have 16 and 32 but nothing between 12 and 16. This means that the best candidate to do the least amount of copies when splitting a 29-register tuple was to copy (e.g.) 0-15 and 14-29, causing an overlap. With this change, getCoveringSubRegIndexes will now prefer using something like 0-15, 16-28 and 1 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D141576	2023-01-18 03:50:17 -05:00
Pierre van Houtryve	6a60a68e72	[AMDGPU] Precommit test for D141576 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D141903	2023-01-18 03:49:37 -05:00
Anshil Gandhi	5073a622a7	[MachineBasicBlock] Explicit FT branching param Introduce a parameter in getFallThrough() to optionally allow returning the fall through basic block in spite of an explicit branch instruction to it. This parameter is set to false by default. Introduce getLogicalFallThrough() which calls getFallThrough(false) to obtain the block while avoiding insertion of a jump instruction to its immediate successor. This patch also reverts the changes made by D134557 and solves the case where a jump is inserted after another jump (branch-relax-no-terminators.mir). Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D140790	2023-01-17 17:12:08 -07:00
Pierre van Houtryve	9bdfd7c8db	[AMDGPU] Regenerate extend-phi-subrange-not-in-parent.mir	2023-01-16 02:29:24 -05:00
Matt Arsenault	ab6b48b711	DAG: Avoid stack lowering if bitcast has an illegal vector result type A bitcast of <10 x i32> to <5 x i64> was ending up on the stack. Instead of doing that, handle the case where the new type doesn't evenly divide but the elements do. Extract the individual elements and pad with undef. Avoids stack usage for bitcasts involving <5 x i64>. In some of these cases, later optimizations actually eliminated the stack objects but left behind the unused temporary stack object to final emission. Fixes: SWDEV-377548	2023-01-15 12:37:14 -05:00
Matt Arsenault	6ee5a1a090	GlobalISel: Enable CSE for G_SEXT_INREG	2023-01-15 11:38:30 -05:00
Matt Arsenault	e70ae0f46b	DAG/GlobalISel: Fix broken/redundant setting of MODereferenceable This was incorrectly setting dereferenceable on unaligned operands. getLoadMemOperandFlags does the alignment dereferenceabilty check without alignment, and then both paths went on to check isDereferenceableAndAlignedPointer. Make getLoadMemOperandFlags check isDereferenceableAndAlignedPointer, and remove the second call.	2023-01-13 20:30:30 -05:00
Paul Kirth	fdc0bf6adc	Revert "[codegen] Add StackFrameLayoutAnalysisPass" This breaks on some AArch64 bots This reverts commit 0a652c540556a118bbd9386ed3ab7fd9e60a9754.	2023-01-13 22:59:36 +00:00
Paul Kirth	0a652c5405	[codegen] Add StackFrameLayoutAnalysisPass Issue #58168 describes the difficulty diagnosing stack size issues identified by -Wframe-larger-than. For simple code, its easy to understand the stack layout and where space is being allocated, but in more complex programs, where code may be heavily inlined, unrolled, and have duplicated code paths, it is no longer easy to manually inspect the source program and understand where stack space can be attributed. This patch implements a machine function pass that emits remarks with a textual representation of stack slots, and also outputs any available debug information to map source variables to those slots. The new behavior can be used by adding `-Rpass-analysis=stack-frame-layout` to the compiler invocation. Like other remarks the diagnostic information can be saved to a file in a machine readable format by adding -fsave-optimzation-record. Fixes: #58168 Reviewed By: nickdesaulniers, thegameg Differential Revision: https://reviews.llvm.org/D135488	2023-01-13 20:52:48 +00:00
Matt Arsenault	0d87732a1f	AMDGPU: Use getConstantStringInfo for printf format strings Tolerated printf format strings that are indexed globals and fixes asserting on non-null terminated strings.	2023-01-13 14:11:37 -05:00
Matt Arsenault	39af5cec8b	AMDGPU: Fix format string indexes for existing llvm.printf.fmts The index stored to the buffer is just an index into this named metadata. It would more robust to produce a private constant table, and use a constant expression to index into it.	2023-01-13 13:18:27 -05:00
Matt Arsenault	689207e7ed	AMDGPU: Some printf call edge case tests Check printf printing printf, and printf passed to a function.	2023-01-13 12:32:04 -05:00
Matt Arsenault	0e7a83f502	AMDGPU: Don't expand printf users if printf is defined	2023-01-13 12:32:04 -05:00
Matt Arsenault	3718848653	AMDGPU/GlobalISel: Make regbankselect of implicit_def consistent with constants	2023-01-12 22:52:09 -05:00
Matt Arsenault	4d4894ab92	Partially reapply "AMDGPU: Invert handling of enqueued block detection" This mostly reverts commit 270e96f435596449002fc89962595497481c8770. Keep the attributor related changes around, but functionally restore the old behavior as a workaround. Device enqueue goes back to not working at -O0 with this version.	2023-01-12 15:02:16 -05:00
Matt Arsenault	e7cf44e723	Revert "[amdgpu] Change the RA to basic" This reverts commit 28733d86cf7bf3e4e9667654ad6785aa8e21e04f. This was a workaround for a bug which was fixed in 74ef03d38a59bb4da710a43dac189be3d01d0cd7	2023-01-12 08:32:02 -05:00
Jay Foad	4b89e8adda	[AMDGPU] Temporarily disable FeatureBackOffBarrier for GFX11 Enabling this feature exposed some incorrect codegen, where a workgroup- scope barrier fails to properly synchronise two waves from the same workgroup running on different SIMDs of the same CU. Disabling FeatureBackOffBarrier causes an s_waitcnt to be emitted before the barrier which works around the problem. Differential Revision: https://reviews.llvm.org/D141379	2023-01-11 17:47:56 +00:00
Thomas Symalla	2f38de3222	[NFC][AMDGPU] Pre-commit BFI test.	2023-01-11 14:24:06 +01:00
Ruiling Song	9119d9bfce	AMDGPU/SIInsertWait: Skip dummy tied source For D16 memory load instructions, the hardware usually only write to half of the 32bit register, but we define the destination register using 32bit register for the MachineIR instruction. Without the extra tied source register, LLVM framework will think previous write to the other half of the register being dead. This is because by using 32bit register as the destination register, LLVM will think the instruction will always overwrite the whole 32bit register. By adding the extra tied source, LLVM will think we are reading the register, so previous write to the register will not be dead. This dummy tied source is introducing unnecessary read-after-write dependency. The change here is to bypass the tied source that can be skipped, thus avoiding an unnecessary s_waitcnt. Reviewed by: foad Differential Revision: https://reviews.llvm.org/D140537	2023-01-11 09:59:35 +08:00
Ruiling Song	5d0ff923c3	AMDGPU: Promote array alloca if used by memmove/memcpy Reviewed by: arsenm Differential Revision: https://reviews.llvm.org/D140599	2023-01-11 09:59:35 +08:00
Matt Arsenault	4fc07e1849	AMDGPU: Use constant and externally_initialized for block handle The runtime initializes this.	2023-01-10 20:35:49 -05:00
Matt Arsenault	0cd3a39e95	AMDGPU: Fix opaque pointer handling for enqueued blocks, again	2023-01-10 20:35:48 -05:00
Matt Arsenault	6454391b31	AMDGPU/GlobalISel: Widen s1 SGPR constants during regbankselect To unambiguously interpret these as 32-bit SGPRs, we need to widen these to s32. This was selecting to a copy from a 64-bit SGPR to a 32-bit SGPR for wave64.	2023-01-10 14:45:23 -05:00
Matt Arsenault	4682039db0	AMDGPU: Don't assert on printf of half The comment says fields should be 4-byte aligned, so just pass through after conversion to integer. The conformance test lacks any testing of half.	2023-01-10 14:13:23 -05:00
Matt Arsenault	fdda800ba3	AMDGPU: Fix opaque pointer and other bugs in printf of constant strings Strip pointer casts to get to the global. Fixes not respecting indexed constant strings. Tolerate non-null terminated and empty strings.	2023-01-10 13:39:44 -05:00
Matt Arsenault	d8534e4e98	AMDGPU: Don't insert ptrtoint for printf lowering	2023-01-10 13:07:01 -05:00
Matt Arsenault	777b905bae	AMDGPU: Stop trying to specially handle vector stores in printf lowering This was broken for 1 element vectors and trying to create invalid casts. We can directly store any type just fine, so don't bother with this buggy conversion logic.	2023-01-10 13:07:01 -05:00
Jay Foad	e2e5d59236	[AMDGPU] Add GFX10/GFX11 wave64 test coverage in huge-private-buffer.ll	2023-01-10 14:17:09 +00:00
Jay Foad	a7c2121d03	[AMDGPU] Fix duplicate -verify-machineinstrs option	2023-01-10 14:07:09 +00:00
Jay Foad	fadacaa87a	[AMDGPU] Add GFX11 test coverage for FeatureBackOffBarrier	2023-01-10 12:36:16 +00:00
Jay Foad	efc5cedb38	[AMDGPU] Regenerate checks in waitcnt-preexisting-vscnt.mir	2023-01-10 12:36:16 +00:00
Jessica Del	f33633f512	[AMDGPU] adding test for partially masked operands This test is testing whether the compiler behaves correctly when only parts of an operand are masked. In this case, no optimization is supposed to happen, since neither the upper nor the lower half is fully masked. Therefore, none of the halves can be known to be zero. The result is a regular multiplication.	2023-01-10 11:05:52 +01:00
Stanislav Mekhanoshin	c8ed36281a	[AMDGPU] Cast sub-dword elements to i32 in concat_vectors This produces better code by avoiding repacking in some cases. Fixes: SWDEV-373436 Differential Revision: https://reviews.llvm.org/D141329	2023-01-09 15:35:49 -08:00
Jeffrey Byrnes	596c558155	[AMDGPU] More selectively attach implicit operands to agpr spills Implicit def operands are needed when we spill partially undef super registers by each individual subregister. The implicit-def operands will allow us to lower spills without the verifier complaining. Currently, we are overzeously attaching implicit operands, when we really only need them on the first sub reg spill op. By more selectively attached the implicit ops, we will free up some unneeded dependencies for the post-ra scheduler. Moreover, this enables a previously incorrect optimization / resolves a correctness issue in indirectCopyToAGPR. When lowering AGPR copies on GFX908, we can improve CodeGen by reusing accvgpr_writes. However, we could not reliably determine which agprs accvgpr_writes actually define due to implicit-defs. Differential Revision: https://reviews.llvm.org/D141101	2023-01-09 15:10:06 -08:00
Stanislav Mekhanoshin	d562d30fb5	[AMDGPU] More tests for vector_shuffle.packed.ll. NFC. Pre-commit tests before the next patch. Subtest shuffle_v16f16_concat exposes the problem with suboptimal lowering.	2023-01-09 14:57:45 -08:00
Matt Arsenault	270e96f435	Revert "AMDGPU: Invert handling of enqueued block detection" This reverts commit 47288cc977fa31c44cc92b4e65044a5b75c2597e. The runtime is having trouble with this at -O0 when the inputs are always enabled.	2023-01-07 21:48:07 -05:00
Matt Arsenault	47554a0c73	AMDGPU: Use more accurate IR type for block handle The device library uses this as a struct with a pointer sized integer and 2 ints.	2023-01-06 21:23:28 -05:00
Matt Arsenault	b7587ca837	AMDGPU: Add more opencl printf tests	2023-01-06 21:23:14 -05:00
Matt Arsenault	47288cc977	AMDGPU: Invert handling of enqueued block detection Invert the sense of the attribute and let the attributor figure this out like everything else. If needed we can have the not-OpenCL languages set amdgpu-no-default-queue and amdgpu-no-completion-action up front so they never have to pay the cost. There are also so many of these now, the offset use API should probably consider all of them at once. Maybe they should merge into one attribute with used fields. Having separate functions for each field in AMDGPUBaseInfo is also not the greatest API (might as well fix this when the patch to get the object version from the module lands).	2023-01-06 21:16:08 -05:00
Matt Arsenault	0416883dc1	AMDGPU: Fix enqueue block lowering for opaque pointers This was looking for a specific constant cast of the function, when the type doesn't matter. Doesn't bother trying to handle typed pointers, it will just assert. Things probably don't work completely correctly if the block kernel address is captured somewhere else, but that wouldn't work before either. The uses should really be loads out of the handle, and the handle initializer should contain the kernel address.	2023-01-06 21:15:39 -05:00
Matt Arsenault	4ce5400a3f	AMDGPU: Convert enqueue-kernel.ll to opaque pointers This demonstrates the pass is broken with them, the follow up change will fix it.	2023-01-06 21:15:39 -05:00
Matt Arsenault	8723836358	AMDGPU: Add additional printf string tests Test various inputs passed to %s.	2023-01-06 17:22:13 -05:00
Matt Arsenault	b4d44322d9	AMDGPU/GlobalISel: Add missing test for implicit_def regbankselect	2023-01-06 08:58:10 -05:00
Matt Arsenault	6fe85933d4	AMDGPU/GlobalISel: Add wave32 checks to bool test	2023-01-06 08:58:10 -05:00

1 2 3 4 5 ...

6113 Commits