llvm-project

Author	SHA1	Message	Date
Jay Foad	a4196666ac	[AMDGPU] Revert "Preliminary patch for divergence driven instruction selection. Operands Folding 1." (#71710 ) This reverts commit 201f892b3b597f24287ab6a712a286e25a45a7d9.	2023-11-13 13:53:10 +00:00
Jay Foad	47f29043f0	[AMDGPU] Fix a GlobalISel RUN line This was added in D149795 without actually enabling GlobalISel.	2023-11-13 11:30:15 +00:00
Carl Ritson	edc38a6cbd	[AMDGPU] Add option to pre-allocate SGPR spill VGPRs (#70626 ) SGPR spill VGPRs are WWM registers so allow them to be allocated by SIPreAllocateWWMRegs pass. This intentionally prevents spilling of these VGPRs when enabled.	2023-11-13 12:21:18 +09:00
Carl Ritson	52b247b1d3	[PHIElimination] Handle subranges in LiveInterval updates (#69429 ) Add subrange tracking and handling for LiveIntervals during PHI elimination. This requires extending MachineBasicBlock::SplitCriticalEdge to also update subrange intervals.	2023-11-13 12:16:26 +09:00
Joseph Huber	a3bd87b100	[AMDGPU] Call the `FINI_ARRAY` destructors in the correct order (#71815 ) Summary: The AMDGPU backend uses the linker-provided INIT_ARRAY and FINI_ARRAY sections to call all the global constructors in a single kernel. Previously this mistakenly used the same iteration logic for both arrays. The destructors stored in FINI_ARRAY are stored in the same order as the ones in the INIT_ARRAY section so we need to traverse it in reverse order. Relanding after the revert in fe7b5e2cfcf6848287010291081f85fa1f6bb2ef using the IR builder interface instead of ConstantExpr.	2023-11-10 11:01:02 -06:00
Nikita Popov	fe7b5e2cfc	Revert "[AMDGPU] Call the `FINI_ARRAY` destructors in the correct order (#71815 )" This reverts commit c1d5865a313d0a8a254b37c852bdd444453c0f73. Introduces a new use of ConstantExpr::getAShr().	2023-11-10 17:01:06 +01:00
Joseph Huber	c1d5865a31	[AMDGPU] Call the `FINI_ARRAY` destructors in the correct order (#71815 ) Summary: The AMDGPU backend uses the linker-provided INIT_ARRAY and FINI_ARRAY sections to call all the global constructors in a single kernel. Previously this mistakenly used the same iteration logic for both arrays. The destructors stored in FINI_ARRAY are stored in the same order as the ones in the INIT_ARRAY section so we need to traverse it in reverse order.	2023-11-10 09:34:04 -06:00
Valery Pykhtin	87b8d94371	[AMDGPU] Fix GCNUpwardRPTracker. (#71186 ) Fixed: 1. Maximum register pressure calculation at the instruction level. Previously max RP included both def and use of registers of an instruction. Now maximum RP includes _uses_ and _early-clobber defs_. 2. Uses were incorrectly tracked and this resulted in a mismatch of live-in set reported by LiveIntervals and tracked live reg set when the beginning of the block is reached. Interface has changed, moveMaxPressure becomes deprecated and getMaxPressure, resetMaxPressure functions are added. reset function seem now more consistent.	2023-11-10 13:44:10 +01:00
Diana Picus	20e9e4f797	[AMDGPU] si-wqm: Skip only LiveMask COPY si-wqm sometimes needs to save the LiveMask in the entry block. Later on, while looking for a place to enter WQM/WWM, it unconditionally skips over the first COPY instruction in the entry block. This is incorrect for functions where the LiveMask doesn't need to be saved, and therefore the first COPY is more likely a COPY from a function argument and might need to be in some non-exact mode. This patch fixes the issue by also checking that the source of the COPY is the EXEC register. This produces different code in 3 of the existing tests: In wwm-reserved.ll, a SGPR copy is now inside the WWM area rather than outside. This is benign. In wave32.ll, we end up with an extra register copy. This is because the first COPY in the block is now part of the WWM block, so si-pre-allocate-wwm-regs will allocate a new register for its destination (when it was outside of the WWM region, the register allocator could just re-use the same register). We might be able to improve this in si-pre-allocate-wwm-regs but I haven't looked into it. The same thing happens in dual-source-blend-export.ll, but for that one it's harder to see because of the scheduling changes. I've uploaded the before/after si-wqm output for it here: https://reviews.llvm.org/differential/diff/553445/ Differential Revision: https://reviews.llvm.org/D158841	2023-11-10 09:30:44 +01:00
Matt Arsenault	67c3cb4f6b	AMDGPU: Use an explicit triple in test to avoid bot failures	2023-11-10 17:09:55 +09:00
Jun Wang	54470176af	[AMDGPU] Add inreg support for SGPR arguments (#67182 ) Function parameters marked with inreg are supposed to be allocated to SGPRs. However, for compute functions, this is ignored and function parameters are allocated to VGPRs. This fix modifies CC_AMDGPU_Func in AMDGPUCallingConv.td to use SGPRs if input arg is marked inreg. --------- Co-authored-by: Jun Wang <jun.wang7@amd.com>	2023-11-08 11:35:52 -08:00
Diana Picus	3b905a0be5	[AMDGPU] ISel for llvm.amdgcn.set.inactive.chain.arg Add patterns to select int_amdgcn_set_inactive_chain_arg to V_SET_INACTIVE. This could probably use some more testing, but at least for simple cases V_SET_INACTIVE seems to mostly work out of the box. Differential Revision: https://reviews.llvm.org/D158605	2023-11-08 09:53:47 +01:00
Diana Picus	39830fea28	[AMDGPU][PEI] Set up SP for chain functions Initialize the SP to 0 in the prologue of functions with the `amdgpu_cs_chain` or `amdgpu_cs_chain_preserve` calling conventions, but only if they need one (i.e. if they contain calls to `amdgpu_gfx` functions or if they have stack objects). Also make sure we don't try to realign the stack (since 0 is aligned enough). Differential Revision: https://reviews.llvm.org/D156413	2023-11-08 09:27:34 +01:00
Diana	1fa58c7790	[AMDGPU] Callee saves for amdgpu_cs_chain[_preserve] (#71526 ) Teach prolog epilog insertion how to handle functions with the amdgpu_cs_chain or amdgpu_cs_chain_preserve calling conventions. For amdgpu_cs_chain functions, we only need to preserve the inactive lanes of VGPRs above v8, and only in the presence of calls via @llvm.amdgcn.cs.chain. For amdgpu_cs_chain_preserve functions, we will also need to preserve the active lanes for registers above the last argument VGPR. AFAICT there's no direct way to find out what the last argument VGPR is, so instead the patch uses the fact that chain calls from amdgpu_cs_chain_preserve functions can't use more VGPRs than the caller's VGPR arguments. In other words, it removes the operands of SI_CS_CHAIN_TC instructions from the list of callee saved registers. For both calling conventions, registers v0-v7 never need to be saved and restored, so we should never add them as WWM spills. Differential Revision: https://reviews.llvm.org/D156412	2023-11-08 08:28:15 +01:00
Carl Ritson	af6ff98c53	[AMDGPU] Move WWM register pre-allocation to during regalloc (#70618 ) Move SIPreAllocateWWMRegs pass to just before VGPR allocation. This saves recomputation of the virtual matrix and live reg map, with the slight regression in O0 that live intervals and slot indexes must be computed.	2023-11-08 11:54:28 +09:00
Pierre van Houtryve	5db63d29fd	[AMDGPU] PromoteAlloca: Handle load/store subvectors using non-constant indexes (#71505 ) I assumed indexes were always ConstantInts, but that's not always the case. They can be other things as well. We can easily handle that by just emitting an add and let InstSimplify do the constant folding for cases where it's really a ConstantInt. Solves SWDEV-429935	2023-11-07 15:29:41 +01:00
Pierre van Houtryve	4428b01faa	Reland: [AMDGPU] Remove Code Object V3 (#67118 ) V3 has been deprecated for a while as well, so it can safely be removed like V2 was removed. - [Clang] Set minimum code object version to 4 - [lld] Fix tests using code object v3 - Remove code object V3 from the AMDGPU backend, and delete or port v3 tests to v4. - Update docs to make it clear V3 can no longer be emitted.	2023-11-07 12:23:03 +01:00
Nikita Popov	17764d2c87	[IR] Remove FP cast constant expressions (#71408 ) Remove support for the fptrunc, fpext, fptoui, fptosi, uitofp and sitofp constant expressions. All places creating them have been removed beforehand, so this just removes the APIs and uses of these constant expressions in tests. With this, the only remaining FP operation that still has constant expression support is fcmp. This is part of https://discourse.llvm.org/t/rfc-remove-most-constant-expressions/63179.	2023-11-07 09:34:16 +01:00
Matt Arsenault	d34a10a47d	AMDGPU: Port AMDGPUAttributor to new pass manager (#71349 )	2023-11-07 15:40:40 +09:00
Amara Emerson	6b69584660	[GlobalISel] Fall back for bf16 conversions. (#71470 ) We don't support these correctly since we don't yet have FP types. AMDGPU tests were silently miscompiling bf16 as if they were fp16.	2023-11-06 21:18:57 -08:00
Jay Foad	521ac12a25	[AMDGPU] Remove AMDGPUAsmPrinter::isBlockOnlyReachableByFallthrough (#71407 ) The special handling for blocks ending with a long branch has been unnecessary since D106445: "[amdgpu] Add 64-bit PC support when expanding unconditional branches."	2023-11-06 16:29:52 +00:00
Jay Foad	1c6102d19b	[AMDGPU] Regenerate checks for long-branch-reserve-register.ll	2023-11-06 15:33:23 +00:00
Nikita Popov	f9404a1b57	[AMDGPU] Regenerate test to fix failure	2023-11-06 15:42:02 +01:00
Valery Pykhtin	fe6893b1d8	Improve selection of conditional branch on amdgcn.ballot!=0 condition in SelectionDAG. (#68714 ) Improve selection of the following pattern: bool cnd = ... if (amdgcn.ballot(cnd) != 0) { ... } which means "execute _then_ if any lane has satisfied the _cnd_ condition".	2023-11-06 15:16:49 +01:00
sstipanovic	22a323e3db	[AMDGPU] Select v_lshl_add_u32 instead of v_mul_lo_u32 by constant (#71035 ) Instead of: v_mul_lo_u32 v0, v0, 5 we should generate: v_lshl_add_u32 v0, v0, 2, v0.	2023-11-06 14:52:27 +01:00
Diana	7f5d59b38d	[AMDGPU] ISel for @llvm.amdgcn.cs.chain intrinsic (#68186 ) The @llvm.amdgcn.cs.chain intrinsic is essentially a call. The call parameters are bundled up into 2 intrinsic arguments, one for those that should go in the SGPRs (the 3rd intrinsic argument), and one for those that should go in the VGPRs (the 4th intrinsic argument). Both will often be some kind of aggregate. Both instruction selection frameworks have some internal representation for intrinsics (G_INTRINSIC[_WITH_SIDE_EFFECTS] for GlobalISel, ISD::INTRINSIC_[VOID\|WITH_CHAIN] for DAGISel), but we can't use those because aggregates are dissolved very early on during ISel and we'd lose the inreg information. Therefore, this patch shortcircuits both the IRTranslator and SelectionDAGBuilder to lower this intrinsic as a call from the very start. It tries to use the existing infrastructure as much as possible, by calling into the code for lowering tail calls. This has already gone through a few rounds of review in Phab: Differential Revision: https://reviews.llvm.org/D153761	2023-11-06 12:30:07 +01:00
Carl Ritson	19bfe08c7f	Reapply [AMDGPU] Generate wwm-reserved.ll (NFC) Fix target triple so address locations are host independent.	2023-11-06 13:26:06 +09:00
Jessica Del	6e4692c9ee	[AMDGPU] - Add s_wqm intrinsics (#71048 ) Add intrinsics to generate `s_wqm_b32` and `s_wqm_b64`. Support VGPR arguments by inserting a `v_readfirstlane`.	2023-11-03 14:48:59 +01:00
Nikita Popov	e4a4122eb6	[IR] Remove zext and sext constant expressions (#71040 ) Remove support for zext and sext constant expressions. All places creating them have been removed beforehand, so this just removes the APIs and uses of these constant expressions in tests. There is some additional cleanup that can be done on top of this, e.g. we can remove the ZExtInst vs ZExtOperator footgun. This is part of https://discourse.llvm.org/t/rfc-remove-most-constant-expressions/63179.	2023-11-03 10:46:07 +01:00
Nico Weber	6acd1671e6	Revert "[AMDGPU] Generate wwm-reserved.ll (NFC)" This reverts commit b3523d7e6d8834468cfcb66e629adbe17da90ea5. Breaks tests on mac, see: https://github.com/llvm/llvm-project/commit/b3523d7e6d88344#commitcomment-131547708	2023-11-02 14:55:41 -04:00
Jay Foad	b90cfe4601	[AMDGPU] New ttracedata intrinsics (#70235 ) Add llvm.amdgcn.s.ttracedata and llvm.amdgcn.s.ttracedata.imm which map directly to the corresponding instructions s_ttracedata and s_ttracedata_imm. These are inherently whole-wave operations so any non-uniform inputs are readfirstlaned.	2023-11-02 10:35:15 +00:00
Jay Foad	65bad23e43	[AMDGPU] Fix test for #70532 (Implement moveToVALU for S_CSELECT_B64)	2023-11-02 10:31:02 +00:00
Jay Foad	1590cac494	[AMDGPU] Implement moveToVALU for S_CSELECT_B64 (#70352 ) moveToVALU previously only handled S_CSELECT_B64 in the trivial case where it was semantically equivalent to a copy. Implement the general case using V_CNDMASK_B64_PSEUDO and implement post-RA expansion of V_CNDMASK_B64_PSEUDO with immediate as well as register operands.	2023-11-02 10:08:09 +00:00
Jessica Del	41cf94e6b8	[AMDGPU] - Add s_quadmask intrinsics (#70804 ) Add intrinsics to generate `s_quadmask_b32` and `s_quadmask_b64`. Support VGPR arguments by inserting a `v_readfirstlane`.	2023-11-02 10:37:52 +01:00
Thomas Symalla	18839aec4e	[AMDGPU] Detect kills in register sets when trying to form V_CMPX instructions. (#68293 ) During the SIOptimizeExecMasking pass, we try to form V_CMPX instructions by detecting S_AND_SAVEEXEC and V_MOV instructions. Generally, we require the input operand of the V_MOV, which is the input operand to the to-be-formed V_CMPX, to be alive. This is forced by clearing the kill flags on the operand after V_CMPX has been generated. However, if we have a kill of a register set that contains said register, this will not be detected by clearKillFlags. With this change, possible additional kill-flag candidates will be detected during the final call to findInstrBackwards and then, the kill flag will be removed to keep all registers in the set alive. Co-authored-by: Thomas Symalla <thomas.symalla@amd.com>	2023-11-02 10:36:27 +01:00
Carl Ritson	b3523d7e6d	[AMDGPU] Generate wwm-reserved.ll (NFC)	2023-11-02 17:50:42 +09:00
Carl Ritson	0eb516817d	[AMDGPU] Remove dom tree requirements from SIWholeQuadMode pass (#71012 ) SIWholeQuadMode preserves dominator and post dominator trees, but does not require them.	2023-11-02 17:16:19 +09:00
Tobias Stadler	373c343a77	Reland: [GlobalISel] LegalizationArtifactCombiner: Elide redundant G_AND Reland 3686a0b after fixing an exposed miscompile in #68840 Differential Revision: https://reviews.llvm.org/D159140	2023-11-02 00:18:19 +01:00
Valery Pykhtin	e808f8a616	[AMDGPU] GCNRegPressurePrinter pass to print GCNRegPressure values for testing. (#70031 ) Using GCNDownwardRPTracker or GCNUpwardRPTracker the pass collects register pressure values for a function and prints these values next to instructions. Output can be used to generate Filecheck rules in mir tests.	2023-11-01 23:01:39 +01:00
Jay Foad	86f2e09250	[AMDGPU] Tweak handling of GlobalAddress operands in SI_PC_ADD_REL_OFFSET (#70960 ) When SI_PC_ADD_REL_OFFSET is expanded to S_GETPC/S_ADD/S_ADDC, the GlobalAddress operands have to be adjusted by 4 or 12 bytes to account for the offset from the end of the S_GETPC instruction to the literal operands. Do this all in SIInstrInfo::expandPostRAPseudo instead of duplicating the adjustment code in both AMDGPULegalizerInfo and SITargetLowering. NFCI.	2023-11-01 19:48:30 +00:00
Simon Pilgrim	51d4ad6701	[AMDGPU] amdgpu-codegenprepare-idiv.ll - regenerate checks. NFC. Reduces diffs in a future patch	2023-10-31 13:24:27 +00:00
Jay Foad	a6dabed348	[AMDGPU] Fix nondeterminism in SIFixSGPRCopies (#70644 ) There are a couple of loops that iterate over V2SCopies. The iteration order needs to be deterministic, otherwise we can call moveToVALU in different orders, which causes temporary vregs to be allocated in different orders, which can affect register allocation heuristics.	2023-10-31 11:47:42 +00:00
Jessica Del	b8d3ccdff1	[AMDGPU] - Add s_bitreplicate intrinsic (#69209 ) Add intrinsic for s_bitreplicate. Lower to S_BITREPLICATE_B64_B32 machine instruction in both GISel and Selection DAG. Support VGPR arguments by inserting a `v_readfirstlane`.	2023-10-31 11:26:45 +01:00
Craig Topper	9a7c26a399	[GISel] Restrict G_BSWAP to multiples of 16 bits. (#70245 ) This is consistent with the IR verifier and SelectionDAG's getNode. Update tests accordingly. I tried to keep some coverage of non-pow2 when possible. X86 didn't like a G_UNMERGE_VALUES from s48 to 3 s16 that got created when I tried s48.	2023-10-30 10:27:57 -07:00
Jay Foad	101008be83	[AMDGPU] CodeGen for 64-bit buffer atomic cmpswap intrinsics (#70475 ) Implement codegen for: llvm.amdgcn.raw.buffer.atomic.cmpswap.i64 llvm.amdgcn.raw.ptr.buffer.atomic.cmpswap.i64 llvm.amdgcn.struct.buffer.atomic.cmpswap.i64 llvm.amdgcn.struct.ptr.buffer.atomic.cmpswap.i64	2023-10-30 16:44:22 +00:00
Jessica Del	849297c97d	[AMDGPU][wmma] - Add tied wmma intrinsic (#69903 ) These new intrinsics, `amdgcn_wmma_tied_f16_16x16x16_f16` and `amdgcn_wmma_tied_f16_16x16x16_f16`, explicitly tie the destination accumulator matrix to the input accumulator matrix. The `wmma_f16` and `wmma_bf16` intrinsics only write to 16-bit of the 32-bit destination VGPRs. Which half is determined via the `op_sel` argument. The other half of the destination registers remains unchanged. In some cases however, we expect the destination to copy the other halves from the input accumulator. For instance, when packing two separate accumulator matrices into one. In that case, the two matrices are tied into the same registers, but separate halves. Then it is important to copy the other matrix values to the new destination.	2023-10-30 16:23:49 +01:00
Stanislav Mekhanoshin	fe8335babb	[AMDGPU] Select 64-bit imm moves if can be encoded as 32 bit operand (#70395 ) This allows folding of 64-bit operands if fit into 32-bit. Fixes https://github.com/llvm/llvm-project/issues/67781	2023-10-30 08:12:28 -07:00
Stanislav Mekhanoshin	ee6d62db99	[AMDGPU] Prevent folding of the negative i32 literals as i64 (#70274 ) We can use sign extended 64-bit literals, but only for signed operands. At the moment we do not know if an operand is signed. Such operand will be encoded as its low 32 bits and then either correctly sign extended or incorrectly zero extended by HW.	2023-10-30 08:07:43 -07:00
Simon Pilgrim	d96529af3c	[DAG] Attempt shl narrowing in SimplifyDemandedBits (REAPPLIED) If a shl node leaves the upper half bits zero / undemanded, then see if we can profitably perform this with a half-width shl and a free trunc/zext. Followup to D146121 Reapplied - moved after the ShrinkDemandedOp call; reuse the existing KnownBits result; ensure that we only attempt this if all the upper bits are demanded; 547dc461225ba should address the remaining regressions that were noticed in the previous commit. Differential Revision: https://reviews.llvm.org/D155472	2023-10-29 15:38:46 +00:00
Changpeng Fang	8ceb72ffe5	[AMDGPU] make v32i16/v32f16 legal (#70484 ) Some upcoming intrinsics will be using these new types	2023-10-27 15:28:31 -07:00

1 2 3 4 5 ...

6937 Commits