llvm-project

Author	SHA1	Message	Date
Craig Topper	fa630e7594	[RISCV][AMDGPU][TargetLowering] Special case overflow expansion for (uaddo X, 1). If we expand (uaddo X, 1) we previously expanded the overflow calculation as (X + 1) <u X. This potentially increases the live range of X and can prevent X+1 from reusing the register that previously held X. Since we're adding 1, overflow only occurs if X was UINT_MAX in which case (X+1) would be 0. So this patch adds a special case to expand the overflow calculation to (X+1) == 0. This seems to help with uaddo intrinsics that get introduced by CodeGenPrepare after LSR. Alternatively, we could block the uaddo transform in CodeGenPrepare for this case. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D122933	2022-04-01 13:14:10 -07:00
Venkata Ramanaiah Nalamothu	04fff547e2	[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range Currently the return address ABI registers s[30:31], which fall in the call clobbered register range, are added as a live-in on the function entry to preserve its value when we have calls so that it gets saved and restored around the calls. But the DWARF unwind information (CFI) needs to track where the return address resides in a frame and the above approach makes it difficult to track the return address when the CFI information is emitted during the frame lowering, due to the involvment of understanding the control flow. This patch moves the return address ABI registers s[30:31] into callee saved registers range and stops adding live-in for return address registers, so that the CFI machinery will know where the return address resides when CSR save/restore happen during the frame lowering. And doing the above poses an issue that now the return instruction uses undefined register `sgpr30_sgpr31`. This is resolved by hiding the return address register use by the return instruction through the `SI_RETURN` pseudo instruction, which doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the `S_SETPC_B64_return` during the `expandPostRAPseudo()`. As an added benefit, this patch simplifies overall return instruction handling. Note: The AMDGPU CFI changes are there only in the downstream code and another version of this patch will be posted for review for the downstream code. Reviewed By: arsenm, ronlieb Differential Revision: https://reviews.llvm.org/D114652	2022-03-09 12:18:02 +05:30
Carl Ritson	565af157ef	[AMDGPU] Extend pre-emit peephole to redundantly masked VCC Extend pre-emit peephole for S_CBRANCH_VCC[N]Z to eliminate redundant S_AND operations against EXEC for V_CMP results in VCC. These occur after after register allocation when VCC has been selected as the comparison destination. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D120202	2022-02-25 10:18:31 +09:00
Ron Lieberman	09b53296cf	Revert "[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range" This reverts commit 9075009d1fd5f2bf9aa6c2f362d2993691a316b3. Failed amdgpu runtime buildbot # 3514	2021-12-22 11:39:28 -05:00
RamNalamothu	9075009d1f	[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range Currently the return address ABI registers s[30:31], which fall in the call clobbered register range, are added as a live-in on the function entry to preserve its value when we have calls so that it gets saved and restored around the calls. But the DWARF unwind information (CFI) needs to track where the return address resides in a frame and the above approach makes it difficult to track the return address when the CFI information is emitted during the frame lowering, due to the involvment of understanding the control flow. This patch moves the return address ABI registers s[30:31] into callee saved registers range and stops adding live-in for return address registers, so that the CFI machinery will know where the return address resides when CSR save/restore happen during the frame lowering. And doing the above poses an issue that now the return instruction uses undefined register `sgpr30_sgpr31`. This is resolved by hiding the return address register use by the return instruction through the `SI_RETURN` pseudo instruction, which doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the `S_SETPC_B64_return` during the `expandPostRAPseudo()`. As an added benefit, this patch simplifies overall return instruction handling. Note: The AMDGPU CFI changes are there only in the downstream code and another version of this patch will be posted for review for the downstream code. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D114652	2021-12-22 20:51:12 +05:30
Austin Kerbow	da067ed569	[AMDGPU] Set most sched model resource's BufferSize to one Using a BufferSize of one for memory ProcResources will result in better ILP since it more accurately models the dependencies between memory ops and their consumers on an in-order processor. After this change, the scheduler will treat the data edges from loads as blocking so that stalls are guaranteed when waiting for data to be retreaved from memory. Since we don't actually track waitcnt here, this should do a better job at modeling their behavior. Practically, this means that the scheduler will trigger the 'STALL' heuristic more often. This type of change needs to be evaluated experimentally. Preliminary results are positive. Fixes: SWDEV-282962 Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D114777	2021-12-01 22:31:28 -08:00
RamNalamothu	18f9351223	[AMDGPU] Do not generate ELF symbols for the local branch target labels The compiler was generating symbols in the final code object for local branch target labels. This bloats the code object, slows down the loader, and is only used to simplify disassembly. Use '--symbolize-operands' with llvm-objdump to improve readability of the branch target operands in disassembly. Fixes: SWDEV-312223 Reviewed By: scott.linder Differential Revision: https://reviews.llvm.org/D114273	2021-11-20 10:32:41 +05:30
Jay Foad	30b27ecfc2	[AMDGPU] Use new opcode for indexed vgpr reads Introduce V_MOV_B32_indirect_read for indexed vgpr reads (and rename the old V_MOV_B32_indirect to V_MOV_B32_indirect_write) so they can be unambiguously distinguished from regular V_MOV_B32_e32. Previously they were distinguished by looking for extra implicit operands but this is fragile because regular moves sometimes have extra implicit operands too: - either by accident, when instructions end up with duplicate implicit operands (see e.g. D100939) - or by design, when SIInstrInfo::copyPhysReg breaks a multi-dword copy into individual subreg mov instructions and adds implicit operands for the super-register. The effect of this is that SIInstrInfo::isFoldableCopy can be simplified and identifies more foldable copies. The test diffs show that more immediate 0 values have been folded as inline operands. SIInstrInfo::isReallyTriviallyReMaterializable could probably be simplified too but that is not part of this patch. Differential Revision: https://reviews.llvm.org/D114230	2021-11-19 13:08:11 +00:00
Jay Foad	a70bbb5f7a	[AMDGPU] Simplify 64-bit division/remainder expansion The old expansion open-coded a 64-bit addition in a strange way, by adding the high parts without carry-in from the low part, and then adding the carry back in later on. Fixing this saves a couple of instructions and makes the code much easier to understand. Differential Revision: https://reviews.llvm.org/D113679	2021-11-12 15:48:41 +00:00
Stanislav Mekhanoshin	c80d8a8cea	[AMDGPU] MachineLICM cannot hoist VALU MachineLoop::isLoopInvariant() returns false for all VALU because of the exec use. Check TII::isIgnorableUse() to allow hoisting. That unfortunately results in higher register consumption since MachineLICM does not adequately estimate pressure. Therefor I think it shall only be enabled after D107677 even though it does not depend on it. Differential Revision: https://reviews.llvm.org/D107859	2021-10-20 11:47:24 -07:00
Jay Foad	c885857e9d	[AMDGPU] Enable load clustering in the post-RA scheduler This has a couple of benefits: 1. It can sometimes fix clusters that got broken apart when the register allocator inserted a copy. 2. Post-RA scheduling does not have to worry about increasing register pressure, which in some cases gives it more freedom to reorder instructions. Testing on a collection of 10,000 graphics shaders compiled for gfx1010 showed: - The average length of each run of one or more load instructions increased by about 1%. - The number of runs of two or more load instructions increased by about 4%. Differential Revision: https://reviews.llvm.org/D111646	2021-10-13 17:12:26 +01:00
Jay Foad	66ce1015af	Revert "[AMDGPU] Enable load clustering in the post-RA scheduler" This reverts commit 66e13c7f439cf162d7ed1d25883e71a5755ac7ec. It was committed by accident.	2021-10-12 16:19:35 +01:00
Jay Foad	66e13c7f43	[AMDGPU] Enable load clustering in the post-RA scheduler This has a couple of benefits: 1. It can sometimes fix clusters that got broken apart when the register allocator inserted a copy. 2. Post-RA scheduling does not have to worry about increasing register pressure, which in some cases gives it more freedom to reorder instructions. Testing on a collection of 10,000 graphics shaders compiled for gfx1010 showed: - The average length of each run of one or more load instructions increased by about 1%. - The number of runs of two or more load instructions increased by about 4%.	2021-10-12 16:09:04 +01:00
Joe Nash	3ce1b9631a	[AMDGPU] Switch PostRA sched to MachineSched Use GCNHazardRecognizer in postra sched. Updated tests for the new schedules. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D109536 Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde	2021-09-14 15:11:27 -04:00
alex-t	ed0f4415f0	[AMDGPU] Divergence-driven compare operations instruction selection Description: This change enables the compare operations to be selected to SALU/VALU form dependent of the SDNode divergence flag. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D106079	2021-08-25 18:30:49 +03:00
Matt Arsenault	d719f1c3cc	AMDGPU: Add alloc priority to global ranges The requested register class priorities weren't respected globally. Not sure why this is a target option, and not just the expected behavior (recently added in 1a6dc92be7d68611077f0fb0b723b361817c950c). This avoids an allocation failure when many wide tuple spills are introduced. I think this is a workaround since I would not expect the allocation priority to be required, and only a performance hint. The allocator should be smarter about when only a subregister needs to be spilled and restored. This does regress a couple of degenerate store stress lit tests which shouldn't be too important.	2021-08-10 13:12:34 -04:00
Jay Foad	e6c364a624	[AMDGPU][SDag] Better lowering for 64-bit ctlz/cttz Differential Revision: https://reviews.llvm.org/D107546	2021-08-05 15:57:40 +01:00
Craig Topper	c23405174a	[DAGCombiner][AMDGPU] Canonicalize constants to the RHS of MULHU/MULHS. This allows special constants like to 0 to be recognized. It's also expected by isel patterns if a target had a mulh with immediate instructions. The commuting done by tablegen won't commute patterns with immediates since it expects DAGCombine to have done it. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D107486	2021-08-04 11:39:23 -07:00
Stanislav Mekhanoshin	4a3b055653	[AMDGPU] Fix flags of V_MOV_B64_PSEUDO In particular it was not rematerializable. Differential Revision: https://reviews.llvm.org/D105724	2021-07-09 12:49:28 -07:00
Stanislav Mekhanoshin	381ded345b	[AMDGPU] Add S_MOV_B64_IMM_PSEUDO for wide constants This is to allow 64 bit constant rematerialization. If a constant is split into two separate moves initializing sub0 and sub1 like now RA cannot rematerizalize a 64 bit register. This gives 10-20% uplift in a set of huge apps heavily using double precession math. Fixes: SWDEV-292645 Differential Revision: https://reviews.llvm.org/D104874	2021-06-30 11:45:38 -07:00
Brendon Cahoon	3f7b7e7393	[AMDGPU] Update SCC defs to VCC when uses are changed to VCC The FixSGPRCopies pass converts instructions to VALU when removing illegal VGPR to SGPR copies. Instructions that use SCC are changed to use VCC instead. When that happens, the pass must also change instructions that define SCC to define VCC. The pass was not changing the SCC definition when an ADDC is converted due to a input that is a VGPR to SGPR copy. But, the initial ADD insruction, which define SCC, is not converted. This causes a compilation failure due to a use of an undefined physical register. This patch adds code that inserts the SCC definition in the MoveToVALU worklist when a SCC use is converted to a VCC use. Differential Revision: https://reviews.llvm.org/D102111	2021-05-14 18:05:05 -04:00
Jay Foad	ef443390a9	[AMDGPU] Remove MachineDCE after SIFoldOperands Remove the MachineDCE pass after the first SIFoldOperands pass now that SIFoldOperands deletes its own dead instructions. Reapply after fixing dependent change D100188. Differential Revision: https://reviews.llvm.org/D100189	2021-04-19 12:08:02 +01:00
Mitch Phillips	092f288d36	Revert "[AMDGPU] Remove MachineDCE after SIFoldOperands" This reverts commit 5a0117b2d0eaedffeeb393bd9915f11cccfe241b. Reason: Dependent change d19a42eba98fe853dd52f7dc89d8cd2727c7fc1c broke the ASan buildbots.	2021-04-09 15:47:44 -07:00
Jay Foad	5a0117b2d0	[AMDGPU] Remove MachineDCE after SIFoldOperands Remove the MachineDCE pass after the first SIFoldOperands pass now that SIFoldOperands deletes its own dead instructions. Differential Revision: https://reviews.llvm.org/D100189	2021-04-09 20:41:09 +01:00
Jay Foad	a4ced03d34	[AMDGPU] SIFoldOperands: eagerly delete dead copies This is cheap to implement, means less work for future passes like MachineDCE, and slightly improves the folding in some cases. Differential Revision: https://reviews.llvm.org/D100117	2021-04-09 13:52:54 +01:00
Simon Pilgrim	9d2df96407	[DAG] computeKnownBits - add ISD::MULHS/MULHU/SMUL_LOHI/UMUL_LOHI handling Reuse the existing KnownBits multiplication code to handle the 'extend + multiply + extract high bits' pattern for multiply-high ops. Noticed while looking at the codegen for D88785 / D98587 - the patch helps division-by-constant expansion code in particular, which suggests that we might have some further KnownBits div/rem cases we could handle - but this was far easier to implement. Differential Revision: https://reviews.llvm.org/D98857	2021-03-19 16:02:31 +00:00
Piotr Sobczak	62d8b8a225	Fix 64-bit copy to SCC Fix 64-bit copy to SCC by restricting the pattern resulting in such a copy to subtargets supporting 64-bit scalar compare, and mapping the copy to S_CMP_LG_U64. Before introducing the S_CSELECT pattern with explicit SCC (0045786f146e78afee49eee053dc29ebc842fee1), there was no need for handling 64-bit copy to SCC ($scc = COPY sreg_64). The proposed handling to read only the low bits was however based on a false premise that it is only one bit that matters, while in fact the copy source might be a vector of booleans and all bits need to be considered. The practical problem of mapping the 64-bit copy to SCC is that the natural instruction to use (S_CMP_LG_U64) is not available on old hardware. Fix it by restricting the problematic pattern to subtargets supporting the instruction (hasScalarCompareEq64). Differential Revision: https://reviews.llvm.org/D85207	2020-08-09 20:50:30 +02:00
Jay Foad	62fd7f767c	[MachineScheduler] Fix the TopDepth/BotHeightReduce latency heuristics tryLatency compares two sched candidates. For the top zone it prefers the one with lesser depth, but only if that depth is greater than the total latency of the instructions we've already scheduled -- otherwise its latency would be hidden and there would be no stall. Unfortunately it only tests the depth of one of the candidates. This can lead to situations where the TopDepthReduce heuristic does not kick in, but a lower priority heuristic chooses the other candidate, whose depth is greater than the already scheduled latency, which causes a stall. The fix is to apply the heuristic if the depth of either candidate is greater than the already scheduled latency. All this also applies to the BotHeightReduce heuristic in the bottom zone. Differential Revision: https://reviews.llvm.org/D72392	2020-07-17 11:02:13 +01:00
Piotr Sobczak	0045786f14	[AMDGPU] Select s_cselect Summary: Add patterns to select s_cselect in the isel. Handle more cases of implicit SCC accesses in si-fix-sgpr-copies to allow new patterns to work. Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, asbirlea, kerbowa, llvm-commits Tags: #llvm Re-commit D81925 with a bugfix D82370. Differential Revision: https://reviews.llvm.org/D81925 Differential Revision: https://reviews.llvm.org/D82370	2020-06-25 10:38:23 +02:00
Matt Arsenault	778351df77	Revert "[AMDGPU] Enable compare operations to be selected by divergence" This reverts commit 521ac0b5cea02f629d035f807460affbb65ae7ad. Reported to break thousands of piglit tests.	2020-06-24 11:21:30 -04:00
alex-t	521ac0b5ce	[AMDGPU] Enable compare operations to be selected by divergence Summary: Details: This patch enables SETCC to be selected to S_CMP_* if uniform and V_CMP_* if divergent. Reviewers: rampitec, arsenm Reviewed By: rampitec Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D82194	2020-06-24 11:50:40 +03:00
Piotr Sobczak	6d9565d6d5	Revert "[AMDGPU] Select s_cselect" This caused some failures detected by the buildbot with expensive checks enabled. This reverts commit 4067de569f119a81419fbf2e79d5f3307dfdda5b.	2020-06-19 16:41:04 +02:00
Piotr Sobczak	4067de569f	[AMDGPU] Select s_cselect Summary: Add patterns to select s_cselect in the isel. Handle more cases of implicit SCC accesses in si-fix-sgpr-copies to allow new patterns to work. Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, asbirlea, kerbowa, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D81925	2020-06-19 16:17:46 +02:00
alex-t	5b898bddff	[AMDGPU] Enable carry out ADD/SUB operations divergence driven instruction selection. Summary: This change enables all kind of carry out ISD opcodes to be selected according to the node divergence. Reviewers: rampitec, arsenm, vpykhtin Reviewed By: rampitec Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D78091	2020-05-04 16:42:25 +03:00
Konstantin Pyzhov	72e8754916	[AMDGPU] Disable 'Skip Uniform Regions' optimization by default for AMDGPU. Reviewers: sameerds, dstuttard Differential Revision: https://reviews.llvm.org/D77228	2020-04-06 09:05:58 -04:00
Konstantin Pyzhov	51dc028314	Revert e1730cfeb3588f20dcf4a96b181ad52761666e52	2020-04-06 05:56:11 -04:00
Konstantin Pyzhov	e1730cfeb3	[AMDGPU] Disable 'Skip Uniform Regions' optimization by default for AMDGPU. Reviewers: sameerds, dstuttard Differential Revision: https://reviews.llvm.org/D77228	2020-04-06 05:10:37 -04:00
alex-t	6e34e71869	[AMDGPU] Enable divergence driven ISel for ADD/SUB i64 Summary: Currently we custom select add/sub with carry out to scalar form relying on later replacing them to vector form if necessary. This change enables custom selection code to take the divergence of adde/addc SDNodes into account and select the appropriate form in one step. Reviewers: arsenm, vpykhtin, rampitec Reviewed By: arsenm, vpykhtin Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa Differential Revision: https://reviews.llvm.org/D76371	2020-03-20 17:06:11 +03:00
Jay Foad	a46dba24fa	[AMDGPU] Extend macro fusion for ADDC and SUBB to SUBBREV Summary: There's a lot of test case churn but the overall effect is to increase the number of back-to-back v_sub,v_subbrev pairs, which can execute with no delay even on gfx10. Reviewers: arsenm, rampitec, nhaehnle Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D75999	2020-03-11 17:59:21 +00:00
Jay Foad	43830790d7	[AMDGPU] Remove dubious logic in bidirectional list scheduler Summary: pickNodeBidirectional tried to compare the best top candidate and the best bottom candidate by examining TopCand.Reason and BotCand.Reason. This is unsound because, after calling pickNodeFromQueue, Cand.Reason does not reflect the most important reason why Cand was chosen. Rather it reflects the most recent reason why it beat some other potential candidate, which could have been for some low priority tie breaker reason. I have seen this cause problems where TopCand is a good candidate, but because TopCand.Reason is ORDER (which is very low priority) it is repeatedly ignored in favour of a mediocre BotCand. This is not how bidirectional scheduling is supposed to work. To fix this I changed the code to always compare TopCand and BotCand directly, like the generic implementation of pickNodeBidirectional does. This removes some uncommented AMDGPU-specific logic; if this logic turns out to be important then perhaps it could be moved into an override of tryCandidate instead. Graphics shader benchmarking on gfx10 shows a lot more positive than negative effects from this change. Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68338	2020-02-28 21:35:34 +00:00
Matt Arsenault	4bb0c8f91c	AMDGPU: Enable integer division bypass We probably want this, and I've meant to turn this on for a long time. SC actually emits a special case to early-out for a 1 denominator, which perhaps should also be considered.	2020-02-19 17:50:19 -05:00
Simon Pilgrim	4af8db317d	[AMDGPU] performCvtF32UByteNCombine - add SHL and SimplifyMultipleUseDemandedBits support This is part of the work to remove SelectionDAG::GetDemandedBits and just use SimplifyMultipleUseDemandedBits. Recent experiments raised some v_cvt_f32_ubyte*_e32 regressions, so I've added some additional abilities to performCvtF32UByteNCombine to help unpack byte data more aggressively. We still don't remove all OR(SHL,SRL) patterns as some of the regenerated nodes don't get combined again, but we are getting closer. Differential Revision: https://reviews.llvm.org/D74786	2020-02-19 11:45:57 +00:00
Tim Renouf	1e926a9f9c	[AMDGPU] Fix some tests that did not specify -mcpu Summary: This fixes some tests that did not specify -mcpu. Doing that disables all subtarget features, which gives behavior that (a) does not necessarily correspond to any actual target, and (b) can change as we add new subtarget features. Also added gfx1010 to memtime test. Differential Revision: https://reviews.llvm.org/D74594 Change-Id: I8c0fe4fa03e9a93ef8bb722cd42d22e064526309	2020-02-17 14:02:32 +00:00
Matt Arsenault	65dbdc329f	AMDGPU: Don't preserve analyses with div64 IR expansion The dominator tree needs to be updated, but that isn't handled now.	2020-02-14 20:06:02 -05:00
Matt Arsenault	34d9a16e54	AMDGPU: Add option to expand 64-bit integer division in IR I didn't realize we were already expanding 24/32-bit division here already. Use the available IntegerDivision utilities. This uses loops, so produces significantly smaller code than the inline DAG expansion. This now requires width reductions of 64-bit divisions before introducing the expanded loops. This helps work around missing legalization in GlobalISel for division, which are the only remaining core instructions that didn't work at all. I think this is plausibly a better implementation than exists in the DAG, although turning it on by default misses out on the constant value optimizations and also needs benchmarking.	2020-02-14 11:16:08 -08:00
Matt Arsenault	317fdcd09a	AMDGPU: Cleanup and generate 64-bit div tests Split out r600 tests, and try to be more consistent with coverage. Cover a few more cases for 24-bit optimization and constants.	2020-01-20 17:19:39 -05:00

46 Commits