llvm-project

Author	SHA1	Message	Date
Christudasan Devadasan	a3028239a7	Revert "[AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs" This reverts commit 40ba0942e2ab1107f83aa5a0ee5ae2980bf47b1a.	2022-12-21 16:17:42 +05:30
Nikita Popov	bdf2fbba9c	[AMDGPU] Convert some tests to opaque pointers (NFC)	2022-12-19 12:41:13 +01:00
Christudasan Devadasan	40ba0942e2	[AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs Currently, the custom SGPR spill lowering pass spills SGPRs into physical VGPR lanes and the remaining VGPRs are used by regalloc for vector regclass allocation. This imposes many restrictions that we ended up with unsuccessful SGPR spilling when there won't be enough VGPRs and we are forced to spill the leftover into memory during PEI. The custom spill handling during PEI has many edge cases and often breaks the compiler time to time. This patch implements spilling SGPRs into virtual VGPR lanes. Since we now split the register allocation for SGPRs and VGPRs, the virtual registers introduced for the spill lanes would get allocated automatically in the subsequent regalloc invocation for VGPRs. Spill to virtual registers will always be successful, even in the high-pressure situations, and hence it avoids most of the edge cases during PEI. We are now left with only the custom SGPR spills during PEI for special registers like the frame pointer which isn an unproblematic case. This patch also implements the whole wave spills which might occur if RA spills any live range of virtual registers involved in the whole wave operations. Earlier, we had been hand-picking registers for such machine operands. But now with SGPR spills into virtual VGPR lanes, we are exposing them to the allocator. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D124196	2022-12-17 11:56:32 +05:30
Christudasan Devadasan	29247824f5	[AMDGPU][SIFrameLowering] Use the right frame register in CSR spills Unlike the callee-saved VGPR spill instructions emitted by `PEI::spillCalleeSavedRegs`, the CS VGPR spills inserted during emitPrologue/emitEpilogue require the exec bits flipping to avoid clobbering the inactive lanes of VGPRs used for SGPR spilling. Currently, these spill instructions are referenced from the SP at function entry and when the callee performs a stack realignment, they ended up getting incorrect stack offsets. Even if we try to adjust the offsets, the FP-SP becomes a runtime entity with dynamic stack realignment and the offsets would still be inaccurate. To fix it, use FP as the frame base in the spill instructions whenever the function has FP. The offsets obtained for the CS objects would always be the right values from FP. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D134949	2022-12-17 11:52:36 +05:30
Christudasan Devadasan	b25b4c0ab4	[AMDGPU] Separate out SGPR spills to VGPR lanes during PEI SILowerSGPRSpills pass handles the lowering of SGPR spills into VGPR lanes. Some SGPR spills are handled later during PEI. There is a common function used in both places to find the free VGPR lane. This patch eliminates that dependency to find the free VGPR by handling it separately for PEI. It is a prerequisite patch for a future work to allow SGPR spills to virtual VGPR lanes during SILowerSGPRSpills. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D124195	2022-12-17 11:49:41 +05:30
Ron Lieberman	ca856fff1c	Revert "enable code-object-version=5" very sorry wrong repo. This reverts commit d882ba7aeac4b496dccd1b10cb58bd691786b691.	2022-11-29 15:21:09 -06:00
Ron Lieberman	d882ba7aea	enable code-object-version=5	2022-11-29 15:11:57 -06:00
Guozhi Wei	11e86868c1	[MachineCSE] Allow CSE for instructions with ignorable operands Ignorable operands don't impact instruction's behavior, we can safely do CSE on the instruction. It is split from D130919. It has big impact to some AMDGPU test cases. For example in atomic_optimizations_raw_buffer.ll, when trying to check if the following instruction can be CSEed %37:vgpr_32 = V_MOV_B32_e32 0, implicit $exec Function isCallerPreservedOrConstPhysReg is called on operand "implicit $exec", this function is implemented as - return TRI.isCallerPreservedPhysReg(Reg, MF) \|\| + return TRI.isCallerPreservedPhysReg(Reg, MF) \|\| TII.isIgnorableUse(MO) \|\| (MRI.reservedRegsFrozen() && MRI.isConstantPhysReg(Reg)); Both TRI.isCallerPreservedPhysReg and MRI.isConstantPhysReg return false on this operand, so isCallerPreservedOrConstPhysReg is also false, it causes LLVM failed to CSE this instruction. With this patch TII.isIgnorableUse returns true for the operand $exec, so isCallerPreservedOrConstPhysReg also returns true, it causes this instruction to be CSEed with previous instruction %14:vgpr_32 = V_MOV_B32_e32 0, implicit $exec So I got different result from here. AMDGPU's implementation of isIgnorableUse is bool SIInstrInfo::isIgnorableUse(const MachineOperand &MO) const { // Any implicit use of exec by VALU is not a real register read. return MO.getReg() == AMDGPU::EXEC && MO.isImplicit() && isVALU(MO.getParent()) && !resultDependsOnExec(MO.getParent()); } Since the operand $exec is not a real register read, my understanding is it's reasonable to do CSE on such instructions. Because more instructions are CSEed, so I get less instructions generated for these tests. Differential Revision: https://reviews.llvm.org/D137222	2022-11-14 19:34:59 +00:00
Matt Arsenault	7a84624079	AMDGPU: Make various vector undefs legal Surprisingly these were getting legalized to something zero initialized. This fixes an infinite loop when combining some vector types. Also fixes zero initializing some undef values. SimplifyDemandedVectorElts / SimplifyDemandedBits are not checking for the legality of the output undefs they are replacing unused operations with. This resulted in turning vectors into undefs that were later re-legalized back into zero vectors.	2022-09-28 10:48:52 -04:00
Ruiling Song	a5676a3a7e	StructurizeCFG: Set Undef for non-predecessors in setPhiValues() During structurization process, we may place non-predecessor blocks between the predecessors of a block in the structurized CFG. Take the typical while-break case as an example: ``` /---A(v=...) \| / \ ^ B C \| \ /\| \---L \| \ / E (r = phi (v:C)...) ``` After structurization, the CFG would be look like: ``` /---A \| \|\ \| \| C \| \|/ \| F1 ^ \|\ \| \| B \| \|/ \| F2 \| \|\ \| \| L \ \|/ \--F3 \| E ``` We can see that block B is placed between the predecessors(C/L) of E. During phi reconstruction, to achieve the same sematics as before, we are reconstructing the PHIs as: F1: v1 = phi (v:C), (undef:A) F3: r = phi (v1:F2), ... But this is also saying that `v1` would be live through B, which is not quite necessary. The idea in the change is to say the incoming value from B is Undef for the PHI in E. With this change, the reconstructed PHI would be: F1: v1 = phi (v:C), (undef:A) F2: v2 = phi (v1:F1), (undef:B) F3: r = phi (v2:F2), ... Reviewed by: sameerds Differential Revision: https://reviews.llvm.org/D132450	2022-09-26 09:54:47 +08:00
Ruiling Song	40e9284f3c	StructurizeCFG: prefer reduced number of live values The instruction simplification will try to simplify the affected phis. In some cases, this might extend the liveness of values. For example: BB0: \| \ \| BB1 \| / BB2:phi (BB0, v), (BB1, undef) The phi in BB2 will be simplified to v as v dominates BB2, but this is increasing the number of active values in BB1. By setting CanUseUndef to false, we will not simplify the phi in this way, this would help register pressure. This is mandatory for the later change to help reducing VGPR pressure for AMDGPU. Reviewed by: foad, sameerds Differential Revision: https://reviews.llvm.org/D132449	2022-09-26 09:54:47 +08:00
Matt Arsenault	69153d6c0a	AMDGPU: Use GlobalPriority for largest register tuples Only do this for 16 and 32 register tuples, although we might want to extend to 8 tuples. It's incredibly expensive to spill these, and doing so majorly interferes with the ability to allocate anything else in the function. The lit tests show mostly sizeable improvements with a handful of tiny regressions with large vectors.	2022-09-15 11:45:02 -04:00
Matt Arsenault	da59fe7f15	AMDGPU: Fix test failure Forgot to commit regenerated test	2022-09-12 09:33:22 -04:00
Matt Arsenault	e30271169f	RegAllocGreedy: Try local instruction splitting with subranges This was only trying this to relax register class constraints, but this can also help if there are subranges involved. This solves a compilation failure for AMDGPU when there is high pressure created by large register tuples. If one virtual register is using most of the available budget, we need to be able to evict subranges. This solves the immediate failure, but this solution leaves a lot to be desired. In the relevant testcases, we have 32-element tuples but most of the uses are operations on 1 element subranges of it. What we're now getting is a spill and restore of the full 1024 bits and an extract of the used 32-bits. It would be far better if we introduced a copy to a new virtual register with a smaller register class and used narrower spills. Furthermore, we could probably do a better job if the allocator were to introduce new subranges where none previously existed in the highest pressure scenarios. The block and region splits should also try to split specific subranges out. The mve-vst3.ll test changes looks like noise to me, but instruction count increased by one. mve-vst4.ll looks like a solid improvement with several 16-byte spills eliminated. splitkit-copy-live-lanes.mir also shows a solid reduction in total spill count. This could use more tests but it's pretty tiring to come up with cases that fail on this.	2022-09-12 09:03:55 -04:00

14 Commits