llvm-project

Author	SHA1	Message	Date
Emma Pilkington	4490003a22	[AMDGPU] Rename COV module flag to amdhsa_code_object_version (#79905 ) The previous name 'amdgpu_code_object_version', was misleading since this is really a property of the HSA OS. The new spelling also matches the asm directive I added in bc82cfb.	2024-03-06 09:51:48 -05:00
Jan Patrick Lehr	f661057865	Revert "[AMDGPU] Compiler should synthesize private buffer resource descriptor from flat_scratch_init" (#81234 ) Reverts llvm/llvm-project#79586 This broke the AMDGPU OpenMP Offload buildbot. The typical error message was that the GPU attempted to read beyong the largest legal address. Error message: AMDGPU fatal error 1: Received error in queue 0x7f8363f22000: HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address.	2024-02-09 09:57:38 +01:00
alex-t	88e52511ca	[AMDGPU] Compiler should synthesize private buffer resource descriptor from flat_scratch_init (#79586 ) This change implements synthesizing the private buffer resource descriptor in the kernel prolog instead of using the preloaded kernel argument.	2024-02-08 20:27:36 +01:00
Yaxun (Sam) Liu	1f3c30911c	[AMDGPU] Mark PC_ADD_REL_OFFSET rematerializable (#79674 ) Currently machine LICM hoist PC_ADD_REL_OFFSET out of loops, causes register pressure when function calls are deep in loops. This is a main cause of sgpr spill for programs containing large number of function calls in loops. This patch marks PC_ADD_REL_OFFSET as rematerializable, which eliminates sgpr spills due to function calls in loops.	2024-02-01 12:21:19 -05:00
Saiyedul Islam	e21b7e2143	[AMDGPU][NFC] Check more autogenerated llc tests for COV5 (#75219 ) Regenerate a few more llc tests to check for COV5 instead of the default ABI version.	2023-12-15 10:27:49 +05:30
Stanislav Mekhanoshin	fe8335babb	[AMDGPU] Select 64-bit imm moves if can be encoded as 32 bit operand (#70395 ) This allows folding of 64-bit operands if fit into 32-bit. Fixes https://github.com/llvm/llvm-project/issues/67781	2023-10-30 08:12:28 -07:00
Matthias Braun	e3cf80c5c1	BlockFrequencyInfoImpl: Avoid big numbers, increase precision for small spreads BlockFrequencyInfo calculates block frequencies as Scaled64 numbers but as a last step converts them to unsigned 64bit integers (`BlockFrequency`). This improves the factors picked for this conversion so that: * Avoid big numbers close to UINT64_MAX to avoid users overflowing/saturating when adding multiply frequencies together or when multiplying with integers. This leaves the topmost 10 bits unused to allow for some room. * Spread the difference between hottest/coldest block as much as possible to increase precision. * If the hot/cold spread cannot be represented loose precision at the lower end, but keep the frequencies at the upper end for hot blocks differentiable.	2023-10-24 20:27:39 -07:00
Jingu Kang	ff68e43c81	[MachineLICM] Handle Subloops It is a re-commit from reverted commit 3454cf67bd0a650097dc6ca99874a34e1d59b500. Following discussion on https://reviews.llvm.org/D154205, make MachineLICM pass handle subloops with only visiting outermost loop's blocks once. Differential Revision: https://reviews.llvm.org/D154205	2023-09-26 14:25:11 +01:00
Benjamin Kramer	3454cf67bd	Revert "[MachineLICM] Handle Subloops" This reverts commit 5ec9699c4d1f165364586d825baef434e2c110b4. It accesses MI after it has been hoisted.	2023-09-15 13:20:31 +02:00
Jingu Kang	5ec9699c4d	[MachineLICM] Handle Subloops Following discussion on https://reviews.llvm.org/D154205, make MachineLICM pass handle subloops with only visiting outermost loop's blocks once. Differential Revision: https://reviews.llvm.org/D154205	2023-09-14 18:07:31 +01:00
Saiyedul Islam	466a8149b3	Revert "[AMDGPU] Make default AMDHSA Code Object Version to be 5 (#65410 )" (#66060 ) This reverts commit 0a8d17e79b02a92814a2a788d79df1f54d70ec3e.	2023-09-12 15:13:59 +05:30
Saiyedul Islam	0a8d17e79b	[AMDGPU] Make default AMDHSA Code Object Version to be 5 (#65410 ) Also update LIT tests and docs. For more details, see https://llvm.org/docs/AMDGPUUsage.html#code-object-v5-metadata Reviewed By: arsenm, jhuber6 Github PR: #65410 Differential Revision: https://reviews.llvm.org/D129818	2023-09-12 13:53:31 +05:30
Matt Arsenault	4d42e8b5d1	Reapply "[CodeGen]Allow targets to use target specific COPY instructions for live range splitting" This reverts commit a496c8be6e638ae58bb45f13113dbe3a4b7b23fd. The workaround in c26dfc81e254c78dc23579cf3d1336f77249e1f6 should work around the underlying problem with SUBREG_TO_REG.	2023-07-31 20:15:45 -04:00
Vitaly Buka	a496c8be6e	Revert "[CodeGen]Allow targets to use target specific COPY instructions for live range splitting" And dependent commits. Details in D150388. This reverts commit 825b7f0ca5f2211ec3c93139f98d1e24048c225c. This reverts commit 7a98f084c4d121244ef7286bc6503b6a181d446e. This reverts commit b4a62b1fa546312d882fa12dfdcd015177d66826. This reverts commit b7836d856206ec39509d42529f958c920368166b. No conflicts in the code, few tests had conflicts in autogenerated CHECKs: llvm/test/CodeGen/Thumb2/mve-float32regloops.ll llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll Reviewed By: alexfh Differential Revision: https://reviews.llvm.org/D156381	2023-07-26 22:13:32 -07:00
Jingu Kang	351b4c17dd	Revert "[MachineLICM] Handle Subloops" This reverts commit 50dd383d08670960540fecb4b48c0f0429fbfba3.	2023-07-20 17:12:25 +01:00
Jingu Kang	50dd383d08	[MachineLICM] Handle Subloops Following discussion on https://reviews.llvm.org/D154205, make MachineLICM pass handle subloops with only visiting outmost loop's blocks once. Differential Revision: https://reviews.llvm.org/D154205	2023-07-20 16:39:13 +01:00
Jingu Kang	62ed3ff4bb	Revert "[MachineLICM] Handle Subloops" This reverts commit 33e60484d750291e99301e29e60fe72c8fa48ccd.	2023-07-19 10:30:50 +01:00
Jingu Kang	33e60484d7	[MachineLICM] Handle Subloops MachineLICM pass handles inner loops only when outmost loop does not have unique predecessor. If the loop has preheader and there is loop invariant code, the invariant code can be hoisted to the preheader in general. This patch makes the pass handle inner loops in general. Differential Revision: https://reviews.llvm.org/D154205	2023-07-12 16:32:14 +01:00
Christudasan Devadasan	7a98f084c4	[AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs Currently, the custom SGPR spill lowering pass spills SGPRs into physical VGPR lanes and the remaining VGPRs are used by regalloc for vector regclass allocation. This imposes many restrictions that we ended up with unsuccessful SGPR spilling when there won't be enough VGPRs and we are forced to spill the leftover into memory during PEI. The custom spill handling during PEI has many edge cases and often breaks the compiler time to time. This patch implements spilling SGPRs into virtual VGPR lanes. Since we now split the register allocation for SGPRs and VGPRs, the virtual registers introduced for the spill lanes would get allocated automatically in the subsequent regalloc invocation for VGPRs. Spill to virtual registers will always be successful, even in the high-pressure situations, and hence it avoids most of the edge cases during PEI. We are now left with only the custom SGPR spills during PEI for special registers like the frame pointer which is an unproblematic case. Differential Revision: https://reviews.llvm.org/D124196	2023-07-07 23:14:32 +05:30
pvanhout	d892521076	[AMDGPU] Break-up large PHIs for DAGISel DAGISel uses CopyToReg/CopyFromReg to lower PHI nodes. With large PHIs, this can result in poor codegen. This is because it introduces a need to have a build_vector before copying the PHI value, and that build_vector may have many undef elements. This can cause very high register pressure and abnormal stack usage in some cases. This scalarization/phi "break-up" can be easily tuned/disabled through CL options in case it's not beneficial for some users. It's also only enabled for DAGIsel and GlobalISel handles PHIs much better (as it works on the whole function). This can both scalarize (break a vector into its elements) and simplify (break a vector into smaller, more manageable subvectors) PHIs. Fixes SWDEV-321581 Reviewed By: kzhuravl Differential Revision: https://reviews.llvm.org/D143731	2023-03-28 09:38:47 +02:00
Christudasan Devadasan	a3028239a7	Revert "[AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs" This reverts commit 40ba0942e2ab1107f83aa5a0ee5ae2980bf47b1a.	2022-12-21 16:17:42 +05:30
Nikita Popov	bdf2fbba9c	[AMDGPU] Convert some tests to opaque pointers (NFC)	2022-12-19 12:41:13 +01:00
Christudasan Devadasan	40ba0942e2	[AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs Currently, the custom SGPR spill lowering pass spills SGPRs into physical VGPR lanes and the remaining VGPRs are used by regalloc for vector regclass allocation. This imposes many restrictions that we ended up with unsuccessful SGPR spilling when there won't be enough VGPRs and we are forced to spill the leftover into memory during PEI. The custom spill handling during PEI has many edge cases and often breaks the compiler time to time. This patch implements spilling SGPRs into virtual VGPR lanes. Since we now split the register allocation for SGPRs and VGPRs, the virtual registers introduced for the spill lanes would get allocated automatically in the subsequent regalloc invocation for VGPRs. Spill to virtual registers will always be successful, even in the high-pressure situations, and hence it avoids most of the edge cases during PEI. We are now left with only the custom SGPR spills during PEI for special registers like the frame pointer which isn an unproblematic case. This patch also implements the whole wave spills which might occur if RA spills any live range of virtual registers involved in the whole wave operations. Earlier, we had been hand-picking registers for such machine operands. But now with SGPR spills into virtual VGPR lanes, we are exposing them to the allocator. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D124196	2022-12-17 11:56:32 +05:30
Christudasan Devadasan	29247824f5	[AMDGPU][SIFrameLowering] Use the right frame register in CSR spills Unlike the callee-saved VGPR spill instructions emitted by `PEI::spillCalleeSavedRegs`, the CS VGPR spills inserted during emitPrologue/emitEpilogue require the exec bits flipping to avoid clobbering the inactive lanes of VGPRs used for SGPR spilling. Currently, these spill instructions are referenced from the SP at function entry and when the callee performs a stack realignment, they ended up getting incorrect stack offsets. Even if we try to adjust the offsets, the FP-SP becomes a runtime entity with dynamic stack realignment and the offsets would still be inaccurate. To fix it, use FP as the frame base in the spill instructions whenever the function has FP. The offsets obtained for the CS objects would always be the right values from FP. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D134949	2022-12-17 11:52:36 +05:30
Christudasan Devadasan	b25b4c0ab4	[AMDGPU] Separate out SGPR spills to VGPR lanes during PEI SILowerSGPRSpills pass handles the lowering of SGPR spills into VGPR lanes. Some SGPR spills are handled later during PEI. There is a common function used in both places to find the free VGPR lane. This patch eliminates that dependency to find the free VGPR by handling it separately for PEI. It is a prerequisite patch for a future work to allow SGPR spills to virtual VGPR lanes during SILowerSGPRSpills. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D124195	2022-12-17 11:49:41 +05:30
Ron Lieberman	ca856fff1c	Revert "enable code-object-version=5" very sorry wrong repo. This reverts commit d882ba7aeac4b496dccd1b10cb58bd691786b691.	2022-11-29 15:21:09 -06:00
Ron Lieberman	d882ba7aea	enable code-object-version=5	2022-11-29 15:11:57 -06:00
Guozhi Wei	11e86868c1	[MachineCSE] Allow CSE for instructions with ignorable operands Ignorable operands don't impact instruction's behavior, we can safely do CSE on the instruction. It is split from D130919. It has big impact to some AMDGPU test cases. For example in atomic_optimizations_raw_buffer.ll, when trying to check if the following instruction can be CSEed %37:vgpr_32 = V_MOV_B32_e32 0, implicit $exec Function isCallerPreservedOrConstPhysReg is called on operand "implicit $exec", this function is implemented as - return TRI.isCallerPreservedPhysReg(Reg, MF) \|\| + return TRI.isCallerPreservedPhysReg(Reg, MF) \|\| TII.isIgnorableUse(MO) \|\| (MRI.reservedRegsFrozen() && MRI.isConstantPhysReg(Reg)); Both TRI.isCallerPreservedPhysReg and MRI.isConstantPhysReg return false on this operand, so isCallerPreservedOrConstPhysReg is also false, it causes LLVM failed to CSE this instruction. With this patch TII.isIgnorableUse returns true for the operand $exec, so isCallerPreservedOrConstPhysReg also returns true, it causes this instruction to be CSEed with previous instruction %14:vgpr_32 = V_MOV_B32_e32 0, implicit $exec So I got different result from here. AMDGPU's implementation of isIgnorableUse is bool SIInstrInfo::isIgnorableUse(const MachineOperand &MO) const { // Any implicit use of exec by VALU is not a real register read. return MO.getReg() == AMDGPU::EXEC && MO.isImplicit() && isVALU(MO.getParent()) && !resultDependsOnExec(MO.getParent()); } Since the operand $exec is not a real register read, my understanding is it's reasonable to do CSE on such instructions. Because more instructions are CSEed, so I get less instructions generated for these tests. Differential Revision: https://reviews.llvm.org/D137222	2022-11-14 19:34:59 +00:00
Matt Arsenault	7a84624079	AMDGPU: Make various vector undefs legal Surprisingly these were getting legalized to something zero initialized. This fixes an infinite loop when combining some vector types. Also fixes zero initializing some undef values. SimplifyDemandedVectorElts / SimplifyDemandedBits are not checking for the legality of the output undefs they are replacing unused operations with. This resulted in turning vectors into undefs that were later re-legalized back into zero vectors.	2022-09-28 10:48:52 -04:00
Ruiling Song	a5676a3a7e	StructurizeCFG: Set Undef for non-predecessors in setPhiValues() During structurization process, we may place non-predecessor blocks between the predecessors of a block in the structurized CFG. Take the typical while-break case as an example: ``` /---A(v=...) \| / \ ^ B C \| \ /\| \---L \| \ / E (r = phi (v:C)...) ``` After structurization, the CFG would be look like: ``` /---A \| \|\ \| \| C \| \|/ \| F1 ^ \|\ \| \| B \| \|/ \| F2 \| \|\ \| \| L \ \|/ \--F3 \| E ``` We can see that block B is placed between the predecessors(C/L) of E. During phi reconstruction, to achieve the same sematics as before, we are reconstructing the PHIs as: F1: v1 = phi (v:C), (undef:A) F3: r = phi (v1:F2), ... But this is also saying that `v1` would be live through B, which is not quite necessary. The idea in the change is to say the incoming value from B is Undef for the PHI in E. With this change, the reconstructed PHI would be: F1: v1 = phi (v:C), (undef:A) F2: v2 = phi (v1:F1), (undef:B) F3: r = phi (v2:F2), ... Reviewed by: sameerds Differential Revision: https://reviews.llvm.org/D132450	2022-09-26 09:54:47 +08:00
Ruiling Song	40e9284f3c	StructurizeCFG: prefer reduced number of live values The instruction simplification will try to simplify the affected phis. In some cases, this might extend the liveness of values. For example: BB0: \| \ \| BB1 \| / BB2:phi (BB0, v), (BB1, undef) The phi in BB2 will be simplified to v as v dominates BB2, but this is increasing the number of active values in BB1. By setting CanUseUndef to false, we will not simplify the phi in this way, this would help register pressure. This is mandatory for the later change to help reducing VGPR pressure for AMDGPU. Reviewed by: foad, sameerds Differential Revision: https://reviews.llvm.org/D132449	2022-09-26 09:54:47 +08:00
Matt Arsenault	69153d6c0a	AMDGPU: Use GlobalPriority for largest register tuples Only do this for 16 and 32 register tuples, although we might want to extend to 8 tuples. It's incredibly expensive to spill these, and doing so majorly interferes with the ability to allocate anything else in the function. The lit tests show mostly sizeable improvements with a handful of tiny regressions with large vectors.	2022-09-15 11:45:02 -04:00
Matt Arsenault	da59fe7f15	AMDGPU: Fix test failure Forgot to commit regenerated test	2022-09-12 09:33:22 -04:00
Matt Arsenault	e30271169f	RegAllocGreedy: Try local instruction splitting with subranges This was only trying this to relax register class constraints, but this can also help if there are subranges involved. This solves a compilation failure for AMDGPU when there is high pressure created by large register tuples. If one virtual register is using most of the available budget, we need to be able to evict subranges. This solves the immediate failure, but this solution leaves a lot to be desired. In the relevant testcases, we have 32-element tuples but most of the uses are operations on 1 element subranges of it. What we're now getting is a spill and restore of the full 1024 bits and an extract of the used 32-bits. It would be far better if we introduced a copy to a new virtual register with a smaller register class and used narrower spills. Furthermore, we could probably do a better job if the allocator were to introduce new subranges where none previously existed in the highest pressure scenarios. The block and region splits should also try to split specific subranges out. The mve-vst3.ll test changes looks like noise to me, but instruction count increased by one. mve-vst4.ll looks like a solid improvement with several 16-byte spills eliminated. splitkit-copy-live-lanes.mir also shows a solid reduction in total spill count. This could use more tests but it's pretty tiring to come up with cases that fail on this.	2022-09-12 09:03:55 -04:00

34 Commits