llvm-project

Author	SHA1	Message	Date
Mirko Brkušanin	5879162f7f	[AMDGPU] CodeGen for GFX12 VBUFFER instructions (#75492 )	2023-12-15 13:45:03 +01:00
Pranav Taneja	41507fe595	[GISel] Combine (Scalarize) vector load followed by an element extract.	2023-12-06 11:23:23 +05:30
Jay Foad	7b3bbd83c0	Revert "[CodeGen] Really renumber slot indexes before register allocation (#67038 )" This reverts commit 2501ae58e3bb9a70d279a56d7b3a0ed70a8a852c. Reverted due to various buildbot failures.	2023-10-09 12:31:32 +01:00
Jay Foad	2501ae58e3	[CodeGen] Really renumber slot indexes before register allocation (#67038 ) PR #66334 tried to renumber slot indexes before register allocation, but the numbering was still affected by list entries for instructions which had been erased. Fix this to make the register allocator's live range length heuristics even less dependent on the history of how instructions have been added to and removed from SlotIndexes's maps.	2023-10-09 11:44:41 +01:00
Jay Foad	5aea839ab3	[AMDGPU] Switch to backwards scavenging in eliminateFrameIndex Frame index elimination runs backwards so we must use backwards scavenging. Otherwise, when a scavenged register is spilled, the scavenger will remember that the register is in use until the restore point, but it will never reach that restore point. The result is that in some cases it will keep scavenging different registers instead of reusing the same one. Differential Revision: https://reviews.llvm.org/D152394	2023-06-07 20:59:05 +01:00
Valery Pykhtin	342acfc9bb	[AMDGPU] Turn off pass to rewrite partially used virtual superregisters after RenameIndependentSubregs pass with registers of minimal size. There is a failure with this pass in the case when target register class for a subregister isn't known from instruction description (for ex. COPY). Currently in this situation the RC is obtained using TargetRegisterInfo::getSubRegisterClass but in general it's not working. In order to fix this two things should be done: 1. Stop processing a subregister if the target register class is unknown (conservative approach) 2. Improve deduction of subregister' target register class (i.e by processing COPY chain) I was going to implement point 1 but my tests use implicit operands for S_NOP and they don't have associated target register class and all tests fail. Therefore I decided to turn off the pass now, implement point 1 and fix my tests. Reviewed By: arsenm, #amdgpu Differential Revision: https://reviews.llvm.org/D152291	2023-06-07 12:05:25 +02:00
Valery Pykhtin	8d0412ce9d	[AMDGPU] Add pass to rewrite partially used virtual superregisters after RenameIndependentSubregs pass with registers of minimal size. The main purpose of this is to simplify register pressure tracking as after the pass there is no need to track subreg liveness anymore. On the other hand this pass creates more possibilites for the subreg unaware code, as many of the subregs becomes ordinary registers. Intersting sideeffect: spill-vgpr.ll has lost a lot of spills. Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D139732	2023-05-26 09:05:44 +02:00
Matt Arsenault	8e0fadda10	AMDGPU: Bulk update all GlobalISel tests to use opaque pointers	2022-11-28 11:51:36 -05:00
Alexander Timofeev	32bd75716c	PEI should be able to use backward walk in replaceFrameIndicesBackward. The backward register scavenger has correct register liveness information. PEI should leverage the backward register scavenger. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D137574	2022-11-18 15:57:34 +01:00
Fangrui Song	6c7666a408	Revert D137574 "PEI should be able to use backward walk in replaceFrameIndicesBackward." This reverts commit e05ce03cfa0b36e9b99149e21afcb1fc039df813. Caused asan use-after-poison to 4 DebugInfo/AMDGPU/ tests. Triggered in PEI::replaceFrameIndicesBackward called llvm::MachineInstr::getNumOperands	2022-11-15 19:19:46 +00:00
Alexander Timofeev	e05ce03cfa	PEI should be able to use backward walk in replaceFrameIndicesBackward. The backward register scavenger has correct register liveness information. PEI should leverage the backward register scavenger. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D137574	2022-11-15 15:20:25 +01:00
Matt Arsenault	69153d6c0a	AMDGPU: Use GlobalPriority for largest register tuples Only do this for 16 and 32 register tuples, although we might want to extend to 8 tuples. It's incredibly expensive to spill these, and doing so majorly interferes with the ability to allocate anything else in the function. The lit tests show mostly sizeable improvements with a handful of tiny regressions with large vectors.	2022-09-15 11:45:02 -04:00
Austin Kerbow	8470bf2b08	[AMDGPU] Do not reserve any VGPR for SGPR spills After the split register allocation changes in eebe841a47cb it is no longer necessary to reserve a VGPR before RA. This can also create bugs when IPRA is enabled since we cannot predict that a called function may not reserve any register if it does not have any SGPR spills. If that happens those functions may override reserved registers that are normally callee saved. Added a test to show this. Fixes: SWDEV-309900 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D115551	2022-01-11 22:14:59 -08:00
Matt Arsenault	4a36e96c3f	RegAllocGreedy: Account for reserved registers in num regs heuristic This simple heuristic uses the estimated live range length combined with the number of registers in the class to switch which heuristic to use. This was taking the raw number of registers in the class, even though not all of them may be available. AMDGPU heavily relies on dynamically reserved numbers of registers based on user attributes to satisfy occupancy constraints, so the raw number is highly misleading. There are still a few problems here. In the original testcase that made me notice this, the live range size is incorrect after the scheduler rearranges instructions, since the instructions don't have the original InstrDist offsets. Additionally, I think it would be more appropriate to use the number of disjointly allocatable registers in the class. For the AMDGPU register tuples, there are a large number of registers in each tuple class, but only a small fraction can actually be allocated at the same time since they all overlap with each other. It seems we do not have a query that corresponds to the number of independently allocatable registers. Relatedly, I'm still debugging some allocation failures where overlapping tuples seem to not be handled correctly. The test changes are mostly noise. There are a handful of x86 tests that look like regressions with an additional spill, and a handful that now avoid a spill. The worst looking regression is likely test/Thumb2/mve-vld4.ll which introduces a few additional spills. test/CodeGen/AMDGPU/soft-clause-exceeds-register-budget.ll shows a massive improvement by completely eliminating a large number of spills inside a loop.	2021-09-14 21:00:29 -04:00
Sebastian Neubauer	fbae34635d	[GlobalISel] Add combine for PTR_ADD with regbanks Combine two G_PTR_ADDs, but keep the register bank of the constant. That way, the combine can be used in post-regbank-select combines. Introduce two helper methods in CombinerHelper, getRegBank and setRegBank that get and set an optional register bank to a register. That way, they can be used before and after register bank selection. Differential Revision: https://reviews.llvm.org/D103326	2021-08-17 13:58:16 +02:00
Matt Arsenault	eebe841a47	RegAlloc: Allow targets to split register allocation AMDGPU normally spills SGPRs to VGPRs. Previously, since all register classes are handled at the same time, this was problematic. We don't know ahead of time how many registers will be needed to be reserved to handle the spilling. If no VGPRs were left for spilling, we would have to try to spill to memory. If the spilled SGPRs were required for exec mask manipulation, it is highly problematic because the lanes active at the point of spill are not necessarily the same as at the restore point. Avoid this problem by fully allocating SGPRs in a separate regalloc run from VGPRs. This way we know the exact number of VGPRs needed, and can reserve them for a second run. This fixes the most serious issues, but it is still possible using inline asm to make all VGPRs unavailable. Start erroring in the case where we ever would require memory for an SGPR spill. This is implemented by giving each regalloc pass a callback which reports if a register class should be handled or not. A few passes need some small changes to deal with leftover virtual registers. In the AMDGPU implementation, a new pass is introduced to take the place of PrologEpilogInserter for SGPR spills emitted during the first run. One disadvantage of this is currently StackSlotColoring is no longer used for SGPR spills. It would need to be run again, which will require more work. Error if the standard -regalloc option is used. Introduce new separate -sgpr-regalloc and -vgpr-regalloc flags, so the two runs can be controlled individually. PBQB is not currently supported, so this also prevents using the unhandled allocator.	2021-07-13 18:49:29 -04:00
Brendon Cahoon	f9f5d41545	[AMDGPU][GlobalISel] Legalize and select G_SBFX and G_UBFX Adds legalizer, register bank select, and instruction select support for G_SBFX and G_UBFX. These opcodes generate scalar or vector ALU bitfield extract instructions for AMDGPU. The instructions allow both constant or register values for the offset and width operands. The 32-bit scalar version is expanded to a sequence that combines the offset and width into a single register. There are no 64-bit vgpr bitfield extract instructions, so the operations are expanded to a sequence of instructions that implement the operation. If the width is a constant, then the 32-bit bitfield extract instructions are used. Moved the AArch64 specific code for creating G_SBFX to CombinerHelper.cpp so that it can be used by other targets. Only bitfield extracts with constant offset and width values are handled currently. Differential Revision: https://reviews.llvm.org/D100149	2021-06-28 09:06:44 -04:00
Sebastian Neubauer	96e1fcb1e0	[AMDGPU] Use s_add_i32 for address additions This allows to convert the add instruction to s_addk_i32 and v_add_nc_u32 instead of needing v_add_co_u32 when converting to a VALU instruction. Differential Revision: https://reviews.llvm.org/D103322	2021-06-07 16:09:48 +02:00
Matt Arsenault	70cb57d7da	AMDGPU/GlobalISel: Improve private addressing mode matching This enables the look-through-copy to hack around not correctly regbankselecting constants to match the use bank.	2021-03-11 10:23:35 -05:00
Sebastian Neubauer	8b898b19a8	[AMDGPU] Remove unused tmp register The temporary register is only used to compute the frame pointer. The frame pointer is overwritten and not used in between, so we can reuse the frame pointer for the computation, saving one register. Differential Revision: https://reviews.llvm.org/D95865	2021-02-02 17:17:54 +01:00
Austin Kerbow	2291bd137d	[AMDGPU] Update subtarget features for new target ID support Support for XNACK and SRAMECC is not static on some GPUs. We must be able to differentiate between different scenarios for these dynamic subtarget features. The possible settings are: - Unsupported: The GPU has no support for XNACK/SRAMECC. - Any: Preference is unspecified. Use conservative settings that can run anywhere. - Off: Request support for XNACK/SRAMECC Off - On: Request support for XNACK/SRAMECC On GCNSubtarget will track the four options based on the following criteria. If the subtarget does not support XNACK/SRAMECC we say the setting is "Unsupported". If no subtarget features for XNACK/SRAMECC are requested we must support "Any" mode. If the subtarget features XNACK/SRAMECC exist in the feature string when initializing the subtarget, the settings are "On/Off". The defaults are updated to be conservatively correct, meaning if no setting for XNACK or SRAMECC is explicitly requested, defaults will be used which generate code that can be run anywhere. This corresponds to the "Any" setting. Differential Revision: https://reviews.llvm.org/D85882	2021-01-26 11:25:51 -08:00
Sebastian Neubauer	221fdedc69	[AMDGPU][GlobalISel] Fold flat vgpr + constant addresses Use getPtrBaseWithConstantOffset in selectFlatOffsetImpl to fold more vgpr+constant addresses. Differential Revision: https://reviews.llvm.org/D93692	2020-12-23 10:40:30 +01:00
Matt Arsenault	663f16684d	AMDGPU: Fix verifier error on killed spill of partially undef register This does unfortunately end up with extra waitcnts getting inserted that were avoided before. Ideally we would avoid the spills of these undef components in the first place.	2020-10-15 09:45:44 -04:00
Sebastian Neubauer	a343b9b032	Revert "[AMDGPU] Insert waitcnt after returning from call" This reverts commit ca907bfb57d8ad3ec3bcc2cff2abab7b1b933af6. According to michel.daenzer, > This completely broke the Mesa radeonsi driver on Navi 14. Xorg + > xterm come up with major corruption & psychedelic colours.	2020-09-23 17:16:39 +02:00
Sebastian Neubauer	ca907bfb57	[AMDGPU] Insert waitcnt after returning from call When memory operations are outstanding on function calls, either the caller or the callee can insert a waitcnt to ensure that all reads are finished. Calls need some time to be executed, so if the callee inserts the waitcnt, filling the instruction buffer and waiting for memory will be interleaved, hiding some latency. This comes at the cost of having a waitcnt inside functions that may not be needed as no memory operations are outstanding. For function calls, this is already implemented. The same principal applies to returns: If the caller inserts a waitcnt after the call, the callee does not have to wait and the return and memory operation can be run in parallel. This commit implements waiting in the caller after returning from a function call. Differential Revision: https://reviews.llvm.org/D87674	2020-09-23 12:17:59 +02:00
Jay Foad	c799f873cb	[AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to increase register pressure and restricts scheduling freedom. Differential Revision: https://reviews.llvm.org/D85530	2020-09-14 13:40:17 +01:00
QingShan Zhang	3359ea62ed	[Scheduling] Create the missing dependency edges for store cluster If it is load cluster, we don't need to create the dependency edges(SUb->reg) from SUb to SUa as they both depend on the base register "reg" +-------+ +----> reg \| \| +---+---+ \| ^ \| \| \| \| \| \| \| +---+---+ \| \| SUa \| Load 0(reg) \| +---+---+ \| ^ \| \| \| \| \| +---+---+ +----+ SUb \| Load 4(reg) +-------+ But if it is store cluster, we need to create it as follow shows to avoid the instruction store depend on scheduled in-between SUb and SUa. +-------+ +----> reg \| \| +---+---+ \| ^ \| \| Missing +-------+ \| \| +-------------------->+ y \| \| \| \| +---+---+ \| +---+-+-+ ^ \| \| SUa \| Store x 0(reg) \| \| +---+---+ \| \| ^ \| \| \| +------------------------+ \| \| \| \| +---+--++ +----+ SUb \| Store y 4(reg) +-------+ Reviewed By: evandro, arsenm, rampitec, foad, fhahn Differential Revision: https://reviews.llvm.org/D72031	2020-08-07 04:58:03 +00:00
Matt Arsenault	e00201539f	GlobalISel: Implement fewerElementsVector for G_EXTRACT_VECTOR_ELT Use the same basic strategy as LegalizeVectorTypes. Try to index into smaller pieces if there's a constant index, and otherwise fall back to a stack temporary.	2020-08-06 14:33:16 -04:00

28 Commits