llvm-project

Author	SHA1	Message	Date
Carl Ritson	f811482a74	[AMDGPU] SIWholeQuadMode: Ensure earliest WQM entry point for PS (#123266 ) Ensure shaders running WQM (PS) enter at the earliest point irrespective of WQM marking.	2025-01-19 15:50:33 +09:00
Piotr Sobczak	40fa7f5e8b	[AMDGPU] Fix computed kill mask (#122736 ) Replace S_XOR with S_ANDN2 when computing the kill mask in demote/kill lowering. This has the effect of AND'ing demote/kill condition with exec which is needed for proper live mask update. The S_XOR is inadequate because it may return true for lane with exec=0. This patch fixes an image corruption in game. I think the issue went unnoticed because demote/kill condition is often naturally dependent on exec, so AND'ing with exec is usually not required.	2025-01-14 10:00:40 +01:00
paperchalice	1562b70eaf	Reapply "[DomTreeUpdater] Move critical edge splitting code to updater" (#119547 ) This relands commit #115111. Use traditional way to update post dominator tree, i.e. break critical edge splitting into insert, insert, delete sequence. When splitting critical edges, the post dominator tree may change its root node, and `setNewRoot` only works in normal dominator tree... See `6c7e5827ed/llvm/include/llvm/Support/GenericDomTree.h (L684-L687)`	2024-12-13 11:43:09 +08:00
paperchalice	553058f825	Revert "[DomTreeUpdater] Move critical edge splitting code to updater" (#119512 ) Reverts llvm/llvm-project#115111 Causes #119511	2024-12-11 14:25:17 +08:00
paperchalice	79047fac65	[DomTreeUpdater] Move critical edge splitting code to updater (#115111 ) Support critical edge splitting in dominator tree updater. Continue the work in #100856. Compile time check: https://llvm-compile-time-tracker.com/compare.php?from=87c35d782795b54911b3e3a91a5b738d4d870e55&to=42b3e5623a9ab4c3648564dc0926b36f3b438a3a&stat=instructions%3Au	2024-12-11 11:31:42 +08:00
Jay Foad	8d13e7b8c3	[AMDGPU] Qualify auto. NFC. (#110878 ) Generated automatically with: $ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find lib/Target/AMDGPU/ -type f)	2024-10-03 13:07:54 +01:00
Diana Picus	3356208531	Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108512 ) This reverts commit `7792b4ae79`. The problem was a conflict with `e55d6f5ea2` "[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive (https://github.com/llvm/llvm-project/pull/107889)" which changed the syntax of V_SET_INACTIVE (and thus made my MIR test crash). ...if only we had a merge queue.	2024-09-13 11:54:30 +02:00
Diana Picus	7792b4ae79	Revert "Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054 )"" (#108341 ) Reverts llvm/llvm-project#108173 si-init-whole-wave.mir crashes on some buildbots (although it passed both locally with sanitizers enabled and in pre-merge tests). Investigating.	2024-09-12 10:12:09 +02:00
Diana Picus	703ebca869	Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054 )" (#108173 ) This reverts commit `c7a7767fca`. The buildbots failed because I removed a MI from its parent before updating LIS. This PR should fix that.	2024-09-12 09:11:41 +02:00
Jay Foad	e55d6f5ea2	[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive (#107889 ) Always generate v_cndmask_b32 instead of modifying exec around v_mov_b32. This is expected to be faster because modifying exec generally causes pipeline stalls.	2024-09-11 17:16:06 +01:00
Jay Foad	01967e2658	[AMDGPU] Shrink a live interval instead of recomputing it. NFCI. (#108171 )	2024-09-11 14:55:14 +01:00
Vitaly Buka	c7a7767fca	Revert "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054 ) Breaks bots, see #105822. Reverts llvm/llvm-project#105822	2024-09-10 09:51:43 -07:00
Diana Picus	44556e64f2	[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic (#105822 ) This intrinsic is meant to be used in functions that have a "tail" that needs to be run with all the lanes enabled. The "tail" may contain complex control flow that makes it unsuitable for the use of the existing WWM intrinsics. Instead, we will pretend that the function starts with all the lanes enabled, then branches into the actual body of the function for the lanes that were meant to run it, and then finally all the lanes will rejoin and run the tail. As such, the intrinsic will return the EXEC mask for the body of the function, and is meant to be used only as part of a very limited pattern (for now only in amdgpu_cs_chain functions): ``` entry: %func_exec = call i1 @llvm.amdgcn.init.whole.wave() br i1 %func_exec, label %func, label %tail func: ; ... stuff that should run with the actual EXEC mask br label %tail tail: ; ... stuff that runs with all the lanes enabled; ; can contain more than one basic block ``` It's an error to use the result of this intrinsic for anything other than a branch (but unfortunately checking that in the verifier is non-trivial because SIAnnotateControlFlow will introduce an amdgcn.if between the intrinsic and the branch). The intrinsic is lowered to a SI_INIT_WHOLE_WAVE pseudo, which for now is expanded in si-wqm (which is where SI_INIT_EXEC is handled too); however the information that the function was conceptually started in whole wave mode is stored in the machine function info (hasInitWholeWave). This will be useful in prolog epilog insertion, where we can skip saving the inactive lanes for CSRs (since if the function started with all the lanes active, then there are no inactive lanes to preserve).	2024-09-10 13:24:53 +02:00
Jay Foad	1d44ecb9da	[AMDGPU] Remove unnecessary untieRegOperand (#107695 ) As far as I can tell, V_SET_INACTIVE has never had tied operands.	2024-09-08 08:21:19 +01:00
Carl Ritson	16cda01d22	[AMDGPU] V_SET_INACTIVE optimizations (#98864 ) Optimize V_SET_INACTIVE by allow it to run in WWM. Hence WWM sections are not broken up for inactive lane setting. WWM V_SET_INACTIVE can typically be lower to V_CNDMASK. Some cases require use of exec manipulation V_MOV as previous code. GFX9 sees slight instruction count increase in edge cases due to smaller constant bus. Additionally avoid introducing exec manipulation and V_MOVs where a source of V_SET_INACTIVE is the destination. This is a common pattern as WWM register pre-allocation often assigns the same register.	2024-09-05 14:39:28 +09:00
Carl Ritson	3611c0b703	[AMDGPU] SIWholeQuadMode: avoid execz effects in exact regions (#101157 ) Exact mode regions within WQM may have EXEC=0 in divergent control flow. This occurs if a branch is only taken by helper lanes and an instruction requiring WQM disabling is encountered. The current code extends the exact region as far as possible; however, this can result in it including instructions with unwanted side effects at EXEC=0. In particular readfirstlane combined with scalar loads can produce invalid memory accesses in this circumstance. Workaround this by shrinking exact regions to only the instructions requiring WQM disabling when unwanted side effects are present. Eventually we should branch over these regions when EXEC=0, but this requires visibility of CFG/divergence information not currently available.	2024-08-01 18:46:36 +09:00
Carl Ritson	8d28a4102b	[AMDGPU] Remove SIWholeQuadMode pass early exit (#98450 ) Merge the code bypass elements from the early exit into the main pass execution flow.	2024-07-17 19:38:23 +09:00
Jay Foad	5e338f1f4a	[AMDGPU] clang-tidy: use emplace_back instead of push_back. NFC.	2024-07-17 08:27:35 +01:00
Carl Ritson	36984536be	[AMDGPU] SIWholeQuadMode: remove unnecessary map access (NFCI)	2024-07-15 13:51:18 +09:00
Jay Foad	d4e46f0e86	[AMDGPU] Fix machine verification failure from INIT_EXEC lowering (#98333 ) Fix machine verification failure from INIT_EXEC lowering since it was moved from SILowerControlFlow to SIWholeQuadMode in #94452.	2024-07-11 09:18:50 +01:00
Nikita Popov	6a907699d8	Revert "[CodeGen] Remove `applySplitCriticalEdges` in `MachineDominatorTree` (#97055 )" This reverts commit c5e5088033fed170068d818c54af6862e449b545. Causes large compile-time regressions.	2024-07-11 09:13:37 +02:00
paperchalice	c5e5088033	[CodeGen] Remove `applySplitCriticalEdges` in `MachineDominatorTree` (#97055 ) Summary: - Remove wrappers in `MachineDominatorTree`. - Remove `MachineDominatorTree` update code in `MachineBasicBlock::SplitCriticalEdge`. - Use `MachineDomTreeUpdater` in passes which call `MachineBasicBlock::SplitCriticalEdge` and preserve `MachineDominatorTreeWrapperPass` or CFG analyses. Commit abea99f65a97248974c02a5544eaf25fc4240056 introduced related methods in 2014. Now we have SemiNCA based dominator tree in 2017 and dominator tree updater, the solution adopted here seems a bit outdated.	2024-07-11 11:08:05 +08:00
paperchalice	abde52aa66	[CodeGen][NewPM] Port `LiveIntervals` to new pass manager (#98118 ) - Add `LiveIntervalsAnalysis`. - Add `LiveIntervalsPrinterPass`. - Use `LiveIntervalsWrapperPass` in legacy pass manager. - Use `std::unique_ptr` instead of raw pointer for `LICalc`, so destructor and default move constructor can handle it correctly. This would be the last analysis required by `PHIElimination`.	2024-07-10 19:34:48 +08:00
paperchalice	4010f894a1	[CodeGen][NewPM] Port `SlotIndexes` to new pass manager (#97941 ) - Add `SlotIndexesAnalysis`. - Add `SlotIndexesPrinterPass`. - Use `SlotIndexesWrapperPass` in legacy pass.	2024-07-09 12:09:11 +08:00
Carl Ritson	db096adba0	[AMDGPU] Remove SIWholeQuadMode pseudo wavemode optimization (#94133 ) This does not work correctly in divergent control flow. Can be replaced with a later exec mask manipulation optimizer. This reverts commit a3646ec1bc662e221c2a1d182987257c50958789.	2024-06-12 15:51:40 +09:00
paperchalice	4b24c2dfb5	[CodeGen][NewPM] Split `MachinePostDominators` into a concrete analysis result (#95113 ) `MachinePostDominators` version of #94571.	2024-06-12 14:29:22 +08:00
paperchalice	837dc542b1	[CodeGen][NewPM] Split `MachineDominatorTree` into a concrete analysis result (#94571 ) Prepare for new pass manager version of `MachineDominatorTreeAnalysis`. We may need a machine dominator tree version of `DomTreeUpdater` to handle `SplitCriticalEdge` in some CodeGen passes.	2024-06-11 21:27:14 +08:00
Jay Foad	df6750eaa8	[AMDGPU] Fix interaction between WQM and llvm.amdgcn.init.exec (#93680 ) Whole quad mode requires inserting a copy of the initial EXEC mask. In a function that also uses llvm.amdgcn.init.exec, insert the COPY after initializing EXEC.	2024-06-07 13:23:15 +01:00
Jay Foad	4c6dd70ec4	[AMDGPU] Move INIT_EXEC lowering from SILowerControlFlow to SIWholeQuadMode (#94452 ) NFCI; this just preserves SI_INIT_EXEC and SI_INIT_EXEC_FROM_INPUT instructions a little longer so that we can reliably identify them in SIWholeQuadMode.	2024-06-06 10:29:55 +01:00
Jay Foad	180448b13c	[AMDGPU] Reduce use of continue in SIWholeQuadMode. NFC. (#93659 )	2024-05-29 14:40:08 +01:00
Xu Zhang	f6d431f208	[CodeGen] Make the parameter TRI required in some functions. (#85968 ) Fixes #82659 There are some functions, such as `findRegisterDefOperandIdx` and `findRegisterDefOperand`, that have too many default parameters. As a result, we have encountered some issues due to the lack of TRI parameters, as shown in issue #82411. Following @RKSimon 's suggestion, this patch refactors 9 functions, including `{reads, kills, defines, modifies}Register`, `registerDefIsDead`, and `findRegister{UseOperandIdx, UseOperand, DefOperandIdx, DefOperand}`, adjusting the order of the TRI parameter and making it required. In addition, all the places that call these functions have also been updated correctly to ensure no additional impact. After this, the caller of these functions should explicitly know whether to pass the `TargetRegisterInfo` or just a `nullptr`.	2024-04-24 14:24:14 +01:00
Mirko Brkušanin	82e33d6203	[AMDGPU] Add VDSDIR instructions for GFX12 (#75197 )	2024-01-03 16:32:00 +01:00
Diana Picus	20e9e4f797	[AMDGPU] si-wqm: Skip only LiveMask COPY si-wqm sometimes needs to save the LiveMask in the entry block. Later on, while looking for a place to enter WQM/WWM, it unconditionally skips over the first COPY instruction in the entry block. This is incorrect for functions where the LiveMask doesn't need to be saved, and therefore the first COPY is more likely a COPY from a function argument and might need to be in some non-exact mode. This patch fixes the issue by also checking that the source of the COPY is the EXEC register. This produces different code in 3 of the existing tests: In wwm-reserved.ll, a SGPR copy is now inside the WWM area rather than outside. This is benign. In wave32.ll, we end up with an extra register copy. This is because the first COPY in the block is now part of the WWM block, so si-pre-allocate-wwm-regs will allocate a new register for its destination (when it was outside of the WWM region, the register allocator could just re-use the same register). We might be able to improve this in si-pre-allocate-wwm-regs but I haven't looked into it. The same thing happens in dual-source-blend-export.ll, but for that one it's harder to see because of the scheduling changes. I've uploaded the before/after si-wqm output for it here: https://reviews.llvm.org/differential/diff/553445/ Differential Revision: https://reviews.llvm.org/D158841	2023-11-10 09:30:44 +01:00
Carl Ritson	0eb516817d	[AMDGPU] Remove dom tree requirements from SIWholeQuadMode pass (#71012 ) SIWholeQuadMode preserves dominator and post dominator trees, but does not require them.	2023-11-02 17:16:19 +09:00
Jay Foad	da7892f729	[MC] Use regunits instead of MCRegUnitIterator. NFC. Differential Revision: https://reviews.llvm.org/D153122	2023-06-16 12:21:32 +01:00
Sergei Barannikov	aa2d0fbc30	[MC] Add MCRegisterInfo::regunits for iteration over register units Reviewed By: foad Differential Revision: https://reviews.llvm.org/D152098	2023-06-16 05:39:50 +03:00
Carl Ritson	6afc4b0629	[AMDGPU] WQM: Ensure exact mode placement before branches Fix for D151797 where the change accidentally allowed exit to exact mode between branch instructions. Reviewed By: dstuttard Differential Revision: https://reviews.llvm.org/D152228	2023-06-06 18:11:35 +09:00
Jay Foad	3030c03988	[AMDGPU] Make use of MachineInstr::all_defs and all_uses. NFCI.	2023-06-05 10:32:33 +01:00
Carl Ritson	2e87ed80b2	[AMDGPU] WQM: Allow insertion of exact mode transition as terminator Allow WQM pass to insert transitions to exact mode among block terminators, instead of forcing them to occur before terminators. This should not yield any functional change, but allows block splitting of control flow, such as that in D145329. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D151797	2023-06-02 14:00:54 +09:00
Mikael Holmen	cafb0991a2	[AMDGPU] Silence gcc warning [NFC] Without the fix gcc complains with ../lib/Target/AMDGPU/SIWholeQuadMode.cpp:1543: warning: enumeral and non-enumeral type in conditional expression [-Wextra] 1542 \| unsigned CopyOp = MI->getOperand(1).isReg() \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1543 \| ? AMDGPU::COPY \| ~~~~~~~~~~~~~~ 1544 \| : TII->getMovOpcode(TRI->getRegClassForOperandReg( \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1545 \| *MRI, MI->getOperand(0))); \|	2023-05-26 10:17:59 +02:00
Jay Foad	9283c43ee2	[AMDGPU] Fix lowering of @llvm.amdgcn.set.inactive(imm, poison) If the second argument of V_SET_INACTIVE is undef/poison, SIWholeQuadMode lowered it to a COPY from the first argument, but that caused invalid MIR if the first argument was an immediate rather than a register. Fix this by lowering to a V_MOV instruction instead of a COPY. Fixes https://github.com/llvm/llvm-project/issues/62862 Differential Revision: https://reviews.llvm.org/D151105	2023-05-22 16:31:27 +01:00
Carl Ritson	5bc703f755	[AMDGPU] Replace getPhysRegClass with getPhysRegBaseClass Accelerate finding the base class for a physical register by building a statically mapping table from physical registers to base classes using TableGen. Replace uses of SIRegisterInfo::getPhysRegClass with TargetRegisterInfo::getPhysRegBaseClass in order to use the computed table. Reviewed By: arsenm, foad Differential Revision: https://reviews.llvm.org/D139422	2022-12-20 16:22:14 +09:00
Jay Foad	6443c0ee02	[AMDGPU] Stop using make_pair and make_tuple. NFC. C++17 allows us to call constructors pair and tuple instead of helper functions make_pair and make_tuple. Differential Revision: https://reviews.llvm.org/D139828	2022-12-14 13:22:26 +00:00
Carl Ritson	a3646ec1bc	[AMDGPU] Add pseudo wavemode to optimize strict_wqm Strict WQM does not require a WQM transistion if it occurs within an existing WQM section. This occurs heavily in GFX11 pixel shaders with LDS_PARAM_LOAD. Which leads to unnecessary EXEC mask manipulation. To avoid these transitions, detect WQM -> Strict WQM -> WQM and substitute new ENTER_PSEUDO_WM/EXIT_PSEUDO_WM markers instead. These are treat similarly by WWM register pre-allocation pass, but do not manipulate EXEC or use registers to save EXEC state. Reviewed By: piotr Differential Revision: https://reviews.llvm.org/D136813	2022-10-28 09:45:17 +09:00
Matt Arsenault	7834194837	TableGen: Introduce generated getSubRegisterClass function Currently there isn't a generic way to get a smaller register class that can be produced from a subregister of a larger class. Replaces a manually implemented version for AMDGPU. This will be used to improve subregister support in the allocator.	2022-09-12 09:03:37 -04:00
Ruiling Song	732eed40fd	[AMDGPU] Mark GFX11 dual source blend export as strict-wqm The instructions that generate the source of dual source blend export should run in strict-wqm. That is if any lane in a quad is active, we need to enable all four lanes of that quad to make the shuffling operation before exporting to dual source blend target work correctly. Differential Revision: https://reviews.llvm.org/D127981	2022-06-20 21:58:12 +01:00
Piotr Sobczak	29621c13ef	[AMDGPU] Tag GFX11 LDS loads as using strict_wqm LDS_PARAM_LOAD and LDS_DIRECT_LOAD use EXEC per quad (if any pixel is enabled in the quad, data is written to all 4 pixels/threads in the quad). Tag LDS_PARAM_LOAD and LDS_DIRECT_LOAD as using strict_wqm to enforce this and avoid lane clobbering issues. Note that only the instruction itself is tagged. The implicit uses of these do not need to be set WQM. The reduces unnecessary WQM calculation of M0. Differential Revision: https://reviews.llvm.org/D127977	2022-06-20 21:58:12 +01:00
Kazu Hirata	4271a1ff33	[llvm] Call *set::insert without checking membership first (NFC)	2022-06-18 10:17:22 -07:00
Shengchen Kan	37b378386e	[NFC][CodeGen] Rename some functions in MachineInstr.h and remove duplicated comments	2022-03-16 20:25:42 +08:00
Ruiling Song	98dd390573	AMDGPU: Use removeAllRegUnitsForPhysReg() I met the issue here when working on something else. Actually we have already reserved EXEC, but it looks like the register coalescer is causing the sub-register of EXEC appears in LiveIntervals. I have not looked deeper why register coalscer have such behavior, but removeAllRegUnitsForPhysReg() is the right way. Reviewed By: critson, foad, arsenm Differential Revision: https://reviews.llvm.org/D117014	2022-03-15 10:28:27 +08:00

1 2 3

126 Commits