llvm-project

Author	SHA1	Message	Date
Carl Ritson	a3a3e6997b	[AMDGPU] Rewrite GFX12 SGPR hazard handling to dedicated pass (#118750 ) - Algorithm operates over whole IR to attempt to minimize waits. - Add support for VALU->VALU SGPR hazards via VA_SDST/VA_VCC.	2025-01-30 11:21:11 +09:00
Joel E. Denny	18f8106f31	[KernelInfo] Implement new LLVM IR pass for GPU code analysis (#102944 ) This patch implements an LLVM IR pass, named kernel-info, that reports various statistics for codes compiled for GPUs. The ultimate goal of these statistics to help identify bad code patterns and ways to mitigate them. The pass operates at the LLVM IR level so that it can, in theory, support any LLVM-based compiler for programming languages supporting GPUs. It has been tested so far with LLVM IR generated by Clang for OpenMP offload codes targeting NVIDIA GPUs and AMD GPUs. By default, the pass runs at the end of LTO, and options like ``-Rpass=kernel-info`` enable its remarks. Example `opt` and `clang` command lines appear in `llvm/docs/KernelInfo.rst`. Remarks include summary statistics (e.g., total size of static allocas) and individual occurrences (e.g., source location of each alloca). Examples of its output appear in tests in `llvm/test/Analysis/KernelInfo`.	2025-01-29 12:40:19 -05:00
Chaitanya	3c79a04cc2	[AMDGPU] Add amdgpu-sw-lower-lds pass to NPM codegen addIRPasses. (#124102 ) This PR adds amdgpu-sw-lower-lds pass to AMDGPUCodeGenPassBuilder::addIRPasses()	2025-01-24 11:15:30 +05:30
Lucas Ramirez	6206f5444f	[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748 ) Occupancy (i.e., the number of waves per EU) depends, in addition to register usage, on per-workgroup LDS usage as well as on the range of possible workgroup sizes. Mirroring the latter, occupancy should therefore be expressed as a range since different group sizes generally yield different achievable occupancies. `getOccupancyWithLocalMemSize` currently returns a scalar occupancy based on the maximum workgroup size and LDS usage. With respect to the workgroup size range, this scalar can be the minimum, the maximum, or neither of the two of the range of achievable occupancies. This commit fixes the function by making it compute and return the range of achievable occupancies w.r.t. workgroup size and LDS usage; it also renames it to `getOccupancyWithWorkGroupSizes` since it is the range of workgroup sizes that produces the range of achievable occupancies. Computing the achievable occupancy range is surprisingly involved. Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum occupancies i.e., sometimes workgroup sizes inside the range yield the occupancy bounds. The implementation finds these sizes in constant time; heavy documentation explains the rationale behind the sometimes relatively obscure calculations. As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU, 64-wide waves. Also consider a function with no LDS usage and a flat workgroup size range of [513,1024]. - A group of 513 items requires 9 waves per group. Only 4 groups made up of 9 waves each can fit fully on a CU at any given time, for a total of 36 waves on the CU, or 9 per EU. However, filling as much as possible the remaining 40-36=4 wave slots without decreasing the number of groups reveals that a larger group of 640 items yields 40 waves on the CU, or 10 per EU. - Similarly, a group of 1024 items requires 16 waves per group. Only 2 groups made up of 16 waves each can fit fully on a CU ay any given time, for a total of 32 waves on the CU, or 8 per EU. However, removing as many waves as possible from the groups without being able to fit another equal-sized group on the CU reveals that a smaller group of 896 items yields 28 waves on the CU, or 7 per EU. Therefore the achievable occupancy range for this function is not [8,9] as the group size bounds directly yield, but [7,10]. Naturally this change causes a lot of test churn as instruction scheduling is driven by achievable occupancy estimates. In most unit tests the flat workgroup size range is the default [1,1024] which, ignoring potential LDS limitations, would previously produce a scalar occupancy of 8 (derived from 1024) on a lot of targets, whereas we now consider the maximum occupancy to be 10 in such cases. Most tests are updated automatically and checked manually for sanity. I also manually changed some non-automatically generated assertions when necessary. Fixes #118220.	2025-01-23 16:07:57 +01:00
Matt Arsenault	93d35ad5f5	AMDGPU: Delete FillMFMAShadowMutation (#123861 ) No test changes with this removed and it appears to be obsolete.	2025-01-22 22:41:25 +07:00
Akshat Oke	a343b8e595	[AMDGPU][NewPM] Port SILowerWWMCopies to NPM (#123695 )	2025-01-22 14:54:01 +05:30
Akshat Oke	9b6e8df896	[AMDGPU][NewPM] Port SIFixVGPRCopies to NPM (#123592 ) Extends NPM pipeline support till PostRegAlloc passes (greedy is in the works)	2025-01-21 15:27:46 +05:30
Akshat Oke	96c4f978d0	[AMDGPU][NewPM] Port SIOptimizeExecMasking to NPM (#123572 )	2025-01-20 16:34:01 +05:30
Christudasan Devadasan	1797fb6b23	[AMDGPU][NewPM] Port SILowerControlFlow pass into NPM. (#123045 )	2025-01-16 11:06:38 +05:30
Akshat Oke	73b0e8a191	[AMDGPU][NewPM] Port AMDGPUOpenCLEnqueuedBlockLowering to NPM (#122434 )	2025-01-13 17:52:30 +05:30
Akshat Oke	f431f93a77	[CodeGen][NewPM] Use proper NPM AtomicExpandPass in AMDGPU (#122086 ) `PassRegistry.def` already has this entry, but the dummy definition was being pulled instead. I couldn't reproduce the build failures that FIXME referenced, maybe the Dummy pass getting in the way was part of the cause.	2025-01-13 10:38:24 +05:30
Akshat Oke	7bf1cb702b	[AMDGPU][NewPM] Port AMDGPURemoveIncompatibleFunctions to NPM (#122261 )	2025-01-13 10:11:40 +05:30
Austin Kerbow	2e5c298281	[AMDGPU] Add backward compatibility layer for kernarg preloading (#119167 ) Add a prologue to the kernel entry to handle cases where code designed for kernarg preloading is executed on hardware equipped with incompatible firmware. If hardware has compatible firmware the 256 bytes at the start of the kernel entry will be skipped. This skipping is done automatically by hardware that supports the feature. A pass is added which is intended to be run at the very end of the pipeline to avoid any optimizations that would assume the prologue is a real predecessor block to the actual code start. In reality we have two possible entry points for the function. 1. The optimized path that supports kernarg preloading which begins at an offset of 256 bytes. 2. The backwards compatible entry point which starts at offset 0.	2025-01-10 11:39:02 -08:00
Kirill Stoimenov	e821f642fd	Revert "[AMDGPU][CodeGen] Do not backtrace invalid -regalloc param (#119687 )" Causes bot failure: https://lab.llvm.org/buildbot/#/builders/55/builds/4246/steps/11/logs/stdio This reverts commit 7a648554f886fbc043c4f3f58ca88f6c4535f2cf.	2024-12-14 03:47:53 +00:00
Akshat Oke	7a648554f8	[AMDGPU][CodeGen] Do not backtrace invalid -regalloc param (#119687 ) No need to generate a stack trace and a GitHub issue prompt on a wrongly set regalloc option.	2024-12-13 11:58:53 +05:30
Akshat Oke	0876c11cee	[AMDGPU] Parse wwm filter flag for regalloc fast (#119347 )	2024-12-12 13:51:02 +05:30
Shilei Tian	04269ea0e4	[AMDGPU] Re-enable closed-world assumption as an opt-in feature (#115371 ) Although the ABI (if one exists) doesn’t explicitly prohibit cross-code-object function calls—particularly since our loader can handle them—such calls are not actually allowed in any of the officially supported programming models. However, this limitation has some nuances. For instance, the loader can handle cross-code-object global variables, which complicates the situation further. Given this complexity, assuming a closed-world model at link time isn’t always safe. To address this, this PR introduces an option that enables this assumption, providing end users the flexibility to enable it for improved compiler optimizations. However, it is the user’s responsibility to ensure they do not violate this assumption.	2024-12-10 15:57:41 -05:00
Ruiling, Song	b33c807b39	[AMDGPU] Add MaxMemoryClauseSchedStrategy (#114957 ) Also expose an option to choose custom scheduler strategy: amdgpu-sched-strategy={max-ilp\|max-memory-clause} This can be set through either function attribute or command line option. The major behaviors of the max memory clause schedule strategy includes: 1. Try to cluster memory instructions more aggressively. 2. Try to schedule long latency load earlier than short latency instruction. I tested locally against about 470 real shaders and got the perf changes (only count perf changes over +/-10%): About 15 shaders improved 10%~40%. Only 3 shaders drops ~10%. (This was tested together with another change which increases the maximum clustered dword from 8 to 32). I will make another change to make that threshold configurable.	2024-12-09 10:07:27 +08:00
Petar Avramovic	fef54d0393	AMDGPU/GlobalISel: Add skeletons for new register bank select passes (#112862 ) New register bank select for AMDGPU will be split in two passes: - AMDGPURegBankSelect: select banks based on machine uniformity analysis - AMDGPURegBankLegalize: lower instructions that can't be inst-selected with register banks assigned by AMDGPURegBankSelect. AMDGPURegBankLegalize is similar to legalizer but with context of uniformity analysis. Does not change already assigned banks. Main goal of AMDGPURegBankLegalize is to provide high level table-like overview of how to lower generic instructions based on available target features and uniformity info (uniform vs divergent). See RegBankLegalizeRules. Summary of new features: At the moment register bank select assigns register bank to output register using simple algorithm: - one of the inputs is vgpr output is vgpr - all inputs are sgpr output is sgpr. When function does not contain divergent control flow propagating register banks like this works. In general, first point is still correct but second is not when function contains divergent control flow. Examples: - Phi with uniform inputs that go through divergent branch - Instruction with temporal divergent use. To fix this AMDGPURegBankSelect will use machine uniformity analysis to assign vgpr to each divergent and sgpr to each uniform instruction. But some instructions are only available on VALU (for example floating point instructions before gfx1150) and we need to assign vgpr to them. Since we are no longer propagating register banks we need to ensure that uniform instructions get their inputs in sgpr in some way. In AMDGPURegBankLegalize uniform instructions that are only available on VALU will be reassigned to vgpr on all operands and read-any-lane vgpr output to original sgpr output.	2024-12-03 16:02:00 -05:00
Christudasan Devadasan	c5ab28a42d	[AMDGPU][NewPM] Port SIOptimizeVGPRLiveRange pass to NPM. (#117686 )	2024-11-29 09:11:24 +05:30
Petar Avramovic	87503fa51c	Revert "AMDGPU/GlobalISel: Add stub custom regbankselect pass" (#113913 ) This reverts commit e9c49901a43f5b16c3df416460b7e4dbdd24ce03. Current AMDGPURegBankSelect does nothing different then RegBankSelect. Revert to using generic RegBankSelect in preparation for adding new regbankselect passes. New AMDGPURegBankSelect, that will use uniformity analysis for regbank select decisions, will not subclass RegBankSelect. Revert regression tests to use regbankselect since amdgpu-regbankselect will be used by new pass and behavior will be different.	2024-11-27 13:16:22 -05:00
Jay Foad	89cb0eefcb	[AMDGPU] Move GCNPreRAOptimizations after MachineScheduler (#116211 ) This is in preparation for adding a new optimization to the pass that cares about the order of instructions. The existing optimization does not care, so this just causes minor codegen differences.	2024-11-16 09:40:46 +00:00
Matin Raayai	bb3f5e1fed	Overhaul the TargetMachine and LLVMTargetMachine Classes (#111234 ) Following discussions in #110443, and the following earlier discussions in https://lists.llvm.org/pipermail/llvm-dev/2017-October/117907.html, https://reviews.llvm.org/D38482, https://reviews.llvm.org/D38489, this PR attempts to overhaul the `TargetMachine` and `LLVMTargetMachine` interface classes. More specifically: 1. Makes `TargetMachine` the only class implemented under `TargetMachine.h` in the `Target` library. 2. `TargetMachine` contains target-specific interface functions that relate to IR/CodeGen/MC constructs, whereas before (at least on paper) it was supposed to have only IR/MC constructs. Any Target that doesn't want to use the independent code generator simply does not implement them, and returns either `false` or `nullptr`. 3. Renames `LLVMTargetMachine` to `CodeGenCommonTMImpl`. This renaming aims to make the purpose of `LLVMTargetMachine` clearer. Its interface was moved under the CodeGen library, to further emphasis its usage in Targets that use CodeGen directly. 4. Makes `TargetMachine` the only interface used across LLVM and its projects. With these changes, `CodeGenCommonTMImpl` is simply a set of shared function implementations of `TargetMachine`, and CodeGen users don't need to static cast to `LLVMTargetMachine` every time they need a CodeGen-specific feature of the `TargetMachine`. 5. More importantly, does not change any requirements regarding library linking. cc @arsenm @aeubanks	2024-11-14 13:30:05 -08:00
Kazu Hirata	be187369a0	[AMDGPU] Remove unused includes (NFC) (#116154 ) Identified with misc-include-cleaner.	2024-11-13 21:10:03 -08:00
Jay Foad	2560505203	[AMDGPU] Reorder GCNPassConfig::addOptimizedRegAlloc. NFC. (#115873 ) This just makes it so that the added passes are mentioned in this function in the same order that they will appear in the final pass pipeline.	2024-11-13 14:38:23 +00:00
Akshat Oke	3495d04560	[AMDGPU][MIR] Serialize SpillPhysVGPRs (#113129 )	2024-11-05 13:17:25 +05:30
Shilei Tian	390300d9f4	[PassBuilder] Add `ThinOrFullLTOPhase` to optimizer pipeline (#114577 )	2024-11-03 23:25:29 -05:00
Shilei Tian	dc45ff1d2a	[PassBuilder] Add `ThinOrFullLTOPhase` to early simplication EP call backs (#114547 ) The early simplication pipeline is used in non-LTO and (Thin/Full)LTO pre-link stage. There are some passes that we want them in non-LTO mode, but not at LTO pre-link stage. The control is missing currently. This PR adds the support. To demonstrate the use, we only enable the internalization pass in non-LTO mode for AMDGPU because having it run in pre-link stage causes some issues.	2024-11-03 23:24:10 -05:00
Shilei Tian	10a1ea9b53	[NFC][AMDGPU] Remove the empty FPM as well as the adaptor to MPM (#114558 )	2024-11-01 12:21:26 -04:00
Akshat Oke	ca32bd643b	[NewPM][AMDGPU] Port SIPreAllocateWWMRegs to NPM (#109939 )	2024-10-22 15:37:08 +05:30
Akshat Oke	6360652e9f	Reland [AMDGPU] Serialize WWM_REG vreg flag (#110229 ) (#112492 ) A reland but not an exact copy as `VRegInfo.Flags` from the parser is now an int8 instead of a vector; so only need to copy over the value.	2024-10-21 13:44:09 +05:30
Christudasan Devadasan	72a7b471de	[AMDGPU][NewPM] Fill out addILPOpts. (#108514 )	2024-10-16 13:30:46 +05:30
Christudasan Devadasan	488d3924dd	[CodeGen][NewPM] Port EarlyIfConversion pass to NPM. (#108508 )	2024-10-16 13:22:57 +05:30
Peter Collingbourne	3cab8827fd	Revert "[AMDGPU] Serialize WWM_REG vreg flag (#110229 )" This reverts commit bec839d8eed9dd13fa7eaffd50b28f8f913de2e2. Caused buildbot failures, e.g. https://lab.llvm.org/buildbot/#/builders/52/builds/2928	2024-10-15 13:18:43 -07:00
Akshat Oke	bec839d8ee	[AMDGPU] Serialize WWM_REG vreg flag (#110229 )	2024-10-14 14:37:21 +05:30
Akshat Oke	039e6f879c	[AMDGPU][NewPM] Fill out AMDGPU addMachineSSAOptimizations (#111658 ) Implement the addMachineSSAOptimizations passes for AMDGPU. Porting the other generic passes in this category is WIP.	2024-10-10 15:35:11 +05:30
Jay Foad	8d13e7b8c3	[AMDGPU] Qualify auto. NFC. (#110878 ) Generated automatically with: $ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find lib/Target/AMDGPU/ -type f)	2024-10-03 13:07:54 +01:00
vikashgu	870bdc6ea7	Reapply "[AMDGPU]Optimize SGPR spills (#93668 )" This reverts commit c2fc7f75f67039bb1ed577bc0edbd699a850cd9d. As the dependent patch about split vgpr regalloc pipeline solved the issue(#96353).	2024-10-03 09:47:15 +00:00
Christudasan Devadasan	ac0f64f06d	[AMDGPU] Split vgpr regalloc pipeline (#93526 ) Allocating wwm-registers and per-thread VGPR operands together imposes many challenges in the way the registers are reused during allocation. There are times when regalloc reuses the registers of regular VGPRs operations for wwm-operations in a small range leading to unwantedly clobbering their inactive lanes causing correctness issues that are hard to trace. This patch splits the VGPR allocation pipeline further to allocate wwm-registers first and the regular VGPR operands in a separate pipeline. The splitting would ensure that the physical registers used for wwm allocations won't take part in the next allocation pipeline to avoid any such clobbering.	2024-09-30 19:55:42 +05:30
Matt Arsenault	a87640c97e	AMDGPU: Fix assertion on load of vector of pointers (#110436 ) Fix InferAddressSpaces asserting on a load of a vector of flat pointers. Fixes #110433	2024-09-30 10:16:38 +04:00
Scott Egerton	396f677514	[AMDGPU] Remove unused VGPRSingleUseHintInsts feature (#109769 )	2024-09-24 10:58:00 +01:00
Akshat Oke	0b0874755d	[AMDGPU][NewPM] Port SILowerSGPRSpills to NPM (#108934 )	2024-09-21 09:59:36 +05:30
Akshat Oke	d2d78e584b	[NewPM][CodeGen] Port MachineLICM to NPM (#107376 )	2024-09-20 11:34:18 +05:30
Jay Foad	e03f427196	[LLVM] Use {} instead of std::nullopt to initialize empty ArrayRef (#109133 ) It is almost always simpler to use {} instead of std::nullopt to initialize an empty ArrayRef. This patch changes all occurrences I could find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor could be deprecated or removed.	2024-09-19 16:16:38 +01:00
Diana Picus	3356208531	Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108512 ) This reverts commit `7792b4ae79`. The problem was a conflict with `e55d6f5ea2` "[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive (https://github.com/llvm/llvm-project/pull/107889)" which changed the syntax of V_SET_INACTIVE (and thus made my MIR test crash). ...if only we had a merge queue.	2024-09-13 11:54:30 +02:00
Diana Picus	7792b4ae79	Revert "Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054 )"" (#108341 ) Reverts llvm/llvm-project#108173 si-init-whole-wave.mir crashes on some buildbots (although it passed both locally with sanitizers enabled and in pre-merge tests). Investigating.	2024-09-12 10:12:09 +02:00
Diana Picus	703ebca869	Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054 )" (#108173 ) This reverts commit `c7a7767fca`. The buildbots failed because I removed a MI from its parent before updating LIS. This PR should fix that.	2024-09-12 09:11:41 +02:00
Akshat Oke	e1ee07d0ff	[AMDGPU][NewPM] Port SIPeepholeSDWA pass to NPM (#107049 )	2024-09-11 14:30:16 +04:00
Vitaly Buka	c7a7767fca	Revert "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054 ) Breaks bots, see #105822. Reverts llvm/llvm-project#105822	2024-09-10 09:51:43 -07:00
Diana Picus	44556e64f2	[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic (#105822 ) This intrinsic is meant to be used in functions that have a "tail" that needs to be run with all the lanes enabled. The "tail" may contain complex control flow that makes it unsuitable for the use of the existing WWM intrinsics. Instead, we will pretend that the function starts with all the lanes enabled, then branches into the actual body of the function for the lanes that were meant to run it, and then finally all the lanes will rejoin and run the tail. As such, the intrinsic will return the EXEC mask for the body of the function, and is meant to be used only as part of a very limited pattern (for now only in amdgpu_cs_chain functions): ``` entry: %func_exec = call i1 @llvm.amdgcn.init.whole.wave() br i1 %func_exec, label %func, label %tail func: ; ... stuff that should run with the actual EXEC mask br label %tail tail: ; ... stuff that runs with all the lanes enabled; ; can contain more than one basic block ``` It's an error to use the result of this intrinsic for anything other than a branch (but unfortunately checking that in the verifier is non-trivial because SIAnnotateControlFlow will introduce an amdgcn.if between the intrinsic and the branch). The intrinsic is lowered to a SI_INIT_WHOLE_WAVE pseudo, which for now is expanded in si-wqm (which is where SI_INIT_EXEC is handled too); however the information that the function was conceptually started in whole wave mode is stored in the machine function info (hasInitWholeWave). This will be useful in prolog epilog insertion, where we can skip saving the inactive lanes for CSRs (since if the function started with all the lanes active, then there are no inactive lanes to preserve).	2024-09-10 13:24:53 +02:00

1 2 3 4 5 ...

607 Commits