llvm-project

Author	SHA1	Message	Date
Jeffrey Byrnes	5da7179cb3	[AMDGPU] Reland: Add IR LiveReg type-based optimization	2024-07-03 09:26:19 -07:00
Vitaly Buka	3e53c97d33	Revert "[AMDGPU] Add IR LiveReg type-based optimization" (#97138 ) Part of #66838. https://lab.llvm.org/buildbot/#/builders/52/builds/404 https://lab.llvm.org/buildbot/#/builders/55/builds/358 https://lab.llvm.org/buildbot/#/builders/164/builds/518 This reverts commit ded956440739ae326a99cbaef18ce4362e972679.	2024-06-28 23:18:26 -07:00
Jeffrey Byrnes	ded9564407	[AMDGPU] Add IR LiveReg type-based optimization Change-Id: Ia0d11b79b8302e79247fe193ccabc0dad2d359a0	2024-06-28 15:01:39 -07:00
Nikita Popov	5cd0ba30f5	Reapply [IR] Lazily initialize the class to pass name mapping (NFC) (#96321 ) (#96462 ) On MSVC the `this` uses inside `decltype` require a lambda capture. On clang they result in an unused capture warning instead. Add the capture and suppress the warning with `(void)this`. ----- Initializing this map is somewhat expensive (especially for O0), so we currently only do it if certain flags are used. I would like to make use of it for crash dumps (#96078), where we don't know in advance whether it will be needed or not. This patch changes the initialization to a lazy approach, where a callback is registered that does the actual initialization. The callbacks will be run the first time the pass name is requested. This way there is no compile-time impact if the mapping is not used.	2024-06-24 15:00:11 +02:00
Nikita Popov	e5a41f0afc	Revert "[IR] Lazily initialize the class to pass name mapping (NFC) (#96321 )" My attempt to fix the Windows build made things worse, revert entirely for now. This reverts commit e7137f2fed5cfee822ae3c4c6d39188adb59a16c. This reverts commit 6eaf204dbb0a6a81cddfd02f625c130f7bb1aae5. This reverts commit 957dc4366dd2ce9d5d2991c3ad76bbf438e9954e.	2024-06-24 10:32:03 +02:00
Nikita Popov	957dc4366d	[IR] Lazily initialize the class to pass name mapping (NFC) (#96321 ) Initializing this map is somewhat expensive (especially for O0), so we currently only do it if certain flags are used. I would like to make use of it for crash dumps (#96078), where we don't know in advance whether it will be needed or not. This patch changes the initialization to a lazy approach, where a callback is registered that does the actual initialization. The callbacks will be run the first time the pass name is requested. This way there is no compile-time impact if the mapping is not used.	2024-06-24 09:40:09 +02:00
vg0204	c2fc7f75f6	Revert "[AMDGPU]Optimize SGPR spills (#93668 )" This reverts commit 4b9112e88a998ce620e4683548f2afd17cc5fe95. A separate issue(#96353) describing it has been opened to further keep its track.	2024-06-24 12:36:36 +05:30
Vikash Gupta	4b9112e88a	[AMDGPU]Optimize SGPR spills (#93668 ) This PR is dependent on [#93779](https://github.com/llvm/llvm-project/pull/93779). As currently, each SGPR Spills are lowered to go into distinct stack slots in stack frame after SGPR allocation phase. Therefore, this patch utilizes the capability of StackSlotColoring to ensure the stack slot sharing if possible for stack frame index, where the SGPR spills are occuring in the non-interfering region. StackSlotColoring is introduced immediately after SGPR register allocation, just to ensure that any further lowering happens on the optimally allocated stack slots, with certain flags to indicate the preservation of certain analysis result later to be used by RA of other register classes.	2024-06-18 20:55:31 +05:30
Pierre van Houtryve	d95b82c49a	[NFC][AMDGPU] Make AMDGPUSplitModule a ModulePass (#95773 ) It allows it to access TTI correctly, and opens the door to accessing more analysis in the future. I went back and forth between this, and also making the default SplitModule a Pass too to make it uniform, but I decided against it because it's just needless complications. Neither llvm-split or LTOBackend have a PM ready to use so we need to create one anyway. Let's keep all the mess hidden in the AMDGPU version for now to keep this change more self-contained.	2024-06-18 09:16:32 +02:00
paperchalice	1bc8b3258e	[NewPM][CodeGen] Port `regallocfast` to new pass manager (#94426 ) This pull request port `regallocfast` to new pass manager. It exposes the parameter `filter` to handle different register classes for AMDGPU. IIUC AMDGPU need to allocate different register classes separately so it need implement its own `--<reg-class>-regalloc`. Now users can use e.g. `-passe=regallocfast<filter=sgpr>` to allocate specific register class. The command line option `--regalloc-npm` is still in work progress, plan to reuse the syntax of passes, e.g. use `--regalloc-npm=regallocfast<filter=sgpr>,greedy<filter=vgpr>` to replace `--sgpr-regalloc` and `--vgpr-regalloc`.	2024-06-07 12:22:42 +08:00
Jon Chesterfield	8516f54e6a	[AMDGPU] Implement variadic functions by IR lowering (#93362 ) This is a mostly-target-independent variadic function optimisation and lowering pass. It is only enabled for AMDGPU in this initial commit. The purpose is to make C style variadic functions a zero cost abstraction. They are lowered to equivalent IR which is then amenable to other optimisations. This is inherently slightly target specific but much less so than one might expect - the C varargs interface heavily constrains the ABI design divergence. The pass is primarily tested from webassembly. This is because wasm has a straightforward variadic lowering strategy which coincides exactly with what this pass transforms code into and a struct passing convention with few cases to check. Adding further targets conventions is straightforward and elided from this patch primarily to simplify the review. Implemented in other branches are Linux X86, AMD64, AArch64 and NVPTX. Testing for targets that have existing lowering for va_arg from clang is most efficiently done by checking that clang \| opt completely elides the variadic syntax from test cases. The lowering produces a struct for each call site which can be inspected to check the various alignment and indirections are correct. AMDGPU presently has no variadic support other than some ad hoc printf handling. Combined with the pass being inactive on all other targets landing this represents strict increase in capability with zero risk. Testing and refining will continue post commit. In addition to the compiler tests included here, a self contained x64 clang/musl toolchain was constructed using the "lowering" instead of the systemv ABI and used to build various C programs like lua and libxml2.	2024-06-06 10:44:53 +01:00
paperchalice	7652a59407	Reland "[NewPM][CodeGen] Port selection dag isel to new pass manager" (#94149 ) - Fix build with `EXPENSIVE_CHECKS` - Remove unused `PassName::ID` to resolve warning - Mark `~SelectionDAGISel` virtual so AArch64 backend can work properly	2024-06-04 08:10:58 +08:00
paperchalice	8917afaf0e	Revert "[NewPM][CodeGen] Port selection dag isel to new pass manager" (#94146 ) This reverts commit de37c06f01772e02465ccc9f538894c76d89a7a1 to de37c06f01772e02465ccc9f538894c76d89a7a1 It still breaks EXPENSIVE_CHECKS build. Sorry.	2024-06-02 14:31:52 +08:00
paperchalice	d2cdc8ab45	[NewPM][CodeGen] Port selection dag isel to new pass manager (#83567 ) Port selection dag isel to new pass manager. Only `AMDGPU` and `X86` support new pass version. `-verify-machineinstrs` in new pass manager belongs to verify instrumentation, it is enabled by default.	2024-06-02 09:12:33 +08:00
Pierre van Houtryve	43fd244b3d	Reland "[AMDGPU] Add AMDGPU-specific module splitting (#89245 )" (with fix for ubsan) This enables the --lto-partitions option to work more consistently. This module splitting logic is fully aware of AMDGPU modules and their specificities and takes advantage of them to split modules in a way that avoids compilation issue (such as resource usage being incorrectly represented). This also includes a logging system that's more elaborate than just LLVM_DEBUG which allows printing logs to uniquely named files, and optionally with all value names hidden so they can be safely shared without leaking informatiton about the source. Logs can also be enabled through an environment variable, which avoids the sometimes complicated process of passing a -mllvm option all the way from clang driver to the offload linker that handles full LTO codegen.	2024-05-27 10:43:00 +02:00
Vitaly Buka	5a48223d1c	Revert "[AMDGPU] Add AMDGPU-specific module splitting" (#93275 ) Fails on https://lab.llvm.org/buildbot/#/builders/85/builds/24181 and https://lab.llvm.org/buildbot/#/builders/5/builds/43589 Reverts llvm/llvm-project#89245	2024-05-23 23:48:39 -07:00
Pierre van Houtryve	d7c3713000	[AMDGPU] Add AMDGPU-specific module splitting (#89245 ) This enables the --lto-partitions option to work more consistently. This module splitting logic is fully aware of AMDGPU modules and their specificities and takes advantage of them to split modules in a way that avoids compilation issue (such as resource usage being incorrectly represented). This also includes a logging system that's more elaborate than just LLVM_DEBUG which allows printing logs to uniquely named files, and optionally with all value names hidden so they can be safely shared without leaking informatiton about the source. Logs can also be enabled through an environment variable, which avoids the sometimes complicated process of passing a -mllvm option all the way from clang driver to the offload linker that handles full LTO codegen.	2024-05-23 12:26:24 +02:00
paperchalice	b4ba3fe006	[NewPM][AMDGPU] Add CodeGenPassBuilder (#91040 ) In order to test SelectionDAG for target AMDGPU, we need CodeGenPassBuilder.	2024-05-19 15:57:02 +08:00
Matt Arsenault	bc3620d3a8	AMDGPU: Move libcall simplify into PeepholeEP (#88853 ) We were running this immediately on the incoming IR, which is still littered with temporary allocas obscuring trivial values. This needs to run after initial SROA to handle sincos insertion.	2024-04-17 08:50:14 +02:00
paperchalice	a2dfc9ac7d	[NewPM][AMDGPU] Add AMDGPUPassRegistry.def (#86095 ) Move the pass registry to a separate file, prepare for porting dag-isel.	2024-03-22 08:49:29 +08:00
Pierre van Houtryve	95a834a16c	(Reland) [AMDGPU] Run LowerLDS at the end of the fullLTO pipeline (#85626 ) Reland of #75333	2024-03-21 11:44:47 +01:00
pvanhout	3493438605	Revert "[AMDGPU] Run LowerLDS at the end of the fullLTO pipeline (#75333 )" This reverts commit 9b98692eedb78aa106539c36ba02944f32cae1ff.	2024-03-18 11:18:57 +01:00
Pierre van Houtryve	9b98692eed	[AMDGPU] Run LowerLDS at the end of the fullLTO pipeline (#75333 ) This change allows us to use `--lto-partitions` in some cases (not at all guaranteed it works perfectly), as LDS is lowered before the module is split for parallel codegen. We must run LowerLDS before splitting modules as it needs to see all callers of functions with LDS to properly lower them.	2024-03-18 09:09:43 +01:00
Krzysztof Drewniak	6540f1635a	[AMDGPU] Add IR-level pass to rewrite away address space 7 (#77952 ) This commit adds the -lower-buffer-fat-pointers pass, which is applicable to all AMDGCN compilations. The purpose of this pass is to remove the type `ptr addrspace(7)` from incoming IR. This must be done at the LLVM IR level because `ptr addrspace(7)`, as a 160-bit primitive type, cannot be correctly handled by SelectionDAG. The detailed operation of the pass is described in comments, but, in summary, the removal proceeds by: 1. Rewriting loads and stores of ptr addrspace(7) to loads and stores of i160 (including vectors and aggregates). This is needed because the in-register representation of these pointers will stop matching their in-memory representation in step 2, and so ptrtoint/inttoptr operations are used to preserve the expected memory layout 2. Mutating the IR to replace all occurrences of `ptr addrspace(7)` with the type `{ptr addrspace(8), ptr addrspace(6) }`, which makes the two parts of a buffer fat pointer (the 128-bit address space 8 resource and the 32-bit address space 6 offset) visible in the IR. This also impacts the argument and return types of functions. 3. Splitting the resource and offset parts. All instructions that produce or consume buffer fat pointers (like GEP or load) are rewritten to produce or consume the resource and offset parts separately. For example, GEP updates the offset part of the result and a load uses the resource and offset parts to populate the relevant llvm.amdgcn.raw.ptr.buffer.load intrinsic call. At the end of this process, the original mutated instructions are replaced by their new split counterparts, ensuring no invalidly-typed IR escapes this pass. (For operations like call, where the struct form is needed, insertelement operations are inserted). Compared to LGC's PatchBufferOp ( `32cda89776/lgc/patch/PatchBufferOp.cpp` ): this pass - Also handles vectors of ptr addrspace(7)s - Also handles function boundaries - Includes the same uniform buffer optimization for loops and conditionals - Does not handle memcpy() and friends (this is future work) - Does not break up large loads and stores into smaller parts. This should be handled by extending the legalization of .buffer.{load,store} to handle larger types by producing multiple instructions (the same way ordinary LOAD and STORE are legalized). That work is planned for a followup commit. - Does not* have special logic for handling divergent buffer descriptors. The logic in LGC is, as far as I can tell, incorrect in general, and, per discussions with @nhaehnle, isn't widely used. Therefore, divergent descriptors are handled with waterfall loops later in legalization. As a final matter, this commit updates atomic expansion to treat buffer operations analogously to global ones. (One question for reviewers: is the new pass is the right place? Should it be later in the pipeline?) Differential Revision: https://reviews.llvm.org/D158463	2024-03-06 09:49:58 -06:00
Sameer Sahasrabuddhe	60822637bf	Restore "Implement convergence control in MIR using SelectionDAG (#71785 )" This restores commit c7fdd8c11e54585dc9d15d63de9742067e0506b9. Previously reverted in f010b1bef4dda2c7082cbb41dbabf1f149cce306. LLVM function calls carry convergence control tokens as operand bundles, where the tokens themselves are produced by convergence control intrinsics. This patch implements convergence control tokens in MIR as follows: 1. Introduce target-independent ISD opcodes and MIR opcodes for convergence control intrinsics. 2. Model token values as untyped virtual registers in MIR. The change also introduces an additional ISD opcode CONVERGENCECTRL_GLUE and a corresponding machine opcode with the same spelling. This glues the convergence control token to SDNodes that represent calls to intrinsics. The glued token is later translated to an implicit argument in the MIR. The lowering of calls to user-defined functions is target-specific. On AMDGPU, the convergence control operand bundle at a non-intrinsic call is translated to an explicit argument to the SI_CALL_ISEL instruction. Post-selection adjustment converts this explicit argument to an implicit argument on the SI_CALL instruction.	2024-03-06 12:19:32 +05:30
Mitch Phillips	f010b1bef4	Revert "Restore "Implement convergence control in MIR using SelectionDAG (#71785 )"" This reverts commit c7fdd8c11e54585dc9d15d63de9742067e0506b9. Reason: Broke the sanitizer buildbots. See the comments at https://github.com/llvm/llvm-project/pull/71785 for more information.	2024-03-04 17:05:34 +01:00
Sameer Sahasrabuddhe	c7fdd8c11e	Restore "Implement convergence control in MIR using SelectionDAG (#71785 )" Original commit 79889734b940356ab3381423c93ae06f22e772c9. Perviously reverted in commit a2afcd5721869d1d03c8146bae3885b3385ba15e. LLVM function calls carry convergence control tokens as operand bundles, where the tokens themselves are produced by convergence control intrinsics. This patch implements convergence control tokens in MIR as follows: 1. Introduce target-independent ISD opcodes and MIR opcodes for convergence control intrinsics. 2. Model token values as untyped virtual registers in MIR. The change also introduces an additional ISD opcode CONVERGENCECTRL_GLUE and a corresponding machine opcode with the same spelling. This glues the convergence control token to SDNodes that represent calls to intrinsics. The glued token is later translated to an implicit argument in the MIR. The lowering of calls to user-defined functions is target-specific. On AMDGPU, the convergence control operand bundle at a non-intrinsic call is translated to an explicit argument to the SI_CALL_ISEL instruction. Post-selection adjustment converts this explicit argument to an implicit argument on the SI_CALL instruction.	2024-03-04 13:28:04 +05:30
Jeffrey Byrnes	cf1c97b2d2	[AMDGPU] Do not attempt to fallback to default mutations (#83208 ) IGLP itself will be in SavedMutations via mutations added during Scheduler creation, thus falling back results in reapplying IGLP. In PostRA scheduling, if we have multiple regions with IGLP instructions, then we may have infinite loop. Disable the feature for now.	2024-02-27 18:04:59 -08:00
Rishabh Bali	fe42e72db2	[CodeGen] Port AtomicExpand to new Pass Manager (#71220 ) Port the `atomicexpand` pass to the new Pass Manager. Fixes #64559	2024-02-25 18:42:22 +05:30
Jeffrey Byrnes	8f2bd8ae68	[AMDGPU] Introduce iglp_opt(2): Generalized exp/mfma interleaving for select kernels (#81342 ) This implements the basic pipelining structure of exp/mfma interleaving for better extensibility. While it does have improved extensibility, there are controls which only enable it for DAGs with certain characteristics (matching the DAGs it has been designed against).	2024-02-23 17:13:20 -08:00
Sameer Sahasrabuddhe	a2afcd5721	Revert "Implement convergence control in MIR using SelectionDAG (#71785 )" This reverts commit 79889734b940356ab3381423c93ae06f22e772c9. Encountered multiple buildbot failures.	2024-02-21 11:07:02 +05:30
Sameer Sahasrabuddhe	79889734b9	Implement convergence control in MIR using SelectionDAG (#71785 ) LLVM function calls carry convergence control tokens as operand bundles, where the tokens themselves are produced by convergence control intrinsics. This patch implements convergence control tokens in MIR as follows: 1. Introduce target-independent ISD opcodes and MIR opcodes for convergence control intrinsics. 2. Model token values as untyped virtual registers in MIR. The change also introduces an additional ISD opcode CONVERGENCECTRL_GLUE and a corresponding machine opcode with the same spelling. This glues the convergence control token to SDNodes that represent calls to intrinsics. The glued token is later translated to an implicit argument in the MIR. The lowering of calls to user-defined functions is target-specific. On AMDGPU, the convergence control operand bundle at a non-intrinsic call is translated to an explicit argument to the SI_CALL_ISEL instruction. Post-selection adjustment converts this explicit argument to an implicit argument on the SI_CALL instruction.	2024-02-21 10:06:37 +05:30
Matin Raayai	87e04b471e	Fix Passing TargetOptions by Value in TargetMachines for AMDGPU (#79866 ) `TargetOptions` is currently passed by value in AMDGPU targets, which makes unnecessary copies. This PR fixes this issue.	2024-02-01 09:50:44 -08:00
Mirko Brkušanin	1d286ad59b	[AMDGPU] Add mark last scratch load pass (#75512 )	2024-01-18 09:36:44 +01:00
paperchalice	ffb1f20e0d	[CodeGen] Add flag to populate target pass names (#76328 ) `print-pipeline-passes` can show target pass names.	2024-01-03 09:07:02 +08:00
Mariusz Sikora	a018c8cdbb	GFX12: Add LoopDataPrefetchPass (#75625 ) It is currently disabled by default. It will need experiments on a real HW to tune and decide on the profitability. --------- Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2023-12-19 08:32:16 +01:00
Jessica Del	32f9983c06	[AMDGPU] - Add address space for strided buffers (#74471 ) This is an experimental address space for strided buffers. These buffers can have structs as elements and a stride > 1. These pointers allow the indexed access in units of stride, i.e., they point at `buffer[index * stride]`. Thus, we can use the `idxen` modifier for buffer loads. We assign address space 9 to 192-bit buffer pointers which contain a 128-bit descriptor, a 32-bit offset and a 32-bit index. Essentially, they are fat buffer pointers with an additional 32-bit index.	2023-12-15 15:49:25 +01:00
Valery Pykhtin	dd051295bc	[AMDGPU] Enable GCNRewritePartialRegUses pass by default. (#72975 ) Let's try once again after #69957 has landed.	2023-12-14 14:10:27 +01:00
Petar Avramovic	6892c175c5	AMDGPU/GlobalISel: add AMDGPUGlobalISelDivergenceLowering pass (#75340 ) Add empty AMDGPUGlobalISelDivergenceLowering pass. This pass will implement - selection of divergent i1 phis as lane mask phis, requires lane mask merging in some cases - lower uses of divergent i1 values outside of the cycle using lane mask merging - lowering of all cases of temporal divergence: - lower uses of uniform i1 values outside of the cycle using lane mask merging - lower uses of uniform non-i1 values outside of the cycle using a copy to vgpr inside of the cycle Add very detailed set of regression tests for cases mentioned above. patch 1 from: https://github.com/llvm/llvm-project/pull/73337	2023-12-13 16:42:56 +01:00
Piotr Sobczak	fac093dd08	[AMDGPU] Update IEEE and DX10_CLAMP for GFX12 (#75030 ) Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2023-12-13 13:52:40 +01:00
Kazu Hirata	586ecdf205	[llvm] Use StringRef::{starts,ends}_with (NFC) (#74956 ) This patch replaces uses of StringRef::{starts,ends}with with StringRef::{starts,ends}_with for consistency with std::{string,string_view}::{starts,ends}_with in C++20. I'm planning to deprecate and eventually remove StringRef::{starts,ends}with.	2023-12-11 21:01:36 -08:00
Jay Foad	28233b11ac	[AMDGPU] New AMDGPUInsertSingleUseVDST pass (#72388 ) Add support for emitting GFX11.5 s_singleuse_vdst instructions. This is a power saving feature whereby the compiler can annotate VALU instructions whose results are known to have only a single use, so the hardware can in some cases avoid writing the result back to VGPR RAM. To begin with the pass is disabled by default because of one missing feature: we need an exclusion list of opcodes that never qualify as single-use producers and/or consumers. A future patch will implement this and enable the pass by default. --------- Co-authored-by: Scott Egerton <scott.egerton@amd.com>	2023-11-24 10:23:06 +00:00
Carl Ritson	af6ff98c53	[AMDGPU] Move WWM register pre-allocation to during regalloc (#70618 ) Move SIPreAllocateWWMRegs pass to just before VGPR allocation. This saves recomputation of the virtual matrix and live reg map, with the slight regression in O0 that live intervals and slot indexes must be computed.	2023-11-08 11:54:28 +09:00
Matt Arsenault	d34a10a47d	AMDGPU: Port AMDGPUAttributor to new pass manager (#71349 )	2023-11-07 15:40:40 +09:00
Valery Pykhtin	e808f8a616	[AMDGPU] GCNRegPressurePrinter pass to print GCNRegPressure values for testing. (#70031 ) Using GCNDownwardRPTracker or GCNUpwardRPTracker the pass collects register pressure values for a function and prints these values next to instructions. Output can be used to generate Filecheck rules in mir tests.	2023-11-01 23:01:39 +01:00
Alex Voicu	0ce6255a50	[HIP][LLVM][Opt] Add LLVM support for `hipstdpar` This patch adds the LLVM changes needed for enabling HIP parallel algorithm offload on AMDGPU targets. What we do here is add two passes, one mandatory and one optional: 1. HipStdParAcceleratorCodeSelectionPass is mandatory, depends on CallGraphAnalysis, and implements the following transform: - Traverse the call-graph, and check for functions that are roots for accelerator execution (at the moment, these are GPU kernels exclusively, and would originate in the accelerator specific algorithm library the toolchain uses as an implementation detail); - Starting from a root, do a BFS to find all functions that are reachable (called directly or indirectly via a call- chain) and record them; - After having done the above for all roots in the Module, we have the computed the set of reachable functions, which is the union of roots and functions reachable from roots; - All functions that are not in the reachable set are removed; for the special case where the reachable set is empty we completely clear the module; 2. HipStdParAllocationInterpositionPass is optional, is meant as a fallback with restricted functionality for cases where on-demand paging is unavailable on a platform, and implements the following transform: - Iterate all functions in a Module; - If a function's name is in a predefined set of allocation / deallocation that the runtime implementation is allowed and expected to interpose, replace all its uses with the equivalent accelerator aware function, iff the latter is available; - If the accelerator aware equivalent is unavailable we warn, but compilation will go ahead, which means that it is possible to get issues around the accelerator trying to access inaccessible memory at run time; - We rely on direct name matching as opposed to using the new alloc-kind family of attributes and / or the LibCall analysis pass because some of the legacy functions that need replacing would not carry the former or be identified by the latter. Reviewed by: JonChesterfield, yaxunl Differential Revision: https://reviews.llvm.org/D155856	2023-10-12 11:26:48 +01:00
Alex Voicu	25935c384d	Revert "[HIP][LLVM][Opt] Add LLVM support for `hipstdpar`" This reverts commit c5bba7ea5a05f540948f76a189c880eb24a5e8c6.	2023-10-11 12:27:03 +01:00
Alex Voicu	c5bba7ea5a	[HIP][LLVM][Opt] Add LLVM support for `hipstdpar` This patch adds the LLVM changes needed for enabling HIP parallel algorithm offload on AMDGPU targets. What we do here is add two passes, one mandatory and one optional: 1. HipStdParAcceleratorCodeSelectionPass is mandatory, depends on CallGraphAnalysis, and implements the following transform: - Traverse the call-graph, and check for functions that are roots for accelerator execution (at the moment, these are GPU kernels exclusively, and would originate in the accelerator specific algorithm library the toolchain uses as an implementation detail); - Starting from a root, do a BFS to find all functions that are reachable (called directly or indirectly via a call- chain) and record them; - After having done the above for all roots in the Module, we have the computed the set of reachable functions, which is the union of roots and functions reachable from roots; - All functions that are not in the reachable set are removed; for the special case where the reachable set is empty we completely clear the module; 2. HipStdParAllocationInterpositionPass is optional, is meant as a fallback with restricted functionality for cases where on-demand paging is unavailable on a platform, and implements the following transform: - Iterate all functions in a Module; - If a function's name is in a predefined set of allocation / deallocation that the runtime implementation is allowed and expected to interpose, replace all its uses with the equivalent accelerator aware function, iff the latter is available; - If the accelerator aware equivalent is unavailable we warn, but compilation will go ahead, which means that it is possible to get issues around the accelerator trying to access inaccessible memory at run time; - We rely on direct name matching as opposed to using the new alloc-kind family of attributes and / or the LibCall analysis pass because some of the legacy functions that need replacing would not carry the former or be identified by the latter. Reviewed by: JonChesterfield, yaxunl Differential Revision: https://reviews.llvm.org/D155856	2023-10-11 12:22:00 +01:00
Alex Voicu	98eda5dda7	Revert "[HIP][LLVM][Opt] Add LLVM support for `hipstdpar`" in order to address build breakage. This reverts commit 9b98ebb0eb43b005921926a622177f10e13b1ac6.	2023-10-10 12:16:10 +01:00
Alex Voicu	9b98ebb0eb	[HIP][LLVM][Opt] Add LLVM support for `hipstdpar` This patch adds the LLVM changes needed for enabling HIP parallel algorithm offload on AMDGPU targets. What we do here is add two passes, one mandatory and one optional: 1. HipStdParAcceleratorCodeSelectionPass is mandatory, depends on CallGraphAnalysis, and implements the following transform: - Traverse the call-graph, and check for functions that are roots for accelerator execution (at the moment, these are GPU kernels exclusively, and would originate in the accelerator specific algorithm library the toolchain uses as an implementation detail); - Starting from a root, do a BFS to find all functions that are reachable (called directly or indirectly via a call- chain) and record them; - After having done the above for all roots in the Module, we have the computed the set of reachable functions, which is the union of roots and functions reachable from roots; - All functions that are not in the reachable set are removed; for the special case where the reachable set is empty we completely clear the module; 2. HipStdParAllocationInterpositionPass is optional, is meant as a fallback with restricted functionality for cases where on-demand paging is unavailable on a platform, and implements the following transform: - Iterate all functions in a Module; - If a function's name is in a predefined set of allocation / deallocation that the runtime implementation is allowed and expected to interpose, replace all its uses with the equivalent accelerator aware function, iff the latter is available; - If the accelerator aware equivalent is unavailable we warn, but compilation will go ahead, which means that it is possible to get issues around the accelerator trying to access inaccessible memory at run time; - We rely on direct name matching as opposed to using the new alloc-kind family of attributes and / or the LibCall analysis pass because some of the legacy functions that need replacing would not carry the former or be identified by the latter. Reviewed by: JonChesterfield, yaxunl Differential Revision: https://reviews.llvm.org/D155856	2023-10-10 12:02:05 +01:00

1 2 3 4 5 ...

520 Commits