llvm-project

Author	SHA1	Message	Date
Shilei Tian	6e4105574e	[NFC][AMDGPU] Improve code introduced in #124607 (#124672 )	2025-01-27 22:57:16 -05:00
Shilei Tian	3b2b7ec07d	[AMDGPU] Handle invariant marks in `AMDGPUPromoteAllocaPass` (#124607 ) Fixes SWDEV-509327.	2025-01-27 17:30:50 -05:00
Lucas Ramirez	6206f5444f	[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748 ) Occupancy (i.e., the number of waves per EU) depends, in addition to register usage, on per-workgroup LDS usage as well as on the range of possible workgroup sizes. Mirroring the latter, occupancy should therefore be expressed as a range since different group sizes generally yield different achievable occupancies. `getOccupancyWithLocalMemSize` currently returns a scalar occupancy based on the maximum workgroup size and LDS usage. With respect to the workgroup size range, this scalar can be the minimum, the maximum, or neither of the two of the range of achievable occupancies. This commit fixes the function by making it compute and return the range of achievable occupancies w.r.t. workgroup size and LDS usage; it also renames it to `getOccupancyWithWorkGroupSizes` since it is the range of workgroup sizes that produces the range of achievable occupancies. Computing the achievable occupancy range is surprisingly involved. Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum occupancies i.e., sometimes workgroup sizes inside the range yield the occupancy bounds. The implementation finds these sizes in constant time; heavy documentation explains the rationale behind the sometimes relatively obscure calculations. As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU, 64-wide waves. Also consider a function with no LDS usage and a flat workgroup size range of [513,1024]. - A group of 513 items requires 9 waves per group. Only 4 groups made up of 9 waves each can fit fully on a CU at any given time, for a total of 36 waves on the CU, or 9 per EU. However, filling as much as possible the remaining 40-36=4 wave slots without decreasing the number of groups reveals that a larger group of 640 items yields 40 waves on the CU, or 10 per EU. - Similarly, a group of 1024 items requires 16 waves per group. Only 2 groups made up of 16 waves each can fit fully on a CU ay any given time, for a total of 32 waves on the CU, or 8 per EU. However, removing as many waves as possible from the groups without being able to fit another equal-sized group on the CU reveals that a smaller group of 896 items yields 28 waves on the CU, or 7 per EU. Therefore the achievable occupancy range for this function is not [8,9] as the group size bounds directly yield, but [7,10]. Naturally this change causes a lot of test churn as instruction scheduling is driven by achievable occupancy estimates. In most unit tests the flat workgroup size range is the default [1,1024] which, ignoring potential LDS limitations, would previously produce a scalar occupancy of 8 (derived from 1024) on a lot of targets, whereas we now consider the maximum occupancy to be 10 in such cases. Most tests are updated automatically and checked manually for sanity. I also manually changed some non-automatically generated assertions when necessary. Fixes #118220.	2025-01-23 16:07:57 +01:00
Matt Arsenault	efe87fbc9d	AMDGPU: Improve vector of pointer handling in amdgpu-promote-alloca (#114144 )	2024-11-06 08:47:15 -08:00
Matt Arsenault	6d9fc1b846	AMDGPU: Fix producing invalid IR on vector typed getelementptr (#114113 ) This did not consider the IR change to allow a scalar base with a vector offset part. Reject any users that are not explicitly handled. In this situation we could handle the vector GEP, but that is a larger change. This just avoids the IR verifier error by rejecting it.	2024-10-29 22:14:24 -07:00
Jay Foad	85c17e4092	[LLVM] Make more use of IRBuilder::CreateIntrinsic. NFC. (#112706 ) Convert many instances of: Fn = Intrinsic::getOrInsertDeclaration(...); CreateCall(Fn, ...) to the equivalent CreateIntrinsic call.	2024-10-17 16:20:43 +01:00
Rahul Joshi	fa789dffb1	[NFC] Rename `Intrinsic::getDeclaration` to `getOrInsertDeclaration` (#111752 ) Rename the function to reflect its correct behavior and to be consistent with `Module::getOrInsertFunction`. This is also in preparation of adding a new `Intrinsic::getDeclaration` that will have behavior similar to `Module::getFunction` (i.e, just lookup, no creation).	2024-10-11 05:26:03 -07:00
Jeremy Morse	96f37ae453	[NFC] Use initial-stack-allocations for more data structures (#110544 ) This replaces some of the most frequent offenders of using a DenseMap that cause a malloc, where the typical element-count is small enough to fit in an initial stack allocation. Most of these are fairly obvious, one to highlight is the collectOffset method of GEP instructions: if there's a GEP, of course it's going to have at least one offset, but every time we've called collectOffset we end up calling malloc as well for the DenseMap in the MapVector.	2024-09-30 23:15:18 +01:00
Jay Foad	e03f427196	[LLVM] Use {} instead of std::nullopt to initialize empty ArrayRef (#109133 ) It is almost always simpler to use {} instead of std::nullopt to initialize an empty ArrayRef. This patch changes all occurrences I could find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor could be deprecated or removed.	2024-09-19 16:16:38 +01:00
Matt Arsenault	df138625df	AMDGPU: Remove unnecessary pointer bitcast	2024-09-06 21:32:19 +04:00
Matt Arsenault	21cea3f3be	AMDGPU: Stop promoting allocas with addrspacecast users (#104051 ) We cannot promote this case unless we know the value is only observed through flat operations. We cannot analyze this through a call. PointerMayBeCaptured was an imprecise check for this. A callee with a nocapture attribute may still cast to private and observe the address space, so really we need a different notion of nocapture. I doubt this was of any use anyway. The promotable cases should have optimized out addrspacecast to begin earlier. Fixes #66669 Fixes #104035	2024-08-14 21:53:38 +04:00
Pierre van Houtryve	0e73bbd345	[AMDGPU][PromoteAlloca] Don't stop when an alloca is too big to promote (#93466 ) When I rewrote this, I made a mistake in the control flow. I thought we could just stop promoting if an alloca is too big to vectorize, but we can't. Other allocas in the list may be promotable and fit within the budget. Fixes SWDEV-455343	2024-05-28 08:05:50 +02:00
Shilei Tian	b4df0da9e8	[AMDGPU] Fix a potential wrong return value indicating whether a pass modifies a function (#88197 ) When the alloca is too big for vectorization, the function could have already been modified in previous iteration of the `for` loop.	2024-04-12 09:34:46 -04:00
Pierre van Houtryve	953c13b5c9	[AMDGPU][PromoteAlloca] Whole-function alloca promotion to vector (#84735 ) Update PromoteAllocaToVector so it considers the whole function before promoting allocas. Allocas are scored & sorted so the highest value ones are seen first. The budget is now per function instead of per alloca. Passed internal performance testing.	2024-03-19 11:49:22 +01:00
Pierre van Houtryve	5e379b63fc	[AMDGPU][PromoteAlloca] Drop bitcast handling (#85747 ) This is no longer needed with opaque pointers.	2024-03-19 10:36:12 +01:00
bcahoon	4cf8b298cf	[AMDGPU][PromoteAlloca] Correctly handle a variable vector index (#83597 ) The promote alloca to vector transformation assumes that the vector index is a constant value. If it is not a constant, then either an assert occurs or the tranformation generates an incorrect index.	2024-03-05 08:18:17 -06:00
Pierre van Houtryve	4e958abf2f	[AMDGPU][PromoteAlloca] Support memsets to ptr allocas (#80678 ) Fixes #80366	2024-02-05 14:36:15 +01:00
Mariusz Sikora	9a41a80e76	[AMDGPU] Handle object size and bail if assume-like intrinsic is used in PromoteAllocaToVector (#68744 ) Attached test will cause crash without this change. We should not remove isAssumeLikeIntrinsic instruction if it is used by other instruction.	2023-12-20 07:47:49 +01:00
Mariusz Sikora	facead618b	[AMDGPU] PromoteAlloca - bail always if load/store is volatile (#73228 ) This change is addressing case where alloca size is the same as load/store size.	2023-11-28 12:01:35 +01:00
bcahoon	28b5054751	[AMDGPU] Fix PromoteAlloca size check of alloca for store (#72528 ) When storing a subvector, too many element were written when the size of the alloca is smaller than the size of the vector store. This patch checks for the minimum of the alloca vector and the store vector to determine the number of elements to store.	2023-11-20 07:57:48 -06:00
Pierre van Houtryve	5db63d29fd	[AMDGPU] PromoteAlloca: Handle load/store subvectors using non-constant indexes (#71505 ) I assumed indexes were always ConstantInts, but that's not always the case. They can be other things as well. We can easily handle that by just emitting an add and let InstSimplify do the constant folding for cases where it's really a ConstantInt. Solves SWDEV-429935	2023-11-07 15:29:41 +01:00
Matt Arsenault	f7dcabe502	AMDGPU: Pass in TargetMachine to AMDGPULowerModuleLDSPass https://reviews.llvm.org/D157660	2023-09-02 12:02:36 -04:00
pvanhout	a8aabba587	[AMDGPU] Fix PromoteAlloca Subvector Stores for Single Elements The previous condition was incorrect in some cases, like storing <2 x i32> into a double. If IndexVal was >0, we ended up never storing anything. Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D156308	2023-07-26 13:21:21 +02:00
pvanhout	3cd4afce5b	[AMDGPU] Allow vector access types in PromoteAllocaToVector Depends on D152706 Solves SWDEV-408279 Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D155699	2023-07-25 07:44:48 +02:00
pvanhout	3890a3b113	[AMDGPU] Use SSAUpdater in PromoteAlloca This allows PromoteAlloca to not be reliant on a second SROA run to remove the alloca completely. It just does the full transformation directly. Note PromoteAlloca is still reliant on SROA running first to canonicalize the IR. For instance, PromoteAlloca will no longer handle aggregate types because those should be simplified by SROA before reaching the pass. Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D152706	2023-07-25 07:44:47 +02:00
Nikita Popov	7be7f23269	[llvm] Remove uses of getWithSamePointeeType() (NFC)	2023-07-18 12:07:09 +02:00
Youngsuk Kim	243f0566dc	[llvm] Replace uses of Type::getPointerTo (NFC) Partial progress towards removing in-tree uses of `Type::getPointerTo`, before we can deprecate the API. If the API is used solely to support an unnecessary bitcast, get rid of the bitcast as well. Reviewed By: nikic Differential Revision: https://reviews.llvm.org/D153933	2023-06-28 09:21:34 -04:00
pvanhout	7007b99340	Revert "[AMDGPU] Use SSAUpdater in PromoteAlloca" This reverts commit 091bfa76db64fbe96d0e53d99b2068cc05f6aa16.	2023-06-28 11:14:17 +02:00
pvanhout	091bfa76db	[AMDGPU] Use SSAUpdater in PromoteAlloca This allows PromoteAlloca to not be reliant on a second SROA run to remove the alloca completely. It just does the full transformation directly. Note PromoteAlloca is still reliant on SROA running first to canonicalize the IR. For instance, PromoteAlloca will no longer handle aggregate types because those should be simplified by SROA before reaching the pass. Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D152706	2023-06-28 08:12:22 +02:00
pvanhout	f104eb6e15	[AMDGPU] Reintroduce CC exception for non-inlined functions in Promote Alloca limits This is basically a partial revert of https://reviews.llvm.org/D145586 ( fd1d60873fdc ) D145586 was originally introduced to help with SWDEV-363662, and it did, but it also caused a 25% drop in performance in some MIOpen benchmarks where, it seems, functions are inlined more conservatively. This patch restores the pre-D145586 behavior for PromoteAlloca: functions with a non-entry CC have a 32 VGPRs threshold, but only if the function is not marked with "alwaysinline". A good number of AMDGPU code makes uses of the AMDGPUAlwaysInline pass anyway, so in our backend "alwaysinline" seems very common. This change does not affect SWDEV-363662 (the motivating issue for introducing D145586). Fixes SWDEV-399519 Reviewed By: rampitec, #amdgpu Differential Revision: https://reviews.llvm.org/D150551	2023-05-23 09:01:39 +02:00
pvanhout	047bf17eab	[AMDGPU] Add more verbose logs to PromoteAlloca More specifically make it more talkative when it's looking at the users of an alloca to promote it to a vector. A common failure point of the pass is unknown or weird users of the alloca. While debugging issues related to this pass one of the first thing I usually did was to add logs to see how the users were being handled. Having such logs in directly seems to be a nice addition. Reviewed By: arsenm, rampitec Differential Revision: https://reviews.llvm.org/D148629	2023-04-19 11:30:14 +02:00
pvanhout	83ae2d3618	[AMDGPU] Refactor PromoteAlloca implementation We're getting a lot of mileage out of PromoteAlloca, and the pass had grown somewhat organically over the year. This patch attempts to clean up the implementation and restructure it. For instance, the exact same code path is now used for both promote alloca to LDS and promote alloca to vector - just with different parameters. This removes some redundancy here and there. I also reordered functions in a way that hopefully makes more sense (e.g. all of the pass API is in the same place) No functionality change is intended in the patch, but some checks were movved around so I'm not using the NFC tag. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D148526	2023-04-18 14:23:58 +02:00
pvanhout	fd1d60873f	[AMDGPU] Remove CC exception for Promote Alloca Limits Apparently it was used to work around some issue that has been fixed. Removing it helps with high scratch usage observed in some cases due to failed alloca promotion. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D145586	2023-04-13 08:48:34 +02:00
pvanhout	d7b4b76956	[AMDGPU] Handle memset users in PromoteAlloca Allows allocas with memset users to be promoted. This is intended to prevent patterns such as `memset(&alloca, 0, sizeof(alloca))` (which I think can be emitted by frontends) from preventing a vectorization of allocas. Fixes SWDEV-388784 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D146225	2023-03-28 15:01:55 +02:00
Nicolai Hähnle	10cef708a7	AMDGPU: Clean up LDS-related occupancy calculations Occupancy is expressed as waves per SIMD. This means that we need to take into account the number of SIMDs per "CU" or, to be more precise, the number of SIMDs over which a workgroup may be distributed. getOccupancyWithLocalMemSize was wrong because it didn't take SIMDs into account at all. At the same time, we need to take into account that WGP mode offers access to a larger total amount of LDS, since this can affect how non-power-of-two LDS allocations are rounded. To make this work consistently, we distinguish between (available) local memory size and addressable local memory size (which is always limited by 64kB on gfx10+, even with WGP mode). This change results in a massive amount of test churn. A lot of it is caused by the fact that the default work group size is 1024, which means that (due to rounding effects) the default occupancy on older hardware is 8 instead of 10, which affects scheduling via register pressure estimates. I've adjusted most tests by just running the UTC tools, but in some cases I manually changed the work group size to 32 or 64 to make sure that work group size chunkiness has no effect. Differential Revision: https://reviews.llvm.org/D139468	2023-01-23 21:43:06 +01:00
Guillaume Chatelet	135f23d67b	Deprecate MemIntrinsicBase::getDestAlignment() and MemTransferBase::getSourceAlignment() Differential Revision: https://reviews.llvm.org/D141840	2023-01-16 14:22:03 +00:00
Ruiling Song	5d0ff923c3	AMDGPU: Promote array alloca if used by memmove/memcpy Reviewed by: arsenm Differential Revision: https://reviews.llvm.org/D140599	2023-01-11 09:59:35 +08:00
Matt Arsenault	687e0e205e	AMDGPU: Create alloca wide load/store with explicit alignment This was introducing transient UB by using the default alignment of a larger vector type.	2023-01-03 11:29:18 -05:00
Matt Arsenault	49caf70121	AMDGPU: Use cast instead of unchecked dyn_cast	2023-01-03 10:32:10 -05:00
Jay Foad	6443c0ee02	[AMDGPU] Stop using make_pair and make_tuple. NFC. C++17 allows us to call constructors pair and tuple instead of helper functions make_pair and make_tuple. Differential Revision: https://reviews.llvm.org/D139828	2022-12-14 13:22:26 +00:00
Kazu Hirata	20cde15415	[Target] Use std::nullopt instead of None (NFC) This patch mechanically replaces None with std::nullopt where the compiler would warn if None were deprecated. The intent is to reduce the amount of manual work required in migrating from Optional to std::optional. This is part of an effort to migrate from llvm::Optional to std::optional: https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716	2022-12-02 20:36:06 -08:00
Matt Arsenault	3830e4e58c	AMDGPU: Create poison values instead of undef These placeholders don't care about the finer points on the difference between the two.	2022-11-16 14:47:24 -08:00
Kazu Hirata	e0039b8d6a	Use llvm::less_second (NFC)	2022-06-04 22:48:32 -07:00
Nikita Popov	3ed643ea76	[AMDGPUPromoteAlloca] Make compatible with opaque pointers This mainly changes the handling of bitcasts to not check the types being casted from/to -- we should only care about the actual load/store types. The GEP handling is also changed to not care about types, and just make sure that we get an offset corresponding to a vector element. This was a bit of a struggle for me, because this code seems to be pretty sensitive to small changes. The end result seems to produce strictly better results for the existing test coverage though, because we can now deal with more situations involving bitcasts. Differential Revision: https://reviews.llvm.org/D121371	2022-03-11 09:20:51 +01:00
Sebastian Neubauer	6527b2a4d5	[AMDGPU][NFC] Fix typos Fix some typos in the amdgpu backend. Differential Revision: https://reviews.llvm.org/D119235	2022-02-18 15:05:21 +01:00
serge-sans-paille	e188aae406	Cleanup header dependencies in LLVMCore Based on the output of include-what-you-use. This is a big chunk of changes. It is very likely to break downstream code unless they took a lot of care in avoiding hidden ehader dependencies, something the LLVM codebase doesn't do that well :-/ I've tried to summarize the biggest change below: - llvm/include/llvm-c/Core.h: no longer includes llvm-c/ErrorHandling.h - llvm/IR/DIBuilder.h no longer includes llvm/IR/DebugInfo.h - llvm/IR/IRBuilder.h no longer includes llvm/IR/IntrinsicInst.h - llvm/IR/LLVMRemarkStreamer.h no longer includes llvm/Support/ToolOutputFile.h - llvm/IR/LegacyPassManager.h no longer include llvm/Pass.h - llvm/IR/Type.h no longer includes llvm/ADT/SmallPtrSet.h - llvm/IR/PassManager.h no longer includes llvm/Pass.h nor llvm/Support/Debug.h And the usual count of preprocessed lines: $ clang++ -E -Iinclude -I../llvm/include ../llvm/lib/IR/*.cpp -std=c++14 -fno-rtti -fno-exceptions \| wc -l before: 6400831 after: 6189948 200k lines less to process is no that bad ;-) Discourse thread on the topic: https://llvm.discourse.group/t/include-what-you-use-include-cleanup Differential Revision: https://reviews.llvm.org/D118652	2022-02-02 06:54:20 +01:00
Yaxun (Sam) Liu	15f54dd5e4	AMDGPU: Account for usage HIP-style dynamic LDS Disable promote alloca to LDS when HIP-style dynamic LDS since the size is unknown at compile time. Patch by: Siu Chi Chan Reviewed by: Matt Arsenault, Yaxun Liu Differential Revision: https://reviews.llvm.org/D117494	2022-01-19 13:05:29 -05:00
Arthur Eubanks	1172712f46	[NFC] Replace some deprecated getAlignment() calls with getAlign() Reviewed By: gchatelet Differential Revision: https://reviews.llvm.org/D115370	2021-12-09 08:43:19 -08:00
Neubauer, Sebastian	d1f45ed58f	[AMDGPU][NFC] Fix typos Differential Revision: https://reviews.llvm.org/D113672	2021-11-12 11:37:21 +01:00
Stanislav Mekhanoshin	cf74ef134c	[AMDGPU] Limit promote alloca max size in functions Non-entry functions have 32 caller saved VGPRs available. If we promote alloca to consume more registers we will have to spill CSRs. There is no reason to eliminate scratch access to get another scratch access instead. Differential Revision: https://reviews.llvm.org/D110372	2021-09-24 13:38:39 -07:00

1 2 3 4

156 Commits