llvm-project

Author	SHA1	Message	Date
Jan Patrick Lehr	ec787501dc	Revert "[AMDGPU] Enable i8 GEP promotion for vector allocas" (#171087 ) Reverts llvm/llvm-project#166132 Broke libc on GPU tests. https://lab.llvm.org/buildbot/#/builders/10/builds/18635	2025-12-08 08:25:48 +00:00
Harrison Hao	6ec8c4351c	[AMDGPU] Enable i8 GEP promotion for vector allocas (#166132 ) This patch adds support for the pattern: ```llvm %index = select i1 %idx_sel, i32 0, i32 4 %elt = getelementptr inbounds i8, ptr addrspace(5) %alloca, i32 %index ``` by scaling the byte offset to an element index (index >> log2(ElemSize)), allowing the vector element to be updated with insertelement instead of using scratch memory.	2025-12-08 12:13:09 +08:00
Nicolai Hähnle	8dee997a85	Reland "AMDGPU/PromoteAlloca: Always use i32 for indexing (#170511 )" (#170956 ) Create more canonical code that may even lead to slightly better codegen.	2025-12-06 08:54:44 -08:00
Nicolai Hähnle	ee77c58e5b	Reland "AMDGPU/PromoteAlloca: Simplify how deferred loads work (#170510 )" (#170955 ) The second pass of promotion to vector can be quite simple. Reflect that simplicity in the code for better maintainability. v2: - don't put placeholders into the SSAUpdater, and add a test that shows the problem	2025-12-06 01:15:28 +00:00
Nicolai Hähnle	0e0ec4c348	Revert "AMDGPU/PromoteAlloca: Simplify how deferred loads work (#170510 )" This reverts commit 22a2c27a0aa0d3aa5d4222f6e766646166450543. Failure on clang-hip-vega20: https://lab.llvm.org/buildbot/#/builders/123/builds/31779	2025-12-05 13:23:05 -08:00
Nicolai Hähnle	de86696dba	Revert "AMDGPU/PromoteAlloca: Always use i32 for indexing (#170511 )" This reverts commit f558c30146e51d5ef72bf3d4b3f0e86ca19e4b99. Failure on clang-hip-vega20: https://lab.llvm.org/buildbot/#/builders/123/builds/31779	2025-12-05 13:22:41 -08:00
Nicolai Hähnle	f558c30146	AMDGPU/PromoteAlloca: Always use i32 for indexing (#170511 ) Create more canonical code that may even lead to slightly better codegen.	2025-12-05 12:54:57 -08:00
Nicolai Hähnle	22a2c27a0a	AMDGPU/PromoteAlloca: Simplify how deferred loads work (#170510 ) The second pass of promotion to vector can be quite simple. Reflect that simplicity in the code for better maintainability.	2025-12-05 12:54:25 -08:00
Nicolai Hähnle	3c5fd492d4	AMDGPU/PromoteAlloca: Extract getVectorTypeForAlloca helper (#170509 )	2025-12-03 22:24:07 -08:00
Jay Foad	72c69aefba	[AMDGPU] Make use of getFunction and getMF. NFC. (#167872 )	2025-11-14 11:00:57 +00:00
Fabian Ritter	7982980e07	[AMDGPUPromoteAlloca][NFC] Avoid unnecessary APInt/int64_t conversions (#157864 ) Follow-up to #157682	2025-09-12 09:51:55 +02:00
Fabian Ritter	5b81367960	[AMDGPU] Generate canonical additions in AMDGPUPromoteAlloca (#157810 ) When we know that one operand of an addition is a constant, we might was well put it on the right-hand side and avoid the work to canonicalize it in a later pass.	2025-09-10 14:46:46 +02:00
Fabian Ritter	b965f26538	[AMDGPU] Treat GEP offsets as signed in AMDGPUPromoteAlloca (#157682 ) [AMDGPU] Treat GEP offsets as signed in AMDGPUPromoteAlloca AMDGPUPromoteAlloca can transform i32 GEP offsets that operate on allocas into i64 extractelement indices. Before this patch, negative GEP offsets would be zero-extended, leading to wrong extractelement indices with values around (2**32-1). This fixes failing LlvmLibcCharacterConverterUTF32To8Test tests for AMDGPU.	2025-09-10 11:32:14 +02:00
Carl Ritson	1f6648ccaa	[AMDGPU] AMDGPUPromoteAlloca: increase default max-regs to 32 (#155076 ) Increase promote-alloca-to-vector-max-regs to 32 from 16. This restores default promotion of 16 x double which was disabled by #127973. Fixes SWDEV-525817.	2025-08-26 09:30:16 +09:00
Diana Picus	a201f8872a	[AMDGPU] Replace dynamic VGPR feature with attribute (#133444 ) Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget feature, as requested in #130030.	2025-06-24 11:09:36 +02:00
Matt Arsenault	1cae21da47	AMDGPU: Remove legacy PM version of AMDGPUPromoteAllocaToVector (#144986 ) This is only run in the middle end with the new pass manager now, so garbage collect the old PM version.	2025-06-20 16:43:39 +09:00
zGoldthorpe	4692f0d344	Revert "[AMDGPU] Extended vector promotion to aggregate types." (#144366 ) Reverts llvm/llvm-project#143784 Patch fails some internal tests. Will investigate more thoroughly before attempting to remerge.	2025-06-16 11:06:18 -04:00
zGoldthorpe	79e06bf1ae	[AMDGPU] Extended vector promotion to aggregate types. (#143784 ) Extends the `amdgpu-promote-alloca-to-vector` pass to also promote aggregate types whose elements are all the same type to vector registers. The motivation for this extension was to account for IR generated by the frontend containing several singleton struct types containing vectors or vector-like elements, though the implementation is strictly more general.	2025-06-13 14:22:21 -04:00
Harrison Hao	1a7f5f5833	[AMDGPU] Promote nestedGEP allocas to vectors (#141199 ) Supports the `nestedGEP`pattern that appears when an alloca is first indexed as an array element and then shifted with a byte‑offset GEP: ```llvm %SortedFragments = alloca [10 x <2 x i32>], addrspace(5), align 8 %row = getelementptr [10 x <2 x i32>], ptr addrspace(5) %SortedFragments, i32 0, i32 %j %elt1 = getelementptr i8, ptr addrspace(5) %row, i32 4 %val = load i32, ptr addrspace(5) %elt1 ``` The pass folds the two levels of addressing into a single vector lane index and keeps the whole object in a VGPR: ```llvm %vec = freeze <20 x i32> poison ; alloca promote <20 x i32> %idx0 = mul i32 %j, 2 ; j * 2 %idx = add i32 %idx0, 1 ; j * 2 + 1 %val = extractelement <20 x i32> %vec, i32 %idx ``` This eliminates the scratch read.	2025-06-02 16:20:14 +08:00
Robert Imschweiler	dc29901efb	[AMDGPU] PromoteAlloca: handle out-of-bounds GEP for shufflevector (#139700 ) This LLVM defect was identified via the AMD Fuzzing project. --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-05-21 15:28:30 +02:00
Lucas Ramirez	e377dc4d38	[AMDGPU] Max. WG size-induced occupancy limits max. waves/EU (#137807 ) The default maximum waves/EU returned by the family of `AMDGPUSubtarget::getWavesPerEU` is currently the maximum number of waves/EU supported by the subtarget (only a valid occupancy range in "amdgpu-waves-per-eu" may lower that maximum). This ignores maximum achievable occupancy imposed by flat workgroup size and LDS usage, resulting in situations where `AMDGPUSubtarget::getWavesPerEU` produces a maximum higher than the one from `AMDGPUSubtarget::getOccupancyWithWorkGroupSizes`. This limits the waves/EU range's maximum to the maximum achievable occupancy derived from flat workgroup sizes and LDS usage. This only has an impact on functions which restrict flat workgroup size with "amdgpu-flat-work-group-size", since the default range of flat workgroup sizes achieves the maximum number of waves/EU supported by the subtarget. Improvements to the handling of "amdgpu-waves-per-eu" are left for a follow up PR (e.g., I think the attribute should be able to lower the full range of waves/EU produced by these methods).	2025-05-01 13:22:23 +02:00
Fabian Ritter	cf188d650c	[AMDGPU] Avoid crashes for non-byte-sized types in PromoteAlloca (#134042 ) This patch addresses three problems when promoting allocas to vectors: - Element types with size < 1 byte in allocas with a vector type caused divisions by zero. - Element types whose size doesn't match their AllocSize hit an assertion. - Access types whose size doesn't match their AllocSize hit an assertion. With this patch, we do not attempt to promote affected allocas to vectors. In principle, we could handle these cases in PromoteAlloca, e.g., by truncating and extending elements from/to their allocation size. It's however unclear if we ever encounter such cases in practice, so that doesn't seem worth the added complexity. For SWDEV-511252	2025-04-14 09:13:54 +02:00
Rahul Joshi	74b7abf154	[IRBuilder] Add new overload for CreateIntrinsic (#131942 ) Add a new `CreateIntrinsic` overload with no `Types`, useful for creating calls to non-overloaded intrinsics that don't need additional mangling.	2025-03-31 08:10:34 -07:00
Kazu Hirata	71935281e0	[Target] Use *Set::insert_range (NFC) (#132140 ) DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently gained C++23-style insert_range. This patch replaces: Dest.insert(Src.begin(), Src.end()); with: Dest.insert_range(Src); This patch does not touch custom begin like succ_begin for now.	2025-03-20 09:09:30 -07:00
Carl Ritson	0e4116a6b9	[AMDGPU] Fix typing error in multi dimensional promote alloca (#131763 ) Fix type error when GEP uses i64 index introduced in #127973.	2025-03-19 08:17:04 +09:00
Matt Arsenault	c5fe075eaf	AMDGPU: Use freeze poison instead of undef in alloca promotion (#131285 ) Previously the value created to represent the uninitialized memory of the alloca was undef. Use freeze poison instead. Enables some optimization improvements (which need defeating in the limit tests), but also a few regressions. Seems to leave behind dead code in some cases too.	2025-03-18 17:27:02 +07:00
Shilei Tian	51c706c119	[NFC][AMDGPU] Replace direct arch comparison with `isAMDGCN()` (#131357 )	2025-03-14 14:21:44 -04:00
Carl Ritson	525d412cae	[AMDGPU] Fix typing error introduce in promote alloca change Fix type error when GEP uses i64 offset introduced in #127973.	2025-03-12 17:32:57 +09:00
Carl Ritson	d921bf233c	[AMDGPU] Extend promotion of alloca to vectors (#127973 ) * Add multi dimensional array support * Make maximum vector size tunable * Make ratio of VGPRs used for vector promotion tunable * Maximum array size now based on VGPR count (32b) instead of element count	2025-03-12 15:11:30 +09:00
Jay Foad	44607666b3	[AMDGPU] Simplify conditional expressions. NFC. (#129228 ) Simplfy `cond ? val : false` to `cond && val` and similar.	2025-03-03 10:40:49 +00:00
Sumanth Gundapaneni	4c9e14b3ad	[AMDGPU] Update PromoteAlloca to handle GEPs with variable offset. (#122342 ) In case of variable offset of a GEP that can be optimized out, promote alloca is updated to use the refereshed index to avoid an assertion. Issue found by fuzzer. --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-02-24 13:36:30 -06:00
Kazu Hirata	aa066e36f8	[AMDGPU] Avoid repeated hash lookups (NFC) (#126430 )	2025-02-09 13:34:28 -08:00
Shilei Tian	6e4105574e	[NFC][AMDGPU] Improve code introduced in #124607 (#124672 )	2025-01-27 22:57:16 -05:00
Shilei Tian	3b2b7ec07d	[AMDGPU] Handle invariant marks in `AMDGPUPromoteAllocaPass` (#124607 ) Fixes SWDEV-509327.	2025-01-27 17:30:50 -05:00
Lucas Ramirez	6206f5444f	[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748 ) Occupancy (i.e., the number of waves per EU) depends, in addition to register usage, on per-workgroup LDS usage as well as on the range of possible workgroup sizes. Mirroring the latter, occupancy should therefore be expressed as a range since different group sizes generally yield different achievable occupancies. `getOccupancyWithLocalMemSize` currently returns a scalar occupancy based on the maximum workgroup size and LDS usage. With respect to the workgroup size range, this scalar can be the minimum, the maximum, or neither of the two of the range of achievable occupancies. This commit fixes the function by making it compute and return the range of achievable occupancies w.r.t. workgroup size and LDS usage; it also renames it to `getOccupancyWithWorkGroupSizes` since it is the range of workgroup sizes that produces the range of achievable occupancies. Computing the achievable occupancy range is surprisingly involved. Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum occupancies i.e., sometimes workgroup sizes inside the range yield the occupancy bounds. The implementation finds these sizes in constant time; heavy documentation explains the rationale behind the sometimes relatively obscure calculations. As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU, 64-wide waves. Also consider a function with no LDS usage and a flat workgroup size range of [513,1024]. - A group of 513 items requires 9 waves per group. Only 4 groups made up of 9 waves each can fit fully on a CU at any given time, for a total of 36 waves on the CU, or 9 per EU. However, filling as much as possible the remaining 40-36=4 wave slots without decreasing the number of groups reveals that a larger group of 640 items yields 40 waves on the CU, or 10 per EU. - Similarly, a group of 1024 items requires 16 waves per group. Only 2 groups made up of 16 waves each can fit fully on a CU ay any given time, for a total of 32 waves on the CU, or 8 per EU. However, removing as many waves as possible from the groups without being able to fit another equal-sized group on the CU reveals that a smaller group of 896 items yields 28 waves on the CU, or 7 per EU. Therefore the achievable occupancy range for this function is not [8,9] as the group size bounds directly yield, but [7,10]. Naturally this change causes a lot of test churn as instruction scheduling is driven by achievable occupancy estimates. In most unit tests the flat workgroup size range is the default [1,1024] which, ignoring potential LDS limitations, would previously produce a scalar occupancy of 8 (derived from 1024) on a lot of targets, whereas we now consider the maximum occupancy to be 10 in such cases. Most tests are updated automatically and checked manually for sanity. I also manually changed some non-automatically generated assertions when necessary. Fixes #118220.	2025-01-23 16:07:57 +01:00
Matt Arsenault	efe87fbc9d	AMDGPU: Improve vector of pointer handling in amdgpu-promote-alloca (#114144 )	2024-11-06 08:47:15 -08:00
Matt Arsenault	6d9fc1b846	AMDGPU: Fix producing invalid IR on vector typed getelementptr (#114113 ) This did not consider the IR change to allow a scalar base with a vector offset part. Reject any users that are not explicitly handled. In this situation we could handle the vector GEP, but that is a larger change. This just avoids the IR verifier error by rejecting it.	2024-10-29 22:14:24 -07:00
Jay Foad	85c17e4092	[LLVM] Make more use of IRBuilder::CreateIntrinsic. NFC. (#112706 ) Convert many instances of: Fn = Intrinsic::getOrInsertDeclaration(...); CreateCall(Fn, ...) to the equivalent CreateIntrinsic call.	2024-10-17 16:20:43 +01:00
Rahul Joshi	fa789dffb1	[NFC] Rename `Intrinsic::getDeclaration` to `getOrInsertDeclaration` (#111752 ) Rename the function to reflect its correct behavior and to be consistent with `Module::getOrInsertFunction`. This is also in preparation of adding a new `Intrinsic::getDeclaration` that will have behavior similar to `Module::getFunction` (i.e, just lookup, no creation).	2024-10-11 05:26:03 -07:00
Jeremy Morse	96f37ae453	[NFC] Use initial-stack-allocations for more data structures (#110544 ) This replaces some of the most frequent offenders of using a DenseMap that cause a malloc, where the typical element-count is small enough to fit in an initial stack allocation. Most of these are fairly obvious, one to highlight is the collectOffset method of GEP instructions: if there's a GEP, of course it's going to have at least one offset, but every time we've called collectOffset we end up calling malloc as well for the DenseMap in the MapVector.	2024-09-30 23:15:18 +01:00
Jay Foad	e03f427196	[LLVM] Use {} instead of std::nullopt to initialize empty ArrayRef (#109133 ) It is almost always simpler to use {} instead of std::nullopt to initialize an empty ArrayRef. This patch changes all occurrences I could find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor could be deprecated or removed.	2024-09-19 16:16:38 +01:00
Matt Arsenault	df138625df	AMDGPU: Remove unnecessary pointer bitcast	2024-09-06 21:32:19 +04:00
Matt Arsenault	21cea3f3be	AMDGPU: Stop promoting allocas with addrspacecast users (#104051 ) We cannot promote this case unless we know the value is only observed through flat operations. We cannot analyze this through a call. PointerMayBeCaptured was an imprecise check for this. A callee with a nocapture attribute may still cast to private and observe the address space, so really we need a different notion of nocapture. I doubt this was of any use anyway. The promotable cases should have optimized out addrspacecast to begin earlier. Fixes #66669 Fixes #104035	2024-08-14 21:53:38 +04:00
Pierre van Houtryve	0e73bbd345	[AMDGPU][PromoteAlloca] Don't stop when an alloca is too big to promote (#93466 ) When I rewrote this, I made a mistake in the control flow. I thought we could just stop promoting if an alloca is too big to vectorize, but we can't. Other allocas in the list may be promotable and fit within the budget. Fixes SWDEV-455343	2024-05-28 08:05:50 +02:00
Shilei Tian	b4df0da9e8	[AMDGPU] Fix a potential wrong return value indicating whether a pass modifies a function (#88197 ) When the alloca is too big for vectorization, the function could have already been modified in previous iteration of the `for` loop.	2024-04-12 09:34:46 -04:00
Pierre van Houtryve	953c13b5c9	[AMDGPU][PromoteAlloca] Whole-function alloca promotion to vector (#84735 ) Update PromoteAllocaToVector so it considers the whole function before promoting allocas. Allocas are scored & sorted so the highest value ones are seen first. The budget is now per function instead of per alloca. Passed internal performance testing.	2024-03-19 11:49:22 +01:00
Pierre van Houtryve	5e379b63fc	[AMDGPU][PromoteAlloca] Drop bitcast handling (#85747 ) This is no longer needed with opaque pointers.	2024-03-19 10:36:12 +01:00
bcahoon	4cf8b298cf	[AMDGPU][PromoteAlloca] Correctly handle a variable vector index (#83597 ) The promote alloca to vector transformation assumes that the vector index is a constant value. If it is not a constant, then either an assert occurs or the tranformation generates an incorrect index.	2024-03-05 08:18:17 -06:00
Pierre van Houtryve	4e958abf2f	[AMDGPU][PromoteAlloca] Support memsets to ptr allocas (#80678 ) Fixes #80366	2024-02-05 14:36:15 +01:00
Mariusz Sikora	9a41a80e76	[AMDGPU] Handle object size and bail if assume-like intrinsic is used in PromoteAllocaToVector (#68744 ) Attached test will cause crash without this change. We should not remove isAssumeLikeIntrinsic instruction if it is used by other instruction.	2023-12-20 07:47:49 +01:00

1 2 3 4

188 Commits