200 Commits

Author SHA1 Message Date
Jameson Nash
a6ceae48f5
[AMDGPU] Assert non-array alloca does have a size (#183834)
Refs
https://github.com/llvm/llvm-project/pull/179523/changes#r2851952141
2026-02-28 10:32:36 -05:00
Harrison Hao
1afd7d40af
[AMDGPU] Support i8/i16 GEP indices when promoting allocas to vectors (#175489)
Allow promote alloca to vector to form a vector element index from
i8/i16
GEPs when the dynamic offset is known to be element size aligned.

Example:
```llvm
%alloca = alloca <3 x float>, addrspace(5)
%idx = select i1 %idx_select, i32 0, i32 4
%p = getelementptr inbounds i8, ptr addrspace(5) %alloca, i32 %idx
```
Or:
```llvm
%alloca = alloca <3 x float>, addrspace(5)
%idx = select i1 %idx_select, i32 0, i32 2
%p = getelementptr inbounds i16, ptr addrspace(5) %alloca, i32 %idx
```
2026-02-27 18:24:43 +08:00
Jameson Nash
bddc8e20bd
[AMDGPU] Replace getAllocatedType with getAllocationSize in PromoteAlloca (#179523)
Some progress towards using size-based APIs instead of unreliable
querying of alloca element types. The removal of the mis-accounting of
alignment to global variable size might have a minor functional impact
in edge cases where the overestimation of size used pushed it just over
the threshold to stop optimizing, and it wasn't already canonicalized by
an earlier pass.

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-24 09:30:39 -05:00
Shilei Tian
a56993a694
[AMDGPU] Remove FeaturePromoteAlloca (#177636)
It looks like `+promote-alloca` is always enabled, and `-promote-alloca`
is simply used as a switch to toggle the pass.
2026-02-23 15:24:57 -05:00
Shilei Tian
70905e0afa
[RFC][IR] Remove Constant::isZeroValue (#181521)
`Constant::isZeroValue` currently behaves same as
`Constant::isNullValue` for all types except floating-point, where it
additionally returns true for negative zero (`-0.0`). However, in
practice, almost all callers operate on integer/pointer types where the
two are equivalent, and the few FP-relevant callers have no meaningful
dependence on the `-0.0` behavior.

This PR removes `isZeroValue` to eliminate the confusing API. All
callers are changed to `isNullValue` with no test failures.

`isZeroValue` will be reintroduced in a future change with clearer
semantics: when null pointers may have non-zero bit patterns,
`isZeroValue` will check for bitwise-all-zeros, while `isNullValue` will
check for the semantic null (which
may be non-zero).
2026-02-15 12:06:42 -05:00
Steffen Larsen
c7408d17fa
[AMDGPU][SROA] Unify cast chain implementations (#177945)
The AMDGPU promote alloca pass is missing a conversion link when casting
between vectors of pointers and pointers or vectors of pointers with
different number of elements. This causes codegen to crash due to
invalid casts being generated. To address this, this commit adds the
missing conversion link.

In addition to this, the commit moves the common load/store cast logic
into a new function `createLoadStoreCastChain`.

---------

Signed-off-by: Steffen Holst Larsen <HolstLarsen.Steffen@amd.com>
Co-authored-by: Steffen Holst Larsen <HolstLarsen.Steffen@amd.com>
2026-02-03 11:12:02 +00:00
Shilei Tian
786a20710d
[NFCI][AMDGPU] Use GET_SUBTARGETINFO_MACRO in GCNSubtarget.h and R600Subtarget.h (#177402)
We can finally get rid of the manually defined boolean variables, like
other targets. Even though most of them are now defined by macros, we
still need to add the entries.
2026-01-25 09:38:42 -05:00
Jameson Nash
d10b2b566a
[NFCI] replace getValueType with new getGlobalSize query (#177186)
Returns uint64_t to simplify callers. The goal is eventually replace
getValueType with this query, which should return the known minimum
reference-able size, as provided (instead of a Type) during create.
Additionally the common isSized query would be replaced with an
isExactKnownSize query to test if that size is an exact definition.
2026-01-22 13:55:53 -05:00
Shilei Tian
4e74fba5b2
[AMDGPU] Fix a potential use-after-erase in AMDGPUPromoteAlloca pass (#174529)
In some cases, the placeholder itself can be used as the value for its
corresponding block in `SSAUpdater`, and later used as an incoming value
in another block in `GetValueInMiddleOfBlock`. If we erase it too early,
this can lead to a use-after-erase.
2026-01-08 11:44:16 -05:00
Kevin Choi
5897f276a5
[AMDGPU] In promote-alloca, if index is dynamic, sandwich load with bitcasts to reduce excessive codegen (#171253)
Investigation revealed that scalarized copy results in a long chain of
extract/insert elements which can explode in generated temps in the
AMDGPU backend as there is no efficient representation for extracting
subvector with dynamic index. Using identity bitcasts can reduce the
number of extract/insert elements down to 1 and produce much smaller,
efficient generated code.

Credit: ruiling
2025-12-19 14:06:52 -05:00
macurtis-amd
e741cd88a1
AMDGPU/PromoteAlloca: Fix handling of users of multiple allocas (#172771)
With recent refactoring, LDS promotion worklists for all allocas are
populated upfront. In some cases, this results in a User in multiple
lists. Then as each list is processed, a User might get deleted via
removeFromParent, potentially leaving a dangling pointer in a subsequent
worklist.

Currently this only occurs for memcpy and memmove. Prior to refactoring,
these were handled by DeferredInstr, and were processed after the last
use of the then singular worklist.

This change moves processing of DeferredInstr to after all worklists
have be processed.
2025-12-18 08:41:21 -06:00
Nicolai Hähnle
e760d0619f
AMDGPU/PromoteAlloca: Refactor into analysis / commit phases (#170512)
This change is motivated by the overall goal of finding alternative ways
to promote allocas to VGPRs. The current solution is effectively limited
to allocas whose size matches a register class, and we can't keep adding
more register classes. We have some downstream work in this direction,
and I'm currently looking at cleaning that up to bring it upstream.

This refactor paves the way to adding a third way of promoting allocas,
on top of the existing alloca-to-vector and alloca-to-LDS. Much of the
analysis can be shared between the different promotion techniques.

Additionally, the idea behind splitting the pass into an analysis
phase and a commit phase is that it ought to allow us to more easily
make
better "big picture" decision about which allocas to promote how in the
future.
2025-12-12 01:24:38 +00:00
Jan Patrick Lehr
ec787501dc
Revert "[AMDGPU] Enable i8 GEP promotion for vector allocas" (#171087)
Reverts llvm/llvm-project#166132

Broke libc on GPU tests.
https://lab.llvm.org/buildbot/#/builders/10/builds/18635
2025-12-08 08:25:48 +00:00
Harrison Hao
6ec8c4351c
[AMDGPU] Enable i8 GEP promotion for vector allocas (#166132)
This patch adds support for the pattern:
```llvm
  %index = select i1 %idx_sel, i32 0, i32 4
  %elt = getelementptr inbounds i8, ptr addrspace(5) %alloca, i32 %index
```
by scaling the byte offset to an element index (index >>
log2(ElemSize)),
allowing the vector element to be updated with insertelement instead of
using
scratch memory.
2025-12-08 12:13:09 +08:00
Nicolai Hähnle
8dee997a85
Reland "AMDGPU/PromoteAlloca: Always use i32 for indexing (#170511)" (#170956)
Create more canonical code that may even lead to slightly better
codegen.
2025-12-06 08:54:44 -08:00
Nicolai Hähnle
ee77c58e5b
Reland "AMDGPU/PromoteAlloca: Simplify how deferred loads work (#170510)" (#170955)
The second pass of promotion to vector can be quite simple. Reflect that
simplicity in the code for better maintainability.

v2:
- don't put placeholders into the SSAUpdater, and add a test that shows
the problem
2025-12-06 01:15:28 +00:00
Nicolai Hähnle
0e0ec4c348 Revert "AMDGPU/PromoteAlloca: Simplify how deferred loads work (#170510)"
This reverts commit 22a2c27a0aa0d3aa5d4222f6e766646166450543.

Failure on clang-hip-vega20: https://lab.llvm.org/buildbot/#/builders/123/builds/31779
2025-12-05 13:23:05 -08:00
Nicolai Hähnle
de86696dba Revert "AMDGPU/PromoteAlloca: Always use i32 for indexing (#170511)"
This reverts commit f558c30146e51d5ef72bf3d4b3f0e86ca19e4b99.

Failure on clang-hip-vega20: https://lab.llvm.org/buildbot/#/builders/123/builds/31779
2025-12-05 13:22:41 -08:00
Nicolai Hähnle
f558c30146
AMDGPU/PromoteAlloca: Always use i32 for indexing (#170511)
Create more canonical code that may even lead to slightly better
codegen.
2025-12-05 12:54:57 -08:00
Nicolai Hähnle
22a2c27a0a
AMDGPU/PromoteAlloca: Simplify how deferred loads work (#170510)
The second pass of promotion to vector can be quite simple. Reflect that
simplicity in the code for better maintainability.
2025-12-05 12:54:25 -08:00
Nicolai Hähnle
3c5fd492d4
AMDGPU/PromoteAlloca: Extract getVectorTypeForAlloca helper (#170509) 2025-12-03 22:24:07 -08:00
Jay Foad
72c69aefba
[AMDGPU] Make use of getFunction and getMF. NFC. (#167872) 2025-11-14 11:00:57 +00:00
Fabian Ritter
7982980e07
[AMDGPUPromoteAlloca][NFC] Avoid unnecessary APInt/int64_t conversions (#157864)
Follow-up to #157682
2025-09-12 09:51:55 +02:00
Fabian Ritter
5b81367960
[AMDGPU] Generate canonical additions in AMDGPUPromoteAlloca (#157810)
When we know that one operand of an addition is a constant, we might was
well put it on the right-hand side and avoid the work to canonicalize it
in a later pass.
2025-09-10 14:46:46 +02:00
Fabian Ritter
b965f26538
[AMDGPU] Treat GEP offsets as signed in AMDGPUPromoteAlloca (#157682)
[AMDGPU] Treat GEP offsets as signed in AMDGPUPromoteAlloca

AMDGPUPromoteAlloca can transform i32 GEP offsets that operate on
allocas into i64 extractelement indices. Before this patch, negative GEP
offsets would be zero-extended, leading to wrong extractelement indices
with values around (2**32-1).

This fixes failing LlvmLibcCharacterConverterUTF32To8Test tests for
AMDGPU.
2025-09-10 11:32:14 +02:00
Carl Ritson
1f6648ccaa
[AMDGPU] AMDGPUPromoteAlloca: increase default max-regs to 32 (#155076)
Increase promote-alloca-to-vector-max-regs to 32 from 16.
This restores default promotion of 16 x double which was disabled by
#127973.

Fixes SWDEV-525817.
2025-08-26 09:30:16 +09:00
Diana Picus
a201f8872a
[AMDGPU] Replace dynamic VGPR feature with attribute (#133444)
Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget
feature, as requested in #130030.
2025-06-24 11:09:36 +02:00
Matt Arsenault
1cae21da47
AMDGPU: Remove legacy PM version of AMDGPUPromoteAllocaToVector (#144986)
This is only run in the middle end with the new pass manager now,
so garbage collect the old PM version.
2025-06-20 16:43:39 +09:00
zGoldthorpe
4692f0d344
Revert "[AMDGPU] Extended vector promotion to aggregate types." (#144366)
Reverts llvm/llvm-project#143784

Patch fails some internal tests. Will investigate more thoroughly before
attempting to remerge.
2025-06-16 11:06:18 -04:00
zGoldthorpe
79e06bf1ae
[AMDGPU] Extended vector promotion to aggregate types. (#143784)
Extends the `amdgpu-promote-alloca-to-vector` pass to also promote
aggregate types whose elements are all the same type to vector
registers.

The motivation for this extension was to account for IR generated by the
frontend containing several singleton struct types containing vectors or
vector-like elements, though the implementation is strictly more
general.
2025-06-13 14:22:21 -04:00
Harrison Hao
1a7f5f5833
[AMDGPU] Promote nestedGEP allocas to vectors (#141199)
Supports the `nestedGEP`pattern that
 appears when an alloca is first indexed as an array element and then
 shifted with a byte‑offset GEP:

```llvm
  %SortedFragments = alloca [10 x <2 x i32>], addrspace(5), align 8
  %row  = getelementptr [10 x <2 x i32>], ptr addrspace(5) %SortedFragments, i32 0, i32 %j
  %elt1 = getelementptr i8, ptr addrspace(5) %row, i32 4
  %val  = load i32, ptr addrspace(5) %elt1
```

The pass folds the two levels of addressing into a single vector lane
 index and keeps the whole object in a VGPR:

```llvm
  %vec  = freeze <20 x i32> poison              ; alloca promote  <20 x i32>
  %idx0 = mul i32 %j, 2                         ; j * 2
  %idx  = add i32 %idx0, 1                      ; j * 2 + 1
  %val  = extractelement <20 x i32> %vec, i32 %idx
```

This eliminates the scratch read.
2025-06-02 16:20:14 +08:00
Robert Imschweiler
dc29901efb
[AMDGPU] PromoteAlloca: handle out-of-bounds GEP for shufflevector (#139700)
This LLVM defect was identified via the AMD Fuzzing project.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-05-21 15:28:30 +02:00
Lucas Ramirez
e377dc4d38
[AMDGPU] Max. WG size-induced occupancy limits max. waves/EU (#137807)
The default maximum waves/EU returned by the family of
`AMDGPUSubtarget::getWavesPerEU` is currently the maximum number of
waves/EU supported by the subtarget (only a valid occupancy range in
"amdgpu-waves-per-eu" may lower that maximum). This ignores maximum
achievable occupancy imposed by flat workgroup size and LDS usage,
resulting in situations where `AMDGPUSubtarget::getWavesPerEU` produces
a maximum higher than the one from
`AMDGPUSubtarget::getOccupancyWithWorkGroupSizes`.

This limits the waves/EU range's maximum to the maximum achievable
occupancy derived from flat workgroup sizes and LDS usage. This only has
an impact on functions which restrict flat workgroup size with
"amdgpu-flat-work-group-size", since the default range of flat workgroup
sizes achieves the maximum number of waves/EU supported by the
subtarget.

Improvements to the handling of "amdgpu-waves-per-eu" are left for a
follow up PR (e.g., I think the attribute should be able to lower the
full range of waves/EU produced by these methods).
2025-05-01 13:22:23 +02:00
Fabian Ritter
cf188d650c
[AMDGPU] Avoid crashes for non-byte-sized types in PromoteAlloca (#134042)
This patch addresses three problems when promoting allocas to vectors:
- Element types with size < 1 byte in allocas with a vector type caused
  divisions by zero.
- Element types whose size doesn't match their AllocSize hit an assertion.
- Access types whose size doesn't match their AllocSize hit an assertion.

With this patch, we do not attempt to promote affected allocas to vectors. In
principle, we could handle these cases in PromoteAlloca, e.g., by truncating
and extending elements from/to their allocation size. It's however unclear if
we ever encounter such cases in practice, so that doesn't seem worth the added
complexity.

For SWDEV-511252
2025-04-14 09:13:54 +02:00
Rahul Joshi
74b7abf154
[IRBuilder] Add new overload for CreateIntrinsic (#131942)
Add a new `CreateIntrinsic` overload with no `Types`, useful for
creating calls to non-overloaded intrinsics that don't need additional
mangling.
2025-03-31 08:10:34 -07:00
Kazu Hirata
71935281e0
[Target] Use *Set::insert_range (NFC) (#132140)
DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently
gained C++23-style insert_range.  This patch replaces:

  Dest.insert(Src.begin(), Src.end());

with:

  Dest.insert_range(Src);

This patch does not touch custom begin like succ_begin for now.
2025-03-20 09:09:30 -07:00
Carl Ritson
0e4116a6b9
[AMDGPU] Fix typing error in multi dimensional promote alloca (#131763)
Fix type error when GEP uses i64 index introduced in #127973.
2025-03-19 08:17:04 +09:00
Matt Arsenault
c5fe075eaf
AMDGPU: Use freeze poison instead of undef in alloca promotion (#131285)
Previously the value created to represent the uninitialized memory
of the alloca was undef. Use freeze poison instead. Enables some
optimization improvements (which need defeating in the limit tests),
but also a few regressions. Seems to leave behind dead code in some
cases too.
2025-03-18 17:27:02 +07:00
Shilei Tian
51c706c119
[NFC][AMDGPU] Replace direct arch comparison with isAMDGCN() (#131357) 2025-03-14 14:21:44 -04:00
Carl Ritson
525d412cae [AMDGPU] Fix typing error introduce in promote alloca change
Fix type error when GEP uses i64 offset introduced in #127973.
2025-03-12 17:32:57 +09:00
Carl Ritson
d921bf233c
[AMDGPU] Extend promotion of alloca to vectors (#127973)
* Add multi dimensional array support
* Make maximum vector size tunable
* Make ratio of VGPRs used for vector promotion tunable
* Maximum array size now based on VGPR count (32b) instead of element count
2025-03-12 15:11:30 +09:00
Jay Foad
44607666b3
[AMDGPU] Simplify conditional expressions. NFC. (#129228)
Simplfy `cond ? val : false` to `cond && val` and similar.
2025-03-03 10:40:49 +00:00
Sumanth Gundapaneni
4c9e14b3ad
[AMDGPU] Update PromoteAlloca to handle GEPs with variable offset. (#122342)
In case of variable offset of a GEP that can be optimized out, promote
alloca is updated to use the refereshed index to avoid an assertion.

Issue found by fuzzer.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-02-24 13:36:30 -06:00
Kazu Hirata
aa066e36f8
[AMDGPU] Avoid repeated hash lookups (NFC) (#126430) 2025-02-09 13:34:28 -08:00
Shilei Tian
6e4105574e
[NFC][AMDGPU] Improve code introduced in #124607 (#124672) 2025-01-27 22:57:16 -05:00
Shilei Tian
3b2b7ec07d
[AMDGPU] Handle invariant marks in AMDGPUPromoteAllocaPass (#124607)
Fixes SWDEV-509327.
2025-01-27 17:30:50 -05:00
Lucas Ramirez
6206f5444f
[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748)
Occupancy (i.e., the number of waves per EU) depends, in addition to
register usage, on per-workgroup LDS usage as well as on the range of
possible workgroup sizes. Mirroring the latter, occupancy should
therefore be expressed as a range since different group sizes generally
yield different achievable occupancies.

`getOccupancyWithLocalMemSize` currently returns a scalar occupancy
based on the maximum workgroup size and LDS usage. With respect to the
workgroup size range, this scalar can be the minimum, the maximum, or
neither of the two of the range of achievable occupancies. This commit
fixes the function by making it compute and return the range of
achievable occupancies w.r.t. workgroup size and LDS usage; it also
renames it to `getOccupancyWithWorkGroupSizes` since it is the range of
workgroup sizes that produces the range of achievable occupancies.

Computing the achievable occupancy range is surprisingly involved.
Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum
occupancies i.e., sometimes workgroup sizes inside the range yield the
occupancy bounds. The implementation finds these sizes in constant time;
heavy documentation explains the rationale behind the sometimes
relatively obscure calculations.

As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU,
64-wide waves. Also consider a function with no LDS usage and a flat
workgroup size range of [513,1024].

- A group of 513 items requires 9 waves per group. Only 4 groups made up
of 9 waves each can fit fully on a CU at any given time, for a total of
36 waves on the CU, or 9 per EU. However, filling as much as possible
the remaining 40-36=4 wave slots without decreasing the number of groups
reveals that a larger group of 640 items yields 40 waves on the CU, or
10 per EU.
- Similarly, a group of 1024 items requires 16 waves per group. Only 2
groups made up of 16 waves each can fit fully on a CU ay any given time,
for a total of 32 waves on the CU, or 8 per EU. However, removing as
many waves as possible from the groups without being able to fit another
equal-sized group on the CU reveals that a smaller group of 896 items
yields 28 waves on the CU, or 7 per EU.

Therefore the achievable occupancy range for this function is not [8,9]
as the group size bounds directly yield, but [7,10].

Naturally this change causes a lot of test churn as instruction
scheduling is driven by achievable occupancy estimates. In most unit
tests the flat workgroup size range is the default [1,1024] which,
ignoring potential LDS limitations, would previously produce a scalar
occupancy of 8 (derived from 1024) on a lot of targets, whereas we now
consider the maximum occupancy to be 10 in such cases. Most tests are
updated automatically and checked manually for sanity. I also manually
changed some non-automatically generated assertions when necessary.

Fixes #118220.
2025-01-23 16:07:57 +01:00
Matt Arsenault
efe87fbc9d
AMDGPU: Improve vector of pointer handling in amdgpu-promote-alloca (#114144) 2024-11-06 08:47:15 -08:00
Matt Arsenault
6d9fc1b846
AMDGPU: Fix producing invalid IR on vector typed getelementptr (#114113)
This did not consider the IR change to allow a scalar base with a vector
offset part. Reject any users that are not explicitly handled.

In this situation we could handle the vector GEP, but that is a larger
change. This just avoids the IR verifier error by rejecting it.
2024-10-29 22:14:24 -07:00
Jay Foad
85c17e4092
[LLVM] Make more use of IRBuilder::CreateIntrinsic. NFC. (#112706)
Convert many instances of:
  Fn = Intrinsic::getOrInsertDeclaration(...);
  CreateCall(Fn, ...)
to the equivalent CreateIntrinsic call.
2024-10-17 16:20:43 +01:00