This patch adds support for the pattern:
```llvm
%index = select i1 %idx_sel, i32 0, i32 4
%elt = getelementptr inbounds i8, ptr addrspace(5) %alloca, i32 %index
```
by scaling the byte offset to an element index (index >>
log2(ElemSize)),
allowing the vector element to be updated with insertelement instead of
using
scratch memory.
The second pass of promotion to vector can be quite simple. Reflect that
simplicity in the code for better maintainability.
v2:
- don't put placeholders into the SSAUpdater, and add a test that shows
the problem
When we know that one operand of an addition is a constant, we might was
well put it on the right-hand side and avoid the work to canonicalize it
in a later pass.
[AMDGPU] Treat GEP offsets as signed in AMDGPUPromoteAlloca
AMDGPUPromoteAlloca can transform i32 GEP offsets that operate on
allocas into i64 extractelement indices. Before this patch, negative GEP
offsets would be zero-extended, leading to wrong extractelement indices
with values around (2**32-1).
This fixes failing LlvmLibcCharacterConverterUTF32To8Test tests for
AMDGPU.
Increase promote-alloca-to-vector-max-regs to 32 from 16.
This restores default promotion of 16 x double which was disabled by
#127973.
Fixes SWDEV-525817.
Extends the `amdgpu-promote-alloca-to-vector` pass to also promote
aggregate types whose elements are all the same type to vector
registers.
The motivation for this extension was to account for IR generated by the
frontend containing several singleton struct types containing vectors or
vector-like elements, though the implementation is strictly more
general.
Supports the `nestedGEP`pattern that
appears when an alloca is first indexed as an array element and then
shifted with a byte‑offset GEP:
```llvm
%SortedFragments = alloca [10 x <2 x i32>], addrspace(5), align 8
%row = getelementptr [10 x <2 x i32>], ptr addrspace(5) %SortedFragments, i32 0, i32 %j
%elt1 = getelementptr i8, ptr addrspace(5) %row, i32 4
%val = load i32, ptr addrspace(5) %elt1
```
The pass folds the two levels of addressing into a single vector lane
index and keeps the whole object in a VGPR:
```llvm
%vec = freeze <20 x i32> poison ; alloca promote <20 x i32>
%idx0 = mul i32 %j, 2 ; j * 2
%idx = add i32 %idx0, 1 ; j * 2 + 1
%val = extractelement <20 x i32> %vec, i32 %idx
```
This eliminates the scratch read.
The default maximum waves/EU returned by the family of
`AMDGPUSubtarget::getWavesPerEU` is currently the maximum number of
waves/EU supported by the subtarget (only a valid occupancy range in
"amdgpu-waves-per-eu" may lower that maximum). This ignores maximum
achievable occupancy imposed by flat workgroup size and LDS usage,
resulting in situations where `AMDGPUSubtarget::getWavesPerEU` produces
a maximum higher than the one from
`AMDGPUSubtarget::getOccupancyWithWorkGroupSizes`.
This limits the waves/EU range's maximum to the maximum achievable
occupancy derived from flat workgroup sizes and LDS usage. This only has
an impact on functions which restrict flat workgroup size with
"amdgpu-flat-work-group-size", since the default range of flat workgroup
sizes achieves the maximum number of waves/EU supported by the
subtarget.
Improvements to the handling of "amdgpu-waves-per-eu" are left for a
follow up PR (e.g., I think the attribute should be able to lower the
full range of waves/EU produced by these methods).
This patch addresses three problems when promoting allocas to vectors:
- Element types with size < 1 byte in allocas with a vector type caused
divisions by zero.
- Element types whose size doesn't match their AllocSize hit an assertion.
- Access types whose size doesn't match their AllocSize hit an assertion.
With this patch, we do not attempt to promote affected allocas to vectors. In
principle, we could handle these cases in PromoteAlloca, e.g., by truncating
and extending elements from/to their allocation size. It's however unclear if
we ever encounter such cases in practice, so that doesn't seem worth the added
complexity.
For SWDEV-511252
DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently
gained C++23-style insert_range. This patch replaces:
Dest.insert(Src.begin(), Src.end());
with:
Dest.insert_range(Src);
This patch does not touch custom begin like succ_begin for now.
Previously the value created to represent the uninitialized memory
of the alloca was undef. Use freeze poison instead. Enables some
optimization improvements (which need defeating in the limit tests),
but also a few regressions. Seems to leave behind dead code in some
cases too.
* Add multi dimensional array support
* Make maximum vector size tunable
* Make ratio of VGPRs used for vector promotion tunable
* Maximum array size now based on VGPR count (32b) instead of element count
In case of variable offset of a GEP that can be optimized out, promote
alloca is updated to use the refereshed index to avoid an assertion.
Issue found by fuzzer.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Occupancy (i.e., the number of waves per EU) depends, in addition to
register usage, on per-workgroup LDS usage as well as on the range of
possible workgroup sizes. Mirroring the latter, occupancy should
therefore be expressed as a range since different group sizes generally
yield different achievable occupancies.
`getOccupancyWithLocalMemSize` currently returns a scalar occupancy
based on the maximum workgroup size and LDS usage. With respect to the
workgroup size range, this scalar can be the minimum, the maximum, or
neither of the two of the range of achievable occupancies. This commit
fixes the function by making it compute and return the range of
achievable occupancies w.r.t. workgroup size and LDS usage; it also
renames it to `getOccupancyWithWorkGroupSizes` since it is the range of
workgroup sizes that produces the range of achievable occupancies.
Computing the achievable occupancy range is surprisingly involved.
Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum
occupancies i.e., sometimes workgroup sizes inside the range yield the
occupancy bounds. The implementation finds these sizes in constant time;
heavy documentation explains the rationale behind the sometimes
relatively obscure calculations.
As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU,
64-wide waves. Also consider a function with no LDS usage and a flat
workgroup size range of [513,1024].
- A group of 513 items requires 9 waves per group. Only 4 groups made up
of 9 waves each can fit fully on a CU at any given time, for a total of
36 waves on the CU, or 9 per EU. However, filling as much as possible
the remaining 40-36=4 wave slots without decreasing the number of groups
reveals that a larger group of 640 items yields 40 waves on the CU, or
10 per EU.
- Similarly, a group of 1024 items requires 16 waves per group. Only 2
groups made up of 16 waves each can fit fully on a CU ay any given time,
for a total of 32 waves on the CU, or 8 per EU. However, removing as
many waves as possible from the groups without being able to fit another
equal-sized group on the CU reveals that a smaller group of 896 items
yields 28 waves on the CU, or 7 per EU.
Therefore the achievable occupancy range for this function is not [8,9]
as the group size bounds directly yield, but [7,10].
Naturally this change causes a lot of test churn as instruction
scheduling is driven by achievable occupancy estimates. In most unit
tests the flat workgroup size range is the default [1,1024] which,
ignoring potential LDS limitations, would previously produce a scalar
occupancy of 8 (derived from 1024) on a lot of targets, whereas we now
consider the maximum occupancy to be 10 in such cases. Most tests are
updated automatically and checked manually for sanity. I also manually
changed some non-automatically generated assertions when necessary.
Fixes#118220.
This did not consider the IR change to allow a scalar base with a vector
offset part. Reject any users that are not explicitly handled.
In this situation we could handle the vector GEP, but that is a larger
change. This just avoids the IR verifier error by rejecting it.
Rename the function to reflect its correct behavior and to be consistent
with `Module::getOrInsertFunction`. This is also in preparation of
adding a new `Intrinsic::getDeclaration` that will have behavior similar
to `Module::getFunction` (i.e, just lookup, no creation).
This replaces some of the most frequent offenders of using a DenseMap that
cause a malloc, where the typical element-count is small enough to fit in
an initial stack allocation.
Most of these are fairly obvious, one to highlight is the collectOffset
method of GEP instructions: if there's a GEP, of course it's going to have
at least one offset, but every time we've called collectOffset we end up
calling malloc as well for the DenseMap in the MapVector.
It is almost always simpler to use {} instead of std::nullopt to
initialize an empty ArrayRef. This patch changes all occurrences I could
find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor
could be deprecated or removed.
We cannot promote this case unless we know the value is only
observed through flat operations. We cannot analyze this through
a call. PointerMayBeCaptured was an imprecise check for this.
A callee with a nocapture attribute may still cast to private and
observe the address space, so really we need a different notion
of nocapture.
I doubt this was of any use anyway. The promotable cases should
have optimized out addrspacecast to begin earlier.
Fixes#66669Fixes#104035
When I rewrote this, I made a mistake in the control flow. I thought we
could just stop promoting if an alloca is too big to vectorize, but we
can't. Other allocas in the list may be promotable and fit within the
budget.
Fixes SWDEV-455343
Update PromoteAllocaToVector so it considers the whole function before promoting allocas.
Allocas are scored & sorted so the highest value ones are seen first. The budget is now per function instead of per alloca.
Passed internal performance testing.
The promote alloca to vector transformation assumes that the
vector index is a constant value. If it is not a constant, then
either an assert occurs or the tranformation generates an
incorrect index.