The test used to look all good, but actually not. The WeakVH just make
itself null after the pointed value being replaced. So a zero value was
used because VarIndex become null. The test checks looks all good.
Actually only the WeakTrackingVH have the ability to be updated to new
value.
Change the test slightly to make that using zero index is wrong.
The VarIndex might come from (like load) another alloca which maybe
promoted before. The value will replaced in this case. WeakVH correctly
handles this.
If `promoteAllocaUserToVector` returns the placeholder, it means the
instruction does not actually modify the alloca. we don't need to add
the placeholder as block available value for correctness. Instructions
appear afterwards in the the same block could still get the placeholder
as source value through GetCurVal() call. Instructions in other block
which access the alloca will be set up later when we really do
placeholder replacement.
This help simplify the placeholder replacement logic.
When we do alloca promotion, there might be cross references to the
values derived from different allocas. RAUW immediately during promotion
may fail to update the values cached in the AllocaAnalysis structure.
Solving the problem by postpone the value replacement, and also the
value deletion as well to make this possible.
Some progress towards using size-based APIs instead of unreliable
querying of alloca element types. The removal of the mis-accounting of
alignment to global variable size might have a minor functional impact
in edge cases where the overestimation of size used pushed it just over
the threshold to stop optimizing, and it wasn't already canonicalized by
an earlier pass.
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
`Constant::isZeroValue` currently behaves same as
`Constant::isNullValue` for all types except floating-point, where it
additionally returns true for negative zero (`-0.0`). However, in
practice, almost all callers operate on integer/pointer types where the
two are equivalent, and the few FP-relevant callers have no meaningful
dependence on the `-0.0` behavior.
This PR removes `isZeroValue` to eliminate the confusing API. All
callers are changed to `isNullValue` with no test failures.
`isZeroValue` will be reintroduced in a future change with clearer
semantics: when null pointers may have non-zero bit patterns,
`isZeroValue` will check for bitwise-all-zeros, while `isNullValue` will
check for the semantic null (which
may be non-zero).
The AMDGPU promote alloca pass is missing a conversion link when casting
between vectors of pointers and pointers or vectors of pointers with
different number of elements. This causes codegen to crash due to
invalid casts being generated. To address this, this commit adds the
missing conversion link.
In addition to this, the commit moves the common load/store cast logic
into a new function `createLoadStoreCastChain`.
---------
Signed-off-by: Steffen Holst Larsen <HolstLarsen.Steffen@amd.com>
Co-authored-by: Steffen Holst Larsen <HolstLarsen.Steffen@amd.com>
We can finally get rid of the manually defined boolean variables, like
other targets. Even though most of them are now defined by macros, we
still need to add the entries.
Returns uint64_t to simplify callers. The goal is eventually replace
getValueType with this query, which should return the known minimum
reference-able size, as provided (instead of a Type) during create.
Additionally the common isSized query would be replaced with an
isExactKnownSize query to test if that size is an exact definition.
In some cases, the placeholder itself can be used as the value for its
corresponding block in `SSAUpdater`, and later used as an incoming value
in another block in `GetValueInMiddleOfBlock`. If we erase it too early,
this can lead to a use-after-erase.
Investigation revealed that scalarized copy results in a long chain of
extract/insert elements which can explode in generated temps in the
AMDGPU backend as there is no efficient representation for extracting
subvector with dynamic index. Using identity bitcasts can reduce the
number of extract/insert elements down to 1 and produce much smaller,
efficient generated code.
Credit: ruiling
With recent refactoring, LDS promotion worklists for all allocas are
populated upfront. In some cases, this results in a User in multiple
lists. Then as each list is processed, a User might get deleted via
removeFromParent, potentially leaving a dangling pointer in a subsequent
worklist.
Currently this only occurs for memcpy and memmove. Prior to refactoring,
these were handled by DeferredInstr, and were processed after the last
use of the then singular worklist.
This change moves processing of DeferredInstr to after all worklists
have be processed.
This change is motivated by the overall goal of finding alternative ways
to promote allocas to VGPRs. The current solution is effectively limited
to allocas whose size matches a register class, and we can't keep adding
more register classes. We have some downstream work in this direction,
and I'm currently looking at cleaning that up to bring it upstream.
This refactor paves the way to adding a third way of promoting allocas,
on top of the existing alloca-to-vector and alloca-to-LDS. Much of the
analysis can be shared between the different promotion techniques.
Additionally, the idea behind splitting the pass into an analysis
phase and a commit phase is that it ought to allow us to more easily
make
better "big picture" decision about which allocas to promote how in the
future.
This patch adds support for the pattern:
```llvm
%index = select i1 %idx_sel, i32 0, i32 4
%elt = getelementptr inbounds i8, ptr addrspace(5) %alloca, i32 %index
```
by scaling the byte offset to an element index (index >>
log2(ElemSize)),
allowing the vector element to be updated with insertelement instead of
using
scratch memory.
The second pass of promotion to vector can be quite simple. Reflect that
simplicity in the code for better maintainability.
v2:
- don't put placeholders into the SSAUpdater, and add a test that shows
the problem
When we know that one operand of an addition is a constant, we might was
well put it on the right-hand side and avoid the work to canonicalize it
in a later pass.
[AMDGPU] Treat GEP offsets as signed in AMDGPUPromoteAlloca
AMDGPUPromoteAlloca can transform i32 GEP offsets that operate on
allocas into i64 extractelement indices. Before this patch, negative GEP
offsets would be zero-extended, leading to wrong extractelement indices
with values around (2**32-1).
This fixes failing LlvmLibcCharacterConverterUTF32To8Test tests for
AMDGPU.
Increase promote-alloca-to-vector-max-regs to 32 from 16.
This restores default promotion of 16 x double which was disabled by
#127973.
Fixes SWDEV-525817.
Extends the `amdgpu-promote-alloca-to-vector` pass to also promote
aggregate types whose elements are all the same type to vector
registers.
The motivation for this extension was to account for IR generated by the
frontend containing several singleton struct types containing vectors or
vector-like elements, though the implementation is strictly more
general.
Supports the `nestedGEP`pattern that
appears when an alloca is first indexed as an array element and then
shifted with a byte‑offset GEP:
```llvm
%SortedFragments = alloca [10 x <2 x i32>], addrspace(5), align 8
%row = getelementptr [10 x <2 x i32>], ptr addrspace(5) %SortedFragments, i32 0, i32 %j
%elt1 = getelementptr i8, ptr addrspace(5) %row, i32 4
%val = load i32, ptr addrspace(5) %elt1
```
The pass folds the two levels of addressing into a single vector lane
index and keeps the whole object in a VGPR:
```llvm
%vec = freeze <20 x i32> poison ; alloca promote <20 x i32>
%idx0 = mul i32 %j, 2 ; j * 2
%idx = add i32 %idx0, 1 ; j * 2 + 1
%val = extractelement <20 x i32> %vec, i32 %idx
```
This eliminates the scratch read.
The default maximum waves/EU returned by the family of
`AMDGPUSubtarget::getWavesPerEU` is currently the maximum number of
waves/EU supported by the subtarget (only a valid occupancy range in
"amdgpu-waves-per-eu" may lower that maximum). This ignores maximum
achievable occupancy imposed by flat workgroup size and LDS usage,
resulting in situations where `AMDGPUSubtarget::getWavesPerEU` produces
a maximum higher than the one from
`AMDGPUSubtarget::getOccupancyWithWorkGroupSizes`.
This limits the waves/EU range's maximum to the maximum achievable
occupancy derived from flat workgroup sizes and LDS usage. This only has
an impact on functions which restrict flat workgroup size with
"amdgpu-flat-work-group-size", since the default range of flat workgroup
sizes achieves the maximum number of waves/EU supported by the
subtarget.
Improvements to the handling of "amdgpu-waves-per-eu" are left for a
follow up PR (e.g., I think the attribute should be able to lower the
full range of waves/EU produced by these methods).
This patch addresses three problems when promoting allocas to vectors:
- Element types with size < 1 byte in allocas with a vector type caused
divisions by zero.
- Element types whose size doesn't match their AllocSize hit an assertion.
- Access types whose size doesn't match their AllocSize hit an assertion.
With this patch, we do not attempt to promote affected allocas to vectors. In
principle, we could handle these cases in PromoteAlloca, e.g., by truncating
and extending elements from/to their allocation size. It's however unclear if
we ever encounter such cases in practice, so that doesn't seem worth the added
complexity.
For SWDEV-511252
DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently
gained C++23-style insert_range. This patch replaces:
Dest.insert(Src.begin(), Src.end());
with:
Dest.insert_range(Src);
This patch does not touch custom begin like succ_begin for now.
Previously the value created to represent the uninitialized memory
of the alloca was undef. Use freeze poison instead. Enables some
optimization improvements (which need defeating in the limit tests),
but also a few regressions. Seems to leave behind dead code in some
cases too.
* Add multi dimensional array support
* Make maximum vector size tunable
* Make ratio of VGPRs used for vector promotion tunable
* Maximum array size now based on VGPR count (32b) instead of element count
In case of variable offset of a GEP that can be optimized out, promote
alloca is updated to use the refereshed index to avoid an assertion.
Issue found by fuzzer.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>