This issue was discovered during some downstream work around Vulkan CTS
tests, specifically
`dEQP-VK.subgroups.arithmetic.compute.subgroupadd_float`
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
This PR is adding a boilerplate of CGSCC AMDGPUAttributor pass
(amdgpu-attributor-cgscc) by doing refactoring from the existing Module
AMDGPUAttributor pass (amdgpu-attributor).
CGSCC AMDGPUAttributor pass sets `AttributorConfig.IsModulePass =
false`, and make Attributor's `Functions` set contain only functions in
a SCC.
The main implementations of abstract attributes have not changed - NFC.
Subsequently, in future work some of the AMDGPU abstract attributors
might move to be handled by CGSCC pass.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
The "uniform-work-group-size" function attribute previously took a
string value of "true" or "false". Since presence alone can convey the
"true" semantics and absence can convey "false", the value is
unnecessary.
This patch converts it to a valueless string attribute: presence
indicates true, absence indicates false. For backward compatibility,
auto-upgrade logic is added in both UpgradeAttributes (bitcode) and
UpgradeFunctionAttributes: if the old value is "true", the attribute is
kept without a value; if "false", the attribute is removed.
This is one of the string attributes that takes a boolean
value for no reason. There is no point in ever writing this
with an explicit false. Stop adding the noise and reporting
an unnecessary change.
This adds support to whitelist trap intrinsics while handling of
intrinsics with !nocallback. This fixes the reasons behind the previous
revert of #131759.
The attributor was exiting early whenever it saw intrinsics without the nocallback bit, so trap-only kernels lost all the inferred “no implicit arg” metadata and their amdgpu-agpr-alloc=0 guarantees. That conservative fallback broke certain workloads by forcing unnecessary implicit arguments and AGPR reservations. This patch allows the pass to recognize leaf-like trap intrinsics, so they no longer poison the analysis.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Calculating alignment for `make.buffer.rsrc` intrinsic. The logic is the
alignment on use of return value of `make.buffer.rsrc` should be capped
by the base operand's alignment of `make.buffer.rsrc`.
For example:
```ll
define float @foo(ptr addrspace(1) align X %ptr) {
%fat.ptr = call ptr addrspace(7) @llvm.amdgcn.make.buffer.rsrc.p7.p1(ptr addrspace(1) %ptr, i16 0, i32 C, i32 0)
%y = load float, ptr addrspace(7) %fat.ptr, align Y
ret float %y
}
```
We hopes that `Y = min(X, Y)`
---
After discussion, it seems improper for letting `Y = min(X, Y)` since it
contradict with the semantic of align on load.
So we would apply the origin behavior of align, which is letting `X` and
`Y` both equal to `max(X, Y)`
---------
Co-authored-by: Shilei Tian <i@tianshilei.me>
(#131759)
This isn't really the right check, we want to know that the intrinsic
does not perform a true function call to any code (in the module or
not). nocallback
appears to be the closest thing to this property we have now though.
Fixes theoretically
miscompiles with intrinsics like statepoint, which hide a call to a real
function.
Also do the same for inferring no-agpr usage.
(#162300)
This now tries to compute a lower bound on the number of registers
for individual inline asm uses. Also starts using AACallEdges
to handling indirect calls.
Apparently the IR verifier doesn't enforce the correct structure.
Also I do not know why we have this extra level of wrapper in the
intrinsic,
it just makes it harder to get at the string. I also do not know why
kokkos is using these intrinsics, but it shouldn't.
For now just try to compute the minimum number of AGPRs required
to allocate the asm. Leave the attributor changes to turn this
into an integer value for later.
We do not need this in the attributor, because `ST.getWavesPerEU`
accounts for both the waves-per-eu and flat-workgroup-size attributes.
If the waves-per-eu values are not valid, it drops them. In the
attributor, we only need to propagate the values without using
intermediate flat workgroup size values.
Fixes SWDEV-550257.
This patch introduces `AAAMDGPUUniformArgument` that can infer `inreg` function
argument attribute. The idea is, for a function argument, if the corresponding
call site arguments are always uniform, we can mark it as `inreg` thus pass it
via SGPR.
In addition, this AA is also able to propagate the inreg attribute if feasible.
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
This patch addresses clang-tidy's readability-const-return-type by
dropping const from the return type while switching to StringRef at
the same time because these functions just return string constants.
Currently, we use `AAAMDWavesPerEU` to iteratively update values based
on attributes from the associated function, potentially propagating
user-annotated values, along with `AAAMDFlatWorkGroupSize`. Similarly,
we have `AAAMDFlatWorkGroupSize`. However, since the value calculated
through the flat workgroup size always dominates the user annotation
(i.e., the attribute), running `AAAMDWavesPerEU` iteratively is
unnecessary if no user-annotated value exists.
This PR completely rewrites how the `amdgpu-waves-per-eu` attribute is
handled in `AMDGPUAttributor`. The key changes are as follows:
- `AAAMDFlatWorkGroupSize` remains unchanged.
- `AAAMDWavesPerEU` now only propagates user-annotated values.
- A new function is added to check and update `amdgpu-waves-per-eu`
based on the following rules:
- No waves per eu, no flat workgroup size: Assume a flat workgroup size
of `1,1024` and compute waves per eu based on this.
- No waves per eu, flat workgroup size exists: Use the provided flat
workgroup size to compute waves-per-eu.
- Waves per eu exists, no flat workgroup size: This is a tricky case. In
this PR, we assume a flat workgroup size of `1,1024`, but this can be
adjusted if a different approach is preferred. Alternatively, we could
directly use the user-annotated value.
- Both waves per eu and flat workgroup size exist: If there’s a
conflict, the value derived from the flat workgroup size takes
precedence over waves per eu.
This PR also updates the logic for merging two waves per eu pairs. The
current implementation, which uses `clampStateAndIndicateChange` to
compute a union, might not be ideal. If we think from ensure proper
resource allocation perspective, for instance, if one pair specifies a
minimum of 2 waves per eu, and another specifies a minimum of 4, we
should guarantee that 4 waves per eu can be supported, as failing to do
so could result in excessive resource allocation per wave. A similar
principle applies to the upper bound. Thus, the PR uses the following
approach for merging two pairs, `lo_a,up_a` and `lo_b,up_b`: `max(lo_a,
lo_b), max(up_a, up_b)`. This ensures that resource allocation adheres
to the stricter constraints from both inputs.
Fix#123092.
Moves kernarg preload logic to its own module pass. Cloned function
declarations are removed when preloading hidden arguments. The inreg
attribute is now added in this pass instead of AMDGPUAttributor. The
rest of the logic is copied from AMDGPULowerKernelArguments which now
only check whether an arguments is marked inreg to avoid replacing
direct uses of preloaded arguments. This change requires test updates to
remove inreg from lit tests with kernels that don't actually want
preloading.
The default maximum waves/EU returned by the family of
`AMDGPUSubtarget::getWavesPerEU` is currently the maximum number of
waves/EU supported by the subtarget (only a valid occupancy range in
"amdgpu-waves-per-eu" may lower that maximum). This ignores maximum
achievable occupancy imposed by flat workgroup size and LDS usage,
resulting in situations where `AMDGPUSubtarget::getWavesPerEU` produces
a maximum higher than the one from
`AMDGPUSubtarget::getOccupancyWithWorkGroupSizes`.
This limits the waves/EU range's maximum to the maximum achievable
occupancy derived from flat workgroup sizes and LDS usage. This only has
an impact on functions which restrict flat workgroup size with
"amdgpu-flat-work-group-size", since the default range of flat workgroup
sizes achieves the maximum number of waves/EU supported by the
subtarget.
Improvements to the handling of "amdgpu-waves-per-eu" are left for a
follow up PR (e.g., I think the attribute should be able to lower the
full range of waves/EU produced by these methods).
This performs the minimal replacment of amdgpu-no-agpr to
amdgpu-agpr-alloc=0. Most of the test diffs are due to the new
attribute sorting later alphabetically.
We could do better by trying to perform range merging in the attributor,
and trying to pick non-0 values.
If a function has `amdgpu-flat-work-group-size`, honor it in `initialize` by
taking its value directly; otherwise, it uses the default range as a starting
point. We will no longer manipulate the known range, which can cause issues
because the known range is a "throttle" to the assumed range such that the
assumed range can't get widened properly in `updateImpl` if the known range is
not set properly for whatever reasons. Another benefit of not touching the known
range is, if we indicate pessimistic state, it also invalidates the AA such that
`manifest` will not be called. Since we honor the attribute, we don't want and
will not add any half-baked attribute added to a function.
Although the ABI (if one exists) doesn’t explicitly prohibit
cross-code-object function calls—particularly since our loader can
handle them—such calls are not actually allowed in any of the officially
supported programming models. However, this limitation has some nuances.
For instance, the loader can handle cross-code-object global variables,
which complicates the situation further.
Given this complexity, assuming a closed-world model at link time isn’t
always safe. To address this, this PR introduces an option that enables
this assumption, providing end users the flexibility to enable it for
improved compiler optimizations. However, it is the user’s
responsibility to ensure they do not violate this assumption.
The AMDGPUAnnotateKernelFeatures pass infers the "amdgpu-calls" and
"amdgpu-stack-objects" attributes, which are used to infer whether we
need to initialize flat scratch. This is, however, not precise. Instead,
we should use AMDGPUAttributor and infer amdgpu-no-flat-scratch-init on
kernels. Refer to https://github.com/llvm/llvm-project/issues/63586 .
Even though the Attributor framework will invalidate all its dependent
AAs after the current iteration, a dependent AA can still use the worst
state of a depending AA if it doesn't check the state of the depending
AA in current iteration.