When using the `amdgcn.init.whole.wave` intrinsic, we add dummy VGPR
arguments with the purpose of preserving their inactive lanes. The
pattern may look something like this:
```
entry:
call amdgcn.init.whole.wave
branch to shader or tail
shader:
$vInactive = IMPLICIT_DEF ; Tells regalloc it's safe to use the active lanes
actual code...
tail:
call amdgcn.cs.chain [...], implicit $vInactive
```
We should not report these VGPRs in the `.vgpr_count` metadata. This
patch achieves that goal by ignoring meta instructions and calls. This should
be safe since if those registers are actually used in any other context,
they will be counted there. The same reasoning applies in the general
case, so we don't explicitly check for the existence of `init.whole.wave`.
This is a reworked version of #133242, which was reverted in #144039
and split into smaller bits.
Simplify the code in AMDGPUResourceUsageAnalysis to rely more
on the TargetRegisterInfo for computing the number of used SGPRs and
AGPRs. This is a preliminary refactoring split out from #144855.
(While we could technically use TRI to compute the used number of VGPRs
at this point too, I'm leaving some of the original code in since for
VGPRs we're going to introduce some special cases).
…133242)"
This reverts commit 130080fab11cde5efcb338b77f5c3b31097df6e6 because it
causes issues in testcases similar to coalescer_remat.ll [1], i.e. when
we use a VGPR tuple but only write to its lower parts. The high VGPRs
would then not be included in the vgpr_count, and accessing them would
be an out of bounds violation.
[1]
https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/AMDGPU/coalescer_remat.ll
Don't count register uses when determining the maximum number of
registers used by a function. Count only the defs. This is really an
underestimate of the true register usage, but in practice that's not
a problem because if a function uses a register, then it has either
defined it earlier, or some other function that executed before has
defined it.
In particular, the register counts are used:
1. When launching an entry function - in which case we're safe because
the register counts of the entry function will include the register
counts of all callees.
2. At function boundaries in dynamic VGPR mode. In this case it's safe
because whenever we set the new VGPR allocation we take into account
the outgoing_vgpr_count set by the middle-end.
The main advantage of doing this is that the artificial VGPR arguments
used only for preserving the inactive lanes when using the
llvm.amdgcn.init.whole.wave intrinsic are no longer counted. This
enables us to allocate only the registers we need in dynamic VGPR mode.
---------
Co-authored-by: Thomas Symalla <5754458+tsymalla@users.noreply.github.com>
Converts AMDGPUResourceUsageAnalysis pass from Module to MachineFunction
pass. Moves function resource info propagation to to MC layer (through
helpers in AMDGPUMCResourceInfo) by generating MCExprs for every
function resource which the emitters have been prepped for.
Fixes https://github.com/llvm/llvm-project/issues/64863
Summary:
Previously this code only looked through a single level of aliases to
find the underlying function. This patch changes it to continue until it
finds the end. Aliases that form a cycle are illegal IR, so we shouldn't
need to worry about infinite loops.
Fixes https://github.com/llvm/llvm-project/issues/96812
Named '.amdhsa_code_object_version'. This directive sets the
e_ident[ABIVERSION] in the ELF header, and should be used as the assumed
COV for the rest of the asm file.
This commit also weakens the --amdhsa-code-object-version CL flag.
Previously, the CL flag took precedence over the IR flag. Now the IR
flag/asm directive take precedence over the CL flag. This is implemented
by merging a few COV-checking functions in AMDGPUBaseInfo.h.
Factor out and unify some common code that calculates and tracks the
number of user SGRPs.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D159439
There is no need to set a big default stack size for PAL code object indirect
calls. The driver knows the max recursion depth, so it can compute a more
accurate value from the minimum scratch size.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D150609
With architected SGPRs, workgroup IDs are passed into a compute shader
in TTMP registers. Allow for this in AMDGPUResourceUsageAnalysis instead
of failing an assertion.
Differential Revision: https://reviews.llvm.org/D148239
Summary:
This is part of the leftover work for https://reviews.llvm.org/D143138.
In this work, we pass code object version as an argument to initialize target ID
and use it for targetID dump.
Reviewers: arsenm
Differential Revision
https://reviews.llvm.org/D143293
Summary:
This patch introduces a mechanism to check the code object version from the module flag, This avoids checking from command line.
In case the module flag is missing, we use the current default code object version supported in the compiler.
For tools whose inputs are not IR, we may need other approach (directive, for example) to check the code
object version, That will be in a separate patch later.
For LIT tests update, we directly add module flag if there is only a single code object version associated with all checks in one file.
In cause of multiple code object version in one file, we use the "sed" method to "clone" the checks to achieve the goal.
Reviewer: arsenm
Differential Revision:
https://reviews.llvm.org/D14313
C++17 allows us to call constructors pair and tuple instead of helper
functions make_pair and make_tuple.
Differential Revision: https://reviews.llvm.org/D139828
Makes the SRC_(SHARED|PRIVATE)_(BASE|LIMIT) registers 64 bit instead of 32.
They're still usable as 32 bit operands by using the _LO suffix.
Preparation for D137542
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D137767
This change sets
-amdgpu-assume-{external-call-stack-size | dynamic-stack-object-size}
options to zero by default for code object v5 and later. The runtime is
expected to adjust the scratch size if the amdhsa_uses_dynamic_stack bit
in the kernel descriptor is set.
Differential Revision: https://reviews.llvm.org/D128346
Since 814a0abccefdd2e52b1b507f21ce842b689dbedd, this would break if we
had a function in the module that becomes dead in any codegen IR
pass. The function wasn't deleted since it was initially used in dead
code, but is detached from the call graph and doesn't appear in the PO
traversal. Do a second walk over the module to populate the resources
of any functions which weren't already processed.
On gfx10+ null register can be used as both 32 and 64 bit operand.
Define a 64 bit version of the register to use during codegen.
Differential Revision: https://reviews.llvm.org/D127527
The AMDGPUResourceUsageAnalysis was previously a CGSCC pass, and assumed
that a function's callees were always analyzed prior to their callees.
When it was refactored into a module pass, this assumption no longer
always holds. This results in calls being erroneously identified as
indirect, and reserving private segment space for them. This results in
significantly slower kernel launch latency.
This patch changes the order in which the module's functions are analyzed
from the order in which they occur in the module to a post-order traversal
of the call graph. Perhaps Clang always generates the module's functions
in such an order, but this is not the case for the Cray Fortran compiler.
Reviewed By: #amdgpu, arsenm
Differential Revision: https://reviews.llvm.org/D126025
This reverts commit 7f230feeeac8a67b335f52bd2e900a05c6098f20.
Breaks CodeGenCUDA/link-device-bitcode.cu in check-clang,
and many LLVM tests, see comments on https://reviews.llvm.org/D121169
This is more precise in the face of indirect calls and aliases, still
assuming the call target is defined somewhere in the current module.
This sometimes changes the order the functions are printed, and also
changes the point where context errors are printed relative to
stdout. This also likely has negative consequences for compile time
and memory usage.
Compute the theoretical register budget based on the IR function
signature/attributes, and use the global maximum register budgets for
unknown callees.
This should fix the kernel reported register usage in the presence of
indirect calls. The previous fix in
2b08f6af62afbf32e89a6a392dbafa92c62f7bdf was incorrect becauset it was
only taking the maximum in the known call graph, and missing something
that was either outside of it or codegened later.
This fixes a second case I discovered where calls to aliases also did
not work as expected. CallGraphAnalysis misses these, so functions
called through aliases were not codegened ahead of callers as
expected. CallGraphAnalysis should probably be fixed to understand
this case, and there's likely a bug with IPRA here. This fixes
numerous failures in the conformance test at -O0.
We were previously setting an ignored bit in the kernel headers. The
current behavior is to add the large amount on top of the statically
known size of a single stack frame. I'm not sure if we should just use
the large size as the entire reported size instead.
By default clang emits complete contructors as alias of base constructors if they are the same.
The backend is supposed to emit symbols for the alias, otherwise it causes undefined symbols.
@yaxunl observed that this issue is related to the llvm options `-amdgpu-early-inline-all=true`
and `-amdgpu-function-calls=false`. This issue is resolved by only inlining global values
with internal linkage. The `getCalleeFunction()` in AMDGPUResourceUsageAnalysis also had
to be extended to support aliases to functions. inline-calls.ll was corrected appropriately.
Reviewed By: yaxunl, #amdgpu
Differential Revision: https://reviews.llvm.org/D109707
By default clang emits complete contructors as alias of base constructors if they are the same.
The backend is supposed to emit symbols for the alias, otherwise it causes undefined symbols.
@yaxunl observed that this issue is related to the llvm options `-amdgpu-early-inline-all=true`
and `-amdgpu-function-calls=false`. This issue is resolved by only inlining global values
with internal linkage. The `getCalleeFunction()` in AMDGPUResourceUsageAnalysis also had
to be extended to support aliases to functions. inline-calls.ll was corrected appropriately.
Reviewed By: yaxunl, #amdgpu
Differential Revision: https://reviews.llvm.org/D109707
For PS shaders we can use the input SPI_PS_INPUT_ENA and SPI_PS_INPUT_ADDR
registers
Calculate the number of VGPR registers used as input VGPRs based on these
registers rather than the arguments passed in (this conservatively always
allocates the maximum).
Differential Revision: https://reviews.llvm.org/D101633
Change-Id: Idf7c060cbbd5f7e3300102c55ecee3c07f209de6
First, collect the register usage in each function, then apply the
maximum register usage of all functions to functions with indirect
calls.
This is more accurate than guessing the maximum register usage without
looking at the actual usage.
As before, assume that indirect calls will hit a function in the
current module.
Differential Revision: https://reviews.llvm.org/D105839