SIMachineFunctionInfo has a scan of the function body for inline asm
which may use AGPRs, or callees in SIMachineFunctionInfo. Move this
into the attributor, so it actually works interprocedurally.
Could probably avoid most of the test churn if this bothered to avoid
adding this on subtargets without AGPRs. We should also probably
try to delete the MIR scan in usesAGPRs but it seems to be trickier
to eliminate.
A new function attribute named amdgpu_num_work_groups is added. This
attribute, which consists of three integers, allows programmers to let
the compiler know the number of workgroups to be launched in each of the
three dimensions and do optimizations based on that information.
---------
Co-authored-by: Jun Wang <jun.wang7@amd.com>
At the moment, the emergency spill slot is a fixed object for entry
functions and chain functions, and a regular stack object otherwise.
This patch adopts the latter behaviour for entry/chain functions too. It
seems this was always the intention [1] and it will also save us a bit
of stack space in cases where the first stack object has a large
alignment.
[1]
34c8b835b1
CSR SGPR spilling currently uses the early available physical VGPRs. It
currently imposes a high register pressure while trying to allocate
large VGPR tuples within the default register budget.
This patch changes the spilling strategy by picking the VGPRs in the
reverse order, the highest available VGPR first and later after regalloc
shift them back to the lowest available range. With that, the initial
VGPRs would be available for allocation and possibility
of finding large number of contiguous registers will be more.
Physical VGPRs used for SGPR spills need to be tracked independent of
WWM reserved registers. The WWM reserved set contains extra registers
allocated during WWM pre-allocation pass.
This causes SGPR spills allocated after WWM pre-allocation to overlap
with WWM register usage, e.g. if frame pointer is spilt during
prologue/epilog insertion.
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.
I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
Introduce a helper to check if a function is at the bottom of the stack,
i.e. if it's an entry function or a chain function.
This was suggested in #71913.
Switch to using immediate offsets instead of the SP register to access
objects on the current stack frame in chain functions. This means we no
longer need to reserve a SP register just for accesing stack objects and
it also allows us to set the SP (when one is actually needed) to the
stack size from the very beginning.
This only works if we use a FixedObject for the ScavengeFI, which is
what we do for entry functions anyway (and we generally want to keep
chain functions close to amdgpu_cs behaviour where we don't have a good
reason to diverge).
Teach prolog epilog insertion how to handle functions with the
amdgpu_cs_chain or amdgpu_cs_chain_preserve calling conventions.
For amdgpu_cs_chain functions, we only need to preserve the inactive
lanes of VGPRs above v8, and only in the presence of calls via
@llvm.amdgcn.cs.chain.
For amdgpu_cs_chain_preserve functions, we will also need to preserve
the active lanes for registers above the last argument VGPR. AFAICT
there's no direct way to find out what the last argument VGPR is, so
instead the patch uses the fact that chain calls from
amdgpu_cs_chain_preserve functions can't use more VGPRs than the
caller's VGPR arguments. In other words, it removes the operands of
SI_CS_CHAIN_TC instructions from the list of callee saved registers.
For both calling conventions, registers v0-v7 never need to be saved and
restored, so we should never add them as WWM spills.
Differential Revision: https://reviews.llvm.org/D156412
This patch adds the DAG isel changes for kernel argument preloading.
These changes are not usable with older firmware but subsequent patches
in the series will make the codegen backwards compatible. This patch
should only be submitted alongside that subsequent patch.
Preloading here begins from the start of the kernel arguments until the
amount of arguments indicated by the CL flag
amdgpu-kernarg-preload-count.
Aggregates and arguments passed by-ref are not supported.
Special care for the alignment of the kernarg segment is needed as well
as consideration of the alignment of addressable SGPR tuples when we
cannot directly use misaligned large tuples that the arguments are
loaded to.
Reviewed By: bcahoon
Differential Revision: https://reviews.llvm.org/D158579
Factor out and unify some common code that calculates and tracks the
number of user SGRPs.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D159439
Lower formal arguments and returns for functions with the
`amdgpu_cs_chain` and `amdgpu_cs_chain_preserve` calling conventions:
* Put `inreg` arguments into SGPRs, starting at s0, and other arguments
into VGPRs, starting at v8. No arguments should end up on the stack, if
we don't have enough registers we should error out.
* Lower the return (which is always void) as an S_ENDPGM.
* Set the ScratchRSrc register to s48:51, as described in the docs.
* Set the SP to s32, matching amdgpu_gfx. This might be revisited in a
future patch.
Differential Revision: https://reviews.llvm.org/D153517
This reverts commit a496c8be6e638ae58bb45f13113dbe3a4b7b23fd.
The workaround in c26dfc81e254c78dc23579cf3d1336f77249e1f6 should work
around the underlying problem with SUBREG_TO_REG.
And dependent commits.
Details in D150388.
This reverts commit 825b7f0ca5f2211ec3c93139f98d1e24048c225c.
This reverts commit 7a98f084c4d121244ef7286bc6503b6a181d446e.
This reverts commit b4a62b1fa546312d882fa12dfdcd015177d66826.
This reverts commit b7836d856206ec39509d42529f958c920368166b.
No conflicts in the code, few tests had conflicts in autogenerated CHECKs:
llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll
Reviewed By: alexfh
Differential Revision: https://reviews.llvm.org/D156381
Currently, the custom SGPR spill lowering pass spills
SGPRs into physical VGPR lanes and the remaining VGPRs
are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.
This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.
Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which is an unproblematic case.
Differential Revision: https://reviews.llvm.org/D124196
To reduce the register pressure during allocation,
when the allocator spills a virtual register that
corresponds to a whole wave mode operation, the
spill loads and restores should be activated for
all lanes by temporarily flipping all bits in exec
register to one just before the spills. It is not
implemented in the compiler as of today and this
patch enables the necessary support.
This is a pre-patch before the SGPR spill to virtual
VGPR lanes that would eventually causes the whole
wave register spills during allocation.
Reviewed By: arsenm, cdevadas
Differential Revision: https://reviews.llvm.org/D143759
Branch relaxation requires 2 additional SGPRs for AMDGPU to handle the
case when an indirect branch target is too far away. The register
scavanger may not find available registers, which causes a “did not find
scavenging index” assert to occur in assignRegToScavengingIndex.
In this patch, we estimate before register allocation whether an
indirect branch is likely to be needed, and reserve 2 SGPRs if the
branch distance is found to be above a threshold. The distance threshold
is an approximation as the exact code size and branch distance are
unknown prior to register allocation.
Patch by Corbin Robeck. Thanks!
Differential Review: https://reviews.llvm.org/D149775
Currently, the codegen support for llvm.amdgcn.workgroup.id*
intrinsics are enabled only for compute kernels. In addition,
this patch enables their selection for compute shaders on
subtargets that have architected SGPRs.
Differential Revision: https://reviews.llvm.org/D145045
This fixes what I consider to be an API flaw I've tripped over
multiple times. The point this is constructed isn't well defined, so
depending on where this is first called, you can conclude different
information based on the MachineFunction. For example, the AMDGPU
implementation inspected the MachineFrameInfo on construction for the
stack objects and if the frame has calls. This kind of worked in
SelectionDAG which visited all allocas up front, but broke in
GlobalISel which hasn't visited any of the IR when arguments are
lowered.
I've run into similar problems before with the MIR parser and trying
to make use of other MachineFunction fields, so I think it's best to
just categorically disallow dependency on the MachineFunction state in
the constructor and to always construct this at the same time as the
MachineFunction itself.
A missing feature I still could use is a way to access an custom
analysis pass on the IR here.
Currently, the custom SGPR spill lowering pass spills
SGPRs into physical VGPR lanes and the remaining VGPRs
are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.
This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.
Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which isn an unproblematic case.
This patch also implements the whole wave spills which
might occur if RA spills any live range of virtual registers
involved in the whole wave operations. Earlier, we had
been hand-picking registers for such machine operands.
But now with SGPR spills into virtual VGPR lanes, we are
exposing them to the allocator.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D124196
In general, a callee is free to use a scratch register without
preserving its previous state. However, the VGPR used for SGPR
spilling can potentially have its inactive lanes overwritten by
the writelane instructions. When the function returns, it can
cause unexpected behavior if the VGPR value is not preserved
appropriately.
The current scheme to preserve the inactive lanes of such
scratch VGPRs is not done rightly. It preserves all lanes
and causes the outgoing values (if any) getting overwritten
by the epilog restores. It then corrupts the return value.
To avoid such situation with scratch VGPRs, this patch ensures
we preserve only their inactive lanes.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D134526
There is a lot of customization and eventually code duplication in the
frame lowering that handles special SGPR spills like the one needed for
the Frame Pointer. Incorporating any additional SGPR spill currently
makes it difficult during PEI. This patch introduces a new spill builder
to efficiently handle such spill requirements. Various spill methods are
special handled using a separate class.
Reviewed By: sebastian-ne, scott.linder
Differential Revision: https://reviews.llvm.org/D132436
SILowerSGPRSpills pass handles the lowering of SGPR spills
into VGPR lanes. Some SGPR spills are handled later during
PEI. There is a common function used in both places to find
the free VGPR lane. This patch eliminates that dependency to
find the free VGPR by handling it separately for PEI. It is a
prerequisite patch for a future work to allow SGPR spills to
virtual VGPR lanes during SILowerSGPRSpills.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D124195
The custom VGPR spills inserted during frame lowering
maintain a separate list for WWM reserved registers.
Added them into WWMSpills that already tracks such
reserved registers. It unifies the spill insertion.
Reviewed By: nhaehnle, arsenm
Differential Revision: https://reviews.llvm.org/D124193
Since the writelane instruction used for SGPR spills can
modify inactive lanes, the callee must preserve the VGPR
this instruction modifies even if it was marked Caller-saved.
Reviewed By: arsenm, nhaehnle
Differential Revision: https://reviews.llvm.org/D124192
Presently, there is an issue on MI100 (and probably other architecture) where the VGPR used for AGPR copies clobbers VGPR used for AGPR spill. AFAICT this is because in processFunctionBeforeFrameIndicesReplaced we think the VGPR register for AGPR spill is unused. This patch aims to correct that. This is a WIP while I work out issues with producing a good test. For now, I'm curious if this is generally a good / bad idea.
Differential Revision: https://reviews.llvm.org/D139673
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated. The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.
This is part of an effort to migrate from llvm::Optional to
std::optional:
https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
Following up on the removal of BufferPSV in commit 43b86bf992 ("AMDGPU:
Remove BufferPseudoSourceValue")
It is unclear what exactly the right address space for images should be.
They seem morally closest to buffers, so that's what I went with. In
practical terms, address space 7 is better than address space 0 because
it can't alias with LDS.
Differential Revision: https://reviews.llvm.org/D138949
The use of a PSV for buffer intrinsics is misleading because it may be
misinterpreted as all buffer intrinsics accessing the same address in
memory, which is clearly not true.
Instead, build MachineMemOperands without a pointer value but with an
address space, so that address space-based alias analysis can still
work.
There is a lot of test churn because previously address space 4
(constant address space) was used as an address space for buffer
intrinsics. This doesn't make much sense and seems to have been an
accident -- see the change in
AMDGPUTargetMachine::getAddressSpaceForPseudoSourceKind.
Differential Revision: https://reviews.llvm.org/D138711
Implement an intrinsic for use lowering LDS variables to different
addresses from different kernels. This will allow kernels that cannot
reach an LDS variable to avoid wasting space for it.
There are a number of implicit arguments accessed by intrinsic already
so this implementation closely follows the existing handling. It is slightly
novel in that this SGPR is written by the kernel prologue.
It is necessary in the general case to put variables at different addresses
such that they can be compactly allocated and thus necessary for an
indirect function call to have some means of determining where a
given variable was allocated. Claiming an arbitrary SGPR into which
an integer can be written by the kernel, in this implementation based
on metadata associated with that kernel, which is then passed on to
indirect call sites is sufficient to determine the variable address.
The intent is to emit a __const array of LDS addresses and index into it.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D125060
MIR support is totally unusable for AMDGPU without this, since the set
of reserved registers is set from fields here.
Add a clone method to MachineFunctionInfo. This is a subtle variant of
the copy constructor that is required if there are any MIR constructs
that use pointers. Specifically, at minimum fields that reference
MachineBasicBlocks or the MachineFunction need to be adjusted to the
values in the new function.
Based on available register budget, reserve highest available VGPR for
AGPR copy before RA. After RA, shift it to lowest unused VGPR if the one
exist.
Fixes SWDEV-330006.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D123525
These don't seem to be very well used or tested, but try to make the
behavior a bit more consistent with LDS globals.
I'm not sure what the definition for amdgpu-gds-size is supposed to
mean. For now I assumed it's allocating a static size at the beginning
of the allocation, and any known globals are allocated after it.