Combine two G_PTR_ADDs, but keep the register bank of the constant.
That way, the combine can be used in post-regbank-select combines.
Introduce two helper methods in CombinerHelper, getRegBank and
setRegBank that get and set an optional register bank to a register.
That way, they can be used before and after register bank selection.
Differential Revision: https://reviews.llvm.org/D103326
AMDGPU normally spills SGPRs to VGPRs. Previously, since all register
classes are handled at the same time, this was problematic. We don't
know ahead of time how many registers will be needed to be reserved to
handle the spilling. If no VGPRs were left for spilling, we would have
to try to spill to memory. If the spilled SGPRs were required for exec
mask manipulation, it is highly problematic because the lanes active
at the point of spill are not necessarily the same as at the restore
point.
Avoid this problem by fully allocating SGPRs in a separate regalloc
run from VGPRs. This way we know the exact number of VGPRs needed, and
can reserve them for a second run. This fixes the most serious
issues, but it is still possible using inline asm to make all VGPRs
unavailable. Start erroring in the case where we ever would require
memory for an SGPR spill.
This is implemented by giving each regalloc pass a callback which
reports if a register class should be handled or not. A few passes
need some small changes to deal with leftover virtual registers.
In the AMDGPU implementation, a new pass is introduced to take the
place of PrologEpilogInserter for SGPR spills emitted during the first
run.
One disadvantage of this is currently StackSlotColoring is no longer
used for SGPR spills. It would need to be run again, which will
require more work.
Error if the standard -regalloc option is used. Introduce new separate
-sgpr-regalloc and -vgpr-regalloc flags, so the two runs can be
controlled individually. PBQB is not currently supported, so this also
prevents using the unhandled allocator.
Adds legalizer, register bank select, and instruction
select support for G_SBFX and G_UBFX. These opcodes generate
scalar or vector ALU bitfield extract instructions for
AMDGPU. The instructions allow both constant or register
values for the offset and width operands.
The 32-bit scalar version is expanded to a sequence that
combines the offset and width into a single register.
There are no 64-bit vgpr bitfield extract instructions, so the
operations are expanded to a sequence of instructions that
implement the operation. If the width is a constant,
then the 32-bit bitfield extract instructions are used.
Moved the AArch64 specific code for creating G_SBFX to
CombinerHelper.cpp so that it can be used by other targets.
Only bitfield extracts with constant offset and width values
are handled currently.
Differential Revision: https://reviews.llvm.org/D100149
This allows to convert the add instruction to s_addk_i32 and
v_add_nc_u32 instead of needing v_add_co_u32 when converting to a VALU
instruction.
Differential Revision: https://reviews.llvm.org/D103322
The temporary register is only used to compute the frame pointer.
The frame pointer is overwritten and not used in between, so we
can reuse the frame pointer for the computation, saving one register.
Differential Revision: https://reviews.llvm.org/D95865
Support for XNACK and SRAMECC is not static on some GPUs. We must be able
to differentiate between different scenarios for these dynamic subtarget
features.
The possible settings are:
- Unsupported: The GPU has no support for XNACK/SRAMECC.
- Any: Preference is unspecified. Use conservative settings that can run anywhere.
- Off: Request support for XNACK/SRAMECC Off
- On: Request support for XNACK/SRAMECC On
GCNSubtarget will track the four options based on the following criteria. If
the subtarget does not support XNACK/SRAMECC we say the setting is
"Unsupported". If no subtarget features for XNACK/SRAMECC are requested we
must support "Any" mode. If the subtarget features XNACK/SRAMECC exist in the
feature string when initializing the subtarget, the settings are "On/Off".
The defaults are updated to be conservatively correct, meaning if no setting
for XNACK or SRAMECC is explicitly requested, defaults will be used which
generate code that can be run anywhere. This corresponds to the "Any" setting.
Differential Revision: https://reviews.llvm.org/D85882
This does unfortunately end up with extra waitcnts getting inserted
that were avoided before. Ideally we would avoid the spills of these
undef components in the first place.
This reverts commit ca907bfb57d8ad3ec3bcc2cff2abab7b1b933af6.
According to michel.daenzer,
> This completely broke the Mesa radeonsi driver on Navi 14. Xorg +
> xterm come up with major corruption & psychedelic colours.
When memory operations are outstanding on function calls, either the
caller or the callee can insert a waitcnt to ensure that all reads are
finished.
Calls need some time to be executed, so if the callee inserts the
waitcnt, filling the instruction buffer and waiting for memory will be
interleaved, hiding some latency. This comes at the cost of having a
waitcnt inside functions that may not be needed as no memory operations
are outstanding.
For function calls, this is already implemented. The same principal
applies to returns: If the caller inserts a waitcnt after the call, the
callee does not have to wait and the return and memory operation can be
run in parallel.
This commit implements waiting in the caller after returning from a
function call.
Differential Revision: https://reviews.llvm.org/D87674
Clustering loads has caching benefits, but as far as I know there is no
advantage to clustering stores on any AMDGPU subtargets.
The disadvantage is that it tends to increase register pressure and
restricts scheduling freedom.
Differential Revision: https://reviews.llvm.org/D85530
If it is load cluster, we don't need to create the dependency edges(SUb->reg) from SUb to SUa
as they both depend on the base register "reg"
+-------+
+----> reg |
| +---+---+
| ^
| |
| |
| |
| +---+---+
| | SUa | Load 0(reg)
| +---+---+
| ^
| |
| |
| +---+---+
+----+ SUb | Load 4(reg)
+-------+
But if it is store cluster, we need to create it as follow shows to avoid the instruction store
depend on scheduled in-between SUb and SUa.
+-------+
+----> reg |
| +---+---+
| ^
| | Missing +-------+
| | +-------------------->+ y |
| | | +---+---+
| +---+-+-+ ^
| | SUa | Store x 0(reg) |
| +---+---+ |
| ^ |
| | +------------------------+
| | |
| +---+--++
+----+ SUb | Store y 4(reg)
+-------+
Reviewed By: evandro, arsenm, rampitec, foad, fhahn
Differential Revision: https://reviews.llvm.org/D72031
Use the same basic strategy as LegalizeVectorTypes. Try to index into
smaller pieces if there's a constant index, and otherwise fall back to
a stack temporary.