This patch implements an LLVM IR pass, named kernel-info, that reports
various statistics for codes compiled for GPUs. The ultimate goal of
these statistics to help identify bad code patterns and ways to mitigate
them. The pass operates at the LLVM IR level so that it can, in theory,
support any LLVM-based compiler for programming languages supporting
GPUs. It has been tested so far with LLVM IR generated by Clang for
OpenMP offload codes targeting NVIDIA GPUs and AMD GPUs.
By default, the pass runs at the end of LTO, and options like
``-Rpass=kernel-info`` enable its remarks. Example `opt` and `clang`
command lines appear in `llvm/docs/KernelInfo.rst`. Remarks include
summary statistics (e.g., total size of static allocas) and individual
occurrences (e.g., source location of each alloca). Examples of its
output appear in tests in `llvm/test/Analysis/KernelInfo`.
Since 3494ee95902cef62f767489802e469c58a13ea04 APInt stopped to
implicitly truncate values, therefore it asserts on a big signed value
converted to (implicitly) unsigned APInt.
The change explicitly marks offset as a signed value.
This is so we can try to make use of v_pk_mov_b32 when available.
Note this currently has little observable effect. The combiner
will undo the common extract of shuffle pattern. The lack
of test changes should demonstrate this change is minimally
correct.
We should probably try to make better use of wider extracts in
even aligned cases, but I'm trying to avoid some really ugly
regalloc regressions in some MFMA tests. The DAG scheduler ends
up doing a worse job if we use vector extracts, resulting
in failure to do 3 address conversion of MFMAs.
This blocks rematerialization during scheduling if the instruction has a
non accepted PhysReg use.
Currently, there aren't any checks like this in place, and we may create
invalid code: https://godbolt.org/z/xjPjdcorf
As part of the "RemoveDIs" work to eliminate debug intrinsics, we're
replacing methods that use Instruction*'s as positions with iterators. A
number of these (such as getFirstNonPHIOrDbg) are sufficiently
infrequently used that we can just replace the pointer-returning version
with an iterator-returning version, hopefully without much/any
disruption.
Thus this patch has getFirstNonPHIOrDbg and
getFirstNonPHIOrDbgOrLifetime return an iterator, and updates all
call-sites. There are no concerns about the iterators returned being
converted to Instruction*'s and losing the debug-info bit: because the
methods skip debug intrinsics, the iterator head bit is always false
anyway.
A bulk commit of true16 support for v_cmpx_xx_f16 instructions
including:
v_cmpx_f_f16
v_cmpx_le_f16
v_cmpx_gt_f16
v_cmpx_lg_f16
v_cmpx_ge_f16
v_cmpx_o_f16
v_cmpx_u_f16
v_cmpx_nge_f16
v_cmpx_nlg_f16
v_cmpx_ngt_f16
v_cmpx_nle_f16
v_cmpx_neq_f16
v_cmpx_nlt_f16
v_cmpx_t_f16
v_cmpx_eq_f16 is not in this patch and will be added in the following
patch
Currently, the AMDGPU backend bumps the Stack Pointer
by fixed size offsets in the prolog of device functions, and
restores it by the same amount in the epilog.
Prolog:
sp += frameSize
Epilog:
sp -= frameSize
If a function has dynamic stack realignment,
Prolog:
sp += frameSize + max_alignment
Epilog:
sp -= frameSize + max_alignment
These calculations are not optimal in case of dynamic
stack realignment, and completely fail in case of
dynamic stack readjustment.
This patch uses the saved Frame Pointer to restore SP.
Prolog:
fp = sp
sp += frameSize
Epilog:
sp = fp
In case of dynamic stack realignment, SP is restored from
the saved Base Pointer.
Prolog:
fp = sp + (max_alignment - 1)
fp = fp & (-max_alignment)
bp = sp
sp += frameSize + max_alignment
Epilog:
sp = bp
(Note: The presence of BP has been enforced in case of any
dynamic stack realignment.)
---------
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Add IDs for bit width that cover multiple LLTs: B32 B64 etc.
"Predicate" wrapper class for bool predicate functions used to
write pretty rules. Predicates can be combined using &&, || and !.
Lowering for splitting and widening loads.
Write rules for loads to not change existing mir tests from old
regbankselect.
Lower G_ instructions that can't be inst-selected with register bank
assignment from AMDGPURegBankSelect based on uniformity analysis.
- Lower instruction to perform it on assigned register bank
- Put uniform value in vgpr because SALU instruction is not available
- Execute divergent instruction in SALU - "waterfall loop"
Given LLTs on all operands after legalizer, some register bank
assignments require lowering while other do not.
Note: cases where all register bank assignments would require lowering
are lowered in legalizer.
AMDGPURegBankLegalize goals:
- Define Rules: when and how to perform lowering
- Goal of defining Rules it to provide high level table-like brief
overview of how to lower generic instructions based on available
target features and uniformity info (uniform vs divergent).
- Fast search of Rules, depends on how complicated Rule.Predicate is
- For some opcodes there would be too many Rules that are essentially
all the same just for different combinations of types and banks.
Write custom function that handles all cases.
- Rules are made from enum IDs that correspond to each operand.
Names of IDs are meant to give brief description what lowering does
for each operand or the whole instruction.
- AMDGPURegBankLegalizeHelper implements lowering algorithms
Since this is the first patch that actually enables -new-reg-bank-select
here is the summary of regression tests that were added earlier:
- if instruction is uniform always select SALU instruction if available
- eliminate back to back vgpr to sgpr to vgpr copies of uniform values
- fast rules: small differences for standard and vector instruction
- enabling Rule based on target feature - salu_float
- how to specify lowering algorithm - vgpr S64 AND to S32
- on G_TRUNC in reg, it is up to user to deal with truncated bits
G_TRUNC in reg is treated as no-op.
- dealing with truncated high bits - ABS S16 to S32
- sgpr S1 phi lowering
- new opcodes for vcc-to-scc and scc-to-vcc copies
- lowering for vgprS1-to-vcc copy (formally this is vgpr-to-vcc G_TRUNC)
- S1 zext and sext lowering to select
- uniform and divergent S1 AND(OR and XOR) lowering - inst-selected into
SALU instruction
- divergent phi with uniform inputs
- divergent instruction with temporal divergent use, source instruction
is defined as uniform(AMDGPURegBankSelect) - missing temporal
divergence lowering
- uniform phi, because of undef incoming, is assigned to vgpr. Will be
fixed in AMDGPURegBankSelect via another fix in machine uniformity
analysis.
As part of the "RemoveDIs" project, BasicBlock::iterator now carries a
debug-info bit that's needed when getFirstNonPHI and similar feed into
instruction insertion positions. Call-sites where that's necessary were
updated a year ago; but to ensure some type safety however, we'd like to
have all calls to moveBefore use iterators.
This patch adds a (guaranteed dereferenceable) iterator-taking
moveBefore, and changes a bunch of call-sites where it's obviously safe
to change to use it by just calling getIterator() on an instruction
pointer. A follow-up patch will contain less-obviously-safe changes.
We'll eventually deprecate and remove the instruction-pointer
insertBefore, but not before adding concise documentation of what
considerations are needed (very few).
Assign register banks to virtual registers. Does not use generic
RegBankSelect. After register bank selection all register operand of
G_ instructions have LLT and register banks exclusively. If they had
register class, reassign appropriate register bank.
Assign register banks using machine uniformity analysis:
Sgpr - uniform values and some lane masks
Vgpr - divergent, non S1, values
Vcc - divergent S1 values(lane masks)
AMDGPURegBankSelect does not consider available instructions and, in
some cases, G_ instructions with some register bank assignment can't be
inst-selected. This is solved in RegBankLegalize.
Exceptions when uniformity analysis does not work:
S32/S64 lane masks:
- need to end up with sgpr register class after instruction selection
- In most cases Uniformity analysis declares them as uniform
(forced by tablegen) resulting in sgpr S32/S64 reg bank
- When Uniformity analysis declares them as divergent (some phis),
use intrinsic lane mask analyzer to still assign sgpr register bank
temporal divergence copy:
- COPY to vgpr with implicit use of $exec inside of the cycle
- this copy is declared as uniform by uniformity analysis
- make sure that assigned bank is vgpr
Note: uniformity analysis does not consider that registers with vgpr def
are divergent (you can have uniform value in vgpr).
- TODO: implicit use of $exec could be implemented as indicator
that instruction is divergent
Occupancy (i.e., the number of waves per EU) depends, in addition to
register usage, on per-workgroup LDS usage as well as on the range of
possible workgroup sizes. Mirroring the latter, occupancy should
therefore be expressed as a range since different group sizes generally
yield different achievable occupancies.
`getOccupancyWithLocalMemSize` currently returns a scalar occupancy
based on the maximum workgroup size and LDS usage. With respect to the
workgroup size range, this scalar can be the minimum, the maximum, or
neither of the two of the range of achievable occupancies. This commit
fixes the function by making it compute and return the range of
achievable occupancies w.r.t. workgroup size and LDS usage; it also
renames it to `getOccupancyWithWorkGroupSizes` since it is the range of
workgroup sizes that produces the range of achievable occupancies.
Computing the achievable occupancy range is surprisingly involved.
Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum
occupancies i.e., sometimes workgroup sizes inside the range yield the
occupancy bounds. The implementation finds these sizes in constant time;
heavy documentation explains the rationale behind the sometimes
relatively obscure calculations.
As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU,
64-wide waves. Also consider a function with no LDS usage and a flat
workgroup size range of [513,1024].
- A group of 513 items requires 9 waves per group. Only 4 groups made up
of 9 waves each can fit fully on a CU at any given time, for a total of
36 waves on the CU, or 9 per EU. However, filling as much as possible
the remaining 40-36=4 wave slots without decreasing the number of groups
reveals that a larger group of 640 items yields 40 waves on the CU, or
10 per EU.
- Similarly, a group of 1024 items requires 16 waves per group. Only 2
groups made up of 16 waves each can fit fully on a CU ay any given time,
for a total of 32 waves on the CU, or 8 per EU. However, removing as
many waves as possible from the groups without being able to fit another
equal-sized group on the CU reveals that a smaller group of 896 items
yields 28 waves on the CU, or 7 per EU.
Therefore the achievable occupancy range for this function is not [8,9]
as the group size bounds directly yield, but [7,10].
Naturally this change causes a lot of test churn as instruction
scheduling is driven by achievable occupancy estimates. In most unit
tests the flat workgroup size range is the default [1,1024] which,
ignoring potential LDS limitations, would previously produce a scalar
occupancy of 8 (derived from 1024) on a lot of targets, whereas we now
consider the maximum occupancy to be 10 in such cases. Most tests are
updated automatically and checked manually for sanity. I also manually
changed some non-automatically generated assertions when necessary.
Fixes#118220.
Fixes#123800
Extends LDS lowering by allowing it to discover transitive
indirect/escpaing references to LDS GVs.
For example, given the following input:
```llvm
@lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8
%store_type = type { i32, ptr }
@place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8
define amdgpu_kernel void @offloading_kernel() {
store ptr @indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @place_to_store_indirect_caller, i32 0), align 8
call void @call_unknown()
ret void
}
define void @call_unknown() {
%1 = alloca ptr, align 8
%2 = call i32 %1()
ret void
}
define void @indirectly_load_lds() {
call void @directly_load_lds()
ret void
}
define void @directly_load_lds() {
%2 = load ptr, ptr addrspace(3) @lds_item_to_indirectly_load, align 8
ret void
}
```
With the above input, prior to this patch, LDS lowering failed to lower
the reference to `@lds_item_to_indirectly_load` because:
1. it is indirectly called by a function whose address is taken in the
kernel.
2. we did not check if the kernel indirectly makes any calls to unknown
functions (we only checked the direct calls).
Co-authored-by: Jon Chesterfield <jonathan.chesterfield@amd.com>
This is meant as a short-term workaround for an invalid conversion in
this pass that occurs because existing SDWA selections are not correctly
taken into account during the conversion.
See the draft PR #123221 for an attempt to fix the actual issue.
---------
Co-authored-by: Frederik Harwath <fharwath@amd.com>
This patch is in preparation to enable setting the MachineInstr::MIFlag
flags, i.e. FrameSetup/FrameDestroy, on callee saved register
spill/reload instructions in prologue/epilogue. This eventually helps in
setting the prologue_end and epilogue_begin markers more accurately.
The DWARF Spec in "6.4 Call Frame Information" says:
The code that allocates space on the call frame stack and performs the
save
operation is called the subroutine’s prologue, and the code that
performs
the restore operation and deallocates the frame is called its epilogue.
which means the callee saved register spills and reloads are part of
prologue (a.k.a frame setup) and epilogue (a.k.a frame destruction),
respectively. And, IIUC, LLVM backend uses FrameSetup/FrameDestroy flags
to identify instructions that are part of call frame setup and
destruction.
In the trunk, while most targets consistently set
FrameSetup/FrameDestroy on save/restore call frame information (CFI)
instructions of callee saved registers, they do not consistently set
those flags on the actual callee saved register spill/reload
instructions.
I believe this patch provides a clean mechanism to set
FrameSetup/FrameDestroy flags on the actual callee saved register
spill/reload instructions as needed. And, by having default argument of
MachineInstr::NoFlags for Flags, this patch is a NFC.
With this patch, the targets have to just pass FrameSetup/FrameDestroy
flag to the storeRegToStackSlot/loadRegFromStackSlot calls from the
target derived spillCalleeSavedRegisters and restoreCalleeSavedRegisters
to set those flags on callee saved register spill/reload instructions.
Also, this patch makes it very easy to set the source line information
on callee saved register spill/reload instructions which is needed by
the DwarfDebug.cpp implementation to set prologue_end and epilogue_begin
markers more accurately.
As per DwarfDebug.cpp implementation:
prologue_end is the first known non-DBG_VALUE and non-FrameSetup
location
that marks the beginning of the function body
epilogue_begin is the first FrameDestroy location that has been seen in
the
epilogue basic block
With this patch, the targets have to just do the following to set the
source line information on callee saved register spill/reload
instructions, without hampering the LLVM's efforts to avoid adding
source line information on the artificial code generated by the
compiler.
<Foo>InstrInfo::storeRegToStackSlot() {
...
DebugLoc DL =
Flags & MachineInstr::FrameSetup ? DebugLoc() : MBB.findDebugLoc(I);
...
}
<Foo>InstrInfo::loadRegFromStackSlot() {
...
DebugLoc DL =
Flags & MachineInstr::FrameDestroy ? MBB.findDebugLoc(I) : DebugLoc();
...
}
While I understand this patch would break out-of-tree backend builds, I
think it is in the right direction.
One immediate use case that can benefit from this patch is fixing
#120553 becomes simpler.
This patch fixes:
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp:2792:14: error: comparison of
integers of different signs: 'unsigned int' and 'int'
[-Werror,-Wsign-compare]
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp:2797:14: error: comparison of
integers of different signs: 'unsigned int' and 'int'
[-Werror,-Wsign-compare]
This prevents legacy PM from mistakenly removing these analyses if
`SILowerWWMCopies` is the last user of them. (it removes dead analyses
after its last use)