Allocating wwm-registers and per-thread VGPR operands
together imposes many challenges in the way the
registers are reused during allocation. There are
times when regalloc reuses the registers of regular
VGPRs operations for wwm-operations in a small range
leading to unwantedly clobbering their inactive lanes
causing correctness issues that are hard to trace.
This patch splits the VGPR allocation pipeline further
to allocate wwm-registers first and the regular VGPR
operands in a separate pipeline. The splitting would
ensure that the physical registers used for wwm
allocations won't take part in the next allocation
pipeline to avoid any such clobbering.
It is almost always simpler to use {} instead of std::nullopt to
initialize an empty ArrayRef. This patch changes all occurrences I could
find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor
could be deprecated or removed.
This reverts commit
7792b4ae79.
The problem was a conflict with
e55d6f5ea2
"[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive
(https://github.com/llvm/llvm-project/pull/107889)"
which changed the syntax of V_SET_INACTIVE (and thus made my MIR test
crash).
...if only we had a merge queue.
Reverts llvm/llvm-project#108173
si-init-whole-wave.mir crashes on some buildbots (although it passed
both locally with sanitizers enabled and in pre-merge tests).
Investigating.
This intrinsic is meant to be used in functions that have a "tail" that
needs to be run with all the lanes enabled. The "tail" may contain
complex control flow that makes it unsuitable for the use of the
existing WWM intrinsics. Instead, we will pretend that the function
starts with all the lanes enabled, then branches into the actual body of
the function for the lanes that were meant to run it, and then finally
all the lanes will rejoin and run the tail.
As such, the intrinsic will return the EXEC mask for the body of the
function, and is meant to be used only as part of a very limited pattern
(for now only in amdgpu_cs_chain functions):
```
entry:
%func_exec = call i1 @llvm.amdgcn.init.whole.wave()
br i1 %func_exec, label %func, label %tail
func:
; ... stuff that should run with the actual EXEC mask
br label %tail
tail:
; ... stuff that runs with all the lanes enabled;
; can contain more than one basic block
```
It's an error to use the result of this intrinsic for anything
other than a branch (but unfortunately checking that in the verifier is
non-trivial because SIAnnotateControlFlow will introduce an amdgcn.if
between the intrinsic and the branch).
The intrinsic is lowered to a SI_INIT_WHOLE_WAVE pseudo, which for now
is expanded in si-wqm (which is where SI_INIT_EXEC is handled too);
however the information that the function was conceptually started in
whole wave mode is stored in the machine function info
(hasInitWholeWave). This will be useful in prolog epilog insertion,
where we can skip saving the inactive lanes for CSRs (since if the
function started with all the lanes active, then there are no inactive
lanes to preserve).
We can't assume closed world even in full LTO post-link stage. It is
only true
if we are building a "GPU executable". However, AMDGPU does support
"dyamic
library". I'm not aware of any approach to tell if it is relocatable
link when
we create the pass. For now let's revert the patch as it is currently
breaking things.
We can re-enable it once we can handle it correctly.
This PR introduces new pass "amdgpu-sw-lower-lds".
This pass lowers the local data store, LDS, uses in kernel and
non-kernel functions in module to use dynamically allocated global
memory. Packed LDS Layout is emulated in the global memory.
The lowered memory instructions from LDS to global memory are then
instrumented for address sanitizer, to catch addressing errors.
This pass only work when address sanitizer has been enabled and has
instrumented the IR. It identifies that IR has been instrumented using
"nosanitize_address" module flag.
For a kernel, LDS access can be static or dynamic which are direct
(accessed within kernel) and indirect (accessed through non-kernels).
**Replacement of Kernel LDS accesses:**
- All the LDS accesses corresponding to kernel will be packed together,
where all static LDS accesses will be allocated first and then dynamic
LDS follows. The total size with alignment is calculated. A new LDS
global will be created for the kernel called "SW LDS" and it will have
the attribute "amdgpu-lds-size" attached with value of the size
calculated. All the LDS accesses in the module will be replaced by GEP
with offset into the "Sw LDS".
- A new "llvm.amdgcn.<kernel>.dynlds" is created per kernel accessing
the dynamic LDS. This will be marked used by kernel and will have
MD_absolue_symbol metadata set to total static LDS size, Since dynamic
LDS allocation starts after all static LDS allocation.
- A device global memory equal to the total LDS size will be allocated.
At the prologue of the kernel, a single work-item from the work-group,
does a "malloc" and stores the pointer of the allocation in "SW LDS". To
store the offsets corresponding to all LDS accesses, another global
variable is created which will be called "SW LDS metadata" in this pass.
- **SW LDS:**
It is LDS global of ptr type with name
"llvm.amdgcn.sw.lds.<kernel-name>".
- **SW LDS Metadata:**
It is of struct type, with n members. n equals the number of LDS globals
accessed by the kernel(direct and indirect). Each member of struct is
another struct of type {i32, i32, i32}. First member corresponds to
offset, second member corresponds to size of LDS global being replaced
and third represents the total aligned size. It will have name
"llvm.amdgcn.sw.lds.<kernel-name>.md". This global will have an
intializer with static LDS related offsets and sizes initialized. But
for dynamic LDS related entries, offsets will be intialized to previous
static LDS allocation end offset. Sizes for them will be zero initially.
These dynamic LDS offset and size values will be updated with in the
kernel, since kernel can read the dynamic LDS size allocation done at
runtime with query to "hidden_dynamic_lds_size" hidden kernel argument.
- At the epilogue of kernel, allocated memory would be made free by the
same single work-item.
**Replacement of non-kernel LDS accesses:**
- Multiple kernels can access the same non-kernel function. All the
kernels accessing LDS through non-kernels are sorted and assigned a
kernel-id. All the LDS globals accessed by non-kernels are sorted.
- This information is used to build two tables:
- **Base table:**
Base table will have single row, with elements of the row placed as per
kernel ID. Each element in the row corresponds to ptr of "SW LDS"
variable created for that kernel.
- **Offset table:**
Offset table will have multiple rows and columns. Rows are assumed to be
from 0 to (n-1). n is total number of kernels accessing the LDS through
non-kernels. Each row will have m elements. m is the total number of
unique LDS globals accessed by all non-kernels. Each element in the row
correspond to the ptr of the replacement of LDS global done by that
particular kernel.
- A LDS variable in non-kernel will be replaced based on the information
from base and offset tables. Based on kernel-id query, ptr of "SW LDS"
for that corresponding kernel is obtained from base table. The Offset
into the base "SW LDS" is obtained from corresponding element in offset
table. With this information, replacement value is obtained.
This breaks using -passes=atomic-expand (but only sometimes?).
Somehow an AtomicExpand pass ends up running without a TargetMachine,
despite always being constructed with one.
This is not complete, but gets AtomicExpand running. I was able
to get further than I expected; we're quite close to having all
the IR codegen passes ported.
This specific callback should now be at parity with the old
pass manager version. There are still some missing IR passes
before this point.
Also I don't understand the need for the RequiresAnalysisPass at the
end. SelectionDAG should just be using the uncached getResult?
This was much more difficult than I anticipated. The pass is
not in a good state, with poor test coverage. The legacy PM
does seem to be relying on maintaining the map state between
different SCCs, which seems bad. The pass is going out of its
way to avoid putting the attributes it introduces onto non-callee
functions. If it just added them, we could use them directly
instead of relying on the map, I would think.
The NewPM path uses a ModulePass; I'm not sure if we should be
using CGSCC here but there seems to be some missing infrastructure
to support backend defined ones.