If an undef subregister def is live into another block, we need to
maintain a physreg def to track the liveness of those lanes. This
would manifest a verifier error after branch folding, when the cloned
tail block use no longer had a def.
We need to detect interference with other assigned intervals to avoid
clobbering the undef lanes defined in other intervals, since the undef
def didn't count as interference. This is pretty ugly and adds a new
dependency on LiveRegMatrix, keeping it live for one more pass. It also
adds a lot of implicit operand spam (we really should have a better
representation for this).
There is a missing verifier check for this situation. Added an xfailed
test that demonstrates this. We may also be able to revert the changes
in 47d3cbcf842a036c20c3f1c74255cdfc213f41c2.
It might be better to insert an IMPLICIT_DEF before the instruction
rather than using the implicit-def operand.
Fixes#98474
When llvm.memcpy or llvm.memmove intrinsics are lowered as a loop in
LowerMemIntrinsics.cpp, the loop consists of a single load/store pair
per iteration. We can improve performance in some cases by emitting
multiple load/store pairs per iteration. This patch achieves that by
increasing the width of the loop lowering type in the GCN target and
letting legalization split the resulting too-wide access pairs into
multiple legal access pairs.
This change only affects lowered memcpys and memmoves with large (>=
1024 bytes) constant lengths. Smaller constant lengths are handled by
ISel directly; non-constant lengths would be slowed down by this change
if the dynamic length was smaller or slightly larger than what an
unrolled iteration copies.
The chosen default unroll factor is the result of microbenchmarks on
gfx1030. This change leads to speedups of 15-38% for global memory and
1.9-5.8x for scratch in these microbenchmarks.
Part of SWDEV-455845.
Both are based on MachineLICMBase, and the functionality there is
"switched" based on a PreRegAlloc flag. This commit is simply about
trusting the original value of that flag, defined by the `MachineLICM`
and `EarlyMachineLICM` classes.
The `PreRegAlloc` flag used to be overwritten it based on MRI.isSSA(),
which is un-reliable due to how it is inferred by the MIRParser. I see
that we can now define isSSA in MIR (thanks @gargaroff ), meaning the
fix isn’t really needed anymore, but redefining that flag still feels
wrong.
Note that I'm looking into upstreaming more changes to MachineLICM, see
[the discourse
thread](https://discourse.llvm.org/t/extending-post-regalloc-machinelicm/82725).
For image_atomic instructions with TFE, in some cases (e.g., when
dmask=3) the disassembler produces dst register with wrong size (e.g.,
image_atomic_smin v5, v1, s[8:15] dmask:0x3 tfe, instead of v[5:7]).
This patch fixes the VDataDwords values for image atomic instructions.
The IR lowering of memcpy/memmove intrinsics uses a target-specific type
for its load/store operations. So far, the loaded and stored addresses
are computed with GEPs based on this type. That is wrong if the
allocation size of the type differs from its store size: The width of
the accesses is determined by the store size, while the GEP stride is
determined by the allocation size. If the allocation size is greater
than the store size, some bytes are not copied/moved.
This patch changes the GEPs to use i8 addressing, with offsets based on
the type's store size. The correctness of the lowering therefore no
longer depends on the type's allocation size.
This is in support of PR #112332, which allows adjusting the memcpy loop
lowering type through a command line argument in the AMDGPU backend.
So far, isValidAddrSpaceCast only allows casts to the flat address
space and between the constant(32) address spaces. It does not allow
casting between the global and constant address spaces, even though they
alias. That affects, e.g., the lowering of memmoves from the constant to
the global address space in LowerMemIntrinsics, since that requires
aliasing address spaces to be castable.
This patch relaxes isValidAddrSpaceCast and allows such casts. It also
includes a memmove test that would crash with the previous
implementation because the memmove IR lowering would not be
applicable for the move from constant AS to global AS.
MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels. It's
been identified this message is only really needed for VGPR limited
kernels. A kernel becomes VGPR limited if a total number of VGPRs per
SIMD / number of used VGPRs is more than a number of wave slots.
IMPLICIT_DEF inserted for a wwm-register at the
very first block or the predecessor block where
it is used for sgpr spilling can appear at a block
begin that requires spill-insertion during per-lane
VGPR regalloc phase. The presence of the IMPLICIT_DEF
currently breaks the BB prolog.
Fixes: SWDEV-490717
Some targets (e.g. PPC and Hexagon) already did this. I think it's best
to do this consistently so that frontend authors don't run into
inconsistent results when they emit `naked` functions. For example, in
Zig, we had to change our emit code to also set `frame-pointer=none` to
get reliable results across targets.
Note: I don't have commit access.
fp conversion V_CVT_F_F/V_CVT_F_U instructions true16 format were
previously implemented using fake16 profile.
With the MC support inplace, correct and support these instructions in
true16/fake16 format in CodeGen
Modify VOP3 profile and pesudo, and add encoding info for VOP3 True16
including DPP and DPP8 in true16 and fake16 format.
This patch applies true16/fake16 changes and asm/dasm changes to
V_ADD_NC_U16
V_ADD_NC_I16
V_SUB_NC_U16
V_SUB_NC_I16
Combine is needed to clear redundant ANDs with 1 that will be
created by reg-bank-select to clean-up high bits in register.
Fix replaceRegWith from CombinerHelper:
If copy had to be inserted, first create copy then delete MI.
If MI is deleted first insert point is not valid.
LDV could reorder reinserted fragment and non-fragment debug values for
the same variable (compared to the input order), potentially resulting
in stale values being presented.
For example, before:
DBG_VALUE 1001, $noreg, !13, !DIExpression(DW_OP_LLVM_fragment, 0, 16)
DBG_VALUE 1002, $noreg, !13, !DIExpression(DW_OP_LLVM_fragment, 16, 16)
DBG_VALUE %0, $noreg, !13, !DIExpression()
After (without this patch):
DBG_VALUE %stack.0, 0, !13, !DIExpression()
DBG_VALUE 1002, $noreg, !13, !DIExpression(DW_OP_LLVM_fragment, 16, 16)
DBG_VALUE 1001, $noreg, !13, !DIExpression(DW_OP_LLVM_fragment, 0, 16)
It would also reorder DBG_VALUEs for different variables. Although that
does not matter for the debug information output, it resulted in some
noise in before/after pass diffs.
This should hopefully align so that instruction referencing and
DBG_VALUE emit debug instructions in the same order (see the
sdag-salvage-add.ll change).
- V_SET_INACTIVE is always in WWM/WQM so can be treated like any other
operation in WWM/WQM.
- After encountering SI_SPILL_S32_TO_VGPR loop should bypass to avoid
double processing its defs.
[MIR] Serialize virtual register flags
This introduces target-specific vreg flag serialization. Flags are represented as `uint8_t` and the `TargetRegisterInfo` override provides methods `getVRegFlagValue` to deserialize and `getVRegFlagsOfReg` to serialize.
Both input and output of ballot are lane-masks:
result is lane-mask with 'S32/S64 LLT and SGPR bank'
input is lane-mask with 'S1 LLT and VCC reg bank'.
Ballot copies bits from input lane-mask for
all active lanes and puts 0 for inactive lanes.
GlobalISel did not set 0 in result for inactive lanes
for non-constant input.
This allows us to emit wide generic and scratch memory accesses when we
do not have alignment information. In cases where accesses happen to be
properly aligned or where generic accesses do not go to scratch memory,
this improves performance of the generated code by a factor of up to 16x
and reduces code size, especially when lowering memcpy and memmove
intrinsics.
Also: Make the use of the FeatureUnalignedScratchAccess feature more
consistent: FeatureUnalignedScratchAccess and EnableFlatScratch are now
orthogonal, whereas, before, code assumed that the latter implies the
former at some places.
Part of SWDEV-455845.
Currently `AAAddressSpace` relies on identifying the address spaces of
all underlying objects. However, it might infer sub-optimal address
space when the underlying object is a function argument. In
`AMDGPUPromoteKernelArgumentsPass`, the promotion of a pointer kernel
argument is by adding a series of `addrspacecast` instructions (as shown
below), and hoping `InferAddressSpacePass` can pick it up and do the
rewriting accordingly.
Before promotion:
```
define amdgpu_kernel void @kernel(ptr %to_be_promoted) {
%val = load i32, ptr %to_be_promoted
...
ret void
}
```
After promotion:
```
define amdgpu_kernel void @kernel(ptr %to_be_promoted) {
%ptr.cast.0 = addrspace cast ptr % to_be_promoted to ptr addrspace(1)
%ptr.cast.1 = addrspace cast ptr addrspace(1) %ptr.cast.0 to ptr
# all the use of %to_be_promoted will use %ptr.cast.1
%val = load i32, ptr %ptr.cast.1
...
ret void
}
```
When `AAAddressSpace` analyzes the code after promotion, it will take
`%to_be_promoted` as the underlying object of `%ptr.cast.1`, and use its
address space (which is 0) as its final address space, thus simply do
nothing in `manifest`. The attributor framework will them eliminate the
address space cast from 0 to 1 and back to 0, and replace `%ptr.cast.1`
with `%to_be_promoted`, which basically reverts all changes by
`AMDGPUPromoteKernelArgumentsPass`.
IMHO I'm not sure if `AMDGPUPromoteKernelArgumentsPass` promotes the
argument in a proper way. To improve the handling of this case, this PR
adds an extra handling when iterating over all underlying objects. If an
underlying object is a function argument, it means it reaches a terminal
such that we can't futher deduce its underlying object further. In this
case, we check all uses of the argument. If they are all `addrspacecast`
instructions and their destination address spaces are same, we take the
destination address space.
Fixes: SWDEV-482640.
This adds `-disable-gisel-legality-check` to some gfx6 and gfx7 test
lines to prevent behavior mismatches between debug and release builds
The first attempted reapply was #111059
This reverts commit e075dcf7d270fd52dc837163ff24e8c872dfeb49.
Trying to codegen these on targets without the instructions should
fail to select. Not sure if all the predicates are correct. We had
a fake one disconnected to a feature which was always true.
Fixes: SWDEV-482274
This adds the ability to use the GCNRPTrackers during scheduling. These trackers have several advantages over the generic trackers: 1. global live-thru trackers, 2. subregister based RP deltas, and 3. flexible vreg -> PressureSet mappings.
This feature is off-by-default to ease with the roll-out process. In particular, when using the optional trackers, the scheduler will still maintain the generic trackers leading to unnecessary compile time.
This was using addrspace 0 and 1 pointers interchangably. This works
out since they happen to use the same size, but consistently query
or use the correct one.
The current `selectVOP3PMadMixModsImpl` can produce `V_MAD_FIX_F32`
instruction
that violates constant bus restriction, while its `SelectionDAG`
counterpart
doesn't. The culprit is in the copy stripping while the `SelectionDAG`
version
only has a bitcast stripping. This PR simply aligns the two version.
With #93526 we split the regalloc pipeline further
to have a standalone allocation for wwm registers
and per-lane VGPRs. Currently the presence of the
wwm-spill reloads inserted at the bb-top limits the
isBasicPrologue function during the per-lane vgpr
regalloc to skip past the exec manipulation instruction
and ended up causing incorrect codegen. The wmm-spill
inserted during the wwm-regalloc pipeline should also
be included in the bb-prolog so that the per-lane vgpr
regalloc pipeline can identify the appropriate insertion
points for their spills and copies.
Hack around the register scavenger doing the wrong thing.
It does not find the result register as available in the
case the frame index add isn't also reading the dest register.
This is the quick fix for a regression where the scavenge would
create a broken spill of SGPR to memory. I believe this is still
broken for cases we cannot use the result register.
I'm confused about what position the scavenger iterator
is supposed to be in, and what RestoreAfter is for. The scavenger
is missing a full set of forward/backward APIs and there seems
to be an off by one somewhere.