We only emits v_mov_b32/64_dpp. Don't combine t16 instructions with mov
dpp. Update the test inputs to be legal.
It is future work to emit v_mov_b16_dpp, and then update GCNDPPCombine
to combine it with the 16-bit instructions.
The existing pretty printer generates excessive parentheses for
MCBinaryExpr expressions. This update removes unnecessary parentheses
of MCBinaryExpr with +/- operators and MCUnaryExpr.
Since relocatable expressions only use + and -, this change improves
readability in most cases.
Examples:
- (SymA - SymB) + C now prints as SymA - SymB + C.
This updates the output of -fexperimental-relative-c++-abi-vtables for
AArch64 and x86 to `.long _ZN1B3fooEv@PLT-_ZTV1B-8`
- expr + (MCTargetExpr) now prints as expr + MCTargetExpr, with this
change primarily affecting AMDGPUMCExpr.
Added pattern so s_nor is selected for ((i1 x or i1 y) xor -1) instead
of s_or and s_xor . This patch is for i1 divergent. The ballot in the
test is added for the retrieval of lanemask. The control flow is needed
because the combiner can't pass through phi instructions.
Added a srl pattern for true16 flow. Changing right shift 16bit to a
reg_sequence
`srl vgpr32, 16 -> reg_sequence (vgpr32.hi16, 0)`
and finally it's lowered to two COPY
`vdst.lo16 = COPY vsrc.hi16`
`vdst.hi16 = COPY 0`
The benefits of this transform is allowing the following pass to
optimize out these copy.
Add tests to various targets covering rotate idioms where an 'ADD' node
is used to combine the halves instead of an 'OR'. Some of these cases
will be better optimized following #125612, while others are already
well optimized or do not have a valid fold to a rotate or funnel-shift.
This is consistent with what's done on GISel. This allows the register
coalescer to remove the redundant intermediate `s_mov_b32` instructions
by using `m0` directly as the result register.
If we cannot find any lds DMA instruction that is aliased by some load
from lds, we will still insert vmcnt(0). This is overly cautious since
handling inter-thread dependences is normally managed by the memory
model instead of the waitcnt pass, so this change updates the behavior
to be more inline with how other types of memory events are handled.
Follow up to e4284a7c70cd "[AMDGPU] 4-align SGPR triples".
Previously TTMP triples like ttmp[3:5] were aligned on a 3-TTMP boundary
which has no basis in hardware.
Aligning them on a 4-TTMP boundary matches what we do for SGPRs, which
reduces the number of extra register classes synthesized by TableGen,
bringing the total number down from 653 to 615.
The lowerKillI1 method wrongly handled cases where it inserted a
new S_BRANCH instruction when the kill was not the only terminator,
and then tried to split the block.
`SI_KILL_I1_TERMINATOR -1,0` doesn't have any effect. Instead of
lowering to an unconditional branch, we remove the instruction and
insert an unconditional branch only if the instruction is the last
terminator. No split is needed in this case (if the last terminator
has been reached, then the whole block was processed).
Also stop generating an unconditional branch in splitBlock: this
branch was redundant since TermMI is promoted to a
terminator that fallsthrough to the next block already.
Solves SWDEV-508819
It is an architectural requirement that there must be no outstanding GDS
instructions when an "always GDS" instruction is issued, and also that
an always GDS instruction must be allowed to complete.
Insert waits on DScnt/LGKMcnt prior to (if necessary) and subsequent to
(unconditionally) any always GDS instruction, and an additional S_NOP if
the subsequent wait was followed by S_ENDPGM.
Always GDS instructions are GWS instructions, DS_ORDERED_COUNT,
DS_ADD_GS_REG_RTN, and DS_SUB_GS_REG_RTN (the latter two as considered
always GDS as of this patch).
The llvm.amdgcn.cs.chain intrinsic has a 'flags' operand which may
indicate that we want to reallocate the VGPRs before performing the
call.
A call with the following arguments:
```
llvm.amdgcn.cs.chain %callee, %exec, %sgpr_args, %vgpr_args,
/*flags*/0x1, %num_vgprs, %fallback_exec, %fallback_callee
```
is supposed to do the following:
- copy the SGPR and VGPR args into their respective registers
- try to change the VGPR allocation
- if the allocation has succeeded, set EXEC to %exec and jump to
%callee, otherwise set EXEC to %fallback_exec and jump to
%fallback_callee
This patch implements the dynamic VGPR behaviour by generating an
S_ALLOC_VGPR followed by S_CSELECT_B32/64 instructions for the EXEC and
callee. The rest of the call sequence is left undisturbed (i.e.
identical to the case where the flags are 0 and we don't use dynamic
VGPRs). We achieve this by introducing some new pseudos
(SI_CS_CHAIN_TC_Wn_DVGPR) which are expanded in the SILateBranchLowering
pass, just like the simpler SI_CS_CHAIN_TC_Wn pseudos. The main reason
is so that we don't risk other passes (particularly the PostRA
scheduler) introducing instructions between the S_ALLOC_VGPR and the
jump. Such instructions might end up using VGPRs that have been
deallocated, or the wrong EXEC mask. Once the whole backend treats
S_ALLOC_VGPR and changes to EXEC as barriers for instructions that use
VGPRs, we could in principle move the expansion earlier (but in the
absence of a good reason for that my personal preference is to keep it
later in order to make debugging easier).
Since the expansion happens after register allocation, we're careful to
select constants to immediate operands instead of letting ISel generate
S_MOVs which could interfere with register allocation (i.e. make it look
like we need more registers than we actually do).
For GFX12, S_ALLOC_VGPR only works in wave32 mode, so we bail out during
ISel in wave64 mode. However, we can define the pseudos for wave64 too
so it's easy to handle if future generations support it.
---------
Co-authored-by: Ana Mihajlovic <Ana.Mihajlovic@amd.com>
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
The CWSR trap handler needs to save and restore the VGPRs. When dynamic
VGPRs are in use, the fixed function hardware will only allocate enough
space for one VGPR block. The rest will have to be stored in scratch, at
offset 0.
This patch allocates the necessary space by:
- generating a prologue that checks at runtime if we're on a compute
queue (since CWSR only works on compute queues); for this we will have
to check the ME_ID bits of the ID_HW_ID2 register - if that is non-zero,
we can assume we're on a compute queue and initialize the SP and FP with
enough room for the dynamic VGPRs
- forcing all compute entry functions to use a FP so they can access
their locals/spills correctly (this isn't ideal but it's the quickest to
implement)
Note that at the moment we allocate enough space for the theoretical
maximum number of VGPRs that can be allocated dynamically (for blocks of
16 registers, this will be 128, of which we subtract the first 16, which
are already allocated by the fixed function hardware). Future patches
may decide to allocate less if they can prove the shader never allocates
that many blocks.
Also note that this should not affect any reported stack sizes (e.g. PAL
backend_stack_size etc).
In dynamic VGPR mode, Waves must deallocate all VGPRs before exiting. If
the shader program does not do this, hardware inserts `S_ALLOC_VGPR 0`
before S_ENDPGM, but this may incur some performance cost. Therefore
it's better if the compiler proactively generates that instruction.
This patch extends `si-insert-waitcnts` to deallocate the VGPRs via a
`S_ALLOC_VGPR 0` before any `S_ENDPGM` when in dynamic VGPR mode.
This is a patch arising from AMD's fuzzing project.
In the test case, the scheduling algorithm decides to undo an attempted
schedule, but is unprepared to handle bundled instructions at that
point -- and those can arise via the expansion of intrinsics earlier
in compilation. The fix is to use the splice method instead of
remove/insert, since that can handle bundles properly.
Insert the start instruction directly into the map before the uses. This
prevents improperly re-visting sgpr->vgpr phi inputs multiple times
which
would trigger a use after free.
I don't particularly trust the iteration scheme here. This is also
unnecessarily revisting transitive users of a phi or reg_sequence for
every
input operand, but I will address that separately.
Fixes#130646. I also believe it fixes#130119, although that test fails
less consistently for me.
The language reference says about inbounds geps that "if the
getelementptr has any non-zero indices[...] [t]he base pointer has an in
bounds address of the allocated object that it is based on [and]
[d]uring the successive addition of offsets to the address, the
resulting pointer must remain in bounds of the allocated object at each
step."
If (gep inbounds p, (a + 5)) is translated to (gep [inbounds] (gep p,
a), 5) with p pointing to the beginning of an object and a=-4, as the
example in the comments suggests, that's the case for neither of the
resulting geps. Therefore, we need to clear the inbounds flag for both
geps.
We might want to use ValueTracking to check if a is known to be
non-negative to preserve the inbounds flags.
For the AMDGPU tests with scratch instructions, removing the unsound
inbounds flag means that AMDGPUDAGToDAGISel::isFlatScratchBaseLegal sees
no NUW flag at the pointer add, which prevents generation of scratch
instructions with immediate offsets.
For SWDEV-516125.
This represents a hardware mode supported only for wave32 compute
shaders. When enabled, we set the `.dynamic_vgpr_en` field of
`.compute_registers` to true in the PAL metadata.
This will be changed to use an attribute after downstream consumers
have been migrated.
Previously the value created to represent the uninitialized memory
of the alloca was undef. Use freeze poison instead. Enables some
optimization improvements (which need defeating in the limit tests),
but also a few regressions. Seems to leave behind dead code in some
cases too.
This patch replaces use of MachineRegisterInfo's liveIn check with the
machine basicBlock's liveIn. As the MRI's liveIn is inconsistent with
the entry MBB liveIns, when it comes to the machine verifier checks.
PS: Its an alternative solution with respect to #126926.