`getLit64Encoding` uses a different approach to determine whether 64-bit
literal encoding is used, which caused a size mismatch between the
`MachineInstr` and the `MCInst`.
For `!isValid32BitLiteral`, it is effectively `!(isInt<32>(Val) ||
isUInt<32>(Val))`, which is `!isInt<32>(Val) && !isUInt<32>(Val)`, but
in `getLit64Encoding`, it is `!isInt<32>(Val) || !isUInt<32>(Val)`.
The definition for V_INDIRECT_REG_READ_GPR_IDX_B32_V*'s SSrc_b32 operand
allows immediates, but the expansion logic handles only register cases
now. This can result in expansion failures when e.g.
llvm.amdgcn.wave.reduce.umin.i32 is folded into a constant and then used
as an insertelement idx.
- Add `MONonVolatile` MachineMemOperand flag.
- Set nv=1 on memory operations on GFX12.5 if the operation accesses a
constant address space,
is an invariant load, or has the `MONonVolatile` flag set.
In some of our use cases, the GPU runtime stores some data at the top of
the stack. It figures out where it's safe to store it by using the PAL
metadata generated by the backend, which includes the total stack size.
However, the metadata does not include the space reserved at the bottom
of the stack for the trap handler when CWSR is enabled in dynamic VGPR
mode. This space is reserved dynamically based on whether or not the
code is running on the compute queue. Therefore, the runtime needs a way
to take that into account.
Add support for `llvm.sponentry`, which should return the base of the
stack,
skipping over any reserved areas. This allows us to keep this
computation in
one place rather than duplicate it between the backend and the runtime.
The implementation for functions that set up their own stack uses a
pseudo
that is expanded to the same code sequence as that used in the prolog to
set up the stack in the first place.
In callable functions, we generate a fixed stack object and use that
instead,
similar to the Arm/AArch64 approach. This wastes some stack space but
that's
not a problem for now because we're not planning to use this in callable
functions yet.
This patch implements a custom printer/parser for the immediate operand
of s_wait_alu that prints/parses the decoded counter values.
Format:
```
.<counter1>_<value1>_<counter2>_<value2>
```
Example:
`s_wait_alu .VaVdst_1_VmVsrc_1`
; Which is equivalent to this:
`s_wait_alu 8167`
Features:
- If a counter is at its maximum value it won't get printed.
- The parser will error out if a counter is greater or equal to its max
value.
- If all counters are disabled we can use 'AllOff'.
- For now we also accept numeric values for backwards compatibility with
older MIR.
Note: This is similar to https://github.com/llvm/llvm-project/pull/96004
but for `s_wait_alu`.
The code in the test is causing a crash in `SIInstrInfo.cpp`
`fixImplicitOperands()` in `MI.implicit_operands()`:
```
for (auto &Op : MI.implicit_operands()) {
```
MachineInstr.h:
```
mop_range implicit_operands() {
=> return operands_impl().drop_front(getNumExplicitOperands());
}
```
We are trying to drop 1 operand from the operands of MI which are 0.
By early returning we are no longer crashing at that point and we are
getting a more meaningful error message:
```
*** Bad machine code: Too few operands ***
- function: missing_operand_crash
- basic block: %bb.0 (0x5a9d30ced988)
- instruction: S_WAITCNT_DEPCTR
1 operands expected, but 0 given.
```
The code is still crashing at a different location, but at least we are
getting an error message.
In practice when legalizeOperands is called on a PHI node, the result is
never an SGPR class and the operands are never subregs. Simplify the
code accordingly by using the result regclass for all the inputs. This
includes using an AV class where previously we picked either an AGPR or
VGPR class.
Convert:
```
s_add_u32 X, Y, 1
s_cmp_lg_i32 X, 0
```
to:
```
s_add_u32 X, Y, 1
<invert scc uses>
```
Also delete with s_cmp_eq_i32 X, 0, but inverting scc uses is not
necessary.
---------
Signed-off-by: John Lu <John.Lu@amd.com>
This Change makes `RegState` into an enum class, with bitwise operators.
It also:
- Updates declarations of flag variables/arguments/returns from
`unsigned` to `RegState`.
- Updates empty RegState initializers from 0 to `{}`.
If this is causing problems in downstream code:
- Adopt the `RegState getXXXRegState(bool)` functions instead of using a
ternary operator such as `bool ? RegState::XXX : 0`.
- Adopt the `bool hasRegState(RegState, RegState)` function instead of
using a bitwise check of the flags.
`insertSimulatedTrap` was returning `HaltLoopBB` when the trap was in a
block with no successors and was the last instruction. Since
`HaltLoopBB` gets appended to the end of the function, `FinalizeISel`
would jump there and skip any intermediate blocks, leaving their pseudos
unexpanded.
Fix by returning `MBB.getNextNode()` unconditionally:
- After `splitAt()`: `getNextNode()` returns the split-off block
(`ContBB`)
- No split, `MBB` in middle: `getNextNode()` returns the next original
block
- No split, `MBB` was last: `getNextNode()` returns `HaltLoopBB` (just
pushed)
Since we always `push_back(HaltLoopBB)` before returning,
`getNextNode()` can never be `nullptr`: if `MBB` was the last block,
`HaltLoopBB` is now after it.
Fixes: SWDEV-572407
This PR handles`v_pk_fmac_f16` inline constant encoding/decoding
differences between pre-GFX11 and GFX11+ hardware.
- Pre-GFX11: fp16 inline constants produce `(f16, 0)` - value in low 16
bits, zero in high.
- GFX11+: fp16 inline constants are duplicated to both halves `(f16,
f16)`.
Fixes#94116.
This Change is to prepare to make RegState into an enum class. It:
- Updates documentation to match the order in the code.
- Brings the `get<>RegState` functions together and makes them
`constexpr`.
- Adopts the `get<>RegState` where RegStates were being chosen with
ternary operators in backend code.
- Introduces `hasRegState` to make querying RegState easier once it is
an enum class.
- Adopts `hasRegState` where equivalent was done with bitwise
arithmetic.
- Introduces `RegState::NoFlags`, which will be used for the lack of
flags.
- Documents that `0x1` is a reserved flag value used to detect if
someone is passing `true` instead of flags (due to implicit bool to
unsigned conversions).
- Updates two calls to `MachineInstrBuilder::addReg` which were passing
`false` to the flags operand, to no longer pass a value.
- Documents that `getRegState` seems to have forgotten a call to
`getEarlyClobberRegState`.
This PR relands llvm/llvm-project#176091 (commit
1d616cdca3aba9d22f120888bb6b09b75ca90b92) which was reverted in
llvm/llvm-project#176190 (commit
6309cd8668fc2ae589f156b23f86821f4ce5b7ea).
Register classes of sources also has to be restrained to lo128.
There are few regression with register coalescing in true16 mode
though, but otherwise it fails verification.
Reverts llvm/llvm-project#176091
Reverting because some compilers were erroring on the call to
`Reg.isReg()` (which is not `constexpr`) in a `constexpr` function.
This Change is to prepare to make RegState into an enum class. It:
- Updates documentation to match the order in the code.
- Brings the `get<>RegState` functions together and makes them
`constexpr`.
- Adopts the `get<>RegState` where RegStates were being chosen with
ternary operators in backend code.
- Introduces `hasRegState` to make querying RegState easier once it is
an enum class.
- Adopts `hasRegState` where equivalent was done with bitwise
arithmetic.
- Introduces `RegState::NoFlags`, which will be used for the lack of
flags.
- Documents that `0x1` is a reserved flag value used to detect if
someone is passing `true` instead of flags (due to implicit bool to
unsigned conversions).
- Updates two calls to `MachineInstrBuilder::addReg` which were passing
`false` to the flags operand, to no longer pass a value.
- Documents that `getRegState` seems to have forgotten a call to
`getEarlyClobberRegState`.
This preparatory patch introduces an additional argument to the target hook
loadRegFromStackSlot. Ths is essential for targets to handle subregister-specific
reload in the future. See how this is used for AMDGPU target with PR #175002.
The COPY inserted for liverange split during sgpr-regalloc
pipeline currently breaks the BB prolog during the subsequent
vgpr-regalloc phase while spilling and/or splitting the vector
liveranges. This patch fixes it by correctly including the
LR split instructions during sgpr-regalloc and wwm-regalloc
pipelines into the BB prolog.
At the moment the MIR tests are somewhat redundant. The waitcnt
one is needed to ensure we actually have a load, given we are
currently just emitting an error on ExternalSymbol. The asm printer
one is more redundant for the moment, since it's stressed by the IR
test. However I am planning to change the error path for the IR test,
so it will soon not be redundant.
A buildbot failure in https://github.com/llvm/llvm-project/pull/170323
when expensive checks were used highlighted that some of these patterns
were missing.
This patch adds `V_INDIRECT_REG_{READ/WRITE}_GPR_IDX` and
`V/S_INDIRECT_REG_WRITE_MOVREL` for `V6` and `V7` vector sizes.
BUF instructions can access the scratch address space, so
SIInsertWaitCnt needs to be able
to track the SCRATCH_WRITE_ACCESS event for such BUF instructions.
The release-vgprs.mir test had to be updated because BUF instructions
w/o a MMO are now
tracked as a SCRATCH_WRITE_ACCESS. I added a MMO that touches global to
keep the test result unchanged. I also added a couple of testcases with no MMO to test the corrected behavior.
When SIMULATED_TRAP is at the end of a block with no successors,
insertSimulatedTrap incorrectly returns the original MBB despite adding
HaltLoopBB to the CFG.
EmitInstrWithCustomInserter detects CFG changes by comparing the
returned MBB with the original. When they match, it assumes no
modification occurred and skips MachineLoopInfo invalidation. This
causes stale loop information in subsequent passes, particularly when
using the NPM which relies on accurate invalidation signals.
Fix: Return HaltLoopBB to properly signal the CFG modification.
The 16-bit immediate operand of s_waitcnt_depctr / s_wait_alu has some
unused bits. Previously codegen would set these bits to 1, but setting
them to 0 matches the SP3 assembler behaviour better, which in turn
means that we can print them using the human readable SP3 syntax:
s_wait_alu 0xfffd ; unused bits set to 1
s_wait_alu 0xff9d ; unused bits set to 0
s_wait_alu depctr_va_vcc(0) ; unused bits set to 0, human readable
Note that the set of unused bits changed between GFX10.1 and GFX10.3.
We previously got a duplicate implicit $exec operand. It didn't really
hurt anything (other than being a slight drag on compile-time
performance). Still, let's keep things clean.
This is in preparation for future changes in AMDGPU that will make more
substantial use of bundles pre-RA. For now, simply test this with
degenerate (single-instruction) bundles.