In SIFoldOperands, folding `or x, -1` to `v_mov_b32 -1` removed
`Src1Idx`, which is incorrect because `-1` is in `Src0Idx` (after
canonicalization).
Closes https://github.com/llvm/llvm-project/issues/189677.
Clear the flag. It fails verification if set, only convergent
operations may have NoConvergent flag. NFCI as it is now because
it just does not happen.
One of global flags in `resetTargetOptions`, users should use `nsz`
instead.
`fneg_fadd_0_f64` from `AMDGPU/fneg-combines.new.ll` will have
regression when `fadd` is annotated with `nsz`.
This Change makes `RegState` into an enum class, with bitwise operators.
It also:
- Updates declarations of flag variables/arguments/returns from
`unsigned` to `RegState`.
- Updates empty RegState initializers from 0 to `{}`.
If this is causing problems in downstream code:
- Adopt the `RegState getXXXRegState(bool)` functions instead of using a
ternary operator such as `bool ? RegState::XXX : 0`.
- Adopt the `bool hasRegState(RegState, RegState)` function instead of
using a bitwise check of the flags.
On some targets, a packed f32 instruction can only read 32 bits from a
scalar operand (SGPR or literal) and replicates the bits to both
channels. In this case, we should not fold an immediate value if it
can't be replicated from its lower 32-bit.
Fixes SWDEV-567139.
Reduces memory usage compiling backend sources, most notably for
AMDGPU by ~98 MB per source on average.
AMDGPUGenRegisterInfo.inc is tens of megabytes in size now, and
is even larger downstream. At the same time, it is included in
nearly all backend sources, typically just for a small portion of
its content, resulting in compilation being unnecessarily
memory-hungry, which in turn stresses buildbots and wastes their
resources.
Splitting .inc files also helps avoiding extra ccache misses
where changes in .td files don't cause changes in all parts of
what previously was a single .inc file.
It is thought that rather than building on top of the current
single-output-file design of TableGen, e.g., using `split-file`,
it would be more preferable to recognise the need for multi-file
outputs and give it a proper first-class support directly in
TableGen.
This fixes a regression exposed after
445415219708f9539801018e03282049ca33e0e2.
This introduces a few small regressions for true16. There are more cases
where the value can propagate through subregister extracts which need
new handling. They're also small enough that perhaps there's a way to
avoid needing to deal with this case in the first place.
This removes special case processing in TargetInstrInfo::getRegClass to
fixup register operands which depending on the subtarget support AGPRs,
or require even aligned registers.
This regresses assembler diagnostics, which currently work by hackily
accepting invalid cases and then post-rejecting a validly parsed
instruction.
On the plus side this now emits a comment when disassembling unaligned
registers for targets with the alignment requirement.
With true16 mode v_mov_b16_t16 is added as new foldable copy inst, but
the src operand is in different index.
Use the correct src index for v_mov_b16_t16.
This is primarily to avoid folding a frame index materialized
into an SGPR into the pseudo; this would end up looking like:
%sreg = s_mov_b32 %stack.0
%av_32 = av_mov_b32_imm_pseudo %sreg
Which is not useful.
Match the check used for the b64 case. This is limited to the
pseudo to avoid regression due to gfx908's special case - it
is expecting to pass here with v_accvgpr_write_b32 for illegal
cases, and stay in the intermediate state with an sgpr input.
This avoids regressions in a future patch.
The RC of the folded operand does not need to be constrained based on
the RC of the current operand we are folding into.
The purpose of this PR is to facilitate this PR:
https://github.com/llvm/llvm-project/pull/151033
Change shouldRewriteCopySrc to return the common register
class and expose it as a utility function. I've found myself
reproducing essentially the same logic in multiple places. The
purpose of this function is to jsut work through the API constraints
of which combination of register class and subreg indexes you have.
i.e. you need to use a different function if you have 0, 1, or 2
subregister indexes involved in a pair of copy-like operations.
This strengthens the check to ensure the new mov's source class
is compatible with the source register. This avoids using the register
sized based checks in getMovOpcode, which don't quite understand
AV superclasses correctly. As a side effect it also enables more folds
into true16 movs.
getMovOpcode should probably be deleted, or at least replaced
with class check based logic. In this particular case other
legality checks need to be mixed in with attempted IR changes,
so I didn't try to push all of that into the opcode selection.
With current codegen this only affects src_flat_scratch_base_lo/hi.
Co-authored-by: Jay Foad <Jay.Foad@amd.com>
Co-authored-by: Jay Foad <Jay.Foad@amd.com>
SS forms of SCRATCH_LOAD_DWORD do not support SCALE_OFFSET,
so if this bit is used SCRATCH_LOAD_DWORD_SADDR cannot be formed.
This generally shall not happen because FI is not supposed to
be scaled, but add this as a precaution.
We weren't fully respecting the type of a def of an immediate vs.
the type at the use point. Refactor the folding logic to track the
value to fold, as well as a subregister to apply to the underlying
value. This is similar to how PeepholeOpt tracks subregisters (though
only for pure copy-like instructions, no constants).
Fixes#139317
This helps avoid some regressions in a future patch. The or 0
pattern appears in the division tests because the reduce 64-bit
bit operation to a 32-bit one with half identity value is only
implemented for constants. We could fix that by using computeKnownBits.
Additionally the pattern disappears if I optimize the IR division
expansion, so that IR should probably be emitted more optimally in
the first place.
VOP3 instructions ignore opsel source modifiers, so a constant that
contains two different [B]F16 imms cannot be encoded into instruction
with an src opsel.
E.g. without the fix the following instructions
`s_mov_b32 s0, 0x40003c00 // <half 1.0, half 2.0>`
`v_cvt_scalef32_pk_fp8_f16 v0, s0, v2`
lose `2.0` imm and are folded into
`v_cvt_scalef32_pk_fp8_f16 v1, 1.0, 1.0`
Fixes SWDEV-531672
This was pre-filtering out a specific situation from being
added to the fold candidate list. The operand legality will
ultimately be checked with isOperandLegal before the fold is
performed, so I don't see the plus in pre-filtering this one
case.
No tests fail with this. I'm not sure I understand the comment,
there can't be any folding into an operand that had to already
be a constant. I tried different combinations of immediates to these
instructions but never hit the condition.
foldCopyToVGPROfScalarAddOfFrameIndex transforms s_adds whose results are copied
to vector registers into v_adds. We don't want to do that if foldInstOperand
(which so far runs later) can fold the sreg->vreg copy away.
This patch therefore delays foldCopyToVGPROfScalarAddOfFrameIndex until after
foldInstOperand.
This avoids unnecessary movs in the flat-scratch-svs.ll test and also avoids
regressions in an upcoming patch to enable ISD::PTRADD nodes.
We need to consider the use instruction's intepretation of the bits,
not the defined immediate without use context. This will regress
some cases where we previously coud match f64 inline constants. We
can restore them by either using pseudo instructions to materialize f64
constants, or recognizing reg_sequence decomposed into 32-bit pieces for them
(which essentially means recognizing every other input is a 0).
Fixes#139908
This code clunkily tried to find a splat reg_sequence by
looking at every use of the reg_sequence, and then looking
back at the reg_sequence to see if it's a splat. Extract this
into a separate helper function to help clean this up. This now
parses whether the reg_sequence forms a splat once, and defers the
legal inline immediate check to the use check (which is really use
context dependent)
The one regression is in globalisel, which has an extra
copy that should have been separately folded out. It was getting
dealt with by the handling of foldable copies in tryToFoldACImm.
This is preparation for #139908 and #139317