This is a refinement for the existing hack. With this,
the default target will have neither wavefrontsize feature
present, unless it was explicitly specified. That is,
getWavefrontSize() == 64 no longer implies +wavefrontsize64.
getWavefrontSize() == 32 does imply +wavefrontsize32.
Continue to assume the value is 64 with no wavesize feature.
This maintains the codegenable property without any code
that directly cares about the wavesize needing to worry about it.
Introduce an isWaveSizeKnown helper to check if we know the
wavesize is accurate based on having one of the features explicitly
set, or a known target-cpu.
I'm not sure what's going on in wave_any.s. It's testing what
happens when both wavesizes are enabled, but this is treated
as an error in codegen. We now treat wave32 as the winning
case, so some cases that were previously printed as vcc are now
vcc_lo.
It's very expensive and doesn't achieve anything.
I one test I did, it saves almost 10s on a 2m23s build, bringing it down
to 2m15s using a downstream branch.
It's very expensive and doesn't achieve anything.
I one test I did, it saves almost 10s on a 2m23s build, bringing it down
to 2m15s using a downstream branch.
The current implementation of `isInlinableLiteral16` assumes, a 16-bit
inlinable
literal is either an `i16` or a `fp16`. This is not always true because
of
`bf16`. However, we can't tell `fp16` and `bf16` apart by just looking
at the
value. This patch splits `isInlinableLiteral16` into three versions,
`i16`,
`fp16`, `bf16` respectively, and call the corresponding version.
Endoding is VOP3P. Tagged as deep/machine learning instructions. i32
type (v4fp8 or v4bf8 packed in i32) is used for src0 and src1. src0 and
src1 have no src_modifiers. src2 is f32 and has src_modifiers: f32
fneg(neg_lo[2]) and f32 fabs(neg_hi[2]).
---------
Co-authored-by: Petar Avramovic <Petar.Avramovic@amd.com>
Consistently treat packed 16-bit operands as 32-bit values, because
that's really what they are. The attempt to treat them differently was
ultimately incorrect and lead to miscompiles, e.g. when using non-splat
constants such as (1, 0) as operands.
Recognize 32-bit float constants for i/u16 instructions. This is a bit
odd conceptually, but it matches HW behavior and SP3.
Remove isFoldableLiteralV216; there was too much magic in the dependency
between it and its use in SIFoldOperands. Instead, we now simply rely on
checking whether a constant is an inline constant, and trying a bunch of
permutations of the low and high halves. This is more obviously correct
and leads to some new cases where inline constants are used as shown by
tests.
Move the logic for switching packed add vs. sub into SIFoldOperands.
This has two benefits: all logic that optimizes for inline constants in
packed math is now in one place; and it applies to both SelectionDAG and
GISel paths.
Disable the use of opsel with v_dot* instructions on gfx11. They are
documented to ignore opsel on src0 and src1. It may be interesting to
re-enable to use of opsel on src2 as a future optimization.
A similar "proper" fix of what inline constants mean could potentially
be applied to unpacked 16-bit ops. However, it's less clear what the
benefit would be, and there are surely places where we'd have to
carefully audit whether values are properly sign- or zero-extended. It
is best to keep such a change separate.
Fixes: Corruption in FSR 2.0 (latent bug exposed by an LLPC change)
A 64-bit literal can be used as a 32-bit zero or sign extended operand.
In case of double zeroes are added to the low 32 bits. Currently asm
parser stores only high 32 bits of a double into an operand. To support
codegen as requested by the
https://github.com/llvm/llvm-project/issues/67781 we need to change the
representation to store a full 64-bit value so that codegen can simply
add immediates to an instruction.
There is some code to support compatibility with existing tests and asm
kernels. We allow to use short hex strings to represent only a high 32
bit of a double value as a valid literal.
Always output values of wait_vdst and wait_exp in assembly even when
they are zero.
While we normally avoid outputing default/zero parameters in assembly,
the values of these parameters still imply wait behaviour when zero.
Outputing zero values makes the intent more obvious to human readers,
and avoid any future ambiguity if we choose to change the defaults to
something other than zero.
Fixes#66383
Names '64BitDPP' and especially 'DPP64' were found misleading, and
DPP64 can easily be mixed with DPP16 and DPP8 while these are
different concepts. DPP16 and DPP8 refers to lanes where DPP64
refers to the operand size.
In fact the essential part here is that these instructions are
executed on the DP ALU, so rename the feature accordingly.
I have also found a bug in a check for these instructions, which is
fixed here and a common utility function is now used.
Differential Revision: https://reviews.llvm.org/D158465
This required two substantial changes:
1. Moving a `getRegBitWidth(TargetRegisterClass)` overload out of Utils
and into CodeGen
2. Passing the string function name to AMDGPUPALMetadata instead of the
MachineFunction
Other changes are minor or updates to accommodate the first two.
See issue #64166 for more information on the layering issue.
Differential Revision: https://reviews.llvm.org/D156486
decodeFPImmed creates immediate operand using register operand width,
but size of created immediate should correspond to OperandType for
RegisterOperand.
e.g. OPW128 could be used for RegisterOperands that use v2f64 v4f32
and v8f16. Each RegisterOperands would have different OperandType and
require that immediate is decoded using 64, 32 and 16 bit immediate
respectively.
decodeOperand_<RegClass> only provides width for register decoding,
introduce decodeOperand_<RegClass>_Imm<ImmWidth> that also provides
width for immediate decoding.
Refactor RegisterOperands:
- decoders get _Imm<ImmWidth> suffix in some cases
- removed unused RegisterOperands defined via multiclass
- use different RegisterOperand in a few places, new RegisterOperand's
decoder corresponds to the number of bits used for operand's encoding
Refactor decoder functions:
- add asserts for the size of encoding that will be decoded
- regroup them according to the method of decoding
decodeOperand_<RegClass> (register only, no immediate) decoders can now
create immediate of consistent size, use it for better diagnostic of
'invalid immediate'.
Differential Revision: https://reviews.llvm.org/D142636