For multipass instructions, overlap on VDST and SRC’s
would result in HW race & undefined results.
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
These instructions have non-standard use of OPSEL bits to select
dest write byte. The src2_modifiers operand is used without having
its corresponding src2 operand by introducing dummy src2.
OPSEL ASM OPSEL Syntax: opsel:[a,b,c,d]
a & b are meaningless, c & d together decides byte to write in dst reg.
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
OPSEL ASM Syntax for v_cvt_scalef32_pk_{f|bf}16_fp4 : opsel:[x,y,z]
where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read.
Note: Conventional Inst{13} i.e. OPSEL[2] is ignored in asm syntax.
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
OPSEL ASM Syntax for v_cvt_scalef32_pk_f32_fp4 : opsel:[x,y,z]
where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read.
OPSEL ASM Syntax for v_cvt_scalef32_pk_fp4_f32 : opsel:[a,b,c,d]
where, c & d i.e. OPSEL[3 : 2] selects which dst_byte to write.
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
OPSEL[1:0] collectively decide which byte to read
from src input.
Builtin takes additional imm argument which
represents index (with valid values:[0:3]) of src
byte read. Out of bounds checks will added in next
patch.
OPSEL ASM Syntax: opsel:[x,y,z]
where,
opsel[x] = Inst{11} = src0_modifier{2}
opsel[y] = Inst{12} = src1_modifier{2}
opsel[z] = Inst{14} = src0_modifier{3}
Note: Inst{13} i.e. OPSEL[2] is ignored in
asm syntax and opsel[z] is meaningless
for v_cvt_scalef32_f32_{fp|bf}8
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
In order to align with `svext` and NEON `vext`/`vextq`, this patch
changes immediate argument in `svextq` such that it refers to elements
of the size of those of the source vector, rather than bytes. The [spec
for this
intrinsic](https://github.com/ARM-software/acle/blob/main/main/acle.md#extq)
is ambiguous about the meaning of this argument, this issue was raised
after there was a differing interpretation for it from the implementers
of the ACLE in GCC.
For example (with our current implementation):
`svextq_f64(zn_f64, zm_f64, 1)` would, for each 128-bit segment of
`zn_f64,` concatenate the highest 15 bytes of this segment with the
first byte of the corresponding segment of `zm_f64`.
After this patch, the behavior of `svextq_f64(zn_f64, zm_f64, 1)` would
be, for each 128-bit vector segment of `zn_f64`, to concatenate the
higher doubleword of this segment with the lower doubleword of the
corresponding segment of `zm_f64`.
The range of the immediate argument in `svextq` would be modified such
that it is:
- [0,15] for `svextq_{s8,u8}`
- [0,7] for `svextq_{s16,u16,f16,bf16}`
- [0,3] for `svextq_{s32,u32,f32}`
- [0,1] for `svextq_{s64,u64,f64}`
This MR fixes failed test `CodeGen/RISCV/compress-opt-select.ll`.
It was failed due to previously merged commit `[TTI][RISCV]
Unconditionally break critical edges to sink ADDI (PR #108889)`.
So, regenerated `compress-opt-select` test.
Two options for clang
-mdiv32: Use div.w[u] and mod.w[u] instructions with input not
sign-extended.
-mno-div32: Do not use div.w[u] and mod.w[u] instructions with input not
sign-extended.
The default is -mno-div32.
All of these immediates are signed, as the surrounding comments
indicate. This fixes an assertion failure in
CodeGen/Generic/dag-combine-ossfuzz-crash.ll when run with a
powerpc-aix triple.
znver4 512-bit instructions are half rate of 128/256-bit variants (still 1uop though)
Confirmed with Agner/uops.info
Noticed while triaging #110308 and #117579
Split extension/truncation patterns to simplify matching.
Fix patterns to consistently match SSE/AVX1/AVX2 variants as well.
Add some missing src/dst type variants - there should be no difference in scheduling, its purely based on dst reg width.
Confirmed with Agner/uops.info
Noticed while triaging #110308
A static analysis tool found that ModuleCost could be zero, so would
perform divide by zero when being printed. Perhaps this is unreachable
in practice, but the fix is straightforward enough and unlikely to be a
performance concern.
Following up issue #89822, this patch adds opportunity to use tail call
in machine outliner pass.
Also it enables outline patterns with X5(T0) register.
Extend the optimization that converts s_barrier to wave_barrier (nop)
when the number of work items is not larger than wave size.
This handles the "split barrier" form of s_barrier where the barrier
is represented by separate intrinsics (s_barrier_signal/s_barrier_wait).
Note: the version where s_barrier is used in gfx12 (and later split)
has the optimization already, but some front-ends may prefer to use
split intrinsics and this is being addressed by the patch.
Fix all the places I could find that did't do this. We were already
mostly correct for FP_ROUND after
9a976f36615dbe15e76c12b22f711b2e597a8e51, but not STRICT_FP_ROUND.
The encoding of v_dot2c_f32_bf16 opcode is same as v_mac_f32 in gfx90a,
both from gfx9 series. This required a new decoderNameSpace GFX950_DOT.
Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>
v_dot2_f32_bf16 was added in gfx11 along with v_dot2_f16_f16 and v_dot2_bf16_bf16.
All three instructions were part of Dot9 instructions in the compiler.
This patch will split existing dot9 (v_dot2_f16_f16, v_dot2_bf16_bf16, v_dot2_f32_bf16)
into new dot9 (v_dot2_f16_f16 and v_dot2_bf16_bf16), and dot12 (v_dot2_f32_bf16).
All necessary changes to gfx11 and gfx12 are updated to reflect this change.
Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>
Scale packed 16-component single-precision float vectors from
two source inputs using the exponent provided by the third
single-precision float input, then convert the values to a packed
32-component FP6 float value.
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
These instructions have non-standard use of OPSEL bits to select
dest write byte. The src2_modifiers operand is used without having
its corresponding src2 operand by introducing dummy src2.
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>