81365 Commits

Author SHA1 Message Date
Matt Arsenault
d9c4e9ffe7
AMDGPU: Verify f8f6f4 formats in assembler (#117826)
Verify the register widths of the corresponding operands match
the floating point format expected size.
2024-11-26 23:45:03 -05:00
Matt Arsenault
5615657209
AMDGPU: Builtin & CodeGen support for v_cvt_sr_{bf16|f16}_f32 instructions (#117824)
Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-11-26 23:37:05 -05:00
Matt Arsenault
62dc8f3069
AMDGPU: Add builtins & codegen support for bitop3_b{16|32} of gfx950. (#117823)
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 23:33:07 -05:00
Matt Arsenault
142b33c58b
AMDGPU: Allocate different registers for vdst & src in v_cvt_scalef32* (#117822)
For multipass instructions, overlap on VDST and SRC’s
would result in HW race & undefined results.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 23:29:11 -05:00
Matt Arsenault
265e209ceb
AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_sr_{bf8|fp8}_{f16|bf16|f32} (#117821)
Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-11-26 23:24:01 -05:00
Matt Arsenault
301c8e6047
AMDGPU: Add support for v_cvt_scalef32_sr instructions (#117820)
Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-11-26 23:20:16 -05:00
Brandon Wu
4a7dbede6b
[RISCV] Support svukte extension (#115657)
This is the extension for "Address-Independent Latency of User-Mode
Faults to Supervisor Addresses".
Spec: https://github.com/riscv/riscv-isa-manual/pull/1564,
https://lf-riscv.atlassian.net/browse/RVS-2977
The spec states that the `svukte` depends on `sv39`, but we don't have
`sv39` yet, so I didn't add it to the implied list.
2024-11-27 10:54:57 +08:00
Sam Clegg
ea58410d0f
[WebAssembly] Implement %llvm.thread.pointer intrinsic (#117817)
We can simply use the `__tls_base` global for this which is guaranteed
to be non-zero and unique per thread.

Fixes: #117433
2024-11-26 17:19:14 -08:00
Matt Arsenault
76715787f4
AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_sr_pk_fp4 instructions (#117798)
Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-11-26 19:59:14 -05:00
Matt Arsenault
f87cabea26
AMDGPU: MC support for v_cvt_scalef32_sr_{bf8|fp8}_{f16|bf16|f32} (#117797)
Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-11-26 19:52:09 -05:00
Matt Arsenault
34a8bb0da3
AMDGPU: MC support for v_cvt_sr_{f16|bf16}_f32 instructions (#117796)
Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-11-26 19:48:50 -05:00
Matt Arsenault
d3c103b80e
AMDGPU: MC support for V_CVT_SCALE_SR_FP4 instructions (#117795)
Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-11-26 19:41:52 -05:00
Matt Arsenault
c8ee1ee057
AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_pk_fp4_{f|bf}16 for gfx950 (#117794)
These instructions have non-standard use of OPSEL bits to select
dest write byte. The src2_modifiers operand is used without having
its corresponding src2 operand by introducing dummy src2.

OPSEL ASM OPSEL Syntax: opsel:[a,b,c,d]
a & b are meaningless, c & d together decides byte to write in dst reg.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:38:23 -05:00
Matt Arsenault
065dc93d96
AMDGPU: Builtins & CodeGen support for v_cvt_scalef32_pk_{bf|f}16_{bf|fp}8 for gfx950 (#117793)
OPSEL[0] selects src_word to read.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:35:18 -05:00
Matt Arsenault
991dcbc468
AMDGPU: Builtin & codegen support for v_cvt_scalef32_pk32_{bf|f}16_{bf|fp}6 for gfx950 (#117747)
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:30:04 -05:00
Matt Arsenault
0f4fcca546
AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_pk32_f32_[fp|bf]6 for gfx950 (#117745)
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:26:07 -05:00
Matt Arsenault
eeb76880f3
AMDGPU: Builtins & CodeGen support for v_cvt_scalef32_pk_{f|bf}16_fp4 for gfx950 (#117744)
OPSEL ASM Syntax for v_cvt_scalef32_pk_{f|bf}16_fp4 : opsel:[x,y,z]
where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read.

Note: Conventional Inst{13} i.e. OPSEL[2] is ignored in asm syntax.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:23:15 -05:00
Matt Arsenault
2b9e947d43
AMDGPU: Builtins & Codegen support for v_cvt_scale_fp4<->f32 for gfx950 (#117743)
OPSEL ASM Syntax for v_cvt_scalef32_pk_f32_fp4 : opsel:[x,y,z]
where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read.

OPSEL ASM Syntax for v_cvt_scalef32_pk_fp4_f32 : opsel:[a,b,c,d]
where, c & d i.e. OPSEL[3 : 2] selects which dst_byte  to write.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:20:09 -05:00
Matt Arsenault
4527894143
Builtins & Codegen support for v_cvt_scalef32_pk_{fp|bf}8_{f|bf}16 for gfx950 (#117742)
OPSEL[3] determines low/high 16 bits of word to write.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:16:08 -05:00
Matt Arsenault
62584f32eb
AMDGPU: Builtins & Codegen support for v_cvt_scalef32_pk_f32_{fp8|bf8} for gfx950 (#117741)
OPSEL[0] determines low/high 16 bits of src0 to read.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:12:18 -05:00
Craig Topper
43b6b78771
[RISCV][GISel] Use libcalls for f32/f64 G_FCMP without F/D extensions. (#117660)
LegalizerHelp only supported f128 libcalls and incorrectly assumed that
the destination register for the G_FCMP was s32.
2024-11-26 15:48:49 -08:00
Pradeep Kumar
e84614833e
[LLVM][NVPTX] Add support for div.full instruction (#116482)
This commit adds NVPTX support for div.full PTX instruction with test
under div.ll. [For more information, see PTX
ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/#floating-point-instructions-div)
2024-11-27 04:57:42 +05:30
Matt Arsenault
803bd812b1
AMDGPU: Builtins & Codegen support for v_cvt_scalef32_pk_{fp8|bf8}_f32 for gfx950 (#117740)
OPSEL[3] determines low/high 16 bits of word to write.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 14:57:09 -05:00
Matt Arsenault
815069c701
AMDGPU: Builtins & Codegen support for: v_cvt_scalef32_[f16|f32]_[bf8|fp8] (#117739)
OPSEL[1:0] collectively decide which byte to read
from src input.

Builtin takes additional imm argument which
represents index (with valid values:[0:3]) of src
byte read. Out of bounds checks will added in next
patch.

OPSEL ASM Syntax: opsel:[x,y,z]
where,
    opsel[x] = Inst{11} = src0_modifier{2}
    opsel[y] = Inst{12} = src1_modifier{2}
    opsel[z] = Inst{14} = src0_modifier{3}

Note: Inst{13} i.e. OPSEL[2] is ignored in
asm syntax and opsel[z] is meaningless
for v_cvt_scalef32_f32_{fp|bf}8

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 14:54:10 -05:00
Matt Arsenault
7221bc74bc
AMDGPU: Make v2f16 minimum/maximum legal for gfx950 (#117738) 2024-11-26 14:51:05 -05:00
Matt Arsenault
f5e92eb04b
AMDGPU: Handle f32 minimum3/maximum3 pattern for gfx950 (#117737) 2024-11-26 14:47:52 -05:00
Matt Arsenault
e57b327be2
AMDGPU: Legalize fminimum and fmaximum f32 for gfx950 (#117634)
Select to minimum3/maximum3. Leave f16/v2f16 for later
since it's complicated by only having the vector version.
2024-11-26 14:44:09 -05:00
SpencerAbson
2a0162c019
[AArch64][SVE] Change the immediate argument in svextq (#115340)
In order to align with `svext` and NEON `vext`/`vextq`, this patch
changes immediate argument in `svextq` such that it refers to elements
of the size of those of the source vector, rather than bytes. The [spec
for this
intrinsic](https://github.com/ARM-software/acle/blob/main/main/acle.md#extq)
is ambiguous about the meaning of this argument, this issue was raised
after there was a differing interpretation for it from the implementers
of the ACLE in GCC.

For example (with our current implementation):

`svextq_f64(zn_f64, zm_f64, 1)` would, for each 128-bit segment of
`zn_f64,` concatenate the highest 15 bytes of this segment with the
first byte of the corresponding segment of `zm_f64`.

After this patch, the behavior of `svextq_f64(zn_f64, zm_f64, 1)` would
be, for each 128-bit vector segment of `zn_f64`, to concatenate the
higher doubleword of this segment with the lower doubleword of the
corresponding segment of `zm_f64`.

The range of the immediate argument in `svextq` would be modified such
that it is:
- [0,15] for `svextq_{s8,u8}`
- [0,7] for `svextq_{s16,u16,f16,bf16}`
- [0,3] for `svextq_{s32,u32,f32}`
- [0,1] for `svextq_{s64,u64,f64}`
2024-11-26 16:50:51 +00:00
Mark Goncharov
80df56e03b
Reapply "[RISCV] Implement tail call optimization in machine outliner" (#117700)
This MR fixes failed test `CodeGen/RISCV/compress-opt-select.ll`.

It was failed due to previously merged commit `[TTI][RISCV]
Unconditionally break critical edges to sink ADDI (PR #108889)`.

So, regenerated `compress-opt-select` test.
2024-11-26 23:39:45 +08:00
tangaac
f4379db496
[LoongArch] Support LA V1.1 feature that div.w[u] and mod.w[u] instructions with inputs not signed-extended. (#116764)
Two options for clang
-mdiv32: Use div.w[u] and mod.w[u] instructions with input not
sign-extended.
-mno-div32: Do not use div.w[u] and mod.w[u] instructions with input not
sign-extended.
The default is -mno-div32.
2024-11-26 21:57:29 +08:00
Nikita Popov
5322415f92 [PowerPC] Use getSignedConstant() in SelectOptimalAddrMode()
All of these immediates are signed, as the surrounding comments
indicate. This fixes an assertion failure in
CodeGen/Generic/dag-combine-ossfuzz-crash.ll when run with a
powerpc-aix triple.
2024-11-26 14:34:30 +01:00
Mehdi Amini
f94bd3c933
Revert "[RISCV] Implement tail call optimization in machine outliner" (#117710)
Reverts llvm/llvm-project#115297
Bots are broken
2024-11-26 13:45:47 +01:00
Simon Pilgrim
90df66455b [MCA][X86] Fix throughput of (V)PMOV extension/truncation 512-bit instructions
znver4 512-bit instructions are half rate of 128/256-bit variants (still 1uop though)

Confirmed with Agner/uops.info

Noticed while triaging #110308 and #117579
2024-11-26 12:04:19 +00:00
Simon Pilgrim
45fdb77557 [MCA][X86] Cleanup znver4 instregex patterns for (V)PMOV extension/truncation instructions
Split extension/truncation patterns to simplify matching.

Fix patterns to consistently match SSE/AVX1/AVX2 variants as well.

Add some missing src/dst type variants - there should be no difference in scheduling, its purely based on dst reg width.

Confirmed with Agner/uops.info

Noticed while triaging #110308
2024-11-26 10:56:27 +00:00
Fraser Cormack
3414993eaf
[AMDGPU][SplitModule] Fix potential divide by zero (#117602)
A static analysis tool found that ModuleCost could be zero, so would
perform divide by zero when being printed. Perhaps this is unreachable
in practice, but the fix is straightforward enough and unlikely to be a
performance concern.
2024-11-26 10:05:09 +00:00
Mark Goncharov
29062329f3
[RISCV] Implement tail call optimization in machine outliner (#115297)
Following up issue #89822, this patch adds opportunity to use tail call
in machine outliner pass.
Also it enables outline patterns with X5(T0) register.
2024-11-26 12:30:37 +03:00
Piotr Sobczak
a96ec01e1a
[AMDGPU] Optimize out s_barrier_signal/_wait (#116993)
Extend the optimization that converts s_barrier to wave_barrier (nop)
when the number of work items is not larger than wave size.
    
This handles the "split barrier" form of s_barrier where the barrier
is represented by separate intrinsics (s_barrier_signal/s_barrier_wait).
Note: the version where s_barrier is used in gfx12 (and later split)
has the optimization already, but some front-ends may prefer to use
split intrinsics and this is being addressed by the patch.
2024-11-26 10:04:32 +01:00
Craig Topper
bc282605df
[SelectionDAG] Require last operand of (STRICT_)FP_ROUND to be a TargetConstant. (#117639)
Fix all the places I could find that did't do this. We were already
mostly correct for FP_ROUND after
9a976f36615dbe15e76c12b22f711b2e597a8e51, but not STRICT_FP_ROUND.
2024-11-25 21:36:33 -08:00
Matt Arsenault
ae719f0756
AMDGPU: Add minimum3/maximum3 pkf16 for gfx950 encodings (#117601) 2024-11-25 20:02:50 -08:00
Matt Arsenault
a5174de8c2
AMDGPU: Add encodings for minimum3/maximum3 f32 for gfx950 (#117600) 2024-11-25 19:59:00 -08:00
Matt Arsenault
7fc71f7909
AMDGPU: Support buffer_atomic_pk_add_bf16 for gfx950 (#117599)
Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>
2024-11-25 19:54:50 -08:00
Matt Arsenault
716364ebd6
AMDGPU: Add support for v_dot2c_f32_bf16 instruction for gfx950 (#117598)
The encoding of v_dot2c_f32_bf16 opcode is same as v_mac_f32 in gfx90a,
both from gfx9 series. This required a new decoderNameSpace GFX950_DOT.

Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>
2024-11-25 19:51:01 -08:00
Matt Arsenault
aa7eb5723c
AMDGPU: Add support for v_dot2_f32_bf16 instruction for gfx950 (#117597)
v_dot2_f32_bf16 was added in gfx11 along with v_dot2_f16_f16 and v_dot2_bf16_bf16.
All three instructions were part of Dot9 instructions in the compiler.

This patch will split existing dot9 (v_dot2_f16_f16, v_dot2_bf16_bf16, v_dot2_f32_bf16)
into new dot9 (v_dot2_f16_f16 and v_dot2_bf16_bf16), and dot12 (v_dot2_f32_bf16).

All necessary changes to gfx11 and gfx12 are updated to reflect this change.

Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>
2024-11-25 19:47:48 -08:00
Matt Arsenault
5d650a62a3
AMDGPU: Add support for v_ashr_pk_i8/u8_i32 instructions for gfx950 (#117596)
This patch adds assembly and builtin support for v_ashr_pk_i8/u8_i32
instructions.

Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>
2024-11-25 19:44:47 -08:00
Matt Arsenault
a87d484a97
AMDGPU: Support v_cvt_scalef32_2xpk16_{bf|fp}6_f32 for gfx950. (#117595)
Scale packed 16-component single-precision float vectors from
two  source inputs using the exponent provided by the third
single-precision float input, then convert the values to a packed
32-component FP6 float value.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-25 19:41:12 -08:00
LiqinWeng
c3377af4c3
[RISCV][CostModel] add cost for cttz/ctlz under the non-zvbb (#117515) 2024-11-26 11:40:52 +08:00
Matt Arsenault
d727b6f777
AMDGPU: MC support for v_cvt_scalef32_pk_fp4_{f|bf}16 on gfx950. (#117594)
These instructions have non-standard use of OPSEL bits to select
dest write byte. The src2_modifiers operand is used without having
its corresponding src2 operand by introducing dummy src2.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-25 19:37:04 -08:00
Matt Arsenault
c767570eb1
AMDGPU: MC support for v_cvt_scalef32_pk_{bf|f}16_{bf|fp}8 of gfx950. (#117593)
OPSEL[0] selects src_word to read.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-25 19:30:07 -08:00
Matt Arsenault
22503a9df1
AMDGPU: Support v_cvt_scalef32_pk32_{bf|f}6_{bf|fp}16 for gfx950 (#117592)
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-25 19:27:01 -08:00
Matt Arsenault
658db918fe
AMDGPU: MC support for v_cvt_scalef32_pk32_{bf|f}16_{bf|fp}6 of gfx950. (#117591)
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-25 19:23:58 -08:00