837 Commits

Author SHA1 Message Date
Florian Hahn
7954a0514b
[Clang] Enable -fpointer-tbaa by default. (#117244)
Support for more precise TBAA metadata has been added a while ago
(behind the -fpointer-tbaa flag). The more precise TBAA metadata allows
treating accesses of different pointer types as no-alias.

This helps to remove more redundant loads and stores in a number of
workloads.

Some highlights on the impact across llvm-test-suite's MultiSource,
SPEC2006 & SPEC2017 include:
 * +2% more NoAlias results for memory accesses
 * +3% more stores removed by DSE,
 * +4% more loops vectorized.

This closes a relatively big gap to GCC, which has been supporting
disambiguating based on pointer types for a long time.
(https://clang.godbolt.org/z/K7Wbhrz4q)

Pointer-TBAA support for pointers to builtin types has been added in
https://github.com/llvm/llvm-project/pull/76612.

Support for user-defined types has been added in
https://github.com/llvm/llvm-project/pull/110569.

There are 2 recent PRs with bug fixes for special cases uncovered during
testing:
 * https://github.com/llvm/llvm-project/pull/116991
 * https://github.com/llvm/llvm-project/pull/116596

PR: https://github.com/llvm/llvm-project/pull/117244
2024-12-04 20:55:18 +00:00
Matt Arsenault
e0f52538c9
AMDGPU: Change bitop3 intrinsic operand to i32 (#118647) 2024-12-04 15:44:04 -05:00
Shilei Tian
68bcba6d7a Revert "[AMDGPU] Use COV6 by default (#118515)"
This reverts commit 410cbe3cf28913cca2fc61b3437306b841d08172 because some
buildbots are not ready yet.
2024-12-03 20:17:06 -05:00
Shilei Tian
410cbe3cf2
[AMDGPU] Use COV6 by default (#118515) 2024-12-03 19:38:35 -05:00
Matt Arsenault
a796f597cd
AMDGPU: Allow f16/bf16 for DS_READ_TR16_B64 gfx950 builtins (#118297)
Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>
2024-12-02 14:40:36 -05:00
Matt Arsenault
a2c3e0c4cb
AMDGPU/clang: Add global_load_lds size check support for gfx950 (#117825)
Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-11-26 23:41:09 -05:00
Matt Arsenault
5615657209
AMDGPU: Builtin & CodeGen support for v_cvt_sr_{bf16|f16}_f32 instructions (#117824)
Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-11-26 23:37:05 -05:00
Matt Arsenault
62dc8f3069
AMDGPU: Add builtins & codegen support for bitop3_b{16|32} of gfx950. (#117823)
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 23:33:07 -05:00
Matt Arsenault
265e209ceb
AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_sr_{bf8|fp8}_{f16|bf16|f32} (#117821)
Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-11-26 23:24:01 -05:00
Matt Arsenault
301c8e6047
AMDGPU: Add support for v_cvt_scalef32_sr instructions (#117820)
Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-11-26 23:20:16 -05:00
Matt Arsenault
76715787f4
AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_sr_pk_fp4 instructions (#117798)
Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-11-26 19:59:14 -05:00
Matt Arsenault
c8ee1ee057
AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_pk_fp4_{f|bf}16 for gfx950 (#117794)
These instructions have non-standard use of OPSEL bits to select
dest write byte. The src2_modifiers operand is used without having
its corresponding src2 operand by introducing dummy src2.

OPSEL ASM OPSEL Syntax: opsel:[a,b,c,d]
a & b are meaningless, c & d together decides byte to write in dst reg.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:38:23 -05:00
Matt Arsenault
065dc93d96
AMDGPU: Builtins & CodeGen support for v_cvt_scalef32_pk_{bf|f}16_{bf|fp}8 for gfx950 (#117793)
OPSEL[0] selects src_word to read.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:35:18 -05:00
Matt Arsenault
991dcbc468
AMDGPU: Builtin & codegen support for v_cvt_scalef32_pk32_{bf|f}16_{bf|fp}6 for gfx950 (#117747)
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:30:04 -05:00
Matt Arsenault
0f4fcca546
AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_pk32_f32_[fp|bf]6 for gfx950 (#117745)
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:26:07 -05:00
Matt Arsenault
eeb76880f3
AMDGPU: Builtins & CodeGen support for v_cvt_scalef32_pk_{f|bf}16_fp4 for gfx950 (#117744)
OPSEL ASM Syntax for v_cvt_scalef32_pk_{f|bf}16_fp4 : opsel:[x,y,z]
where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read.

Note: Conventional Inst{13} i.e. OPSEL[2] is ignored in asm syntax.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:23:15 -05:00
Matt Arsenault
2b9e947d43
AMDGPU: Builtins & Codegen support for v_cvt_scale_fp4<->f32 for gfx950 (#117743)
OPSEL ASM Syntax for v_cvt_scalef32_pk_f32_fp4 : opsel:[x,y,z]
where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read.

OPSEL ASM Syntax for v_cvt_scalef32_pk_fp4_f32 : opsel:[a,b,c,d]
where, c & d i.e. OPSEL[3 : 2] selects which dst_byte  to write.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:20:09 -05:00
Matt Arsenault
4527894143
Builtins & Codegen support for v_cvt_scalef32_pk_{fp|bf}8_{f|bf}16 for gfx950 (#117742)
OPSEL[3] determines low/high 16 bits of word to write.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:16:08 -05:00
Matt Arsenault
62584f32eb
AMDGPU: Builtins & Codegen support for v_cvt_scalef32_pk_f32_{fp8|bf8} for gfx950 (#117741)
OPSEL[0] determines low/high 16 bits of src0 to read.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:12:18 -05:00
Matt Arsenault
803bd812b1
AMDGPU: Builtins & Codegen support for v_cvt_scalef32_pk_{fp8|bf8}_f32 for gfx950 (#117740)
OPSEL[3] determines low/high 16 bits of word to write.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 14:57:09 -05:00
Matt Arsenault
815069c701
AMDGPU: Builtins & Codegen support for: v_cvt_scalef32_[f16|f32]_[bf8|fp8] (#117739)
OPSEL[1:0] collectively decide which byte to read
from src input.

Builtin takes additional imm argument which
represents index (with valid values:[0:3]) of src
byte read. Out of bounds checks will added in next
patch.

OPSEL ASM Syntax: opsel:[x,y,z]
where,
    opsel[x] = Inst{11} = src0_modifier{2}
    opsel[y] = Inst{12} = src1_modifier{2}
    opsel[z] = Inst{14} = src0_modifier{3}

Note: Inst{13} i.e. OPSEL[2] is ignored in
asm syntax and opsel[z] is meaningless
for v_cvt_scalef32_f32_{fp|bf}8

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 14:54:10 -05:00
Matt Arsenault
7fc71f7909
AMDGPU: Support buffer_atomic_pk_add_bf16 for gfx950 (#117599)
Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>
2024-11-25 19:54:50 -08:00
Matt Arsenault
716364ebd6
AMDGPU: Add support for v_dot2c_f32_bf16 instruction for gfx950 (#117598)
The encoding of v_dot2c_f32_bf16 opcode is same as v_mac_f32 in gfx90a,
both from gfx9 series. This required a new decoderNameSpace GFX950_DOT.

Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>
2024-11-25 19:51:01 -08:00
Matt Arsenault
aa7eb5723c
AMDGPU: Add support for v_dot2_f32_bf16 instruction for gfx950 (#117597)
v_dot2_f32_bf16 was added in gfx11 along with v_dot2_f16_f16 and v_dot2_bf16_bf16.
All three instructions were part of Dot9 instructions in the compiler.

This patch will split existing dot9 (v_dot2_f16_f16, v_dot2_bf16_bf16, v_dot2_f32_bf16)
into new dot9 (v_dot2_f16_f16 and v_dot2_bf16_bf16), and dot12 (v_dot2_f32_bf16).

All necessary changes to gfx11 and gfx12 are updated to reflect this change.

Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>
2024-11-25 19:47:48 -08:00
Matt Arsenault
5d650a62a3
AMDGPU: Add support for v_ashr_pk_i8/u8_i32 instructions for gfx950 (#117596)
This patch adds assembly and builtin support for v_ashr_pk_i8/u8_i32
instructions.

Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>
2024-11-25 19:44:47 -08:00
Matt Arsenault
a87d484a97
AMDGPU: Support v_cvt_scalef32_2xpk16_{bf|fp}6_f32 for gfx950. (#117595)
Scale packed 16-component single-precision float vectors from
two  source inputs using the exponent provided by the third
single-precision float input, then convert the values to a packed
32-component FP6 float value.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-25 19:41:12 -08:00
Matt Arsenault
22503a9df1
AMDGPU: Support v_cvt_scalef32_pk32_{bf|f}6_{bf|fp}16 for gfx950 (#117592)
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-25 19:27:01 -08:00
Matt Arsenault
e97fb2207e
AMDGPU: Add support for load transpose instructions for gfx950 (#117378)
This patch support for intrinsics in clang, as well as assembly
instructions in the backend.

Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>
2024-11-25 09:39:04 -08:00
Alex Voicu
48ec59c234
[llvm][AMDGPU] Fold llvm.amdgcn.wavefrontsize early (#114481)
Fold `llvm.amdgcn.wavefrontsize` early, during InstCombine, so that it's
concrete value is used throughout subsequent optimisation passes.
2024-11-25 10:29:50 +00:00
Matt Arsenault
d1cca3133a
AMDGPU: Add v_permlane16_swap_b32 and v_permlane32_swap_b32 for gfx950 (#117260)
This was a bit annoying because these introduce a new special case
encoding usage. op_sel is repurposed as a subset of dpp controls,
and is eligible for VOP3->VOP1 shrinking. For some reason fi also
uses an enum value, so we need to convert the raw boolean to 1 instead
of -1.

The 2 registers are swapped, so this has 2 defs. Ideally the builtin
would return a pair, but that's difficult so return a vector instead.
This would make a hypothetical builtin that supports v2f16 directly
uglier.
2024-11-22 20:12:50 -08:00
Matt Arsenault
7d544c64e3
AMDGPU: Add v_smfmac_f32_32x32x64_fp8_fp8 for gfx950 (#117259) 2024-11-22 12:11:06 -08:00
Matt Arsenault
90dc644d73
AMDGPU: Add v_smfmac_f32_32x32x32x64_fp8_bf8 for gfx950 (#117258) 2024-11-22 12:08:15 -08:00
Matt Arsenault
8d3435f8a1
AMDGPU: Add v_smfmac_f32_32x32x64_bf8_fp8 for gfx950 (#117257) 2024-11-22 12:02:18 -08:00
Matt Arsenault
8a5c24149d
AMDGPU: Add v_smfmac_f32_32x32x64_bf8_bf8 for gfx950 (#117256) 2024-11-22 11:59:06 -08:00
Matt Arsenault
836d2dcf60
AMDGPU: Add v_smfmac_f32_16x16x128_fp8_fp8 for gfx950 (#117235) 2024-11-21 17:06:06 -08:00
Matt Arsenault
33124910c9
AMDGPU: Add v_smfmac_f32_16x16x128_fp8_bf8 for gfx950 (#117234) 2024-11-21 17:03:03 -08:00
Matt Arsenault
3678f8a8aa
AMDGPU: Add v_smfmac_f32_16x16x128_bf8_fp8 for gfx950 (#117233) 2024-11-21 17:00:08 -08:00
Matt Arsenault
7baadb2a4e
AMDGPU: Add v_smfmac_f32_16x16x128_bf8_bf8 for gfx950 (#117232) 2024-11-21 16:57:01 -08:00
Matt Arsenault
3e6f3508ad
AMDGPU: Add v_smfmac_i32_32x32x64_i8 for gfx950 (#117214) 2024-11-21 15:01:03 -08:00
Matt Arsenault
8c53036146
AMDGPU: Add v_smfmac_i32_16x16x128_i8 for gfx950 (#117213) 2024-11-21 14:58:11 -08:00
Matt Arsenault
42dd114a46
AMDGPU: Add v_smfmac_f32_32x32x32_bf16 for gfx950 (#117212) 2024-11-21 14:52:11 -08:00
Matt Arsenault
95ddc1a63b
AMDGPU: Add v_smfmac_f32_16x16x64_bf16 for gfx950 (#117211) 2024-11-21 14:46:43 -08:00
Matt Arsenault
e50eaa2cf1
AMDGPU: Add v_smfmac_f32_32x32x32_f16 for gfx950 (#117205) 2024-11-21 14:43:33 -08:00
Matt Arsenault
2ab178820b
AMDGPU: Add v_smfmac_f32_16x16x64_f16 for gfx950 (#117202) 2024-11-21 14:40:30 -08:00
Matt Arsenault
1c47d67abc
AMDGPU: Add v_mfma_f32_16x16x32_bf16 for gfx950 (#117053) 2024-11-21 14:28:05 -08:00
Matt Arsenault
f4ed79b160
AMDGPU: Add v_mfma_i32_32x32x32_i8 for gfx950 (#117052) 2024-11-21 09:08:15 -08:00
Matt Arsenault
0a6e8741dd
AMDGPU: Shrink used number of registers for mfma scale based on format (#117047)
Currently the builtins assume you are using an 8-bit format that requires
an 8 element vector. We can shrink the number of registers if the format
requires 4 or 6.
2024-11-21 09:08:05 -08:00
Matt Arsenault
76b24640e5
AMDGPU: Add v_mfma_i32_16x16x64_i8 for gfx950 (#116728) 2024-11-21 09:02:12 -08:00
Matt Arsenault
01c9a14ccf
AMDGPU: Define v_mfma_f32_{16x16x128|32x32x64}_f8f6f4 instructions (#116723)
These use a new VOP3PX encoding for the v_mfma_scale_* instructions,
which bundles the pre-scale v_mfma_ld_scale_b32. None of the modifiers
are supported yet (op_sel, neg or clamp).

I'm not sure the intrinsic should really expose op_sel (or any of the
others). If I'm reading the documentation correctly, we should be able
to just have the raw scale operands and auto-match op_sel to byte
extract patterns.

The op_sel syntax also seems extra horrible in this usage, especially with the
usual assumed op_sel_hi=-1 behavior.
2024-11-21 08:51:58 -08:00
Haopeng Liu
4d6e69143d
Add the initializes attribute inference (#117104)
reland https://github.com/llvm/llvm-project/pull/97373 after fixing
clang tests.

Confirmed with "ninja check-llvm" and "ninja check-clang"
2024-11-20 19:15:23 -08:00