llvm-project

Author	SHA1	Message	Date
Petar Avramovic	87503fa51c	Revert "AMDGPU/GlobalISel: Add stub custom regbankselect pass" (#113913 ) This reverts commit e9c49901a43f5b16c3df416460b7e4dbdd24ce03. Current AMDGPURegBankSelect does nothing different then RegBankSelect. Revert to using generic RegBankSelect in preparation for adding new regbankselect passes. New AMDGPURegBankSelect, that will use uniformity analysis for regbank select decisions, will not subclass RegBankSelect. Revert regression tests to use regbankselect since amdgpu-regbankselect will be used by new pass and behavior will be different.	2024-11-27 13:16:22 -05:00
Matt Arsenault	b4a16a78c2	AMDGPU: Match and Select BITOP3 on gfx950 (#117843 ) Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2024-11-27 01:31:19 -05:00
Matt Arsenault	6934870a13	AMDGPU: Remove FeatureCvtFP8VOP1Bug from gfx950 (#117827 )	2024-11-27 01:28:09 -05:00
Matt Arsenault	5615657209	AMDGPU: Builtin & CodeGen support for v_cvt_sr_{bf16\|f16}_f32 instructions (#117824 ) Co-authored-by: Shilei Tian <shilei.tian@amd.com>	2024-11-26 23:37:05 -05:00
Matt Arsenault	62dc8f3069	AMDGPU: Add builtins & codegen support for bitop3_b{16\|32} of gfx950. (#117823 ) Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 23:33:07 -05:00
Matt Arsenault	142b33c58b	AMDGPU: Allocate different registers for vdst & src in v_cvt_scalef32* (#117822 ) For multipass instructions, overlap on VDST and SRC’s would result in HW race & undefined results. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 23:29:11 -05:00
Matt Arsenault	265e209ceb	AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_sr_{bf8\|fp8}_{f16\|bf16\|f32} (#117821 ) Co-authored-by: Shilei Tian <shilei.tian@amd.com>	2024-11-26 23:24:01 -05:00
Matt Arsenault	301c8e6047	AMDGPU: Add support for v_cvt_scalef32_sr instructions (#117820 ) Co-authored-by: Shilei Tian <shilei.tian@amd.com>	2024-11-26 23:20:16 -05:00
Matt Arsenault	76715787f4	AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_sr_pk_fp4 instructions (#117798 ) Co-authored-by: Shilei Tian <shilei.tian@amd.com>	2024-11-26 19:59:14 -05:00
Matt Arsenault	c8ee1ee057	AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_pk_fp4_{f\|bf}16 for gfx950 (#117794 ) These instructions have non-standard use of OPSEL bits to select dest write byte. The src2_modifiers operand is used without having its corresponding src2 operand by introducing dummy src2. OPSEL ASM OPSEL Syntax: opsel:[a,b,c,d] a & b are meaningless, c & d together decides byte to write in dst reg. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 19:38:23 -05:00
Matt Arsenault	065dc93d96	AMDGPU: Builtins & CodeGen support for v_cvt_scalef32_pk_{bf\|f}16_{bf\|fp}8 for gfx950 (#117793 ) OPSEL[0] selects src_word to read. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 19:35:18 -05:00
Matt Arsenault	991dcbc468	AMDGPU: Builtin & codegen support for v_cvt_scalef32_pk32_{bf\|f}16_{bf\|fp}6 for gfx950 (#117747 ) Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 19:30:04 -05:00
Matt Arsenault	0f4fcca546	AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_pk32_f32_[fp\|bf]6 for gfx950 (#117745 ) Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 19:26:07 -05:00
Matt Arsenault	eeb76880f3	AMDGPU: Builtins & CodeGen support for v_cvt_scalef32_pk_{f\|bf}16_fp4 for gfx950 (#117744 ) OPSEL ASM Syntax for v_cvt_scalef32_pk_{f\|bf}16_fp4 : opsel:[x,y,z] where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read. Note: Conventional Inst{13} i.e. OPSEL[2] is ignored in asm syntax. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 19:23:15 -05:00
Matt Arsenault	2b9e947d43	AMDGPU: Builtins & Codegen support for v_cvt_scale_fp4<->f32 for gfx950 (#117743 ) OPSEL ASM Syntax for v_cvt_scalef32_pk_f32_fp4 : opsel:[x,y,z] where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read. OPSEL ASM Syntax for v_cvt_scalef32_pk_fp4_f32 : opsel:[a,b,c,d] where, c & d i.e. OPSEL[3 : 2] selects which dst_byte to write. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 19:20:09 -05:00
Matt Arsenault	4527894143	Builtins & Codegen support for v_cvt_scalef32_pk_{fp\|bf}8_{f\|bf}16 for gfx950 (#117742 ) OPSEL[3] determines low/high 16 bits of word to write. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 19:16:08 -05:00
Matt Arsenault	62584f32eb	AMDGPU: Builtins & Codegen support for v_cvt_scalef32_pk_f32_{fp8\|bf8} for gfx950 (#117741 ) OPSEL[0] determines low/high 16 bits of src0 to read. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 19:12:18 -05:00
Matt Arsenault	803bd812b1	AMDGPU: Builtins & Codegen support for v_cvt_scalef32_pk_{fp8\|bf8}_f32 for gfx950 (#117740 ) OPSEL[3] determines low/high 16 bits of word to write. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 14:57:09 -05:00
Matt Arsenault	815069c701	AMDGPU: Builtins & Codegen support for: v_cvt_scalef32_[f16\|f32]_[bf8\|fp8] (#117739 ) OPSEL[1:0] collectively decide which byte to read from src input. Builtin takes additional imm argument which represents index (with valid values:[0:3]) of src byte read. Out of bounds checks will added in next patch. OPSEL ASM Syntax: opsel:[x,y,z] where, opsel[x] = Inst{11} = src0_modifier{2} opsel[y] = Inst{12} = src1_modifier{2} opsel[z] = Inst{14} = src0_modifier{3} Note: Inst{13} i.e. OPSEL[2] is ignored in asm syntax and opsel[z] is meaningless for v_cvt_scalef32_f32_{fp\|bf}8 Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 14:54:10 -05:00
Matt Arsenault	7221bc74bc	AMDGPU: Make v2f16 minimum/maximum legal for gfx950 (#117738 )	2024-11-26 14:51:05 -05:00
Matt Arsenault	f5e92eb04b	AMDGPU: Handle f32 minimum3/maximum3 pattern for gfx950 (#117737 )	2024-11-26 14:47:52 -05:00
Matt Arsenault	e57b327be2	AMDGPU: Legalize fminimum and fmaximum f32 for gfx950 (#117634 ) Select to minimum3/maximum3. Leave f16/v2f16 for later since it's complicated by only having the vector version.	2024-11-26 14:44:09 -05:00
Matt Arsenault	5a3299a684	AMDGPU: Remove some -verify-machineinstrs from tests (#117736 ) We should leave these for EXPENSIVE_CHECKS builds. Some of these were near the top of slowest tests.	2024-11-26 12:59:15 -05:00
Piotr Sobczak	a96ec01e1a	[AMDGPU] Optimize out s_barrier_signal/_wait (#116993 ) Extend the optimization that converts s_barrier to wave_barrier (nop) when the number of work items is not larger than wave size. This handles the "split barrier" form of s_barrier where the barrier is represented by separate intrinsics (s_barrier_signal/s_barrier_wait). Note: the version where s_barrier is used in gfx12 (and later split) has the optimization already, but some front-ends may prefer to use split intrinsics and this is being addressed by the patch.	2024-11-26 10:04:32 +01:00
Matt Arsenault	7fc71f7909	AMDGPU: Support buffer_atomic_pk_add_bf16 for gfx950 (#117599 ) Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>	2024-11-25 19:54:50 -08:00
Matt Arsenault	716364ebd6	AMDGPU: Add support for v_dot2c_f32_bf16 instruction for gfx950 (#117598 ) The encoding of v_dot2c_f32_bf16 opcode is same as v_mac_f32 in gfx90a, both from gfx9 series. This required a new decoderNameSpace GFX950_DOT. Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>	2024-11-25 19:51:01 -08:00
Matt Arsenault	aa7eb5723c	AMDGPU: Add support for v_dot2_f32_bf16 instruction for gfx950 (#117597 ) v_dot2_f32_bf16 was added in gfx11 along with v_dot2_f16_f16 and v_dot2_bf16_bf16. All three instructions were part of Dot9 instructions in the compiler. This patch will split existing dot9 (v_dot2_f16_f16, v_dot2_bf16_bf16, v_dot2_f32_bf16) into new dot9 (v_dot2_f16_f16 and v_dot2_bf16_bf16), and dot12 (v_dot2_f32_bf16). All necessary changes to gfx11 and gfx12 are updated to reflect this change. Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>	2024-11-25 19:47:48 -08:00
Matt Arsenault	a87d484a97	AMDGPU: Support v_cvt_scalef32_2xpk16_{bf\|fp}6_f32 for gfx950. (#117595 ) Scale packed 16-component single-precision float vectors from two source inputs using the exponent provided by the third single-precision float input, then convert the values to a packed 32-component FP6 float value. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-25 19:41:12 -08:00
Matt Arsenault	22503a9df1	AMDGPU: Support v_cvt_scalef32_pk32_{bf\|f}6_{bf\|fp}16 for gfx950 (#117592 ) Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-25 19:27:01 -08:00
Vikash Gupta	0a140c4248	[AMDGPU] Adds pre-commit test for fmul-select combine (#111107 ) This adds the f32/f64/f16/bf16 test cases for below pattern : `fmul x, select(y, A, B)` with just one use of select Inst above. It acts as pre-commit tests for dagCombining above pattern into cheaper ldexp in case of non-inlline 32 bit-constants. (#111109)	2024-11-25 10:03:31 -08:00
Matt Arsenault	e97fb2207e	AMDGPU: Add support for load transpose instructions for gfx950 (#117378 ) This patch support for intrinsics in clang, as well as assembly instructions in the backend. Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>	2024-11-25 09:39:04 -08:00
Matt Arsenault	27a8afa3fc	AMDGPU: Handle gfx950 valu write vdst + permlane read hazard (#117287 )	2024-11-25 09:33:04 -08:00
Matt Arsenault	c3fe5ad6be	AMDGPU: Handle vcmpx+permalane gfx950 hazard (#117286 ) Confusingly, this is a different hazard to the one on gfx10 with a subtarget feature.	2024-11-25 09:27:53 -08:00
Matt Arsenault	3db4f5b0da	AMDGPU: Refine gfx950 xdl-write-vgpr hazard cases (#117285 ) The 2-pass XDL write VGPR, read by non-XDL SGEMM/DGEMM case was 1 wait state overly conservative. Previously, for gfx940, the XDL/non-XDL cases happened to have the same number of cycles in all cases. Now the XDL consumer case has an additional state for 2 pass sources.	2024-11-25 09:23:51 -08:00
Alex Voicu	48ec59c234	[llvm][AMDGPU] Fold `llvm.amdgcn.wavefrontsize` early (#114481 ) Fold `llvm.amdgcn.wavefrontsize` early, during InstCombine, so that it's concrete value is used throughout subsequent optimisation passes.	2024-11-25 10:29:50 +00:00
Matt Arsenault	85601fd78f	AMDGPU: Handle v_mfma_f64_16x16x4_f64 write VGPR read srca/srcb hazard change for gfx950 (#117284 ) Increase in wait states from 11 to 19. The index for smfmac counts as like srcA/srcB.	2024-11-22 20:30:06 -08:00
Matt Arsenault	db08d78c3e	AMDGPU: Handle v_mfma_f64_16x16x4_f64 srcc write VGPR hazard change for gfx950 (#117283 ) Read by sgemm/dgemm in srcc after v_mfma_f64_16x16x4_f64 increases from 9 to 17 wait states.	2024-11-22 20:26:58 -08:00
Matt Arsenault	8cb6c9907c	AMDGPU: Handle gfx950 XDL-write-overlapped-smfma-src-c wait state change (#117263 ) These have an additional wait state compared to gfx940.	2024-11-22 20:23:46 -08:00
Matt Arsenault	b078b882b9	AMDGPU: Handle gfx950 change in mfma_f64_16x16x4 + valu hazard (#117262 ) Increase from 11 wait states to 19	2024-11-22 20:20:23 -08:00
Matt Arsenault	33c2b20f3d	AMDGPU: Define new sched model for gfx950 (#117261 ) A few instructions changed rate.	2024-11-22 20:17:52 -08:00
Matt Arsenault	d1cca3133a	AMDGPU: Add v_permlane16_swap_b32 and v_permlane32_swap_b32 for gfx950 (#117260 ) This was a bit annoying because these introduce a new special case encoding usage. op_sel is repurposed as a subset of dpp controls, and is eligible for VOP3->VOP1 shrinking. For some reason fi also uses an enum value, so we need to convert the raw boolean to 1 instead of -1. The 2 registers are swapped, so this has 2 defs. Ideally the builtin would return a pair, but that's difficult so return a vector instead. This would make a hypothetical builtin that supports v2f16 directly uglier.	2024-11-22 20:12:50 -08:00
Matt Arsenault	7d544c64e3	AMDGPU: Add v_smfmac_f32_32x32x64_fp8_fp8 for gfx950 (#117259 )	2024-11-22 12:11:06 -08:00
Matt Arsenault	90dc644d73	AMDGPU: Add v_smfmac_f32_32x32x32x64_fp8_bf8 for gfx950 (#117258 )	2024-11-22 12:08:15 -08:00
Matt Arsenault	8d3435f8a1	AMDGPU: Add v_smfmac_f32_32x32x64_bf8_fp8 for gfx950 (#117257 )	2024-11-22 12:02:18 -08:00
Matt Arsenault	8a5c24149d	AMDGPU: Add v_smfmac_f32_32x32x64_bf8_bf8 for gfx950 (#117256 )	2024-11-22 11:59:06 -08:00
Brox Chen	4cc278587f	[AMDGPU][True16][MC] VOPC profile fake16 pseudo update (#113175 ) Update VOPC profile with VOP3 pseudo: 1. On GFX11+, v_cmp_class_f16 has src1 type f16 for literals, however it's semantically interpreted as an integer. Update VOPC class f16 profile from operand type f16, i16 to f16, f16, currently updating it for fake16 format, and will update t16 format in the following patch. 2. 16bit V_CMP_CLASS instructions (V_CMP_**_U/I/F16) are named with `t16`, but actually using 32 bit registers. Correct it by updating the pseudo definitions with useRealTrue16/useFakeTrue16 predicates and rename these `t16` instructions to `fake16`. 3. Update the inst select so that `t16`/`fake16` instructions are selected in true16/fake16 flow. 4. The mir test file are impacted for a name change of these impacted 16 bit V_CMP instructions, but non-functional change to emitted code	2024-11-22 12:12:13 -05:00
Matt Arsenault	836d2dcf60	AMDGPU: Add v_smfmac_f32_16x16x128_fp8_fp8 for gfx950 (#117235 )	2024-11-21 17:06:06 -08:00
Matt Arsenault	33124910c9	AMDGPU: Add v_smfmac_f32_16x16x128_fp8_bf8 for gfx950 (#117234 )	2024-11-21 17:03:03 -08:00
Matt Arsenault	3678f8a8aa	AMDGPU: Add v_smfmac_f32_16x16x128_bf8_fp8 for gfx950 (#117233 )	2024-11-21 17:00:08 -08:00
Matt Arsenault	7baadb2a4e	AMDGPU: Add v_smfmac_f32_16x16x128_bf8_bf8 for gfx950 (#117232 )	2024-11-21 16:57:01 -08:00

1 2 3 4 5 ...

8025 Commits