llvm-project

Author	SHA1	Message	Date
Thorsten Schütt	71ac1eb509	Revert "[GlobalISel] Combine [s,z]ext of undef into 0" (#118746 ) Reverts llvm/llvm-project#117439	2024-12-05 07:48:20 +01:00
Philip Reames	1ef9410a96	Revert "[AMDGPU] Infer amdgpu-no-flat-scratch-init attribute in AMDGPUAttributor (#94647 )" This reverts commit e6aec2c12095cc7debd1a8004c8535eef41f4c36. Commit breaks "ninja check-llvm" on x86 host.	2024-12-04 15:37:25 -08:00
Jun Wang	e6aec2c120	[AMDGPU] Infer amdgpu-no-flat-scratch-init attribute in AMDGPUAttributor (#94647 ) The AMDGPUAnnotateKernelFeatures pass infers the "amdgpu-calls" and "amdgpu-stack-objects" attributes, which are used to infer whether we need to initialize flat scratch. This is, however, not precise. Instead, we should use AMDGPUAttributor and infer amdgpu-no-flat-scratch-init on kernels. Refer to https://github.com/llvm/llvm-project/issues/63586 .	2024-12-04 14:10:15 -08:00
Matt Arsenault	e0f52538c9	AMDGPU: Change bitop3 intrinsic operand to i32 (#118647 )	2024-12-04 15:44:04 -05:00
Krzysztof Drewniak	87c21bf064	[AMDGPU] Preserve `noundef` and `range` during kernel argument loads (#118395 ) This commit ensures than noundef (which is frequently a prerequisite for other annotations) and range() annotations on kernel arguments are copied onto their corresponding load from the kernel argument structure.	2024-12-04 11:04:03 -06:00
Mariusz Sikora	455b4fd01a	[AMDGPU] Emit amdgcn.if.break in the same BB as amdgcn.loop (#118081 ) Before this change if.break was placed in wrong loop level which resulted in accumulating values only from last iteration of the inner loop.	2024-12-04 08:42:04 +01:00
Shilei Tian	68bcba6d7a	Revert "[AMDGPU] Use COV6 by default (#118515 )" This reverts commit 410cbe3cf28913cca2fc61b3437306b841d08172 because some buildbots are not ready yet.	2024-12-03 20:17:06 -05:00
Shilei Tian	410cbe3cf2	[AMDGPU] Use COV6 by default (#118515 )	2024-12-03 19:38:35 -05:00
Petar Avramovic	fef54d0393	AMDGPU/GlobalISel: Add skeletons for new register bank select passes (#112862 ) New register bank select for AMDGPU will be split in two passes: - AMDGPURegBankSelect: select banks based on machine uniformity analysis - AMDGPURegBankLegalize: lower instructions that can't be inst-selected with register banks assigned by AMDGPURegBankSelect. AMDGPURegBankLegalize is similar to legalizer but with context of uniformity analysis. Does not change already assigned banks. Main goal of AMDGPURegBankLegalize is to provide high level table-like overview of how to lower generic instructions based on available target features and uniformity info (uniform vs divergent). See RegBankLegalizeRules. Summary of new features: At the moment register bank select assigns register bank to output register using simple algorithm: - one of the inputs is vgpr output is vgpr - all inputs are sgpr output is sgpr. When function does not contain divergent control flow propagating register banks like this works. In general, first point is still correct but second is not when function contains divergent control flow. Examples: - Phi with uniform inputs that go through divergent branch - Instruction with temporal divergent use. To fix this AMDGPURegBankSelect will use machine uniformity analysis to assign vgpr to each divergent and sgpr to each uniform instruction. But some instructions are only available on VALU (for example floating point instructions before gfx1150) and we need to assign vgpr to them. Since we are no longer propagating register banks we need to ensure that uniform instructions get their inputs in sgpr in some way. In AMDGPURegBankLegalize uniform instructions that are only available on VALU will be reassigned to vgpr on all operands and read-any-lane vgpr output to original sgpr output.	2024-12-03 16:02:00 -05:00
Thorsten Schütt	45162635bf	[GlobalISel] Combine [s,z]ext of undef into 0 (#117439 ) Alternative for https://github.com/llvm/llvm-project/pull/113764 It builds on a minimalistic approach with the legality check in match and a blind apply. The precise patterns are used for better compile-time and modularity. It also moves the pattern check into combiner. While unary_undef_to_zero and propagate_undef_any_op rely on custom C++ code for pattern matching. Is there a limit on the number of patterns? G_ANYEXT of undef -> undef G_SEXT of undef -> 0 G_ZEXT of undef -> 0 The combine is not a member of the post legalizer combiner for AArch64. Test: llvm/test/CodeGen/AArch64/GlobalISel/combine-cast.mir	2024-12-03 07:14:49 +01:00
Matt Arsenault	b80a157d12	AMDGPU: Add codegen support for gfx950 v_ashr_pk_i8/u8_i32 (#118304 ) Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>	2024-12-02 19:23:12 -05:00
Matt Arsenault	15676ec552	AMDGPU: Add support for V_CVT_PK_F16_F32 instruction for gfx950 (#118300 ) Co-authored-by: Shilei Tian <shilei.tian@amd.com>	2024-12-02 16:04:24 -05:00
Matt Arsenault	468fb5fc7e	RegisterCoalescer: Set undef on full register uses when coalescing implicit_def (#118321 ) Previously this would delete the IMPLICIT_DEF and not introduce the undef flag on the use operand. Fixes sub-issue found while reducing #109294	2024-12-02 14:43:04 -05:00
Matt Arsenault	a796f597cd	AMDGPU: Allow f16/bf16 for DS_READ_TR16_B64 gfx950 builtins (#118297 ) Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>	2024-12-02 14:40:36 -05:00
Brox Chen	94316dd541	[AMDGPU][True16][CodeGen] saddsat/ssubsat sdag for true16 format (#118245 ) saddsat and ssubsat SDAG codeGen pattern for True16 format	2024-12-02 14:20:59 -05:00
Brox Chen	40fb74a8ff	[AMDGPU][True16][CodeGen] V_MUL_LO_U16 true16 test (#118118 ) This is a NFC. Update and eable V_MUL_LO_U16 codegen test for true16/fake16 flow	2024-12-02 10:09:02 -05:00
Matt Arsenault	39337ff2dc	AMDGPU: Handle cvt_scale F32/F16->F4/F8 gfx950 hazard (#117844 ) gfx950 SP changes doc says: No 4 clk forwarding on opcodes that convert from F32/F16->F8 or F32/F16->F4. Must insert a NOP or instruction writing some other destination VREG after a conversion to F4/F8 since it writes either low/high half or bytes. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com> Co-authored-by: Jeffrey Byrnes <Jeffrey.Byrnes@amd.com>	2024-12-02 09:23:17 -05:00
Matt Arsenault	92ba7e3973	AMDGPU/GlobalISel: Do not try to form v_bitop3_b32 for SGPR results (#117940 )	2024-11-30 20:21:20 -05:00
Christudasan Devadasan	c5ab28a42d	[AMDGPU][NewPM] Port SIOptimizeVGPRLiveRange pass to NPM. (#117686 )	2024-11-29 09:11:24 +05:30
Matt Arsenault	26fd693b97	RegisterCoalescer: Fix creating full / empty subrange on undef subreg use (#117936 )	2024-11-28 11:12:19 -05:00
Petar Avramovic	87503fa51c	Revert "AMDGPU/GlobalISel: Add stub custom regbankselect pass" (#113913 ) This reverts commit e9c49901a43f5b16c3df416460b7e4dbdd24ce03. Current AMDGPURegBankSelect does nothing different then RegBankSelect. Revert to using generic RegBankSelect in preparation for adding new regbankselect passes. New AMDGPURegBankSelect, that will use uniformity analysis for regbank select decisions, will not subclass RegBankSelect. Revert regression tests to use regbankselect since amdgpu-regbankselect will be used by new pass and behavior will be different.	2024-11-27 13:16:22 -05:00
Matt Arsenault	b4a16a78c2	AMDGPU: Match and Select BITOP3 on gfx950 (#117843 ) Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2024-11-27 01:31:19 -05:00
Matt Arsenault	6934870a13	AMDGPU: Remove FeatureCvtFP8VOP1Bug from gfx950 (#117827 )	2024-11-27 01:28:09 -05:00
Matt Arsenault	5615657209	AMDGPU: Builtin & CodeGen support for v_cvt_sr_{bf16\|f16}_f32 instructions (#117824 ) Co-authored-by: Shilei Tian <shilei.tian@amd.com>	2024-11-26 23:37:05 -05:00
Matt Arsenault	62dc8f3069	AMDGPU: Add builtins & codegen support for bitop3_b{16\|32} of gfx950. (#117823 ) Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 23:33:07 -05:00
Matt Arsenault	142b33c58b	AMDGPU: Allocate different registers for vdst & src in v_cvt_scalef32* (#117822 ) For multipass instructions, overlap on VDST and SRC’s would result in HW race & undefined results. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 23:29:11 -05:00
Matt Arsenault	265e209ceb	AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_sr_{bf8\|fp8}_{f16\|bf16\|f32} (#117821 ) Co-authored-by: Shilei Tian <shilei.tian@amd.com>	2024-11-26 23:24:01 -05:00
Matt Arsenault	301c8e6047	AMDGPU: Add support for v_cvt_scalef32_sr instructions (#117820 ) Co-authored-by: Shilei Tian <shilei.tian@amd.com>	2024-11-26 23:20:16 -05:00
Matt Arsenault	76715787f4	AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_sr_pk_fp4 instructions (#117798 ) Co-authored-by: Shilei Tian <shilei.tian@amd.com>	2024-11-26 19:59:14 -05:00
Matt Arsenault	c8ee1ee057	AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_pk_fp4_{f\|bf}16 for gfx950 (#117794 ) These instructions have non-standard use of OPSEL bits to select dest write byte. The src2_modifiers operand is used without having its corresponding src2 operand by introducing dummy src2. OPSEL ASM OPSEL Syntax: opsel:[a,b,c,d] a & b are meaningless, c & d together decides byte to write in dst reg. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 19:38:23 -05:00
Matt Arsenault	065dc93d96	AMDGPU: Builtins & CodeGen support for v_cvt_scalef32_pk_{bf\|f}16_{bf\|fp}8 for gfx950 (#117793 ) OPSEL[0] selects src_word to read. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 19:35:18 -05:00
Matt Arsenault	991dcbc468	AMDGPU: Builtin & codegen support for v_cvt_scalef32_pk32_{bf\|f}16_{bf\|fp}6 for gfx950 (#117747 ) Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 19:30:04 -05:00
Matt Arsenault	0f4fcca546	AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_pk32_f32_[fp\|bf]6 for gfx950 (#117745 ) Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 19:26:07 -05:00
Matt Arsenault	eeb76880f3	AMDGPU: Builtins & CodeGen support for v_cvt_scalef32_pk_{f\|bf}16_fp4 for gfx950 (#117744 ) OPSEL ASM Syntax for v_cvt_scalef32_pk_{f\|bf}16_fp4 : opsel:[x,y,z] where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read. Note: Conventional Inst{13} i.e. OPSEL[2] is ignored in asm syntax. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 19:23:15 -05:00
Matt Arsenault	2b9e947d43	AMDGPU: Builtins & Codegen support for v_cvt_scale_fp4<->f32 for gfx950 (#117743 ) OPSEL ASM Syntax for v_cvt_scalef32_pk_f32_fp4 : opsel:[x,y,z] where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read. OPSEL ASM Syntax for v_cvt_scalef32_pk_fp4_f32 : opsel:[a,b,c,d] where, c & d i.e. OPSEL[3 : 2] selects which dst_byte to write. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 19:20:09 -05:00
Matt Arsenault	4527894143	Builtins & Codegen support for v_cvt_scalef32_pk_{fp\|bf}8_{f\|bf}16 for gfx950 (#117742 ) OPSEL[3] determines low/high 16 bits of word to write. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 19:16:08 -05:00
Matt Arsenault	62584f32eb	AMDGPU: Builtins & Codegen support for v_cvt_scalef32_pk_f32_{fp8\|bf8} for gfx950 (#117741 ) OPSEL[0] determines low/high 16 bits of src0 to read. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 19:12:18 -05:00
Matt Arsenault	803bd812b1	AMDGPU: Builtins & Codegen support for v_cvt_scalef32_pk_{fp8\|bf8}_f32 for gfx950 (#117740 ) OPSEL[3] determines low/high 16 bits of word to write. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 14:57:09 -05:00
Matt Arsenault	815069c701	AMDGPU: Builtins & Codegen support for: v_cvt_scalef32_[f16\|f32]_[bf8\|fp8] (#117739 ) OPSEL[1:0] collectively decide which byte to read from src input. Builtin takes additional imm argument which represents index (with valid values:[0:3]) of src byte read. Out of bounds checks will added in next patch. OPSEL ASM Syntax: opsel:[x,y,z] where, opsel[x] = Inst{11} = src0_modifier{2} opsel[y] = Inst{12} = src1_modifier{2} opsel[z] = Inst{14} = src0_modifier{3} Note: Inst{13} i.e. OPSEL[2] is ignored in asm syntax and opsel[z] is meaningless for v_cvt_scalef32_f32_{fp\|bf}8 Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-26 14:54:10 -05:00
Matt Arsenault	7221bc74bc	AMDGPU: Make v2f16 minimum/maximum legal for gfx950 (#117738 )	2024-11-26 14:51:05 -05:00
Matt Arsenault	f5e92eb04b	AMDGPU: Handle f32 minimum3/maximum3 pattern for gfx950 (#117737 )	2024-11-26 14:47:52 -05:00
Matt Arsenault	e57b327be2	AMDGPU: Legalize fminimum and fmaximum f32 for gfx950 (#117634 ) Select to minimum3/maximum3. Leave f16/v2f16 for later since it's complicated by only having the vector version.	2024-11-26 14:44:09 -05:00
Matt Arsenault	5a3299a684	AMDGPU: Remove some -verify-machineinstrs from tests (#117736 ) We should leave these for EXPENSIVE_CHECKS builds. Some of these were near the top of slowest tests.	2024-11-26 12:59:15 -05:00
Piotr Sobczak	a96ec01e1a	[AMDGPU] Optimize out s_barrier_signal/_wait (#116993 ) Extend the optimization that converts s_barrier to wave_barrier (nop) when the number of work items is not larger than wave size. This handles the "split barrier" form of s_barrier where the barrier is represented by separate intrinsics (s_barrier_signal/s_barrier_wait). Note: the version where s_barrier is used in gfx12 (and later split) has the optimization already, but some front-ends may prefer to use split intrinsics and this is being addressed by the patch.	2024-11-26 10:04:32 +01:00
Matt Arsenault	7fc71f7909	AMDGPU: Support buffer_atomic_pk_add_bf16 for gfx950 (#117599 ) Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>	2024-11-25 19:54:50 -08:00
Matt Arsenault	716364ebd6	AMDGPU: Add support for v_dot2c_f32_bf16 instruction for gfx950 (#117598 ) The encoding of v_dot2c_f32_bf16 opcode is same as v_mac_f32 in gfx90a, both from gfx9 series. This required a new decoderNameSpace GFX950_DOT. Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>	2024-11-25 19:51:01 -08:00
Matt Arsenault	aa7eb5723c	AMDGPU: Add support for v_dot2_f32_bf16 instruction for gfx950 (#117597 ) v_dot2_f32_bf16 was added in gfx11 along with v_dot2_f16_f16 and v_dot2_bf16_bf16. All three instructions were part of Dot9 instructions in the compiler. This patch will split existing dot9 (v_dot2_f16_f16, v_dot2_bf16_bf16, v_dot2_f32_bf16) into new dot9 (v_dot2_f16_f16 and v_dot2_bf16_bf16), and dot12 (v_dot2_f32_bf16). All necessary changes to gfx11 and gfx12 are updated to reflect this change. Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>	2024-11-25 19:47:48 -08:00
Matt Arsenault	a87d484a97	AMDGPU: Support v_cvt_scalef32_2xpk16_{bf\|fp}6_f32 for gfx950. (#117595 ) Scale packed 16-component single-precision float vectors from two source inputs using the exponent provided by the third single-precision float input, then convert the values to a packed 32-component FP6 float value. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-25 19:41:12 -08:00
Matt Arsenault	22503a9df1	AMDGPU: Support v_cvt_scalef32_pk32_{bf\|f}6_{bf\|fp}16 for gfx950 (#117592 ) Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-25 19:27:01 -08:00
Vikash Gupta	0a140c4248	[AMDGPU] Adds pre-commit test for fmul-select combine (#111107 ) This adds the f32/f64/f16/bf16 test cases for below pattern : `fmul x, select(y, A, B)` with just one use of select Inst above. It acts as pre-commit tests for dagCombining above pattern into cheaper ldexp in case of non-inlline 32 bit-constants. (#111109)	2024-11-25 10:03:31 -08:00

1 2 3 4 5 ...

8045 Commits