llvm-project

Author	SHA1	Message	Date
Matt Arsenault	5dd48c4901	AMDGPU: MC support for v_cvt_scalef32_pk32_f32_[fp\|bf]6 of gfx950 (#117590 ) Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-25 19:20:51 -08:00
Philip Reames	6657d4bd70	[TTI][RISCV] Unconditionally break critical edges to sink ADDI (#108889 ) This looks like a rather weird change, so let me explain why this isn't as unreasonable as it looks. Let's start with the problem it's solving. ``` define signext i32 @overlap_live_ranges(ptr %arg, i32 signext %arg1) { bb: %i = icmp eq i32 %arg1, 1 br i1 %i, label %bb2, label %bb5 bb2: ; preds = %bb %i3 = getelementptr inbounds nuw i8, ptr %arg, i64 4 %i4 = load i32, ptr %i3, align 4 br label %bb5 bb5: ; preds = %bb2, %bb %i6 = phi i32 [ %i4, %bb2 ], [ 13, %bb ] ret i32 %i6 } ``` Right now, we codegen this as: ``` li a3, 1 li a2, 13 bne a1, a3, .LBB0_2 lw a2, 4(a0) .LBB0_2: mv a0, a2 ret ``` In this example, we have two values which must be assigned to a0 per the ABI (%arg, and the return value). SelectionDAG ensures that all values used in a successor phi are defined before exit the predecessor block. This creates an ADDI to materialize the immediate in the entry block. Currently, this ADDI is not sunk into the tail block because we'd have to split a critical edges to do so. Note that if our immediate was anything large enough to require two instructions we would split this critical edge. Looking at other targets, we notice that they don't seem to have this problem. They perform the sinking, and tail duplication that we don't. Why? Well, it turns out for AArch64 that this is entirely an accident of the existance of the gpr32all register class. The immediate is materialized into the gpr32 class, and then copied into the gpr32all register class. The existance of that copy puts us right back into the two instruction case noted above. This change essentially just bypasses this emergent behavior aspect of the aarch64 behavior, and implements the same "always sink immediates" behavior for RISCV as well.	2024-11-25 18:59:31 -08:00
Pengcheng Wang	6633916ef5	[RISCV] Remove getPostRAMutations (#117527 ) We are using `PostMachineScheduler` instead of `PostRAScheduler` since #68696. The hook `getPostRAMutations` is only used in `PostRAScheduler` so it is actually dead code for RISC-V now.	2024-11-26 10:55:43 +08:00
LiqinWeng	dd7aabf7c0	[TTI][RISCV] Deduplicate type-based VP costing of vpcmp/vpcast (#117520 ) Refered to: https://github.com/llvm/llvm-project/pull/115983	2024-11-26 10:49:24 +08:00
Daniel Zabawa	c1a3960abe	[X86] Add APX imulzu support. (#116806 ) Add patterns to select 16b imulzu with -mapx-feature=zu, including folding of zero-extends of the result. IsDesirableToPromoteOp is changed to leave 16b multiplies by constant un-promoted, as imulzu will not cause partial-write stalls.	2024-11-26 08:26:23 +08:00
Phoebe Wang	2ab84a60ff	[X86][FP16][BF16] Improve vectorization of fcmp (#116153 )	2024-11-26 08:17:48 +08:00
Craig Topper	c2bb056482	[SelectionDAG][RISCV][AArch64] Allow f16 STRICT_FLDEXP to be promoted. Fix integer promotion of STRICT_FLDEXP in type legalizer. (#117633 ) A special case in type legalization wasn't accounting for different operand numbering between FLDEXP and STRICT_FLDEXP. AArch64 already asked STRICT_FLDEXP to be promoted, but had no test for it.	2024-11-25 16:12:45 -08:00
Helena Kotas	cac978331f	[HLSL] Add `Increment`/`DecrementCounter` methods to structured buffers (#117608 ) Introduces `__builtin_hlsl_buffer_update_counter` clang buildin that is used to implement the `IncrementCounter` and `DecrementCounter` methods on `RWStructuredBuffer` and `RasterizerOrderedStructuredBuffer` (see Note). The builtin is translated to LLVM intrisic `llvm.dx.bufferUpdateCounter` or `llvm.spv.bufferUpdateCounter`. Introduces `BuiltinTypeMethodBuilder` helper in `HLSLExternalSemaSource` that enables adding methods to builtin types using builder pattern like this: ``` BuiltinTypeMethodBuilder(Sema, RecordBuilder, "MethodName", ReturnType) .addParam("param_name", Type, InOutModifier) .callBuiltin("buildin_name", { BuiltinParams }) .finalizeMethod(); ``` Fixes #113513 [First version](llvm/llvm-project#114148) of this PR was reverted because of build break.	2024-11-25 16:10:48 -08:00
Feng Zou	ab4e06667d	[X86][MC] Add R_X86_64_CODE_6_GOTTPOFF (#117277 ) For add %reg1, name@GOTTPOFF(%rip), %reg2 add name@GOTTPOFF(%rip), %reg1, %reg2 {nf} add %reg1, name@GOTTPOFF(%rip), %reg2 {nf} add name@GOTTPOFF(%rip), %reg1, %reg2 {nf} add name@GOTTPOFF(%rip), %reg add `R_X86_64_CODE_6_GOTTPOFF` = 50 if the instruction starts at 6 bytes before the relocation offset. It's similar to R_X86_64_GOTTPOFF. Linker can treat `R_X86_64_CODE_6_GOTTPOFF` as `R_X86_64_GOTTPOFF` or convert the instructions above to add $name@tpoff, %reg1, %reg2 add $name@tpoff, %reg1, %reg2 {nf} add $name@tpoff, %reg1, %reg2 {nf} add $name@tpoff, %reg1, %reg2 {nf} add $name@tpoff, %reg if the first byte of the instruction at the relocation `offset - 6` is `0xd5` (namely, encoded w/REX2 prefix) when possible. Binutils patch: `5bc71c2a6b` Binutils mailthread: https://sourceware.org/pipermail/binutils/2024-February/132351.html ABI discussion: https://groups.google.com/g/x86-64-abi/c/FhEZjCtDLFw/m/VHDjN4orAgAJ Blog: https://kanrobert.github.io/rfc/All-about-APX-relocation	2024-11-26 07:00:20 +08:00
Matt Arsenault	935da49a4d	AMDGPU: Pass HwMode to AMDGPUGenRegisterInfo (#117449 ) I haven't figured out how to do anything useful with this yet, but it seems you are supposed to pass this to the subtarget constructor.	2024-11-25 14:21:52 -08:00
S. Bharadwaj Yadavalli	96547decd5	[DirectX] Infrastructure to collect shader flags for each function (#112967 ) Currently, ShaderFlagsAnalysis pass represents various module-level properties as well as function-level properties of a DXIL Module using a single mask. However, one mask per function is needed for accurate computation of shader flags mask, such as for entry function metadata creation. This change introduces a structure that wraps a sorted vector of function-shader flag mask pairs that represent function properties instead of a single shader flag mask that represents module properties and properties of all functions. The result type of ShaderFlagsAnalysis pass is changed to newly-defined structure type instead of a single shader flags mask. This allows accurate computation of shader flags of an entry function (and all functions in a library shader) for use during its metadata generation (DXILTranslateMetadata pass) and its feature flags in DX container globals construction (DXContainerGlobals pass) based on the shader flags mask of functions. However, note that the change to implement propagation of such callee-based shader flags mask computation is planned in a follow-on PR. Consequently, this PR changes shader flag mask computation in DXILTranslateMetadata and DXContainerGlobals passes to simply be a union of module flags and shader flags of all functions, thereby retaining the existing effect of using a single shader flag mask.	2024-11-25 16:02:46 -05:00
Kazu Hirata	8e510b8472	[RISCV] Fix a warning This patch fixes: llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp:476:25: error: unused variable 'ST' [-Werror,-Wunused-variable]	2024-11-25 12:57:05 -08:00
Matt Arsenault	5001f16058	AMDGPU: MC support for v_cvt_scalef32_pk_{f\|bf}16_fp4 of gfx950. (#117418 ) OPSEL ASM Syntax for v_cvt_scalef32_pk_{f\|bf}16_fp4 : opsel:[x,y,z] where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read. Note: Conventional Inst{13} i.e. OPSEL[2] is ignored in asm syntax. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com> Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-25 12:43:45 -08:00
Philip Reames	d733fa1c90	[RISCV] Consolidate VLS codepaths in stack frame manipulation [nfc] (#117605 ) We can move the logic from adjustStackForRVV into adjustReg, which results in the remaining logic being trivially inlined to the two callers and allows a duplicate copy of the same logic in eliminateFrameIndex to be pruned.	2024-11-25 12:40:37 -08:00
Justin Bogner	bb88fd171a	[DirectX] Calculate resource binding offsets using the lower bound (#117303 ) In the DXIL CreateHandle and CreateHandleFromBinding ops, resource bindings are indexed from the beginning of the binding space, not from the binding itself. Translate from an index into the binding to one from the beginning of the space when lowering to these operations.	2024-11-25 10:44:01 -08:00
Simon Pilgrim	b0bc4674b7	[X86] Fix bad instregex in VPMOVSX/ZX znver4 512-bit patterns. The Z size was optional, meaning it matched with the 128-bit SSE instructions as well. Noticed while triaging the strange perf numbers on #110308	2024-11-25 18:39:02 +00:00
Craig Topper	ed6749a405	[RISCV] Promote frexp with Zfh. The default expansion tries to create an illegal integer type after legalization.	2024-11-25 10:27:37 -08:00
Matt Arsenault	1a86d44c80	AMDGPU: MC support for v_cvt_scale_fp4<->f32 of gfx950. (#117417 ) OPSEL ASM Syntax for v_cvt_scalef32_pk_f32_fp4 : opsel:[x,y,z] where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read. OPSEL ASM Syntax for v_cvt_scalef32_pk_fp4_f32 : opsel:[a,b,c,d] where, c & d i.e. OPSEL[3 : 2] selects which dst_byte to write. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-25 10:08:13 -08:00
Craig Topper	29828b26fa	[RISCV] Fix double counting scalar CSRs with Zcmp when emitting cfi_offset for RVV CSRs. (#117408 ) getCalleeSavedStackSize() already contains RVPushStackSize. Don't subtract it again.	2024-11-25 10:03:48 -08:00
Matt Arsenault	362d8fb241	AMDGPU: MC support for v_cvt_scalef32_pk_{fp\|bf}8_{f\|bf}16 of gfx950. (#117384 ) OPSEL ASM Syntax: opsel:[x,y,z] where, opsel[z] = Inst{14} = src0_modifier{3} Note: Conventional Inst{13} i.e. OPSEL[2] is ignored in asm syntax. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-25 09:59:10 -08:00
Matt Arsenault	70fef78329	AMDGPU: MC support for v_cvt_scalef32_pk_f32_[fp\|bf]8 of gfx950. (#117383 ) OPSEL[0] selects srcword to read. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-25 09:56:22 -08:00
Matt Arsenault	8997bf8e46	AMDGPU: MC support for v_cvt_scalef32_pk_{fp8\|bf8}_f32 of gfx950. (#117382 ) OPSEL[3] selects low/high 16 bits of dest write. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-25 09:53:15 -08:00
Matt Arsenault	91af15b764	AMDGPU: MC support for v_cvt_scale_[f16\|f32]_bf8 of gfx950. (#117381 ) OPSEL ASM Syntax: opsel:[x,y,z] where, opsel[x] = Inst{11} = src0_modifier{2} opsel[y] = Inst{12} = src1_modifier{2} opsel[z] = Inst{14} = src0_modifier{3} Note: Conventional Inst{13} i.e. OPSEL[2] is ignored in asm syntax. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-25 09:49:49 -08:00
Matt Arsenault	7ad1084b52	AMDGPU: MC support for v_cvt_scale_[f16\|f32]_fp8 of gfx950. (#117380 ) OPSEL ASM Syntax: opsel:[x,y,z] where, opsel[x] = Inst{11} = src0_modifier{2} opsel[y] = Inst{12} = src1_modifier{2} opsel[z] = Inst{14} = src0_modifier{3} Note: Conventional Inst{13} i.e. OPSEL[2] is ignored in asm syntax. Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-25 09:45:48 -08:00
Matt Arsenault	6f8e7c11cf	AMDGPU: Add MC support for gfx950 V_BITOP3_B32/B16 (#117379 ) Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>	2024-11-25 09:42:07 -08:00
Matt Arsenault	e97fb2207e	AMDGPU: Add support for load transpose instructions for gfx950 (#117378 ) This patch support for intrinsics in clang, as well as assembly instructions in the backend. Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>	2024-11-25 09:39:04 -08:00
Raphael Moreira Zinsly	d88ed9357a	[NFC][RISCV] Refactor allocation of the stack space (#116625 ) Separates the stack allocations from prologue in preparation for the stack clash protection support.	2024-11-25 09:36:15 -08:00
Matt Arsenault	27a8afa3fc	AMDGPU: Handle gfx950 valu write vdst + permlane read hazard (#117287 )	2024-11-25 09:33:04 -08:00
Matt Arsenault	c3fe5ad6be	AMDGPU: Handle vcmpx+permalane gfx950 hazard (#117286 ) Confusingly, this is a different hazard to the one on gfx10 with a subtarget feature.	2024-11-25 09:27:53 -08:00
Matt Arsenault	3db4f5b0da	AMDGPU: Refine gfx950 xdl-write-vgpr hazard cases (#117285 ) The 2-pass XDL write VGPR, read by non-XDL SGEMM/DGEMM case was 1 wait state overly conservative. Previously, for gfx940, the XDL/non-XDL cases happened to have the same number of cycles in all cases. Now the XDL consumer case has an additional state for 2 pass sources.	2024-11-25 09:23:51 -08:00
Craig Topper	20bd029a40	[RISCV] Promote fldexp with Zfh. (#117396 ) The default expansion tries to create i16 operations after type legalization. Fixes #117349	2024-11-25 09:08:56 -08:00
David Sherwood	9b76e7fc60	Revert "[DAGCombiner] Add support for scalarising extracts of a vector setcc (#116031 )" (#117556 ) This reverts commit 22ec44f509ff266b581dbb490d7b040473b7c31a.	2024-11-25 13:49:21 +00:00
Nikita Popov	3699931dee	[M68k] Use getSignedConstant() where necessary	2024-11-25 14:43:11 +01:00
Nikita Popov	4715dec8c0	[XTensa] Use getSignedConstant() for negative values	2024-11-25 14:43:11 +01:00
Alex Voicu	48ec59c234	[llvm][AMDGPU] Fold `llvm.amdgcn.wavefrontsize` early (#114481 ) Fold `llvm.amdgcn.wavefrontsize` early, during InstCombine, so that it's concrete value is used throughout subsequent optimisation passes.	2024-11-25 10:29:50 +00:00
Luke Lau	15fadeb2aa	[RISCV] Add cost for @llvm.experimental.vp.splat (#117313 ) This is split off from #115274. There doesn't seem to be an easy way to share this with getShuffleCost since that requires passing in a real insert_element operand to get it to recognise it's a scalar splat. For i1 vectors we can't currently lower them so it returns an invalid cost. --------- Co-authored-by: Shih-Po Hung <shihpo.hung@sifive.com>	2024-11-25 11:28:46 +01:00
David Green	c537c75278	[AArch64][GlobalISel] Scalarize i128 vector sadd_sat/uadd_sat/etc. As with other operations we scalarize any vectors with larger types to let the scalare legalization kick in.	2024-11-25 09:55:46 +00:00
LiqinWeng	db14010405	[RISCV][TTI] Implement cost of intrinsic abs with LMUL (#115813 )	2024-11-25 17:35:58 +08:00
David Sherwood	22ec44f509	[DAGCombiner] Add support for scalarising extracts of a vector setcc (#116031 ) For IR like this: %icmp = icmp ult <4 x i32> %a, splat (i32 5) %res = extractelement <4 x i1> %icmp, i32 1 where there is only one use of %icmp we can take a similar approach to what we already do for binary ops such add, sub, etc. and convert this into %ext = extractelement <4 x i32> %a, i32 1 %res = icmp ult i32 %ext, 5 For AArch64 targets at least the scalar boolean result will almost certainly need to be in a GPR anyway, since it will probably be used by branches for control flow. I've tried to reuse existing code in scalarizeExtractedBinop to also work for setcc. NOTE: The optimisations don't apply for tests such as extract_icmp_v4i32_splat_rhs in the file CodeGen/AArch64/extract-vector-cmp.ll because scalarizeExtractedBinOp only works if one of the input operands is a constant.	2024-11-25 09:25:01 +00:00
Nikita Popov	3317c9ceac	[AMDGPU] Use getSignedConstant() where necessary (#117328 ) Create signed constant using getSignedConstant(), to avoid future assertion failures when we disable implicit truncation in getConstant(). This also touches some generic legalization code, which apparently only AMDGPU tests.	2024-11-25 09:49:34 +01:00
Nikita Popov	815a1bb53a	[SystemZ] Use getSignedConstant() where necessary (#117181 ) This will avoid assertion failures once we disable implicit truncation in getConstant(). Inside adjustSubwordCmp() I ended up suppressing the issue with an explicit cast, because this code deals with a mix of unsigned and signed immediates.	2024-11-25 09:47:49 +01:00
Craig Topper	3fb0bea859	[RISCV][GISel] Add register class to some isel output patterns so they can be imported. This makes (fcopysign X, (fneg Y)) patterns work.	2024-11-24 19:29:52 -08:00
hev	e26af0938c	[llvm] Add `BasicTTIImpl::areInlineCompatible` for target feature subset checks (#117493 ) This patch moves the `areInlineCompatible` implementation from multiple subclasses (`AArch64TTIImpl`, `RISCVTTIImpl`, `WebAssemblyTTIImpl`) to the base class `BasicTTIImpl`. The new implementation checks whether the callee's target features are a subset of the caller's, enabling consistent behavior across targets. Subclasses now simply delegate to the base implementation, reducing code duplication and improving maintainability.	2024-11-25 11:22:49 +08:00
Craig Topper	bb5bbe523d	[RISCV][GISel] Support s32/s64 G_FSUB/FDIV/FNEG without F/D extensions. Use libcalls for G_FSUB/FDIV. Use integer operations for G_FNEG. Copy most of the IR tests for arithmetic from SelectionDAG.	2024-11-24 18:22:12 -08:00
Sergei Barannikov	5f3eab9e45	[AVR] Remove extra ROL / ROR operands (#117510 ) The nodes have one input, shift amount of 1 is implied.	2024-11-25 05:15:20 +03:00
Aiden Grossman	512dc5cb32	[X86] Swap ports 10 and 11 in Alder Lake Scheduling Model (#117466 ) Based on https://github.com/intel/perfmon/issues/149, the documentation is incorrect and the pfm counter names are actually correct. This patch adjusts the Alder Lake scheduling model to match the performance counter naming/ correct naming that will soon be reflected in the optimization manual. This fixes part of #117360.	2024-11-24 16:58:16 -08:00
Aiden Grossman	095f489757	[X86] Swap ports 10 and 11 in SapphireRapids Scheduling Model (#117468 ) Based on intel/perfmon#149, the documentation is incorrect and the pfm counter names are actually correct. This patch adjusts the SapphireRapids scheduling model to match the performance counter naming/ correct naming that will soon be reflected in the optimization manual. This fixes part of #117360.	2024-11-24 16:58:01 -08:00
Simon Pilgrim	0a6d797c20	[X86] Improve F16C CVT schedules on SNB/HSW/BDW Add complete IvyBridge schedule (which is included in the SandyBridge model, IvyBridge was the first to support F16C) - split rr/rm schedules as they usually have very different port usage. Haswell/Broadwell use Port1 not Port0. Confirmed with a mixture of Agner + uops.info comparisons.	2024-11-24 17:04:53 +00:00
Simon Pilgrim	6cfaddfd52	[X86] Split rr/rm CVT schedules on SNB/HSW/BDW (#117494 ) The folded load variants almost never require Port5 for length changing conversions (just for SNB ymm cases), and don't typically use an extra uop for the load. Confirmed with a mixture of Agner + uops.info comparisons.	2024-11-24 16:19:34 +00:00
Sergei Barannikov	c85c77c054	[AVR] Fix shift node descriptions (#117456 ) Wide shift nodes produce two results, not one. Reuse the added type profile to define the standard "shift parts" nodes.	2024-11-24 09:26:52 +03:00

1 2 3 4 5 ...

81365 Commits