llvm-project

Author	SHA1	Message	Date
Elvis Wang	d611a9ca15	[LV][VPlan] Reduce register usage of VPEVLBasedIVPHIRecipe. (#154482 ) `VPEVLBasedIVPHIRecipe` will lower to VPInstruction scalar phi and generate scalar phi. This recipe will only occupy a scalar register just like other phi recipes. This patch fix the register usage for `VPEVLBasedIVPHIRecipe` from vector to scalar which is close to generated vector IR. https://godbolt.org/z/6Mzd6W6ha shows that no register spills when choosing `<vscale x 16>`. Note that this test is basically copied from AArch64.	2025-08-21 07:39:01 +08:00
Shih-Po Hung	cf0e86118d	[VPlan] Handle canonical VPWidenIntOrFpInduction in branch-condition simplification (#153539 ) SimplifyBranchConditionForVFAndUF only recognized canonical IVs and a few PHI recipes in the loop header. With more IV-step optimizations, the canonical widen-canonical-iv can be replaced by a canonical VPWidenIntOrFpInduction, which the pass did not handle, causing regressions (missed simplifications). This patch replaces canonical VPWidenIntOrFpInduction with a StepVector in the vector preheader since the vector loop region only executes once.	2025-08-21 07:34:54 +08:00
Florian Hahn	7d33743324	[LV] Add tests for narrowing interleave groups with scalable vectors.	2025-08-20 22:31:24 +01:00
Florian Hahn	b0d0e04693	[LV] Add test where we choose VF * IC is larger than trip count.	2025-08-20 20:40:49 +01:00
Ryotaro Kasuga	2330fd2f73	[LoopPeel] Add new option to peeling loops to convert PHI into IV (#121104 ) LoopPeel currently considers PHI nodes that become loop invariants through peeling. However, in some cases, peeling transforms PHI nodes into induction variables (IVs), potentially enabling further optimizations such as loop vectorization. For example: ```c // TSVC s292 int im = N-1; for (int i=0; i<N; i++) { a[i] = b[i] + b[im]; im = i; } ``` In this case, peeling one iteration converts `im` into an IV, allowing it to be handled by the loop vectorizer. This patch adds a new feature to peel loops when to convert PHIs into IVs. At the moment this feature is disabled by default. Enabling it allows to vectorize the above example. I have measured on neoverse-v2 and observed a speedup of more than 60% (options: `-O3 -ffast-math -mcpu=neoverse-v2 -mllvm -enable-peeling-for-iv`). This PR is taken over from #94900 Related #81851	2025-08-20 13:44:56 +00:00
Florian Hahn	dc23869f98	[LV] Handle vector trip count being zero in preparePlanForEpiVectorLoop. After a485e0e, we may not set the vector trip count in preparePlanForEpilogueVectorLoop if it is zero. We should not choose a VF * UF that makes the main vector loop dead (i.e. vector trip count is zero), but there are some cases where this can happen currently. In those cases, set EPI.VectorTripCount to zero.	2025-08-20 11:54:22 +01:00
Nikita Popov	5ae749b77d	[FunctionAttr] Invalidate callers with mismatching signature (#154289 ) If FunctionAttrs infers additional attributes on a function, it also invalidates analysis on callers of that function. The way it does this right now limits this to calls with matching signature. However, the function attributes will also be used when the signatures do not match. Use getCalledOperand() to avoid a signature check. This is not a correctness fix, just improves analysis quality. I noticed this due to https://github.com/llvm/llvm-project/pull/144497#issuecomment-3199330709, where LICM ends up with a stale MemoryDef that could be a MemoryUse (which is a bug in LICM, but still non-optimal).	2025-08-20 11:38:31 +02:00
Kerry McLaughlin	c34cba0413	[AArch64][SME] Lower aarch64.sme.cnts* to vscale when in streaming mode (#154305 ) In streaming mode, both the @llvm.aarch64.sme.cnts and @llvm.aarch64.sve.cnt intrinsics are equivalent. For SVE, cnt* is lowered in instCombineIntrinsic to @llvm.sme.vscale(). This patch lowers the SME intrinsic similarly when in streaming-mode.	2025-08-20 09:48:36 +01:00
AZero13	f94290cbff	[ValueTracking][GlobalISel] UCMP and SCMP cannot create undef or poison (#154404 ) They cannot make poison or undef, same for IR. They can only make -1, 0, or 1 Alive 2: https://alive2.llvm.org/ce/z/--Jd78	2025-08-20 08:41:27 +09:00
Florian Hahn	23ea79de61	[LV] Add more tests for costs of predicated udivs and calls. Adds missing test coverage for the cost model. Also reduce the size of check lines a bit, by using a common prefix and filtering out after scalar.ph.	2025-08-19 20:04:31 +01:00
Florian Hahn	1217c8226b	[LoopIdiom] Add test for simplifying SCEV during expansion with flags.	2025-08-19 13:22:45 +01:00
Nuno Lopes	d0029b87d8	remove UB from test [NFC]	2025-08-19 11:18:27 +01:00
David Green	a7df02f83c	[InstCombine] Make strlen optimization more resilient to different gep types. (#153623 ) This makes the optimization in optimizeStringLength for strlen(gep @glob, %x) -> sub endof@glob, %x a little more resilient, and maybe a bit more correct for geps with non-array types.	2025-08-19 10:37:17 +01:00
Nikita Popov	4ab87ffd1e	[SCCP] Enable PredicateInfo for non-interprocedural SCCP (#153003 ) SCCP can use PredicateInfo to constrain ranges based on assume and branch conditions. Currently, this is only enabled during IPSCCP. This enables it for SCCP as well, which runs after functions have already been simplified, while IPSCCP runs pre-inline. To a large degree, CVP already handles range-based optimizations, but SCCP is more reliable for the cases it can handle. In particular, SCCP works reliably inside loops, which is something that CVP struggles with due to LVI cycles. I have made various optimizations to make PredicateInfo more efficient, but unfortunately this still has significant compile-time cost (around 0.1-0.2%).	2025-08-19 10:59:38 +02:00
David Sherwood	13d8ba7dea	[LV][TTI] Calculate cost of extracting last index in a scalable vector (#144086 ) There are a couple of places in the loop vectoriser where we want to calculate the cost of extracting the last lane in a vector. However, we wrongly assume that asking for the cost of extracting lane (VF.getKnownMinValue() - 1) is an accurate representation of the cost of extracting the last lane. For SVE at least, this is non-trivial as it requires the use of whilelo and lastb instructions. To solve this problem I have added a new getReverseVectorInstrCost interface where the index is used in reverse from the end of the vector. Suppose a vector has a given ElementCount EC, the extracted/inserted lane would be EC - 1 - Index. For scalable vectors this index is unknown at compile time. I've added a AArch64 hook that better represents the cost, and also a RISCV hook that maintains compatibility with the behaviour prior to this PR. I've also taken the liberty of adding support in vplan for calculating the cost of VPInstruction::ExtractLastElement.	2025-08-19 09:31:37 +01:00
Nikita Popov	5753ee2434	[LICM] Avoid assertion failure on stale MemoryDef It can happen that the call is originally created as a MemoryDef, and then later transforms show it is actually read-only and could be a MemoryUse -- however, this is not guaranteed to be reflected in MSSA.	2025-08-19 10:25:45 +02:00
Nikita Popov	e6d9542b77	[X86][Inline] Check correct function for target feature check (#152515 ) The check for ABI differences for inlined calls involves the caller, the callee and the nested callee. Before inlining, the ABI is determined by the target features of the callee. After inlining it is determined by the caller. The features of the nested callee should never actually matter.	2025-08-19 09:44:00 +02:00
Pierre van Houtryve	6f7c77fe90	[AMDGPU] Check noalias.addrspace in mayAccessScratchThroughFlat (#151319 ) PR #149247 made the MD accessible by the backend so we can now leverage it in the memory model. The first use case here is detecting if a flat op can access scratch memory. Benefits both the MemoryLegalizer and InsertWaitCnt.	2025-08-19 07:42:59 +02:00
Luke Lau	144736b07e	[VPlan] Don't fold live ins with both scalar and vector operands (#154067 ) If we end up with a extract_element VPInstruction where both operands are live-ins, we will try to fold the live-ins even though the first operand is a vector whilst the live-in is scalar. This fixes it by just returning the vector live-in instead of calling the folder, and removes the handling for insertelement where we aren't able to do the fold. From some quick testing we previously never hit this fold anyway, and were probably just missing test coverage. Fixes #154045	2025-08-19 04:10:53 +00:00
Kyle Wang	064f02dac0	[VectorCombine] Preserve scoped alias metadata (#153714 ) Right now if a load op is scalarized, the `!alias.scope` and `!noalias` metadata are dropped. This PR is to keep them if exist.	2025-08-18 18:16:32 +00:00
Justin Fargnoli	58de8f2c25	[Inliner] Add option (default off) to inline all calls regardless of the cost (#152365 ) Add a default off option to the inline cost calculation to always inline all viable calls regardless of the cost/benefit and cost/threshold calculations. For performance reasons, some users require that all calls be inlined. Rather than forcing them to adjust the inlining threshold to an arbitrarily high value, offer an option to inline all calls.	2025-08-18 17:48:49 +00:00
Tobias Stadler	8135b7c1ab	[LV] Emit all remarks for unvectorizable instructions (#153833 ) If ExtraAnalysis is requested, emit all remarks caused by unvectorizable instructions - instead of only the first. This is in line with how other places handle DoExtraAnalysis and it can be quite helpful to get info about all instructions in a loop that prevent vectorization.	2025-08-18 18:04:53 +01:00
Ramkumar Ramachandra	97f554249c	[VPlan] Preserve nusw in createInBoundsPtrAdd (#151549 ) Rename createInBoundsPtrAdd to createNoWrapPtrAdd, and preserve nusw as well as inbounds at the callsite.	2025-08-18 17:48:42 +01:00
Antonio Frighetto	33761df961	Revert "[SimpleLoopUnswitch] Record loops from unswitching non-trivial conditions" This reverts commit e9de32fd159d30cfd6fcc861b57b7e99ec2742ab due to multiple performance regressions observed across downstream Numba benchmarks (https://github.com/llvm/llvm-project/issues/138509#issuecomment-3193855772). While avoiding non-trivial unswitches on newly-cloned loops helps mitigate the pathological case reported in https://github.com/llvm/llvm-project/issues/138509, it may as well make the IR less friendly to vectorization / loop- canonicalization (in the test reported, previously no select with loop-carried dependence existed in the new specialized loops), leading the abovementioned approach to be reconsidered.	2025-08-18 17:40:08 +02:00
Arne Stenkrona	ea2f5395b1	[SimplifyCFG] Avoid threading for loop headers (#151142 ) Updates SimplifyCFG to avoid jump threading through loop headers if -keep-loops is requested. Canonical loop form requires a loop header that dominates all blocks in the loop. If we thread through a header, we risk breaking its domination of the loop. This change avoids this issue by conservatively avoiding threading through headers entirely. Fixes: https://github.com/llvm/llvm-project/issues/151144	2025-08-18 09:46:55 +00:00
David Green	790bee99de	[VectorCombine] Remove dead node immediately in VectorCombine (#149047 ) The vector combiner will process all instructions as it first loops through the function, adding any newly added and deleted instructions to a worklist which is then processed when all nodes are done. These leaves extra uses in the graph as the initial processing is performed, leading to sub-optimal decisions being made for other combines. This changes it so that trivially dead instructions are removed immediately. The main changes that this requires is to make sure iterator invalidation does not occur.	2025-08-18 07:55:21 +01:00
Owen Anderson	69e4514978	[GlobalOpt] Do not fold away addrspacecasts which may be runtime operations (#153753 ) Specifically in the context of the once-stored transformation, GlobalOpt would strip all pointer casts unconditionally, even though addrspacecasts might be runtime operations. This manifested particularly on CHERI targets. This patch was inspired by an existing change in CHERI LLVM (`91afa60f17`), but has been reimplemented with updated conventions, and a testcase constructed from scratch.	2025-08-18 02:11:51 +00:00
Andreas Jonson	0561ff6a12	[LVI] Add support for trunc nuw range. (#154021 ) Proof: https://alive2.llvm.org/ce/z/a5Yjb8	2025-08-17 20:24:09 +02:00
Andreas Jonson	5ae8a9b8ce	[SimplifyCfg] Handle trunc nuw i1 condition in Equality comparison. (#153051 ) proof: https://alive2.llvm.org/ce/z/WVt4-F	2025-08-17 09:53:40 +02:00
Florian Hahn	73775a0f27	[LV] Add test for #153946 . Add test for miscompile from https://github.com/llvm/llvm-project/issues/153946, caused by poison propagation.	2025-08-16 21:19:20 +01:00
Florian Hahn	351d398a37	[VPlan] Run final VPlan simplifications before codegen. Dissolving the hierarchical VPlan CFG and converting abstract to concrete recipes can expose additional simplification opportunities. Do a final run of simplifyRecipes before executing the VPlan.	2025-08-16 18:54:27 +01:00
Mircea Trofin	c971c25544	[licm] don't drop `MD_prof` when dropping other metadata (#152420 ) Part of Issue #147390	2025-08-16 07:26:13 -07:00
Alexey Bataev	b157599156	[SLP]Do not include copyable data to the same user twice If the copyable schedule data is created and the user is used several times in the user node, no need to count same data for the same user several times, need to include it only ones. Fixes #153754	2025-08-15 12:36:45 -07:00
Alexey Bataev	09f5b9ab0a	Revert "[SLP]Do not include copyable data to the same user twice" This reverts commit 758c6852c3ffe6b5e259cafadd811e60d8c276fb to fix buildbot https://lab.llvm.org/buildbot/#/builders/195/builds/13298	2025-08-15 12:08:31 -07:00
zGoldthorpe	82caa251d4	[InstCombine] Fold integer unpack/repack patterns through ZExt (#153583 ) This patch explicitly enables the InstCombiner to fold integer unpack/repack patterns such as ```llvm define i64 @src_combine(i32 %lower, i32 %upper) { %base = zext i32 %lower to i64 %u.0 = and i32 %upper, u0xff %z.0 = zext i32 %u.0 to i64 %s.0 = shl i64 %z.0, 32 %o.0 = or i64 %base, %s.0 %r.1 = lshr i32 %upper, 8 %u.1 = and i32 %r.1, u0xff %z.1 = zext i32 %u.1 to i64 %s.1 = shl i64 %z.1, 40 %o.1 = or i64 %o.0, %s.1 %r.2 = lshr i32 %upper, 16 %u.2 = and i32 %r.2, u0xff %z.2 = zext i32 %u.2 to i64 %s.2 = shl i64 %z.2, 48 %o.2 = or i64 %o.1, %s.2 %r.3 = lshr i32 %upper, 24 %u.3 = and i32 %r.3, u0xff %z.3 = zext i32 %u.3 to i64 %s.3 = shl i64 %z.3, 56 %o.3 = or i64 %o.2, %s.3 ret i64 %o.3 } ; => define i64 @tgt_combine(i32 %lower, i32 %upper) { %base = zext i32 %lower to i64 %upper.zext = zext i32 %upper to i64 %s.0 = shl nuw i64 %upper.zext, 32 %o.3 = or disjoint i64 %s.0, %base ret i64 %o.3 } ``` Alive2 proofs: [YAy7ny](https://alive2.llvm.org/ce/z/YAy7ny)	2025-08-15 12:48:32 -06:00
Alexey Bataev	758c6852c3	[SLP]Do not include copyable data to the same user twice If the copyable schedule data is created and the user is used several times in the user node, no need to count same data for the same user several times, need to include it only ones. Fixes #153754	2025-08-15 11:47:35 -07:00
XChy	3a4a60deff	[VectorCombine] Apply InstSimplify in scalarizeOpOrCmp to avoid infinite loop (#153069 ) Fixes #153012 As we tolerate unfoldable constant expressions in `scalarizeOpOrCmp`, we may fold ```llvm define void @bug(ptr %ptr1, ptr %ptr2, i64 %idx) #0 { entry: %158 = insertelement <2 x i64> <i64 5, i64 ptrtoint (ptr @val to i64)>, i64 %idx, i32 0 %159 = or disjoint <2 x i64> splat (i64 2), %158 store <2 x i64> %159, ptr %ptr2 ret void } ``` to ```llvm define void @bug(ptr %ptr1, ptr %ptr2, i64 %idx) { entry: %.scalar = or disjoint i64 2, %idx %0 = or <2 x i64> splat (i64 2), <i64 5, i64 ptrtoint (ptr @val to i64)> %1 = insertelement <2 x i64> %0, i64 %.scalar, i64 0 store <2 x i64> %1, ptr %ptr2, align 16 ret void } ``` And it would be folded back in `foldInsExtBinop`, resulting in an infinite loop. This patch forces scalarization iff InstSimplify can fold the constant expression.	2025-08-15 18:38:04 +00:00
Florian Hahn	2b1e06598f	[LV] Regenerate some more check lines. (NFC)	2025-08-15 15:53:19 +01:00
Alexey Bataev	13b54f7dc1	[SLP] Recalculate dependencies for potential control dependencies if cleared If the control dependecies are cleared after calcellation of the copyables, need to reclculate them unconditionally. Fixes #153754 #153676	2025-08-15 07:52:10 -07:00
Tobias Stadler	d803a93f55	[Inliner] Report inlining decision before deleting Callee contents (#153616 ) Call `recordInliningWithCalleeDeleted` before dropping the contents of the Callee. Otherwise the handlers don't have access to e.g. the DebugLoc, so the Callee DebugLoc was missing in inlining remarks for functions with internal linkage. The test is the same as `optimization-remarks-passed-yaml.ll` except that the function `foo` has internal linkage instead of external linkage.	2025-08-15 12:00:34 +01:00
Florian Hahn	36be0bba2a	[SCEV] Check if predicate is known false for predicated AddRecs. (#151134 ) Similarly to https://github.com/llvm/llvm-project/pull/131538, we can also try and check if a predicate is known to wrap given the backedge taken count. For now, this just checks directly when we try to create predicated AddRecs. This both helps to avoid spending compile-time on optimizations where we know the predicate is false, and can also help to allow additional vectorization (e.g. by deciding to scalarize memory accesses when otherwise we would try to create a predicated AddRec with a predicate that's always false). The initial version is quite restricted, but can be extended in follow-ups to cover more cases. PR: https://github.com/llvm/llvm-project/pull/151134	2025-08-15 09:30:25 +01:00
Mircea Trofin	b8d74ad2b3	[JTS] Use common branch weight downscaling (#153738 ) This also fixes a bug introduced accidentally in #153651, whereby the `JumpTableToSwitch` would convert all the branch weights to 0 except for one. It didn't trip the test because `update_test_checks` wasn't run with `-check-globals`. It is now. This also made noticeable that the direct calls promoted from the indirect call inherited the `VP`metadata, which should be dropped as it makes no more sense now.	2025-08-15 07:30:43 +00:00
Alexey Bataev	bf2f241458	[SLP]Support LShr as base for copyable elements Added support for LShr instructions as base for copyable elements. Also, added simple analysis for best base instruction selection, if multiple candidates are available. Fixed scheduling after cancellation Reviewers: hiraditya, RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/153393	2025-08-14 19:12:27 -07:00
Jasmine Tang	10d9e7b1b7	Reapply "[WebAssembly] Constant fold wasm.dot" (#153070 ) In #149619, for the test of `@dot_follow_modulo_spec_2`, constant folding the addition of two i32 1073741824 causes an overflow from 2^32 to -2^32=-2147483648, which triggers the UB sanitizer. This PR reapplies the previous PR, explicitly casting the addition operand to int64_t first before performing the addition before producing a int32 number via `Constant *C = get(cast<IntegerType>(Ty->getScalarType()), V, isSigned)`	2025-08-14 18:52:35 -07:00
Alex Bradbury	db5f7dc374	Revert "[SLP]Support LShr as base for copyable elements" This reverts commit ca4ebf95172d24f8c47655709b2c9eb85bda5cb2. Causes compile-time crashes for some inputs with RVV zvl512b/zvl1024b configurations. See here for a minimal reproducer: https://github.com/llvm/llvm-project/pull/153393#issuecomment-3189898813	2025-08-14 22:18:24 +01:00
David Green	5836bae463	[AArch64] Change the cost of fma and fmuladd to match fmul. (#152963 ) As fmul and fmadd are so similar, their performance characteristics tend to be the same on most platforms, at least in terms of reciprocal throughputs. Processors capable of performing a given number of fmul per cycle can usually perform the same number of fma, with the extra add being relatively simple on top. This patch makes the scores of the two operations the same, which brings the throughput cost of a fma/fmuladd to 2, and the latency to 3, which are the defaults for fmul. Note that we might also want to change the throughput cost of a fmul to 1, as most processors have ample bandwidth for them, but they should still stay in-line with one another.	2025-08-14 21:53:45 +01:00
Florian Hahn	8a0c7e9b32	[LV] Regenerate some more tests.	2025-08-14 21:21:03 +01:00
Michael Berg	334a046a3c	[LoopDist] Consider reads and writes together for runtime checks (#145623 ) Emit safety guards for ptr accesses when cross partition loads exist which have a corresponding store to the same address in a different partition. This will emit the necessary ptr checks for these accesses. The test case was obtained from SuperTest, which SiFive runs regularly. We enabled LoopDistribution by default in our downstream compiler, this change was part of that enablement.	2025-08-14 12:50:17 -07:00
Florian Hahn	db98ac43ec	[LV] Use shl for ((VF * Step) * vscale) in createStepForVF. (#153495 ) Directly emit shl instead of a multiply if VF * Step is a power-of-2. The main motivation here is to prepare the code and test for directly generating and expanding a SCEV expression of the minimum iteration count. SCEVExpander will directly emit shl for multiplies with powers-of-2. InstCombine will also performs this combine, so end-to-end this should effectively by NFC. PR: https://github.com/llvm/llvm-project/pull/153495	2025-08-14 19:27:51 +01:00
Mircea Trofin	f5d284309f	[JTS] Propagate profile info (#153305 ) If the indirect call target being recognized as a jump table has profile info, we can accurately synthesize the branch weights of the switch that replaces the indirect call. Otherwise we insert the "unknown" `MD_prof` to indicate this is the best we can do here. Part of Issue #147390	2025-08-14 11:17:57 -07:00

1 2 3 4 5 ...

32662 Commits