llvm-project

Author	SHA1	Message	Date
Florian Hahn	d486b76823	[AArch64] Unroll some loops with early-continues on Apple Silicon. (#118499 ) Try to runtime-unroll loops with early-continues depending on loop-varying loads; this helps with branch-prediction for the early-continues and can significantly improve performance for such loops Builds on top of https://github.com/llvm/llvm-project/pull/118317. PR: https://github.com/llvm/llvm-project/pull/118499.	2024-12-22 13:10:54 +00:00
Vladi Krapp	f8d270474c	[ARM] Reduce loop unroll when low overhead branching is available (#120065 ) For processors with low overhead branching (LOB), runtime unrolling the innermost loop is often detrimental to performance. In these cases the loop remainder gets unrolled into a series of compare-and-jump blocks, which in deeply nested loops get executed multiple times, negating the benefits of LOB. This is particularly noticable when the loop trip count of the innermost loop varies within the outer loop, such as in the case of triangular matrix decompositions. In these cases we will prefer to not unroll the innermost loop, with the intention for it to be executed as a low overhead loop.	2024-12-18 10:10:51 +00:00
Florian Hahn	0bb7bd4b4e	[AArch64] Runtime-unroll small load/store loops for Apple Silicon CPUs. (#118317 ) Add initial heuristics to selectively enable runtime unrolling for loops where doing so is expected to be highly beneficial on Apple Silicon CPUs. To start with, we try to runtime-unroll small, single block loops, if they have load/store dependencies, to expose more parallel memory access streams [1] and to improve instruction delivery [2]. We also explicitly avoid runtime-unrolling for loop structures that may limit the expected gains from runtime unrolling. Such loops include loops with complex control flow (aren't innermost loops, have multiple exits, have a large number of blocks), trip count expansion is expensive and are expected to execute a small number of iterations. Note that the heuristics here may be overly conservative and we err on the side of avoiding runtime unrolling rather than unroll excessively. They are all subject to further refinement. Across a large set of workloads, this increase the total number of unrolled loops by 2.9%. [1] 4.6.10 in Apple Silicon CPU Optimization Guide [2] 4.4.4 in Apple Silicon CPU Optimization Guide Depends on https://github.com/llvm/llvm-project/pull/118316 for TTI changes. PR: https://github.com/llvm/llvm-project/pull/118317	2024-12-09 14:28:31 +00:00
VladiKrapp-Arm	bb3eb0ca0c	[ARM] Test unroll behaviour on machines with low overhead branching (#118692 ) Add test for existing loop unroll behaviour. Current behaviour is the single loop with fmul gets runtime unrolled by count of 4, with the loop remainder unrolled as the 3 for.body9.us.prol sections. This is quite a lot of compare and branch, negating the benefits of the low overhead loop mechanism.	2024-12-06 15:04:56 +00:00
Nikita Popov	f7685af4a5	[InstCombine] Move gep of phi fold into separate function This makes sure that an early return during this fold doesn't end up skipping later gep folds.	2024-12-05 15:20:56 +01:00
Nikita Popov	462cb3cd6c	[InstCombine] Infer nusw + nneg -> nuw for getelementptr (#111144 ) If the gep is nusw (usually via inbounds) and the offset is non-negative, we can infer nuw. Proof: https://alive2.llvm.org/ce/z/ihztLy	2024-12-05 14:36:40 +01:00
Florian Hahn	21d27b3aab	[LoopUnroll] Add tests for loop unrolling on Apple platforms. Add first set of tests where runtime unrolling can be highly beneficial on Apple Silicon CPUs.	2024-12-02 15:48:48 +00:00
Lee Wei	abb9f9fa06	[llvm] Remove `br i1 undef` from some regression tests [NFC] (#117112 ) This PR removes tests with `br i1 undef` under `llvm/tests/Transforms/Loop, Lower`.	2024-11-21 08:06:56 +00:00
Stephen Tozer	92e0fb0c94	[DebugInfo][LoopUnroll] Preserve DebugLocs on optimized cond branches (#114225 ) This patch fixes a simple error where as part of loop unrolling we optimize conditional loop-exiting branches into unconditional branches when we know that they will or won't exit the loop, but does not propagate the source location of the original branch to the new one. Found using https://github.com/llvm/llvm-project/pull/107279.	2024-11-08 16:52:30 +00:00
Yingwei Zheng	0b9f1cc024	[SCEV] Disallow simplifying phi(undef, X) to X (#115109 ) See the following case: ``` @GlobIntONE = global i32 0, align 4 define ptr @src() { entry: br label %for.body.peel.begin for.body.peel.begin: ; preds = %entry br label %for.body.peel for.body.peel: ; preds = %for.body.peel.begin br i1 true, label %cleanup.peel, label %cleanup.loopexit.peel cleanup.loopexit.peel: ; preds = %for.body.peel br label %cleanup.peel cleanup.peel: ; preds = %cleanup.loopexit.peel, %for.body.peel %retval.2.peel = phi ptr [ undef, %for.body.peel ], [ @GlobIntONE, %cleanup.loopexit.peel ] br i1 true, label %for.body.peel.next, label %cleanup7 for.body.peel.next: ; preds = %cleanup.peel br label %for.body.peel.next1 for.body.peel.next1: ; preds = %for.body.peel.next br label %entry.peel.newph entry.peel.newph: ; preds = %for.body.peel.next1 br label %for.body for.body: ; preds = %cleanup, %entry.peel.newph %retval.0 = phi ptr [ %retval.2.peel, %entry.peel.newph ], [ %retval.2, %cleanup ] br i1 false, label %cleanup, label %cleanup.loopexit cleanup.loopexit: ; preds = %for.body br label %cleanup cleanup: ; preds = %cleanup.loopexit, %for.body %retval.2 = phi ptr [ %retval.0, %for.body ], [ @GlobIntONE, %cleanup.loopexit ] br i1 false, label %for.body, label %cleanup7.loopexit cleanup7.loopexit: ; preds = %cleanup %retval.2.lcssa.ph = phi ptr [ %retval.2, %cleanup ] br label %cleanup7 cleanup7: ; preds = %cleanup7.loopexit, %cleanup.peel %retval.2.lcssa = phi ptr [ %retval.2.peel, %cleanup.peel ], [ %retval.2.lcssa.ph, %cleanup7.loopexit ] ret ptr %retval.2.lcssa } define ptr @tgt() { entry: br label %for.body.peel.begin for.body.peel.begin: ; preds = %entry br label %for.body.peel for.body.peel: ; preds = %for.body.peel.begin br i1 true, label %cleanup.peel, label %cleanup.loopexit.peel cleanup.loopexit.peel: ; preds = %for.body.peel br label %cleanup.peel cleanup.peel: ; preds = %cleanup.loopexit.peel, %for.body.peel %retval.2.peel = phi ptr [ undef, %for.body.peel ], [ @GlobIntONE, %cleanup.loopexit.peel ] br i1 true, label %for.body.peel.next, label %cleanup7 for.body.peel.next: ; preds = %cleanup.peel br label %for.body.peel.next1 for.body.peel.next1: ; preds = %for.body.peel.next br label %entry.peel.newph entry.peel.newph: ; preds = %for.body.peel.next1 br label %for.body for.body: ; preds = %cleanup, %entry.peel.newph br i1 false, label %cleanup, label %cleanup.loopexit cleanup.loopexit: ; preds = %for.body br label %cleanup cleanup: ; preds = %cleanup.loopexit, %for.body br i1 false, label %for.body, label %cleanup7.loopexit cleanup7.loopexit: ; preds = %cleanup %retval.2.lcssa.ph = phi ptr [ %retval.2.peel, %cleanup ] br label %cleanup7 cleanup7: ; preds = %cleanup7.loopexit, %cleanup.peel %retval.2.lcssa = phi ptr [ %retval.2.peel, %cleanup.peel ], [ %retval.2.lcssa.ph, %cleanup7.loopexit ] ret ptr %retval.2.lcssa } ``` 1. `simplifyInstruction(%retval.2.peel)` returns `@GlobIntONE`. Thus, `ScalarEvolution::createNodeForPHI` returns SCEV expr `@GlobIntONE` for `%retval.2.peel`. 2. `SimplifyIndvar::replaceIVUserWithLoopInvariant` tries to replace the use of `%retval.2.peel` in `%retval.2.lcssa.ph` with `@GlobIntONE`. 3. `simplifyLoopAfterUnroll -> simplifyLoopIVs -> SCEVExpander::expand` reuses `%retval.2.peel = phi ptr [ undef, %for.body.peel ], [ @GlobIntONE, %cleanup.loopexit.peel ]` to generate code for `@GlobIntONE`. It is incorrect. This patch disallows simplifying `phi(undef, X)` to `X` by setting `CanUseUndef` to false. Closes https://github.com/llvm/llvm-project/issues/114879.	2024-11-07 15:53:51 +08:00
Paul Walker	38fffa630e	[LLVM][IR] Use splat syntax when printing Constant[Data]Vector. (#112548 )	2024-11-06 11:53:33 +00:00
Florian Hahn	2f7ccaf4a8	[SCEV] Add predicate in SolveLinEq to ensure B is a multiple of A. (#108777 ) This can help in cases where pointer alignment info is missing, e.g. https://github.com/llvm/llvm-project/pull/108210 The predicate is formed for the complex expression that's passed to SolveLinEquationWithOverflow and the checks could probably be pushed closer to the root nodes, which in some cases may be cheaper to check. PR: https://github.com/llvm/llvm-project/pull/108777	2024-09-28 14:19:57 +01:00
Nikita Popov	5bcc82d433	[LoopPeel] Fix LCSSA phi node invalidation In the test case, the BECount of the second loop uses %load, but we only have an LCSSA phi node for %add, so that is what gets invalidated. Use the forgetLcssaPhiWithNewPredecessor() API instead, which will invalidate the roots of the expression instead. Fixes https://github.com/llvm/llvm-project/issues/109333.	2024-09-20 17:01:41 +02:00
Nikita Popov	4ec4ac15ed	[SCEVExpander] Fix addrec cost model (#106704 ) The current isHighCostExpansion cost model for addrecs computes the cost for some kind of polynomial expansion that does not appear to have any relation to addrec expansion whatsoever. A literal expansion of an affine addrec is a phi and add (plus the expansion of start and step). For a non-affine addrec, we get another phi+add for each additional addrec nested in the step recurrence. This partially `fixes` https://github.com/llvm/llvm-project/issues/53205 (the runtime unroll test case in this PR).	2024-09-19 09:39:35 +02:00
Ganesh	02e4186d0b	[X86] AMD Zen 5 Initial enablement (#107964 ) This patch enables the basic skeleton enablement of AMD next gen zen5 CPUs.	2024-09-13 17:45:33 +01:00
Nikita Popov	52b879594f	[LoopUnroll] Avoid undef values in test (NFC) Avoid most of the code being optimized away as a result of optimization improvements.	2024-09-03 12:10:29 +02:00
Nikita Popov	fe1a1eee2f	[Tests] Regenerate test checks (NFC)	2024-09-03 11:42:47 +02:00
Nikita Popov	9edd998e10	[LoopUnroll] Add test for #53205 (NFC)	2024-08-29 16:43:56 +02:00
Nikita Popov	fe182ddf1f	[LoopUnrollAnalyzer] Use constant folding API for loads Use ConstantFoldLoadFromConst() instead of a partial re-implementation. This makes the code slightly more generic by not depending on the exact structure of the constant.	2024-08-28 11:53:25 +02:00
James Y Knight	dfeb3991fb	Remove the `x86_mmx` IR type. (#98505 ) It is now translated to `<1 x i64>`, which allows the removal of a bunch of special casing. This _incompatibly_ changes the ABI of any LLVM IR function with `x86_mmx` arguments or returns: instead of passing in mmx registers, they will now be passed via integer registers. However, the real-world incompatibility caused by this is expected to be minimal, because Clang never uses the x86_mmx type -- it lowers `__m64` to either `<1 x i64>` or `double`, depending on ABI. This change does _not_ eliminate the SelectionDAG `MVT::x86mmx` type. That type simply no longer corresponds to an IR type, and is used only by MMX intrinsics and inline-asm operands. Because SelectionDAGBuilder only knows how to generate the operands/results of intrinsics based on the IR type, it thus now generates the intrinsics with the type MVT::v1i64, instead of MVT::x86mmx. We need to fix this before the DAG LegalizeTypes, and thus have the X86 backend fix them up in DAGCombine. (This may be a short-lived hack, if all the MMX intrinsics can be removed in upcoming changes.) Works towards issue #98272.	2024-07-25 09:19:22 -04:00
v01dXYZ	cff8d716bd	[SCEV] forgetValue: support (with-overflow-inst op0, op1) (#98015 ) The use-def walk in forgetValue() was skipping instructions with non-SCEVable types. However, SCEV may look past with.overflow intrinsics returning aggregates. Fixes #97586.	2024-07-09 09:14:33 +02:00
Nikita Popov	37c736e035	[LoopUnroll] Use poison instead of undef for another preheader value	2024-06-25 12:17:20 +02:00
Nikita Popov	eeb0884e66	[LoopUnroll] Use poison instead of undef for preheader value	2024-06-25 12:09:58 +02:00
Stephen Tozer	094572701d	[RemoveDIs] Print IR with debug records by default (#91724 ) This patch makes the final major change of the RemoveDIs project, changing the default IR output from debug intrinsics to debug records. This is expected to break a large number of tests: every single one that tests for uses or declarations of debug intrinsics and does not explicitly disable writing records. If this patch has broken your downstream tests (or upstream tests on a configuration I wasn't able to run): 1. If you need to immediately unblock a build, pass `--write-experimental-debuginfo=false` to LLVM's option processing for all failing tests (remember to use `-mllvm` for clang/flang to forward arguments to LLVM). 2. For most test failures, the changes are trivial and mechanical, enough that they can be done by script; see the migration guide for a guide on how to do this: https://llvm.org/docs/RemoveDIsDebugInfo.html#test-updates 3. If any tests fail for reasons other than FileCheck check lines that need updating, such as assertion failures, that is most likely a real bug with this patch and should be reported as such. For more information, see the recent PSA: https://discourse.llvm.org/t/psa-ir-output-changing-from-debug-intrinsics-to-debug-records/79578	2024-06-14 15:07:27 +01:00
Sameer Sahasrabuddhe	e0ac087ff0	[LoopUnroll] Consider convergence control tokens when unrolling (#91715 ) - There is no restriction on a loop with controlled convergent operations when the relevant tokens are defined and used within the loop. - When a token defined outside a loop is used inside (also called a loop convergence heart), unrolling is allowed only in the absence of remainder or runtime checks. - When a token defined inside a loop is used outside, such a loop is said to be "extended". This loop can only be unrolled by also duplicating the extended part lying outside the loop. Such unrolling is disabled for now. - Clean up loop hearts: When unrolling a loop with a heart, duplicating the heart will introduce multiple static uses of a convergence control token in a cycle that does not contain its definition. This violates the static rules for tokens, and needs to be cleaned up into a single occurrence of the intrinsic. - Spell out the initializer for UnrollLoopOptions to improve readability. Original implementation [D85605] by Nicolai Haehnle <nicolai.haehnle@amd.com>.	2024-06-06 13:13:46 +05:30
Sergey Kachkov	f34dedbf44	[LoopPeel] Support min/max intrinsics in loop peeling (#93162 ) This patch adds processing of min/max intrinsics in LoopPeel in the similar way as it was done for conditional statements: for min/max(IterVal, BoundVal) we peel iterations where IterVal < BoundVal for monotonically increasing IterVal; for monotonically decreasing IterVal we peel iterations where IterVal > BoundVal (strict comparision predicates are used to minimize number of peeled iterations).	2024-05-31 13:58:10 +03:00
Sergey Kachkov	60a890d855	[LoopPeel] Add pre-commit test for min/max intrinsics	2024-05-31 13:06:08 +03:00
Simon Pilgrim	54e52aa5eb	[X86] Reduce znver3/4 LoopMicroOpBufferSize to practical loop unrolling values (#91340 ) The znver3/4 scheduler models have previously associated the LoopMicroOpBufferSize with the maximum size of their op caches, and when this led to quadratic complexity issues this were reduced to a value of 512 uops, based mainly on compilation time and not its effectiveness on runtime performance. From a runtime performance POV, a large LoopMicroOpBufferSize leads to a higher number of loop unrolls, meaning the cpu has to rely on the frontend decode rate (4 ins/cy max) for much longer to fill the op cache before looping begins and we make use of the faster op cache rate (8/9 ops/cy). This patch proposes we instead cap the size of the LoopMicroOpBufferSize based off the maximum rate from the op cache (znver3 = 8op/cy, znver4 = 9op/cy) and the branch misprediction penalty from the opcache (~12cy) as a estimate of the useful number of ops we can unroll a loop by before mispredictions are likely to cause stalls. This isn't a perfect metric, but does try to be closer to the spirit of how we use LoopMicroOpBufferSize in the compiler vs the size of a similar naming buffer in the cpu.	2024-05-16 14:44:00 +01:00
Dmitri Gribenko	83974a4b92	Revert "[LoopUnroll] Clamp PartialThreshold for large LoopMicroOpBufferSize (#67657 )" This reverts commit f0b3654701bde1cf7821d60698b42383edaff9f3. This commit triggers UB by reading an uninitialized variable. `UP.PartialThreshold` is used uninitialized in `getUnrollingPreferences()` when it is called from `LoopVectorizationPlanner::executePlan()`. In this case the `UP` variable is created on the stack and its fields are not initialized. ``` ==8802==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x557c0b081b99 in llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getUnrollingPreferences(llvm::Loop, llvm::ScalarEvolution&, llvm::TargetTransformInfo::UnrollingPreferences&, llvm::OptimizationRemarkEmitter) llvm-project/llvm/include/llvm/CodeGen/BasicTTIImpl.h #1 0x557c0b07a40c in llvm::TargetTransformInfo::Model<llvm::X86TTIImpl>::getUnrollingPreferences(llvm::Loop, llvm::ScalarEvolution&, llvm::TargetTransformInfo::UnrollingPreferences&, llvm::OptimizationRemarkEmitter) llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:2277:17 #2 0x557c0f5d69ee in llvm::TargetTransformInfo::getUnrollingPreferences(llvm::Loop, llvm::ScalarEvolution&, llvm::TargetTransformInfo::UnrollingPreferences&, llvm::OptimizationRemarkEmitter) const llvm-project/llvm/lib/Analysis/TargetTransformInfo.cpp:387:19 #3 0x557c0e6b96a0 in llvm::LoopVectorizationPlanner::executePlan(llvm::ElementCount, unsigned int, llvm::VPlan&, llvm::InnerLoopVectorizer&, llvm::DominatorTree, bool, llvm::DenseMap<llvm::SCEV const, llvm::Value, llvm::DenseMapInfo<llvm::SCEV const, void>, llvm::detail::DenseMapPair<llvm::SCEV const, llvm::Value>> const) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:7624:7 #4 0x557c0e6e4b63 in llvm::LoopVectorizePass::processLoop(llvm::Loop) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:10253:13 #5 0x557c0e6f2429 in llvm::LoopVectorizePass::runImpl(llvm::Function&, llvm::ScalarEvolution&, llvm::LoopInfo&, llvm::TargetTransformInfo&, llvm::DominatorTree&, llvm::BlockFrequencyInfo, llvm::TargetLibraryInfo, llvm::DemandedBits&, llvm::AssumptionCache&, llvm::LoopAccessInfoManager&, llvm::OptimizationRemarkEmitter&, llvm::ProfileSummaryInfo) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:10344:30 #6 0x557c0e6f2f97 in llvm::LoopVectorizePass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:10383:9 [...] Uninitialized value was created by an allocation of 'UP' in the stack frame #0 0x557c0e6b961e in llvm::LoopVectorizationPlanner::executePlan(llvm::ElementCount, unsigned int, llvm::VPlan&, llvm::InnerLoopVectorizer&, llvm::DominatorTree, bool, llvm::DenseMap<llvm::SCEV const, llvm::Value, llvm::DenseMapInfo<llvm::SCEV const, void>, llvm::detail::DenseMapPair<llvm::SCEV const, llvm::Value>> const) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:7623:3 ```	2024-05-16 12:11:42 +02:00
Nikita Popov	f0b3654701	[LoopUnroll] Clamp PartialThreshold for large LoopMicroOpBufferSize (#67657 ) The znver3/znver4 scheduler models are outliers, specifying very large LoopMicroOpBufferSizes at 512, while typical values for other subtargets are on the order of ~50. Even if this information is micro-architecturally correct (), this does not mean that we want to runtime unroll all loops to a size that completely fills the loop buffer. Unless this is the single hot loop in the entire application, the massive code size increase will bust the micro-op and instruction caches. Protect against this by clamping to the default PartialThreshold of 150, which is the same as the default full-unroll threshold and half the aggressive full-unroll threshold. Allowing more partial unrolling than full unrolling certainly does not make sense. () I strongly doubt that this is actually correct -- I believe this may derive from an incorrect reading of Agner Fog's micro-architecture guide. The number 4096 that was originally used here is the size of the general micro-op cache, not that of a loop buffer. A separate loop buffer is not listed for the Zen microarchitecture. Comparing this to the listing for Skylake, it has a 1536 micro-op buffer, but only a 64 micro-op loopback buffer, with a note that it's rarely fully utilized. Our scheduling model specifies LoopMicroOpBufferSize of 50 in that case.	2024-05-16 10:21:22 +09:00
chenlin	79643565a8	[LoopUnroll] Remove redundant debug instructions after blocks have been merged (#91246 ) Remove redundant debug instructions after blocks have been merged into the predecessor, It can reduce some compile time in some cases. This change only fixes the situation of loop unrolling, and other situations are not considered. "RemoveRedundantDbgInstrs" seems to be very time-consuming. Thus, we just add here after the "Dest" has been merged into the "Fold", this may be a more targeted solution!!! fixes: https://github.com/llvm/llvm-project/issues/89073	2024-05-13 09:42:04 -07:00
Jorge Gorbe Moya	2cde0e2f97	Revert "[BasicBlockUtils] Remove redundant llvm.dbg instructions after blocks to reduce compile time (#89069 )" This reverts commit 2e3e0868748635b779ba89a772eae3664bd822e4. It caused quadratic slowdown at compilation time in some cases. See the comments in the original PR: https://github.com/llvm/llvm-project/pull/89069	2024-05-03 13:05:08 -07:00
annamthomas	46c2d93662	[StandardInstrumentation] Annotate loops with the function name (#90756 ) When analyzing pass debug output it is helpful to have the function name along with the loop name.	2024-05-03 14:13:59 -04:00
Florian Hahn	175d297102	[LoopUnroll] Add CSE to remove redundant loads after unrolling. (#83860 ) This patch adds loadCSE support to simplifyLoopAfterUnroll. It is based on EarlyCSE's implementation using ScopeHashTable and is using SCEV for accessed pointers to check to find redundant loads after unrolling. This applies to the late unroll pass only, for full unrolling those redundant loads will be cleaned up by the regular pipeline. The current approach constructs MSSA on-demand per-loop, but there is still small but notable compile-time impact: stage1-O3 +0.04% stage1-ReleaseThinLTO +0.06% stage1-ReleaseLTO-g +0.05% stage1-O0-g +0.02% stage2-O3 +0.09% stage2-O0-g +0.04% stage2-clang +0.02% https://llvm-compile-time-tracker.com/compare.php?from=c089fa5a729e217d0c0d4647656386dac1a1b135&to=ec7c0f27cb5c12b600d9adfc8543d131765ec7be&stat=instructions:u This benefits some workloads with runtime-unrolling disabled, where users use pragmas to force unrolling, as well as with runtime unrolling enabled. On SPEC/MultiSource, this removes a number of loads after unrolling on AArch64 with runtime unrolling enabled. ``` External/S...te/526.blender_r/526.blender_r 96 MultiSourc...rks/mediabench/gsm/toast/toast 39 SingleSource/Benchmarks/Misc/ffbench 4 External/SPEC/CINT2006/403.gcc/403.gcc 18 MultiSourc.../Applications/JM/ldecod/ldecod 4 MultiSourc.../mediabench/jpeg/jpeg-6a/cjpeg 6 MultiSourc...OE-ProxyApps-C/miniGMG/miniGMG 9 MultiSourc...e/Applications/ClamAV/clamscan 4 MultiSourc.../MallocBench/espresso/espresso 3 MultiSourc...dence-flt/LinearDependence-flt 2 MultiSourc...ch/office-ispell/office-ispell 4 MultiSourc...ch/consumer-jpeg/consumer-jpeg 6 MultiSourc...ench/security-sha/security-sha 11 MultiSourc...chmarks/McCat/04-bisect/bisect 3 SingleSour...tTests/2020-01-06-coverage-009 12 MultiSourc...ench/telecomm-gsm/telecomm-gsm 39 MultiSourc...lds-flt/CrossingThresholds-flt 24 MultiSourc...dence-dbl/LinearDependence-dbl 2 External/S...C/CINT2006/445.gobmk/445.gobmk 6 MultiSourc...enchmarks/mafft/pairlocalalign 53 External/S...31.deepsjeng_r/531.deepsjeng_r 3 External/S...rate/510.parest_r/510.parest_r 58 External/S...NT2006/464.h264ref/464.h264ref 29 External/S...NT2017rate/502.gcc_r/502.gcc_r 45 External/S...C/CINT2006/456.hmmer/456.hmmer 6 External/S...te/538.imagick_r/538.imagick_r 18 External/S.../CFP2006/447.dealII/447.dealII 4 MultiSourc...OE-ProxyApps-C++/miniFE/miniFE 12 External/S...2017rate/525.x264_r/525.x264_r 36 MultiSourc...Benchmarks/7zip/7zip-benchmark 33 MultiSourc...hmarks/ASC_Sequoia/AMGmk/AMGmk 2 MultiSourc...chmarks/VersaBench/8b10b/8b10b 1 MultiSourc.../Applications/JM/lencod/lencod 116 MultiSourc...lds-dbl/CrossingThresholds-dbl 24 MultiSource/Benchmarks/McCat/05-eks/eks 15 ``` PR: https://github.com/llvm/llvm-project/pull/83860	2024-05-02 11:01:24 +01:00
CL	2e3e086874	[BasicBlockUtils] Remove redundant llvm.dbg instructions after blocks to reduce compile time (#89069 ) this patch is to fix the compile time for some cases, before this change, some targets (riscv-64, ve) will spend much more compile time on this case (https://godbolt.org/z/rrov17cTo). With this change, the compile time was reduced a lot. Fixes https://github.com/llvm/llvm-project/issues/89073 PR: https://github.com/llvm/llvm-project/pull/89069	2024-04-26 10:57:37 +01:00
Florian Hahn	01f8da908c	[LoopUnroll] Add tests for performing load CSE after unrolling. Precommit tests for https://github.com/llvm/llvm-project/pull/83860.	2024-04-24 12:00:16 +01:00
Graham Hunter	03f852f704	[AArch64] Improve cost model for legal subvec insert/extract (#81135 ) Currently we model subvector inserts and extracts as shuffles, potentially going as far as scalarizing. If the types are legal then they can just be simple zip/unzip operations, or possible even no-ops. Change the cost to a relatively small one to ensure that simple loops featuring such operations between fixed and scalable vector types that are effectively the same at a given sve width can be unrolled and further optimized.	2024-03-04 16:17:01 +00:00
Vedant Paranjape	e209178d64	[SimplifyIndVar] LCSSA form is destroyed by simplifyLoopIVs, preserve it (#78696 ) In LoopUnroll, peelLoop is called on the loop. After the loop is peeled it calls simplifyLoopAfterUnroll on the loop. This call to simplifyLoopAfterUnroll doesn't preserve the LCSSA form of the parent loop and thus during the next call to peelLoop the LCSSA form is already broken. LoopPeel util takes in the PreserveLCSSA argument and it passes on the same argument to simplifyLoop which checks if the loop is in a valid LCSSA form, when (PreserveLCSSA = true). This causes an assert in simplifyLoop when (PreserveLCSSA = true), as during the last call LCSSA for the loop wasn't preserved, and thus crashes at the following assert. assert(L->isRecursivelyLCSSAForm(DT, LI) && "Requested to preserve LCSSA, but it's already broken."); Upon debugging, it is evident that simplifyLoopIVs call inside simplifyLoopAfterUnroll breaks the LCSSA form. This patch fixes llvm#77118, it checks if the replacement of IV Users with Loop Invariant preserves the LCSSA form. If it does not, it emits the required LCSSA Phi instructions.	2024-02-21 17:51:56 +05:30
Graham Hunter	ad78e210bd	[NFC][AArch64] Tests for guarding unrolling with scalable vec ins/ext (#81132 )	2024-02-19 09:47:49 +00:00
Sergey Kachkov	ffd79b3312	[LoopUnroll] Consider simplified operands while retrieving TTI instruction cost (#70929 ) Get more precise cost of instruction after LoopUnroll considering that some operands of it can be simplified, e.g. induction variable will be replaced by constant after full unrolling.	2024-02-06 17:01:38 +03:00
modiking	99ddd77ed9	[LoopUnroll] Introduce PragmaUnrollFullMaxIterations as a hard cap on how many iterations we try to unroll (#78648 ) Fixes [PR77842](https://github.com/llvm/llvm-project/issues/77842) where UBSAN causes pragma full unroll to try and unroll INT_MAX times. This sets a cap to make sure we don't attempt this and crash the compiler. Testing: ninja check-all with new test --------- Co-authored-by: Nikita Popov <github@npopov.com>	2024-02-05 17:01:00 -08:00
Nikita Popov	2d69827c5c	[Transforms] Convert tests to opaque pointers (NFC)	2024-02-05 11:57:34 +01:00
Nikita Popov	62ae7d976f	[LoopUnroll] Fix missing sign extension For integers larger than 64-bit, this would zero-extend a -1 value, instead of sign-extending it. Fixes https://github.com/llvm/llvm-project/issues/80289.	2024-02-01 16:08:25 +01:00
Nikita Popov	f7b05e055f	[LoopUnroll] Add test for #80289 (NFC)	2024-02-01 16:08:25 +01:00
Nikita Popov	90ba33099c	[InstCombine] Canonicalize constant GEPs to i8 source element type (#68882 ) This patch canonicalizes getelementptr instructions with constant indices to use the `i8` source element type. This makes it easier for optimizations to recognize that two GEPs are identical, because they don't need to see past many different ways to express the same offset. This is a first step towards https://discourse.llvm.org/t/rfc-replacing-getelementptr-with-ptradd/68699. This is limited to constant GEPs only for now, as they have a clear canonical form, while we're not yet sure how exactly to deal with variable indices. The test llvm/test/Transforms/PhaseOrdering/switch_with_geps.ll gives two representative examples of the kind of optimization improvement we expect from this change. In the first test SimplifyCFG can now realize that all switch branches are actually the same. In the second test it can convert it into simple arithmetic. These are representative of common optimization failures we see in Rust. Fixes https://github.com/llvm/llvm-project/issues/69841.	2024-01-24 15:25:29 +01:00
Yingwei Zheng	b7f50e13d8	[InstCombine] Improve `foldICmpWithDominatingICmp` with DomConditionCache (#75370 ) This patch uses affected values from DomConditionCache(introduced by #73662), instead of a cheap/incomplete check `getSinglePredecessor`.	2023-12-14 21:02:10 +08:00
XiangZhang	1d6a678591	[LoopUnroll] Make use of MaxTripCount for loops with "#pragma unroll" (#74703 ) Fix loop unroll fail caused by branches folding. For example: SimplifyCFG foldloop branches then cause loop unroll failed for "#program unroll" loop. ``` #program unroll for (int I = 0; I < ConstNum; ++I) { // folding "I < ConstNum" and "Cond2" if (Cond2) { break; } xxx loop body; } ``` The pragma unroll metadata only takes effect if there is an exact trip count, but not if there is an upper bound trip count. This patch make it work with an upper bound trip count as well in shouldPragmaUnroll(). Loop unroll is important in stack nervous devices (e.g. GPU, and that is why a lot of GPU code mark loop with "#program unroll"). It usually much simplify the address (offset) calculations in old iterations, then we can do a lot of others optimizations, e.g, SROA, for these simplifed address (escape alloca the whole aggregates).	2023-12-08 19:43:10 +08:00
Philip Reames	ffb2af3ed6	[SCEVExpander] Attempt to reinfer flags dropped due to CSE (#72431 ) LSR uses SCEVExpander to generate induction formulas. The expander internally tries to reuse existing IR expressions. To do that, it needs to strip any poison generating flags (nsw, nuw, exact, nneg, etc..) which may not be valid for the newly added users. This is conservatively correct, but has the effect that LSR will strip nneg flags on zext instructions involved in trip counts in loop preheaders. To avoid this, this patch adjusts the expanded to reinfer the flags on the CSE candidate if legal for all possible users. This should fix the regression reported in https://github.com/llvm/llvm-project/issues/71200. This should arguably be done inside canReuseInstruction instead, but doing it outside is more conservative compile time wise. Both canReuseInstruction and isGuaranteedNotToBePoison walk operand lists, so right now we are performing work which is roughly O(N^2) in the size of the operand graph. We should fix that before making the per operand step more expensive. My tenative plan is to land this, and then rework the code to sink the logic into more core interfaces.	2023-12-07 13:20:36 -08:00
Nikita Popov	eecb99c5f6	[Tests] Add disjoint flag to some tests (NFC) These tests rely on SCEV looking recognizing an "or" with no common bits as an "add". Add the disjoint flag to relevant or instructions in preparation for switching SCEV to use the flag instead of the ValueTracking query. The IR with disjoint flag matches what InstCombine would produce.	2023-12-05 14:09:36 +01:00
Joshua Cao	5602636835	[LoopPeel] Peel iterations based on and, or conditions (#73413 ) For example, this allows us to peel this loop with a `and`: ``` for (int i = 0; i < N; ++i) { if (i % 2 == 0 && i < 3) // can peel based on \|\| as well f1(); f2(); ``` into: ``` for (int i = 0; i < 3; ++i) { // peel three iterations if (i % 2 == 0) f1(); f2(); } for (int i = 3; i < N; ++i) f2(); ```	2023-12-02 11:24:02 -08:00

1 2 3 4 5 ...

619 Commits