llvm-project

Author	SHA1	Message	Date
Luke Lau	61a0653cc6	[VPlan] Fix first-order splices without header mask not using EVL (#146672 ) This fixes a buildbot failure with EVL tail folding after #144666: https://lab.llvm.org/buildbot/#/builders/132/builds/1653 For a first-order recurrence to be correct with EVL tail folding we need to convert splices to vp splices with the EVL operand. Originally we did this by looking for users of the header mask and its users, and converting it in createEVLRecipe. However after #144666 a FOR splice might not actually use the header mask if it's based off e.g. an induction variable, and so we wouldn't pick it up in createEVLRecipe. This fixes this by converting FOR splices separately in a loop over all recipes in the plan, regardless of whether or not it uses the header mask. I think there was some conflation in createEVLRecipe between what was an optimisation and what was needed for correctness. Most of the transforms in it just exist to optimize the mask away and we should still emit correct code without them. So I've renamed it to make the separation clearer.	2025-07-03 16:55:00 +01:00
Luke Lau	ec25a0568c	[VPlan] Don't convert VPWidenSelectRecipes to vp.select in EVL transform (#146695 ) createEVLRecipe tries to optimise recipes that use the header mask by replacing them with their VP equivalents and setting the EVL, allowing the mask to be removed. However we currently also convert widened selects to vp.select even though they don't necessarily use the header mask. Unlike vp.merge a vp.select only makes the "unused" lanes past EVL poison, so it's not needed for correctness. In the same vein as #127180, this patch removes the transform for VPWidenSelectRecipes and keeps them as plain select instructions to allow for more optimisations. RISCVVLOptimizer will still be able to optimise away any VL toggles and we end up with better code generation across llvm-test-suite and SPEC CPU 2017.	2025-07-03 11:50:25 +01:00
Mel Chen	fcdb91e113	[VPlan] Remove redundant debug location setting in VPInterleaveRecipe::execute. nfc (#146670 ) Remove it since we already set debug loc in VPBasicBlock::executeRecipes.	2025-07-03 15:52:45 +08:00
Florian Hahn	4acdb8e14e	[VectorCombine] Scalarize extracts of ZExt if profitable. (#142976 ) Add a new scalarization transform that tries to convert extracts of a vector ZExt to a set of scalar shift and mask operations. This can be profitable if the cost of extracting is the same or higher than the cost of 2 scalar ops. This is the case on AArch64 for example. For AArch64,this shows up in a number of workloads, including av1aom, gmsh, minizinc and astc-encoder. PR: https://github.com/llvm/llvm-project/pull/142976	2025-07-03 08:49:32 +01:00
Mel Chen	3e370452fd	[VPlan] Early assert for unsupported interleaved access features. nfc (#146669 )	2025-07-03 15:31:53 +08:00
Luke Lau	7931a8f102	[VectorCombine] Scalarize vector intrinsics with scalar arguments (#146530 ) Some intrinsics like llvm.abs or llvm.powi have a scalar argument even when the overloaded type is a vector. This patch handles these in scalarizeOpOrCmp to allow scalarizing them. In the test the leftover vector powi isn't folded away to poison, this should be fixed in a separate patch.	2025-07-02 16:48:53 +01:00
David Sherwood	f575b18fdc	[LV] Add support for partial reductions without a binary op (#133922 ) Consider IR such as this: for.body: %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ] %accum = phi i32 [ 0, %entry ], [ %add, %for.body ] %gep.a = getelementptr i8, ptr %a, i64 %iv %load.a = load i8, ptr %gep.a, align 1 %ext.a = zext i8 %load.a to i32 %add = add i32 %ext.a, %accum %iv.next = add i64 %iv, 1 %exitcond.not = icmp eq i64 %iv.next, 1025 br i1 %exitcond.not, label %for.exit, label %for.body Conceptually we can vectorise this using partial reductions too, although the current loop vectoriser implementation requires the accumulation of a multiply. For AArch64 this is easily done with a udot or sdot with an identity operand, i.e. a vector of (i16 1). In order to do this I had to teach getScaledReductions that the accumulated value may come from a unary op, hence there is only one extension to consider. Similarly, I updated the vplan and AArch64 TTI cost model to understand the possible unary op. --------- Co-authored-by: Matt Devereau <matthew.devereau@arm.com>	2025-07-02 13:05:51 +01:00
Mel Chen	bc8dad1c7e	[VPlan] Emit VPVectorEndPointerRecipe for reverse interleave pointer adjustment (#144864 ) A reverse interleave access is essentially composed of multiple load/store operations with same negative stride, and their addresses are based on the last lane address of member 0 in the interleaved group. Currently, we already have VPVectorEndPointerRecipe for computing the last lane address of consecutive reverse memory accesses. This patch extends VPVectorEndPointerRecipe to support constant stride and extracts the reverse interleave group address adjustment from VPInterleaveRecipe::execute, replacing it with a VPVectorEndPointerRecipe. The final goal is to support interleaved accesses with EVL tail folding. Given that VPInterleaveRecipe is large and tightly coupled — combining both load and store, and embedding operations like reverse pointer adjustion (GEP), widen load/store, deinterleave/interleave, and reversal — breaking it down into smaller, dedicated recipes may allow VPlanTransforms::tryAddExplicitVectorLength to lower them into EVL-aware form more effectively. One foreseeable challenge is that VPlanTransforms::convertToConcreteRecipes currently runs after tryAddExplicitVectorLength, so decomposing VPInterleaveRecipe will likely need to happen earlier in the pipeline to be effective.	2025-07-02 18:16:02 +08:00
Simon Pilgrim	651c5208f8	VPlanRecipes.cpp - fix "'llvm::VPExpressionRecipe::computeCost': not all control paths return a value" MSVC warning. NFC.	2025-07-02 09:59:01 +01:00
Florian Hahn	863e17a5be	[VPlan] Make Phi operand for VPReductionPHIRecipe optional (NFC). VPReductionPHIRecipe doesn't rely on the underlying phi any longer, allow empty underlying values when cloning. NFC at the moment but will enable follow-up patches.	2025-07-01 22:49:27 +01:00
Florian Hahn	bcbc440712	[VPlan] Add missing VPWidenSelectto VPRecipeWithIRFlags::classof (NFC). Add missing entry to VPRecipeWithIRFlags. NFC currently as it is never called on VPWidenSelectRecipes currently.	2025-07-01 22:26:57 +01:00
Florian Hahn	829f2f2448	[VectorCombine] Mark function as changed if shuffle is created. 777d6b5de90b7e0 exposed a code path where a function is modified but not marked accordingly. Make sure we return true from foldShuffleFromReductions if only a shuffle has been inserted/replaced. Should fix https://lab.llvm.org/buildbot/#/builders/187/builds/7578.	2025-07-01 21:38:29 +01:00
Florian Hahn	777d6b5de9	[VectorCombine] Use InstSimplifyFolder to simplify instrs on creation. (#146350 ) Update VectorCombine to use InstSimplifyFolder to simplify redundant instructions on creation. PR: https://github.com/llvm/llvm-project/pull/146350	2025-07-01 20:55:51 +01:00
Florian Hahn	6b3d2b629c	[VPlan] Add VPExpressionRecipe, replacing extended reduction recipes. (#144281 ) This patch adds a new recipe to combine multiple recipes into an 'expression' recipe, which should be considered as single entity for cost-modeling and transforms. The recipe needs to be 'decomposed', i.e. replaced by its individual recipes before execute. This subsumes VPExtendedReductionRecipe and VPMulAccumulateReductionRecipe and should make it easier to extend to include more types of bundled patterns, like e.g. extends folded into loads or various arithmetic instructions, if supported by the target. It allows avoiding re-creating the original recipes when converting to concrete recipes, together with removing the need to record various information. The current version of the patch still retains the original printing matching VPExtendedReductionRecipe and VPMulAccumulateReductionRecipe, but this specialized print could be replaced with printing the bundled recipes directly. PR: https://github.com/llvm/llvm-project/pull/144281	2025-07-01 20:44:50 +01:00
David Sherwood	9b13dfdfbc	[LV] Use vscale for tuning to improve branch weight estimates (#144733 ) In addBranchWeightToMiddleTerminator we attempt to add branch weights to the middle block terminator. We pessimistically assume vscale=1, whereas we can improve the estimate by using the value of vscale used for tuning.	2025-07-01 13:23:38 +01:00
Luke Lau	4a2fa0847f	[VPlan] Support VPWidenIntOrFpInductionRecipes with EVL tail folding (#144666 ) Following on from #118638, this handles widened induction variables with EVL tail folding by setting the VF operand to be EVL, calculated in the vector body. We need to do this for correctness since with EVL tail folding the number of elements processed in the penultimate iteration may not be VF, but the runtime EVL, and we need take this into account when updating the backedge value. - Because the VF may now not be a live-in we need to move the insertion point to just after the VFs definition - We also need to avoid truncating it when it's the same size as the step type, previously this wasn't a problem for live-ins. - Also because the VF may be smaller than the IV type, since the EVL is always i32, we may need to zext it. On -march=rva23u64 -O3 we get 87.1% more loops vectorized on TSVC, and 42.8% more loops vectorized on SPEC CPU 2017	2025-07-01 12:29:24 +01:00
Florian Hahn	59a7185dd9	[VPlan] Truncate/Extend ComputeReductionResult at construction (NFC). (#141860 ) Instead of looking up the narrower reduction type via getRecurrenceType we can generate the needed extend directly at constructiond re-use the truncated value from the loop. PR: https://github.com/llvm/llvm-project/pull/141860	2025-06-30 22:39:17 +01:00
Florian Hahn	026aae7047	[VPlan] Infer reduction result types w/o accessing underlying phis.(NFC) Remove another use of the underlying IR phi.	2025-06-30 21:29:29 +01:00
Luke Lau	f01a7936be	[VPlan] Replace all uses of VF when EVL tail folding. NFCI (#146339 ) With EVL tail folding, any use of the VF live in should be replaced by the EVL. Otherwise, it should likely be directly emitted as a constant via VPTransformState::VF. This strengthens the EVL transformation by replacing all uses of VF with EVL and asserting that the only users are VPVectorEndPointerRecipe and VPScalarIVStepsRecipe, the latter of which is new. This should be NFC because even though we didn't previously replace the EVL of VPScalarIVStepsRecipe, it's only used when unrolling which we don't allow with EVL tail folding yet.	2025-06-30 13:47:38 +01:00
Florian Hahn	b822a32659	[VPlan] Fix crash when trying to narrow interleave group storing const. Use dyn_cast_null to handle the case where an interleave groups stores a constant in any of its lanes.	2025-06-29 21:29:12 +01:00
Florian Hahn	f5ed863176	Revert "[VPlan] Allow derived IVs and scalar-steps in narrowing interleave." This reverts commit 2787759ef2e41b19f8bfde06fe9a26b25d1f5834. This exposed a crash on some build bots. Revert to investigate.	2025-06-29 14:40:03 +01:00
Florian Hahn	2787759ef2	[VPlan] Allow derived IVs and scalar-steps in narrowing interleave. Both VPDerivedIVRecipe and VPScalarIVSteps recipe should be supported in narrowInterleaveGroups: * VPDerivedIVRecipe is based on the canonical IV and independent of VF, * VPScalarIVSteps takes the VF as operand, so it will be updated by narrowInterleaveGroup.	2025-06-29 13:18:51 +01:00
Florian Hahn	20fbbd7675	[LV] Add support for cmp reductions with decreasing IVs. (#140451 ) Similar to FindLastIV, add FindFirstIVSMin to support select (icmp(), x, y) reductions where one of x or y is a decreasing induction, producing a SMin reduction. It uses signed max as sentinel value. PR: https://github.com/llvm/llvm-project/pull/140451	2025-06-29 11:17:03 +01:00
Florian Hahn	bdb299a67e	[VPlan] Simplify code in single scalar transform code (NFC). Adjust code as suggested post-commit 3b7b95f78e2. `3b7b95f78e (r160997427)`	2025-06-28 22:53:14 +01:00
Florian Hahn	1949536494	[VPlan] Also visit VPBBs outside loop region when unrolling by VF. Make sure all VPBBs outside the top-level loop region and directly inside the region are visited; all those blocks may contain VPReplicateRecipes that need unrolling. This makes sure we unroll VPRepicateRecipes by VF if they are hoisted out of the loop, but cannot be converted to single scalar recipes yet.	2025-06-28 19:02:22 +01:00
vporpo	1eacdddc0c	[SandboxVec][SeedCollector][NFC] Replace cl::opt flags with constructor args (#143206 ) The `SeedCollector` class gets two new arguments: `CollectStores` and `CollectLoads`. These replace the `sbvec-collect-seeds` cl::opt flag. This is done to help with reusing the SeedCollector class in a future pass. The cl::opt flag is moved to the seed collection pass: Passes/SeedCollection.cpp	2025-06-27 12:27:25 -07:00
Florian Hahn	3b7b95f78e	[VPlan] Support VPWidenSelectRecipe in narrowToSingleScalar. VPWidenSelectRecipes are single scalars if all their operands are. Add support for narrowing them to a single scalar VPReplicateRecipe. This fixes a crash after https://github.com/llvm/llvm-project/pull/142433 (aa24029319083) when due to a replicate recipe not being converted to single-scalar being hoisted to the vector preheader.	2025-06-27 15:42:42 +01:00
Ramkumar Ramachandra	613804cca9	[LV] Improve code using [[maybe_unused]] (NFC) (#137138 )	2025-06-27 10:58:17 +01:00
David Sherwood	bf2b14acf3	[LV] Enable auto-vectorisation of loops with uncountable exits (#133099 ) Until now the feature to enable vectorisation of some early exit loops with uncountable exits was controlled under a flag, off by default. Now that we have efficient code generation for vectorising such loops (see PR #130766) and we still have some time from the next LLVM release it seems like a good time point to enable the feature by default. If any issues arise post-commit it can be easily reverted. Using this patch I built and ran the LLVM test suite successfully, which on neoverse-v1 led to the vectorisation of 114 additional early exit loops. I also built and ran SPEC2017 successfully for both neoverse-v1 and neoverse-v2.	2025-06-27 10:39:33 +01:00
David Sherwood	6f43754e9c	[LV] Disable interleaving via hints for uncountable early exit loops (#145877 ) Currently if the user enables interleaving during vectorisation of uncountable early exit loops via the interleave_count pragma and the enable-early-exit-vectorization option, it will miscompile. There is ongoing work to fix this, but for now it seems safer to ignore the hint until it is supported. --------- Co-authored-by: Paul Walker <paul.walker@arm.com>	2025-06-27 09:09:55 +01:00
Florian Hahn	ec62dee703	[VPlan] Handle FirstActiveLane when unrolling. (#145394 ) Currently FirstActiveLane is not handled correctly during unrolling. This is currently causing mis-compiles when vectorizing early-exit loops with interleaving forced. This patch updates handling of FirstActiveLane to be analogous to computing final reduction results: during unrolling, the created copies for its original operand are added as additional operands, and FirstActiveLane will always produce the index of the first active lane across all unrolled iterations. Note that some of the generated code is still incorrect, as we also need to handle ExtractElement with FirstActiveLane operands. I will share patches for those soon as well. PR: https://github.com/llvm/llvm-project/pull/145394	2025-06-27 08:44:57 +01:00
Florian Hahn	786ccb2c0e	[LV] Directly check if memory or SCEV check blocks are used (NFCI). Slightly simplify the logic to retrieve check blocks in GeneratedRTChecks, to prepare for additional refactoring.	2025-06-27 07:24:32 +01:00
Florian Hahn	772eb07567	[VPlan] Clarify comments after #145340 (NFC). Adjust comments as suggested post-commit for #145340.	2025-06-26 22:42:06 +01:00
Igor Kirillov	aeec2c6e48	[VPlan] Speed up VPSlotTracker by using ModuleSlotTracker (#139881 ) Currently, when VPSlotTracker is initialized with a VPlan, its assignName method calls printAsOperand on each underlying instruction. Each such call recomputes slot numbers for the entire function, leading to O(N × M) complexity, where M is the number of instructions in the loop and N is the number of instructions in the function. This results in slow debug output for large loops. For example, printing costs of all instructions becomes O(M² × N), which is especially painful when enabling verbose dumps. This patch improves debugging performance by caching slot numbers using ModuleSlotTracker. It avoids redundant recomputation and makes debug output significantly faster.	2025-06-26 22:40:48 +01:00
Narayan	d05634d5cd	[VectorCombine] Fold bitwise operations of bitcasts into bitcast of bitwise operation (#137322 ) Currently, LLVM fails to convert certain pblendvb intrinsics into select instructions when the blend mask is derived from complex boolean logic operations. This occurs even when the mask is ultimately based on sign-extended comparison results, preventing further optimization opportunities. Fixes #66513 --------- Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>	2025-06-26 14:57:22 +01:00
Florian Hahn	5b76cdba5a	[VPlan] Handle AnyOf when unrolling. (#145340 ) Currently AnyOf is not handled correctly during unrolling. This is currently causing mis-compiles when vectorizing early-exit loops with interleaving forced (even though selectInterleaveCount will currently only pick IC = 1, unless forced by the user). This patch updates handling of AnyOf to be analogous to computing final reduction results: during unrolling, the created copies for its original operand are added as additional operands, and AnyOf will always produce the reduced value across all unrolled iterations. Note that the generated code is still incorrect, as we also need to handle FirstActiveLane and ExtractElement with FirstActiveLane operands. I will share patches for those soon as well. PR: https://github.com/llvm/llvm-project/pull/145340	2025-06-26 14:19:38 +01:00
Florian Hahn	aa24029319	[VPlan] Unroll VPReplicateRecipe by VF. (#142433 ) Explicitly unroll VPReplicateRecipes outside replicate regions by VF, replacing them by VF single-scalar recipes. Extracts for operands are added as needed and the scalar results are combined to a vector using a new BuildVector VPInstruction. It also adds a few folds to simplify unnecessary extracts/BuildVectors. It also adds a BuildStructVector opcode for handling of calls that have struct return types. VPReplicateRecipe in replicate regions can will be unrolled as follow up, turing non-single-scalar VPReplicateRecipes into 'abstract', i.e. not executable. PR: https://github.com/llvm/llvm-project/pull/142433	2025-06-26 11:19:09 +01:00
Kazu Hirata	2a35414e98	[Transforms] Use range-based for loops (NFC) (#145252 ) Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-06-25 10:08:26 -07:00
LiqinWeng	4ac4726d00	[VPlan] Format some print forms.NFC (#144644 )	2025-06-25 16:14:50 +08:00
Florian Hahn	830b2c842e	[LV] Replace redundant ExtractLastElement of reduction result (NFC). Replace redundant ExtractLastElement VPInstructions early. This is NFC, as the VPInstruction computing the final result is vector-to-scalar, producing a single scalar already. This enables follow-up changes to model more aspects of reductions directly in VPlan.	2025-06-24 21:48:58 +01:00
Florian Hahn	c3e25e7fc4	[VPlan] Add VPInst::getNumOperandsForOpcode, use to verify in ctor (NFC) (#142284 ) Add a new getNumOperandsForOpcode helper to determine the number of operands from the opcode. For now, it is used to verify the number operands at VPInstruction construction. It returns -1 for a few opcodes where the number of operands cannot be determined (GEP, Switch, PHI, Call). This can also be used in a follow-up to determine if a VPInstruction is masked based on the number of arguments. PR: https://github.com/llvm/llvm-project/pull/142284	2025-06-24 20:39:35 +01:00
Ramkumar Ramachandra	bb8c42e859	[LV] Extend FindLastIV to unsigned case (#141752 ) Split the FindLastIV RecurKind into SMax and UMax variants, depending on the reduction op produced.	2025-06-23 15:27:49 +01:00
Simon Pilgrim	26390f22b8	[VectorCombine] foldShuffleOfShuffles - fold shuffle(shuffle(x,y),poison) length changing masks (#144690 ) The shuffle merging code assumes that the shuffle sources are all the same type, which fails if we've changed length and don't have 2 inner shuffles. We already handle length-changing shuffles if we do have 2 inner shuffles. This patch creates a fake "all poison" shuffle mask and reuses the other shuffle's sources, which can be safely used with the existing merge code. The alternative was a considerable refactor of the merge code to account for different vector widths...... Fixes #144656	2025-06-22 13:30:45 +01:00
Florian Hahn	58b939abe5	[VPlan] Support matching constants in narrowInterleaveGroups. Matching constants can trivially be broadcasted, allow them if the same constant is used for all recipes in a bundle.	2025-06-22 08:45:40 +01:00
Florian Hahn	9f7a155394	[VPlan] Update packScalarIntoVector to take and return wide value (NFC) Make the function more flexible in preparation for new users.	2025-06-21 18:03:14 +01:00
Florian Hahn	60d1276b0e	[VPlan] Pass operand index to canNarrowLoad. (NFC) Explicitly pass the operand we are checking to canNarrowLoad. This simplifies the check if the operands match across recipes and enables future optimizations.	2025-06-21 15:41:26 +01:00
David Green	77941eba7f	[CostModel] Add a DstTy to getShuffleCost (#141634 ) A shuffle will take two input vectors and a mask, to produce a new vector of size <MaskElts x SrcEltTy>. Historically it has been assumed that the SrcTy and the DstTy are the same for getShuffleCost, with that being relaxed in recent years. If the Tp passed to getShuffleCost is the SrcTy, then the DstTy can be calculated from the Mask elts and the src elt size, but the Mask is not always provided and the Tp is not reliably always the SrcTy. This has led to situations notably in the SLP vectorizer but also in the generic cost routines where assumption about how vectors will be legalized are built into the generic cost routines - for example whether they will widen or promote, with the cost modelling assuming they will widen but the default lowering to promote for integer vectors. This patch attempts to start improving that - it originally tried to alter more of the cost model but that too quickly became too many changes at once, so this patch just plumbs in a DstTy to getShuffleCost so that DstTy and SrcTy can be reliably distinguished. The callers of getShuffleCost have been updated to try and include a DstTy that is more accurate. Otherwise it tries to be fairly non-functional, keeping the SrcTy used as the primary type used in shuffle cost routines, only using DstTy where it was in the past (for InsertSubVector for example). Some asserts have been added that help to check for consistent values when a Mask and a DstTy are provided to getShuffleCost. Some of them took a while to get right, and some non-mask calls might still be incorrect. Hopefully this will provide a useful base to build more shuffles that alter size.	2025-06-21 12:29:29 +01:00
Florian Hahn	2f5d965bb5	[VPlan] Use EMIT-SCALAR when printing casts. Split off EMIT-SCALAR printing changes from already approved https://github.com/llvm/llvm-project/pull/140623. Currently all casts are single scalars, this brings printing in line with printing for other VPInstructions.	2025-06-21 10:23:53 +01:00
Florian Hahn	f8ffb4e7cd	[VPlan] Simplify ExtractLastElement(Broadcast(A)) -> A. Remove trivial ExtractLastElement VPInstructions.	2025-06-20 21:08:14 +01:00
Philip Reames	c103bbc836	[LV] Consider whether vscale is a known power of two for iteration check (#144963 ) Going mostly by the comment here - but it says "vscale is not necessarily a power-of-2". Both in tree targets have vscale as a power of two, and we have an existing TTI hook for that.	2025-06-20 11:37:27 -07:00

... 3 4 5 6 7 ...

6397 Commits