llvm-project

Author	SHA1	Message	Date
Vasileios Porpodas	55ce296d6f	[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost` Before this patch `Args` was used to pass a broadcat's arguments by SLP. This patch changes this. `Args` is now used for passing the operands of the shuffle. Differential Revision: https://reviews.llvm.org/D124202	2022-04-26 11:11:29 -07:00
Valery N Dmitriev	88b9e46fb5	[SLP] Steer for the best chance in tryToVectorize() when rooting with binary ops. tryToVectorize() method implements one of searching paths for vectorizable tree roots in SLP vectorizer, specifically for binary and comparison operations. Order of making probes for various scalar pairs was defined by its implementation: the instruction operands, then climb over one operand if the instruction is its sole user and then perform same actions for another operand if previous attempts failed. Problem with this approach is that among these options we can have more than a single vectorizable tree candidate and it is not necessarily the one that encountered first. Trying to build vectorizable tree for each possible combination for just evaluation is expensive. But we already have lookahead heuristics mechanism which we use for finding best pick among operands of commutative instructions. It calculates cumulative score for candidates in two consecutive lanes. This patch introduces use of the heuristics for choosing the best pair among several combinations. We only try one that looks as most promising for vectorization. Additional benefit is that we reduce total number of vectorization trees built for probes because we skip those looking non-profitable early. Reviewed By: Alexey Bataev (ABataev), Vasileios Porpodas (vporpo) Differential Revision: https://reviews.llvm.org/D124309	2022-04-25 12:25:33 -07:00
David Green	9727c77d58	[NFC] Rename Instrinsic to Intrinsic	2022-04-25 18:13:23 +01:00
Valery N Dmitriev	edf7bed87b	[SLP][NFC] Outline lookahead heuristics into a separate helper class. Minor refactoring to reduce size of functional change D124309: look-ahead scoring routines pulled out of VLOperands and formed new LookAheadHeuristics helper class. Reviewed By: Alexey Bataev (ABataev), Vasileios Porpodas (vporpo) Differential Revision: https://reviews.llvm.org/D124313	2022-04-22 18:59:08 -07:00
Vasileios Porpodas	889588ee97	[SLP] Refactoring isLegalBroadcastLoad() to use `ElementCount`. Replacing `unsigned` with `ElementCount` in the argument of `isLegalBroadcastLoad()`. This helps reduce the diff of a future SLP patch for AArch64.	2022-04-21 10:19:00 -07:00
Florian Hahn	bea69b232f	[VPlan] Initial modeling of middle block in VPlan. This patch extends the scope of VPlan to also include the exit (aka middle) block. For now, the exit block remains empty, but handling of exit values will subsequently be moved to VPlan, by adding recipes to model exit values in the exit block. As a first step, this will allow fixing #51366. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D123457	2022-04-20 19:34:41 +01:00
Vasileios Porpodas	8d4b5e0833	[NFC][SLP] Improved description of getShallowScore() and getScoreAtLevelRec() Differential Revision: https://reviews.llvm.org/D124027	2022-04-19 12:15:36 -07:00
Florian Hahn	4026b718b8	[VPlan] Remove unused SCEV forward declaration (NFC).	2022-04-19 17:16:17 +02:00
Alexey Bataev	883571928c	Revert "[SLP]Improve reductions analysis and emission, part 1." This reverts commit 0e1f4d4d3cb08ff84df5adc4f5e41d0a2cebc53d to fix a crash reported in PR54976	2022-04-19 06:17:03 -07:00
Florian Hahn	a65f2730d2	[VPlan] Expand induction step in VPlan pre-header. This patch moves SCEV expansion of steps used by VPWidenIntOrFpInductionRecipes to the pre-header using VPExpandSCEVRecipe. This ensures that those steps are expanded while the CFG is in a valid state. Previously, SCEV expansion may happen during vector body code-generation, during which the CFG may be invalid, causing issues with SCEV expansion. Depends on D122095. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D122096	2022-04-19 13:06:39 +02:00
Vasileios Porpodas	b1333f03d9	Recommit "[SLP] Support internal users of splat loads" Code review: https://reviews.llvm.org/D121940 This reverts commit 359dbb0d3daa8295848a09ddd083c79f6851888e.	2022-04-18 15:58:01 -07:00
Vasileios Porpodas	359dbb0d3d	Revert "[SLP] Support internal users of splat loads" This reverts commit f8e1337115623cb879f734940fd9dfeb29a611e7.	2022-04-18 12:12:34 -07:00
Vasileios Porpodas	f8e1337115	[SLP] Support internal users of splat loads Until now we would only accept a broadcast load pattern if it is only used by a single vector of instructions. This patch relaxes this, and allows for the broadcast to have more than one user vector, as long as all of its uses are internal to the SLP graph and vectorized. Differential Revision: https://reviews.llvm.org/D121940	2022-04-18 11:59:44 -07:00
Florian Hahn	73f5d7d0d6	[VPlan] Handle equal address and store ops in onlyFirstLaneDemanded. With opaque pointers, the stored value and address can be the same. Previously the code in VPWidenMemoryInstructionRecipe::onlyFirstLaneDemanded incorrectly considers stores with matching store and pointer operands as only demanding the first lane, causing a crash.	2022-04-15 22:53:33 +02:00
Johannes Doerfert	1fb415fee9	[AMDGPU][FIX] Proper load-store-vectorizer result with opaque pointers The original code relied on the fact that we needed a bitcast instruction (for non constant base objects). With opaque pointers there might not be a bitcast. Always check if reordering is required instead. Fixes: https://github.com/llvm/llvm-project/issues/54896 Differential Revision: https://reviews.llvm.org/D123694	2022-04-15 13:42:46 -05:00
Florian Hahn	2c14cdf831	[VPlan] Turn external defs in Value -> VPValue mapping. This addresses an existing TODO by keeping a mapping of external IR Value * definitions wrapped in VPValues for use in a VPlan. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D123700	2022-04-14 12:03:09 +02:00
serge-sans-paille	fa5a4e1b95	[iwyu] Handle regressions in libLLVM header include Running iwyu-diff on LLVM codebase since a96638e50ef5 detected a few regressions, fixing them.	2022-04-13 20:53:19 +02:00
Alexey Bataev	0e1f4d4d3c	[SLP]Improve reductions analysis and emission, part 1. Currently SLP vectorizer walks through the instructions and selects 3 main classes of values: 1) reduction operations - instructions with same reduction opcode (add, mul, min/max, etc.), which build the reduction, 2) reduced values - instructions with the same opcodes, but different from the reduction opcode, 3) extra arguments - all other values, instructions from the different basic block rather than the root node, instructions with to many/less uses. This scheme is not very efficient. It excludes some instructions and all non-instruction values from the reductions (constants, proficient gathers), to many possibly reduced values are marked as extra arguments. Patch improves this process by introducing a bit extended analysis stage. During this stage, we still try to select 3 classes of the values: 1) reduction operations - same as before, 2) possibly reduced values - all instructions from the current block/non-instructions, which may build a vectorization tree, 3) extra arguments - instructions from the different basic blocks. Additionally, an extra sorting of the possibly reduced values occurs to build the scalar sequences which highly likely will bed vectorized, e.g. loads are grouped by the distance between them, constants are grouped together, cmp instructions are sorted by their compare types and predicates, extractelement instructions are sorted by the vector operand, etc. Also, these groups are reordered by their length so the longest group is the first in the list of the possibly reduced values. The vectorization process tries to emit the reductions for all these groups. These reductions, remaining non-vectorized possible reduced values and extra arguments are then combined into the final expression just like it was before. Differential Revision: https://reviews.llvm.org/D114171	2022-04-12 17:46:11 -07:00
Muhammad Omair Javaid	42ebfa8269	Revert "[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth" This reverts commit 64b6192e812977092242ae34d6eafdcd42fea39d. This broke LLVM AArch64 buildbot clang-aarch64-sve-vls-2stage: https://lab.llvm.org/buildbot/#/builders/176/builds/1515 llvm-tblgen crashes after applying this patch.	2022-04-13 04:53:07 +05:00
Florian Hahn	5f1eb74850	[VPlan] Place VPExpandSCEVRecipe in pre-header. After D121624 models the pre-header in VPlan, VPExpandSCEVRecipes can be placed there. This ensures SCEV expansion happens before modifying the CFG during VPlan execution, when CFG is incomplete. Depends on D121624. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D122095	2022-04-10 10:26:20 +02:00
Florian Hahn	256c6b0ba1	[VPlan] Model pre-header explicitly. This patch extends the scope of VPlan to also model the pre-header. The pre-header can be used to place recipes that should be code-gen'd outside the loop, like SCEV expansion. Depends on D121623. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D121624	2022-04-09 14:19:47 +02:00
Florian Hahn	467dbcd9f1	[LV] Set debug loc after setting insert point. This fixes the code to actually use the location of the instruction, if available. Previously, SetInsertPoint would overwrite the insert point set from the instruction.	2022-04-08 20:34:40 +02:00
Florian Hahn	29fe998eaa	[VPlan] Preserve debug location when creating branch. Update createEmptyBasicBlock to preserve the debug location of the previous terminator.	2022-04-08 17:22:53 +02:00
Florian Hahn	4388c979da	[VPlan] Use vector.body as header name in VPlan native path. This brings the VPlan block naming in line with the naming of the generated basic blocks.	2022-04-07 10:31:12 +02:00
Jingu Kang	64b6192e81	[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth Set the maximum VF of AArch64 with 128 / the size of smallest type in loop. Differential Revision: https://reviews.llvm.org/D118979	2022-04-05 13:16:52 +01:00
Florian Hahn	1ff022e21b	[LV] Add vector.body block to parent loop during skeleton creation. When creating induction resume values, SCEV queries may rely on LoopInfo. Make sure vector.body gets added to the loop of the pre-header during skeleton construction. %vector.body will be moved to the vector preheader during VPlan execution. Fixes #54745.	2022-04-05 11:54:17 +01:00
Florian Hahn	1817c526e1	[VPlan] Update VPInterleavedAccessInfo to use getVectorLoopRegion. Update VPInterleavedAccessInfo to use the generic getVectorLoopRegion helper instead of relying on the entry block being the top-most vector loop region.	2022-04-04 10:26:39 +01:00
Florian Hahn	8cd1892725	[VPlan] Remember previous loop and reset vector loop. At the moment this is NFC, but will be needed once nested loops are also modeled as regions. Preparation for D123005.	2022-04-04 09:27:15 +01:00
Philip Reames	88de27e3fd	[LV] Handle non-integral types when considering interleave widening legality In general, anywhere we might need to insert a blind bitcast, we need to make sure the types are losslessly convertible. This fixes pr54634.	2022-04-03 20:16:20 -07:00
Florian Hahn	95b2aa511e	[VPlan] Set VPlan header block name to vector.body. This brings the VPlan block naming in line with the naming of the generated basic blocks.	2022-04-02 19:34:32 +01:00
Florian Hahn	f8101e4d68	Recommit "[LV] Remove unneeded createHeaderBranch.(NFCI)" This reverts commit 14e3650f01d158f7e4117c353927a07ceebdd504. The issue causing the revert were fixed independently in a08c90a4023f and 14e5f9785c9c.	2022-04-01 16:53:39 +01:00
Florian Hahn	14e5f9785c	[LV] Add SCEV workaround from 80e8025 to epilogue vector code path. This was exposed by 14e3650f. The recommit of 14e3650f will hit the problematic code path requiring the workaround. test case that crashes without the workaround.	2022-04-01 15:14:47 +01:00
Florian Hahn	a08c90a402	[LV] Re-use TripCount from EPI.TripCount. During skeleton construction for the epilogue vector loop, generic helpers use getOrCreateTripCount, which will re-expand the trip count computation. Instead, re-use the TripCount created during main loop vectorization.	2022-04-01 13:47:34 +01:00
Florian Hahn	14e3650f01	Revert "Recommit "[LV] Remove unneeded createHeaderBranch.(NFCI)"" This reverts commit 8378a71b6cce611e01f42690713fd7b561ff3f30. It looks like this patch uncovered another issue, e.g. see https://lab.llvm.org/buildbot/#/builders/168/builds/5518	2022-03-31 19:00:48 +01:00
Florian Hahn	8378a71b6c	Recommit "[LV] Remove unneeded createHeaderBranch.(NFCI)" This reverts the revert commit 2760cdc9c6. This version pulls in the code to create the vector loop object in VPlan from D121624. This is needed because otherwise existing LoopInfo verification will fail, as a loop block doesn't have in-loop successors now that we do not replace the branch. Now that we do not add new loops during skeleton construction, there's also no need to verify LI there.	2022-03-31 14:48:32 +01:00
Florian Hahn	2760cdc9c6	Revert "[LV] Remove unneeded createHeaderBranch.(NFCI)" This reverts commit 32bc83d11e19b8a8c15df81b32fde1f9f8c6156b. This is causing bots with expensive-checks to fail. Revert while I investigate.	2022-03-31 12:32:50 +01:00
Florian Hahn	32bc83d11e	[LV] Remove unneeded createHeaderBranch.(NFCI) The only remaining use was to get the exit block of the loop. Instead of relying on the loop, use the successor of VectorHeaderBB (LoopMiddleBlock) directly to set VPTransformState::CFG::ExitB Depends on D121621. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D121623	2022-03-31 11:48:52 +01:00
Florian Hahn	2c494f0941	[VPlan] Remove unneeded Loop variable (NFC). Suggested in D121623. The remaining uses of L can be replaced, reducing the need for the variable.	2022-03-31 10:34:28 +01:00
David Green	b65267ca7b	[LV] Invalidate widening decisions after maximizing vector bandwidth When MaximizeVectorBandwidth is enabled, we can end up (via calls to collectUniformsAndScalars/setCostBasedWideningDecision through calculateRegisterUsage) making widening decisions before we have decided whether to fold the tail by masking. These decisions will be wrong if we later decided to fold the tail, for example when the trip count is very low. It will use incorrect costs for loads that should get masked, using standard memory operation costs instead. This still at the moment uses the EmulatedMaskMemRefHack costs (a bit unfortunately), but the old costs without this change were 1, leading to too optimistic vectorization. This slightly changes the way that the MaximizeVectorBandwidth option works to make it easier to test, always honouring the option if it is set. Differential Revision: https://reviews.llvm.org/D120215	2022-03-31 09:19:31 +01:00
Florian Hahn	e4543af4e6	[VPlan] Track current vector loop in VPTransformState (NFC). Instead of looking up the vector loop using the header, keep track of the current vector loop in VPTransformState. This removes the requirement for the vector header block being part of the loop up front. A follow-up patch will move the code to generate the Loop object for the vector loop to VPRegionBlock. Depends on D121619. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D121621	2022-03-30 22:16:40 +01:00
Florian Hahn	e8673f2f20	[LV] Do not create separate latch block in VPlan::execute. Now that all dependencies on creating the latch block up-front have been removed, there is no need to create it early. Depends on D121618. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D121619	2022-03-30 17:31:38 +01:00
Florian Hahn	8a4077fac0	[LV] Pass LoopHeaderBB directly to updateDominatorTree. (NFC) At the call site, we already know what the vector header block is. Pass it directly.	2022-03-30 13:11:20 +01:00
Florian Hahn	ecb4171dcb	[LV] Handle zero cost loops in selectInterleaveCount. In some case, like in the added test case, we can reach selectInterleaveCount with loops that actually have a cost of 0. Unfortunately a loop cost of 0 is also used to communicate that the cost has not been computed yet. To resolve the crash, bail out if the cost remains zero after computing it. This seems like the best option, as there are multiple code paths that return a cost of 0 to force a computation in selectInterleaveCount. Computing the cost at multiple places up front there would unnecessarily complicate the logic. Fixes #54413.	2022-03-29 22:52:43 +01:00
Florian Hahn	d1d3563278	[LV] Move code to place pointer induction increment to VPlan post-processing. This patch moves the code to set the correct incoming block for the backedge value to VPlan::execute. When generating the phi node, the backedge value is temporarily added using the pre-header as incoming block. The invalid phi node will be fixed up during VPlan::execute after main VPlan code generation. At the same time, the backedge value is also moved to the latch. This change removes the requirement to create the latch block up-front for VPWidenInductionPHIRecipe::execute, which in turn will enable modeling the pre-header in VPlan. Depends on D121617. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D121618	2022-03-29 20:27:59 +01:00
Philip Reames	7d6e8f2a96	[slp] Delete dead scalar instructions feeding vectorized instructions If we vectorize a e.g. store, we leave around a bunch of getelementptrs for the individual scalar stores which we removed. We can go ahead and delete them as well. This is purely for test output quality and readability. It should have no effect in any sane pipeline. Differential Revision: https://reviews.llvm.org/D122493	2022-03-28 20:10:13 -07:00
Florian Hahn	e7bf2ea934	[LV] Move code to place induction increment to VPlan post-processing. This patch moves the code to set the correct incoming block for the backedge value to VPlan::execute. When generating the phi node, the backedge value is temporarily added using the pre-header as incoming block. The invalid phi node will be fixed up during VPlan::execute after main VPlan code generation. At the same time, the backedge value is also moved to the latch. This change removes the requirement to create the latch block up-front for VPWidenIntOrFpInductionRecipe::execute, which in turn will enable modeling the pre-header in VPlan. As an alternative, the increment could be modeled as separate recipe, but that would require more work and a bit of redundant code, as we need to create the step-vector during VPWidenIntOrFpInductionRecipe::execute anyways, to create the values for different parts. Reviewed By: Ayal Differential Revision: https://reviews.llvm.org/D121617	2022-03-28 16:20:02 +01:00
Philip Reames	f80aaa675f	[SLP] Simplify eraseInstruction [NFC] This simplifies the implementation of eraseInstruction by moving the odd-replace-users-with-undef handling back to the only caller which uses it. This handling was not obviously correct, so add the asserts which make it clear why this is safe to do at all. The result is simpler code and stronger assertions.	2022-03-25 12:01:52 -07:00
Philip Reames	48cc9287f5	Reapply "[SLP] Schedule only sub-graph of vectorizable instructions"" (try 3) The original commit exposed several missing dependencies (e.g. latent bugs in SLP scheduling). Most of these were fixed over the weekend and have had several days to bake. The last was fixed this morning after being noticed in manual review of test changes yesterday. See the review thread for links to each change. Original commit message follows: SLP currently schedules all instructions within a scheduling window which stretches from the first instruction potentially vectorized to the last. This window can include a very large number of unrelated instructions which are not being considered for vectorization. This change switches the code to only schedule the sub-graph consisting of the instructions being vectorized and their transitive users. This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example: Before this patch: 704357 SLP - Number of calcDeps actions 699021 SLP - Number of schedule calls 5598 SLP - Number of ReSchedule actions 59 SLP - Number of ReScheduleOnFail actions 10084 SLP - Number of schedule resets 8523 SLP - Number of vector instructions generated After this patch: 102895 SLP - Number of calcDeps actions 161916 SLP - Number of schedule calls 5637 SLP - Number of ReSchedule actions 55 SLP - Number of ReScheduleOnFail actions 10083 SLP - Number of schedule resets 8403 SLP - Number of vector instructions generated I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore. The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass. For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch. Differential Revision: https://reviews.llvm.org/D118538	2022-03-25 10:39:23 -07:00
Philip Reames	ec858f0201	[SLP] Optimize stacksave dependence handling [NFC] After writing the commit message for 4b1bace28, realized that the mentioned optimization was rather straight forward. We already have the code for scanning a block during region initialization, we can simply keep track if we've seen a stacksave or stackrestore. If we haven't, none of these dependencies are relevant and we can avoid the relatively expensive scans entirely.	2022-03-25 10:04:10 -07:00
Philip Reames	a16308c282	[SLP] Explicit track required stacksave/alloca dependency (try 3) This is an extension of commit b7806c to handle one last case noticed in test changes for D118538. Again, this is thought to be a latent bug in the existing code, though this time I have not managed to reduce tests for the original algoritthm. The prior attempt had failed to account for this case: %a = alloca i8 stacksave stackrestore store i8 0, i8* %a If we allow '%a' to reorder into the stacksave/restore region, then the alloca will be deallocated before the use. We will have taken a well defined program, and introduced a use-after-free bug. There's also an inverse case where the alloca originally follows the stackrestore, and we need to prevent the reordering it above the restore. Compile time wise, we potentially do an extra scan of the block for each alloca seen in a bundle. This is significantly more expensive than the stacksave rooted version and is why I'd tried to avoid this in the initial patch. There is room to optimize this (by essentially caching a "has stacksave" bit per block), but I'm leaving that to future work if it actually shows up in practice. Since allocas in bundles should be rare in practice, I suspect we can defer the complexity for a long while.	2022-03-25 10:04:10 -07:00

... 8 9 10 11 12 ...

3535 Commits