llvm-project

Author	SHA1	Message	Date
Alexey Bataev	b17f607703	[SLP][NFC]Remove unnecessary std::optional around Factor value	2024-11-20 05:54:15 -08:00
Sjoerd Meijer	9bccf61f5f	[AArch64][LV] Set MaxInterleaving to 4 for Neoverse V2 and V3 (#100385 ) Set the maximum interleaving factor to 4, aligning with the number of available SIMD pipelines. This increases the number of vector instructions in the vectorised loop body, enhancing performance during its execution. However, for very low iteration counts, the vectorised body might not execute at all, leaving only the epilogue loop to run. This issue affects e.g. cam4_r from SPEC FP, which experienced a performance regression. To address this, the patch reduces the minimum epilogue vectorisation factor from 16 to 8, enabling the epilogue to be vectorised and largely mitigating the regression.	2024-11-20 09:33:39 +00:00
vporpo	6e4821487f	[SandboxVec][DAG] Register callback for erase instr (#116742 ) This patch adds the callback registration logic in the DAG's constructor and the corresponding deregistration logic in the destructor. It also implements the code that makes sure that SchedBundle and DGNodes can be safely destroyed in any order.	2024-11-19 16:20:38 -08:00
Alexey Bataev	79682c4d57	[SLP]Check if the buildvector root is not a part of the graph before deletion If the buildvector root has no uses, it might be still needed as a part of the graph, so need to check that it is not a part of the graph before deletion. Fixes #116852	2024-11-19 11:31:40 -08:00
David Sherwood	12180717cb	[NFC][LoopVectorize] Introduce new getEstimatedRuntimeVF function (#116247 ) There are lots of places where we try to estimate the runtime vectorisation factor based on the getVScaleForTuning TTI hook. I've added a new getEstimatedRuntimeVF function and taught several places in the vectoriser to use this new function.	2024-11-19 12:38:11 +00:00
David Sherwood	3097c60928	[LoopVectorize][NFC] Rewrite tests to check output of vplan cost model (#113697 ) Currently it's very difficult to improve the cost model for tail-folded loops because as soon as you add a VPInstruction::computeCost function that adds the costs of instructions such as VPInstruction::ActiveLaneMask and VPInstruction::ExplicitVectorLength the assert in LoopVectorizationPlanner::computeBestVF fails for some tests. This is because the VF chosen by the legacy cost model doesn't match the vplan cost model. See PR #90191. This assert is currently making it difficult to improve the cost model. Hopefully we will be in a position to remove the assert soon, however in order to do that we have to fix up a whole bunch of tests that rely upon the legacy cost model output. I've tried my best to update these tests to use vplan output instead. There is still work needed for the VF=1 case because the vplan cost model is not printed out in this case. I've not attempted to fix those in this patch.	2024-11-19 08:55:39 +00:00
Alexey Bataev	ad9c0b369e	[SLP]Check if the gathered loads form full vector before attempting build it Need to check that the number of gathered loads in the slice forms the build vector to avoid compiler crash. Fixes #116691	2024-11-18 14:09:31 -08:00
Julian Nagele	a8538b9138	[LV] Vectorize Epilogues for loops with small VF but high IC (#108190 ) - Consider MainLoopVF * IC when determining whether Epilogue Vectorization is profitable - Allow the same VF for the Epilogue as for the main loop - Use an upper bound for the trip count of the Epilogue when choosing the Epilogue VF PR: https://github.com/llvm/llvm-project/pull/108190 --------- Co-authored-by: Florian Hahn <flo@fhahn.com>	2024-11-17 19:35:32 +00:00
vporpo	1be9827754	[SandboxVec][BottomUpVec] Implement packing of vectors (#116447 ) Up until now we could only support packing of scalar elements. This patch fixes this by implementing packing of vector elements, by generating extractelement and insertelement instruction pairs.	2024-11-15 16:12:22 -08:00
vporpo	3be3b33e57	[SandboxVec][BottomUpVec] Implement pack of scalars (#115549 ) This patch implements packing of scalar operands when the vectorizer decides to stop vectorizing. Packing is implemented with a sequence of InsertElement instructions. Packing vectors requires different instructions so it's implemented in a follow-up patch.	2024-11-15 14:45:17 -08:00
Alexey Bataev	f6e1d64458	[SLP]Enable interleaved stores support Enables interaleaved stores, results in better estimation for segmented stores for RISC-V Reviewers: preames, topperc, RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/115354	2024-11-15 11:01:57 -05:00
Alexey Bataev	af3295bd3d	[SLP]Enable splat ordering for loads Enables splat support for loads with lanes> 2 or number of operands> 2. Allows better detect splats of loads and reduces number of shuffles in some cases. X86, AVX512, -O3+LTO Metric: size..text results results0 diff test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 154867.00 156723.00 1.2% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12467735.00 12468023.00 0.0% Better vectorization quality Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/115173	2024-11-15 10:29:43 -05:00
Stephen Tozer	caa9a82797	[DebugInfo][LoopVectorizer] Avoid dropping !dbg in optimizeForVFAndUF (#114243 ) Prior to this patch, optimizeForVFAndUF may optimize the conditional branch for a VPBasicblock to have a constant condition, but unnecessarily drops the DILocation attachment when it does so; this patch changes it to preserve the DILocation.	2024-11-14 09:33:46 +00:00
Luke Lau	050e2d325a	[LV] Remove assertions in IV overflow check (#115705 ) In #111310 an assert was added that for the IV overflow check used with tail folding, the overflow check is never known. However when applying the loop guards, it looks like it's possible that we might actually know the IV won't overflow: this occurs in 500.perlbench_r from SPEC CPU 2017 and triggers the assertion: Assertion failed: (!isIndvarOverflowCheckKnownFalse(Cost, VF * UF) && !SE.isKnownPredicate(CmpInst::getInversePredicate(ICmpInst::ICMP_ULT), TC2OverflowSCEV, SE.getSCEV(Step)) && "unexpectedly proved overflow check to be known"), function emitIterationCountCheck, file LoopVectorize.cpp, line 2501. There is a discrepancy between `isIndvarOverflowCheckKnownFalse` and the ICMP_ULT check, because the former uses `getSmallConstantMaxTripCount` which only takes into trip counts that fit into 32 bits. There doesn't seem to be an easy way to make the assertion aware of this, so this PR just removes it for now. There are two potential follow up things from this PR: 1. We miss calculating the max trip count in `@trip_count_max_1024`, it looks like we might need to apply loop guards somewhere in `ScalarEvolution::computeExitLimitFromICmp` 2. In `@overflow_at_0`, if `%tc == 0` then we the overflow check will always return false, even though it will overflow Fixes https://github.com/llvm/llvm-project/issues/115755	2024-11-14 17:04:49 +09:00
Luke Lau	9e77f59005	[LV] Account for vp_merge in out of loop EVL reductions in legacy cost model (#115903 ) In #101641, support for out of loop reductions with EVL tail folding was added by transforming selects to vp_merges in transformRecipestoEVLRecipes. Whilst the select was previously free, the vp_merge wasn't and incurs a cost on RISC-V with the VPlan cost model. But this diverged from the legacy cost model and caused the "VPlan cost model and legacy cost model disagreed" assertion to trigger when building 502.gcc_r from SPEC CPU 2017. Neither the select nor vp_merge recipes from the VPlan exist in the underlying instructions, so I thought it would make the most sense to fix this by adding the cost to the underlying phi instruction in getInstructionCost. It's worth noting that on RISC-V this vp_merge won't actually generate any instructions because the mask is all true, and will be folded away. So we should update the cost model at some point to reflect that.	2024-11-14 16:55:18 +09:00
Elvis Wang	b4d23cf685	[LV] Fix missing precomptueCosts() in emitInvalidCostRemarks(). (#114918 ) We should always update the `SkipComputation` which is set in `VPCostContext` before VPlan compute costs. This patch prevent the assertion of in-loop reduction in the `VPReductionRecipe::computeCost()` and other potential assertions of partially implemented VPlan-based cost model.	2024-11-14 08:29:55 +08:00
Florian Hahn	98c4f4fce8	[LV] Remove IVEndValues, use resume value directly from fixed phi.(NFC) Use the IV resume/end values from the phis in the scalar header, instead of collecting them in a map. This removes some complexity from the code dealing with induction resume values. Analogous to 1edd22030 which did the same for reduction resume values.	2024-11-13 21:03:54 +00:00
Simon Pilgrim	0baa6a7272	[VectorCombine] foldShuffleOfShuffles - relax one-use of inner shuffles (#116062 ) Allow multi-use of either of the inner shuffles and account for that in the cost comparison.	2024-11-13 16:18:11 +00:00
Simon Pilgrim	1878b94568	[VectorCombine] isExtractExtractCheap - specify the extract/insert shuffle mask to improve shuffle costs (#114780 ) This shuffle mask is so focused, the cost model is very likely to be able to determine a specific (lower) cost	2024-11-13 12:31:39 +00:00
hanbeom	d942f5e13d	[VectorCombine] Combine extract/insert from vector into a shuffle (#115213 ) insert (DstVec, (extract SrcVec, ExtIdx), InsIdx) --> shuffle (DstVec, SrcVec, Mask) This commit combines extract/insert on a vector into Shuffle with vector.	2024-11-13 11:16:09 +00:00
Sushant Gokhale	9991ea28fc	[CostModel][AArch64] Make extractelement, with fmul user, free whenev… (#111479 ) …er possible In case of Neon, if there exists extractelement from lane != 0 such that 1. extractelement does not necessitate a move from vector_reg -> GPR 2. extractelement result feeds into fmul 3. Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0 then the extractelement can be merged with fmul in the backend and it incurs no cost. e.g. ``` define double @foo(<2 x double> %a) { %1 = extractelement <2 x double> %a, i32 0 %2 = extractelement <2 x double> %a, i32 1 %res = fmul double %1, %2 ret double %res } ``` `%2` and `%res` can be merged in the backend to generate: `fmul d0, d0, v0.d[1]` The change was tested with SPEC FP(C/C++) on Neoverse-v2. Compile time impact: None Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.	2024-11-13 11:10:49 +05:30
Han-Kuan Chen	5a5502b9e1	[SLP] NFC. Use Value instead of template. (#115440 )	2024-11-13 11:58:19 +08:00
Sterling-Augustine	7ba864b592	[SandboxVectorizer] Register erase callback for seed collection (#115951 )	2024-11-12 16:03:27 -08:00
Alexey Bataev	058ac837bc	[SLP]Use generic createShuffle for buildvector Use generic createShuffle function, which know how to adjust the vectors correctly, to avoid compiler crash when trying to build a buildvector as a shuffle Fixes #115732	2024-11-11 10:49:39 -08:00
Kazu Hirata	2c0f463b25	[Vectorize] Simplify code with DenseMap::operator[] (NFC) (#115635 )	2024-11-10 07:24:47 -08:00
Florian Hahn	a5a1612deb	[VPlan] Consistently use DEBUG_TYPE loop-vectorize. This ensures debug messages in VPlan.cpp are included in the commonly used -debug-only=loop-vectorize.	2024-11-10 09:17:03 +00:00
Han-Kuan Chen	3cdd86bb47	[SLP][REVEC] Make GetMinMaxCost support FixedVectorType when REVEC is enabled. (#115417 )	2024-11-10 13:53:15 +08:00
Kazu Hirata	c236dbc343	[Vectorize] Simplify code with MapVector::operator[] (NFC) (#115592 )	2024-11-09 14:36:32 -08:00
Florian Hahn	ccb40b0b7a	[VPlan] Add insertOnEdge to VPBlockUtils (NFC). Add a new helper to insert a new VPBlockBase on an edge between 2 blocks. Suggested in https://github.com/llvm/llvm-project/pull/114292 and also useful for some existing code.	2024-11-09 21:19:39 +00:00
Florian Hahn	95eeae195e	[VPlan] Add PredIdx and SuccIdx arguments to connectBlocks (NFC). Add extra arguments to connectBlocks which allow selecting which existing predecessor/successor to update. This avoids having to disconnect blocks first unnecessarily. Suggested in https://github.com/llvm/llvm-project/pull/114292.	2024-11-09 17:18:40 +00:00
Simon Pilgrim	958e37cd1f	[VectorCombine] scalarizeBinopOrCmp - check for out of bounds element indices Fixes #115575	2024-11-09 16:00:03 +00:00
Noah Goldstein	8af5ae0648	[VPlan] Preserve IR flags when widening casts We have `nneg` for both `sext` and `uitofp`. Fixes #114856 Closes #115373	2024-11-08 17:21:05 -06:00
Alexey Bataev	26a9f3f590	[SLP][NFC]Cleanup getSameOpcode, return InstructionsState::invalid() for non-valid inputs Just a cleanup and related changes	2024-11-08 14:00:32 -08:00
Florian Hahn	8a7a7b5ffc	[VPlan] Remove unneeded code connecting blocks in VPBB:splitAt (NFC). insertBlockAfter already takes care of transferring successors. Remove unneeded code to transfer them manually.	2024-11-08 21:52:18 +00:00
Florian Hahn	144bdf3eb7	[VPlan] Also check if plan for best legacy VF contains simplifications. The plan for the VF chosen by the legacy cost model could also contain additional simplifications that cause cost differences. Also check if it contains simplifications. Fixes https://github.com/llvm/llvm-project/issues/114860.	2024-11-08 20:53:03 +00:00
vporpo	7dffc96a54	[SandboxVec][BottomUpVec] Clean up dead instructions (#115267 ) When scalars get replaced by vectors the original scalars may become dead. In that case erase them.	2024-11-08 12:50:53 -08:00
Kazu Hirata	bc7e5c2016	[SLP] Avoid repeated hash lookups (NFC) (#115428 )	2024-11-08 07:35:06 -08:00
Alexey Bataev	77bec78878	[SLP]Do not look for last instruction in schedule block for buildvectors If looking for the insertion point for the node and the node is a buildvector node, the compiler should not use scheduling info for such nodes, they may contain only partial info, which is not fully correct and may cause compiler crash. Fixes #114082	2024-11-08 06:55:29 -08:00
Alexey Bataev	62db1c8a07	[SLP]Better decision making on whether to try stores packs for vectorization Since the stores are sorted by distance, comparing the indices in the original array and early exit, if the index is less than the index of the last store, not always the best strategy. Better to remove such stores explicitly to try better to check for the vectorization opportunity. Fixes #115008	2024-11-07 14:23:15 -08:00
Alexey Bataev	b7a8f5f4c9	[SLP][NFC]Exit early from attempt-to-reorder, if it is useless Adds early exits, which just save compile time. It can exit earl, if the total number of scalars is 2, or all scalars are constant, or the opcode is the same and not alternate. In this case reordering will not happen and compiler can exit early to save compile time	2024-11-07 11:07:49 -08:00
Kazu Hirata	b02e5bc5b1	[Transforms] Remove unused includes (NFC) (#115263 ) Identified with misc-include-cleaner.	2024-11-07 10:58:58 -08:00
Kazu Hirata	22b4b1ab10	Revert "[SLP][REVEC] Make GetMinMaxCost support FixedVectorType when REVEC is enabled. (#114946 )" This reverts commit f58757b8dc167809b69ec00f9b5ab59281df0902. Failing buildbots: https://lab.llvm.org/buildbot/#/builders/174/builds/8058 https://lab.llvm.org/buildbot/#/builders/127/builds/1357	2024-11-07 10:43:11 -08:00
Han-Kuan Chen	f58757b8dc	[SLP][REVEC] Make GetMinMaxCost support FixedVectorType when REVEC is enabled. (#114946 )	2024-11-08 00:52:59 +08:00
Simon Pilgrim	4f24d0355a	[VectorCombine] Use explicit ExtractElementInst getVectorOperand/getIndexOperand accessors. NFC.	2024-11-07 13:53:25 +00:00
Simon Pilgrim	490e58a98e	Fix MSVC "not all control paths return a value" warning. NFC	2024-11-07 10:22:05 +00:00
vporpo	f7ef7b2ff7	[SandboxVec][Scheduler] Implement rescheduling (#115220 ) This patch adds support for re-scheduling already scheduled instructions. For now this will clear and rebuild the DAG, and will reschedule the code using the new DAG.	2024-11-06 20:59:49 -08:00
Han-Kuan Chen	c6091cdbed	[SLP][REVEC] Make shufflevector can be vectorized with ReorderIndices and ReuseShuffleIndices. (#114965 )	2024-11-07 11:04:34 +08:00
vporpo	5942a99f8b	[SandboxVec] Notify scheduler about new instructions (#115102 ) This patch registers the "createInstr" callback that notifies the scheduler about newly created instructions. This guarantees that all newly created instructions have a corresponding DAG node associated with them. Without this the pass crashes when the scheduler encounters the newly created vector instructions. This patch also changes the lifetime of the sandboxir Ctx variable in the SandboxVectorizer pass. It needs to be destroyed after the passes get destroyed. Without this change when components like the Scheduler get destroyed Ctx will have already been freed, which is not legal.	2024-11-06 13:26:14 -08:00
Alexey Bataev	9f3b6adb15	[SLP][NFC]Exit early if the graph is empty, NFC No need to check anything if the graph is empty, just exit early.	2024-11-06 08:33:14 -08:00
Alexey Bataev	76422385c3	[SLP]Support reordered buildvector nodes for better clustering Patch adds reordering of the buildvector nodes for better clustering of the compatible operations and future vectorization. Includes basic cost estimation and if the transformation is not profitable - reverts it. AVX512, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: External/SPEC/CINT2006/401.bzip2/401.bzip2.test 74565.00 75701.00 1.5% test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test 75773.00 76397.00 0.8% test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test 75773.00 76397.00 0.8% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2014462.00 2024494.00 0.5% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 395219.00 396979.00 0.4% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 857795.00 859667.00 0.2% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 800472.00 802440.00 0.2% test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 590699.00 591403.00 0.1% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 203006.00 203102.00 0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 42408.00 42424.00 0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12451575.00 12451927.00 0.0% test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1396480.00 1396448.00 -0.0% test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1396480.00 1396448.00 -0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1047708.00 1047580.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test 111344.00 111328.00 -0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1087660.00 1087500.00 -0.0% test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 280664.00 280616.00 -0.0% test-suite :: MultiSource/Applications/sqlite3/sqlite3.test 502646.00 502006.00 -0.1% test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1033135.00 1031567.00 -0.2% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2070917.00 2065845.00 -0.2% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2070917.00 2065845.00 -0.2% test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test 33893.00 33797.00 -0.3% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39677.00 39549.00 -0.3% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39674.00 39546.00 -0.3% test-suite :: MultiSource/Benchmarks/MiBench/security-blowfish/security-blowfish.test 11560.00 11512.00 -0.4% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 653867.00 649275.00 -0.7% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 653867.00 649275.00 -0.7% CINT2006/401.bzip2 - extra code vectorized CINT2017rate/541.leela_r CINT2017speed/641.leela_s - function _ZN9FastBoard25get_pattern3_augment_specEiib not inlined anymore, better vectorization CFP2017rate/510.parest_r - better vectorization JM/ldecod - better vectorization JM/lencod - same CINT2006/464.h264ref - extra code vectorized CFP2006/447.dealII - extra vector code MiBench/consumer-lame - vectorized 2 loops previously scalar DOE-ProxyApps-C/miniGMG - small changes Benchmarks/7zip - extra code vectorized, better vectorization CFP2017rate/526.blender_r - extra vectorization CFP2017speed/638.imagick_s CFP2017rate/538.imagick_r - extra vectorization MiBench/consumer-jpeg - extra vectorization CINT2006/400.perlbench - extra vectorization Prolangs-C/TimberWolfMC - small variations Applications/sqlite3 - extra function vectorized and inlined Benchmarks/tramp3d-v4 - extra code vectorized CINT2017rate/500.perlbench_r CINT2017speed/600.perlbench_s - extra code vectorized, function digcpy gets vectorized and inlined CINT2006/473.astar - extra code vectorized MiBench/telecomm-gsm - extra code vectorized, better vector code mediabench/gsm - same MiBench/security-blowfish - extra code vectorized CINT2017speed/625.x264_s CINT2017rate/525.x264_r - sub4x4_dct function vectorized and gets inlined RISCV-V, SiFive-p670, O3+LTO CFP2017rate/510.parest_r - extra vectorization CFP2017rate/526.blender_r - extra vectorization MiBench/consumer-lame - extra vectorized code Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/114284	2024-11-06 10:51:15 -05:00

1 2 3 4 5 ...

5182 Commits