llvm-project

Author	SHA1	Message	Date
Alexey Bataev	cac60940b7	[SLP]Improve shuffles cost estimation where possible. Improved/fixed cost modeling for shuffles by providing masks, improved cost model for non-identity insertelements. Differential Revision: https://reviews.llvm.org/D115462	2022-06-03 08:06:22 -07:00
Fangrui Song	df0f30dc36	Revert "[SLP]Improve shuffles cost estimation where possible." This reverts commit 9980c9971892378ea82475e000de8df210a58e69. Caused assertion failures: https://reviews.llvm.org/D115462#3555350	2022-06-03 00:30:34 -07:00
Alexey Bataev	9980c99718	[SLP]Improve shuffles cost estimation where possible. Improved/fixed cost modeling for shuffles by providing masks, improved cost model for non-identity insertelements. Differential Revision: https://reviews.llvm.org/D115462	2022-06-02 11:18:14 -07:00
Alexey Bataev	73020b4540	Revert "[SLP]Improve shuffles cost estimation where possible." This reverts commit fd5a6ce9dcb77b7821c95355d73af0b3b2020647 to fix a crash detected by a buildbot https://lab.llvm.org/buildbot/#/builders/179/builds/3805/steps/11/logs/stdio.	2022-06-01 15:44:51 -07:00
Alexey Bataev	fd5a6ce9dc	[SLP]Improve shuffles cost estimation where possible. Improved/fixed cost modeling for shuffles by providing masks, improved cost model for non-identity insertelements. Differential Revision: https://reviews.llvm.org/D115462	2022-06-01 11:01:37 -07:00
Alexey Bataev	120d52b0ef	[SLP]Fix PR55653: emit undefs where required, not poison. Need to handle a corner case correctly, if all elements are Undefs/Poisons, need to emit actual values, not just poisons. Differential Revision: https://reviews.llvm.org/D126298	2022-05-26 08:38:50 -07:00
Alexey Bataev	10f41a2147	[SLP]Fix PR55688: Miscompile due to incorrect nuw/nsw handling. Need to use all ReductionOps when propagating flags for the reduction ops, otherwise transformation is not correct. Plus, need to drop nuw/nsw flags. Differential Revision: https://reviews.llvm.org/D126371	2022-05-25 13:59:06 -07:00
Alexey Bataev	2ac5ebedea	[SLP]Do not emit extract elements for insertelements users, replace with shuffles directly. SLP vectorizer emits extracts for externally used vectorized scalars and estimates the cost for each such extract. But in many cases these scalars are input for insertelement instructions, forming buildvector, and instead of extractelement/insertelement pair we can emit/cost estimate shuffle(s) cost and generate series of shuffles, which can be further optimized. Tested using test-suite (+SPEC2017), the tests passed, SLP was able to generate/vectorize more instructions in many cases and it allowed to reduce number of re-vectorization attempts (where we could try to vectorize buildector insertelements again and again). Differential Revision: https://reviews.llvm.org/D107966	2022-05-23 07:06:45 -07:00
Florian Hahn	aeb19817d6	Revert "[SLP]Do not emit extract elements for insertelements users, replace with shuffles directly." This reverts commit fc9c59c355cb255446e571b4515b5e41a76503c4. The patch triggers an assertion when building SPEC on X86. Reduced reproducer shared at D107966. Also reverts follow-up commit 11a09af76d11ad5a9f1f95b561112af17ff81f80.	2022-05-21 21:00:01 +01:00
Alexey Bataev	fc9c59c355	[SLP]Do not emit extract elements for insertelements users, replace with shuffles directly. SLP vectorizer emits extracts for externally used vectorized scalars and estimates the cost for each such extract. But in many cases these scalars are input for insertelement instructions, forming buildvector, and instead of extractelement/insertelement pair we can emit/cost estimate shuffle(s) cost and generate series of shuffles, which can be further optimized. Tested using test-suite (+SPEC2017), the tests passed, SLP was able to generate/vectorize more instructions in many cases and it allowed to reduce number of re-vectorization attempts (where we could try to vectorize buildector insertelements again and again). Differential Revision: https://reviews.llvm.org/D107966	2022-05-20 05:58:09 -07:00
David Green	802e15c576	[SLP] Cluster ordering for loads Given a load without a better order, this patch partially sorts the elements to form clusters of adjacent elements in memory. These clusters can potentially be loaded in fewer loads, meaning less overall shuffling (for example loading v4i8 clusters of a v16i8 as a single f32 loads, as opposed to multiple independent bytes loads and inserts). Differential Revision: https://reviews.llvm.org/D122145	2022-05-07 14:38:11 +01:00
David Green	2db46db54d	[SLP] Add tests for awkward laod orders from SLP. NFC	2022-05-07 10:27:32 +01:00
David Green	6f81903e89	[LV][SLP] Mark fptosi_sat as vectorizable This adds fptosi_sat and fptoui_sat to the list of trivially vectorizable functions, mainly so that the loop vectorizer can vectorize the instruction. Marking them as trivially vectorizable also allows them to be SLP vectorized, and Scalarized. The signature of a fptosi_sat requires two type overrides (@llvm.fptosi.sat.v2i32.v2f32), unlike other intrinsics that often only take a single. This patch alters hasVectorInstrinsicOverloadedScalarOpd to isVectorIntrinsicWithOverloadTypeAtArg, so that it can mark the first operand of the intrinsic as a overloaded (but not scalar) operand. Differential Revision: https://reviews.llvm.org/D124358	2022-05-03 09:32:34 +01:00
Alexey Bataev	7ea03f0b4e	[SLP]Improve reductions analysis and emission, part 1. Currently SLP vectorizer walks through the instructions and selects 3 main classes of values: 1) reduction operations - instructions with same reduction opcode (add, mul, min/max, etc.), which build the reduction, 2) reduced values - instructions with the same opcodes, but different from the reduction opcode, 3) extra arguments - all other values, instructions from the different basic block rather than the root node, instructions with to many/less uses. This scheme is not very efficient. It excludes some instructions and all non-instruction values from the reductions (constants, proficient gathers), to many possibly reduced values are marked as extra arguments. Patch improves this process by introducing a bit extended analysis stage. During this stage, we still try to select 3 classes of the values: 1) reduction operations - same as before, 2) possibly reduced values - all instructions from the current block/non-instructions, which may build a vectorization tree, 3) extra arguments - instructions from the different basic blocks. Additionally, an extra sorting of the possibly reduced values occurs to build the scalar sequences which highly likely will bed vectorized, e.g. loads are grouped by the distance between them, constants are grouped together, cmp instructions are sorted by their compare types and predicates, extractelement instructions are sorted by the vector operand, etc. Also, these groups are reordered by their length so the longest group is the first in the list of the possibly reduced values. The vectorization process tries to emit the reductions for all these groups. These reductions, remaining non-vectorized possible reduced values and extra arguments are then combined into the final expression just like it was before. Differential Revision: https://reviews.llvm.org/D114171	2022-05-02 12:03:58 -07:00
David Green	c7d39fd61a	[LV][SLP] Add tests for vectorizing fptoi_sat intrinsics. NFC	2022-05-02 15:11:44 +01:00
Valery N Dmitriev	88b9e46fb5	[SLP] Steer for the best chance in tryToVectorize() when rooting with binary ops. tryToVectorize() method implements one of searching paths for vectorizable tree roots in SLP vectorizer, specifically for binary and comparison operations. Order of making probes for various scalar pairs was defined by its implementation: the instruction operands, then climb over one operand if the instruction is its sole user and then perform same actions for another operand if previous attempts failed. Problem with this approach is that among these options we can have more than a single vectorizable tree candidate and it is not necessarily the one that encountered first. Trying to build vectorizable tree for each possible combination for just evaluation is expensive. But we already have lookahead heuristics mechanism which we use for finding best pick among operands of commutative instructions. It calculates cumulative score for candidates in two consecutive lanes. This patch introduces use of the heuristics for choosing the best pair among several combinations. We only try one that looks as most promising for vectorization. Additional benefit is that we reduce total number of vectorization trees built for probes because we skip those looking non-profitable early. Reviewed By: Alexey Bataev (ABataev), Vasileios Porpodas (vporpo) Differential Revision: https://reviews.llvm.org/D124309	2022-04-25 12:25:33 -07:00
Vasileios Porpodas	4e971efad4	Recommit "[SLP][AArch64] Implement lookahead operand reordering score of splat loads for AArch64" This reverts commit 7052a0ad689b990265ec79bd2b0a7d6e8c131bfe.	2022-04-22 15:44:02 -07:00
Vasileios Porpodas	7052a0ad68	Revert "[SLP][AArch64] Implement lookahead operand reordering score of splat loads for AArch64" This reverts commit 7ba702644bac6df166a02bbd692c1599a95a7c8b.	2022-04-22 08:24:04 -07:00
Vasileios Porpodas	7ba702644b	[SLP][AArch64] Implement lookahead operand reordering score of splat loads for AArch64 The original patch (https://reviews.llvm.org/D121354) targets x86 and adjusts the lookahead score of splat loads ad they can be done by the `movddup` instruction that combines the load and the broadcast and is cheap to execute. A similar issue shows up on AArch64. The `ld1r` instruction performs a broadcast load and is cheap to execute. This patch implements the TargetTransformInfo hooks for AArch64. Differential Revision: https://reviews.llvm.org/D123638	2022-04-22 07:29:58 -07:00
Vasileios Porpodas	ad12f468a3	[SLP][AArch64][NFC] Add test for a follow-up patch that fixes the lookahead cost of splat-loads for AArch64	2022-04-22 05:29:34 -07:00
Alexey Bataev	883571928c	Revert "[SLP]Improve reductions analysis and emission, part 1." This reverts commit 0e1f4d4d3cb08ff84df5adc4f5e41d0a2cebc53d to fix a crash reported in PR54976	2022-04-19 06:17:03 -07:00
Alban Bridonneau	8daffd1dfb	Fix SLP score for out of order contiguous loads SLP uses the distance between pointers to optimize the getShallowScore. However the current code misses the case where we are trying to vectorize for VF=4, and the distance between pointers is 2. In that case the returned score reflects the case of contiguous loads, when it's not actually contiguous. The attached unit tests have 5 loads, where the program order is not the same as the offset order in the GEPs. So, the choice of which 4 loads to bundle together matters. If we pick the first 4, then we can vectorize with VF=4. If we pick the last 4, then we can only vectorize with VF=2. This patch makes a more conservative choice, to consider all distances>1 to not be a case of contiguous load, and give those cases a lower score. Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D123516	2022-04-19 11:58:01 +01:00
Alexey Bataev	0e1f4d4d3c	[SLP]Improve reductions analysis and emission, part 1. Currently SLP vectorizer walks through the instructions and selects 3 main classes of values: 1) reduction operations - instructions with same reduction opcode (add, mul, min/max, etc.), which build the reduction, 2) reduced values - instructions with the same opcodes, but different from the reduction opcode, 3) extra arguments - all other values, instructions from the different basic block rather than the root node, instructions with to many/less uses. This scheme is not very efficient. It excludes some instructions and all non-instruction values from the reductions (constants, proficient gathers), to many possibly reduced values are marked as extra arguments. Patch improves this process by introducing a bit extended analysis stage. During this stage, we still try to select 3 classes of the values: 1) reduction operations - same as before, 2) possibly reduced values - all instructions from the current block/non-instructions, which may build a vectorization tree, 3) extra arguments - instructions from the different basic blocks. Additionally, an extra sorting of the possibly reduced values occurs to build the scalar sequences which highly likely will bed vectorized, e.g. loads are grouped by the distance between them, constants are grouped together, cmp instructions are sorted by their compare types and predicates, extractelement instructions are sorted by the vector operand, etc. Also, these groups are reordered by their length so the longest group is the first in the list of the possibly reduced values. The vectorization process tries to emit the reductions for all these groups. These reductions, remaining non-vectorized possible reduced values and extra arguments are then combined into the final expression just like it was before. Differential Revision: https://reviews.llvm.org/D114171	2022-04-12 17:46:11 -07:00
Philip Reames	7d6e8f2a96	[slp] Delete dead scalar instructions feeding vectorized instructions If we vectorize a e.g. store, we leave around a bunch of getelementptrs for the individual scalar stores which we removed. We can go ahead and delete them as well. This is purely for test output quality and readability. It should have no effect in any sane pipeline. Differential Revision: https://reviews.llvm.org/D122493	2022-03-28 20:10:13 -07:00
Philip Reames	48cc9287f5	Reapply "[SLP] Schedule only sub-graph of vectorizable instructions"" (try 3) The original commit exposed several missing dependencies (e.g. latent bugs in SLP scheduling). Most of these were fixed over the weekend and have had several days to bake. The last was fixed this morning after being noticed in manual review of test changes yesterday. See the review thread for links to each change. Original commit message follows: SLP currently schedules all instructions within a scheduling window which stretches from the first instruction potentially vectorized to the last. This window can include a very large number of unrelated instructions which are not being considered for vectorization. This change switches the code to only schedule the sub-graph consisting of the instructions being vectorized and their transitive users. This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example: Before this patch: 704357 SLP - Number of calcDeps actions 699021 SLP - Number of schedule calls 5598 SLP - Number of ReSchedule actions 59 SLP - Number of ReScheduleOnFail actions 10084 SLP - Number of schedule resets 8523 SLP - Number of vector instructions generated After this patch: 102895 SLP - Number of calcDeps actions 161916 SLP - Number of schedule calls 5637 SLP - Number of ReSchedule actions 55 SLP - Number of ReScheduleOnFail actions 10083 SLP - Number of schedule resets 8403 SLP - Number of vector instructions generated I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore. The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass. For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch. Differential Revision: https://reviews.llvm.org/D118538	2022-03-25 10:39:23 -07:00
Philip Reames	3abf8ebd9a	[slp][tests] Add missing function attributes SLP is currently assuming that control dependence in these cases is irrelevant. This is only valid if none of the lib-funcs involved can throw or infinite loop in the scalar forms. This appears to be true (or at least we infer the respective attributes) for the libfuncs I spot checked. This change is mostly for shrunking the diff on an upcoming patch.	2022-03-18 15:51:42 -07:00
Alexey Bataev	d65cc85977	[SLP]Do not schedule instructions with constants/argument/phi operands and external users. No need to schedule entry nodes where all instructions are not memory read/write instructions and their operands are either constants, or arguments, or phis, or instructions from others blocks, or their users are phis or from the other blocks. The resulting vector instructions can be placed at the beginning of the basic block without scheduling (if operands does not need to be scheduled) or at the end of the block (if users are outside of the block). It may save some compile time and scheduling resources. Differential Revision: https://reviews.llvm.org/D121121	2022-03-17 11:03:45 -07:00
Alexey Bataev	150ea76543	Revert "[SLP]Do not schedule instructions with constants/argument/phi operands and external users." This reverts commit 1eeb2bfe727323332800e8d390f2f8c63c953779 to fix a bug reported in https://reviews.llvm.org/D121121	2022-03-16 13:54:59 -07:00
Alexey Bataev	1eeb2bfe72	[SLP]Do not schedule instructions with constants/argument/phi operands and external users. No need to schedule entry nodes where all instructions are not memory read/write instructions and their operands are either constants, or arguments, or phis, or instructions from others blocks, or their users are phis or from the other blocks. The resulting vector instructions can be placed at the beginning of the basic block without scheduling (if operands does not need to be scheduled) or at the end of the block (if users are outside of the block). It may save some compile time and scheduling resources. Differential Revision: https://reviews.llvm.org/D121121	2022-03-16 06:05:43 -07:00
David Green	8bef17ed59	[AArch64][SLP] Add a test with mutual reductions. NFC	2022-03-09 21:46:57 +00:00
Philip Reames	deae979a2c	Revert "Reapply "[SLP] Schedule only sub-graph of vectorizable instructions""" This reverts commit 738042711bc08cde9135873200b1d088e6cf11c3. A second, apparently separate, issue has been reported on the original review.	2022-03-03 11:35:34 -08:00
David Green	65c0e45a37	[AArch64] Vector shifts cost 1 The costs of vector shifts was 2 as opposed to 1, as the nodes are marked custom. Fix this like the others and mark the nodes as cheap. Differential Revision: https://reviews.llvm.org/D120773	2022-03-03 10:42:57 +00:00
Philip Reames	738042711b	Reapply "[SLP] Schedule only sub-graph of vectorizable instructions"" Root issue which triggered the revert was fixed in 689bab. No changes in the reapplied patch. Original commit message follows: SLP currently schedules all instructions within a scheduling window which stretches from the first instr uction potentially vectorized to the last. This window can include a very large number of unrelated instruct ions which are not being considered for vectorization. This change switches the code to only schedule the su b-graph consisting of the instructions being vectorized and their transitive users. This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example: Before this patch: 704357 SLP - Number of calcDeps actions 699021 SLP - Number of schedule calls 5598 SLP - Number of ReSchedule actions 59 SLP - Number of ReScheduleOnFail actions 10084 SLP - Number of schedule resets 8523 SLP - Number of vector instructions generated After this patch: 102895 SLP - Number of calcDeps actions 161916 SLP - Number of schedule calls 5637 SLP - Number of ReSchedule actions 55 SLP - Number of ReScheduleOnFail actions 10083 SLP - Number of schedule resets 8403 SLP - Number of vector instructions generated I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore. The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass. For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch. Differential Revision: https://reviews.llvm.org/D118538	2022-03-02 10:47:20 -08:00
Arthur Eubanks	9c6250ee41	Revert "[SLP] Schedule only sub-graph of vectorizable instructions" This reverts commit 0539a26d91a1b7c74022fa9cf33bd7faca87544d. Causes a miscompile, see comments on D118538. Required updating bottom-to-top-reorder.ll.	2022-03-01 17:31:16 -08:00
Philip Reames	9392c0d4ef	Revert "[SLP] Remove cap on schedule window size" This reverts commit 6adf4b039e095224edbbecda5972e5e3353b53b6. Reverting while investigating https://github.com/llvm/llvm-project/issues/54029	2022-02-23 13:12:07 -08:00
Philip Reames	6adf4b039e	[SLP] Remove cap on schedule window size This cap was first added in 848c1aa45 (back in 2015). Per the original commit message, the purpose was to avoid a compile time explosion in long basic blocks. The algorithmic problem in scheduling has now been fixed in 0539a26d. In the meantime, the code has rotten fairly badly. Some intermediate refactoring caused the size to only be incremented if both iterators advance in the window search. This causes the size to be badly undercounted when near one end of a basic block. We no longer have any test which exercises the logic in an intentional way; there's one test which differs with this change, but the changes appear fairly orthoganol to the purpose of the test file. Unfortunately, we no longer have the original motivating example, so it's possible that it also hits some other issue. I tested locally with a large example, but even at it's worst, that one doesn't demonstrate anything too extreme even without the algorithmic fix. It's clearly faster with, but only by ~20% which doesn't seem in line with the original commit message. If regressions with this patch are seen, please file a bug and I'll try to fix any other algorithmic problems which fall out.	2022-02-23 08:27:45 -08:00
Philip Reames	0539a26d91	[SLP] Schedule only sub-graph of vectorizable instructions SLP currently schedules all instructions within a scheduling window which stretches from the first instruction potentially vectorized to the last. This window can include a very large number of unrelated instructions which are not being considered for vectorization. This change switches the code to only schedule the sub-graph consisting of the instructions being vectorized and their transitive users. This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example: Before this patch: 704357 SLP - Number of calcDeps actions 699021 SLP - Number of schedule calls 5598 SLP - Number of ReSchedule actions 59 SLP - Number of ReScheduleOnFail actions 10084 SLP - Number of schedule resets 8523 SLP - Number of vector instructions generated After this patch: 102895 SLP - Number of calcDeps actions 161916 SLP - Number of schedule calls 5637 SLP - Number of ReSchedule actions 55 SLP - Number of ReScheduleOnFail actions 10083 SLP - Number of schedule resets 8403 SLP - Number of vector instructions generated I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore. The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass. For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch. Differential Revision: https://reviews.llvm.org/D118538	2022-02-22 10:15:55 -08:00
Alexey Bataev	802ceb8343	[SLP]Excluded external uses from the reordering estimation. Compiler adds the estimation for the external uses during operands reordering analysis, which makes it tend to prefer duplicates in the lanes rather than diamond/shuffled match in the graph. It changes the sizes of the vector operands and may prevent some vectorization. We don't need this kind of estimation for the analysis phase, because we just need to choose the most compatible instruction and it does not matter if it has external user or used in the non-matching lane. Instead, we count the number of unique instruction in the lane and see if the reassociation changes the number of unique scalars to be power of 2 or not. If we have power of 2 unique scalars in the lane, it is considered more profitable rather than having non-power-of-2 number of unique scalars. Metric: SLP.NumVectorInstructions test-suite :: MultiSource/Benchmarks/FreeBench/distray/distray.test 70.00 86.00 22.9% test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test 346.00 353.00 2.0% test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test 346.00 353.00 2.0% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 235.00 239.00 1.7% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 235.00 239.00 1.7% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 8723.00 8834.00 1.3% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 1051.00 1064.00 1.2% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1628.00 1646.00 1.1% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1628.00 1646.00 1.1% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 9100.00 9184.00 0.9% test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 3565.00 3577.00 0.3% test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 3565.00 3577.00 0.3% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 4235.00 4245.00 0.2% test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1996.00 1998.00 0.1% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1671.00 1672.00 0.1% test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 783.00 782.00 -0.1% test-suite :: SingleSource/Benchmarks/Misc/oourafft.test 69.00 68.00 -1.4% test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test 207.00 192.00 -7.2% test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test 207.00 192.00 -7.2% test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test 89.00 80.00 -10.1% test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 89.00 80.00 -10.1% test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test 260.00 215.00 -17.3% test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test 256.00 211.00 -17.6% MultiSource/Benchmarks/Prolangs-C/TimberWolfMC - pretty the same. SingleSource/Benchmarks/Misc/oourafft.test - 2 <2 x > loads replaced by one <4 x> load. External/SPEC/CINT2017speed/641.leela_s - function gets vectorized and not inlined anymore. External/SPEC/CINT2017rate/541.leela_r - same xternal/SPEC/CINT2017rate/531.deepsjeng_r - changed the order in multi-block tree, the result is pretty the same. External/SPEC/CINT2017speed/631.deepsjeng_s - same. MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a - the result is the same as before. MultiSource/Benchmarks/MiBench/consumer-jpeg - same. Differential Revision: https://reviews.llvm.org/D116688	2022-02-03 06:50:06 -08:00
Philip Reames	15f7857412	[tests] Refresh autogen tests for SLP	2022-01-24 17:05:58 -08:00
Alexey Bataev	d130df544d	[SLP]Improve reordering for the nodes beeing used in alternate vectorization. No need to include the order of the scalars beeing used as part of the alternate vectorization into account when trying to reorder the whole graph. Such elements better to reorder in the following phase because the subtree still ends up in shuffle. Part of D116688, fixes the regression in D116690. Differential Revision: https://reviews.llvm.org/D116740	2022-01-06 11:18:57 -08:00
Alexey Bataev	7cb19fe493	[SLP]Initialize the lane with the given value instead of default 0. There is a bug in the reordering analysis stage. If the element with the given hash is not added to the map but has the same number of APOs and instructions with same parent, but different instruction opcode, it will be initalized with default values and then the counter is increased by 1. But the lane is not updated and default to 0 instead of the actual `Lane` value. It leads to the fact that the analysis is useless in many cases and default to lane 0 instead of actual lane with the minimum amount of APO operands. Differential Revision: https://reviews.llvm.org/D116690	2022-01-06 10:57:11 -08:00
Alexey Bataev	bd05376986	[SLP]Improve multinode analysis. Changes the preliminary multinode analysis: 1. Introduced scores for reversed loads/extractelements. 2. Improved shallow score calculation. 3. Lowered the cost of external uses (no need to consider it several times, just ones). 4. The initial lane for analysis is the one with the minimal possible reorderings. These changes in general shall reduce compile time and improve the reordering in many cases. Part of D57059. Differential Revision: https://reviews.llvm.org/D101109	2021-12-14 06:01:52 -08:00
Philip Reames	e6ad9ef4e7	[instcombine] Canonicalize constant index type to i64 for extractelement/insertelement The basic idea to this is that a) having a single canonical type makes CSE easier, and b) many of our transforms are inconsistent about which types we end up with based on visit order. I'm restricting this to constants as for non-constants, we'd have to decide whether the simplicity was worth extra instructions. For constants, there are no extra instructions. We chose the canonical type as i64 arbitrarily. We might consider changing this to something else in the future if we have cause. Differential Revision: https://reviews.llvm.org/D115387	2021-12-13 16:56:22 -08:00
Alexey Bataev	fc0aacf324	[SLP]Improve analysis/emission of vector operands for alternate nodes. Compiler has an analysis for perfect diamond matching but it does not support nodes with main/alternate opcodes. The problem is that the scalars themselves are different and might not match directly with other nodes, but operands and main/alternate opcodes might match and compiler might reuse some previously emitted vector instructions. Need to include this analysis in the cost model and actual vector instructions emission process. Differential Revision: https://reviews.llvm.org/D114101	2021-11-26 06:38:02 -08:00
Alexey Bataev	4675a1654c	Revert "[SLP]Improve analysis/emission of vector operands for alternate nodes." This reverts commit 496254cf802a21e1967b61dec48017b8ec831574 to fix compiler crashes reported in D114101#3152982.	2021-11-25 05:19:49 -08:00
Alexey Bataev	496254cf80	[SLP]Improve analysis/emission of vector operands for alternate nodes. Compiler has an analysis for perfect diamond matching but it does not support nodes with main/alternate opcodes. The problem is that the scalars themselves are different and might not match directly with other nodes, but operands and main/alternate opcodes might match and compiler might reuse some previously emitted vector instructions. Need to include this analysis in the cost model and actual vector instructions emission process. Differential Revision: https://reviews.llvm.org/D114101	2021-11-24 12:55:24 -08:00
Alexey Bataev	900cc1a226	[SLP]Improve cost of the gather nodes. No need to count the final shuffle cost for the constants, gathering of the constants is just a constant vector + extra inserts, if required. Differential Revision: https://reviews.llvm.org/D113770	2021-11-16 06:25:07 -08:00
Alexey Bataev	352c46e707	[SLP]Improve vectorization of split loads. Need to fix ther cost estimation for split loads, since we look at the subregs already, no need to permute them, need just to estimate subregister insert, if it is smaller than the real register. Also, using split loads, it might be profitable already to vectorize smaller trees with gathering of the loads. Differential Revision: https://reviews.llvm.org/D107188	2021-11-12 06:13:22 -08:00
Alexey Bataev	07ef9f513f	[SLP]Improve/fix reordering of the gathered graph nodes. Gathered loads/extractelements/extractvalue instructions should be checked if they can represent a vector reordering node too and their order should ve taken into account for better graph reordering analysis/ Also, if the gather node has reused scalars, they must be reordered instead of the scalars themselves. Differential Revision: https://reviews.llvm.org/D112454	2021-10-28 05:45:09 -07:00
Alexey Bataev	f06e332982	Revert "[SLP]Improve/fix reordering of the gathered graph nodes." This reverts commit 64d1617d18cb8b6f9511d0eda481fc5a5d0ebddf to fix test non-stability.	2021-10-27 11:16:58 -07:00

1 2 3 4 5

241 Commits