llvm-project

Author	SHA1	Message	Date
LiqinWeng	98e5962b7c	[RISCV][CostModel] Add cost for fabs/fsqrt of type bf16/f16 (#118608 )	2025-01-10 17:22:51 +08:00
Shao-Ce SUN	369c61744a	[RISCV] Fix the cost of `llvm.vector.reduce.and` (#119160 ) I added some CodeGen test cases related to reduce. To maintain consistency, I also added cases for instructions like `vector.reduce.or`. For cases where `v1i1` type generates `VFIRST`, please refer to: https://reviews.llvm.org/D139512.	2025-01-10 10:10:42 +08:00
David Green	f22441c14d	[AArch64] Add sve div and rem tests. NFC	2025-01-09 08:46:13 +00:00
David Green	89483403c3	[AArch64] Add additional div and rem test coverage. NFC	2025-01-08 21:37:11 +00:00
David Green	a8dab1aa03	[AArch64] Add a subvector extract cost. (#121472 ) These can generally be emitted using an ext instruction or mov from the high half. The half half extracts can be free depending on the users, but that is not handled here, just the basic costs. It originally included all subvector extracts, but that was toned-down to just half-vector extracts to try and help the mid end not breakup high/low extracts without having the SLP vectorizer create a mess using other shuffles.	2025-01-08 08:13:07 +00:00
Simon Pilgrim	a5e129ccde	[CostModel][X86] getVectorInstrCost - correctly cost v4f32 insertelement into index 0 This is just the MOVSS instruction (SSE41 INSERTPS is still necessary for index != 0) This exposed an issue in VectorCombine::foldInsExtFNeg - we need to use the more general SK_PermuteTwoSrc shuffle kind to allow getShuffleCost to match other shuffle kinds (not just SK_Select).	2025-01-07 12:23:45 +00:00
Simon Pilgrim	5a7dfb4659	[CostModel][X86] Attempt to match v4f32 shuffles that map to MOVSS/INSERTPS instruction improveShuffleKindFromMask matches this as a SK_InsertSubvector of a v1f32 (which legalises to f32) into a v4f32 base vector, making it easy to recognise. MOVSS is limited to index0.	2025-01-07 11:31:44 +00:00
David Green	2db7b314da	[AArch64] Add BF16 fpext and fptrunc costs. (#119524 ) This expands the recently added fp16 fpext and fpround costs to bf16. Some of the costs are taken from the rough number of instructions needed, some are a little aspirational. https://godbolt.org/z/bGEEd1vsW	2025-01-07 09:39:08 +00:00
Simon Pilgrim	db88071a8b	[CostModel][X86] Attempt to match cheap v4f32 shuffles that map to SHUFPS instruction (#121778 ) Avoid always assuming the worst for v4f32 2 input shuffles, and match the SHUFPS pattern where possible - each pair of output elements must come from the same source register.	2025-01-06 17:54:36 +00:00
Simon Pilgrim	7cdbde70fa	[CostModel][X86] getShuffleCost - use processShuffleMasks for all shuffle kinds to legal types (#120599 ) (#121760 ) Now that processShuffleMasks can correctly handle 2 src shuffles, we can completely remove the shuffle kind limits and correctly recognize the number of active subvectors per legalized shuffle - improveShuffleKindFromMask will determine the shuffle kind for each split subvector.	2025-01-06 13:32:55 +00:00
ShihPo Hung	366e836051	[RISCV][NFC] precommit test for fcmp with f16	2025-01-03 01:45:27 -08:00
David Green	3ddc9f06ae	[AArch64] Additional shuffle subvector-extract cost tests. NFC A Phase Ordering test for intrinsic shuffles is also added, showing a recent regression from vector combining.	2025-01-02 10:13:51 +00:00
David Green	641a786bed	[AArch64] Add codegen shuffle-select test. NFC This splits the shuffle-select CostModel test into a seperate CodeGen test and removes the codegen from the CostModel version. An extra fp16 test is added too.	2025-01-02 08:05:44 +00:00
Alexey Bataev	07d284d4eb	[SLP]Add cost estimation for gather node reshuffling Adds cost estimation for the variants of the permutations of the scalar values, used in gather nodes. Currently, SLP just unconditionally emits shuffles for the reused buildvectors, but in some cases better to leave them as buildvectors rather than shuffles, if the cost of such buildvectors is better. X86, AVX512, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 912998.00 913238.00 0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 203070.00 203102.00 0.0% test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1396320.00 1396448.00 0.0% test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1396320.00 1396448.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 309790.00 309678.00 -0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12477607.00 12470807.00 -0.1% CINT2006/445.gobmk - extra code vectorized MiBench/consumer-lame - small variations CFP2017speed/638.imagick_s CFP2017rate/538.imagick_r - extra vectorized code Benchmarks/Bullet - extra code vectorized CFP2017rate/526.blender_r - extra vector code RISC-V, sifive-p670, -O3+LTO CFP2006/433.milc - regressions, should be fixed by https://github.com/llvm/llvm-project/pull/115173 CFP2006/453.povray - extra vectorized code CFP2017rate/508.namd_r - better vector code CFP2017rate/510.parest_r - extra vectorized code SPEC/CFP2017rate - extra/better vector code CFP2017rate/526.blender_r - extra vectorized code CFP2017rate/538.imagick_r - extra vectorized code CINT2006/403.gcc - extra vectorized code CINT2006/445.gobmk - extra vectorized code CINT2006/464.h264ref - extra vectorized code CINT2006/483.xalancbmk - small variations CINT2017rate/525.x264_r - better vectorization Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/115201	2024-12-24 15:35:29 -05:00
Alexey Bataev	fbdf652d97	[RISCV][COST]Add several test with known vlen, NFC	2024-12-23 12:55:45 -08:00
Simon Pilgrim	611401c115	[CostModel][X86] getShuffleCost - use processShuffleMasks to split SK_PermuteTwoSrc shuffles to legal types (#120599 ) processShuffleMasks can now correctly handle 2 src shuffles, so we can use the existing SK_PermuteSingleSrc splitting cost logic to handle SK_PermuteTwoSrc as well and correctly recognise the number of active subvectors per legalised shuffle.	2024-12-20 10:39:45 +00:00
Simon Pilgrim	091448e3c1	Revert "[CostModel][X86] getShuffleCost - use processShuffleMasks to split SK_PermuteTwoSrc shuffles to legal types" (#120707 ) Reverts llvm/llvm-project#120599 - some recent tests are currently failing	2024-12-20 10:06:03 +00:00
Simon Pilgrim	81e63f9e0c	[CostModel][X86] getShuffleCost - use processShuffleMasks to split SK_PermuteTwoSrc shuffles to legal types (#120599 ) processShuffleMasks can now correctly handle 2 src shuffles, so we can use the existing SK_PermuteSingleSrc splitting cost logic to handle SK_PermuteTwoSrc as well and correctly recognise the number of active subvectors per legalised shuffle.	2024-12-20 09:55:11 +00:00
Simon Pilgrim	9bb1d0369c	[X86] getShuffleCost - when splitting shuffles, if a whole vector source is just copied we should treat this as free. (#120561 ) If the shuffle split results in referencing a single legalised whole vector (i.e. no permutation), then this can be treated as free. We already do something similar for broadcasts / whole subvector insertion + extraction - its purely an issue for register allocation.	2024-12-19 12:55:44 +00:00
David Sherwood	eaf482f012	[AArch64] Tweak truncate costs for some scalable vector types (#119542 ) == We were previously returning an invalid cost when truncating anything to <vscale x 2 x i1>, which is incorrect since we can generate perfectly good code for this. == The costs for truncating legal or unpacked types to predicates seemed overly optimistic. For example, when truncating <vscale x 8 x i16> to <vscale x 8 x i1> we typically do something like and z0.h, z0.h, #0x1 cmpne p0.h, p0/z, z0.h, #0 I guess it might depend upon whether the input value is generated in the same block or not and if we can avoid the inreg zero-extend. However, it feels safe to take the more conservative cost here. == The costs for some truncates such as trunc <vscale x 2 x i32> %a to <vscale x 2 x i16> were 1, whereas in actual fact they are free and no instructions are required. == Also, for this trunc <vscale x 8 x i32> %a to <vscale x 8 x i16> it's just a single uzp1 instruction so I reduced the cost to 1. In general, I've added costs for all cases where the destination type is legal or unpacked. One unfortunate side effect of this is the costs for some fixed-width truncates when using SVE now look too optimistic.	2024-12-19 10:07:41 +00:00
Simon Pilgrim	89e530a27c	[CostModel[X86] Update shuffle non-pow-2 tests to not analyse shuffle(undef,undef) Avoid shuffle patterns that can be folded away.	2024-12-16 18:32:13 +00:00
Fangrui Song	cd12922235	[test] Change llc -march= to -mtriple= Similar to 806761a7629df268c8aed49657aeccffa6bca449 -march= is error-prone when running on a host whose OS is different.	2024-12-15 13:08:02 -08:00
Elvis Wang	a674209432	[RISCV][TTI] Model the cost of insert/extractelt when the vector split into multiple register group and idx exceed single group. (#118401 ) This patch implements the cost when the size of the vector need to split into multiple groups and the index exceed single vector group. For extract element, we need to store split vectors to stack and load the target element. For insert element, we need to store split vectors to stack and store the target element and load vectors back. After this patch, the cost of insert/extract element will close to the generated assembly.	2024-12-12 09:26:19 +08:00
Ricardo Jesus	2fe30bc669	[AArch64] Add cost model for @experimental.vector.match (#118512 ) The base cost approximates the expansion code in SelectionDAGBuilder. For the AArch64 cases that don't need generic expansion, fixed-length search vectors have a higher cost than scalable vectors due to the extra instructions to convert the boolean mask.	2024-12-11 07:51:11 +00:00
David Green	2f18b5ef03	[AArch64] Add fpext and fpround costs (#119292 ) This adds some basic costs for fpext and fpround, many of which were already handled by the generic costing routines but this does make some adjustments for larger vector types that can use fcvtn+fcvtn2, as opposed to fcvtn+fcvtn+concat. These should now more closely match the codegen from https://godbolt.org/z/r3P9Mf8ez, for example.	2024-12-11 06:26:41 +00:00
Simon Pilgrim	4f933277a5	[CostModel][X86] Improve cost estimation of insert_subvector shuffle patterns of legalized types (#119363 ) In cases where the base/sub vector type in an insert_subvector pattern legalize to the same width through splitting, we can assume that the shuffle becomes free as the legalized vectors will not overlap. Note this isn't true if the vectors have been widened during legalization (e.g. v2f32 insertion into v4f32 would legalize to v4f32 into v4f32). Noticed while working on adding processShuffleMasks handling for SK_PermuteTwoSrc.	2024-12-10 16:28:56 +00:00
Shao-Ce SUN	95b6524e5c	[NFC] [RISCV] Add tests for `llvm.vector.reduce.*`	2024-12-10 10:20:21 +08:00
David Green	ca884009e4	[AArch64] Add test coverage of fp16 and bf16 fptrunc and fpext. NFC Some of the scalable tests have been split off to make the tests more managable. AArch64TTIImpl::getCastInstrCost is also formatted to avoid the need to fight against CI.	2024-12-09 23:41:18 +00:00
Lei Huang	a13ec9cd54	[PowerPC] Update data layout aligment of i128 to 16 (#118004 ) Fix 64-bit PowerPC part of https://github.com/llvm/llvm-project/issues/102783.	2024-12-09 18:02:24 -05:00
Alexey Bataev	b9aa155d26	[TTI][X86]Fix detection of the shuffles from the second shuffle operand only If the shuffle mask uses only indices from the second shuffle operand, processShuffleMasks function misses it currently, which prevents correct cost estimation in this corner case. To fix this, need to raise the limit to 2 * VF rather than just VF and adjust processing correspondingly. Will allow future improvements for 2 sources permutations. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/118972	2024-12-06 12:27:00 -05:00
LiqinWeng	46829e5430	[RISCV][CostModel] Correct the cost of some reductions (#118072 ) Reductions include: and/or/max/min	2024-12-04 17:26:54 +08:00
Dominik Steenken	866b9f43a0	[SystemZ] Add realistic cost estimates for vector reduction intrinsics (#118319 ) This PR adds more realistic cost estimates for these reduction intrinsics - `llvm.vector.reduce.umax` - `llvm.vector.reduce.umin` - `llvm.vector.reduce.smax` - `llvm.vector.reduce.smin` - `llvm.vector.reduce.fadd` - `llvm.vector.reduce.fmul` - `llvm.vector.reduce.fmax` - `llvm.vector.reduce.fmin` - `llvm.vector.reduce.fmaximum` - `llvm.vector.reduce.fminimum` - `llvm.vector.reduce.mul ` The pre-existing cost estimates for `llvm.vector.reduce.add` are moved to `getArithmeticReductionCosts` to reduce complexity in `getVectorIntrinsicInstrCost` and enable other passes, like the SLP vectorizer, to benefit from these updated calculations. These are not expected to provide noticable performance improvements and are rather provided for the sake of completeness and correctness. This PR is in draft mode pending benchmark confirmation of this. This also provides and/or updates cost tests for all of these intrinsics. This PR was co-authored by me and @JonPsson1 .	2024-12-03 17:08:51 +01:00
LiqinWeng	eb3f1aec6e	[TTI][RISCV] Implement cost of some intrinsics with LMUL (#117874 ) Intrinsics include: sadd_sat/ssub_sat/uadd_sat/usub_sat/fabs/fsqrt/cttz/ctlz/ctpop	2024-12-03 10:17:52 +08:00
Simon Pilgrim	94df95de6b	[TTI][X86] getShuffleCosts - for SK_PermuteTwoSrc, if the masks are known to be "inlane" no need to scale the costs by worst-case legalization (#117999 ) SK_PermuteTwoSrc legalization has to assume any of the legalised source registers could be referenced in split shuffles, but if we already know that each 128-bit lane only references elements from the same lane of the source operands, then this scaling won't occur. Hopefully this can help with #113356 without us having to get full processShuffleMasks canonicalization finished first.	2024-12-01 12:01:47 +00:00
David Green	fe04290482	[AArch64] Change the default vscale-for-tuning to 1. (#117174 ) Most AArch64 cpus outside of Neoverse V1 (256) and A64FX (512) have an SVE vector length of 128, and in environments like Android (where no mcpu option is common) we would expect all cpus to match. This patch changes the default vector length to 128 with -mcpu=generic, to match the most common case.	2024-11-29 17:41:05 +00:00
David Green	6f4b4f41ca	[AArch64] Remove LoopVectorizer/AArch64/scatter-cost.ll test. NFC This test checks the costs, not vectorization, so is better placed in the existing gather/scatter cost modelling tests. An extra neoverse-v2 check line has been added for both gathers and scatters.	2024-11-29 14:38:36 +00:00
David Green	d714b221c7	[AArch64] Guard against getRegisterBitWidth returning zero in vector instr cost. (#117749 ) If the getRegisterBitWidth is zero (such as in sme streaming functions), then we could hit a crash from using % RegWidth.	2024-11-29 04:01:03 +00:00
Simon Pilgrim	79dab3f5eb	[CostModel][X86] Add shuffle 'splat' tests for broadcasts of non-zero element index. As noticed on #115201 - its possible for SK_Broadcast to occur for non-zero element index which we don't currently handle.	2024-11-28 13:25:55 +00:00
LiqinWeng	c3377af4c3	[RISCV][CostModel] add cost for cttz/ctlz under the non-zvbb (#117515 )	2024-11-26 11:40:52 +08:00
Phoebe Wang	2ab84a60ff	[X86][FP16][BF16] Improve vectorization of fcmp (#116153 )	2024-11-26 08:17:48 +08:00
Luke Lau	15fadeb2aa	[RISCV] Add cost for @llvm.experimental.vp.splat (#117313 ) This is split off from #115274. There doesn't seem to be an easy way to share this with getShuffleCost since that requires passing in a real insert_element operand to get it to recognise it's a scalar splat. For i1 vectors we can't currently lower them so it returns an invalid cost. --------- Co-authored-by: Shih-Po Hung <shihpo.hung@sifive.com>	2024-11-25 11:28:46 +01:00
LiqinWeng	db14010405	[RISCV][TTI] Implement cost of intrinsic abs with LMUL (#115813 )	2024-11-25 17:35:58 +08:00
LiqinWeng	48b13ca48b	[RISCV][CostModel] cost of vector cttz/ctlz under ZVBB (#115800 )	2024-11-24 09:18:18 +08:00
Luke Lau	1e897ed28d	[TTI][RISCV] Deduplicate type-based VP costing (#115983 ) We have a lot of code in RISCVTTIImpl::getIntrinsicInstrCost for vp intrinsics, which just forward the cost to the underlying non-vp cost function. However I just also noticed that there is generic code in BasicTTIImpl's getIntrinsicInstrCost that does the same thing, added in #67178. The only difference is that BasicTTIImpl doesn't yet handle it for type-based costing. There doesn't seem to be any reason that it can't since it's just inspecting the argument types. This shuffles the VP costing up to handle both regular and type-based costing, which allows us to deduplicate some of the VP specific costing in RISCVTTIImpl by delegating it to BasicTTIImpl.h. More of those nodes can be moved over to BasicTTIImpl.h later. It's not NFC since it picks up a couple of VP nodes that had slipped through the cracks. Future PRs can begin to move more of the code from RISCVTTIImpl to BasicTTIImpl.	2024-11-19 16:20:29 +08:00
LiqinWeng	5a12881514	[RISCV][Test] Add test for vp float arithmetic ops. NFC (#114516 )	2024-11-13 16:23:04 +08:00
LiqinWeng	9aa4f50ae4	[RISCV][TTI] Add vp.fneg intrinsic cost with functionalOP (#114378 )	2024-11-13 15:40:48 +08:00
Luke Lau	1294ddabbc	[RISCV] Add cost model tests for vp.{s,u}{min,max}. NFC	2024-11-13 14:32:44 +08:00
Sushant Gokhale	9991ea28fc	[CostModel][AArch64] Make extractelement, with fmul user, free whenev… (#111479 ) …er possible In case of Neon, if there exists extractelement from lane != 0 such that 1. extractelement does not necessitate a move from vector_reg -> GPR 2. extractelement result feeds into fmul 3. Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0 then the extractelement can be merged with fmul in the backend and it incurs no cost. e.g. ``` define double @foo(<2 x double> %a) { %1 = extractelement <2 x double> %a, i32 0 %2 = extractelement <2 x double> %a, i32 1 %res = fmul double %1, %2 ret double %res } ``` `%2` and `%res` can be merged in the backend to generate: `fmul d0, d0, v0.d[1]` The change was tested with SPEC FP(C/C++) on Neoverse-v2. Compile time impact: None Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.	2024-11-13 11:10:49 +05:30
Elvis Wang	3431d133cc	[RISCV][TTI] Implement instruction cost for vp.reduce.* #114184 The VP variants simply return the same costs as the non-VP variants. This assumes that reductions are VL predicated, and that VL predication has no additional cost.	2024-11-12 10:01:35 -08:00
Philip Reames	6256f4b807	[CostModel][RISCV] Rename misleadingly named test file RVV intrinsics are the C intrinsic API; this file actually contains tests for the vp.* family of intrinsics.	2024-11-12 08:20:13 -08:00

1 2 3 4 5 ...

1796 Commits