llvm-project

Author	SHA1	Message	Date
Alexey Bataev	2f40145826	[RISCV][TTI]Use processShuffleMasks for cost estimations/actual per-register shuffles Patch adds usage of processShuffleMasks in TTI for RISCV. This function is already used for X86 shuffles estimations and in DAGTypeLegalizer::SplitVecRes_VECTOR_SHUFFLE functions and in RISCV codegen. Patch allows better cost estimation for sparse masks and unifies cost/codegen between different targets/passes Reviewers: preames Reviewed By: preames Pull Request: https://github.com/llvm/llvm-project/pull/118103	2025-01-29 07:18:55 -05:00
LiqinWeng	98e5962b7c	[RISCV][CostModel] Add cost for fabs/fsqrt of type bf16/f16 (#118608 )	2025-01-10 17:22:51 +08:00
Shao-Ce SUN	369c61744a	[RISCV] Fix the cost of `llvm.vector.reduce.and` (#119160 ) I added some CodeGen test cases related to reduce. To maintain consistency, I also added cases for instructions like `vector.reduce.or`. For cases where `v1i1` type generates `VFIRST`, please refer to: https://reviews.llvm.org/D139512.	2025-01-10 10:10:42 +08:00
Craig Topper	b11fe33aea	[RISCV] Correct the cost model for the i1 reduce.add and reduce.or. (#122349 ) reduce.add uses the same sequence as reduce.xor. reduce.or should use vmor not vmxor.	2025-01-09 18:05:22 -08:00
Pengcheng Wang	72db3f989e	[RISCV] Allow tail memcmp expansion (#121460 ) This optimization was introduced by #70469. Like AArch64, we allow tail expansions for 3 on RV32 and 3/5/6 on RV64. This can simplify the comparison and reduce the number of blocks.	2025-01-03 14:05:02 +08:00
Philip Reames	dd30aa83aa	[RISCV][TTI] Simplify compound check for readability [nfc] (#121504 ) I misread this check earlier today on a review, so restructure it to be easier to quickly scan.	2025-01-02 09:36:01 -08:00
Wang Pengcheng	5b5ef254a3	[RISCV] Fix typo: vmv.x.i -> vmv.v.i	2024-12-31 16:18:13 +08:00
Philipp van Kempen	f590963db8	[RISCV] Implement RISCVTTIImpl::getPreferredAddressingMode for HasVendorXCVmem (#120533 ) For a simple matmult kernel this heuristic reduces the length of the critical basic block from 15 to 20 instructions, resulting in a 20% speedup. Without heuristic: ``` 13688: 001b838b cv.lb t2, (s7), 0x1 1368c: 09cdbcab cv.lb s9, t3(s11) 13690: 089db62b cv.lb a2, s1(s11) 13694: 092dbdab cv.lb s11, s2(s11) 13698: 001d028b cv.lb t0, (s10), 0x1 1369c: 00f282b3 add t0, t0, a5 136a0: 9072b52b cv.mac a0, t0, t2 136a4: 9192bfab cv.mac t6, t0, s9 136a8: 90c2beab cv.mac t4, t0, a2 136ac: 91b2bf2b cv.mac t5, t0, s11 136b0: fffc0c13 addi s8, s8, -0x1 136b4: 018e0633 add a2, t3, s8 136b8: 91b2b0ab cv.mac ra, t0, s11 136bc: 000b8d93 mv s11, s7 136c0: fc0614e3 bnez a2, 0x13688 <muriscv_nn_vec_mat_mult_t_s8+0x2f0> #instrs = 15 ``` With heuristic: ``` 7bc0: 001c860b cv.lb a2, (s9), 0x1 7bc4: 001e0d0b cv.lb s10, (t3), 0x1 7bc8: 001e808b cv.lb ra, (t4), 0x1 7bcc: 0015038b cv.lb t2, (a0), 0x1 7bd0: 001c028b cv.lb t0, (s8), 0x1 7bd4: 00f282b3 add t0, t0, a5 7bd8: 90c2bfab cv.mac t6, t0, a2 7bdc: 91a2b92b cv.mac s2, t0, s10 7be0: 9012b5ab cv.mac a1, t0, ra 7be4: 9072b9ab cv.mac s3, t0, t2 7be8: 9072b72b cv.mac a4, t0, t2 7bec: fc851ae3 bne a0, s0, 0x7bc0 <muriscv_nn_vec_mat_mult_t_s8+0x338> #instrs = 12 improvement = 1 - 12/15 = 0.2 = 20% ```	2024-12-31 10:56:28 +08:00
LiqinWeng	a611d67601	[RISCV][TTI] Add llvm.fmuladd and llvm.vp.fmuladd into canSplatOperand (#119508 ) The first or second operand of fmuladd is a splat operand , it can help fmuladd fold vv instructions to vf instructions.	2024-12-12 18:39:21 +08:00
Elvis Wang	a674209432	[RISCV][TTI] Model the cost of insert/extractelt when the vector split into multiple register group and idx exceed single group. (#118401 ) This patch implements the cost when the size of the vector need to split into multiple groups and the index exceed single vector group. For extract element, we need to store split vectors to stack and load the target element. For insert element, we need to store split vectors to stack and store the target element and load vectors back. After this patch, the cost of insert/extract element will close to the generated assembly.	2024-12-12 09:26:19 +08:00
Michael Maitland	34a076c46f	[RISCV][NFC] Don't set UnrollAndJamInnerLoopThreshold in getUnrollingPreferences (#118572 ) This has no effect since its the default value used in llvm::gatherUnrollingPreferences.	2024-12-05 10:41:19 -05:00
LiqinWeng	46829e5430	[RISCV][CostModel] Correct the cost of some reductions (#118072 ) Reductions include: and/or/max/min	2024-12-04 17:26:54 +08:00
LiqinWeng	eb3f1aec6e	[TTI][RISCV] Implement cost of some intrinsics with LMUL (#117874 ) Intrinsics include: sadd_sat/ssub_sat/uadd_sat/usub_sat/fabs/fsqrt/cttz/ctlz/ctpop	2024-12-03 10:17:52 +08:00
LiqinWeng	ede570980a	[RISCV][TTI] Add llvm.vp.select into canSplatOperand. (#117982 ) The second operand of llvm.vp.select is a splat operand , it can help llvm.vp.select fold vv instructions to vx instructions.	2024-12-02 17:39:47 +08:00
Luke Lau	df10f1c6f7	[RISCV] Use getRISCVInstructionCost for split cost in mask reductions. NFC This is effectively the same due to how the mask instructions have an LMUL of 1 and cost of 1, but matches how we use LT.first elsewhere in RISCVTargetTransformInfo.cpp by using it to multiply another instruction cost.	2024-12-02 17:15:53 +08:00
Jonas Paulsson	0ad6be1927	[SLPVectorizer, TargetTransformInfo, SystemZ] Improve SLP getGatherCost(). (#112491 ) As vector element loads are free on SystemZ, this patch improves the cost computation in getGatherCost() to reflect this. getScalarizationOverhead() gets an optional parameter which can hold the actual Values so that they in turn can be passed (by BasicTTIImpl) to getVectorInstrCost(). SystemZTTIImpl::getVectorInstrCost() will now recognize a LoadInst and typically return a 0 cost for it, with some exceptions.	2024-11-29 21:19:45 +01:00
LiqinWeng	b2d3cb1e75	[TTI][RISCV] Remove deduplicate type-based VP costing of VPReduction.NFC (#117708 ) Refered to: #115983	2024-11-29 12:10:05 +08:00
LiqinWeng	c3377af4c3	[RISCV][CostModel] add cost for cttz/ctlz under the non-zvbb (#117515 )	2024-11-26 11:40:52 +08:00
LiqinWeng	dd7aabf7c0	[TTI][RISCV] Deduplicate type-based VP costing of vpcmp/vpcast (#117520 ) Refered to: https://github.com/llvm/llvm-project/pull/115983	2024-11-26 10:49:24 +08:00
Luke Lau	15fadeb2aa	[RISCV] Add cost for @llvm.experimental.vp.splat (#117313 ) This is split off from #115274. There doesn't seem to be an easy way to share this with getShuffleCost since that requires passing in a real insert_element operand to get it to recognise it's a scalar splat. For i1 vectors we can't currently lower them so it returns an invalid cost. --------- Co-authored-by: Shih-Po Hung <shihpo.hung@sifive.com>	2024-11-25 11:28:46 +01:00
LiqinWeng	db14010405	[RISCV][TTI] Implement cost of intrinsic abs with LMUL (#115813 )	2024-11-25 17:35:58 +08:00
hev	e26af0938c	[llvm] Add `BasicTTIImpl::areInlineCompatible` for target feature subset checks (#117493 ) This patch moves the `areInlineCompatible` implementation from multiple subclasses (`AArch64TTIImpl`, `RISCVTTIImpl`, `WebAssemblyTTIImpl`) to the base class `BasicTTIImpl`. The new implementation checks whether the callee's target features are a subset of the caller's, enabling consistent behavior across targets. Subclasses now simply delegate to the base implementation, reducing code duplication and improving maintainability.	2024-11-25 11:22:49 +08:00
LiqinWeng	48b13ca48b	[RISCV][CostModel] cost of vector cttz/ctlz under ZVBB (#115800 )	2024-11-24 09:18:18 +08:00
Alexey Bataev	7523086a05	[SLP]Use getExtendedReduction cost and fix reduction cost calculations Patch uses getExtendedReduction for reductions of ext-based nodes + adds cost estimation for ctpop-kind reductions into basic implementation and RISCV-V specific vcpop cost estimation. Reviewers: RKSimon, preames Reviewed By: preames Pull Request: https://github.com/llvm/llvm-project/pull/117350	2024-11-22 16:12:53 -05:00
Luke Lau	1e897ed28d	[TTI][RISCV] Deduplicate type-based VP costing (#115983 ) We have a lot of code in RISCVTTIImpl::getIntrinsicInstrCost for vp intrinsics, which just forward the cost to the underlying non-vp cost function. However I just also noticed that there is generic code in BasicTTIImpl's getIntrinsicInstrCost that does the same thing, added in #67178. The only difference is that BasicTTIImpl doesn't yet handle it for type-based costing. There doesn't seem to be any reason that it can't since it's just inspecting the argument types. This shuffles the VP costing up to handle both regular and type-based costing, which allows us to deduplicate some of the VP specific costing in RISCVTTIImpl by delegating it to BasicTTIImpl.h. More of those nodes can be moved over to BasicTTIImpl.h later. It's not NFC since it picks up a couple of VP nodes that had slipped through the cracks. Future PRs can begin to move more of the code from RISCVTTIImpl to BasicTTIImpl.	2024-11-19 16:20:29 +08:00
LiqinWeng	9aa4f50ae4	[RISCV][TTI] Add vp.fneg intrinsic cost with functionalOP (#114378 )	2024-11-13 15:40:48 +08:00
Elvis Wang	3431d133cc	[RISCV][TTI] Implement instruction cost for vp.reduce.* #114184 The VP variants simply return the same costs as the non-VP variants. This assumes that reductions are VL predicated, and that VL predication has no additional cost.	2024-11-12 10:01:35 -08:00
Pengcheng Wang	7a5b040e20	[RISCV] Add initial support of memcmp expansion There are two passes that have dependency on the implementation of `TargetTransformInfo::enableMemCmpExpansion` : `MergeICmps` and `ExpandMemCmp`. This PR adds the initial implementation of `enableMemCmpExpansion` so that we can have some basic benefits from these two passes. We don't enable expansion when there is no unaligned access support currently because there are some issues about unaligned loads and stores in `ExpandMemcmp` pass. We should fix these issues and enable the expansion later. Vector case hasn't been tested as we don't generate inlined vector instructions for memcmp currently. Reviewers: preames, arcbbb, topperc, asb, dtcxzyw Reviewed By: topperc, preames Pull Request: https://github.com/llvm/llvm-project/pull/107548	2024-11-06 15:44:12 +08:00
Philip Reames	a905203b9e	[RISCV] Prefer strided load for interleave load with only one lane active (#115069 ) If only one of the elements is actually used, then we can legally use a strided load in place of the segment load. Doing so reduces vector register pressure, so if both segment and strided are believed to be element/segment at a time, then prefer the strided load variant. Note that I've seen the vectorizer emitting wide interleave loads to represent a strided load, so this does happen in practice. It doesn't matter much for small LMUL*NF, but at large NF can start causing problems in register allocation. Note that this patch only covers the fixed vector formation cases. In theory, we should do the same patch for scalable, but we can currently only represent NF2 in scalable IR, and NF2 is assumed to be optimized to better than segment-at-a-time by default, so there's currently nothing to do.	2024-11-05 16:15:20 -08:00
Luke Lau	beb12f92c7	[RISCV] Add +optimized-nfN-segment-load-store (#114414 ) This is a follow up to #111511, where after benchmarking we learnt that the Banana Pi F3 has fast segmented loads for not just NF=2, but also NF=3 and NF=4: https://github.com/preames/bp3-microarch#vlseg_lmul_x_sew_throughput This adds tuning features to allow these segment loads and stores to be costed cheaper and enables it for the spacemit-x60. It also enables +optimized-nf2-segment-load-store by default in the generic tuning to maintain the previous behaviour when compiled without -mcpu or -mtune.	2024-11-04 06:43:58 +08:00
Luke Lau	9c7188871c	[RISCV] Cost ordered bf16/f16 w/ zvfhmin reductions as invalid (#114250 ) In #111000 we removed promotion of fadd/fmul reductions for bf16 and f16 without zvfh, and marked the cost as invalid to prevent the vectorizers from emitting them. However it inadvertently didn't change the cost for ordered reductions, so this moves the check earlier to fix this. This also uses BasicTTIImpl instead which now assigns a valid but expensive cost for fixed-length vectors, which reflects how codegen will actually scalarize them.	2024-10-31 23:36:09 +08:00
Pengcheng Wang	18f0f70934	[RISCV] Support llvm.masked.expandload intrinsic (#101954 ) We can use `viota`+`vrgather` to synthesize `vdecompress` and lower expanding load to `vcpop`+`load`+`vdecompress`. And if `%mask` is all ones, we can lower expanding load to a normal unmasked load. Fixes #101914.	2024-10-31 20:03:58 +08:00
Elvis Wang	a8575c1459	[RISCV] Sink ordered reduction check into FAdd. NFC (#114180 )	2024-10-31 13:35:37 +08:00
Luke Lau	14045de250	[RISCV] Account for factor in interleave memory op costs (#111511 ) Currently we cost an interleaved memory op as if it were a load/store of the widened vector type, but this was undercosting in all cases when compared to the measured performance of todays hardware. On the x280 at NF=2 and spacemit-x60 at NF=2,3 and 4, a segmented load is carried out as a wide load and NF LMUL shuffle ops: https://github.com/preames/bp3-microarch#vlseg_lmul_x_sew_throughput All other NFs go through a slow path. On the spacemit-x60 this is proportional to VLMAX * NF, and on the x280 proportional to the number of segments. This patch increases the cost by implementing a wide load + NF LMUL shuffle op cost for the lowest common denominator NF=2, and then a slower cost proportional to VL for the other NFs. In a follow up patch we can add a tuning flag to use the faster cost model for NF=3 and 4 on the spacemit-x60. Note that the FIXME about illegal vectors seems to have been fixed in #100436	2024-10-31 05:36:46 +08:00
Luke Lau	e989e31a47	[RISCV] Mark f16/bf16 lrint and llrint cost as invalid (#113924 ) We currently can't lower scalable vector lrint and llrint nodes for bf16 and f16, even with zvfh, and will crash. Mark the cost as invalid for now to prevent the vectorizers from emitting them. Note that we can actually lower fixed-length vectors fine by scalarizing them, but we were still undercosting these too so I've also included them. I presume there's an opportunity to improve the codegen later on.	2024-10-30 17:21:18 +02:00
Han-Kuan Chen	12bcea3292	[RISCV][TTI] Recognize CONCAT_VECTORS if a shufflevector mask is multiple insert subvector. (#111459 ) reference: https://github.com/llvm/llvm-project/pull/110457	2024-10-18 20:16:56 +07:00
Elvis Wang	566012a64e	[RISCV][TTI] Implement instruction cost for vp_merge. (#112327 ) This patch implement the instruction for `vp_merge`, which will generate similar instruction sequence to the `select` instruction.	2024-10-17 07:47:43 +08:00
Philip Reames	b3c687b4e9	[LV] Check early for supported interleave factors with scalable types [nfc] (#111592 ) Previously, the cost model was returning an invalid cost. This simply moves the check from one place to another. This is mostly to make the cost modeling code a bit easier to follow. --------- Co-authored-by: Mel Chen <mel.chen@sifive.com>	2024-10-15 07:37:46 -07:00
Jeffrey Byrnes	853c43d04a	[TTI] NFC: Port TLI.shouldSinkOperands to TTI (#110564 ) Porting to TTI provides direct access to the instruction cost model, which can enable instruction cost based sinking without introducing code duplication.	2024-10-09 14:30:09 -07:00
Philip Reames	f11568bcb0	Revert "[RISCV][TTI] Recognize CONCAT_VECTORS if a shufflevector mask is multiple insert subvector. (#110457 )" This reverts commit 554eaec63908ed20c35c8cc85304a3d44a63c634. Change was not approved when landed.	2024-10-07 11:31:57 -07:00
Han-Kuan Chen	554eaec639	[RISCV][TTI] Recognize CONCAT_VECTORS if a shufflevector mask is multiple insert subvector. (#110457 )	2024-10-05 14:58:44 +08:00
Luke Lau	487686b82e	[SDAG][RISCV] Don't promote VP_REDUCE_{FADD,FMUL} (#111000 ) In https://reviews.llvm.org/D153848, promotion was added for a variety of f16 ops with zvfhmin, including VP reductions. However I don't believe it's correct to promote f16 fadd or fmul reductions to f32 since we need to round the intermediate results. Today if we lower @llvm.vp.reduce.fadd.nxv1f16 on RISC-V, we'll get two different results depending on whether we compiled with +zvfh or +zvfhmin, for example with a 3 element reduction: ; v9 = [0.1563, 5.97e-8, 0.00006104] ; zvfh vsetivli x0, 3, e16, m1, ta, ma vmv.v.i v8, 0 vfredosum.vs v8, v9, v8 vfmv.f.s fa0, v8 ; fa0 = 0.1563 ; zvfhmin vsetivli x0, 3, e16, m1, ta, ma vfwcvt.f.f.v v10, v9 vsetivli x0, 3, e32, m1, ta, ma vmv.v.i v8, 0 vfredosum.vs v8, v10, v8 vfmv.f.s fa0, v8 fcvt.h.s fa0, fa0 ; fa0 = 0.1564 This same thing happens with reassociative reductions e.g. vfredusum.vs, and this also applies for bf16. I couldn't find anything in the LangRef for reductions that suggest the excess precision is allowed. There may be something we can do in Clang with -fexcess-precision=fast, but I haven't looked into this yet. I presume the same precision issue occurs with fmul, but not with fmin/fmax/fminimum/fmaximum. I can't think of another way of lowering these other than scalarizing, and we can't scalarize scalable vectors, so this just removes the promotion and adjusts the cost model to return an invalid cost. (It looks like we also don't currently cost fmul reductions, so presumably they also have an invalid cost?) I think this should be enough to stop the loop vectorizer or SLP from emitting these intrinsics.	2024-10-04 00:17:45 +08:00
Philip Reames	50afafbf29	[RISCV][TTI] Adjust constant materialization cost for (z/s)ext from i1 (#110282 ) When we're lowering to a split sequence, we only need one materialization of the zero constant. Our codegen looks something like this: vmv.v.i v24, 0 vmerge.vim v8, v24, -1, v0 vmv1r.v v0, v16 vmerge.vim v16, v24, -1, v0 Note: Doing this specific case since it was pointed out in https://github.com/llvm/llvm-project/pull/110164#discussion_r1778268391, but it's worth noting that we have the same basic problem (over costing split operations with split invariant terms) at multiple places through this file.	2024-09-27 10:53:45 -07:00
Philip Reames	1a9569c4f0	[RISCV][TTI] Avoid an infinite recursion issue in getCastInstrCost (#110164 ) Calling into BasicTTI is not always safe. In particular, BasicTTI does not have a full legalization implementation (vector widening is missing), and falls back on scalarization. The problem is that scalarization for <N x i1> vectors is cost in terms of the cast API and we can end up in an infinite recursive cycle. The "right" fix for this would be teach BasicTTI how to model the full legalization state machine, but several attempts at doing so have resulted in dead ends or undesirable cost changes for targets I don't understand. This patch instead papers over the issue by avoiding the call to the base class when dealing with an i1 source or dest. This doesn't necessarily produce correct costs, but it should at least return something semi-sensible and not crash. Fixes https://github.com/llvm/llvm-project/issues/108708	2024-09-27 07:47:09 -07:00
Philip Reames	d288574363	[TTI][RISCV] Model cost of loading constants arms of selects and compares (#109824 ) This follows in the spirit of 7d82c99403f615f6236334e698720bf979959704, and extends the costing API for compares and selects to provide information about the operands passed in an analogous manner. This allows us to model the cost of materializing the vector constant, as some select-of-constants are significantly more expensive than others when you account for the cost of materializing the constants involved. This is a stepping stone towards fixing https://github.com/llvm/llvm-project/issues/109466. A separate SLP patch will be required to utilize the new API.	2024-09-25 07:25:57 -07:00
Luke Lau	f43ad88ae1	[RISCV] Handle zvfhmin and zvfbfmin promotion to f32 in half arith costs (#108361 ) Arithmetic half or bfloat ops on zvfhmin and zvfbfmin respectively will be promoted and carried out in f32, so this updates getArithmeticInstrCost to check for this.	2024-09-25 18:50:16 +08:00
Philip Reames	0b524efa95	[RISCV][TTI] Reduce cost of a <N x i1> build_vector pattern (#109449 ) This is a follow up to 7f6bbb3. When lowering a <N x i1> build_vector, we currently chose to extend to i8, perform the build_vector there, and then truncate back in vector. Our costing on the other hand accounts for it as if we performed a vector extend, an insert, and a vector extract for every element. This significantly over estimates the cost. Note that we can likely do better in our build_vector lowering here by packing the bits in scalar, and doing a build_vector of the packed bits. Regardless, our costing should match our lowering.	2024-09-23 07:21:54 -07:00
Elvis Wang	80b44517f5	[RISCV][TTI] Add instruction cost for vp.select. (#109381 ) This patch make instruction cost for vp.select the same as its non-vp counterpart.	2024-09-23 15:06:04 +08:00
Philip Reames	7f6bbb3c4f	[RISCV][TTI] Reduce cost of a build_vector pattern (#108419 ) This change is actually two related changes, but they're very hard to meaningfully separate as the second balances the first, and yet doesn't do much good on it's own. First, we can reduce the cost of a build_vector pattern. Our current costing for this defers to generic insertelement costing which isn't unreasonable, but also isn't correct. While inserting N elements requires N-1 slides and N vmv.s.x, doing the full build_vector only requires N vslide1down. (Note there are other cases that our build vector lowering can do more cheaply, this is simply the easiest upper bound which appears to be "good enough" for SLP costing purposes.) Second, we need to tell SLP that calls don't preserve vector registers. Without this, SLP will vectorize scalar code which performs e.g. 4 x float @exp calls as two <2 x float> @exp intrinsic calls. Oddly, the costing works out that this is in fact the optimal choice - except that we don't actually have a <2 x float> @exp, and unroll during DAG. This would be fine (or at least cost neutral) except that the libcall for the scalar @exp blows all vector registers. So the net effect is we added a bunch of spills that SLP had no idea about. Thankfully, AArch64 has a similiar problem, and has taught SLP how to reason about spill cost once the right TTI hook is implemented. Now, for some implications... The SLP solution for spill costing has some inaccuracies. In particular, it basically just guesses whether a intrinsic will be lowered to a call or not, and can be wrong in both directions. It also has no mechanism to differentiate on calling convention. This has the effect of making partial vectorization (i.e. starting in scalar) more profitable. In practice, the major effect of this is to make it more like SLP will vectorize part of a tree in an intersecting forrest, and then vectorize the remaining tree once those uses have been removed. This has the effect of biasing us slightly away from strided, or indexed loads during vectorization - because the scalar cost is more accurately modeled, and these instructions look relevatively less profitable.	2024-09-20 08:34:36 -07:00
Elvis Wang	86ce8e4504	[RISCV][TTI] Fix potential crash of using dyn_cast() in getIntrinsicInstrCost() NFC. (#109379 ) This patch fix the potential crash about using dyn_cast in `vp_cmp` which is same as #109313. Check if the IntrinsicCostAttrubute contains underlying instruction first and cast to the VPCmpIntrinsic.	2024-09-20 22:45:53 +08:00

1 2 3 4 5 ...

314 Commits