261 Commits

Author SHA1 Message Date
Philip Reames
2e7c7d20d5
[RISCV][TTI] Adjust cost for extract/insert element when VLEN is known (#108595)
If we know an exact VLEN, then the index is effectively modulo the
number of elements in a single vector register. Our lowering performs
this subvector optimization.

A bit of context. This change may look a bit strange on it's own given
we are currently *not* scaling insert/extract cost by LMUL. This costing
decision needs to change, but is very intertwined with SLP
profitability, and is thus a bit hard to adjust. I'm hoping that
https://github.com/llvm/llvm-project/pull/108419 will let me start to
untangle this. This change is basically a case of finding a subset I can
tackle before other dependencies are in place which does no real harm in
the meantime.
2024-09-17 08:43:40 -07:00
Luke Lau
41f1b467a2
[RISCV] Account for zvfhmin and zvfbfmin promotion in register usage (#108370)
A half with only zvfhmin or bfloat will end up getting promoted to a f32
for most instructions.

Unless the loop consists only of memory ops and permutation instructions
which don't need promoted (is this common?), we'll end up using double
the LMUL than what's currently being returned by getRegUsageForType.

Since this is used by the loop vectorizer, it seems better to be
conservative and assume that any usage of a zvfhmin half/bfloat will end
up being widened to a f32
2024-09-17 13:50:19 +08:00
Elvis Wang
1b3e64a9d2
[RISCV][TTI] Add vp.cmp intrinsic cost with functionalOPC. (#107504)
This patch make the instruction cost of VP compare intrinsics as same as
their non-VP counterpart.
2024-09-12 07:06:36 +08:00
Elvis Wang
845d8d909c
[RISCV][TTI] Add cost of typebased cast VPIntrinsics with functionalOPC. (#97797)
This patch make the instruction cost of type-based cast VP intrinsics
will be same as their non-VP counterpart.
This is the following patch of
[#93435](https://github.com/llvm/llvm-project/pull/93435)
2024-09-05 13:05:01 +08:00
Shih-Po Hung
837ee5b46a
[RISCV][TTI] Scale the cost of FP-Int conversion with LMUL (#87506)
Widening/narrowing the source data type to match the destination data
type may require multiple steps.
To model the costs, the patch generated the interim type by following
the logic in RISCVTargetLowering::lowerVPFPIntConvOp.
2024-09-02 09:38:42 +08:00
Philip Reames
59f05b683d
[RISCV][TTI] Model cost for insert/extract into illegal types (#106440)
We'd previously just deferred to the base implementation, but that more
or less always returns 1. This underestimates the cost of the
insert/extract, biases the SLP vectorizer towards forming illegally
typed vectors, and underestimates the cost of scalarized operations
(like unaligned scatter/gather).
2024-08-29 09:45:47 -07:00
Maciej Gabka
95d2d1cba0
Move stepvector intrinsic out of experimental namespace (#98043)
This patch is moving out stepvector intrinsic from the experimental
namespace.

This intrinsic exists in LLVM for several years now, and is widely used.
2024-08-28 12:48:20 +01:00
Alexey Bataev
2a50dac9fb [RISCV][TTI]Fix the cost estimation for long select shuffle.
The code was broken completely. Need to iterate over the whole mask and
process the submasks correctly, check if they form full indentity and
adjust indices correctly.

Fixes https://github.com/llvm/llvm-project/issues/106126
2024-08-26 17:27:52 -07:00
Philip Reames
424b87b8d6
[RISCV][TTI] Use legalized element types when costing casts (#105723)
This fixes a crash introduced by my
ac6e1fd0c089043fe60bd0040ba3cad884f00206.

I had failed to consider the case where a vector is truncated to an
illegal element type. The resulting intermediate VT wasn't an MVT and
we'd fail an assertion. Surprisingly, SLP does query illegal element
types in some cases.
2024-08-22 16:19:48 -07:00
LiqinWeng
abaa53199e
[RISCV] Implement RISCVTTIImpl::shouldConsiderAddressTypePromotion for RISCV (#102560)
This optimization helps reduce repeated calculations of base addresses
by extracting type extensions when the same base address is accessed
multiple times but its offset is a constant.
2024-08-15 10:37:04 +08:00
Philip Reames
ac6e1fd0c0
[RISCV][TTI] Cost non-power-of-two size changing casts (#101047)
For a cast with src and destination size being unequal, we were costing
the cast as if it were being scalarized, when in fact we can often
promote such cases to a wider legal type.

Note that for casts with equal size (i.e. bitcast, some fp<->i, and
ptrtoint) the generic logic in BasicTTI already assumed promotion. It
just doesn't handle the cast where source and destination are both
promoted to non-equal types.

This is analogous to d3fd28a, but with the same reasoning applied to
casts instead.
2024-08-13 14:58:16 -07:00
Jeremy Morse
bde243259b Revert "[Asan] Provide TTI hook to provide memory reference infromation of target intrinsics. (#97070)"
This reverts commit e8ad87c7d06afe8f5dde2e4c7f13c314cb3a99e9.
This reverts commit d3c9bb0cf811424dcb8c848cf06773dbdde19965.

A few buildbots trip up on asan-rvv-intrinsics.ll. I've also reverted
the follow-up commit d3c9bb0cf8.

https://lab.llvm.org/buildbot/#/builders/46/builds/2895
2024-08-08 12:26:05 +01:00
Yeting Kuo
e8ad87c7d0
[Asan] Provide TTI hook to provide memory reference infromation of target intrinsics. (#97070)
Previously asan considers target intrinsics as black boxes, so asan
could not instrument accurate check. This patch provide TTI hooks to
make targets describe their intrinsic informations to asan.

Note,
1. this patch renames InterestingMemoryOperand to MemoryRefInfo.
2. this patch does not support RVV indexed/segment load/store.
2024-08-08 13:40:26 +08:00
Craig Topper
ad80265874
[RISCV] Qualify all XCV predicates with !is64Bit. (#101074)
The tablegen patterns all have isRV32. I did not check if any of them
could naively support RV64.

Fixes #101067 and probably other bugs like it we haven't found yet.
2024-07-29 21:52:57 -07:00
Philip Reames
b66310f938
[RISCV][TTI] Split costing of [u/s]int_to_fp from fp_to_[u/s]int [nfc] (#101029)
The amount of code sharing between them is fairly small, and the split
version is much easier to read.
2024-07-29 09:32:36 -07:00
Philip Reames
d3fd28a134
[RISCV][TTI] Properly model odd vector sized LD/ST operations (#100436)
The motivation for this change is the costing of a LD or ST with nearly
power of 2 vectors (e.g. <3 x i32> or <7 x i32>) on V. There's an
experimental option in SLP to allow emitting these if the cost model
says they're profitable. This really helps with e.g. RGB vectors.

Our actual lowering for these depends on whether a wider container type
is known available. If so, we use a vle or vse on the wider type with a
restricted VL. If not, we split until a legal type is found, and then
apply the vle/vse on the sub-pieces.

This change is intentionally restricted to only the case where promotion
(widening w/VL predication) is involved. We appear to have at least one
bug in our splitting lowering (see discussion on review), and to avoid
exposing this more widely, I chose to not adjust costs for the splitting
case. The current splitting costing assumes scalarization (which is not
true of the actual lowering), but that has the effect of biasing
vectorization away from such cases strongly.

For the widening case, the true cost scales with the next largest legal
type. The default implementation assumes that such a type is scalarized.
Changing that brings our cost in line with our actual lowering decision.
Note that since scalarization is not possible for scalable types, the
prior costing falsely returned Invalid for that case.
2024-07-26 12:52:20 -07:00
Luke Lau
58854facb3
[RISCV] Don't cost vector arithmetic fp ops as cheaper than scalar (#99594)
I was comparing some SPEC CPU 2017 benchmarks across rva22u64 and
rva22u64_v, and noticed that in a few cases that rva22u64_v was
considerably slower.

One of them was 519.lbm_r, which has a large loop that was being
unprofitably vectorized. It has an if/else in the loop which requires
large amounts of predication when vectorized, but despite the loop
vectorizer taking this into account the vector cost came out as cheaper
than the scalar.

It looks like the reason for this is because we cost scalar floating
point ops as 2, but their vector equivalents as 1 (for LMUL 1). This
comes from how we use BasicTTIImpl for scalars which treats floats as
twice as expensive as integers.

This patch doubles the cost of vector floating point arithmetic ops so
that they're at least as expensive as their scalar counterparts, which
gives a 13% speedup on 519.lbm_r at -O3 on the spacemit-x60.

Fixes #62576 (the last point there about scalar fsub/fmul)
2024-07-22 13:56:10 +08:00
Alex Bradbury
8687f7cd66
[RISCV] Support constant hoisting of immediate store values (#96073)
Previously getIntImmInstCost only calculated the cost of materialising
the argument of a store if it was the address. This means
ConstantHoisting's transformation wouldn't kick in for cases like
storing two values that require multiple instructions to materialise but
where one can be cheaply generated from the other (e.g. by an addition).

Two key changes were needed to avoid regressions when enabling this:
* Allowing constant materialisation cost to be calculated assuming
zeroes are free (as might happen if you had a 2*XLEN constant and one
half is zero).
* Avoiding constant hoisting if we have a misaligned store that's going
to be a legalised to a sequence of narrower stores. I'm seeing cases
where hoisting the constant ends up with worse codegen in that case.

Out of caution and so as not to unexpectedly degrade other existing hoisting logic, FreeZeroes is used only for the new cost calculations for the load instruction. It would likely make sense to revisit this later.
2024-07-17 15:19:31 +01:00
Elvis Wang
4762f3bab0
[RISCV][TTI] Add cost of type based binOp VP intrinsics with functionalOPC. (#93435)
Intrinsics not supported in the backend will fall Into BasicTTIImpl,
which will check if the VP intrinsic is a type based instruction.
All type based instruction will fall into the
`getTypeBasedIntrinsicInstrCost()` which doesn't support instruction
with scalable vector type.

This patch adds the instruction cost for type based binOp VP intrinsic
instructions in the backend to get the valid instruction costs.
The cost of type based binOp VP intrinsics will be same as their non-VP
counterpart.
2024-07-05 08:13:18 +08:00
Philip Reames
25b65be43d
[RISCV][LSR] Account for temporary register for base addition (#92296)
An LSR formula may require the addition of multiple base or scale
registers, this sum reduction requires a temporary register to perform.
Since the formulas are independent, we only need one temporary,
regardless of the number of unique formula. Each formula can reuse the
same temporary. A later CSE pass may come along and combine
sub-expressions - but then the register pressure would be that passes
problem to consider.

This change fixes up the costing in the RISCV specific way, but this is
really a generic LSR problem. I just didn't feel like fighting with LSR
and dealing with all the various targets swinging slightly in hard to
reason about ways. This problem is more pronounced on RISCV than any
other target due to our lack of addressing modes.

This change is not hugely important on it's own, but I have an upcoming
change to add support fo shNadd in LSR which biases us fairly strongly
towards adding more "base adds". Without this change, we see net
regression due to the increase in register pressure which is not
accounted for.
2024-05-22 13:38:39 -07:00
Elvis Wang
b60e62896e
[RISCV][CostModel] Remove cost of icmp inst in icmp+select with SFB. (#91158)
With ShortFowrardBranchOpt(SFB) or ConditionalMoveFusion, scalar
ICmp and scalar Select instructions will lower to SELECT_CC
and lower to PseudoCCMOVGPR which will generate a conditional
branch instruction and a move instruction.
The cost of scalar (ICmp + Select) = (0 + Select instruction cost)
2024-05-20 16:03:18 +08:00
Craig Topper
487b43cdc9
[RISCV] Pass subvector type to isLegalInterleavedAccessType in getInterleavedMemoryOpCost. (#91825)
isLegalInterleavedAccessType expects the subvector type, but
getInterleavedMemoryOpCost is called with the full vector type. So we
need to divide by Factor.
2024-05-15 21:47:29 -07:00
Min-Yih Hsu
4c68de5a00
[RISCV][CostModel] Add cost model for experimental.cttz.elts (#91778)
The cost of `experimental.cttz.elts` in RISC-V equals to the cost of
vfirst when the zero_is_poison argument is true. Otherwise, we add
additional costs of cmp + select to convert the -1 result from vfirst to
EVL.
2024-05-14 09:18:08 -07:00
Shih-Po Hung
22213d5883 Recommit [RISCV][TTI] Support fdiv/udiv/sdiv/srem/urem in getArithmeticInstrCost (#89170)
Insert a break to fix the implicit-fallthrough caught by sanitizer.

Original commit message:

This patch made following changes:
1. Support ISD FDIV/UDIV/SDIV/UREM/SREM
2. Classify instructions which cost the same
2024-05-12 20:10:51 -07:00
ShihPo Hung
d67c3a4b1f Revert "[RISCV][TTI] Support fdiv/udiv/sdiv/srem/urem in getArithmeticInstrCost (#89170)"
This reverts commit ed16e7aac44f2024b45d8c6c9dc2817d77d0ea97.
2024-05-12 19:57:40 -07:00
Shih-Po Hung
ed16e7aac4
[RISCV][TTI] Support fdiv/udiv/sdiv/srem/urem in getArithmeticInstrCost (#89170)
This patch made following changes:
1. Support ISD FDIV/UDIV/SDIV/UREM/SREM
2. Classify instructions which cost the same
2024-05-13 09:47:57 +08:00
Mel Chen
3f1fef3699
[RISCV] Support interleaved accesses for scalable vector. (#90583)
The support for interleaved accesses for scalable vector with a factor
of 2 is enabled in vectorizer. Therefore, the patch removed the
restriction for scalable vector with a factor of 2.
2024-05-03 21:56:31 +08:00
Shih-Po Hung
097b68ff06
[RISCV][TTI] Refine the cost of FCmp (#88833)
This patch introduces following changes
- Support all fp predicates
- Use the Val type to estimate the latency/throughput cost
- Assign a cost of 1 for mask operations as LMULCost for mask types
cannot be correctly estimated.
2024-04-18 09:44:31 +08:00
Shih-Po Hung
f3a8112d98
[RISCV][TTI] Scale the cost of ICmp with LMUL (#88235)
Use the Val type to estimate the instruction cost for ICmp.
2024-04-16 09:37:32 +08:00
Shih-Po Hung
3d985a6f1b
[RISCV][TTI] Scale the cost of Select with LMUL (#88098)
Use the Val type to estimate the instruction cost for SelectInst.
2024-04-10 14:18:15 +08:00
Shih-Po Hung
ee52add6cb
[RISCV][TTI] Implement cost of intrinsic active_lane_mask (#87931)
This patch uses the argument type to infer the LMUL cost for the index
generation, add, and comparison.
2024-04-10 10:08:33 +08:00
David Green
4ac2721e51
[AArch64] Add costs for ST3 and ST4 instructions, modelled as store(shuffle). (#87934)
This tries to add some costs for the shuffle in a ST3/ST4 instruction,
which are represented in LLVM IR as store(interleaving shuffle). In
order to detect the store, it needs to add a CxtI context instruction to
check the users of the shuffle. LD3 and LD4 are added, LD2 should be a
zip1 shuffle, which will be added in another patch.

It should help fix some of the regressions from #87510.
2024-04-09 16:36:08 +01:00
Alexey Bataev
413a66f339
[LV, VP]VP intrinsics support for the Loop Vectorizer + adding new tail-folding mode using EVL. (#76172)
This patch introduces generating VP intrinsics in the Loop Vectorizer.

Currently the Loop Vectorizer supports vector predication in a very
limited capacity via tail-folding and masked load/store/gather/scatter
intrinsics. However, this does not let architectures with active vector
length predication support take advantage of their capabilities.
Architectures with general masked predication support also can only take
advantage of predication on memory operations. By having a way for the
Loop Vectorizer to generate Vector Predication intrinsics, which (will)
provide a target-independent way to model predicated vector
instructions. These architectures can make better use of their
predication capabilities.

Our first approach (implemented in this patch) builds on top of the
existing tail-folding mechanism in the LV (just adds a new tail-folding
mode using EVL), but instead of generating masked intrinsics for memory
operations it generates VP intrinsics for loads/stores instructions. The
patch adds a new VPlanTransforms to replace the wide header predicate
compare with EVL and updates codegen for load/stores to use VP
store/load with EVL.

Other important part of this approach is how the Explicit Vector Length
is computed. (VP intrinsics define this vector length parameter as
Explicit Vector Length (EVL)). We use an experimental intrinsic
`get_vector_length`, that can be lowered to architecture specific
instruction(s) to compute EVL.

Also, added a new recipe to emit instructions for computing EVL. Using
VPlan in this way will eventually help build and compare VPlans
corresponding to different strategies and alternatives.

Differential Revision: https://reviews.llvm.org/D99750
2024-04-04 18:30:17 -04:00
Shih-Po Hung
97523e5321
[RISCV][TTI] Scale the cost of intrinsic stepvector with LMUL (#87301)
Use the return type to measure the LMUL size for latency/throughput cost
2024-04-04 08:30:15 +08:00
Shih-Po Hung
d7a43a00fe
[RISCV][TTI] Scale the cost of trunc/fptrunc/fpext with LMUL (#87101)
Use the destination data type to measure the LMUL size for
latency/throughput cost
2024-04-02 09:30:51 +08:00
Shih-Po Hung
84f24c2daf
[RISCV][TTI] Scale the cost of intrinsic umin/umax/smin/smax with LMUL (#87245)
Use the return type to measure the LMUL size for throughput/latency cost
2024-04-02 09:26:27 +08:00
Shih-Po Hung
c7954ca312
Recommit "[RISCV] Refine cost on Min/Max reduction (#79402)" (#86480)
This is recommitted as the test and fix for
llvm.vector.reduce.fmaximum/fminimum are covered in #80553 and #80697
2024-04-01 14:44:10 +08:00
ShihPo Hung
aa2d5d5413 Recommit "[RISCV][TTI] Scale the cost of the sext/zext with LMUL (#86617)"
Changes in Recommit:
  Add an additional check on sign/zero extend to the same type.

Original message:
  Use the destination data type to measure the LMUL size for
  latency/throughput cost
2024-03-26 23:41:16 -07:00
Jianjian Guan
05a7b22a01
[RISCV] Add areInlineCompatible for riscv target (#86639)
Inline a callee if its target-features are a subset of the callers
target-features.
2024-03-27 14:16:03 +08:00
ShihPo Hung
da3e58e74a Revert "[RISCV][TTI] Scale the cost of the sext/zext with LMUL (#86617)"
This reverts commit 7545c635729a2055a429c5decd26a619a8d6e74b as it's
failing on the Linux bots.
2024-03-26 21:47:32 -07:00
Shih-Po Hung
7545c63572
[RISCV][TTI] Scale the cost of the sext/zext with LMUL (#86617)
Use the destination data type to measure the LMUL size for
latency/throughput cost
2024-03-27 10:58:17 +08:00
Craig Topper
2fbc40d36d [RISCV] Split compound if statement to fix a crash.
We're not allowed to call getELEN when the vector extension
is not enabled. If we're looking at a vector type, isTypeLegal would
only return true if the vector extensions are enabled. So early out
for non-vector types before we call isTypeLegal and getELEN.
2024-03-26 11:53:17 -07:00
ShihPo Hung
5dc0c75aab [RISCV][TTI] Fix missing return in the end of function 2024-03-25 23:32:18 -07:00
Shih-Po Hung
817f453aa5
[RISCV][TTI] Refactor getCastInstrCost to exit early (#86619)
To reduce the indentation by using early returns, this patch hoist the
return for illegal type and non vector type earlier.

It should mostly be an NFC.
2024-03-26 14:15:40 +08:00
Shih-Po Hung
3cb024198f
[RISCV][CostModel] Estimate cost of llvm.vector.reduce.fmaximum/fminimum (#80697)
The ‘llvm.vector.reduce.fmaximum/fminimum.*’ intrinsics propagate NaNs
if any element of the vector is a NaN.
Following #79402, the patch adds the cost for NaN check (vmfne + vcpop)
2024-03-25 17:17:36 +08:00
Kolya Panchenko
aa68e2814d
[RISCV] Support llvm.masked.compressstore intrinsic (#83457)
The changeset enables lowering of `llvm.masked.compressstore(%data,
%ptr, %mask)` for RVV for fixed vector type into:
```
%0 = vcompress %data, %mask, %vl
%new_vl = vcpop %mask, %vl
vse %0, %ptr, %1, %new_vl
```
Such lowering is only possible when `%data` fits into available LMULs
and otherwise `llvm.masked.compressstore` is scalarized by
`ScalarizeMaskedMemIntrin` pass.
Even though RVV spec in the section `15.8` provide alternative sequence
for compressstore, use of `vcompress + vcpop` should be a proper
canonical form to lower `llvm.masked.compressstore`. If RISC-V target
find the sequence from `15.8` better, peephole optimization can
transform `vcompress + vcpop` into that sequence.
2024-03-13 15:18:51 -04:00
Visoiu Mistrih Francis
eceb24c439
[RISCV] Hoist immediate addresses from loads/stores (#83644)
In case of loads/stores from an immediate address, avoid rematerializing
the constant for every block and allow consthoist to hoist it to the
entry block.
2024-03-05 22:41:56 -08:00
Shih-Po Hung
fb67dce1cb
[RISCV] Fix crash when unrolling loop containing vector instructions (#83384)
When MVT is not a vector type, TCK_CodeSize should return an invalid
cost. This patch adds a check in the beginning to make sure all cost
kinds return invalid costs consistently.

Before this patch, TCK_CodeSize returns a valid cost on scalar MVT but
other cost kinds doesn't.

This fixes the issue #83294 where a loop contains vector instructions
and MVT is scalar after type legalization when the vector extension is
not enabled,
2024-03-02 12:33:55 +08:00
Shih-Po Hung
6ee9c8afbc
[RISCV][CostModel] Updates reduction and shuffle cost (#77342)
- Make `andi` cost 1 in SK_Broadcast
- Query the cost of VID_V, VRSUB_VX/VRSUB_VI which would scale with LMUL
2024-02-29 15:41:19 +08:00
Philip Reames
f037e709ca
[RISCV][TTI] Cost a subvector extract at a register boundary with exact vlen (#82405)
If we have exact vlen knowledge, we can figure out which indices
correspond to register boundaries. Our lowering uses this knowledge to
replace the vslidedown.vi with a sub-register extract. Our costs can
reflect that as well.

This is another piece split off
https://github.com/llvm/llvm-project/pull/80164

---------

Co-authored-by: Luke Lau <luke_lau@icloud.com>
2024-02-21 07:56:08 -08:00