166 Commits

Author SHA1 Message Date
Luke Lau
02bb33c3ce [RISCV] Check for alignment when lowering interleaved/deinterleaved loads/stores
As noted by @reames, we should be checking that the memory access is aligned to
the element size (or the unaligned vector memory access feature is enabled)
before lowering vlseg/vsseg intrinsics via the interleaved access pass.

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D154536
2023-07-07 15:34:24 +01:00
David Green
12025cef3e [CostModel] Use min/max intrinsics for vecreduce.min/max costs
This changes the costmodelling of the vecreduce.min/max nodes to use the costs
of the relevant min/max intrinsics instead of expanding them to compare and
selects. The getMinMaxReductionCost have changed to take a Opcode for the
relevant intrinsic, dropping the IsUnsigned and CondTy parameters as they are
no longer needed.

A follow up patch will add some basic fminimum/fmaximum costmodelling.

Differential Revision: https://reviews.llvm.org/D153547
2023-07-04 15:02:30 +01:00
Luke Lau
d0d864f6f4 [SLP] Explicitly pass AccessTy to getGEPCost
Building on D149889, this patch updates SLP to pass the vector type as
the AccessTy to getGEPCost.
This should have the effect of GEPs being costed for more often instead
of being treated as foldable into the address mode and thus free, as
some architectures, notably RISC-V, do not have offset+reg addressing
modes for vector memory accesses.

Note that in SLP, GEPs are costed in two places: getPointersChainCost
and GetGEPCostDiff.

Reviewed By: ABataev

Differential Revision: https://reviews.llvm.org/D153570
2023-06-29 18:42:24 +01:00
Luke Lau
a68dcd09e8 [TTI] Use users of GEP to guess access type in getGEPCost
Currently getGEPCost uses the target type of the GEP as a heuristic for
the type that will be accessed, to pass onto isLegalAddressingMode.
Targets use this to work out if a GEP can then be folded into the
load/store instruction that uses the GEP.
For example, on RISC-V loads and stores can have an offset added to a
base register folded into a single instruction, so the following GEP is
free:

%p = getelementptr i32, ptr %base, i32 42       ; getInstructionCost = 0
%x = load i32, ptr %p                           ; getInstructionCost = 1
------------------------------------------------------------------------
lw t0, a0(42)

However vector loads and stores cannot have an offset folded into them,
so the following GEP is costed:

%p = getelementptr <2 x i32>, ptr %base, i32 42 ; getInstructionCost = 1
%x = load <2 x i32>, ptr %p                     ; getInstructionCost = 1
------------------------------------------------------------------------
addi  a0, 42
vle32 v8, (a0)

The issue arises whenever there is a mismatch between the target type of
the GEP and the type that is actually accessed:

%p = getelementptr i32, ptr %base, i32 42       ; getInstructionCost = 0
%x = load <2 x i32>, ptr %p                     ; getInstructionCost = 1
------------------------------------------------------------------------
addi  a0, 42
vle32 v8, (a0)

Even though this GEP will result in an add instruction, because TTI
thinks it's loading an i32, it will think it can be folded and not
charge for it.

The target type can become mismatched with the memory access during
transformations, noticeably during SLP where a scalar base pointer will
be reused to perform a vector load or store.

This patch adds an optional AccessType argument to getGEPCost which
allows the type of memory accessed by users to be passed in as a hint,
so that we can more accurately determine if the GEP can be folded into
its users.

If AccessType is not provided, getGEPCost falls back to the old
behaviour of using the PointeeType to guess the memory access type. This
can be revisited in a later patch.

Also for now, only GEPs with exactly one user use the access type hint.
Whilst we could look through all users and use all access types to
determine if we can fold the GEP, this patch avoids doing so to prevent
O(N) behaviour.

Differential Revision: https://reviews.llvm.org/D149889
2023-06-29 13:44:37 +01:00
Simon Pilgrim
defb5cd783 Fix MSVC "'std::max': no matching overloaded function found" error. NFCI. 2023-06-14 19:32:17 +01:00
Philip Reames
7f26c27e03 [RISCV] Enable SLP by default (when vectors are available)
I propose that we go ahead and enabled SLP by default. Over the last few weeks, @luke and I have been working through codegen issues seen at small VLs from a couple of SPEC workloads. We still have a ways to go to get optimal codegen, but we're at the point where having a single configuration we're all tuning against is probably the right default.

As a bit of history, I introduced this TTI hook back in a310637132 back in August of last year to unblock enabling LoopVectorizer. At the time, we had a couple known issues: constant materialization, address generation, and a general lack of maturity of small fixed vector codegen. By now, each of these has had significant investment. I can't say any of them are completely fixed, but we're no longer seeing instances of them every place we look.

What we're mostly seeing at this point is a long tail of code gen opportunities, many involving build vectors, shuffles, and extract patterns. I have a couple patches up to continue iterating on those issues, but I don't think they need to be blockers for enabling SLP.

Differential Revision: https://reviews.llvm.org/D152750
2023-06-14 09:49:58 -07:00
Craig Topper
6ac2ce7d84 [RISCV] Introduce the concept of DLEN(data path width) into getLMULCost.
SiFive's x280 CPU has a vector unit that VLEN/2 bits wide. This
means that LMUL=1 operations take 2 to process all VLEN bits.

This patch adds a DLenFactor tuning parameter and applies it to
TuneSiFive7. getLMULCost has been updated to use this factor in
its calculations. I've added an x280 command line to one cost
model test to demonstrate the effect.

Reviewed By: arcbbb

Differential Revision: https://reviews.llvm.org/D152421
2023-06-13 16:09:25 -07:00
Graham Hunter
95bfb1902d [LV][AArch64] Allow (limited) interleaving for scalable vectors
This patch uses the (de)interleaving intrinsics introduced in
D141924 to handle vectorization of interleaving groups with a
factor of 2 for scalable vectors.

Reviewed By: fhahn, reames

Differential Revision: https://reviews.llvm.org/D145163
2023-06-09 11:42:10 +01:00
Luke Lau
c27a0b21c5 [SLP][RISCV] Account for offset folding in getPointersChainCost
For a GEP in a pointer chain, if:
1) a pointer chain is unit-strided
2) the base pointer wasn't folded and is sitting in a register somewhere
3) the distance between the GEP and the base pointer is small enough and
   can be folded into the addressing mode of the using load/store

Then we can exclude that GEP from the total cost of the pointer chain,
as it will likely be folded away.

In order to check if 3) holds, we need to know the type of memory access
being made by the users of the pointer chain. For that, we need to pass
along a new argument to getPointersChainCost. (Using the source pointer
type of the GEP isn't accurate, see https://reviews.llvm.org/D149889 for
more details).

Also note that 2) is currently an assumption, and could be modelled more
accurately.

This prevents some unprofitable cases from being SLP vectorized on
RISC-V by making the scalar costs cheaper and closer to the actual
codegen.

For now the getPointersChainCost hook is duplicated for RISC-V to prevent
disturbing other targets, but could be merged back in and shared with
other targets in a following patch.

Reviewed By: ABataev

Differential Revision: https://reviews.llvm.org/D149654
2023-05-22 13:55:30 +01:00
Craig Topper
ffa32cd11e [RISCV] Disable constant hoiting for multiply by a power of 2. 2023-05-20 19:20:49 -07:00
Craig Topper
1e6d069709 [RISCV] Simplify and improve getLMULCost.
Use divideCeil for fixed vectors which avoids the need for a std::max
with 1 and should be more correct for odd sized vectors if those
occur.

Use conditional operator instead of an if/else.
2023-05-19 14:09:50 -07:00
Philip Reames
c501aa8843 [RISCV][TTI] Model shuffle mask materialization with correct index type
We were modeling these as if the index type was always e8, but the actual
lowering uses the data type width if legal.  We also weren't accounting for
i64 on xlen32 correctly.

Noticed via inspection while working on the shuffle/buildvec lowering.  Note
that this costing is also wrong in a more major way - we don't actually use
a constant pool load in many cases.  But that's a separate issue.
2023-05-05 12:26:21 -07:00
Yeting Kuo
35c877a6f0 [RISCV] Customed lower vector nearbyint and rint in RISC-V.
The patch lowers vector rint/nearbyint like vp.rint/nearbyint.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D148619
2023-04-19 11:07:23 +08:00
Simon Pilgrim
fb8038db73 [TTI] getExtendedReductionCost - replace std::optional<FastMathFlags> args with FastMathFlags
Followup to D148149 where it was noticed that the std::optional wrapper wasn't helping with anything (we can just use an empty FastMathFlags()).
2023-04-13 11:26:28 +01:00
Simon Pilgrim
9e30b87afb [TTI] getMinMaxReductionCost - add FastMathFlag argument
Similar to the getArithmeticReductionCost / getExtendedReductionCost calls (which really don't need to use std::optional<>).

This will be necessary to correct recognize fast/nnan fmax/fmul reductions which can avoid nan handling - which will allow us to remove the fmax/fmin special case in X86TTIImpl::getMinMaxCost and use getIntrinsicInstrCost like we do for integer reductions (63c3895327839ba5b57f5b99ec9e888abf976ac6).

Differential Revision: https://reviews.llvm.org/D148149
2023-04-13 10:42:42 +01:00
Philip Reames
b0e0c1e46c [RISCV][TTI] Call improveShuffleKindFromMask like all the other backends
No test diff; noticed via inspection.
2023-04-12 17:43:36 -07:00
Philip Reames
27b6ddbf6e [RISCV] Speculative fix for issue reported against D147470 post commit 2023-04-05 17:25:42 -07:00
Philip Reames
0e6d7eceaa [RISCV][TTI] Cost model for SK_ExtractSubvector
Differential Revision: https://reviews.llvm.org/D147618
2023-04-05 09:56:43 -07:00
Philip Reames
37646a2c28 [RISCV] Account for LMUL in memory op costs
Generally, the cost of a memory op will scale with the number of vector registers accessed. Machines might exist which have a narrow memory access than vector register width, but machines with a wider memory access width than vector register width seem unlikely.

I noticed this because we were preferring wide loads + deinterleaves on examples where the cost of a short gather (actually a strided load) would be better. Touching 8 vector registers instead of doing a 4 element gather is not a good tradeoff.

Differential Revision: https://reviews.llvm.org/D147470
2023-04-05 07:58:56 -07:00
Philip Reames
8865ed4dbb [RISCV][TTI] Cost SK_Tranpose as a generic two element shuffle
This matches the actual lowering.  The previous costing was "as if" it had been fully scalarized.
2023-04-04 12:52:50 -07:00
Luke Lau
273f736fcc [RISCV] Add FIXME comment about expensive vector mem op costs 2023-04-04 16:45:54 +01:00
Luke Lau
971a4501f7 [RISCV] Model vlseg/vsseg in interleaved memory ops
If the legalized type is a legal interleaved access type (i.e. there's a
supported vlseg/vsseg instruction for it), the interleaved access pass
will pick any interleaved memory op (wide load + shuffles) and lower it
into a vlseg/vsseg intrinsic.

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D146522
2023-04-04 15:05:14 +01:00
Philip Reames
57492b1eeb [RISCV] Cost model for general case of dual vector permute
The cost model was not accounting for the fact that we can generate a dual vrgather + an index expression sequence instead of scalarizing.

A couple cases to call out:

1) I did not model the difference between vrgather and vrgatherei16. The result is the constant pool cost can be slightly understated on RV32. I don't think we care, but if someone disagrees, this would be easy to add.
2) Our current codegen for i8 vectors longer than 256 (which is the limit of what this costs) has some room for improvement.
3) As indicated by the *regression* in reported cost for <2 x iN> vectors, our current vector lowering is missing support for a sub-case where scalarize-and-insert is actually faster than the generic fallback path.

Differential Revision: https://reviews.llvm.org/D147063
2023-03-29 07:36:35 -07:00
Philip Reames
4b7b612c5d [RISCV][TTI] Extract getConstantPoolLoadCost helper routine [nfc]
We had 3 copies of this code, and I am about to add a fourth.
2023-03-28 07:48:09 -07:00
Philip Reames
64f69e453e [RISCV] Cost model for general case of single vector permute
The cost model was not accounting for the fact that we can generate vrgather + an index expression.

Two cases to call out.
1) I did not model the difference between vrgather and vrgatherei16. The result is the constant pool cost can be slightly understated on RV32. I don't think we care, but if someone disagrees, this would be easy to add.
2) Our current codegen for i8 vectors longer than 256 (which is the limit of what this costs) has some room for improvement.

Differential Revision: https://reviews.llvm.org/D147000
2023-03-28 07:34:11 -07:00
Craig Topper
29463612d2 [RISCV] Replace RISCV -> RISC-V in comments. NFC
To be consistent with RISC-V branding guidelines
https://riscv.org/about/risc-v-branding-guidelines/
Think we should be using RISC-V where possible.

More patches will follow.

Reviewed By: asb

Differential Revision: https://reviews.llvm.org/D146449
2023-03-27 09:50:17 -07:00
Luke Lau
f23ea4cbd4 [RISCV] Model select and insertsubvector shuffle kinds
Selects get lowered to a vmerge with a mask, and insertsubvectors get
lowered to a vslideup.

Differential Revision: https://reviews.llvm.org/D146747
2023-03-24 17:30:32 +00:00
Luke Lau
8d16c6809a [RISCV] Increase default vectorizer LMUL to 2
After some discussion and experimentation, we have seen that changing the default number of vector register bits to LMUL=2 strikes a sweet spot.
Whilst we could be clever here and make the vectorizer smarter about dynamically selecting an LMUL that
a) Doesn't affect register pressure
b) Suitable for the microarchitecture
we would need to teach its heuristics about RISC-V register grouping specifics.
Instead this just does the easy, pragmatic thing by changing the default to a safe value that doesn't affect register pressure signifcantly[1], but should increase throughput and unlock more interleaving.

[1] Register spilling when compiling sqlite at various levels of `-riscv-v-register-bit-width-lmul`:

LMUL=1    2573 spills
LMUL=2    2583 spills
LMUL=4    2819 spills
LMUL=8    3256 spills

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D143723
2023-03-23 10:33:50 +00:00
Luke Lau
b9238abe05 [RISCV] Enable interleaved access vectorization
The loop vectorizer supports generating interleaved loads and stores via
shuffle patterns for fixed length vectors.
This enables it for RISC-V, since interleaved shuffle patterns can be
lowered to vlseg/vsseg in https://reviews.llvm.org/D145022

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D145155
2023-03-16 15:48:55 +00:00
Luke Lau
4e1ba0c518 [RISCV] Don't accidentally match deinterleave masks as interleaves
Consider a shuffle mask of <0, 2>:
This is one of two deinterleave masks to deinterleave a vector of 4
elements with factor 2.
Unfortunately, this is also technically an interleave mask, where
two subvectors of length 1 at indexes 0 and 2 will be interleaved.
This is because a mask can interleave non-contiguous subvectors:
e.g. <0, 6, 4, 1, 7, 5> on a vector of size 8:

```
<0 1 2 3 4 5 6 7> indices
 ^ ^     ^ ^ ^ ^
 0 0     2 2 1 1  deinterleaved subvector
```

This means that deinterleaving shuffles can accidentally be costed as
interleaves.
And it's incorrect in the context of interleaves, because the
only interleave shuffles we model at the moment are single permutation
shuffles, i.e. we are interleaving the first vector below and ignoring
the second:

shufflevector <2 x i32> %v0, <2 x i32> poison, <2 x i32> <i32 0, i32 2>

A mask of <0, 2> interleaves across both vectors.

The fix here is to set NumInputElts correctly: We were setting it to
twice the mask length, i.e. using both input vectors. But in fact we're
actually only using the first vector here, and isInterleaveMask actually
already has logic to ensure that the mask indices stay within the bounds
of the input vectors.

This lacks a test case due to how we're unable to test deinterleave
shuffles (because they are length changing), but is covered in the tests
in D145155

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D146176
2023-03-16 15:48:51 +00:00
Luke Lau
fc220a1aa9 Revert "[RISCV] Enable interleaved access vectorization"
This reverts commit acc03ad10af4f379a644e3956cb9aca54e40696c.
2023-03-15 22:00:48 +00:00
Luke Lau
acc03ad10a [RISCV] Enable interleaved access vectorization
The loop vectorizer supports generating interleaved loads and stores via
shuffle patterns for fixed length vectors.
This enables it for RISC-V, since interleaved shuffle patterns can be
lowered to vlseg/vsseg in https://reviews.llvm.org/D145022

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D145155
2023-03-15 21:56:30 +00:00
Philip Reames
4e3608bf29 [RISCV][TTI] Fix indentation and remove tabs [nfc] 2023-03-15 09:29:03 -07:00
Ben Shi
cb45be2b4f [RISCV][NFC] Combine identical switch cases in TTI
Reviewed By: craig.topper, asb

Differential Revision: https://reviews.llvm.org/D146008
2023-03-15 08:27:58 +08:00
Luke Lau
a9d9616c0d [RISCV][NFC] Share interleave mask checking logic
This adds two new methods to ShuffleVectorInst, isInterleave and
isInterleaveMask, so that the logic to check if a shuffle mask is an
interleave can be shared across the TTI, codegen and the interleaved
access pass.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D145971
2023-03-14 11:02:52 +00:00
Philip Reames
ca0cd670dc [RISCV] Improve SK_Reverse shuffle costs for fixed length vectors
As noted by @luke (https://reviews.llvm.org/D145953#inline-1409312), we were accounting for the cost of vector element size using vlenb whereas the expression can be constant folded for fixed length vectors.

Differential Revision: https://reviews.llvm.org/D145973
2023-03-13 15:17:42 -07:00
Philip Reames
64fc41ad82 [RISCV] Extend SK_Broadcast costing to scalable vectors
The existing scalable costing was just bad.  No LMUL cost, no i1 specific costing, etc..  We had updated the fixed cost model, but none of the code is actually fixed length specific.  Moving it down handles the scalable cases too.
2023-03-13 11:07:26 -07:00
Philip Reames
a37dfbb79c [RISCV] Fallback to scalable lowering costs for fixed length vectors
Fixed vector costs may be more precise, but the actual lowering will use scalable vectors if nothing better is available.  During review, we noticed a case where fixed vector reverse can be improved cost model wise, that will follow seperately.

Differential Revision: https://reviews.llvm.org/D145953
2023-03-13 10:07:57 -07:00
Philip Reames
cfcf274245 [RISCV] Inline and delete RISCVTTIImpl::getSpliceCost [nfc]
The code structure was copied from AArch64 which has a much more complicated splice cost model.
2023-03-13 08:55:32 -07:00
Philip Reames
21bca796d7 [RISCV] Use switch in RISCVTargetTransformInfo::getShuffleCost [nfc]
Refactoring in advance of a semantic change.
2023-03-13 08:40:47 -07:00
Luke Lau
c417266db5 [RISCV] Model interleave and deinterleave shuffles in cost model
Interleave and deinterleave shuffles are lowered by a more efficient
sequence if the element size is smaller than ELEN.

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D145678
2023-03-10 01:10:00 +00:00
ShihPo Hung
fb661e2554 [CostModel][RISCV] Model code size cost for reduction
Since code-size cost doesn't scale linearly with LMUL,
this change is to separate it from throughput.

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D142068
2023-03-05 17:58:45 -08:00
Kazu Hirata
a28b252d85 Use APInt::getSignificantBits instead of APInt::getMinSignedBits (NFC)
Note that getMinSignedBits has been soft-deprecated in favor of
getSignificantBits.
2023-02-19 23:56:52 -08:00
Kazu Hirata
e078201835 [Target] Use llvm::count{l,r}_{zero,one} (NFC) 2023-01-28 09:23:07 -08:00
Kazu Hirata
7a3e87298e [RISCV] Use llvm::bit_floor and std::clamp (NFC) 2023-01-28 00:49:38 -08:00
Philip Reames
a9871772a8 [RISCV][LSR] Treat number of instructions as dominate factor in LSR cost decisions
This matches the behavior from a number of other targets, including e.g. X86. This does have the effect of increasing register pressure slightly, but we have a relative abundance of registers in the ISA compared to other targets which use the same heuristic.

The motivation here is that our current cost heuristic treats number of registers as the dominant cost. As a result, an extra use outside of a loop can radically change the LSR result. As an example consider test4 from the recently added test/Transforms/LoopStrengthReduce/RISCV/lsr-cost-compare.ll. Without a use outside the loop (see test3), we convert the IV into a pointer increment. With one, we leave the gep in place.

The pointer increment version both decreases number of instructions in some loops, and creates parallel chains of computation (i.e. decreases critical path depth). Both are generally profitable.

Arguably, we should really be using a more sophisticated model here - such as e.g. using profile information or explicitly modeling parallelism gains. However, as a practical matter starting with the same mild hack that other targets have used seems reasonable.

Differential Revision: https://reviews.llvm.org/D142227
2023-01-24 11:42:37 -08:00
ShihPo Hung
5fb3a57ea7 [Cost] Add CostKind to getVectorInstrCost and its related users
LoopUnroll estimates the loop size via getInstructionCost(),
but getInstructionCost() cannot pass CostKind to getVectorInstrCost().
And so does getShuffleCost() to getBroadcastShuffleOverhead(),
getPermuteShuffleOverhead(), getExtractSubvectorOverhead(),
and getInsertSubvectorOverhead().

To address this, this patch adds an argument CostKind to these
functions.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D142116
2023-01-21 05:29:24 -08:00
liqinweng
1f8746cc80 [RISCV][CostModel] Add half type support for the cost model of sqrt/fabs
1. Refactor for costs of sqrt/fabs
2. Add half type support for the cost model of sqrt/fabs

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D132908
2023-01-09 12:57:03 +08:00
liqinweng
f3408739da [RISCV][CostModel] Add cost model for integer abs
Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D132999
2023-01-09 11:38:24 +08:00
Alexey Bataev
9b5f62685a [SLP]Fix cost of the broadcast buildvector/gather.
Need to include the cost of the initial insertelement to the cost of the
broadcasts. Also, need to adjust the cost of the gather/buildvector if
the element is inserted into poison/undef vector.

Differential Revision: https://reviews.llvm.org/D140498
2023-01-06 09:25:05 -08:00