Similar to the getArithmeticReductionCost / getExtendedReductionCost calls (which really don't need to use std::optional<>).
This will be necessary to correct recognize fast/nnan fmax/fmul reductions which can avoid nan handling - which will allow us to remove the fmax/fmin special case in X86TTIImpl::getMinMaxCost and use getIntrinsicInstrCost like we do for integer reductions (63c3895327839ba5b57f5b99ec9e888abf976ac6).
Differential Revision: https://reviews.llvm.org/D148149
Generally, the cost of a memory op will scale with the number of vector registers accessed. Machines might exist which have a narrow memory access than vector register width, but machines with a wider memory access width than vector register width seem unlikely.
I noticed this because we were preferring wide loads + deinterleaves on examples where the cost of a short gather (actually a strided load) would be better. Touching 8 vector registers instead of doing a 4 element gather is not a good tradeoff.
Differential Revision: https://reviews.llvm.org/D147470
If the legalized type is a legal interleaved access type (i.e. there's a
supported vlseg/vsseg instruction for it), the interleaved access pass
will pick any interleaved memory op (wide load + shuffles) and lower it
into a vlseg/vsseg intrinsic.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D146522
The cost model was not accounting for the fact that we can generate a dual vrgather + an index expression sequence instead of scalarizing.
A couple cases to call out:
1) I did not model the difference between vrgather and vrgatherei16. The result is the constant pool cost can be slightly understated on RV32. I don't think we care, but if someone disagrees, this would be easy to add.
2) Our current codegen for i8 vectors longer than 256 (which is the limit of what this costs) has some room for improvement.
3) As indicated by the *regression* in reported cost for <2 x iN> vectors, our current vector lowering is missing support for a sub-case where scalarize-and-insert is actually faster than the generic fallback path.
Differential Revision: https://reviews.llvm.org/D147063
The cost model was not accounting for the fact that we can generate vrgather + an index expression.
Two cases to call out.
1) I did not model the difference between vrgather and vrgatherei16. The result is the constant pool cost can be slightly understated on RV32. I don't think we care, but if someone disagrees, this would be easy to add.
2) Our current codegen for i8 vectors longer than 256 (which is the limit of what this costs) has some room for improvement.
Differential Revision: https://reviews.llvm.org/D147000
After some discussion and experimentation, we have seen that changing the default number of vector register bits to LMUL=2 strikes a sweet spot.
Whilst we could be clever here and make the vectorizer smarter about dynamically selecting an LMUL that
a) Doesn't affect register pressure
b) Suitable for the microarchitecture
we would need to teach its heuristics about RISC-V register grouping specifics.
Instead this just does the easy, pragmatic thing by changing the default to a safe value that doesn't affect register pressure signifcantly[1], but should increase throughput and unlock more interleaving.
[1] Register spilling when compiling sqlite at various levels of `-riscv-v-register-bit-width-lmul`:
LMUL=1 2573 spills
LMUL=2 2583 spills
LMUL=4 2819 spills
LMUL=8 3256 spills
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D143723
The loop vectorizer supports generating interleaved loads and stores via
shuffle patterns for fixed length vectors.
This enables it for RISC-V, since interleaved shuffle patterns can be
lowered to vlseg/vsseg in https://reviews.llvm.org/D145022
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D145155
Consider a shuffle mask of <0, 2>:
This is one of two deinterleave masks to deinterleave a vector of 4
elements with factor 2.
Unfortunately, this is also technically an interleave mask, where
two subvectors of length 1 at indexes 0 and 2 will be interleaved.
This is because a mask can interleave non-contiguous subvectors:
e.g. <0, 6, 4, 1, 7, 5> on a vector of size 8:
```
<0 1 2 3 4 5 6 7> indices
^ ^ ^ ^ ^ ^
0 0 2 2 1 1 deinterleaved subvector
```
This means that deinterleaving shuffles can accidentally be costed as
interleaves.
And it's incorrect in the context of interleaves, because the
only interleave shuffles we model at the moment are single permutation
shuffles, i.e. we are interleaving the first vector below and ignoring
the second:
shufflevector <2 x i32> %v0, <2 x i32> poison, <2 x i32> <i32 0, i32 2>
A mask of <0, 2> interleaves across both vectors.
The fix here is to set NumInputElts correctly: We were setting it to
twice the mask length, i.e. using both input vectors. But in fact we're
actually only using the first vector here, and isInterleaveMask actually
already has logic to ensure that the mask indices stay within the bounds
of the input vectors.
This lacks a test case due to how we're unable to test deinterleave
shuffles (because they are length changing), but is covered in the tests
in D145155
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D146176
The loop vectorizer supports generating interleaved loads and stores via
shuffle patterns for fixed length vectors.
This enables it for RISC-V, since interleaved shuffle patterns can be
lowered to vlseg/vsseg in https://reviews.llvm.org/D145022
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D145155
This adds two new methods to ShuffleVectorInst, isInterleave and
isInterleaveMask, so that the logic to check if a shuffle mask is an
interleave can be shared across the TTI, codegen and the interleaved
access pass.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D145971
The existing scalable costing was just bad. No LMUL cost, no i1 specific costing, etc.. We had updated the fixed cost model, but none of the code is actually fixed length specific. Moving it down handles the scalable cases too.
Fixed vector costs may be more precise, but the actual lowering will use scalable vectors if nothing better is available. During review, we noticed a case where fixed vector reverse can be improved cost model wise, that will follow seperately.
Differential Revision: https://reviews.llvm.org/D145953
Interleave and deinterleave shuffles are lowered by a more efficient
sequence if the element size is smaller than ELEN.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D145678
Since code-size cost doesn't scale linearly with LMUL,
this change is to separate it from throughput.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D142068
This matches the behavior from a number of other targets, including e.g. X86. This does have the effect of increasing register pressure slightly, but we have a relative abundance of registers in the ISA compared to other targets which use the same heuristic.
The motivation here is that our current cost heuristic treats number of registers as the dominant cost. As a result, an extra use outside of a loop can radically change the LSR result. As an example consider test4 from the recently added test/Transforms/LoopStrengthReduce/RISCV/lsr-cost-compare.ll. Without a use outside the loop (see test3), we convert the IV into a pointer increment. With one, we leave the gep in place.
The pointer increment version both decreases number of instructions in some loops, and creates parallel chains of computation (i.e. decreases critical path depth). Both are generally profitable.
Arguably, we should really be using a more sophisticated model here - such as e.g. using profile information or explicitly modeling parallelism gains. However, as a practical matter starting with the same mild hack that other targets have used seems reasonable.
Differential Revision: https://reviews.llvm.org/D142227
LoopUnroll estimates the loop size via getInstructionCost(),
but getInstructionCost() cannot pass CostKind to getVectorInstrCost().
And so does getShuffleCost() to getBroadcastShuffleOverhead(),
getPermuteShuffleOverhead(), getExtractSubvectorOverhead(),
and getInsertSubvectorOverhead().
To address this, this patch adds an argument CostKind to these
functions.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D142116
1. Refactor for costs of sqrt/fabs
2. Add half type support for the cost model of sqrt/fabs
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D132908
Need to include the cost of the initial insertelement to the cost of the
broadcasts. Also, need to adjust the cost of the gather/buildvector if
the element is inserted into poison/undef vector.
Differential Revision: https://reviews.llvm.org/D140498
The patch also adds expandVPCTLZ and expandVPCTTZ to expand vp.ctlz/cttz nodes
and the cost model of vp.ctlz/cttz.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D140370
The patch also adds expandVPCTPOP in TargetLowering to expand VP_CTPOP nodes.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D139920
The patch also added function expandVPBITREVERSE to expand ISD::VP_BITREVERSE nodes.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D139697
This reverts commit 7883e5b061bdbbe8bee5f479ebe911db5045b7e9.
The original commit was reverted that it didn't update test files after D136263
landed. The recommit fixed those.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D139509
The patch made VectorLegalizer expand ISD::VP_FSHL and ISD::VP_FSHR to
achieve the codegen.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D138379