Similar to the getArithmeticReductionCost / getExtendedReductionCost calls (which really don't need to use std::optional<>).
This will be necessary to correct recognize fast/nnan fmax/fmul reductions which can avoid nan handling - which will allow us to remove the fmax/fmin special case in X86TTIImpl::getMinMaxCost and use getIntrinsicInstrCost like we do for integer reductions (63c3895327839ba5b57f5b99ec9e888abf976ac6).
Differential Revision: https://reviews.llvm.org/D148149
Currently in LoopVectorize we avoid tail-folding if we can
prove the trip count is always a multiple of the maximum
fixed-width VF. This works because we know the vectoriser
only ever chooses a VF that is a power of 2. However, if
we are also considering scalable VFs then we conservatively
bail out of the optimisation because we don't know the value
of vscale, which could be an odd or prime number, etc.
This patch tries to enable the same optimisation for scalable
VFs by asking if vscale is known to be a power of 2. If so,
we can then query the maximum value of vscale and use the same
logic as we do for fixed-width VFs. I've also added a new TTI
hook called isVScaleKnownToBeAPowerOfTwo that does the same
thing as the existing TargetLowering hook.
Differential Revision: https://reviews.llvm.org/D146199
The loop vectorizer supports generating interleaved loads and stores via
shuffle patterns for fixed length vectors.
This enables it for RISC-V, since interleaved shuffle patterns can be
lowered to vlseg/vsseg in https://reviews.llvm.org/D145022
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D145155
The loop vectorizer supports generating interleaved loads and stores via
shuffle patterns for fixed length vectors.
This enables it for RISC-V, since interleaved shuffle patterns can be
lowered to vlseg/vsseg in https://reviews.llvm.org/D145022
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D145155
This work follows on from D142109 and addresses a possible regression
when we know the loop iteration counter cannot overflow.
When we know the overflow-check always evaluates to false, it's better to
use the other style of tail folding where it assumes a runtime check was
added, because that avoids having to calculate a modified trip-count.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D142894
It's less clear with scalable vectors than fixed length vectors that
interleaving exposes more ILP, as scalable vectors can be thought of a
sort of hardware form of interleaving, especially with larger LMULs.
This also addresses the unexpected additional unrolling that occurs when
using larger LMULs in the loop vectorizer.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D144485
In order to allow targets to disable interleaving for scalable vectors, pass the entire VF's ElementCount to getMaxInterleaveFactor.
This is based off of the approach used here: 8d36708507
The plan would then be to disable interleaving on scalable VFs on RISC-V in a follow up patch.
See https://reviews.llvm.org/D143723#4132349
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D144474
This NFC (intended) patch has several small changes:
* It renames PredicationStyle to TailFoldingStyle.
* It renames TTI.emitActiveLaneMask() to TTI.getPreferredTailFoldingStyle()
* Simplifies some of its uses in the LoopVectorizer
Rationale: To my surprise PredicationStyle::None did not mean 'no
predication', but rather 'no active lane mask intrinsic', such that the
predicate is created using a splat + compare with stepvector. The enum is
also highly specific to tail folding, so it seems better to name this
around that feature, i.e. 'tail folding style'.
This also makes it more amenable to extend it to other tail folding styles,
such as the one added in D142109.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D142887
We don't need to explicitly check ELEN. It's already handled by
isLegalElementTypeForRVV.
This is a leftover from before we had Zve32/64 extensions and used
a command line option for fixed vectors.
This matches the behavior from a number of other targets, including e.g. X86. This does have the effect of increasing register pressure slightly, but we have a relative abundance of registers in the ISA compared to other targets which use the same heuristic.
The motivation here is that our current cost heuristic treats number of registers as the dominant cost. As a result, an extra use outside of a loop can radically change the LSR result. As an example consider test4 from the recently added test/Transforms/LoopStrengthReduce/RISCV/lsr-cost-compare.ll. Without a use outside the loop (see test3), we convert the IV into a pointer increment. With one, we leave the gep in place.
The pointer increment version both decreases number of instructions in some loops, and creates parallel chains of computation (i.e. decreases critical path depth). Both are generally profitable.
Arguably, we should really be using a more sophisticated model here - such as e.g. using profile information or explicitly modeling parallelism gains. However, as a practical matter starting with the same mild hack that other targets have used seems reasonable.
Differential Revision: https://reviews.llvm.org/D142227
LoopUnroll estimates the loop size via getInstructionCost(),
but getInstructionCost() cannot pass CostKind to getVectorInstrCost().
And so does getShuffleCost() to getBroadcastShuffleOverhead(),
getPermuteShuffleOverhead(), getExtractSubvectorOverhead(),
and getInsertSubvectorOverhead().
To address this, this patch adds an argument CostKind to these
functions.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D142116
If the subtarget does not support VInstructions, expand vp intrinscs to scalar instructions.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D139706
Need to include the cost of the initial insertelement to the cost of the
broadcasts. Also, need to adjust the cost of the gather/buildvector if
the element is inserted into poison/undef vector.
Differential Revision: https://reviews.llvm.org/D140498
Most of VP intrinsics are implemented in RISC-V backends, but
vp.reduce.mul (element length > 1) does not yet. Legalizes vp.reduce.mul
using ExpandVectorPredication Pass.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D139721
The Zfhmin subset only has FLH, FSH, FMV.X.H, FMV.H.X, FCVT.S.H, and FCVT.H.S.
If the D extension is present, the FCVT.D.H and FCVT.H.D instructions are also included.
Since most instructions are not included for Zfhmin, so most operations are promoted.
The patch primarily about making f16 a legal type.
RISC-V ISA info:
https://wiki.riscv.org/display/HOME/Recently+Ratified+Extensions
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D139391
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated. The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.
This is part of an effort to migrate from llvm::Optional to
std::optional:
https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
This patch adds basic broadcast shuffle costs in order to enable SLP vectorization.
And adds `getLMULCost` to consider reciprocal throughput for different LMUL.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D137276
Most of the code for this was taken from https://reviews.llvm.org/D133552, with one bug fix by me. I'm landing the plumbing so that we can focus on the cost model pieces in the review.
Epilogue loop vectorization is a feature in the vectorize intended to avoid running fully scalar code when the vector length of the main loop turns out to be either longer than the trip count of the actual loop, or with a huge remainder.
In practice, this feature appears to not have been well tuned. I honestly don't think it should be on by default at all, but it definitely shouldn't be on for RISCV. Note that other targets have also disabled it, but they've done so via disabling interleaving - which is, well, completely unrelated - and we don't want to do that for RISCV.
In the near term, many examples I'm seeing have terrible codegen for epilogue vectorization. We are greatly increasing code size for little value at reasonable VLEN values for small types. In the long term, the cases that epilogue vectorization are intended to handle are likely better handled via tail folding on RISCV.
As an aside, I also don't really trust the correctness of epilogue vectorization. The code structure is such that otherwise straight forward changes sometimes break only epilogue vectorization. The reuse of an existing vplan without careful validation opens significant room for nasty bugs. Given how rarely the code is exercised, that is not a good combination.
As such, this patch introduces a TTI hook, and completely disables epilogue vectorization on RISCV.
Differential Revision: https://reviews.llvm.org/D136695
My original intent had been to reuse this for arithmetic instructions as well, but due to the availability of a immediate splat encoding there, we will need different heuristics. So specialize the existing code for the store case.
This patch adds cost model for vector insert/extract element instructions. In RVV, we could use vector scalar move instruction to insert or extract the first element, and use vslide to move it. But for mask vector or i64 vector in i32 target, we need special instructions to make it.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D133007
This change implements a TTI query with the goal of disabling slp vectorization on RISCV. The current default configuration disables SLP already, but its current tied to the ability to lower fixed length vectors. Over in D131508, I want to enable fixed length vectors for purposes of LoopVectorizer, but preliminary analysis has revealed a couple of SLP specific issues we need to resolve before enabling it by default. This change exists to allow us to enable LV without SLP.
Differential Revision: https://reviews.llvm.org/D132680
This has the effect of exposing the power-of-two property for use in memory op costing, but no target actually uses it yet. The main point of this change is simple consistency with the recently changes getArithmeticInstrCost, and to remove the last (interface) use of OperandValueKind.
This is part of an ongoing transition to use OperandValueInfo which combines OperandValueKind and OperandValueProperties. This change adds some accessor methods and uses them to simplify backend code. The primary motivation of doing so is removing uses of the parameters so that an upcoming api change is less error prone.
Defaults to TCK_RecipThroughput - as most explicit calls were assuming TCK_RecipThroughput (vectorizers) or was just doing a before-vs-after comparison (vectorcombiner). Calls via getInstructionCost were just dropping the CostKind, so again there should be no change at this time (as getShuffleCost and its expansions don't use CostKind yet) - but it will make it easier for us to better account for size/latency shuffle costs in inline/unroll passes in the future.
Differential Revision: https://reviews.llvm.org/D132287
On known hardware, reductions, gather, and scatter operations have execution latencies which correlated with the vector length (VL) of the operation. Most other operations (e.g. simply arithmetic) don't correlated in this way, and instead essentially fixed cost as VL varies.
When I'd implemented initial scalable cost model support for reductions, gather, and scatter operations, I had used an upper bound on the statically unknown VL. The argument at the time was that this prevented falsely low costs, and biased the vectorizer away from generating bad (on some hardware) code. Unfortunately, practical experience shows we were a bit too effective at that goal, and the high costs defacto prevents vectorization using these constructs at all.
This patch reverses course, and ties the returned cost not to the maximum possible VL, but the VL which would correspond to VScaleForTuning. This parameter is the same one the vectorizer uses when normalizing loop costs, so the term effectively cancels out. The result is that the vectorizer now sees these constructs as comparable in cost to their fixed length variants.
This does introduce the possibility of the cost for these operations being a significant under estimate on platforms where actual VLEN is far from that implied by VScaleForTuning. On such platforms, we might make poor heuristic choices. Probably not in LV itself (due to the cancellation mentioned above), but possibly during e.g. lowering. I'm not currently aware of any concrete examples of this, but this patch does open a concern which did not previously exist.
Previously, we had the problem of overestimating costs causing the same problem on machines much closer to default values for vscale for tuning. With this patch, we still have that problem potentially if vscale for tuning is set high (manually), and then the code is run on a narrow VLEN machine.
Differential Revision: https://reviews.llvm.org/D131519
In RVV, we use vwredsum.vs and vwredsumu.vs for vecreduce.add(ext(Ty A)) if the result type's width is twice of the input vector's SEW-width. In this situation, the cost of extended add reduction should be same as single-width add reduction. So as the vector float widenning reduction.
Differential Revision: https://reviews.llvm.org/D129994
This change enables vectorization (using scalable vectorization only, fixed vectors are not yet enabled) for RISCV when vector instructions are available for the target configuration.
At this point, the resulting configuration should be both stable (e.g. no crashes), and profitable (i.e. few cases where scalar loops beat vector ones), but is not going to be particularly well tuned (i.e. we emit the best possible vector loop). The goal of this change is to align testing across organizations and ensure the default configuration matches what downstreams are using as closely as possible.
This exposes a large amount of code which hasn't otherwise been on by default, and thus may not have been fully exercised. Given that, having issues fall out is not unexpected. If you find issues, please make sure to include as much information as you can when reverting this change.
Differential Revision: https://reviews.llvm.org/D129013
Currently, for vectorised loops that use the get.active.lane.mask
intrinsic we only use the mask for predicated vector operations,
such as masked loads and stores, etc. The loop itself is still
controlled by comparing the canonical induction variable with the
trip count. However, for some targets this is inefficient when it's
cheap to use the mask itself to control the loop.
This patch adds support for using the active lane mask for control
flow by:
1. Generating the active lane mask for the next iteration of the
vector loop, rather than the current one. If there are still any
remaining iterations then at least the first bit of the mask will
be set.
2. Extract the first bit of this mask and use this bit for the
conditional branch.
I did this by creating a new VPActiveLaneMaskPHIRecipe that sets
up the initial PHI values in the vector loop pre-header. I've also
made use of the new BranchOnCond VPInstruction for the final
instruction in the loop region.
Differential Revision: https://reviews.llvm.org/D125301
The motivation here is to a) bring us closer into alignment with AArch64 under the assumption that codepath is better tested, and b) simplify pattern matching in an upcoming change.
The immediate impact is a significant IR reduction but a fairly minimal change in the generated assembly. Due to a difference in expansion behavior we get a saturating add vs an unsaturating one for the old code, but that's about it. This difference comes down to different handling of overflow, which doesn't seem to be possible here anyways, so the assembly codegen is arguably a minor regression. I don't expect that to matter in practice.
Differential Revision: https://reviews.llvm.org/D129221
This extends the existing cost model for reductions for scalable vectors.
The existing cost model assumes that reductions are roughly logarithmic in cost for unordered variants and linear for ordered ones. This change keeps that same basic model, and extends it out to the maximum number of elements a scalable vector could possibly have.
This results in costs which aren't terribly high for unordered reductions, but are for ordered ones. This seems about right; we want to strongly bias away from using scalable ordered reductions if the cost might be linear in VL.
Differential Revision: https://reviews.llvm.org/D127447
LoopVectorizer uses getVScaleForTuning for deciding how to discount the cost of a potential vector factor by the amount of work performed. Without the callback implemented, the vectorizer was defaulting to an estimated vscale of 1. This results in fixed vectorization looking falsely profitable (since it used the command line VLEN).
The test change is pretty limited since a) we don't have much coverage of the vectorizer with scalable vectors at all, and b) what little coverage we have mostly uses i64 element types. There's a separate issue with <vscale x 1 x i64> which prevents us from getting to this stage of costing, and thus only the one test explicitly written to avoid that is visible in the diff. However, this is actually a very wide impact change as it changes the practical vectorization result when both fixed and scalable is enabled to scalable.
As an aside, I think the vectorizer is at little too strongly biased towards scalable when both are legal, but we can explore that separately. For now, let's just get the cost model working the way it was intended.
Differential Revision: https://reviews.llvm.org/D128547
This brings us into alignment with AArch64, and in the process fixes a compiler crash bug in uniform store handling in the vectorizer.
Before the recent invalid cost bailout work, this would have also avoided crashes on invalid costs in some cases. I honestly think the vectorizer should gracefully bailout on uniform stores it can't use a scatter for, but it doesn't, so lets take the path of least resistance here. It's also possible that there are other vectorizer bugs AArch64 isn't seeing because of this hook; we don't want to be finding them either.
Differential Revision: https://reviews.llvm.org/D127514
i64 indices aren't supported on Zve32*. Scalarize gathers to prevent
generating illegal instructions.
Since InstCombine will aggressively canonicalize GEP indices to
pointer size, we're pretty much always going to have an i64 index.
Trying to predict when SelectionDAG will find a smaller index from
the TTI hook used by the ScalarizeMaskedMemIntrinPass seems fragile.
To optimize this we probably need an IR pass to rewrite it earlier.
Test RUN lines have also been added to make sure the strided load/store
optimization still works.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D127179