Patch adds usage of processShuffleMasks in TTI for RISCV. This function is already used for X86
shuffles estimations and in DAGTypeLegalizer::SplitVecRes_VECTOR_SHUFFLE
functions and in RISCV codegen.
Patch allows better cost estimation for sparse masks and unifies
cost/codegen between different targets/passes
Reviewers: preames
Reviewed By: preames
Pull Request: https://github.com/llvm/llvm-project/pull/118103
I added some CodeGen test cases related to reduce. To maintain
consistency, I also added cases for instructions like
`vector.reduce.or`.
For cases where `v1i1` type generates `VFIRST`, please refer to:
https://reviews.llvm.org/D139512.
This optimization was introduced by #70469.
Like AArch64, we allow tail expansions for 3 on RV32 and 3/5/6
on RV64.
This can simplify the comparison and reduce the number of blocks.
This patch implements the cost when the size of the vector need to split
into multiple groups and the index exceed single vector group.
For extract element, we need to store split vectors to stack and load
the target element.
For insert element, we need to store split vectors to stack and store
the target element and load vectors back.
After this patch, the cost of insert/extract element will close to the
generated assembly.
This is effectively the same due to how the mask instructions have an
LMUL of 1 and cost of 1, but matches how we use LT.first elsewhere in
RISCVTargetTransformInfo.cpp by using it to multiply another
instruction cost.
As vector element loads are free on SystemZ, this patch improves the cost
computation in getGatherCost() to reflect this.
getScalarizationOverhead() gets an optional parameter which can hold the actual
Values so that they in turn can be passed (by BasicTTIImpl) to
getVectorInstrCost().
SystemZTTIImpl::getVectorInstrCost() will now recognize a LoadInst and
typically return a 0 cost for it, with some exceptions.
This is split off from #115274. There doesn't seem to be an easy way to
share this with getShuffleCost since that requires passing in a real
insert_element operand to get it to recognise it's a scalar splat.
For i1 vectors we can't currently lower them so it returns an invalid
cost.
---------
Co-authored-by: Shih-Po Hung <shihpo.hung@sifive.com>
This patch moves the `areInlineCompatible` implementation from multiple
subclasses (`AArch64TTIImpl`, `RISCVTTIImpl`, `WebAssemblyTTIImpl`) to
the base class `BasicTTIImpl`. The new implementation checks whether the
callee's target features are a subset of the caller's, enabling
consistent behavior across targets. Subclasses now simply delegate to
the base implementation, reducing code duplication and improving
maintainability.
We have a lot of code in RISCVTTIImpl::getIntrinsicInstrCost for vp
intrinsics, which just forward the cost to the underlying non-vp cost
function.
However I just also noticed that there is generic code in BasicTTIImpl's
getIntrinsicInstrCost that does the same thing, added in #67178. The
only difference is that BasicTTIImpl doesn't yet handle it for
type-based costing. There doesn't seem to be any reason that it can't
since it's just inspecting the argument types.
This shuffles the VP costing up to handle both regular and type-based
costing, which allows us to deduplicate some of the VP specific costing
in RISCVTTIImpl by delegating it to BasicTTIImpl.h. More of those nodes
can be moved over to BasicTTIImpl.h later.
It's not NFC since it picks up a couple of VP nodes that had slipped
through the cracks. Future PRs can begin to move more of the code from
RISCVTTIImpl to BasicTTIImpl.
The VP variants simply return the same costs as the non-VP variants.
This assumes that reductions are VL predicated, and that VL predication
has no additional cost.
There are two passes that have dependency on the implementation
of `TargetTransformInfo::enableMemCmpExpansion` : `MergeICmps` and
`ExpandMemCmp`.
This PR adds the initial implementation of `enableMemCmpExpansion`
so that we can have some basic benefits from these two passes.
We don't enable expansion when there is no unaligned access support
currently because there are some issues about unaligned loads and
stores in `ExpandMemcmp` pass. We should fix these issues and enable
the expansion later.
Vector case hasn't been tested as we don't generate inlined vector
instructions for memcmp currently.
Reviewers: preames, arcbbb, topperc, asb, dtcxzyw
Reviewed By: topperc, preames
Pull Request: https://github.com/llvm/llvm-project/pull/107548
If only one of the elements is actually used, then we can legally use a
strided load in place of the segment load. Doing so reduces vector
register pressure, so if both segment and strided are believed to be
element/segment at a time, then prefer the strided load variant.
Note that I've seen the vectorizer emitting wide interleave loads to
represent a strided load, so this does happen in practice. It doesn't
matter much for small LMUL*NF, but at large NF can start causing
problems in register allocation.
Note that this patch only covers the fixed vector formation cases. In
theory, we should do the same patch for scalable, but we can currently
only represent NF2 in scalable IR, and NF2 is assumed to be optimized to
better than segment-at-a-time by default, so there's currently nothing
to do.
This is a follow up to #111511, where after benchmarking we learnt that
the Banana Pi F3 has fast segmented loads for not just NF=2, but also
NF=3 and NF=4:
https://github.com/preames/bp3-microarch#vlseg_lmul_x_sew_throughput
This adds tuning features to allow these segment loads and stores to be
costed cheaper and enables it for the spacemit-x60.
It also enables +optimized-nf2-segment-load-store by default in the
generic tuning to maintain the previous behaviour when compiled without
-mcpu or -mtune.
In #111000 we removed promotion of fadd/fmul reductions for bf16 and f16
without zvfh, and marked the cost as invalid to prevent the vectorizers
from emitting them. However it inadvertently didn't change the cost for
ordered reductions, so this moves the check earlier to fix this.
This also uses BasicTTIImpl instead which now assigns a valid but
expensive cost for fixed-length vectors, which reflects how codegen will
actually scalarize them.
We can use `viota`+`vrgather` to synthesize `vdecompress` and lower
expanding load to `vcpop`+`load`+`vdecompress`.
And if `%mask` is all ones, we can lower expanding load to a normal
unmasked load.
Fixes#101914.
Currently we cost an interleaved memory op as if it were a load/store of
the widened vector type, but this was undercosting in all cases when
compared to the measured performance of todays hardware.
On the x280 at NF=2 and spacemit-x60 at NF=2,3 and 4, a segmented load
is carried out as a wide load and NF LMUL shuffle ops:
https://github.com/preames/bp3-microarch#vlseg_lmul_x_sew_throughput
All other NFs go through a slow path. On the spacemit-x60 this is
proportional to VLMAX * NF, and on the x280 proportional to the number
of segments.
This patch increases the cost by implementing a wide load + NF LMUL
shuffle op cost for the lowest common denominator NF=2, and then a
slower cost proportional to VL for the other NFs.
In a follow up patch we can add a tuning flag to use the faster cost
model for NF=3 and 4 on the spacemit-x60.
Note that the FIXME about illegal vectors seems to have been fixed in
#100436
We currently can't lower scalable vector lrint and llrint nodes for bf16
and f16, even with zvfh, and will crash.
Mark the cost as invalid for now to prevent the vectorizers from
emitting them.
Note that we can actually lower fixed-length vectors fine by scalarizing
them, but we were still undercosting these too so I've also included
them. I presume there's an opportunity to improve the codegen later on.
Previously, the cost model was returning an invalid cost. This simply
moves the check from one place to another. This is mostly to make the
cost modeling code a bit easier to follow.
---------
Co-authored-by: Mel Chen <mel.chen@sifive.com>
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.
In https://reviews.llvm.org/D153848, promotion was added for a variety
of f16 ops with zvfhmin, including VP reductions.
However I don't believe it's correct to promote f16 fadd or fmul
reductions to f32 since we need to round the intermediate results.
Today if we lower @llvm.vp.reduce.fadd.nxv1f16 on RISC-V, we'll get two
different results depending on whether we compiled with +zvfh or
+zvfhmin, for example with a 3 element reduction:
; v9 = [0.1563, 5.97e-8, 0.00006104]
; zvfh
vsetivli x0, 3, e16, m1, ta, ma
vmv.v.i v8, 0
vfredosum.vs v8, v9, v8
vfmv.f.s fa0, v8
; fa0 = 0.1563
; zvfhmin
vsetivli x0, 3, e16, m1, ta, ma
vfwcvt.f.f.v v10, v9
vsetivli x0, 3, e32, m1, ta, ma
vmv.v.i v8, 0
vfredosum.vs v8, v10, v8
vfmv.f.s fa0, v8
fcvt.h.s fa0, fa0
; fa0 = 0.1564
This same thing happens with reassociative reductions e.g. vfredusum.vs,
and this also applies for bf16.
I couldn't find anything in the LangRef for reductions that suggest the
excess precision is allowed. There may be something we can do in Clang
with -fexcess-precision=fast, but I haven't looked into this yet.
I presume the same precision issue occurs with fmul, but not with
fmin/fmax/fminimum/fmaximum.
I can't think of another way of lowering these other than scalarizing,
and we can't scalarize scalable vectors, so this just removes the
promotion and adjusts the cost model to return an invalid cost. (It
looks like we also don't currently cost fmul reductions, so presumably
they also have an invalid cost?)
I think this should be enough to stop the loop vectorizer or SLP from
emitting these intrinsics.
When we're lowering to a split sequence, we only need one
materialization of the zero constant. Our codegen looks something like
this:
vmv.v.i v24, 0
vmerge.vim v8, v24, -1, v0
vmv1r.v v0, v16
vmerge.vim v16, v24, -1, v0
Note: Doing this specific case since it was pointed out in
https://github.com/llvm/llvm-project/pull/110164#discussion_r1778268391,
but it's worth noting that we have the same basic problem (over costing
split operations with split invariant terms) at multiple places through
this file.
Calling into BasicTTI is not always safe. In particular, BasicTTI does
not have a full legalization implementation (vector widening is
missing), and falls back on scalarization. The problem is that
scalarization for <N x i1> vectors is cost in terms of the cast API and
we can end up in an infinite recursive cycle.
The "right" fix for this would be teach BasicTTI how to model the full
legalization state machine, but several attempts at doing so have
resulted in dead ends or undesirable cost changes for targets I don't
understand.
This patch instead papers over the issue by avoiding the call to the
base class when dealing with an i1 source or dest. This doesn't
necessarily produce correct costs, but it should at least return
something semi-sensible and not crash.
Fixes https://github.com/llvm/llvm-project/issues/108708
This follows in the spirit of 7d82c99403f615f6236334e698720bf979959704,
and extends the costing API for compares and selects to provide
information about the operands passed in an analogous manner. This
allows us to model the cost of materializing the vector constant, as
some select-of-constants are significantly more expensive than others
when you account for the cost of materializing the constants involved.
This is a stepping stone towards fixing
https://github.com/llvm/llvm-project/issues/109466. A separate SLP patch
will be required to utilize the new API.
Arithmetic half or bfloat ops on zvfhmin and zvfbfmin respectively will
be promoted and carried out in f32, so this updates
getArithmeticInstrCost to check for this.
This is a follow up to 7f6bbb3. When lowering a <N x i1> build_vector,
we currently chose to extend to i8, perform the build_vector there, and
then truncate back in vector. Our costing on the other hand accounts for
it as if we performed a vector extend, an insert, and a vector extract
for every element. This significantly over estimates the cost.
Note that we can likely do better in our build_vector lowering here by
packing the bits in scalar, and doing a build_vector of the packed bits.
Regardless, our costing should match our lowering.
This change is actually two related changes, but they're very hard to
meaningfully separate as the second balances the first, and yet doesn't
do much good on it's own.
First, we can reduce the cost of a build_vector pattern. Our current
costing for this defers to generic insertelement costing which isn't
unreasonable, but also isn't correct. While inserting N elements
requires N-1 slides and N vmv.s.x, doing the full build_vector only
requires N vslide1down. (Note there are other cases that our build
vector lowering can do more cheaply, this is simply the easiest upper
bound which appears to be "good enough" for SLP costing purposes.)
Second, we need to tell SLP that calls don't preserve vector registers.
Without this, SLP will vectorize scalar code which performs e.g. 4 x
float @exp calls as two <2 x float> @exp intrinsic calls. Oddly, the
costing works out that this is in fact the optimal choice - except that
we don't actually have a <2 x float> @exp, and unroll during DAG. This
would be fine (or at least cost neutral) except that the libcall for the
scalar @exp blows all vector registers. So the net effect is we added a
bunch of spills that SLP had no idea about. Thankfully, AArch64 has a
similiar problem, and has taught SLP how to reason about spill cost once
the right TTI hook is implemented.
Now, for some implications...
The SLP solution for spill costing has some inaccuracies. In particular,
it basically just guesses whether a intrinsic will be lowered to a call
or not, and can be wrong in both directions. It also has no mechanism to
differentiate on calling convention.
This has the effect of making partial vectorization (i.e. starting in
scalar) more profitable. In practice, the major effect of this is to
make it more like SLP will vectorize part of a tree in an intersecting
forrest, and then vectorize the remaining tree once those uses have been
removed.
This has the effect of biasing us slightly away from strided, or indexed
loads during vectorization - because the scalar cost is more accurately
modeled, and these instructions look relevatively less profitable.
This patch fix the potential crash about using dyn_cast in `vp_cmp`
which is same as #109313.
Check if the IntrinsicCostAttrubute contains underlying instruction
first and cast to the VPCmpIntrinsic.