If we know an exact VLEN, then the index is effectively modulo the
number of elements in a single vector register. Our lowering performs
this subvector optimization.
A bit of context. This change may look a bit strange on it's own given
we are currently *not* scaling insert/extract cost by LMUL. This costing
decision needs to change, but is very intertwined with SLP
profitability, and is thus a bit hard to adjust. I'm hoping that
https://github.com/llvm/llvm-project/pull/108419 will let me start to
untangle this. This change is basically a case of finding a subset I can
tackle before other dependencies are in place which does no real harm in
the meantime.
A half with only zvfhmin or bfloat will end up getting promoted to a f32
for most instructions.
Unless the loop consists only of memory ops and permutation instructions
which don't need promoted (is this common?), we'll end up using double
the LMUL than what's currently being returned by getRegUsageForType.
Since this is used by the loop vectorizer, it seems better to be
conservative and assume that any usage of a zvfhmin half/bfloat will end
up being widened to a f32
Widening/narrowing the source data type to match the destination data
type may require multiple steps.
To model the costs, the patch generated the interim type by following
the logic in RISCVTargetLowering::lowerVPFPIntConvOp.
We'd previously just deferred to the base implementation, but that more
or less always returns 1. This underestimates the cost of the
insert/extract, biases the SLP vectorizer towards forming illegally
typed vectors, and underestimates the cost of scalarized operations
(like unaligned scatter/gather).
This patch is moving out stepvector intrinsic from the experimental
namespace.
This intrinsic exists in LLVM for several years now, and is widely used.
The code was broken completely. Need to iterate over the whole mask and
process the submasks correctly, check if they form full indentity and
adjust indices correctly.
Fixes https://github.com/llvm/llvm-project/issues/106126
This fixes a crash introduced by my
ac6e1fd0c089043fe60bd0040ba3cad884f00206.
I had failed to consider the case where a vector is truncated to an
illegal element type. The resulting intermediate VT wasn't an MVT and
we'd fail an assertion. Surprisingly, SLP does query illegal element
types in some cases.
This optimization helps reduce repeated calculations of base addresses
by extracting type extensions when the same base address is accessed
multiple times but its offset is a constant.
For a cast with src and destination size being unequal, we were costing
the cast as if it were being scalarized, when in fact we can often
promote such cases to a wider legal type.
Note that for casts with equal size (i.e. bitcast, some fp<->i, and
ptrtoint) the generic logic in BasicTTI already assumed promotion. It
just doesn't handle the cast where source and destination are both
promoted to non-equal types.
This is analogous to d3fd28a, but with the same reasoning applied to
casts instead.
This reverts commit e8ad87c7d06afe8f5dde2e4c7f13c314cb3a99e9.
This reverts commit d3c9bb0cf811424dcb8c848cf06773dbdde19965.
A few buildbots trip up on asan-rvv-intrinsics.ll. I've also reverted
the follow-up commit d3c9bb0cf8.
https://lab.llvm.org/buildbot/#/builders/46/builds/2895
Previously asan considers target intrinsics as black boxes, so asan
could not instrument accurate check. This patch provide TTI hooks to
make targets describe their intrinsic informations to asan.
Note,
1. this patch renames InterestingMemoryOperand to MemoryRefInfo.
2. this patch does not support RVV indexed/segment load/store.
The tablegen patterns all have isRV32. I did not check if any of them
could naively support RV64.
Fixes#101067 and probably other bugs like it we haven't found yet.
The motivation for this change is the costing of a LD or ST with nearly
power of 2 vectors (e.g. <3 x i32> or <7 x i32>) on V. There's an
experimental option in SLP to allow emitting these if the cost model
says they're profitable. This really helps with e.g. RGB vectors.
Our actual lowering for these depends on whether a wider container type
is known available. If so, we use a vle or vse on the wider type with a
restricted VL. If not, we split until a legal type is found, and then
apply the vle/vse on the sub-pieces.
This change is intentionally restricted to only the case where promotion
(widening w/VL predication) is involved. We appear to have at least one
bug in our splitting lowering (see discussion on review), and to avoid
exposing this more widely, I chose to not adjust costs for the splitting
case. The current splitting costing assumes scalarization (which is not
true of the actual lowering), but that has the effect of biasing
vectorization away from such cases strongly.
For the widening case, the true cost scales with the next largest legal
type. The default implementation assumes that such a type is scalarized.
Changing that brings our cost in line with our actual lowering decision.
Note that since scalarization is not possible for scalable types, the
prior costing falsely returned Invalid for that case.
I was comparing some SPEC CPU 2017 benchmarks across rva22u64 and
rva22u64_v, and noticed that in a few cases that rva22u64_v was
considerably slower.
One of them was 519.lbm_r, which has a large loop that was being
unprofitably vectorized. It has an if/else in the loop which requires
large amounts of predication when vectorized, but despite the loop
vectorizer taking this into account the vector cost came out as cheaper
than the scalar.
It looks like the reason for this is because we cost scalar floating
point ops as 2, but their vector equivalents as 1 (for LMUL 1). This
comes from how we use BasicTTIImpl for scalars which treats floats as
twice as expensive as integers.
This patch doubles the cost of vector floating point arithmetic ops so
that they're at least as expensive as their scalar counterparts, which
gives a 13% speedup on 519.lbm_r at -O3 on the spacemit-x60.
Fixes#62576 (the last point there about scalar fsub/fmul)
Previously getIntImmInstCost only calculated the cost of materialising
the argument of a store if it was the address. This means
ConstantHoisting's transformation wouldn't kick in for cases like
storing two values that require multiple instructions to materialise but
where one can be cheaply generated from the other (e.g. by an addition).
Two key changes were needed to avoid regressions when enabling this:
* Allowing constant materialisation cost to be calculated assuming
zeroes are free (as might happen if you had a 2*XLEN constant and one
half is zero).
* Avoiding constant hoisting if we have a misaligned store that's going
to be a legalised to a sequence of narrower stores. I'm seeing cases
where hoisting the constant ends up with worse codegen in that case.
Out of caution and so as not to unexpectedly degrade other existing hoisting logic, FreeZeroes is used only for the new cost calculations for the load instruction. It would likely make sense to revisit this later.
Intrinsics not supported in the backend will fall Into BasicTTIImpl,
which will check if the VP intrinsic is a type based instruction.
All type based instruction will fall into the
`getTypeBasedIntrinsicInstrCost()` which doesn't support instruction
with scalable vector type.
This patch adds the instruction cost for type based binOp VP intrinsic
instructions in the backend to get the valid instruction costs.
The cost of type based binOp VP intrinsics will be same as their non-VP
counterpart.
An LSR formula may require the addition of multiple base or scale
registers, this sum reduction requires a temporary register to perform.
Since the formulas are independent, we only need one temporary,
regardless of the number of unique formula. Each formula can reuse the
same temporary. A later CSE pass may come along and combine
sub-expressions - but then the register pressure would be that passes
problem to consider.
This change fixes up the costing in the RISCV specific way, but this is
really a generic LSR problem. I just didn't feel like fighting with LSR
and dealing with all the various targets swinging slightly in hard to
reason about ways. This problem is more pronounced on RISCV than any
other target due to our lack of addressing modes.
This change is not hugely important on it's own, but I have an upcoming
change to add support fo shNadd in LSR which biases us fairly strongly
towards adding more "base adds". Without this change, we see net
regression due to the increase in register pressure which is not
accounted for.
With ShortFowrardBranchOpt(SFB) or ConditionalMoveFusion, scalar
ICmp and scalar Select instructions will lower to SELECT_CC
and lower to PseudoCCMOVGPR which will generate a conditional
branch instruction and a move instruction.
The cost of scalar (ICmp + Select) = (0 + Select instruction cost)
isLegalInterleavedAccessType expects the subvector type, but
getInterleavedMemoryOpCost is called with the full vector type. So we
need to divide by Factor.
The cost of `experimental.cttz.elts` in RISC-V equals to the cost of
vfirst when the zero_is_poison argument is true. Otherwise, we add
additional costs of cmp + select to convert the -1 result from vfirst to
EVL.
Insert a break to fix the implicit-fallthrough caught by sanitizer.
Original commit message:
This patch made following changes:
1. Support ISD FDIV/UDIV/SDIV/UREM/SREM
2. Classify instructions which cost the same
The support for interleaved accesses for scalable vector with a factor
of 2 is enabled in vectorizer. Therefore, the patch removed the
restriction for scalable vector with a factor of 2.
This patch introduces following changes
- Support all fp predicates
- Use the Val type to estimate the latency/throughput cost
- Assign a cost of 1 for mask operations as LMULCost for mask types
cannot be correctly estimated.
This tries to add some costs for the shuffle in a ST3/ST4 instruction,
which are represented in LLVM IR as store(interleaving shuffle). In
order to detect the store, it needs to add a CxtI context instruction to
check the users of the shuffle. LD3 and LD4 are added, LD2 should be a
zip1 shuffle, which will be added in another patch.
It should help fix some of the regressions from #87510.
This patch introduces generating VP intrinsics in the Loop Vectorizer.
Currently the Loop Vectorizer supports vector predication in a very
limited capacity via tail-folding and masked load/store/gather/scatter
intrinsics. However, this does not let architectures with active vector
length predication support take advantage of their capabilities.
Architectures with general masked predication support also can only take
advantage of predication on memory operations. By having a way for the
Loop Vectorizer to generate Vector Predication intrinsics, which (will)
provide a target-independent way to model predicated vector
instructions. These architectures can make better use of their
predication capabilities.
Our first approach (implemented in this patch) builds on top of the
existing tail-folding mechanism in the LV (just adds a new tail-folding
mode using EVL), but instead of generating masked intrinsics for memory
operations it generates VP intrinsics for loads/stores instructions. The
patch adds a new VPlanTransforms to replace the wide header predicate
compare with EVL and updates codegen for load/stores to use VP
store/load with EVL.
Other important part of this approach is how the Explicit Vector Length
is computed. (VP intrinsics define this vector length parameter as
Explicit Vector Length (EVL)). We use an experimental intrinsic
`get_vector_length`, that can be lowered to architecture specific
instruction(s) to compute EVL.
Also, added a new recipe to emit instructions for computing EVL. Using
VPlan in this way will eventually help build and compare VPlans
corresponding to different strategies and alternatives.
Differential Revision: https://reviews.llvm.org/D99750
Changes in Recommit:
Add an additional check on sign/zero extend to the same type.
Original message:
Use the destination data type to measure the LMUL size for
latency/throughput cost
We're not allowed to call getELEN when the vector extension
is not enabled. If we're looking at a vector type, isTypeLegal would
only return true if the vector extensions are enabled. So early out
for non-vector types before we call isTypeLegal and getELEN.
To reduce the indentation by using early returns, this patch hoist the
return for illegal type and non vector type earlier.
It should mostly be an NFC.
The ‘llvm.vector.reduce.fmaximum/fminimum.*’ intrinsics propagate NaNs
if any element of the vector is a NaN.
Following #79402, the patch adds the cost for NaN check (vmfne + vcpop)
The changeset enables lowering of `llvm.masked.compressstore(%data,
%ptr, %mask)` for RVV for fixed vector type into:
```
%0 = vcompress %data, %mask, %vl
%new_vl = vcpop %mask, %vl
vse %0, %ptr, %1, %new_vl
```
Such lowering is only possible when `%data` fits into available LMULs
and otherwise `llvm.masked.compressstore` is scalarized by
`ScalarizeMaskedMemIntrin` pass.
Even though RVV spec in the section `15.8` provide alternative sequence
for compressstore, use of `vcompress + vcpop` should be a proper
canonical form to lower `llvm.masked.compressstore`. If RISC-V target
find the sequence from `15.8` better, peephole optimization can
transform `vcompress + vcpop` into that sequence.
In case of loads/stores from an immediate address, avoid rematerializing
the constant for every block and allow consthoist to hoist it to the
entry block.
When MVT is not a vector type, TCK_CodeSize should return an invalid
cost. This patch adds a check in the beginning to make sure all cost
kinds return invalid costs consistently.
Before this patch, TCK_CodeSize returns a valid cost on scalar MVT but
other cost kinds doesn't.
This fixes the issue #83294 where a loop contains vector instructions
and MVT is scalar after type legalization when the vector extension is
not enabled,
If we have exact vlen knowledge, we can figure out which indices
correspond to register boundaries. Our lowering uses this knowledge to
replace the vslidedown.vi with a sub-register extract. Our costs can
reflect that as well.
This is another piece split off
https://github.com/llvm/llvm-project/pull/80164
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>