Currently a cttz.elts of e.g. nxv32i1 will get expanded to a reduction
of nxv32i64 or equivalent, but we can split it into two legal nxv16i1
cttz.elts once we have dedicated SelectionDAG nodes.
This implements the splitting for them the same way we implement type
splitting for vp.cttz.elts, i.e. check if the low result is VF, and if
so add it to the result of the high result. It also implements operand
type promotion for NEON which needs to promote i1 vectors to something
larger first.
We also need to move expansion into LegalizeVectorOps so it doesn't get
expanded before type legalization can do splitting. This uses
LegalizeVectorOps in case the scalar reduction type, which depends on
the minimum bitwidth needed to store the result, still needs type
promotion.
The TTI costs should be updated after this to reflect the more efficient
codegen, but that is deferred to another PR.
Largely a straight-forward replacement with occasional simplifcations.
For AMDGPU, I assumed that unconditional branches are always uniform and
therefore "simplified"/changed AMDGPUAnnotateUniformValues to only
annotate conditional branches.
Target-specific FastISel only selects conditional branches,
unconditional branches are already handled by the non-target-specific
code.
RISCVTTIImpl::getShuffleCost() crashes when querying the cost of a
reverse shufflevector on a target with the P-extension but without V/Zve
extensions. The SK_Reverse case calls
getContainerForFixedLengthVector(), which asserts hasVInstructions().
The P-extension uses fixed-width packed SIMD in GPRs, not RVV registers,
so V extension is typically not enabled.
Add an early return for P-extension fixed vectors in getShuffleCost,
consistent with the existing guards in getScalarizationOverhead,
getCastInstrCost, and getVectorInstrCost.
BranchInst currently represents both unconditional and conditional
branches. However, these are quite different operations that are often
handled separately. Therefore, split them into separate opcodes and
classes to allow distinguishing these operations in the type system.
Additionally, this also slightly improves compile-time performance.
This patch implements the cost of
llvm.experimental.vector.extract.last.active which will lower to:
vcpop.m a0, v0
beqz a0, exit # Return passthru when the entire lane is inactive.
vid v10, v0.t
vredmaxu.vs v10, v10, v10
vmv.v.s a0, v10
zext.b a0, a0
vslidedown v8, v8,
This patch also helps conditional-scalar-assignment (CSA) works for
scalable vector.
All known crashes have been fixed.
We do still need to work out how fixed length vectors are handled when V
and P are both enabled, but I don't think this option is the solution
for that.
On RISC-V, the interrupt attribute relates only to the prolog and epilog
of the attributed function (and has specific restrictions on the
function's signature). It does not change how that function calls other
functions, and when outlining, the outlined function must not have this
attribute.
This adds a target-independent hook to TTI so targets can choose which
attributes to propagate (by default all are propagated).
Fixes#149969
Currently vector splice intrinsics are costed through getShuffleCost
when the offset is fixed. When the offset is variable though we can't
use a shuffle mask so it currently returns invalid.
This implements the cost in RISCVTTIImpl::getIntrinsicInstrCost as the
cost of a slideup and a slidedown, which matches the codegen.
It also implements the type based cost whenever the offset argument
isn't available.
It may be possible to reduce the cost in future when one of the vector
operands is known to be poison, in which case we only generate a single
slideup or slidedown.
When `Zvabd` exists, `llvm.abs` is lowered to `vabs.v` so the cost
is 1.
Reviewers: mshockwave, topperc, lukel97, skachkov-sc, preames
Reviewed By: topperc
Pull Request: https://github.com/llvm/llvm-project/pull/180146
Fixes#176208. Scaled back version of #176515 that only affects the RISCV backend.
Only modifies the cost for cases when DIV is a legal operation.
Updates the cost for both Scalar and Vector types.
Used `TTI::TCC_Expensive` as suggested by
https://github.com/llvm/llvm-project/issues/176208#issuecomment-3760902537.
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>
This commit introduces the VectorInstrContext (VIC) infrastructure to
improve cost estimates for insert/extracts based on the context
instruction in which the insert/extract is used.
This is similar to CastContextHint, and allows providing context on how
the insert/extract is going to be used before creating IR. This is
useful in the LoopVectorizer, where costs need to estimated before
creating IR.
The new hint currently only replaces an existing check in AArch64,
but new uses will be introduced in follow-ups, including
https://github.com/llvm/llvm-project/pull/177201.
PR: https://github.com/llvm/llvm-project/pull/175982
AArch64 has enabled this in https://reviews.llvm.org/D138990, and
the measurement data still stands for RISCV in some cases.
And, similar optimization like #77284 is added too.
After this patch, the highly predictable branch will be converted
back to branches instead using selects.
This pass is disabled by default now, we can enable it by default
after more detailed investigation.
Reviewers: davemgreen, preames, dtcxzyw, lukel97, topperc, asb
Pull Request: https://github.com/llvm/llvm-project/pull/80124
Fixes#175909
We compute the cost of a gather/scatter by multiplying the cost of the
scalar element type memory op by the estimated number of elements. On
rv32 though a scalar i64 load costs 2, even if we have zve64x.
This causes the cost to diverge between a vector of f64 and vector of
i64, even though both are the same. This fixes it by just using
TTI::TCC_Basic as the scalar memory op cost. The element type is checked
to be legal at this point.
I think we have the same issue for the strided op cost, but we don't
have test coverage for it yet.
Some machines are able to make use of AUIPC + ADDI or LUI + ADDI fusion, make sure to consider that in the cost model for `RISCVTTIImpl::getConstantPoolLoadCost()`.
@llvm.experimental.vp.splat was originally added in #98731 in order to
prevent VL toggles when optimizing a zero strided load to a scalar load
+ splat on RISC-V: #101329
However, the need to explicitly set the VL operand has been superseded
by RISCVVLOptimizer which can infer this automatically based on the
splat's users, and the use of the vp.splat intrinsic was removed in
#170543.
Now that there are no users of @llvm.experimental.vp.splat internal to
LLVM and it's unlikely we will need it in future due to
RISCVVLOptimizer, this patch removes the intrinsic. I couldn't find any
publicly available out-of-tree users of the intrinsic with a quick
search on GitHub.
We have disabled unrolling for vectorized loops in #151525 but this
PR only checked the instruction type.
For some loops, there is no instruction with vector type but they
are still vector operations (just like the memset zero test in the
precommit test).
Here we check the operands as well to cover these cases.
A funnel shift with the same first two operands is a rotate. When
`Zbb/Zbkb` is enabled we can use the `ROL(W)/ROR(I)(W)` instruction to
represent this. Add cost model support for this.
Similar to https://github.com/llvm/llvm-project/pull/169335 for AArch64.
Following #165532, this patch moves scalarization‑cost computation into
BaseT::getMemIntrinsicCost and lets backends override it via their
getMemIntrinsicCost.
It also removes the masked/gather‑scatter/strided/expand‑compress
costing interfaces from TTIImpl.
Targets may keep them locally if needed.
Stacked on #170426 and #170436.
- Following #168029. This is a step toward a unified interface for
masked/gather-scatter/strided/expand-compress cost modeling.
- Replace the ad-hoc parameter list with a single attributes object.
API change:
```
- InstructionCost getStridedMemoryOpCost(unsigned Opcode, Type *DataTy, const Value *Ptr,
bool VariableMask, Align Alignment,
TTI::TargetCostKind CostKind,
const Instruction *I = nullptr);
+ InstructionCost getStridedMemoryOpCost(MemIntrinsicCostAttributes,
+ CostKind);
```
Notes:
- NFCI intended: callers populate MemIntrinsicCostAttributes with same
information as before.
- Following #168029. This is a step toward a unified interface for
masked/gather-scatter/strided/expand-compress cost modeling.
- Replace the ad-hoc parameter list with a single attributes object.
API change:
```
- InstructionCost getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,
- Alignment, CostKind, Inst);
+ InstructionCost getGatherScatterOpCost(MemIntrinsicCostAttributes,
+ CostKind);
```
Notes:
- NFCI intended: callers populate MemIntrinsicCostAttributes with same
information as before.
This patch is a rework of #160470 (which was reverted).
With getMemIntrinsicCost() now available, we can re‑land the change and
reduce vp_load_ff boilerplate.
- Following #168029. This is a step toward a unified interface for
masked/gather-scatter/strided/expand-compress cost modeling.
- Replace the ad-hoc parameter list with a single attributes object.
API change:
```
- InstructionCost getExpandCompressMemoryOpCost(Opcode, DataTy,
- VariableMask, Alignment,
- CostKind, Inst);
+ InstructionCost getExpandCompressMemoryOpCost(MemIntrinsicCostAttributes,
+ CostKind);
```
Notes:
- NFCI intended: callers populate MemIntrinsicCostAttributes with same
information as before.
- Split from #165532. This is a step toward a unified interface for
masked/gather-scatter/strided/expand-compress cost modeling.
- Replace the ad-hoc parameter list with a single attributes object.
API change:
```
- InstructionCost getMaskedMemoryOpCost(Opcode, Src, Alignment,
- AddressSpace, CostKind);
+ InstructionCost getMaskedMemoryOpCost(MemIntrinsicCostAttributes,
+ CostKind);
```
Notes:
- NFCI intended: callers populate MemIntrinsicCostAttributes with the
same information as before.
- Follow-up: migrate gather/scatter, strided, and expand/compress cost
queries to the same attributes-based entry point.
This is the initial support of P extension codegen, it only includes
small part of instructions:
PADD_H, PADD_B,
PSADD_H, PSADD_B,
PAADD_H, PAADD_B,
PSADDU_H, PSADDU_B,
PAADDU_H, PAADDU_B,
PSUB_H, PSUB_B,
PDIF_H, PDIF_B,
PSSUB_H, PSSUB_B,
PASUB_H, PASUB_B,
PDIFU_H, PDIFU_B,
PSSUBU_H, PSSUBU_B,
PASUBU_H, PASUBU_B
We have a private extension which also uses the vector type in the
frontend. Our platform does not have the V extension, so it triggered
assertion failures from within getLMULCost().
It feels reasonable to check for V extension before assuming LMUL
exists.
RVV segment is an array of `SegNum` contingous elements. This patch
emulates RVV segment as a large integer with bitwidth equaled to `SegNum
* SEW`. The reason to not emulate RVV segment as some aggregated type is
that vector type should use primitive types as element types.
There is another approach is to create `SegNum` InterestingMemoryOperand
objects. It could avoid create pseudo types, but this approach also
generates large code for asan check.
Co-authored-by: Yeting Kuo <yeting.kuo@sifive.com>
This isn't an optimization change - IR transforms should have remove the
operands and replaced them with poison. However, I noticed the
non-canonical splat structure in a couple of llvm-reduce outputs. This
results in us creating extremely atypical IR which is quite misleading
about the true cause of what's going on. (Because the non-canonical
splat doesn't get sunk, we then prone whatever was actually holding it
outside the loop in the original example, eliminating insight as to the
true cause of whatever issue we're debugging.)
We can rewrite this to (srai(w)/srli X, C1) == C2 so the AND immediate
is free. This transform is done by performSETCCCombine in
RISCVISelLowering.cpp.
This fixes the opaque constant case mentioned in #157416.
This patch is based on https://github.com/llvm/llvm-project/pull/159713
This patch extends AddressSanitizer to support indexed/segment
instructions in RVV. It enables proper instrumentation for these memory
operations.
A new member, `MaybeOffset`, is added to `InterestingMemoryOperand` to
describe the offset between the base pointer and the actual memory
reference address.
Co-authored-by: Yeting Kuo <yeting.kuo@sifive.com>
Split out from #151300 to isolate TargetTransformInfo cost modelling for
fault-only-first loads from VPlan implementation details. This change
adds costing support for vp.load.ff independently of the VPlan work.
For now, model a vp.load.ff as cost-equivalent to a vp.load.
[Previously reverted due to failures on asan-rvv-intrinsics.ll, the test
case is riscv only and it is triggered by other target]
Reland [#157863](https://github.com/llvm/llvm-project/pull/157863), and
add `; REQUIRES: riscv-registered-target` in test case to skip the
configuration that doesn't register riscv target.
Previously asan considers target intrinsics as black boxes, so asan
could not instrument accurate check. This patch make
SmallVector<InterestingMemoryOperand> a member of MemIntrinsicInfo so
that TTI can make targets describe their intrinsic informations to asan.
Note,
1. This patch move InterestingMemoryOperand from Transforms to Analysis.
2. Extend MemIntrinsicInfo by adding a
SmallVector<InterestingMemoryOperand> member.
3. This patch does not support RVV indexed/segment load/store.
Previously asan considers target intrinsics as black boxes, so asan
could not instrument accurate check. This patch make
SmallVector<InterestingMemoryOperand> a member of MemIntrinsicInfo so
that TTI can make targets describe their intrinsic informations to asan.
Note,
1. This patch move InterestingMemoryOperand from Transforms to Analysis.
2. Extend MemIntrinsicInfo by adding a
SmallVector<InterestingMemoryOperand> member.
3. This patch does not support RVV indexed/segment load/store.
This patch implements the `getAddressComputationCost()` in RISCV TTI
which
make the gather/scatter with address calculation more expansive that
stride cost.
Note that the only user of `getAddressComputationCost()` with vector
type is in `VPWidenMemoryRecipe::computeCost()`. So this patch make some
LV tests changes.
I've checked the tests changes in LV and seems those changes can be
divided into two groups.
* gather/scatter with uniform vector ptr, seems can be optimized to
masked.load.
* can optimize to stride load/store.
----
After #155739 landed, the assertion (cost mis-aligned) is fixed.
I've tested llvm-test-suite w/ rva23u64 and rva23u64_zvl1024b locally
and no assertion occurred.
This patch implements the `getAddressComputationCost()` in RISCV TTI
which
make the gather/scatter with address calculation more expansive that
stride cost.
Note that the only user of `getAddressComputationCost()` with vector
type is in `VPWidenMemoryRecipe::computeCost()`. So this patch make some
LV tests changes.
I've checked the tests changes in LV and seems those changes can be
divided into two groups.
* gather/scatter with uniform vector ptr, seems can be optimized to
masked.load.
* can optimize to stride load/store.
There are a couple of places in the loop vectoriser where we
want to calculate the cost of extracting the last lane in a
vector. However, we wrongly assume that asking for the cost
of extracting lane (VF.getKnownMinValue() - 1) is an accurate
representation of the cost of extracting the last lane. For
SVE at least, this is non-trivial as it requires the use of
whilelo and lastb instructions.
To solve this problem I have added a new
getReverseVectorInstrCost interface where the index is used
in reverse from the end of the vector. Suppose a vector has
a given ElementCount EC, the extracted/inserted lane would be
EC - 1 - Index. For scalable vectors this index is unknown at
compile time. I've added a AArch64 hook that better represents
the cost, and also a RISCV hook that maintains compatibility
with the behaviour prior to this PR.
I've also taken the liberty of adding support in vplan for
calculating the cost of VPInstruction::ExtractLastElement.
If we have a floating point vector and no zve32f/zve64f/zve64d, we can
end up with an invalid type-legalization cost from
getTypeLegalizationCost.
Previously this triggered an assertion that the type must have been
legalized if the "legal" type is a vector, but in this case when it's
not possible to legalize the original type is spat back out.
This fixes it by just checking that the legalization cost is valid.
We don't have much testing for zve64x, so we may have other places in
the cost model with this issue.
Fixes#153008
Now that support for masked loads/stores of interleave groups has
landed, we can enable the loop vectorizer to generate masked interleave
access where applicable.
This improves vectorization in several ways:
* Internal predication support: This enables interleave group
vectorization for loops with internal control flow predication, provided
all members of the group share the same predicate. Gaps in interleave
groups are still not efficiently handled by masking, so masking for gaps
remains disabled for now.
* Tail folding: This allows tail folding of loops with interleave groups
by using masking. Without this, vectorized loops with interleaves would
fall back to using separate gather/scatter accesses, which can be
significantly less efficient.
"[RISCV][TTI] Enable masked interleave access for scalable vector
(#149981)" was reverted by 5294793bdcf6ca142f7a0df897638bd4e85ed1a7 due
to triggering an assertion. The issue has been addressed in the patch
"[LV] Fix gap mask requirement for interleaved access (#151105)". On the
other hand, this patch also enable fixed-length masked interleave access
(#150624) since support for fixed-length has also been landed
992118cb4deab139ae384bb85f03225a9a21b008.
---------
Co-authored-by: Philip Reames <preames@rivosinc.com>
Adjust the unrolling preferences to unroll hand-vectorized code, as well
as the scalar remainder of a vectorized loop. Inspired by a similar
effort in AArch64: see #147420 and #151164.
Follow up on a review of bd66fd0 ([CostModel/RISCV] Fix costs of vector
[l](lrint|lround)) post-landing to fix a subtle problem with the cost
of vector [l](lrint|lround). We should use source LMUL in the case of
a narrowing op.
Co-authored-by: Luke Lau <luke@igalia.com>