`VPEVLBasedIVPHIRecipe` will lower to VPInstruction scalar phi and
generate scalar phi. This recipe will only occupy a scalar register just
like other phi recipes.
This patch fix the register usage for `VPEVLBasedIVPHIRecipe` from
vector
to scalar which is close to generated vector IR.
https://godbolt.org/z/6Mzd6W6ha shows that no register spills when
choosing `<vscale x 16>`.
Note that this test is basically copied from AArch64.
SimplifyBranchConditionForVFAndUF only recognized canonical IVs and a
few PHI
recipes in the loop header. With more IV-step optimizations,
the canonical widen-canonical-iv can be replaced by a canonical
VPWidenIntOrFpInduction,
which the pass did not handle, causing regressions (missed
simplifications).
This patch replaces canonical VPWidenIntOrFpInduction with a StepVector
in the vector preheader
since the vector loop region only executes once.
Remove the ArrayRef<const Value*> Args operand from
getOperandsScalarizationOverhead and require that the callers
de-duplicate arguments and filter constant operands.
Removing the Value * based Args argument enables callers where no Value
* operands are available to use the function in a follow-up: computing
the scalarization cost directly for a VPlan recipe.
It also allows more accurate cost-estimates in the future: for example,
when vectorizing a loop, we could also skip operands that are live-ins,
as those also do not require scalarization.
PR: https://github.com/llvm/llvm-project/pull/154126
A number of recipes compute costs for the same opcodes for scalars or
vectors, depending on the recipe.
Move the common logic out to a helper in VPRecipeWithIRFlags, that is
then used by VPReplicateRecipe, VPWidenRecipe and VPInstruction.
This makes it easier to cover all relevant opcodes, without duplication.
PR: https://github.com/llvm/llvm-project/pull/153361
In setVectorizedCallDecision we attempt to calculate the scalar costs
for vectorisation calls, even for scalable VFs where we already know the
answer is Invalid. We can avoid doing unnecessary work by skipping this
completely for scalable vectors.
After a485e0e, we may not set the vector trip count in
preparePlanForEpilogueVectorLoop if it is zero. We should not choose a
VF * UF that makes the main vector loop dead (i.e. vector trip count is
zero), but there are some cases where this can happen currently.
In those cases, set EPI.VectorTripCount to zero.
There are a couple of places in the loop vectoriser where we
want to calculate the cost of extracting the last lane in a
vector. However, we wrongly assume that asking for the cost
of extracting lane (VF.getKnownMinValue() - 1) is an accurate
representation of the cost of extracting the last lane. For
SVE at least, this is non-trivial as it requires the use of
whilelo and lastb instructions.
To solve this problem I have added a new
getReverseVectorInstrCost interface where the index is used
in reverse from the end of the vector. Suppose a vector has
a given ElementCount EC, the extracted/inserted lane would be
EC - 1 - Index. For scalable vectors this index is unknown at
compile time. I've added a AArch64 hook that better represents
the cost, and also a RISCV hook that maintains compatibility
with the behaviour prior to this PR.
I've also taken the liberty of adding support in vplan for
calculating the cost of VPInstruction::ExtractLastElement.
This is the first step in untangling the variable step transform and
header mask optimizations as described in #152541.
Currently we replace all VF users globally in the plan, including
VPVectorEndPointerRecipe. However this leaves reversed loads and stores
in an incorrect state until they are adjusted in optimizeMaskToEVL.
This moves the VPVectorEndPointerRecipe transform so that it is updated
in lockstep with the actual load/store recipe.
One thought that crossed my mind was that VPInterleaveRecipe could also
use VPVectorEndPointerRecipe, in which case we would have also been
computing the wrong address because we don't transform it to an EVL
recipe which accounts for the reversed address.
If we end up with a extract_element VPInstruction where both operands
are live-ins, we will try to fold the live-ins even though the first
operand is a vector whilst the live-in is scalar.
This fixes it by just returning the vector live-in instead of calling
the folder, and removes the handling for insertelement where we aren't
able to do the fold. From some quick testing we previously never hit
this fold anyway, and were probably just missing test coverage.
Fixes#154045
Currently, VPInterleaveRecipe::execute does not support generating LLVM
IR for interleaved accesses that require a gap mask for scalable VFs.
It would be better to detect and prevent such groups from being
vectorized as interleaved accesses in
LoopVectorizationCostModel::interleavedAccessCanBeWidened, rather than
relying on the TTI function getInterleavedMemoryOpCost to return an
invalid cost.
Compute the cost of non-intrinsic, single-scalar calls directly in
VPReplicateRecipe::computeCost.
This starts moving call cost computations to VPlan, handling the
simplest case first.
Materialze Build(Struct)Vectors explicitly for VPRecplicateRecipes, to
serve their users requiring a vector, instead of doing so when unrolling
by VF.
Now we only need to implicitly build vectors in VPTransformState::get
for VPInstructions. Once they are also unrolled by VF we can remove the
code-path alltogether.
PR: https://github.com/llvm/llvm-project/pull/151487
If ExtraAnalysis is requested, emit all remarks caused by unvectorizable instructions - instead of only the first.
This is in line with how other places handle DoExtraAnalysis and it can be quite helpful to get info about all instructions in a loop that prevent vectorization.
This patch replaces SmallSet<T *, N> with SmallPtrSet<T *, N>. Note
that SmallSet.h "redirects" SmallSet to SmallPtrSet for pointer
element types:
template <typename PointeeType, unsigned N>
class SmallSet<PointeeType*, N> : public SmallPtrSet<PointeeType*, N>
{};
We only have 140 instances that rely on this "redirection", with the
vast majority of them under llvm/. Since relying on the redirection
doesn't improve readability, this patch replaces SmallSet with
SmallPtrSet for pointer element types.
The vector combiner will process all instructions as it first loops
through the function, adding any newly added and deleted instructions to
a worklist which is then processed when all nodes are done. These leaves
extra uses in the graph as the initial processing is performed, leading
to sub-optimal decisions being made for other combines. This changes it
so that trivially dead instructions are removed immediately. The main
changes that this requires is to make sure iterator invalidation does not
occur.
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
Dissolving the hierarchical VPlan CFG and converting abstract to
concrete recipes can expose additional simplification opportunities.
Do a final run of simplifyRecipes before executing the VPlan.
If the copyable schedule data is created and the user is used several
times in the user node, no need to count same data for the same user
several times, need to include it only ones.
Fixes#153754
If the copyable schedule data is created and the user is used several
times in the user node, no need to count same data for the same user
several times, need to include it only ones.
Fixes#153754
Fixes#153012
As we tolerate unfoldable constant expressions in `scalarizeOpOrCmp`, we
may fold
```llvm
define void @bug(ptr %ptr1, ptr %ptr2, i64 %idx) #0 {
entry:
%158 = insertelement <2 x i64> <i64 5, i64 ptrtoint (ptr @val to i64)>, i64 %idx, i32 0
%159 = or disjoint <2 x i64> splat (i64 2), %158
store <2 x i64> %159, ptr %ptr2
ret void
}
```
to
```llvm
define void @bug(ptr %ptr1, ptr %ptr2, i64 %idx) {
entry:
%.scalar = or disjoint i64 2, %idx
%0 = or <2 x i64> splat (i64 2), <i64 5, i64 ptrtoint (ptr @val to i64)>
%1 = insertelement <2 x i64> %0, i64 %.scalar, i64 0
store <2 x i64> %1, ptr %ptr2, align 16
ret void
}
```
And it would be folded back in `foldInsExtBinop`, resulting in an
infinite loop.
This patch forces scalarization iff InstSimplify can fold the constant
expression.
When matching integers, `m_ConstantInt` is a convenient alternative to
`m_APInt` for matching unsigned 64-bit integers, allowing one to
simplify
```cpp
const APInt *IntC;
if (match(V, m_APInt(IntC))) {
if (IntC->ule(UINT64_MAX)) {
uint64_t Int = IntC->getZExtValue();
// ...
}
}
```
to
```cpp
uint64_t Int;
if (match(V, m_ConstantInt(Int))) {
// ...
}
```
However, this simplification is only true if `V` is a scalar type.
Specifically, `m_APInt` also matches integer splats, but `m_ConstantInt`
does not.
This patch ensures that the matching behaviour of `m_ConstantInt`
parallels that of `m_APInt`, and also incorporates it in some obvious
places.
Added support for LShr instructions as base for copyable elements. Also,
added simple analysis for best base instruction selection, if multiple
candidates are available.
Fixed scheduling after cancellation
Reviewers: hiraditya, RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/153393
Directly emit shl instead of a multiply if VF * Step is a power-of-2. The
main motivation here is to prepare the code and test for directly
generating and expanding a SCEV expression of the minimum iteration
count. SCEVExpander will directly emit shl for multiplies with
powers-of-2.
InstCombine will also performs this combine, so end-to-end this should
effectively by NFC.
PR: https://github.com/llvm/llvm-project/pull/153495
Added support for LShr instructions as base for copyable elements. Also,
added simple analysis for best base instruction selection, if multiple
candidates are available.
Reviewers: hiraditya, RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/153393
Add 3 new iterator ranges to VPPhiAccessors
* incoming_values(): returns a range over the incoming
values of a phi
* incoming_blocks(): returns a range over the incoming
blocks of a phi
* incoming_values_and_blocks: returns a range over pairs of
incoming values and blocks.
Depends on https://github.com/llvm/llvm-project/pull/124838.
PR: https://github.com/llvm/llvm-project/pull/138472
This patch add cost kind to `getAddressComputationCost()` for #149955.
Note that this patch also remove all the default value in `getAddressComputationCost()`.
Instead of defining unary/binary/ternary/4ary overloads of each matcher,
we can use parameter packs to support arbitrary numbers of operands.
This allows us to remove the explicit N-ary definitions for each
matcher.
We need to rewrite Recipe_match's constructor to use a parameter pack
too, otherwise we end up with ambiguous overloads.
After clearing the dependencies in copyable data, need to recalculate
dependencies for the original ScheduleData, if it can be marked as
control dependent.
Fixes#153289
Shift replacement of regular VPBB for vector.ph with the VPIRBB wrapping
the created IR block directly to skeleton creation, to be consistent
with how the scalar preheader is handled.
We almost only ever have one header mask, except with the data tail
folding style, i.e. with VPInstruction::ActiveLaneMask.
All we need to do is to make sure to erase the old header icmp based
header mask when replacing it.
This reverts commit 1c7c8e3ad39957285524ff116d9a6aec0d9b62f9.
Recommit with a fix for the verifier error caused for EVL recipes.
Extra test coverage added in 6f939da60e.
Attempt to narrow a phi of shufflevector instructions where the two
incoming values have the same operands but different masks.
Related to #128938.
---------
Co-authored-by: Leon Clark <leoclark@amd.com>