We can reuse isVectorIntrinsicWithScalarOpAtArg in VectorUtils to
determine if only the first lane will be used for a
VPWidenIntrinsicRecipe, provided that we also move the VP EVL operand
check into it.
This was needed by a local patch I was working on that created a
VPWidenIntrinsicRecipe with a VP intrinsic, and prevents the need to
update the scalar arguments in two places.
`ad9909d "[SLP]Fix perfect diamond match with extractelements in scalars" `
changed SLPVectorizer getScalarizationOverhead() to call
TTI.getVectorInstrCost() instead of TTI.getScalarizationOverhead() in some
cases. This was due to X86 specific handlings in these (overridden) methods,
and unfortunately the general preference of TTI.getScalarizationOverhead()
was dropped. If VL is available it should always be preferred to use
getScalarizationOverhead(), and this is indeed the case for SystemZ which
has a special insertion instruction that can insert two GPR64s.
Then ` 33af951 "[SLP]Synchronize cost of gather/buildvector nodes with
codegen"` reworked SLPVectorizer getGatherCost() which together with
ad9909d caused the SystemZ test vec-elt-insertion.ll to fail.
This patch restores the SystemZ test and reverts the change in SLPVectorizer
getScalarizationOverhead() so that TTI.getScalarizationOverhead() is always
called again. The ForPoisonSrc argument is now passed on to the TTI method
so that X86 can handle this as required.
Fixes: #135346
This addresses an existing TODO and simply moves the current code to add
canonical IV recipes to the initial skeleton construction, at the same
place where the corresponding region will be introduced.
Whenever calls were transformed to VP intrinsics with EVL tail folding
in #110412, this workaround was added in computeCost to avoid an
assertion when checking ICA.getArgs().
However it turned out that the actual arguments were never used and this
assertion was removed in #115983 afterwards, so it's now fine to leave
the arguments empty and use the type based cost instead.
The type based cost and value based cost are the same for these VP
intrinsics.
This was tested by adding back in the transformation code in #110412 and
checking that no assertions were still hit.
This NFC aims to simplify the control-flow and interfaces used in tryToFindDuplicates(). The point is to make it easier to understand where decisions for scalar de-duplication are made.
In particular:
- Limit indentation
- Rename some variables to better match their use case
- Always give consistent outputs for VL and ReuseShuffleIndices. This makes it possible to use the same code for building gather TreeEntry everywhere. This also allows to remove the TryToFindDuplicates lambda.
Currently if we try to create a VPInstructionWithType without a FMF via
VPBuilder::createNaryOp we will use the constructor that asserts
`assert(isFPMathOp() && "this op can't take fast-math flags");`.
This fixes it by checking if FMFs have a value, similar to the other
createNaryOp overloads.
This is needed by #129508
There is a bug in the computation and handling of MinBWs in the case of
VPWidenIntrinsicRecipe: a crash is observed in
VPlanTransforms::truncateToMinimalBitwidths due to a mismatch between
the number of recipes processed and the number of entries in MinBWs. Fix
handling of calls in llvm::computeMinimumValueSizes, and handle
VPWidenIntrinsicRecipe in truncateToMinimalBitwidths, fixing the bug.
Fixes#87407.
Move out the logic to prepare for vectorization to a separate transform,
before creating loop regions. This was discussed as follow-up
in https://github.com/llvm/llvm-project/pull/136455.
This just moves the existing code around slightly and will simplify
follow-up patches to include the exiting edges during initial VPlan
construction.
This reverts commit 8dd160f4767f971572eac065c8650d9202ff5bf9.
The recommit contains an adjustment to planContainsAdditionalSimplifications,
which considers changes to the original predicate for compares.
Original commit message:
Add simplification to fold negation into a compare, if the negation is
the only user of the compare. This removes a number of redundant
negations.
Alive2 Proofs for FPCMP test changes: https://alive2.llvm.org/ce/z/WGDz9U
PR: https://github.com/llvm/llvm-project/pull/129430
Add a new reduction recurrence kind for reductions with
minimumnum/maximumnum. Such reductions can be vectorized without
nsz/nnans, same as reductions with maximum/minimum intrinsics.
Note that a new reduction kind is needed to make sure partial reductions
are also combined with minimumnum/maximumnum.
Note that the final reduction to a scalar value is performed with
vector.reduce.fmin/fmax. This should be fine, as the results of the
partial reductions with maximumnum/minimumnum silences any sNaNs.
In-loop and reductions in SLP are not supported yet, as there's no
reduction version of maximumnum/minimumnum yet and fmax may be
incorrect.
PR: https://github.com/llvm/llvm-project/pull/137335
I noticed this when working on a patch downstream, and I don't think
this is an issue upstream yet.
But if a VPWidenIntrinsicRecipe is created without an underlying
CallInst, e.g. in createEVLRecipe, it will crash if you try to clone it
because it assumes the CallInst always exists.
This fixes it by using the CallInst-less constructor in this case.
Simplify initial VPlan construction by not creating a separate
vector.latch block, which isn't needed and will get folded away later.
This has been suggested as independent clean-up multiple times.
Use replaceSuccessor/replacePredecessor in
insertBlockAfter/insertBlockBefore. This preserves the predecessor
order, which in turns is needed to not invalidate existing phi recipes.
At the moment this is NFC, but enables additional uses in the future.
ExtractFromEnd only has 2 uses, extracting the last and penultimate
elements. Replace it with 2 separate opcodes, removing the need to
materialize and handle a constant argument.
PR: https://github.com/llvm/llvm-project/pull/137030
Better to preserve the original order of the alternate nodes to avoid
inter-lane shuffling, select/insert subvector patterns provide better
perf.
Reviewers: RKSimon, hiraditya
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/136329
Add a new helper to manage IR metadata that can be progated to generated
instructions for recipes.
This helps to remove a number of remaining uses of getUnderlyingInstr
during VPlan execution.
PR: https://github.com/llvm/llvm-project/pull/135272
Remove legacy ILV sinkScalarOperands, which is superseded by the
sinkScalarOperands VPlan transforms.
There are a few cases that aren't handled by VPlan's sinkScalarOperands,
because the recipes doesn't support replicating. Those are pointer
inductions and blends.
We could probably improve this further, by allowing replication for more
recipes, but I don't think the extra complexity is warranted.
Depends on https://github.com/llvm/llvm-project/pull/136021.
PR: https://github.com/llvm/llvm-project/pull/136023
Add incoming exit phi operands during the initial VPlan construction.
This ensures all users are added to the initial VPlan and is also needed
in preparation to retaining exiting edges during initial construction.
PR: https://github.com/llvm/llvm-project/pull/136455
willGenerateVectors switches on opcodes of a recipe, but Histogram is
missing in the switch statement, which could cause a crash in some
cases. The crash was initially observed when developing another patch.
InstructionCost is already an optional value, containing an Invalid
state that can be checked with isValid(). There is little point in
returning another optional from getValue(). Most uses do not make use of
it being a std::optional, dereferencing the value directly (either
isValid has been checked previously or the Cost is assumed to be valid).
The one case that does in AMDGPU used value_or which has been replaced
by a isValid() check.
Need to estimate, which one is preferable, deinterleaved/segmented
loads or strided. Segmented loads can be combined, improving
the overall performance.
Reviewers: RKSimon, hiraditya
Reviewed By: hiraditya, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/135058
When we have the remaining unique scalar, that should be inserted into
non-poison vector and into non-zero position:
```
%vec1 = insertelement %vec, %v, pos1
%res = shuffle %vec1, poison, <0, 1, 2,..., pos1, pos1 + 1, ..., pos1,
...>
```
better to estimate if it is profitable to model it as is or model it as:
```
%bv = insertelement poison, %v, 0
%splat = shuffle %bv, poison, <poison, ..., 0, ..., 0, ...>
%res = shuffle %vec, %splat, <0, 1, 2,..., pos1 + VF, pos1 + 1, ...>
```
Reviewers: preames, hiraditya, RKSimon
Reviewed By: preames
Pull Request: https://github.com/llvm/llvm-project/pull/136590
These seem to be the wrong way round, e.g. see the definition at
Instruction::mayReadFromMemory().
If an instruction only writes to memory then it's known to not read
memory, and so on.
Only noticed this when using VPWidenIntrinsicRecipe in a local patch and
wondered why it kept on getting DCEd despite the intrinsic writing to
memory.
This will likely not affect much with the current uses of the function,
but if we have getExtractWithExtendCost we can plumb CostKind through it
in the same way as other costmodel functions.
Some vector math routines, e.g. ArmPL, specify a particular
calling convention on the routines which can help improve
performance by specifying what registers have to be preserved
across the call.