This relands the reverted #120721 with a fix for cases where neither
reduction operand are the reduction phi. Only
63114239cc8d26225a0ef9920baacfc7cc00fc58 and
63114239cc8d26225a0ef9920baacfc7cc00fc58 are new on top of the reverted
PR.
---------
Co-authored-by: Nicholas Guy <nicholas.guy@arm.com>
This patch simplifies select-based integer min/max reductions by
utilizing `llvm::getMinMaxReductionPredicate`, and generates
intrinsic-based min/max reductions by utilizing
`llvm::getMinMaxReductionIntrinsicOp`.
Use the existing VPlan-based analysis to identify recipes that only have
their first lane demanded and transform them to uniform recpliate
recipes. This simplifies the generated code in some places and prepares
for fixing https://github.com/llvm/llvm-project/issues/122496.
When estimating the cost of entries shuffles for buildvectors, need to
rebuild original mask, not a generated submask, used for subregisters
analysis.
Fixes#122430
Older version of msvc do not have great lambda support and are not able
to handle uses of class data or lambdas with implicit return types in
some cases. These minor changes improve the sources compatibility with
older msvc and don't hurt readability either.
When estimating the cost of entries shuffles for buildvectors, need to
rebuild original mask, not a generated submask, used for subregisters
analysis.
Fixes#122430
Add TreeEntry::hasState.
Add assert for getTreeEntry.
Remove the OpValue parameter from the canReuseExtract function.
Remove the Opcode parameter from the ComputeMaxBitWidth lambda function.
Add TreeEntry::hasState.
Add assert for getTreeEntry.
Remove the OpValue parameter from the canReuseExtract function.
Remove the Opcode parameter from the ComputeMaxBitWidth lambda function.
With this patch we switch from the temporary dummy seeds to actual seeds
provided by the seed collector.
The seeds get sliced and each slice is used as the starting point for
vectorization.
Currently we emit early-exit related debug messages/remarks even when
there is a single exit. Update to only check isVectorizableEarlyExitLoop
if there isn't a single exit block.
PR: https://github.com/llvm/llvm-project/pull/121994
This is a split-off from #109833 and only adds code relating to checking
if a struct-returning call can be vectorized.
This initial patch only allows the case where all users of the struct
return are `extractvalue` operations that can be widened.
```
%call = tail call { float, float } @foo(float %in_val)
%extract_a = extractvalue { float, float } %call, 0
%extract_b = extractvalue { float, float } %call, 1
```
Note: The tests require the VFABI changes from #119000 to pass.
This allows it to produce a more accurate cost for the shuffle, using
the more accurate calls to getShuffleCost in getInstructionCost. It
helps fix some of the regressions from vector combine a little while
ago, now that we have better subvector extract costs.
Need to check if the GEP bases are equal and return false early. Also,
need to return false if the lookup is too deep, considering bases equal
too. Fixes a crash in the assertion.
This just copies the same conservative definition from mayWriteToMemory,
and enables more VPInstructions to be hoisted out in LICM.
I think this should give more accurate costs, and I was able to build
llvm-test-suite without the legacy-vplan cost model assertion going off.
This patch introduces a new method:
void Vectorizer::mergeEquivalenceClasses(EquivalenceClassMap &EQClasses)
const;
The method is called at the end of
Vectorizer::collectEquivalenceClasses() and is needed to merge
equivalence classes that differ only by their underlying objects (UO1
and UO2), where UO1 is 1-level-indirection underlying base for UO2. This
situation arises due to the limited lookup depth used during the search
of underlying bases with llvm::getUnderlyingObject(ptr).
Using any fixed lookup depth can result into creation of multiple
equivalence classes that only differ by 1-level indirection bases.
The new approach merges equivalence classes if they have adjacent bases
(1-level indirection). If a series of equivalence classes form ladder
formed of 1-step/level indirections, they are all merged into a single
equivalence class. This provides more opportunities for the load-store
vectorizer to generate better vectors.
---------
Signed-off-by: Klochkov, Vyacheslav N <vyacheslav.n.klochkov@intel.com>
VPExpandSCEVRecipes must be placed in the entry and are alway uniform.
This fixes a crash by always identifying them as uniform, even if the
main vector loop region has been removed.
Fixes https://github.com/llvm/llvm-project/issues/121897.
This is just the MOVSS instruction (SSE41 INSERTPS is still necessary for index != 0)
This exposed an issue in VectorCombine::foldInsExtFNeg - we need to use the more general SK_PermuteTwoSrc shuffle kind to allow getShuffleCost to match other shuffle kinds (not just SK_Select).
Legalize extract-from-ends using uniform VPReplicateRecipe of wide
inductions to use regular VPReplicateRecipe, so the correct end value
is available.
Fixes https://github.com/llvm/llvm-project/issues/121745.
Move the debug output that prints out the selected VF from
selectVectorizationFactor -> computeBestVF. This means that the output
will still be written even after removing the assert for the legacy and
vplan cost models matching.
Update optimizeForVFAndUF to completely remove the vector loop region
when possible. At the moment, we cannot remove the region if it contains
* widened IVs: the recipe is needed to generate the step vector
* reductions: ComputeReductionResults requires the reduction phi recipe
for codegen.
Both cases can be addressed by more explicit modeling.
The patch also includes a number of updates to allow executing VPlans
without a vector loop region.
Depends on https://github.com/llvm/llvm-project/pull/110004
<u,1,u,u> 'inplace' single src shuffles can be treated as free identity shuffles - ignore any shuffle cost (similar to what we already do in other folds like foldShuffleOfShuffles) - eventually getShuffleCost should just return TCC_Free in these cases but in a lot of the targets' shuffle cost logic this currently ends up treated as a generic SK_PermuteSingleSrc.
We still want to generate the shuffle as it will help further shuffle folds with the additional PoisonMaskElem 'undemanded' elements.
Check the VPlan directly to determine if a VPValue is an optimiziable IV
or IV use instead of checking the underlying IR instructions.
Split off from https://github.com/llvm/llvm-project/pull/112147. This
refactoring enables moving IV end value creation from the legacy
fixupIVUsers to a VPlan-based transform.
There is one case we now won't optimize, that is IVs with subtracts and
non-constant steps. But as this is a minor optimization and doesn't
impact correctness, the benefits of performing the check in VPlan should
outweigh the missed case.
This fixes a crash when building SPEC CPU 2017 with EVL tail folding
when widening @llvm.log10 intrinsics.
@llvm.log10 and some other intrinsics don't have a corresponding VP
intrinsic, so this fixes the crash by removing the assert and bailing
instead.
Update wide induction increments to use the same step as the corresponding
wide induction. This enables detecting induction increments directly in
VPlan and removes redundant splats.
Remove logic to re-use the previous basic block for the vector pre
header from VPBasicBlock::execute. The preheader is now modeled as
VPIRBasicBlock, so the code is no longer needed.
Split off from https://github.com/llvm/llvm-project/pull/108378.
Some patterns (in particular horizontal style patterns) can end up with shuffles straddling both sides of a binop/cmp.
Where individually the folds aren't worth it, by merging the (oneuse) shuffles we can notably reduce the net instruction count and cost.
One of the final steps towards finally addressing #34072
As we're reducing the use count of the operands its more likely that they will now fold, as they were previously being prevented by a m_OneUse check, or the cost of retaining the extra instruction had been too high.
This is necessary for some upcoming patches, although the only change so far is instruction ordering as it allows some SSE folds of 256/512-bit with 128-bit subvectors to occur earlier in foldShuffleToIdentity as the subvector concats are free.
Reapplied with a fix for foldSingleElementStore/scalarizeLoadExtract which were replacing/removing memory operations - we need to ensure that the worklist is populated in the correct order so all users of the old memory operations are erased first, so there are no remaining users of the loads when its time to remove them as well.
Pulled out of #120984