#114901 exposed that foldExtractedCmps didn't account for non-commutative binops, and were disabled by 05e838f428555bcc4507bd37912da60ea9110ef6
This patch re-enables support for non-commutative binops by ensuring that the LHS/RHS arg order of the binop is retained.
With PR #88385 I am introducing support for vectorising more loops with
early exits that don't require a scalar epilogue. As such, if a loop
doesn't have a unique exit block it will not automatically imply we
require a scalar epilogue. Also, in all places in the code today where
we use the variable LoopExitBlock we actually mean the exit block from
the latch. Therefore, it seemed reasonable to add a new
getUniqueLatchExitBlock that allows the caller to determine the exit
block taken from the latch and use this instead of getUniqueExitBlock. I
also renamed LoopExitBlock to be LatchExitBlock. I feel this not only
better reflects how the variable is used today, but also prepares the
code for PR #88385.
While doing this I also noticed that one of the comments in
requiresScalarEpilogue is wrong when we require a scalar epilogue, i.e.
when we're not exiting from the latch block. This doesn't always imply
we have multiple exits, e.g. see the test in
Transforms/LoopVectorize/unroll_nonlatch.ll
where the latch unconditionally branches back to the only exiting block.
Following #90184, this patch emits vp.merge intrinsic, which is used to
set the inactive lanes in a select operation to the RHS instead of
undef. Currently, it is applied to out-loop reduction for EVL
vectorization.
This patch performs transformation to convert
select(header_mask, LHS, RHS)
into
vp.merge(all-true, LHS, RHS, EVL)
And always use the predicated reduction select to set the incoming value
of the reduction phi to support out-loop reduction when using tail
folding with EVL.
TODO: Postpone the adjustment of the predicated reduction select to
VPlanTransform. The current adjustment might be too early, which could
lead to a situation where the predicated reduction select is adjusted,
but the EVL recipes cannot be successfully generated during
VPlanTransform.
The fold needs to be adjusted to correctly track the LHS/RHS operands, which will take some refactoring, for now just disable the fold in this case.
Fixes#114901
Iteration runtime check confirms whether the trip count is greater than
VFxUF at least. Therefore, there is no need to adjust the
MinProfitableTripCount to VF if it is zero.
Retaining the original MinProfitableTripCount information is also
beneficial for supporting more profitable runtime checks in the future.
This moves the common logic to connect IRBBs created for a VPBB to their
predecessors in the VPlan CFG, making it easier to keep in sync in the
future.
Consider all possible reductions ops as being non-poisoning boolean
logical operations, which require freeze to be fully correct.
https://alive2.llvm.org/ce/z/TKWDMPFixes#114738
Provide computeCost implementations for all remaining sub-classes of
VPSingleDefRecipe. This pushes one of the last uses of getLegacyCost
directly to its user: VPReplicateRecipe.
This work is in preparation for PRs #112138 and #88385 where
the middle block is not guaranteed to be the immediate successor
to the region block. I've simply add new getMiddleBlock()
interfaces to VPlan that for now just return
cast<VPBasicBlock>(VectorRegion->getSingleSuccessor())
Once PR #112138 lands we'll need to do more work to discover
the middle block.
Update VPlan to include the scalar loop header. This allows retiring
VPLiveOut, as the remaining live-outs can now be handled by adding
operands to the wrapped phis in the scalar loop header.
Note that the current version only includes the scalar loop header, no
other loop blocks and also does not wrap it in a region block.
PR: https://github.com/llvm/llvm-project/pull/109975
The code in EH and non-returning blocks can be skipped by the
vectorizer, since it does not add to the perfromance, just consumes
compile/link time.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/112221
If the instruction is vectorized and it is a part of the reduced values
gather/buildvector node, it should replaced in reduced operation
instructions before removal properly, to avoid compiler crash.
Fixes#114371
Infers member MayReadFromMemory, MayWriteToMemory, and
MayHaveSideEffects based on intrinsic attributes.
---------
Co-authored-by: Florian Hahn <flo@fhahn.com>
Use VPInstruction::ResumePhi to create phi nodes for reduction resume
values in the scalar preheader, similar to how ResumePhis are used for
first-order recurrence resume values after 9a5a8731e77.
This allows simplifying createAndCollectMergePhiForReduction to only
collect reduction resume phis when vectorizing epilogue loops and adding
extra incoming edges from the main vector loop. Updating phis for the
epilogue vector loops requires special attention, because additional
incoming values from the bypass blocks need to be added.
PR: https://github.com/llvm/llvm-project/pull/110004
If the list of scalars vectorized as the part of the same vector node,
no need to generate vector node again, it will be handled as part of
overlapping matching.
Fixes#113810
Refactors VPVectorPointerRecipe to use the VF VPValue to obtain the
runtime VF, similar to #95305.
Since only reverse vector pointers require the runtime VF, the patch
sets VPUnrollPart::PartOpIndex to 1 for vector pointers and 2 for
reverse vector pointers. As a result, the generation of reverse vector
pointers is moved into a separate recipe.
In some cases, VPWidenCastRecipes are created but not considered in the
legacy cost model, including truncates/extends when evaluating a reduction
in a smaller type. Return 0 for such casts for now, to avoid divergences
between VPlan and legacy cost models.
Fixes https://github.com/llvm/llvm-project/issues/113526.
If the scalars is used externally is in the root node, it may have
incorrect signedness info because of the conflict with the demanded bits
analysis. Need to perform exact signedness analysis and compute it
rather than rely on the precomputed value, which might be incorrect for
alternate zext/sext nodes.
Fixes#113520
Since SLP support "clusterization" of the non-load instructions, the
restriction for reduced values for loads only should be removed to avoid
compiler crash.
Fixes#113516
Need to consider undefs correctly, when trying to replace them with
potentially poisonous values in shuffles. Such elements should not be
silently replaced by poison values, instead complex analysis should be
implemented to see if it is safe to do it.
Fixes#113425