This patch uses the ExtractFromEnd VPInstruction opcode
to extract the value of a FOR to be used as resume value for the ph in
the scalar loop.
It adds a new live-out that temporarily wraps the FOR phi in the scalar
loop. fixFixedOrderRecurrence will process live outs for fixed order
recurrence phis by creating a new phi node in the scalar preheader,
using the generated value for the live-out as incoming value from the
middle block and the original start value as incoming value for the
other edge. Creation of the phi in the preheader, as well as updating
the phi in the scalar loop will also be moved to VPlan in the future,
eventually retiring fixFixedOrderRecurrence
Depends on https://github.com/llvm/llvm-project/pull/93395
PR: https://github.com/llvm/llvm-project/pull/93396
Update LAA to use PSE::getSymbolicMaxBackedgeTakenCount which returns
the minimum of the countable exits.
When analyzing dependences and computing runtime checks, we need the
smallest upper bound on the number of iterations. In terms of memory
safety, it shouldn't matter if any uncomputable exits leave the loop,
as long as we prove that there are no dependences given the minimum of
the countable exits. The same should apply also for generating runtime
checks.
Note that this shifts the responsiblity of checking whether all exit
counts are computable or handling early-exits to the users of LAA.
Depends on https://github.com/llvm/llvm-project/pull/93498
PR: https://github.com/llvm/llvm-project/pull/93499
VPTypeAnalysis::inferScalarTypeForRecipe is missing the case for
VPInstruction::LogicalAnd, due to which the test
vplan-incomplete-cases.ll crashes. Add this missing case, and move the
test in vplan-infer-not-or-type.ll to vplan-incomplete-cases.ll, showing
correct codegen for trip-counts 2 and 3.
This patch introduces a new ExtractFromEnd VPInstruction opcode to
extract the value of a FOR for users outside the loop (i.e. in the
scalar loop's exits). This moves the first part of fixing first order
recurrences to VPlan, and removes some additional code to patch up
live-outs, which is now handled automatically.
The majority of test changes is due to changes in the order of which the
extracts are generated now. As we are now using VPTransformState to
generate the extracts, we may be able to re-use existing extracts in the
loop body in some cases. For scalable vectors, in some cases we now have
to compute the runtime VF twice, as each extract is now independent, but
those should be trivial to clean up for later passes (and in line with
other places in the code that also liberally re-compute runtime VFs).
PR: https://github.com/llvm/llvm-project/pull/93395
For interleave groups we only create a pointer for the start of the
interleave group, not all original loads/stores. Mark single-use ops
feeding interleave group mem ops as free when vectorizing.
- Prevent null dereference: if the Mask given to
`ShuffleInstructionBuilder::adjustExtracts()` is empty or all-poison,
then `VecBase` will be `nullptr` and the call to
`castToScalarTyElem(VecBase)` will dereference it. Add an assert
to guard against this.
- Prevent use of uninitialized scalar: in the unlikely event that
`CandidateVFs` is empty, then `AnyProfitableGraph` will be
uninitialized in `if` condition following the loop. (This seems like a
false-positive, but I submitted this change anyways as initializing
bools costs nothing and is generally good practice)
Now that the VPlan for the main vector loop gets cloned in the epilogue
vectorization code path, there optimizeForVFAndUF can be applied
unconditionally.
This patch adds a new special type of VPBasicBlock that wraps an
existing IR basic block. Recipes of the block get added before the
terminator of the wrapped IR basic block. Making it a subclass of
VPBasicBlock avoids duplicating various APIs to manage recipes in a
block, as well as makes sure the traversals filtering VPBasicBlocks
automatically apply as well.
Initially VPIRBasicBlock are only used for the pre-preheader (wrapping
the original preheader of the scalar loop).
As follow-up, this will be used to move more parts of the skeleton
inside VPlan, starting with the branch and condition in the middle
block.
Separated out of https://github.com/llvm/llvm-project/pull/92651
PR: https://github.com/llvm/llvm-project/pull/93398
Generalize LoopVectorizationPlanner::isMoreProfitable smoothly across
the fixed-vector and scalable-vector cases, taking the trip-count into
account, and fixing logical pitfalls that arise from a lack of
generality.
This removes the check that both operands of the original shuffle are
instructions, which is a relic from a previous version that held more
variables as Instructions.
Other than some additional checks needed for compare predicates and
selects with scalar condition operands, these are relatively simple
additions to what already exists.
This just adds splat constants, which can be treated like any other
splat which hopefully makes them very simple. It does not try to handle
more complex constant vectors yet, just the more common splats.
As a follow-up to b2f65e80, use the DTU to also update and preserve
the DT in the native path. This should also allow preserving SCEV in the
native path
PR: https://github.com/llvm/llvm-project/pull/93287
This implements the `nusw` and `nuw` flags for `getelementptr` as
proposed at
https://discourse.llvm.org/t/rfc-add-nusw-and-nuw-flags-for-getelementptr/78672.
The three possible flags are encapsulated in the new `GEPNoWrapFlags`
class. Currently this class has a ctor from bool, interpreted as the
InBounds flag. This ctor should be removed in the future, as code gets
migrated to handle all flags.
There are a few places annotated with `TODO(gep_nowrap)`, where I've had
to touch code but opted to not infer or precisely preserve the new
flags, so as to keep this as NFC as possible and make sure any changes
of that kind get test coverage when they are made.
The transform updates all users of inductions to work based on EVL,
instead
of the VF directly. At the moment, widened inductions cannot be updated,
so
bail out if the plan contains any.
This patch introduces a check before applying EVL transform. If any
recipes in loop rely on RuntimeVF, the plan is discarded.
In LoopVectorizationCostModel::getInstructionCost(), when the condition
canTruncateToMinimalBitwidth() is satisfied, for a trunc, the source
type is computed as the smallest type of the source vector and the
destination vector, and the destination type is computed as the largest
type of the instruction and destination type. This is clearly a logical
error, as the original source vector type could be smaller than the
original destination vector type, and the trunc semantics are broken
because we're attempting to widen.
Fixes#47665.
One of the previous patches introduced initial support for non-power-of-2
number of elements but some parts of the SLP vectorizer still were not
adjusted to handle the costs correctly. Patch fixes it by improving
analysis of the non-power-of-2 number of elements and fixes in the cost
of the extractelements instructions.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/93213
Use DTU to queue DominatorTree updates directly when connecting basic
blocks during VPlan execution.
This simplifies DT updates and will also automatically allow updating
the DT for the VPlan-native path as additional benefit in a follow-up.
PR: https://github.com/llvm/llvm-project/pull/92525
If trying to find vector value in shuffling of the extractelements and
one of the vector values is undef value, need to generate real mask value
for such vector and either undef vector, or incoming second vector, if
non-poisonous.
Update VPBlendRecipe::execute to support generating code for first-lane
only. This fixes a crash in the newly added test
@test_not_first_lane_only_wide_compare_incoming_order_swapped.
There are cases where a vector value has some users that demand the
the single scalar value only (NeedsScalar), while other users demand the
vector value (see attached test cases). In those cases, the NeedsScalar
users should only demand the first lane.
Fixes https://github.com/llvm/llvm-project/issues/91883.
This is probably the most involved addition, as it tries to make use of
isTriviallyVectorizable with isVectorIntrinsicWithScalarOpAtArg to handle a
number of different intrinsics that are all lane-wise. Additional tests have
been added for some of the different intrinsics from
isVectorIntrinsicWithScalarOpAtArg / isVectorIntrinsicWithOverloadTypeAtArg.
VPEVLBasedIVPHIRecipe inherits from VPSingleDefRecipe. Add
VPEVLBasedIVPHISC to VPSingleDefRecipe::classof to make isa/dyn_cast &
co work as expected.
Split off https://github.com/llvm/llvm-project/pull/67934.
Need to look through the SExt/ZExt scalars to be gathered, when trying
to reduce their width after minbitwidth analysis to prevent permanent
attempts to revectorize such gathered instructions.
Need to look through the SExt/ZExt scalars to be gathered, when trying
to reduce their width after minbitwidth analysis to prevent permanent
attempts to revectorize such gathered instructions.
The matcher m_Trunc() matches an Operator with a given Opcode, which
could either be an Instruction or ConstExpr.
VectorCombine::foldTruncFromReductions() incorrectly assumes that the
pattern matched is always an Instruction, and attempts a cast. Fix this.
Fixes#88796.
We only version unknown strides to 1. If the original type is i1, then
the sign of the extension matters. Properly extend the stride value
before replacing it.
Fixes https://github.com/llvm/llvm-project/issues/91369.
Still need to do the full analysis of the signedness of the values
rather than rely on Instruction opcode, if the opcode is SExt. Still may
produce unsigned result.