Now that the VPlan for the main vector loop gets cloned in the epilogue
vectorization code path, there optimizeForVFAndUF can be applied
unconditionally.
This patch adds a new special type of VPBasicBlock that wraps an
existing IR basic block. Recipes of the block get added before the
terminator of the wrapped IR basic block. Making it a subclass of
VPBasicBlock avoids duplicating various APIs to manage recipes in a
block, as well as makes sure the traversals filtering VPBasicBlocks
automatically apply as well.
Initially VPIRBasicBlock are only used for the pre-preheader (wrapping
the original preheader of the scalar loop).
As follow-up, this will be used to move more parts of the skeleton
inside VPlan, starting with the branch and condition in the middle
block.
Separated out of https://github.com/llvm/llvm-project/pull/92651
PR: https://github.com/llvm/llvm-project/pull/93398
Generalize LoopVectorizationPlanner::isMoreProfitable smoothly across
the fixed-vector and scalable-vector cases, taking the trip-count into
account, and fixing logical pitfalls that arise from a lack of
generality.
As a follow-up to b2f65e80, use the DTU to also update and preserve
the DT in the native path. This should also allow preserving SCEV in the
native path
PR: https://github.com/llvm/llvm-project/pull/93287
The transform updates all users of inductions to work based on EVL,
instead
of the VF directly. At the moment, widened inductions cannot be updated,
so
bail out if the plan contains any.
This patch introduces a check before applying EVL transform. If any
recipes in loop rely on RuntimeVF, the plan is discarded.
In LoopVectorizationCostModel::getInstructionCost(), when the condition
canTruncateToMinimalBitwidth() is satisfied, for a trunc, the source
type is computed as the smallest type of the source vector and the
destination vector, and the destination type is computed as the largest
type of the instruction and destination type. This is clearly a logical
error, as the original source vector type could be smaller than the
original destination vector type, and the trunc semantics are broken
because we're attempting to widen.
Fixes#47665.
Use DTU to queue DominatorTree updates directly when connecting basic
blocks during VPlan execution.
This simplifies DT updates and will also automatically allow updating
the DT for the VPlan-native path as additional benefit in a follow-up.
PR: https://github.com/llvm/llvm-project/pull/92525
We only version unknown strides to 1. If the original type is i1, then
the sign of the extension matters. Properly extend the stride value
before replacing it.
Fixes https://github.com/llvm/llvm-project/issues/91369.
This reverts the revert commit c6e01627acf859.
This patch includes a fix for any-of reductions and epilogue
vectorization. Extra test coverage for the issue that caused the revert
has been added in bce3bfced5fe0b019 and an assertion has been added in
c7209cbb8be7a3c65813.
--------------------------------
Original commit message:
Update AnyOf reduction code generation to only keep track of the AnyOf
property in a boolean vector in the loop, only selecting either the new
or start value in the middle block.
The patch incorporates feedback from https://reviews.llvm.org/D153697.
This fixes the #62565, as now there aren't multiple uses of the
start/new values.
Fixes https://github.com/llvm/llvm-project/issues/62565
PR: https://github.com/llvm/llvm-project/pull/78304
Support for predicated vector reverse intrinsic was added some time ago.
Adds support for predicated reversed loads/stores in the loop
vectorizer.
Reviewers: fhahn
Reviewed By: fhahn
Pull Request: https://github.com/llvm/llvm-project/pull/88025
This patch adds an assert to createAndCollectMergePhiForReduction to
make sure there is a resume phi when vectorizing the epilogue loop. This
is needed to set the resume value from the main vector loop.
This assertion guards against the issue caused the revert of
https://github.com/llvm/llvm-project/pull/78304.
Replace relying on the underling CallInst for looking up the called
function and its types by instead adding the called function as operand,
in line with how called functions are handled in CallInst.
Operand bundles, metadata and fast-math flags are optionally used if
there's an underlying CallInst.
This enables creating VPWidenCallRecipes without requiring an underlying
IR instruction.
The conditional branch from the loop latch will be replaced by a
single branch controlling the loop, so there is no extra overhead from
scalarization. This improves the cost esimates in some cases.
This patch is moving out following intrinsics:
* vector.interleave2/deinterleave2
* vector.reverse
* vector.splice
from the experimental namespace.
All these intrinsics exist in LLVM for more than a year now, and are
widely used, so should not be considered as experimental.
If we vectorize a loop with multiple exits, all exiting branches should
be considered uniform, as the resulting loop will be controlled by the
canonical IV only. Previously we were overestimating the cost of values
contributing to the other exits.
Introduce new subclasses of VPWidenMemoryRecipe for VP
(vector-predicated) loads and stores to address multiple TODOs from
https://github.com/llvm/llvm-project/pull/76172
Note that the introduction of the new recipes also improves code-gen for
VP gather/scatters by removing the redundant header mask. With the new
approach, it is not sufficient to look at users of the widened canonical
IV to find all uses of the header mask.
In some cases, a widened IV is used instead of separately widening the
canonical IV. To handle that, first collect all VPValues representing header
masks (by looking at users of both the canonical IV and widened inductions
that are canonical) and then checking all users (recursively) of those header
masks.
Depends on https://github.com/llvm/llvm-project/pull/87411.
PR: https://github.com/llvm/llvm-project/pull/87816
When collecting loop scalars, LoopVectorize over-eagerly marks the
induction variable and its update as scalars after vectorization, even
if the induction variable update is a first-order recurrence. Guard the
process with this check, fixing a crash.
Fixes#72969.
In the process of collecting instructions to scalarize, LoopVectorize
uses faulty reasoning whereby it also adds instructions that will be
scalar after vectorization. If an instruction satisfies
isScalarAfterVectorization() for the given VF, it should not be appended
to InstsToScalarize. Add this extra guard, fixing a crash.
Fixes#55096.
This patch introduces a new VPWidenMemoryRecipe base class and distinct
sub-classes to model loads and stores.
This is a first step in an effort to simplify and modularize code
generation for widened loads and stores and enable adding further more
specialized memory recipes.
PR: https://github.com/llvm/llvm-project/pull/87411
Using fp type in the compiler is not the best idea, here it used with
the comparison for equal to 0 and may cause undefined behavior in some
cases.
Reviewers: fhahn
Reviewed By: fhahn
Pull Request: https://github.com/llvm/llvm-project/pull/87241
VPBlendRecipe does not use the first mask operand. Removing it allows
VPlan-based DCE to remove unused mask computations.
This also fixes#87410, where unused Not VPInstructions are considered
having only their first lane demanded, but some of their operands
providing a vector value due to other users.
Fixes https://github.com/llvm/llvm-project/issues/87410
PR: https://github.com/llvm/llvm-project/pull/87770
Instead of using ILV.useOrderedReductions during ::execute, instead
store the information at recipe construction.
Another step towards making recipe'::execute independent of legacy ILV.
This reverts the revert commit 589c7abb03448.
This patch includes a fix for any-of reductions and epilogue
vectorization. Extra test coverage for the issue that caused the revert
has been added in 399ff08e29d.
--------------------------------
Original commit message:
Update AnyOf reduction code generation to only keep track of the AnyOf
property in a boolean vector in the loop, only selecting either the new
or start value in the middle block.
The patch incorporates feedback from https://reviews.llvm.org/D153697.
This fixes the #62565, as now there aren't multiple uses of the
start/new values.
Fixes https://github.com/llvm/llvm-project/issues/62565
PR: https://github.com/llvm/llvm-project/pull/78304
This patch introduces generating VP intrinsics in the Loop Vectorizer.
Currently the Loop Vectorizer supports vector predication in a very
limited capacity via tail-folding and masked load/store/gather/scatter
intrinsics. However, this does not let architectures with active vector
length predication support take advantage of their capabilities.
Architectures with general masked predication support also can only take
advantage of predication on memory operations. By having a way for the
Loop Vectorizer to generate Vector Predication intrinsics, which (will)
provide a target-independent way to model predicated vector
instructions. These architectures can make better use of their
predication capabilities.
Our first approach (implemented in this patch) builds on top of the
existing tail-folding mechanism in the LV (just adds a new tail-folding
mode using EVL), but instead of generating masked intrinsics for memory
operations it generates VP intrinsics for loads/stores instructions. The
patch adds a new VPlanTransforms to replace the wide header predicate
compare with EVL and updates codegen for load/stores to use VP
store/load with EVL.
Other important part of this approach is how the Explicit Vector Length
is computed. (VP intrinsics define this vector length parameter as
Explicit Vector Length (EVL)). We use an experimental intrinsic
`get_vector_length`, that can be lowered to architecture specific
instruction(s) to compute EVL.
Also, added a new recipe to emit instructions for computing EVL. Using
VPlan in this way will eventually help build and compare VPlans
corresponding to different strategies and alternatives.
Differential Revision: https://reviews.llvm.org/D99750
Add a new PtrAdd opcode to VPInstruction that corresponds to
IRBuilder::CreatePtrAdd, which creates a GEP with source element type
i8.
This is then used to model scalarizing VPWidenPointerInductionRecipe by
introducing scalar-steps to model the index increment followed by a
PtrAdd.
Note that PtrAdd needs to be able to generate code for only the first
lane or for all lanes. This may warrant introducing a separate recipe
for scalarizing that can be created without relying on the underlying
IR.
Depends on https://github.com/llvm/llvm-project/pull/80271
PR: https://github.com/llvm/llvm-project/pull/83068
Instead of keeping a mapping of Inst->VPValues (of their corresponding
recipes) in VPlan's Value2VPValue mapping, keep it in VPRecipeBuilder
instead. After recently replacing the last user of this mapping after
initial construction, this mapping is only needed for recipe
construction (to map IR operands to VPValue operands).
By moving the mapping, VPlan's VPValue tracking can be simplified and
limited only to live-ins. It also allows removing disableValue2VPValue
and associated machinery & asserts.
PR: https://github.com/llvm/llvm-project/pull/84464
Instead of passing VPlan in a number of places, just store it directly
in VPRecipeBuilder. A single instance is only used for a single VPlan.
This simplifies the code and was suggested by @nikolaypanchenko in
https://github.com/llvm/llvm-project/pull/84464.
getArithmeticInstrCost is used by both LoopVectorizer and SLPVectorizer
to compute the cost of frem, which becomes a call cost on AArch64 when
TLI has a vector library function.
Add tests that do SLP vectorization for code that contains 2x double and
4x float frem instructions.
As part of the RemoveDIs project we need LLVM to insert instructions using
iterators wherever possible, so that the iterators can carry a bit of
debug-info. This commit implements some of that by updating the contents of
llvm/lib/Transforms/Utils to always use iterator-versions of instruction
constructors.
There are two general flavours of update:
* Almost all call-sites just call getIterator on an instruction
* Several make use of an existing iterator (scenarios where the code is
actually significant for debug-info)
The underlying logic is that any call to getFirstInsertionPt or similar
APIs that identify the start of a block need to have that iterator passed
directly to the insertion function, without being converted to a bare
Instruction pointer along the way.
Noteworthy changes:
* FindInsertedValue now takes an optional iterator rather than an
instruction pointer, as we need to always insert with iterators,
* I've added a few iterator-taking versions of some value-tracking and
DomTree methods -- they just unwrap the iterator. These are purely
convenience methods to avoid extra syntax in some passes.
* A few calls to getNextNode become std::next instead (to keep in the
theme of using iterators for positions),
* SeparateConstOffsetFromGEP has it's insertion-position field changed.
Noteworthy because it's not a purely localised spelling change.
All this should be NFC.
Recent set of changes (PR #67725) in loop interleaving algorithm caused removal of the loop trip count threshold for allowing interleaving. Therefore configuration option interleave-small-loop-scalar-reduction is no longer needed.