Reverts llvm/llvm-project#124129 as its currently causing a regression at #124499 - avoids the regression until a proper fix can be added to getSpillCost
Currently, SLP has 2 distinct storages to manage mapping between
vectorized instructions and their corresponding vectorized TreeEntry
nodes. It leads to inefficient lookup for the matching TreeEntries and
makes it harder to correctly track instructions, associated with
multiple nodes.
There is a plan to extend this support for instructions, that require
scheduling, to allow support for copyable elements. Merging
ScalarToTreeEntry and MultiNodeScalars will allow reduce maintenance of
the feature
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/124914
Add new runPass helpers to run a VPlan transformation. This makes it
easier to add additional checks/functionality for each transform run. In
this patch, an option is added to run the verifier after each VPlan
transform.
Follow-ups will use the same helper to also support printing VPlans
after each transform.
Note that the verifier at the moment requires there to be a canonical IV
and vector loop region, so the final lowering transforms aren't run via
runPass yet.
PR: https://github.com/llvm/llvm-project/pull/123640
`sandboxir::Context` is defined at a pass-level scope with the
`SandboxVectorizerPass` class because the function pass manager `FPM`
object depends on it, and that is in pass-level scope to avoid
recreating the pass pipeline every single time `runOnFunction()` is
called.
This means that the Context's state lives on across function passes. The
problem is twofold:
(i) the LLVM IR to Sandbox IR map can grow very large including objects
from different functions, which is of no use to the vectorizer, as it's
a function-level pass.
(ii) this can result in stale data in the LLVM IR to Sandbox IR object
map, as other passes may delete LLVM IR objects.
To fix both issues this patch introduces a `Context::clear()` function
that clears the `LLVMValueToValueMap`.
Adds getNumberOfParts and uses it instead of similar code across code
base, fixes analysis of non-vectorizable types in
computeMinimumValueSizes.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/124774
Need to include MainOp into the analysis of the instructions in
getSameOpcode to be sure that it is checked for the requirements to
prevent crashes during further analysis.
Change `getScaledReduction` to take an existing vector, rather than
creating and returning a new one each call.
Rename `getScaledReduction` to `getScaledReductions` to more accurately
reflect what it's now doing.
---------
Co-authored-by: Karlo Basioli <68535415+basioli-k@users.noreply.github.com>
This patch removes the assertion that checks for an existing function.
If one exists it will remove it and create a new one. This helps remove
a crash when a function declaration object already exists and we are
about to create a SandboxIR object for the definition.
We have two types of mask in SLP: a scalar mask and a vector mask.
When vectorizing four i32 additions into <4 x i32>, SLP creates a mask
of length 4.
When vectorizing four <2 x i32> additions into <8 x i32>, SLP also
creates a mask of length 4.
We refer to the first case as a scalar mask (because the mask element
represents a scalar, i32), and the second case as a vector mask (because
the mask element represents a vector, <4 x i32>).
At some point, we must convert the scalar mask into a vector mask
(otherwise, calling TTI cost functions or IRBuilderBase functions may
yield incorrect results).
Since both ShuffleCostEstimator and ShuffleInstructionBuilder can modify
the CommonMask, we have decided to perform the mask transformation only
within createShuffle. However, we do not store the transformed result,
as createShuffle may be called multiple times.
To finalise the "RemoveDIs" work removing debug intrinsics, we're
updating call sites that insert instructions to use iterators instead.
This set of changes are those where it's not immediately obvious that
just calling getIterator to fetch an iterator is correct, and one or two
places where more than one line needs to change.
Overall the same rule holds though: iterators generated for the start of
a block such as getFirstNonPHIIt need to be passed into insert/move
methods without being unwrapped/rewrapped, everything else can use
getIterator.
Live-ins don't need to be handled, other than adding to the exit phi
recipe. Do that early and assert that otherwise the exit value is
defined in the vector loop region.
This should enable simply skipping other exit values that do not need
further fixing, e.g. if handling the exit value from the early exit
directly in handleUncountableEarlyExit.
PR: https://github.com/llvm/llvm-project/pull/123819
Update HCFG construction to support multi-exit loops. If there is no
unique exit block, map the middle block of the initial plan to the exit
block from the latch.
This further unifies HCFG construction and prepares for use to also
build an initial VPlan (VPlan0) for inner loops.
Effectively NFC as this isn't used on the default code path yet.
Update HCFG builder to preserve the original latch block of the initial
VPlan, ensuring there is always a latch.
It also skips creating the BranchOnCond for the latch of the top-level
loop, instead of removing it later. Exiting via the latch is controlled
by later recipes.
This further unifies HCFG construction and prepares for use to also
build an initial VPlan (VPlan0) for inner loops.
This patch changes the functionality of `VecUtils::getLowest(Vals, BB)`
such that it filters out any instructions in `Vals` that are not in BB.
This is useful when Vals contains instructions from different BBs,
because in that case we are only interested in one BB.
This patch implements a wrapper function for the LLVM IR verifier for
functions, and calls it (flag-guarded) within the bottom-up-vectorizer
for finding IR bugs as soon as they happen.
This patch implements cost modeling for Region. All instructions that
are added or removed get their cost counted in the Scoreboard. This is
used for checking if the region before or after a transformation is more
profitable.
This patch fixes a bug in the maintenance of the MemDGNode chain of the
DAG. Whenever we move a memory instruction, the DAG gets notified about
the move and maintains the chain of memory nodes. The bug was that if
the destination of the move was not a memory instruction, then the
memory node's next node would end up pointing to itself.
getVectorCallCosts determines the cost of a vector intrinsic, based off
an existing scalar intrinsic call - but we were including the scalar
argument data to the IntrinsicCostAttributes, which meant that not only
was the cost calculation not type-only based, it was making incorrect
assumptions about constant values etc.
This also exposed an issue that x86 relied on fallback calculations for
funnel shift costs - this is great when we have the argument data as
that improves the accuracy of uniform shift amounts etc., but meant that
type-only costs would default to Cost=2 for all custom lowered funnel
shifts, which was far too cheap.
This is the reverse of #124129 where we weren't including argument data
when we could.
Fixes#63980
As part of the "RemoveDIs" project, BasicBlock::iterator now carries a
debug-info bit that's needed when getFirstNonPHI and similar feed into
instruction insertion positions. Call-sites where that's necessary were
updated a year ago; but to ensure some type safety however, we'd like to
have all calls to getFirstNonPHI use the iterator-returning version.
This patch changes a bunch of call-sites calling getFirstNonPHI to use
getFirstNonPHIIt, which returns an iterator. All these call sites are
where it's obviously safe to fetch the iterator then dereference it. A
follow-up patch will contain less-obviously-safe changes.
We'll eventually deprecate and remove the instruction-pointer
getFirstNonPHI, but not before adding concise documentation of what
considerations are needed (very few).
---------
Co-authored-by: Stephen Tozer <Melamoto@gmail.com>
As part of the "RemoveDIs" project, BasicBlock::iterator now carries a
debug-info bit that's needed when getFirstNonPHI and similar feed into
instruction insertion positions. Call-sites where that's necessary were
updated a year ago; but to ensure some type safety however, we'd like to
have all calls to moveBefore use iterators.
This patch adds a (guaranteed dereferenceable) iterator-taking
moveBefore, and changes a bunch of call-sites where it's obviously safe
to change to use it by just calling getIterator() on an instruction
pointer. A follow-up patch will contain less-obviously-safe changes.
We'll eventually deprecate and remove the instruction-pointer
insertBefore, but not before adding concise documentation of what
considerations are needed (very few).
Crossing BBs is not currently supported by the structures of the
vectorizer. This patch fixes instances where this was happening,
including:
- a walk of use-def operands that updates the UnscheduledSuccs counter,
- the dead instruction removal is now done per BB,
- the scheduler, which will reject bundles that cross BBs.
Introduced stack buffer overflow, see #120272.
`getScaledReduction` can return empty vector, and there is not check for
that.
This reverts commit c9b7303b9b18129c4ee6b56aaa2a0a9f59be2d09.
This reverts commit caf0540b91b0fee31353dc7049ae836e0f814cff.
Chaining partial reductions, where multiple partial reductions share an
accumulator, allow for more values to be combined together as part of
the reduction without discarding the semantics of the partial reduction
itself.
We were only constructing the IntrinsicCostAttributes with the arg type info, and not the args themselves, preventing more detailed cost analysis (constant / uniform args etc.)
Just pass the whole IntrinsicInst to the constructor and let it resolve everything it can.
Noticed while having yet another attempt at #63980
VecUtils::getLowest(Valse) returns the lowest instruction in the BB among Vals.
If the instructions are not in the same BB, or if none of them is an
instruction it returns nullptr.
This patch implements the diamond pattern where we are vectorizing
toward the top of the diamond from both edges, but the second edge may
use elements from a different vector or just scalar values. This
requires some additional packing code (see lit test).
I've removed the HasUncountableEarlyExit variable, since we can
already determine whether or not a loop has an early exit by seeing
if we found an uncountable exit.
I have also deleted the old UncountableExitingBlocks and
UncountableExitBlocks lists and replaced them with a single
uncountable edge. This means we don't need to worry about keeping the
list entries in sync and makes it clear which exiting block
corresponds to which exit block.