My usecase is simplifying the control flow generated by LoopVectorize
when vectorising loops whose tripcount is a function of the runtime
vector length. This can be problematic because:
* CSE is a pre-LoopVectorize transform and so it's common for an IR
function to include several calls to llvm.vscale(). (NOTE: Code
generation will typically remove the duplicates)
* Pre-LoopVectorize instcombines will rewrite some multiplies as shifts.
This leads to a mismatch between VL based maths of the scalar loop and
that created for the vector loop, which prevents some obvious
simplifications.
SCEV does not suffer these issues because it effectively does CSE during
construction and shifts are represented as multiplies.
Always add pointers proved to be uniform via legal/SCEV to worklist.
This extends the existing logic to handle a few more pointers known to
be uniform.
After https://github.com/llvm/llvm-project/pull/153643, there may be a
BranchOnCond with constant condition in the entry block.
Simplify those in removeBranchOnConst. This removes a number of
redundant conditional branch from entry blocks.
In some cases, it may also make the original scalar loop unreachable,
because we know it will never execute. In that case, we need to remove
the loop from LoopInfo, because all unreachable blocks may dominate each
other, making LoopInfo invalid. In those cases, we can also completely
remove the loop, for which I'll share a follow-up patch.
Depends on https://github.com/llvm/llvm-project/pull/153643.
PR: https://github.com/llvm/llvm-project/pull/154510
The motivation for this patch is to close the gap between the
VPlan-based CSE and the legacy CSE, to make it easier to remove the
legacy CSE. Before this patch, stubbing out the legacy CSE leads to 22
test failures, and after this patch, there are only 12 failures, and all
of them seem to have a single root cause:
VPlanTransforms::createInterleaveGroups() and
VPInterleaveGroup::execute(). The improvements from this patch are of
course welcome.
While developing the patch, a miscompile was found when GEP
source-element-types differ, and this has been fixed.
Co-authored-by: Florian Hahn <flo@fhahn.com>
Co-authored-by: Luke Lau <luke@igalia.com>
Update calculateRegisterUsageForPlan to track live-ness of VPValues
instead of recipes. This gives slightly more accurate results for
recipes that define multiple values (i.e. VPInterleaveRecipe).
When tracking the live-ness of recipes, all VPValues defined by an
VPInterleaveRecipe are considered alive until the last use of any of
them. When tracking the live-ness of individual VPValues, we can
accurately track the individual values until their last use.
Note the changes in large-loop-rdx.ll and pr47437.ll. This patch
restores the original behavior before introducing VPlan-based liveness
tracking.
PR: https://github.com/llvm/llvm-project/pull/155301
Add extra test coverage for
https://github.com/llvm/llvm-project/pull/149706. The added loop should
be interleaved, after narrowing interleave groups, which requires moving
the transform earlier.
Track which ops already have been narrowed, to avoid narrowing the same
operation multiple times. Repeated narrowing will lead to incorrect
results, because we could first narrow from an interleave group -> wide
load, and then narrow the wide load > single-scalar load.
Fixes thttps://github.com/llvm/llvm-project/issues/156190.
Split GEPs that have more than one non-zero offset into two GEPs. This
is in preparation for the ptradd migration, which can only represent
such GEPs.
This also enables CSE and LICM of the common base.
- In sve-epilog-vscale-fixed.ll file, it tests the preference of
fixed-width epilogue VF vs scalable when costs are equal. This NFC patch
is changing the TC in the test case to be unknown to avoid folding the
epilogue in future LV changes.
Remove instcombine, simplifycfg and dce from some tests, as they make it
a bit more difficult to see the codegen coming out of LV and most
simplifications are already done on the VPlan-level.
Also modernizes some check lines.
In #149056 VF pruning was changed so that it only pruned VFs that
stemmed from MaxBandwidth being enabled.
However we always compute register pressure regardless of whether or not
max bandwidth is permitted for any VFs (via
`MaxPermissibleVFWithoutMaxBW`).
This skips the computation if not needed and renames the method for
clarity.
The diff in reg-usage.ll is due to the scalable VPlan not actually
having any maxbandwidth VFs, so I've changed it to check the
fixed-length VF instead, which is affected by maxbandwidth.
In case of equal costs Prefer epilogue with fixed-width over scalable VF.
That is helpful in cases like post-LTO vectorization where epilogue with
fixed-width VF can be removed when we eventually know that the trip count
is less than the epilogue iterations.
This PR reassociates logical ands in order to enable more
simplifications.
The driving motivation for this is that with tail folding all blocks
inside the loop body will end up using the header mask. However this can
end up nestled deep within a chain of logical ands from other edges.
Typically the header mask will be a leaf nested in the LHS, e.g.
(headermask & y) & z. So pulling it out allows it to be simplified
further, e.g. allows it to be optimised away to VP intrinsics with EVL
tail folding.
Introduce a simple common-subexpression-elimination pass at the
VPlan-level, running late during the execution of the VPlan. The
long-term vision is to get rid of the legacy non-VPlan-based cse routine
in LV, but this patch doesn't yet fully subsume it.
We currently only emit the branch weights for the epilogue
iteration count check if there was already branch weight
data for the scalar loop. However, the code makes no use
of the existing branch weight when estimating the
likelihood of taking a particular branch and so we can
just always add the branch weights regardless. These
hints should hopefully improve code generation.
This patch adds a new flag (-enable-wide-lane-mask) which allows
LoopVectorize to generate wider-than-VF active lane masks when it
is safe to do so (i.e. the mask is used for data and control flow).
The transform in extractFromWideActiveLaneMask creates vector
extracts from the first active lane mask in the header & loop body,
modifying the active lane mask phi operands to use the extracts.
An additional operand is passed to the ActiveLaneMask instruction,
the value of which is used as a multiplier of VF when generating the
mask.
By default this is 1, and is updated to UF by
extractFromWideActiveLaneMask.
The motivation for this change is to improve interleaved loops when
SVE2.1 is available, where we can make use of the whilelo instruction
which returns a predicate pair.
This is based on a PR that was created by @momchil-velikov (#81140)
and contains tests which were added there.
GEPs are often in the form `gep [N x %T], ptr %p, i64 0, i64 %idx`.
Canonicalize these to `gep %T, ptr %p, i64 %idx`.
This enables transforms that only support one GEP index to work and
improves CSE.
Various transforms were recently hardened to make sure they still work
without the leading index.
If a phi is widened with tail folding, all of its predecessors will have
a mask of the form
%x = logical-and %active-lane-mask, %foo
%y = logical-and %active-lane-mask, %bar
%z = logical-and %active-lane-mask, %baz
...
We can remove the common %active-lane-mask from all of these edge masks,
which in turn allows us to simplify a lot of VPBlendRecipes.
In particular, it allows the header mask to be removed in selects with
EVL tail folding, improving RISC-V codegen on SPEC CPU 2017 for
525.x264_r, and supersedes #147243.
This also allows us to remove VPBlendRecipe and directly emit
VPInstruction::Select in another patch.
Update narrowInterleaveGroups to support scalable VFs. After the
transform, the vector loop will process a single iteration of the
original vector loop for fixed-width vectors and vscale iterations for
scalable vectors.
This patch adds a new VPlan-based addMinimumIterationCheck, which
replaced the ILV version for the non-epilogue case.
The VPlan-based version constructs a SCEV expression to compute the
minimum iterations, use that to check if the check is known true or
false. Otherwise it creates a VPExpandSCEV recipe and emits a
compare-and-branch.
When using epilogue vectorization, we still need to create the minimum
trip-count-check during the legacy skeleton creation. The patch moves
the definitions out of ILV.
PR: https://github.com/llvm/llvm-project/pull/153643
This is about code readability. The operands in the disjunction forming the combined predicate in `mergeConditionalStoreToAddress` could sometimes be negated twice. This patch addresses that.
2 tests needed updating because they exposed the double negation and now they don’t.
LoopVectorizationCostModel::expectedCost will only override the cost
returned by getInstructionCost when valid. This patch ensures we do
the same in VPCostContext::getLegacyCost, avoiding the "VPlan cost
model and legacy cost model disagreed" assert in the included test.
Extend [Specific]Cmp_match to handle floating-point compares, and
introduce m_Cmp that matches both integer and floating-point compares.
Use it in simplifyRecipe to match and simplify the general case of
compares. The change has necessitated a bugfix in
VPReplicateRecipe::execute.
SimplifyBranchConditionForVFAndUF only recognized canonical IVs and a
few PHI
recipes in the loop header. With more IV-step optimizations,
the canonical widen-canonical-iv can be replaced by a canonical
VPWidenIntOrFpInduction,
which the pass did not handle, causing regressions (missed
simplifications).
This patch replaces canonical VPWidenIntOrFpInduction with a StepVector
in the vector preheader
since the vector loop region only executes once.
After a485e0e, we may not set the vector trip count in
preparePlanForEpilogueVectorLoop if it is zero. We should not choose a
VF * UF that makes the main vector loop dead (i.e. vector trip count is
zero), but there are some cases where this can happen currently.
In those cases, set EPI.VectorTripCount to zero.
There are a couple of places in the loop vectoriser where we
want to calculate the cost of extracting the last lane in a
vector. However, we wrongly assume that asking for the cost
of extracting lane (VF.getKnownMinValue() - 1) is an accurate
representation of the cost of extracting the last lane. For
SVE at least, this is non-trivial as it requires the use of
whilelo and lastb instructions.
To solve this problem I have added a new
getReverseVectorInstrCost interface where the index is used
in reverse from the end of the vector. Suppose a vector has
a given ElementCount EC, the extracted/inserted lane would be
EC - 1 - Index. For scalable vectors this index is unknown at
compile time. I've added a AArch64 hook that better represents
the cost, and also a RISCV hook that maintains compatibility
with the behaviour prior to this PR.
I've also taken the liberty of adding support in vplan for
calculating the cost of VPInstruction::ExtractLastElement.