There's a pattern throughout LLVM of cl::opts being exported. That in
itself is probably a bit unfortunate, but what's especially bad about it
is that a lot of those symbols are in the global namespace. Move them
into the llvm namespace.
While doing this, I noticed some other variables in the global namespace
and moved them as well.
Extend replaceSymbolicStrides to also replace SCEVUnknowns in
VPExpandSCEVExprs using the information from StridesMaps.
This results in simpler SCEV expansions in some cases.
Make sure that we set the correct wrap flags when creating new
VPWidenCastRecipes for truncs and preserve the flags from the recipe
directly when cloning, to make sure they are not dropped.
Fixes https://github.com/llvm/llvm-project/issues/160396
After https://github.com/llvm/llvm-project/pull/153643, there may be a
BranchOnCond with constant condition in the entry block.
Simplify those in removeBranchOnConst. This removes a number of
redundant conditional branch from entry blocks.
In some cases, it may also make the original scalar loop unreachable,
because we know it will never execute. In that case, we need to remove
the loop from LoopInfo, because all unreachable blocks may dominate each
other, making LoopInfo invalid. In those cases, we can also completely
remove the loop, for which I'll share a follow-up patch.
Depends on https://github.com/llvm/llvm-project/pull/153643.
PR: https://github.com/llvm/llvm-project/pull/154510
Follow up on 7fb3a91 ([PatternMatch] Introduce match functor) to
introduce the VPlanPatternMatch version of the match functor to shorten
some idioms.
Co-authored-by: Luke Lau <luke@igalia.com>
The motivation for this patch is to close the gap between the
VPlan-based CSE and the legacy CSE, to make it easier to remove the
legacy CSE. Before this patch, stubbing out the legacy CSE leads to 22
test failures, and after this patch, there are only 12 failures, and all
of them seem to have a single root cause:
VPlanTransforms::createInterleaveGroups() and
VPInterleaveGroup::execute(). The improvements from this patch are of
course welcome.
While developing the patch, a miscompile was found when GEP
source-element-types differ, and this has been fixed.
Co-authored-by: Florian Hahn <flo@fhahn.com>
Co-authored-by: Luke Lau <luke@igalia.com>
vputils::isSingleScalar(A) may return true to recipes that produce only
a single scalar value, but they could still end up as vector
instruction, because the recipe could not be converted to a
single-scalar VPInstruction/VPReplicateRecipe.
For now, only apply the fold for recipes guaranteed to produce a single
value, i.e. single-scalar VPInstructions and VPReplicateRecipes.
Fixes https://github.com/llvm/llvm-project/issues/158319.
Extend replicateByVF added in #142433 (aa240293190) to also explicitly
unroll replicating VPInstructions.
Now the only remaining case where we replicate for all lanes is
VPReplicateRecipes in replicate regions.
PR: https://github.com/llvm/llvm-project/pull/155102
Track which ops already have been narrowed, to avoid narrowing the same
operation multiple times. Repeated narrowing will lead to incorrect
results, because we could first narrow from an interleave group -> wide
load, and then narrow the wide load > single-scalar load.
Fixes thttps://github.com/llvm/llvm-project/issues/156190.
The default values for DebugLocs in LoopVectorizer/VPlan were recently
updated from empty DebugLocs to DebugLoc::getUnknown, as part of the
DebugLoc Coverage Tracking work. However, there are some cases where we
also pass an explicit empty DebugLoc, in many cases as a filler
argument. This patch updates all of these to `getUnknown` for now, until
either valid locations or a suitable categorization can be assigned to
each instead.
This change is NFC outside of DebugLoc coverage tracking builds.
In simplifyBlends, when normalizing a blend recipe, the first mask that
is used only by the blend and is not all-false is chosen, and its
corresponding incoming value becomes the initial value, with the others
blended into it. At the same time, the mask that is chosen can be
eliminated. However, a multi-user mask might be used by a dead recipe,
which prevents this optimization. This patch moves removeDeadRecipes
before simplifyBlends to eliminate dead recipes, allowing simplifyBlends
to remove more dead masks.
Extracts technically do not use scalars, but vectors, but if the operand
is a single scalar we do not need a vector and they should not block
forming single scalars.
Following up from #150368, this moves folding common edge masks into
simplifyBlends.
One test in uniform-blend.ll ended up regressing but after looking at it
closely, it came from a weird (x && !x) edge mask. So I've just included
a simplifcation in this PR to fold that to false.
This PR reassociates logical ands in order to enable more
simplifications.
The driving motivation for this is that with tail folding all blocks
inside the loop body will end up using the header mask. However this can
end up nestled deep within a chain of logical ands from other edges.
Typically the header mask will be a leaf nested in the LHS, e.g.
(headermask & y) & z. So pulling it out allows it to be simplified
further, e.g. allows it to be optimised away to VP intrinsics with EVL
tail folding.
Introduce a simple common-subexpression-elimination pass at the
VPlan-level, running late during the execution of the VPlan. The
long-term vision is to get rid of the legacy non-VPlan-based cse routine
in LV, but this patch doesn't yet fully subsume it.
The InterleavedAccess pass already supports transforming
vector-predicated (vp) load/store intrinsics. With this patch, we start
enabling interleaved access under tail folding by EVL.
This patch introduces a new base class, VPInterleaveBase, and a concrete
class, VPInterleaveEVLRecipe. Both the existing VPInterleaveRecipe and
the new VPInterleaveEVLRecipe inherit from and implement
VPInterleaveBase.
Compared to VPInterleaveRecipe, VPInterleaveEVLRecipe adds an EVL
operand to emit vp.load/vp.store intrinsics.
Currently, tail folding by EVL is only supported for scalable
vectorization. Therefore, VPInterleaveEVLRecipe will only emit
interleave/deinterleave intrinsics. Reverse accesses are not yet
implemented, as masked reverse interleaved access under tail folding is
not yet supported.
Fixed#123201
This patch adds a new flag (-enable-wide-lane-mask) which allows
LoopVectorize to generate wider-than-VF active lane masks when it
is safe to do so (i.e. the mask is used for data and control flow).
The transform in extractFromWideActiveLaneMask creates vector
extracts from the first active lane mask in the header & loop body,
modifying the active lane mask phi operands to use the extracts.
An additional operand is passed to the ActiveLaneMask instruction,
the value of which is used as a multiplier of VF when generating the
mask.
By default this is 1, and is updated to UF by
extractFromWideActiveLaneMask.
The motivation for this change is to improve interleaved loops when
SVE2.1 is available, where we can make use of the whilelo instruction
which returns a predicate pair.
This is based on a PR that was created by @momchil-velikov (#81140)
and contains tests which were added there.
Update narrowInterleaveGroups to support scalable VFs. After the
transform, the vector loop will process a single iteration of the
original vector loop for fixed-width vectors and vscale iterations for
scalable vectors.
This patch adds a new VPlan-based addMinimumIterationCheck, which
replaced the ILV version for the non-epilogue case.
The VPlan-based version constructs a SCEV expression to compute the
minimum iterations, use that to check if the check is known true or
false. Otherwise it creates a VPExpandSCEV recipe and emits a
compare-and-branch.
When using epilogue vectorization, we still need to create the minimum
trip-count-check during the legacy skeleton creation. The patch moves
the definitions out of ILV.
PR: https://github.com/llvm/llvm-project/pull/153643
This changes the branch condition to use the AVL's backedge value
instead of the EVL-based IV.
This allows us to emit bnez on RISC-V and removes a use of the trip
count, which should reduce register pressure.
To match phis with VPlanPatternMatch I've had to relax the assert that
the number of operands must exactly match the pattern for the Phi
opcode, and I've copied over m_ZExtOrSelf from the LLVM IR
PatternMatch.h.
Fixes#151459
Extend [Specific]Cmp_match to handle floating-point compares, and
introduce m_Cmp that matches both integer and floating-point compares.
Use it in simplifyRecipe to match and simplify the general case of
compares. The change has necessitated a bugfix in
VPReplicateRecipe::execute.
Currently we only allow folding not (cmp eq) -> icmp ne if the not is
the only user of the compare.
However a common scenario is that some select might also use the
compare. We can still fold the not if we also swizzle the arms of the
selects.
This helps avoid regressions in #150368
Move the logic to expand SCEVs directly to a late VPlan transform that
expands SCEVs in the entry block. This turns VPExpandSCEVRecipe into an
abstract recipe without execute, which clarifies how the recipe is
handled, i.e. it is not executed like regular recipes.
It also helps to simplify construction, as now scalar evolution isn't
required to be passed to the recipe.
SimplifyBranchConditionForVFAndUF only recognized canonical IVs and a
few PHI
recipes in the loop header. With more IV-step optimizations,
the canonical widen-canonical-iv can be replaced by a canonical
VPWidenIntOrFpInduction,
which the pass did not handle, causing regressions (missed
simplifications).
This patch replaces canonical VPWidenIntOrFpInduction with a StepVector
in the vector preheader
since the vector loop region only executes once.
This is the first step in untangling the variable step transform and
header mask optimizations as described in #152541.
Currently we replace all VF users globally in the plan, including
VPVectorEndPointerRecipe. However this leaves reversed loads and stores
in an incorrect state until they are adjusted in optimizeMaskToEVL.
This moves the VPVectorEndPointerRecipe transform so that it is updated
in lockstep with the actual load/store recipe.
One thought that crossed my mind was that VPInterleaveRecipe could also
use VPVectorEndPointerRecipe, in which case we would have also been
computing the wrong address because we don't transform it to an EVL
recipe which accounts for the reversed address.
If we end up with a extract_element VPInstruction where both operands
are live-ins, we will try to fold the live-ins even though the first
operand is a vector whilst the live-in is scalar.
This fixes it by just returning the vector live-in instead of calling
the folder, and removes the handling for insertelement where we aren't
able to do the fold. From some quick testing we previously never hit
this fold anyway, and were probably just missing test coverage.
Fixes#154045