Currently the only way to enable the use of wide active lane masks is to pass
-enable-wide-lane-mask and force both interleaving & tail-folding with additional
flags. This patch changes selectInterleaveCount to consider interleaving if wide
lane masks were requested, although the feature remains off by default.
This hardens the code to check based on WideMember0's operands. This
ensures each call will go through the same check. Should be NFC
currently but needed when generalizing in follow-up patches.
narrowToSingleScalarRecipes can permit users that are WidenStore, or a
VPInstruction that has a suitable opcode. This is a generalization and
extension of the existing code.
Call getVectorTripCount first, and call getTripCount failing that, in
simplifyBranchConditionForVFAndUF, to simplify missed cases. While at
it, strip the dead check for a zero TC.
This patch removes the explicit Alignment parameter from
VPWidenLoadRecipe and VPWidenStoreRecipe constructors. Instead, these
recipes now directly retrieve the alignment from their
LoadInst/StoreInst.
Generalize VPWidenSelectRecipe codegen to consider single-scalar
conditions instead of just loop-invariant ones.
If the condition is a single-scalar, we can simply use a scalar
condition.
PR: https://github.com/llvm/llvm-project/pull/165506
Currently in optimizeMaskToEVL we convert every widened load, store or
reduction to a VP predicated recipe with EVL, regardless of whether or
not it uses the header mask.
So currently we have to be careful when working on other parts VPlan to
make sure that the EVL transform doesn't break or transform something
incorrectly, because it's not a semantics preserving transform.
Forgetting to do so has caused miscompiles before, like the case that
was fixed in #113667
This PR rewrites it to work in terms of pattern matching, so it now only
converts a recipe to a VP predicated recipe if it is exactly masked with
the header mask.
After this the transform should be a true optimisation and not change
any semantics, so it shouldn't miscompile things if other parts of VPlan
change.
This fixes#152541, and allows us to move addExplicitVectorLength into
tryToBuildVPlanWithVPRecipes in #153144
It also splits out the load/store transforms into separate patterns for
reversed and non-reversed, which should make #146525 easier to implement
and reason about.
BranchOnCount and BranchOnCond do not read memory, but cannot be moved.
Mark them as having side-effects, but not reading/writing memory, which
more accurately models that above. This allows removing some special
checking for branches both in the current code and future patches.
This follows similar reasoning as 45ce88758d24
(https://github.com/llvm/llvm-project/pull/159556):
LV does not preserve LCSSA, it constructs it just before processing a
loop to vectorize. Runtime check expressions are invariant to that loop,
so expanding them should not break LCSSA form for the loop we are about
to vectorize.
LV creates SCEV and memory runtime checks early on and then disconnects
the blocks temporarily. The patch fixes a mis-compile, where previously
LCSSA construction during SCEV expand may replace uses in currently
unreachable SCEV/memory check blocks.
Fixes https://github.com/llvm/llvm-project/issues/162512
PR: https://github.com/llvm/llvm-project/pull/165505
A reduction (including partial reductions) with a multiply of a constant
value can be bundled by first converting it from `reduce.add(mul(ext,
const))` to `reduce.add(mul(ext, ext(const)))` as long as it is safe to
extend the constant.
This PR adds such bundling by first truncating the constant to the
source type of the other extend, then extending it to the destination
type of the extend. The first truncate is necessary so that the types of
each extend's operand are then the same, and the call to
canConstantBeExtended proves that the extend following a truncate is
safe to do. The truncate is removed by optimisations.
This is a stacked PR, 1a and 1b can be merged in any order:
1a. https://github.com/llvm/llvm-project/pull/147302
1b. https://github.com/llvm/llvm-project/pull/163175
2. -> https://github.com/llvm/llvm-project/pull/162503
Factor out common code to determine legality of hoisting and sinking.
The patch has the side-effect of fixing an underlying bug, where a
load/store pair is reordered.
Add an member Alignment to VPWidenMemoryRecipe to store memory alignment
directly in the recipe. Update constructors, clone(), and relevant
methods to use this stored alignment instead of querying the IR
instruction. This allows VPWidenLoadRecipe/VPWidenStoreRecipe to be
constructed without relying on the original IR instruction in the
future.
InstSimplifyFolder can fold binary intrinsics, so take the opportunity
to unify code with getOpcodeOrIntrinsicID, and handle the case. The
additional handling of WidenGEP is non-functional, as the GEP is
simplified before it is widened, as the included test shows.
8d29d09309 exposed a crash due to incorrectly trying to handle masked
interleave recipes. For now, the current code does not support masked
interleave recipes. Bail out for them.
Move narrowInterleaveGroups to to general VPlan optimization stage.
To do so, narrowInterleaveGroups now has to find a suitable VF where all
interleave groups are consecutive and saturate the full vector width.
If such a VF is found, the original VPlan is split into 2:
a) a new clone which contains all VFs of Plan, except VFToOptimize, and
b) the original Plan with VFToOptimize as single VF.
The original Plan is then optimized. If a new copy for the other VFs has
been created, it is returned and the caller has to add it to the list of
candidate plans.
Together with https://github.com/llvm/llvm-project/pull/149702, this
allows to take the narrowed interleave groups into account when
computing costs to choose the best VF and interleave count.
One example where we currently miss interleaving/unrolling when
narrowing interleave groups is https://godbolt.org/z/Yz77zbacz
PR: https://github.com/llvm/llvm-project/pull/149706
Recipes in replicate regions implicitly depend on the region's
predicate. Limit CSE to recipes in the same block, when either recipe is
in a replicate region.
This allows handling VPPredInstPHIRecipe during CSE. If we perform CSE
on recipes inside a replicate region, we may end up with 2
VPPredInstPHIRecipes sharing the same operand. This is incompatible with
current VPPredInstPHIRecipe codegen, which re-sets the current value of
its operand in VPTransformState. This can cause crashes in the added
test cases.
Note that this patch only modifies ::isEqual to check for replicating
regions and not getHash, as CSE across replicating regions should be
uncommon.
Fixes https://github.com/llvm/llvm-project/issues/157314.
Fixes https://github.com/llvm/llvm-project/issues/161974.
PR: https://github.com/llvm/llvm-project/pull/162110
Recipes in licm are safe to hoist if the legality check passes, and the
recipe is guaranteed to execute; the single successor of the vector
preheader is the vector loop region. Clarify this in the code structure
and comments.
When narrowing stores of a single-scalar, we currently use
ExtractLastElement, which extracts the last element across all parts.
This is not correct if the store's address is not uniform across all
parts. If it is only uniform-per-part, the last lane per part must be
extracted. Add a new ExtractLastLanePerPart opcode to handle this
correctly. Most transforms apply to both ExtractLastElement and
ExtractLastLanePerPart, with the only difference being their treatment
during unrolling.
Fixes https://github.com/llvm/llvm-project/issues/162498.
PR: https://github.com/llvm/llvm-project/pull/163056
We can perform CSE on recipes that do not directly map to Instruction
opcodes. One example is VPVectorPointerRecipe. Currently this is handled
by supporting them in ::canHandle, but currently that means that we
return std::nullopt from getOpcodeOrIntrinsicID() for it. This currently
only works, because the only case we return std::nullopt and perform CSE
is VPVectorPointerRecipe. But that does not work if we support more such
recipes, like VPPredInstPHIRecipe
(https://github.com/llvm/llvm-project/pull/162110).
To fix this, return a custom opcode from getOpcodeOrIntrinsicID for
recipes like VPVectorPointerRecipe, using the VPDefID after all regular
instruction opcodes.
PR: https://github.com/llvm/llvm-project/pull/162267
VPWidenCastRecipes with Trunc opcodes where missing the correct OpType
for IR flags. Update createWidenCast to set the correct flags for
truncs, and use it consistenly.
Fixes https://github.com/llvm/llvm-project/issues/162374.
Instead of re-setting the start value of the canonical IV when
vectorizing the epilogue we can emit an Add VPInstruction to provide
canonical IV value, adjusted by the resume value from the main loop.
This is in preparation to make the canonical IV a VPValue defined by
loop regions. It ensures that the canonical IV always starts at 0.
PR: https://github.com/llvm/llvm-project/pull/161589