Enforce that all VPInstructions set the correct OpType of the VPIRFlags.
Flag mis-matches (e.g. VPInstruction Add without `OverflowingBinOp`
being set) can cause crashes (e.g. in CSE) or potentially mis-compiles.
Add a few helpers in VPBuilder to create common instructions with
correct flags.
PR: https://github.com/llvm/llvm-project/pull/179138
VPWidenActiveLaneMaskPHIRecipe does not have side-effects and also does
not access memory. Mark accordingly. This allows hoisting of some
invariant loads out of loops and also removing unused phi recipes in the
future.
In
llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll,
the hoisting makes vectorization profitable.
PR: https://github.com/llvm/llvm-project/pull/177886
This makes use of the llvm.vector.partial.reduce.fadd intrinsics added
in #163975 to handle the following with FDOT:
```
float32_t fdot(float16_t *src, int N) {
float32_t sum = 0.0f;
for (int i=0; i<N; ++i)
sum += src[i];
return sum;
}
```
In some cases, we identify patterns as reductions, even though they can
be simplified to a non-reduction.
Mark VPReductionPHIRecipe as not reading from memory & not having
side-effects, to clean them up.
We also need to remove ComputeReductionResult VPInstructions with
live-in arguments. This means there is actually no reduction, and we
need to fold it to the live in. Otherwise we would incorrectly reduce
the live-in.
PR: https://github.com/llvm/llvm-project/pull/176795
This commit introduces the VectorInstrContext (VIC) infrastructure to
improve cost estimates for insert/extracts based on the context
instruction in which the insert/extract is used.
This is similar to CastContextHint, and allows providing context on how
the insert/extract is going to be used before creating IR. This is
useful in the LoopVectorizer, where costs need to estimated before
creating IR.
The new hint currently only replaces an existing check in AArch64,
but new uses will be introduced in follow-ups, including
https://github.com/llvm/llvm-project/pull/177201.
PR: https://github.com/llvm/llvm-project/pull/175982
Move SubclassID to VPRecipeBase, and store VPRecipeBase directly in
VPRecipeValue, instead of VPDef. This allows for some additional
simplifications and VPDef now just holds various helpers to deal with
removing and adding VPValues.
This reverts commit 16395da0ff577750571b99fe28281ce6fb6a3ae8.
PR: https://github.com/llvm/llvm-project/pull/174282
Replace ComputeFindIVResult with ComputeReductionResult + explicit
compare + select, to more explicitly and simpler model computing finding
the first/last induction, which boils down to a min/max reduction +
compare and select of the sentinel value.
PR: https://github.com/llvm/llvm-project/pull/176672
Following on from #174693, this updates IRBuilder to allow variable
offsets, and splits the createVectorSplice function into two functions
for left and right splices.
We could preserve the existing createVectorSplice API but given there's
only one LLVM-internal user of it in the loop vectorizer, and the notion
of a negative offset doesn't exist in the intrinsics anymore, I've
removed it. Happy to add it back if reviewers prefer though.
I've also added unit tests since createVectorSpliceLeft has no coverage
otherwise.
If any of the operands of a VPReplicateRecipe have been
force-scalarized, then the legacy cost model skips the scalarization
overhead, but we cannot match this in the VPlan cost model.
Bail out for now in those very rare cases.
Fixes https://github.com/llvm/llvm-project/issues/176720.
VPlan transforms may invert logical AND/OR selects, which can impact
costs on targets the select is not cheap but the boolean AND/OR is.
Also match the inverted logical AND/OR to improve accuracy of the
cost estimation and fixes the underlying issue for the cost
divergence between legacy and VPlan-based cost model that caused
the revert of 01d34eb38fa058 in ed004cf42bf57c.
When `extract-lane` only contains single vector operand. We can simplify
it to `extractelement`.
This patch makes `extract-lane` generate simple `extractelement` when it
only contains single vector operand to prevent unused IR generated.
This patch is mostly NFC, the unused IR should be removed in following
IR passes.
Following up to d5c11b9a24c84f, also handle min/max recurrence kinds in
::printFlags, so the proper kind is imprinted instead of icmp.
NFC modulo debug printing changes
Based on Michael Maitland's previous work:
https://github.com/llvm/llvm-project/pull/121222
This PR uses the existing recurrences code instead of introducing a
new pass just for CSA autovec. I've also made recipes that are more
generic.
Remove the artificial PhiR operand of ComputeReductionResult, which was
only used to look up recurrence kind, in-loop and ordered properties.
Instead, encode them as VPIRFlags as suggested by @ayalz in
https://github.com/llvm/llvm-project/pull/170223.
This addresses a TODO to make codegen for ComputeReductionResult
independent of looking up information from other recipes.
This is NFC w.r.t. codegen, the printing has been improved to include
the reduction type, and whether it is in-loop/ordered.
PR: https://github.com/llvm/llvm-project/pull/174026
This patch simplifies extract-lane(%lane_num, %X) to %X when %X is a
scalar value. Extracting from a scalar is redundant since there is only
one value to extract.
VPScalarIVStepsRecipe relies on APInt truncation in order to vectorize
blocks with a width greater than the maximum value the types of some of
their (changing) operands are able to hold (e.g., an i1 input with a
vector width of 4). Simply reenable implicit truncation in
ConstantInt::get() to cover this case.
Remove the helper function given it is only called in one place to
prevent accidentally using it elsewhere where we probably do not want
implicit truncation turned on.
This fixes another case that we saw after
acb78bde6fb613a9af2a604bc69fa744a8cee850 did not fix that issue, which
had the same stack trace. We still want to keep lane constants as
unsigned.
Somewhat similar to 6d1e7d4982fabc9e245897056a5425496df6a7a3.
This test case comes from a tensorflow/XLA compilation from a test case
in https://github.com/google-research/spherical-cnn.
a83c89495ba6fe0134dcaa02372c320cc7ff0dbf caused assertion failures here
as if we have a single bit induction variable and two lanes (0 and 1),
then the second lane index (1) will be out of bounds of what a signed
1-bit integer can hold. Lane indices are always >0 according to
VPlanHelpers.h:125, and the lane representation in this code is also
unsigned.
The test case come from tensorflow/XLA.
This patch adds VPValue sub-classes for the different cases we currently
have:
* VPIRValue: A live-in VPValue that wraps an underlying IR value
* VPSymbolicValue: A symbolic VPValue not tied to an underlying value,
e.g. the vector trip count or VF VPValues
* VPRecipeValue: A VPValue defined by a VPDef/VPRecipeBase.
This has multiple benefits:
* clearer constructors for each kind of VPValue
* limited scope: for example allows moving VPDef member to VPRecipeValue,
reducing size of other VPValues.
* stricter type checking for member variables (e.g. using VPLiveIn in
the Value -> live-in map in VPlan, or using VPSymbolicValue for symbolic
member VPValues)
There probably are additional opportunities for cleanups as follow-ups.
PR: https://github.com/llvm/llvm-project/pull/172758
Restore the all-operands-invariant handling in WidenGEP::execute prior
to 37f7b31 (Reland [VPlan] Handle WidenGEP in narrowToSingleScalars), as
crashes have been reported.
Fixes#173761.
The stride can be negative here, so we should use getSigned().
This avoids an assertion failure with
https://github.com/llvm/llvm-project/pull/171456. It also avoids a
miscompile if the index is >64-bit, but I don't think that can happen in
practice.
All extra state has been removed from VPWidenSelectRecipe at this point.
There's no benefit of having a separate recipe and Select can easily be
handled by the existing VPWidenRecipe.
PR: https://github.com/llvm/llvm-project/pull/174234
Also handle missing PtrToAddrs and AddrSpaceCast in
getCostForRecipeWithOpcode.
This makes sure all cast opcodes are handled, fixing a crash on loops
replicating addrspacecast and ptrtoaddrs.
Move the logic to compute cast costs to getCostForRecipeWithOpcode and
use for VPReplicateRecipe.
This should match the costs computed by the legacy cost model for scalar
casts.
This PR introduces a new BranchOnTwoConds VPInstruction, that takes 2
boolean operands and must be placed in a block with 3 successors.
If condition I is true, branches to successor I, otherwise falls through
to check the next condition. If both conditions are false, branch to the
third successor.
This new branch recipe is used for early-exit loops, to simplify the
representation in VPlan initially, by avoid the need for splitting the
middle block early on, in a way that preserves the single-exit block
property of regions. All exits still go through the latch block, but
they can go to more than 2 successors.
This idea was part of one of the original proposals for how to model
early exits in VPlan, but at that point in time, there was no good way
to handle this during code-gen, and we went with the early split-middle
block approach initially.
Now that we dissolve regions before ::execute, the new recipe can be
lowered nicely after regions have been removed, to a set of VPBBs and
BranchOnCond recipes. The initial lowering preserves the original
structure with the split middle blocks. Follow-ups will improve the
lowering to avoid this splitting, providing performance gains.
PR: https://github.com/llvm/llvm-project/pull/172750
This patch enables the vectorization of the llvm.frexp intrinsic.
Following the suggestion in #112408, frexp is moved from
isTriviallyScalarizable to isTriviallyVectorizable.
Fixes#112408
This reverts commit f42af14073228 and re-applies
https://github.com/llvm/llvm-project/pull/172915.
It has an additional check if the condition is a live-in,
which makes sure we preserve the original behavior in that case.
This should fix the crash that caused the revert.
Original commit message:
Instead of looking up the predicate from the VPValue condition instead
of the underlying IR.
This improves cost modeling in some cases, e.g. when we can fold
operations like negations in compares. On AArch64, this leads to
additional vectorization in a few cases in practice.
Example lowering for the modified test case:
https://llvm.godbolt.org/z/6nc6jo5eG
getSCEVExprForVPValue is used to create SCEVs for expressions from the
original loop, which may be predicated. Use PSE to construct predicated
SCEVs if possible. This matches the legacy LV code behavior.
Currently should be NFC, but will enable migrating more SCEV/cost-based
computations to VPlan.
The patch requires exposing a new getPredicatedSCEV helper to
PredicatedScalarEvolution which just takes a SCEV, to avoid needing to
go through IR values, which isn't an option for getSCEVExprForVPValue.
Instead of looking up the predicate from the VPValue condition instead
of the underlying IR.
This improves cost modeling in some cases, e.g. when we can fold
operations like negations in compares. On AArch64, this leads to
additional vectorization in a few cases in practice.
Example lowering for the modified test case:
https://llvm.godbolt.org/z/6nc6jo5eG
PR: https://github.com/llvm/llvm-project/pull/172915
getAddressAccessSCEV previously had some restrictive checks that limited
pointer SCEV expressions passed to TTI to GEPs with operands that must
either be invariant or marked as inductions.
As a consequence, the check rejected things like `GEP %base, (%iv + 1)`,
while the SCEV for the GEP should be as easily analyzeable as for `GEP
%base, %v`, with the only difference being the of the AddRec start
adjusted by 1.
This patch changes the code to use a SCEV-based check, limiting the
address SCEV to be loop invariant, an affine AddRec (i.e. induction ),
or an add expression of such operands or a sign-extended AddRec.
This catches all existing cases getAddressAccessSCEV caught, plus
additional ones like the cases mentioned above.
This means we pass address SCEVs in more cases, giving the backends a
better change to make informed decisions. It also unifies the decision
when to use an address SCEV between the legacy and VPlan-based cost
model.
An illustrative example of showing the impact are the gather-cost.ll
tests. Previously they were considered not profitable to vectorize
because we failed to determine that
%gep.src_data = getelementptr inbounds [1536 x float], ptr @src_data,
i64 0, i64 %mul
has a relatively small constant stride.
There may be some rough edges in the cost models, where not passing
pointer SCEVs hid some incorrect modeling, but those issues should be
fixed in the target cost models if they surface.
PR: https://github.com/llvm/llvm-project/pull/171204
This patch introduces VPInstruction::Reverse and extracts the reverse
operations of loaded/stored values from reverse memory accesses. This
extraction facilitates future support for permutation elimination within
VPlan.
This reapplies #171846 with a test case and fix for a legacy cost-model
mismatch assertion.
In the previous version of the patch, we only considered the plan to
contain simplifications when it had a VPBlendRecipe and VF.isScalar()
was true.
However for some VPlans we may have a blend with only the first lane
used:
BLEND ir<%phi> = ir<%foo.res> ir<%bar.res>/ir<%c>
CLONE ir<%gep> = getelementptr ir<%p>, ir<%phi>
vp<%5> = vector-pointer ir<%gep>
And in the legacy cost model we cost a blend as a phi if it's uniform:
// If we know that this instruction will remain uniform, check the cost
of
// the scalar version.
if (isUniformAfterVectorization(I, VF))
VF = ElementCount::getFixed(1);
So this replaces the VF.isScalar() check with
vputils::onlyFirstLaneUsed, which matches how the VPlan cost model
mirrored the legacy model beforehand.
A VPInstruction::Select will also emit a scalar select for a vector VF
if only the first lane is used, so this also updates
VPBlendRecipe::computeCost to reflect that too.
In an effort to get rid of VPUnrollPartAccessor and directly unroll
recipes, start by directly unrolling VectorPointerRecipe, allowing for
VPlan-based simplifications and simplification of the corresponding
execute.