This patch extends the support added in #158088 to loops where the
assignment is non-speculatable (e.g. a conditional load or divide).
For example, the following loop can now be vectorized:
```
int simple_csa_int_load(
int* a, int* b, int default_val, int N, int threshold)
{
int result = default_val;
for (int i = 0; i < N; ++i)
if (a[i] > threshold)
result = b[i];
return result;
}
```
It does this by extending the recurrence matching from only looking for
selects, to include phis where all operands are the header phi, except
for one which can be an arbitrary value outside the recurrence.
---
Reverts llvm/llvm-project#180275 (original PR: #178862)
Additional type legalization for `ISD::VECTOR_FIND_LAST_ACTIVE` was
added in #180290, which should resolve the backend crashes on x86.
Enables support for marking overflow intrinsics `uadd`, `sadd`, `usub`,
`ssub`, `umul` and `smul` as trivially vectorizable.
Fixes#174617
---
This patch is a reland of #174835.
Reverts #179819
In some cases we decide to vectorise loops with first-order recurrences
using VF=1, IC>1. We then attempt to unroll a vplan in replicateByVF,
however when trying to erase the list of values from the parent we
trigger the following assert:
```
virtual llvm::VPRecipeValue::~VPRecipeValue(): Assertion `Users.empty()
&& "trying to delete a VPRecipeValue with remaining users"' failed.
```
The problem seems to stem from this code:
```
DefR->replaceUsesWithIf(LaneDefs[0], [DefR](VPUser &U, unsigned) {
return U.usesFirstLaneOnly(DefR);
});
```
since usesFirstLaneOnly returns false and we fail to replace uses of
DefR with LaneDefs[0]. Upon inspection the only VPUser objects that
return false are VPInstruction::FirstOrderRecurrenceSplice and
VPFirstOrderRecurrencePHIRecipe. Since the values are all scalar it's
simply not possible for us to be using anything other than the first
lane. I've fixed this by bailing out of replicateByVF early for plans with
only a scalar VF.
Fixes https://github.com/llvm/llvm-project/issues/179671
If a phi has fast math flags, we can propagate it to the widened select.
To do this, this patch makes VPPhi and VPBlendRecipe subclasses of
VPRecipeWithIRFlags, and propagates it through PlainCFGBuilder and
VPPredicator.
Alive2 proofs for some of the FMFs (it looks like it can't reason about
the full "fast" set yet)
nnan: https://alive2.llvm.org/ce/z/f0bRd4
nsz: https://alive2.llvm.org/ce/z/u9P96T
The actual motivation for this to eventually be able to move the special
casing for tail folding in
LoopVectorizationPlanner::addReductionResultComputation into the CFG in
#176143, which requires passing through FMFs.
There are some cases when PtrSCEV can be nullptr. Fall back to legacy
cost model, to not call isLoopInvariant with nullptr.
Fixes a crash after 0c4f8094939d2.
Update VPReplicateReicpe::computeCost to compute predicated load/store
costs directly, unless the pointer is uniform. In that case, the legacy
cost model uses a different logic, which will be migrated separately.
PR: https://github.com/llvm/llvm-project/pull/179129
Add a new VPInstruction opcode to compute the exiting value of an
induction variable after vectorization. This replaces the pattern of
extracting the last lane from the last part of the induction backedge
value when applicable.
This allows us to always use the pre-computed IV end value. It will also
allow unifying end value creation for both induction resume and exit
values.
PR: https://github.com/llvm/llvm-project/pull/175651
This patch extends the support added in #158088 to loops where the
assignment is non-speculatable (e.g. a conditional load or divide).
For example, the following loop can now be vectorized:
```
int simple_csa_int_load(
int* a, int* b, int default_val, int N, int threshold)
{
int result = default_val;
for (int i = 0; i < N; ++i)
if (a[i] > threshold)
result = b[i];
return result;
}
```
It does this by extending the recurrence matching from only looking for
selects, to include phis where all operands are the header phi, except
for one which can be an arbitrary value outside the recurrence.
Enforce that all VPInstructions set the correct OpType of the VPIRFlags.
Flag mis-matches (e.g. VPInstruction Add without `OverflowingBinOp`
being set) can cause crashes (e.g. in CSE) or potentially mis-compiles.
Add a few helpers in VPBuilder to create common instructions with
correct flags.
PR: https://github.com/llvm/llvm-project/pull/179138
VPWidenActiveLaneMaskPHIRecipe does not have side-effects and also does
not access memory. Mark accordingly. This allows hoisting of some
invariant loads out of loops and also removing unused phi recipes in the
future.
In
llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll,
the hoisting makes vectorization profitable.
PR: https://github.com/llvm/llvm-project/pull/177886
This makes use of the llvm.vector.partial.reduce.fadd intrinsics added
in #163975 to handle the following with FDOT:
```
float32_t fdot(float16_t *src, int N) {
float32_t sum = 0.0f;
for (int i=0; i<N; ++i)
sum += src[i];
return sum;
}
```
In some cases, we identify patterns as reductions, even though they can
be simplified to a non-reduction.
Mark VPReductionPHIRecipe as not reading from memory & not having
side-effects, to clean them up.
We also need to remove ComputeReductionResult VPInstructions with
live-in arguments. This means there is actually no reduction, and we
need to fold it to the live in. Otherwise we would incorrectly reduce
the live-in.
PR: https://github.com/llvm/llvm-project/pull/176795
This commit introduces the VectorInstrContext (VIC) infrastructure to
improve cost estimates for insert/extracts based on the context
instruction in which the insert/extract is used.
This is similar to CastContextHint, and allows providing context on how
the insert/extract is going to be used before creating IR. This is
useful in the LoopVectorizer, where costs need to estimated before
creating IR.
The new hint currently only replaces an existing check in AArch64,
but new uses will be introduced in follow-ups, including
https://github.com/llvm/llvm-project/pull/177201.
PR: https://github.com/llvm/llvm-project/pull/175982
Move SubclassID to VPRecipeBase, and store VPRecipeBase directly in
VPRecipeValue, instead of VPDef. This allows for some additional
simplifications and VPDef now just holds various helpers to deal with
removing and adding VPValues.
This reverts commit 16395da0ff577750571b99fe28281ce6fb6a3ae8.
PR: https://github.com/llvm/llvm-project/pull/174282
Replace ComputeFindIVResult with ComputeReductionResult + explicit
compare + select, to more explicitly and simpler model computing finding
the first/last induction, which boils down to a min/max reduction +
compare and select of the sentinel value.
PR: https://github.com/llvm/llvm-project/pull/176672
Following on from #174693, this updates IRBuilder to allow variable
offsets, and splits the createVectorSplice function into two functions
for left and right splices.
We could preserve the existing createVectorSplice API but given there's
only one LLVM-internal user of it in the loop vectorizer, and the notion
of a negative offset doesn't exist in the intrinsics anymore, I've
removed it. Happy to add it back if reviewers prefer though.
I've also added unit tests since createVectorSpliceLeft has no coverage
otherwise.
If any of the operands of a VPReplicateRecipe have been
force-scalarized, then the legacy cost model skips the scalarization
overhead, but we cannot match this in the VPlan cost model.
Bail out for now in those very rare cases.
Fixes https://github.com/llvm/llvm-project/issues/176720.
VPlan transforms may invert logical AND/OR selects, which can impact
costs on targets the select is not cheap but the boolean AND/OR is.
Also match the inverted logical AND/OR to improve accuracy of the
cost estimation and fixes the underlying issue for the cost
divergence between legacy and VPlan-based cost model that caused
the revert of 01d34eb38fa058 in ed004cf42bf57c.
When `extract-lane` only contains single vector operand. We can simplify
it to `extractelement`.
This patch makes `extract-lane` generate simple `extractelement` when it
only contains single vector operand to prevent unused IR generated.
This patch is mostly NFC, the unused IR should be removed in following
IR passes.
Following up to d5c11b9a24c84f, also handle min/max recurrence kinds in
::printFlags, so the proper kind is imprinted instead of icmp.
NFC modulo debug printing changes
Based on Michael Maitland's previous work:
https://github.com/llvm/llvm-project/pull/121222
This PR uses the existing recurrences code instead of introducing a
new pass just for CSA autovec. I've also made recipes that are more
generic.
Remove the artificial PhiR operand of ComputeReductionResult, which was
only used to look up recurrence kind, in-loop and ordered properties.
Instead, encode them as VPIRFlags as suggested by @ayalz in
https://github.com/llvm/llvm-project/pull/170223.
This addresses a TODO to make codegen for ComputeReductionResult
independent of looking up information from other recipes.
This is NFC w.r.t. codegen, the printing has been improved to include
the reduction type, and whether it is in-loop/ordered.
PR: https://github.com/llvm/llvm-project/pull/174026
This patch simplifies extract-lane(%lane_num, %X) to %X when %X is a
scalar value. Extracting from a scalar is redundant since there is only
one value to extract.
VPScalarIVStepsRecipe relies on APInt truncation in order to vectorize
blocks with a width greater than the maximum value the types of some of
their (changing) operands are able to hold (e.g., an i1 input with a
vector width of 4). Simply reenable implicit truncation in
ConstantInt::get() to cover this case.
Remove the helper function given it is only called in one place to
prevent accidentally using it elsewhere where we probably do not want
implicit truncation turned on.
This fixes another case that we saw after
acb78bde6fb613a9af2a604bc69fa744a8cee850 did not fix that issue, which
had the same stack trace. We still want to keep lane constants as
unsigned.
Somewhat similar to 6d1e7d4982fabc9e245897056a5425496df6a7a3.
This test case comes from a tensorflow/XLA compilation from a test case
in https://github.com/google-research/spherical-cnn.
a83c89495ba6fe0134dcaa02372c320cc7ff0dbf caused assertion failures here
as if we have a single bit induction variable and two lanes (0 and 1),
then the second lane index (1) will be out of bounds of what a signed
1-bit integer can hold. Lane indices are always >0 according to
VPlanHelpers.h:125, and the lane representation in this code is also
unsigned.
The test case come from tensorflow/XLA.
This patch adds VPValue sub-classes for the different cases we currently
have:
* VPIRValue: A live-in VPValue that wraps an underlying IR value
* VPSymbolicValue: A symbolic VPValue not tied to an underlying value,
e.g. the vector trip count or VF VPValues
* VPRecipeValue: A VPValue defined by a VPDef/VPRecipeBase.
This has multiple benefits:
* clearer constructors for each kind of VPValue
* limited scope: for example allows moving VPDef member to VPRecipeValue,
reducing size of other VPValues.
* stricter type checking for member variables (e.g. using VPLiveIn in
the Value -> live-in map in VPlan, or using VPSymbolicValue for symbolic
member VPValues)
There probably are additional opportunities for cleanups as follow-ups.
PR: https://github.com/llvm/llvm-project/pull/172758
Restore the all-operands-invariant handling in WidenGEP::execute prior
to 37f7b31 (Reland [VPlan] Handle WidenGEP in narrowToSingleScalars), as
crashes have been reported.
Fixes#173761.
The stride can be negative here, so we should use getSigned().
This avoids an assertion failure with
https://github.com/llvm/llvm-project/pull/171456. It also avoids a
miscompile if the index is >64-bit, but I don't think that can happen in
practice.
All extra state has been removed from VPWidenSelectRecipe at this point.
There's no benefit of having a separate recipe and Select can easily be
handled by the existing VPWidenRecipe.
PR: https://github.com/llvm/llvm-project/pull/174234
Also handle missing PtrToAddrs and AddrSpaceCast in
getCostForRecipeWithOpcode.
This makes sure all cast opcodes are handled, fixing a crash on loops
replicating addrspacecast and ptrtoaddrs.
Move the logic to compute cast costs to getCostForRecipeWithOpcode and
use for VPReplicateRecipe.
This should match the costs computed by the legacy cost model for scalar
casts.