Update the logic in narrowToSingleScalar to allow narrowing even if not
all users use scalars, if at least one of the operands already needs
broadcasting.
In that case, there won't be any additional broadcasts introduced. This
should allow removing the special handling for stores, which can
introduce additional broadcasts currently.
Fixes https://github.com/llvm/llvm-project/issues/169668.
PR: https://github.com/llvm/llvm-project/pull/168246
In #160470, there is a discussion about the possibility to explored a
general approach for handling memory intrinsics.
API changes:
- Remove getMaskedMemoryOpCost, getGatherScatterOpCost,
getExpandCompressMemoryOpCost, getStridedMemoryOpCost from
Analysis/TargetTransformInfo.
- Add getMemIntrinsicInstrCost.
In BasicTTIImpl, map intrinsic IDs to existing target implementation
until the legacy TTI hooks are retired.
- masked_load/store → getMaskedMemoryOpCost
- masked_/vp_gather/scatter → getGatherScatterOpCost
- masked_expandload/compressstore → getExpandCompressMemoryOpCost
- experimental_vp_strided_{load,store} → getStridedMemoryOpCost
TODO: add support for vp_load_ff.
No functional change intended; costs continue to route to the same
target-specific hooks.
In some case, VPWidenPointerInductions become only used by scalars after
legalizeAndOptimizationInducftions was already run, for example due to
some VPlan optimizations.
Move the code to scalarize VPWidenPointerInductions to a helper and use
it if needed.
This fixes a crash after #148274 in the added test case.
Fixes https://github.com/llvm/llvm-project/issues/169780
This patch adds a new VPlan transformation to hoist predicated loads, if
we can prove they execute unconditionally, i.e. there are 2 predicated
loads to the same address with complementary masks. Then we are
guaranteed to execute one of them on each iteration, allowing us to
remove the mask.
The transform groups masked replicating loads by their address SCEV,
then checks if there are 2 loads with complementary mask. If that is the
case, we check if there are any writes that may alias the load address
in the blocks between the first and last load with the same address.
The transforms operates after linearizing the CFG, but before
introducing replicate regions, which means this is just checking a chain
of consecutive blocks.
Currently this only uses noalias metadata to check for no-alias (using
the helpers added in https://github.com/llvm/llvm-project/pull/166247).
Then we create an unpredicated VPReplicateRecipe at the position of the
first load, then replace all users of the grouped loads with it.
Small Alive2 proof for hoisting with complementary masks:
https://alive2.llvm.org/ce/z/kUx742
PR: https://github.com/llvm/llvm-project/pull/168373
In preparation to strip VPUnrollPartAccessor and unroll recipes
directly, strip unnecessary complication in getGEPIndexTy, as the unroll
part will no longer be available in follow-ups (see #168886 for
instance). The patch also helps by doing a mass test update up-front.
Narrowing the GEP index type conditionally does not yield any benefit,
and the change is non-functional in terms of emitted assembly. While at
it, avoid hard-coding address-space 0, and use the pointer operand's
address space to get the GEP index type.
Flags are now passed on construction/cloning. Remove unnecessary
transferFlags call, and make code independent of VPRecipeWithIRFlags, to
support additional recipes in the future.
VPVector(End)PointerRecipes are single-scalar if all their operands are.
This should be effectively NFC currently, but it should re-enable cost
checking for some more VPWidenMemoryRecipe after
https://github.com/llvm/llvm-project/pull/157387 as discovered by
John Brawn.
If the expected trip count is less than the VF, the vector loop will
only execute a single iteration. When that's the case, the cost of the
middle block has the same impact as the cost of the vector loop. Include
it in isOutsideLoopWorkProfitable to avoid vectorizing when the extra
work in the middle block makes it unprofitable.
Note that isOutsideLoopWorkProfitable already scales the cost of blocks
outside the vector region, but the patch restricts accounting for the
middle block to cases where VF <= ExpectedTC, to initially catch some
worst cases and avoid regressions.
This initial version should specifically avoid unprofitable tail-folding
for loops with low trip counts after re-applying
https://github.com/llvm/llvm-project/pull/149042.
PR: https://github.com/llvm/llvm-project/pull/168949
Changes: Fix a missed update to WidenGEP::usesFirstLaneOnly, and include
reduced-case test that was previously hitting the new assert: the
underlying reason was that VPWidenGEP::usesScalars was too weak, and the
single-scalar WidenGEP was not narrowed by narrowToSingleScalarRecipes.
This allows us to strip a special case in VPWidenGEP::execute.
When interleaving a loop with an early exit, the parts before the active
lane will be all zero. Currently we emit @llvm.experimental.cttz.elts
with ZeroIsPoison=true for these parts, which means that they will
produce poison.
We don't see any miscompiles today on AArch64 because it has the same
lowering for cttz.elts regardless of ZeroIsPoison, but this may cause
issues on RISC-V when interleaving. This fixes it by setting
ZeroIsPoison=false.
The codegen is slightly worse on RISC-V when ZeroIsPoison=false and we
could potentially recover it by enabling it again when UF=1, but this is
left to another PR.
This is split off from #168738, where LastActiveLane can get expanded to
a FirstActiveLane with an all-zeroes mask.
Fix VPlan SLP check incorrectly bailing out for non-VPInstructions.
Starting from the beginning of the block will include canonical IVs,
which in turn are not VPInstructions. If we hit a non-VPInstruction, we
should conservatively treat is as potentially unvectorizable.
To keep the tests working as expected, refine mayRead/WriteFromMemory
for Load and GEP VPInstructions.
Extract the PreservesUniformity logic from isSingleScalar into a shared
static helper function. Update isUniformAcrossVFsAndUFs to use this
logic for VPWidenRecipe and VPInstruction, so that any opcode that
preserves uniformity is considered uniform-across-vf-and-uf if its
operands are.
This unifies the uniformity checking logic and makes it easier to extend
in the future.
This should effectively by NFC currently.
Create phi recipes for scalar resume value up front in addInitialSkeleton during initial construction. This will allow moving the remaining code dealing with resume values to VPlan transforms/construction.
PR: https://github.com/llvm/llvm-project/pull/166099
Only apply forced instruction costs to recipes with underlying values to
match the legacy cost model. A VPlan may have a number of additional
VPInstructions without underlying values that are not considered for its
cost, and assigning forced costs to them would incorrectly inflate its
cost.
This fixes a cost divergence between legacy and VPlan-based cost models
with forced instruction costs.
PR: https://github.com/llvm/llvm-project/pull/168372
Remove `VPWidenPointerInductionRecipe::IsScalarAfterVectorization` and
replace it with `onlyScalarValuesUsed`. This removes the need to carry
state from the legacy cost model through VPlan, and the VPlan-based
analysis gives more accurate results, avoiding a number of extracts.
PR: https://github.com/llvm/llvm-project/pull/168289
Need to check if the non-schedulable phi parent node has unique
operands, if the incoming node has copyables, and the node is
commutative. Otherwise, there might be issues with the correct
calculation of the dependencies.
Fixes#168589
A pattern of the form reduce.add(ext(mul)) is valid for a partial
reduction as long as the mul and its operands fulfill the requirements
of a normal partial reduction. The mul's extend operands will be
optimised to the wider extend, and we already have oneUse checks in
place to make sure the mul and operands can be modified safely.
1. -> https://github.com/llvm/llvm-project/pull/165536
2. https://github.com/llvm/llvm-project/pull/165543
The problem with the many def-use chain problems in SLP vectorizer are
related to the fact that some nodes reuse the same instruction as
insertion point. Insertion point is not the instruction, but the place
between instructions. To set it correctly, better to generate pseudo
instruction immediately after the last instruction, and use it as
insertion point. It resolves the issues in most cases.
Fixes#168512#168576
Given a set of pointers, check if they can be rearranged as follows (%s is a constant):
%b + 0 * %s + 0
%b + 0 * %s + 1
%b + 0 * %s + 2
...
%b + 0 * %s + w
%b + 1 * %s + 0
%b + 1 * %s + 1
%b + 1 * %s + 2
...
%b + 1 * %s + w
...
If the pointers can be rearanged in the above pattern, it means that the
memory can be accessed with a strided loads of width `w` and stride `%s`.
- Remove file local functions out of `llvm` or anonymous namespace and
make them static.
- Use namespace qualifier to define `BoUpSLP` class and several template
specializations.
#158690 plans on passing BFI as a lazy lambda to avoid computing
BlockFrequencyInfo when not needed.
In preparation for that, this PR removes BFI and PSI from some
constructors that aren't used. It also consolidates the two calls to
llvm::shouldOptimizeForSize so that the result is computed once and
passed where needed.
This also renames OptForSize in LoopVectorizationLegality to clarify
that it's to prevent runtime SCEV checks, see
https://reviews.llvm.org/D68082
Consider skipping epilogue scalable VF when they are greater than
RemainingIterations same as fixed VF.
And skip scalable RemainingIterations from that comparison because
SCEV ATM can't evaluate non-canonical vscale-based expressions.
- Split from #165532. This is a step toward a unified interface for
masked/gather-scatter/strided/expand-compress cost modeling.
- Replace the ad-hoc parameter list with a single attributes object.
API change:
```
- InstructionCost getMaskedMemoryOpCost(Opcode, Src, Alignment,
- AddressSpace, CostKind);
+ InstructionCost getMaskedMemoryOpCost(MemIntrinsicCostAttributes,
+ CostKind);
```
Notes:
- NFCI intended: callers populate MemIntrinsicCostAttributes with the
same information as before.
- Follow-up: migrate gather/scatter, strided, and expand/compress cost
queries to the same attributes-based entry point.
FCmp instructions have both a predicate and fast-math flags. Introduce a
new FCmp kind, that combines both to model this correctly in the current
system.
This should be NFC modulo VPlan printing which now includes the correct
fast-math flags.
Follow up on a cse OpType-mismatch crash reported due to ef023cae388d
(Reland [VPlan] Expand WidenInt inductions with nuw/nsw), setting the
OpType correctly when returning from getFlagsFromIndDesc.
Update VPlan to populate VPIRFlags during VPInstruction construction and
use it when creating widened recipes, instead of constructing VPIRFlags
from the underlying IR instruction each time. The VPRecipeWithIRFlags
constructor taking an underlying instruction and setting the flags based
on it has been removed.
This centralizes initial VPIRFlags creation and ensures flags are
consistently available throughout VPlan transformations and makes sure
we don't accidentally re-add flags from the underlying instruction that
already got dropped during transformations.
Follow-up to https://github.com/llvm/llvm-project/pull/167253, which did
the same for VPIRMetadata.
Should be NFC w.r.t. to the generated IR.
PR: https://github.com/llvm/llvm-project/pull/168450
This patch implements a transform to hoists single-scalar replicated
loads with invariant addresses out of the vector loop to the preheader
when scoped noalias metadata proves they cannot alias with any stores in
the loop.
This enables hosting of loads we can prove do not alias any stores in
the loop due to memory runtime checks added during vectorization.
PR: https://github.com/llvm/llvm-project/pull/166247
Update VPlan to populate VPIRMetadata during VPInstruction construction
and use it when creating widened recipes, instead of constructing
VPIRMetadata from the underlying IR instruction each time.
This centralizes VPIRMetadata in VPInstructions and ensures metadata is
consistently available throughout VPlan transformations.
PR: https://github.com/llvm/llvm-project/pull/167253