The isTreeTinyAndNotFullyVectorizable check for 2-node trees
(insertelement root + gather child) was too aggressive: it rejected
trees even when LoadEntriesToVectorize was non-empty, preventing
gathered loads from being vectorized into masked loads/strided loads, etc.
Reviewers: RKSimon, hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/190040
The FMulAdd (CombinedVectorize) transformation in transformNodes() marks
an FMul child entry with zero cost, assuming it is fully absorbed into
the fmuladd intrinsic. However, when any FMul scalar has multiple uses
(e.g., also stored separately), the FMul must survive as a separate
node.
Reviewers: hiraditya, RKSimon, bababuck
Pull Request: https://github.com/llvm/llvm-project/pull/189692
ReductionRoot was initialized to nullptr instead of the RdxRoot
parameter. This caused two ScaleCost calls (for MinBWs cast cost and
ReductionBitWidth resize cost) to pass nullptr as the user instruction,
and suppressed the "Reduction Cost" line in debug output. In practice
the scale factor is the same because the tree root's main op and the
reduction root share the same basic block, so this is NFC.
Reviewers:
Pull Request: https://github.com/llvm/llvm-project/pull/189994
Replace the DenseMap<Value*, Value*> TrackedToOrig with a SmallVector<Value*>
indexed in parallel with Candidates. This avoids hash-table overhead for the
tracked-value-to-original-value mapping in horizontal reduction processing.
Fixes#189686
The truncating store analogue of #181104.
Adds `Alignment` and `AddrSpace` parameters to
`TargetLoweringBase::getTruncStoreAction` and dependents, and introduces
a `getCustomTruncStoreAction` hook for targets to customize legalization
behavior using this new information.
This change is fully backwards compatible from the target's point of
view, with `setTruncStoreAction` having identical functionality. The
change is purely additive.
If the trimming candidate subtree is rooted at an alternate-shuffle node
with binary ops, and this subtree has the same cost as the buildvector
node cost, better to stick with the buildvector node to avoid runtime
perf regressions from shuffle/extra operations overhead that the cost model may
underestimate. Skip trimming if the subtree contains ExtractElement
nodes, since those operate on already-materialized vectors, which may
reduced vector-to-scalar code movement and have better perf.
Reviewers: hiraditya, bababuck, fhahn, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/188272
Need to check if the potential bitcast/bswap-like construct is a root of
the reduction, otherwise it cannot represent a bitcast/bswap construct.
Fixes#189184
Refactor to proceed #185964.
Much of this is a refactor to address this issues. Instead of iterating over one chain at a time, attempting all VFs for that given change, we now iterate over VFs, trying each chain for the current VF.
Includes fix for use after free bug.
For commutative copyables, instruction operands are always LHS and other
are RHS. But if some instruction is main and has 2 instructions
operands and RHS is more compatible with LHS operands, than LHS
operands, need to swap such operands for better analysis.
Reviewers: hiraditya, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/185320
Adds initial support for spill/reload estimation. Currently, it just
runs the operands and calculates number of registers, used by the
operands. If this number greater than the number of total available
registers, it consider the first (full) groups as the candidates for the spills/reloads.
Reviewers: hiraditya, RKSimon, bababuck
Pull Request: https://github.com/llvm/llvm-project/pull/187594
The operand info passed to getCmpSelInstrCost for Select instructions
was using operands 0 and 1 (condition and true value), but the API
expects info about the data operands (true and false values). For
selects, the data operands are at indices 1 and 2, not 0 and 1.
This led to the cost model receiving the condition's operand info
instead of the false arm's, potentially producing inaccurate cost
estimates.
Reviewers: bababuck, hiraditya, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/188506
Refactor to proceed addition of strided store chain vectorization.
Instead of iterating over one chain at a time, attempting all VFs for that given chain, we now iterate over VFs, trying each chain for the current VF. This will allow us to handle chains that share elements.
If the next candidate is the operand of one of the reduced value
candidates, such instructions also should be marked as a reduced value,
not a reduction operation, even if all other requirements are met.
This will allow to reduce the compile time.
Reviewers: hiraditya, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/188103
The ordered reduction support introduced in 94e366ef2060 can cause an
infinite loop when processing complex reduction chains. The worklist
algorithm re-adds instructions from PossibleOrderedReductionOps when
switching to ordered mode, but doesn't track which instructions have
already been processed. This allows instructions to be re-added and
processed multiple times, creating cycles.
Add a Visited set to track processed instructions and skip any that
have already been handled, preventing the infinite loop.
Initially, the reduction root was detected using the last member of the UserIgnoreList set, which is unordered. Better to use the reduction root explicitly to avoid non-determinism in the reduction parent block, which may cause incorrect scale factor estimation for the reduction cost.
If the const values have more active bits, than requested by the another
operand of the compare, such constants should not be trunced to avoid
miscompilation
If the const values have more active bits, than requested by the another
operand of the compare, such constants should not be trunced to avoid
miscompilation
Need to update matching between the original reduced values and their
vectorized matches after ordered reduction vectorization to avoid
a compiler crash
Patch models ordered reductions as a series of extractelements for the
cases which cannot be modeled as unordered reductions.
Fixes#50590
Reviewers: RKSimon, hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/182644
If the instructions state is alternate and/or contains non-directly
matching instructions, need to check if it is better to represent such
operations as non-alternate with copyables.
To do this, we need to compare operands between the instructions in their
different representations and choose the best one for optimal
vectorization.
Reviewers: RKSimon, hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/183777
shl-based reduced values in many cases serve as a bitcast/bswap-based
transfromation root, but need to improve analysis for better matching.
This patch merges reduction candidates into a single reduced value
array, if there are only 2 different candidate arrays, one of them has
only single element, the second is a list of shl instructions. Also,
sorts these shl instructions by their shift amount and merges with the
single candidate, if it is profitable to have a copyable reduction.
The original support for copyables leads to a regression in x264 in
RISCV, this patch improves detection of the copyable candidates by more
precise checking of the profitability and adds and extra check for
splitnode reduction, if it is profitable.
Fixes#184313
Reviewers: hiraditya, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/185697
If current buildvector node is part of the combined nodes of the
matching candidate node, this matching candidate must be considered as
non-matching to prevent wrong def-use chain
Reviewers:
Pull Request: https://github.com/llvm/llvm-project/pull/187491
Currently, SLP vectorizer do not care about loops and their trip count.
It may lead to inefficient vectorization in some cases. Patch adds loop
nest-aware tree building and cost estimation.
When it comes to tree building, it now checks that tree do not span
across different loop nests. The nodes from other loop nests are
immediate buildvector nodes.
The cost model adds the knowledge about loop trip count. If it is
unknown, the default value is used, controlled by the
-slp-cost-loop-min-trip-count=<value> option. The cost of the vector
nodes in the loop is multiplied by the number of iteration (trip count),
because each vector node will be executed the trip count number of
times. This allows better cost estimation.
Original Reviewers:
jdenny-ornl, vporpo, hiraditya, RKSimon
Original PR: https://github.com/llvm/llvm-project/pull/150450
Recommit after revert in c7bd3062f1dac975cf9b706f457b3c55b4bf57ff and in 4e500bd0015042b0cd4b7c87b81caeea06072d24
Reviewers:
Pull Request: https://github.com/llvm/llvm-project/pull/187391
Currently, SLP vectorizer do not care about loops and their trip count.
It may lead to inefficient vectorization in some cases. Patch adds loop
nest-aware tree building and cost estimation.
When it comes to tree building, it now checks that tree do not span
across different loop nests. The nodes from other loop nests are
immediate buildvector nodes.
The cost model adds the knowledge about loop trip count. If it is
unknown, the default value is used, controlled by the
-slp-cost-loop-min-trip-count=<value> option. The cost of the vector
nodes in the loop is multiplied by the number of iteration (trip count),
because each vector node will be executed the trip count number of
times. This allows better cost estimation.
Reviewers: jdenny-ornl, vporpo, hiraditya, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/150450
Recommit after revert in c7bd3062f1dac975cf9b706f457b3c55b4bf57ff
Fix the checks for the non-power-of-2 base bswaps by checking the
power-of-2 of the source type, not the target scalar type. Plus, add
cost estimation for zext, if the source type does not match the scalar type and fixes final bitcasting for the reduced values.
Fixes https://github.com/llvm/llvm-project/pull/184018#issuecomment-4053477562
Fix the checks for the non-power-of-2 base bswaps by checking the
power-of-2 of the source type, not the target scalar type. Plus, add
cost estimation for zext, if the source type does not match the scalar type.
Fixes https://github.com/llvm/llvm-project/pull/184018#issuecomment-4053477562
If looking for the match of the gather/buildvector node and its root is
a first node, which also a buildvector/gather, and has no state, we
should skip the analysis for such nodes to prevent a compiler crash
Fixes#185851
Currently, SLP vectorizer do not care about loops and their trip count.
It may lead to inefficient vectorization in some cases. Patch adds loop
nest-aware tree building and cost estimation.
When it comes to tree building, it now checks that tree do not span
across different loop nests. The nodes from other loop nests are
immediate buildvector nodes.
The cost model adds the knowledge about loop trip count. If it is
unknown, the default value is used, controlled by the
-slp-cost-loop-min-trip-count=<value> option. The cost of the vector
nodes in the loop is multiplied by the number of iteration (trip count),
because each vector node will be executed the trip count number of
times. This allows better cost estimation.
Reviewers: jdenny-ornl, vporpo, hiraditya, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/150450
**Summary**
Fixes a miscompilation where commutative operations (e.g., or, and, mul)
with a left-hand side constant were incorrectly transformed into
non-commutative operations (e.g., shl, sub).
**The Problem**
In `BinOpSameOpcodeHelper::getOperand`, when a constant is at `Pos ==
0`, the helper was failing to swap operand order for new non-commutative
target opcodes. This resulted in inverted logic, such as transforming
`or 0, %x` into `shl 0, %x` (resulting in 0) instead of the correct `%x
<< 0`.
**The Fix**
The existing logic only protected the Sub opcode. This patch generalizes
the fix to all non-commutative instructions by using
`!Instruction::isCommutative(ToOpcode)`. This ensures that for any
directional operation, the variable is correctly placed on the LHS and
the constant on the RHS.
**Changes**
SLPVectorizer.cpp: Replaced the specific Sub check with a general
isCommutative check.
Regression Test: Added lhs-constant-non-cummutative.ll to cover shl,
sub, and ashr targets.
Fixes#185186
Added support for zero extending the bitcasted/bswapped type to the
original type, if it is larger than the original scalar type
Reviewers: hiraditya, RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/184018
Need to be careful, when filling the mask for fully matched nodes, the
masks may differ in sizes
Fixes a crash reported in test/Transforms/SLPVectorizer/X86/mask-size-less-common-mask.ll