Reverts llvm/llvm-project#124129 as its currently causing a regression at #124499 - avoids the regression until a proper fix can be added to getSpillCost
Currently, SLP has 2 distinct storages to manage mapping between
vectorized instructions and their corresponding vectorized TreeEntry
nodes. It leads to inefficient lookup for the matching TreeEntries and
makes it harder to correctly track instructions, associated with
multiple nodes.
There is a plan to extend this support for instructions, that require
scheduling, to allow support for copyable elements. Merging
ScalarToTreeEntry and MultiNodeScalars will allow reduce maintenance of
the feature
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/124914
Adds getNumberOfParts and uses it instead of similar code across code
base, fixes analysis of non-vectorizable types in
computeMinimumValueSizes.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/124774
Need to include MainOp into the analysis of the instructions in
getSameOpcode to be sure that it is checked for the requirements to
prevent crashes during further analysis.
We have two types of mask in SLP: a scalar mask and a vector mask.
When vectorizing four i32 additions into <4 x i32>, SLP creates a mask
of length 4.
When vectorizing four <2 x i32> additions into <8 x i32>, SLP also
creates a mask of length 4.
We refer to the first case as a scalar mask (because the mask element
represents a scalar, i32), and the second case as a vector mask (because
the mask element represents a vector, <4 x i32>).
At some point, we must convert the scalar mask into a vector mask
(otherwise, calling TTI cost functions or IRBuilderBase functions may
yield incorrect results).
Since both ShuffleCostEstimator and ShuffleInstructionBuilder can modify
the CommonMask, we have decided to perform the mask transformation only
within createShuffle. However, we do not store the transformed result,
as createShuffle may be called multiple times.
getVectorCallCosts determines the cost of a vector intrinsic, based off
an existing scalar intrinsic call - but we were including the scalar
argument data to the IntrinsicCostAttributes, which meant that not only
was the cost calculation not type-only based, it was making incorrect
assumptions about constant values etc.
This also exposed an issue that x86 relied on fallback calculations for
funnel shift costs - this is great when we have the argument data as
that improves the accuracy of uniform shift amounts etc., but meant that
type-only costs would default to Cost=2 for all custom lowered funnel
shifts, which was far too cheap.
This is the reverse of #124129 where we weren't including argument data
when we could.
Fixes#63980
As part of the "RemoveDIs" project, BasicBlock::iterator now carries a
debug-info bit that's needed when getFirstNonPHI and similar feed into
instruction insertion positions. Call-sites where that's necessary were
updated a year ago; but to ensure some type safety however, we'd like to
have all calls to moveBefore use iterators.
This patch adds a (guaranteed dereferenceable) iterator-taking
moveBefore, and changes a bunch of call-sites where it's obviously safe
to change to use it by just calling getIterator() on an instruction
pointer. A follow-up patch will contain less-obviously-safe changes.
We'll eventually deprecate and remove the instruction-pointer
insertBefore, but not before adding concise documentation of what
considerations are needed (very few).
We were only constructing the IntrinsicCostAttributes with the arg type info, and not the args themselves, preventing more detailed cost analysis (constant / uniform args etc.)
Just pass the whole IntrinsicInst to the constructor and let it resolve everything it can.
Noticed while having yet another attempt at #63980
Added non-power-of-2 (but still whole registers) vectorization support
for nodes other than stores and reductions.
Reviewers: preames, RKSimon, hiraditya
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/113356
When adding a node vector, when it is used already in the shuffle for
buildvector, need to calculate vector factor from all vector, not only
this single vector, to avoid incorrect result. Also, need to increase
stability of the reused entries detection to avoid mismatch in cost
estimation/codegen.
Fixes#123639
Add TreeEntry::hasState.
Add assert for getTreeEntry.
Remove the OpValue parameter from the canReuseExtract function.
Remove the Opcode parameter from the ComputeMaxBitWidth lambda function.
Added assert to check for underflow of ReduxWidth
modified: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
Source code analysis flagged the operation (ReduxWwidth - 1) as
potential underflow, since ReduxWidth is unsigned.
Realize that this should never happen if everything is working right,
but added an assert to check for it just in case.
Follow up on 4a0d53a (PatternMatch: migrate to CmpPredicate) to get rid
of the FIXME it introduced in SLPVectorizer: the FIXME is bad, and we'd
get no testable impact by using CmpPredicate::getMatching here.
When extracts are vectorized and it has some poison values instead of
instructions, need to correctly set the vectorized operand not as
poison, but as a main vector operand of the main extract instruction.
Fixes#122583
If the buildvector node fully matched another node, need to exclude
subvectors, when building final shuffle, just a shuffle of the original
node must be emitted.
Fixes#122584
This patch simplifies select-based integer min/max reductions by
utilizing `llvm::getMinMaxReductionPredicate`, and generates
intrinsic-based min/max reductions by utilizing
`llvm::getMinMaxReductionIntrinsicOp`.
When estimating the cost of entries shuffles for buildvectors, need to
rebuild original mask, not a generated submask, used for subregisters
analysis.
Fixes#122430
Older version of msvc do not have great lambda support and are not able
to handle uses of class data or lambdas with implicit return types in
some cases. These minor changes improve the sources compatibility with
older msvc and don't hurt readability either.
When estimating the cost of entries shuffles for buildvectors, need to
rebuild original mask, not a generated submask, used for subregisters
analysis.
Fixes#122430
Add TreeEntry::hasState.
Add assert for getTreeEntry.
Remove the OpValue parameter from the canReuseExtract function.
Remove the Opcode parameter from the ComputeMaxBitWidth lambda function.
Add TreeEntry::hasState.
Add assert for getTreeEntry.
Remove the OpValue parameter from the canReuseExtract function.
Remove the Opcode parameter from the ComputeMaxBitWidth lambda function.
Need to check if the GEP bases are equal and return false early. Also,
need to return false if the lookup is too deep, considering bases equal
too. Fixes a crash in the assertion.
If the VectorizedTree still may generate poisonous value, but it is not
the original operand of the reduction op, need to check if Res still the
operand, to generate correct code.
Fixes#114905
Adds cost estimation for the variants of the permutations of the scalar
values, used in gather nodes. Currently, SLP just unconditionally emits
shuffles for the reused buildvectors, but in some cases better to leave
them as buildvectors rather than shuffles, if the cost of such
buildvectors is better.
X86, AVX512, -O3+LTO
Metric: size..text
Program size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 912998.00 913238.00 0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 203070.00 203102.00 0.0%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1396320.00 1396448.00 0.0%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1396320.00 1396448.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 309790.00 309678.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12477607.00 12470807.00 -0.1%
CINT2006/445.gobmk - extra code vectorized
MiBench/consumer-lame - small variations
CFP2017speed/638.imagick_s
CFP2017rate/538.imagick_r - extra vectorized code
Benchmarks/Bullet - extra code vectorized
CFP2017rate/526.blender_r - extra vector code
RISC-V, sifive-p670, -O3+LTO
CFP2006/433.milc - regressions, should be fixed by https://github.com/llvm/llvm-project/pull/115173
CFP2006/453.povray - extra vectorized code
CFP2017rate/508.namd_r - better vector code
CFP2017rate/510.parest_r - extra vectorized code
SPEC/CFP2017rate - extra/better vector code
CFP2017rate/526.blender_r - extra vectorized code
CFP2017rate/538.imagick_r - extra vectorized code
CINT2006/403.gcc - extra vectorized code
CINT2006/445.gobmk - extra vectorized code
CINT2006/464.h264ref - extra vectorized code
CINT2006/483.xalancbmk - small variations
CINT2017rate/525.x264_r - better vectorization
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/115201
If several reduced values are combined and the first reduced value is
just the original reduced value of the bool logical op, need to freeze
it to prevent the propagation of the poison value.
Fixes#114905
If the operands of the icmp instructions has reduced bitwidth after
MinBitwidth analysis, need to drop samesign flag to preserve correctness
of the transformation.
Fixes#120823
- update `VectorUtils:isVectorIntrinsicWithScalarOpAtArg` to use TTI for
all uses, to allow specifiction of target specific intrinsics
- add TTI to the `isVectorIntrinsicWithStructReturnOverloadAtField` api
- update TTI api to provide `isTargetIntrinsicWith...` functions and
consistently name them
- move `isTriviallyScalarizable` to VectorUtils
- update all uses of the api and provide the TTI parameter
Resolves#117030