This ensures that all blocks created during VPlan execution are properly
added to an enclosing loop, if present.
Split off from https://github.com/llvm/llvm-project/pull/108378 and also
needed once more of the skeleton blocks are created directly via VPlan.
This also allows removing the custom logic for early-exit loop
vectorization added as part of
https://github.com/llvm/llvm-project/pull/117008.
As we're reducing the use count of the operands its more likely that they will now fold, as they were previously being prevented by a m_OneUse check, or the cost of retaining the extra instruction had been too high.
This is necessary for some upcoming patches, although the only change so far is instruction ordering as it allows some SSE folds of 256/512-bit with 128-bit subvectors to occur earlier in foldShuffleToIdentity as the subvector concats are free.
Pulled out of #120984
This patch changes the way blocks are managed by VPlan. Previously all
blocks reachable from entry would be cleaned up when a VPlan is
destroyed. With this patch, each VPlan keeps track of blocks created for
it in a list and this list is then used to delete all blocks in the list
when the VPlan is destroyed. To do so, block creation is funneled
through helpers in directly in VPlan.
The main advantage of doing so is it simplifies CFG transformations, as
those do not have to take care of deleting any blocks, just adjusting
the CFG. This helps to simplify
https://github.com/llvm/llvm-project/pull/108378 and
https://github.com/llvm/llvm-project/pull/106748.
This also simplifies handling of 'immutable' blocks a VPlan holds
references to, which at the moment only include the scalar header block.
PR: https://github.com/llvm/llvm-project/pull/120918
m_SpecificCmp allowed equivalent predicate+flags which don't necessarily work after being folded from "shuffle (cmpop), (cmpop)" into "cmpop (shuffle), (shuffle)"
Fixes#121110
Update C++ unit tests to use VPlanTestBase to construct initial VPlan,
using a constructor that creates the VP blocks directly in the
constructor.
Split off from and in preparation for
https://github.com/llvm/llvm-project/pull/120918.
Currently available intrinsics are only ld2/st2, which don't support interleaving factor > 2.
This patch teaches the LV to use ld2/st2 recursively to support high
interleaving factors.
If the VectorizedTree still may generate poisonous value, but it is not
the original operand of the reduction op, need to check if Res still the
operand, to generate correct code.
Fixes#114905
Adds cost estimation for the variants of the permutations of the scalar
values, used in gather nodes. Currently, SLP just unconditionally emits
shuffles for the reused buildvectors, but in some cases better to leave
them as buildvectors rather than shuffles, if the cost of such
buildvectors is better.
X86, AVX512, -O3+LTO
Metric: size..text
Program size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 912998.00 913238.00 0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 203070.00 203102.00 0.0%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1396320.00 1396448.00 0.0%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1396320.00 1396448.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 309790.00 309678.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12477607.00 12470807.00 -0.1%
CINT2006/445.gobmk - extra code vectorized
MiBench/consumer-lame - small variations
CFP2017speed/638.imagick_s
CFP2017rate/538.imagick_r - extra vectorized code
Benchmarks/Bullet - extra code vectorized
CFP2017rate/526.blender_r - extra vector code
RISC-V, sifive-p670, -O3+LTO
CFP2006/433.milc - regressions, should be fixed by https://github.com/llvm/llvm-project/pull/115173
CFP2006/453.povray - extra vectorized code
CFP2017rate/508.namd_r - better vector code
CFP2017rate/510.parest_r - extra vectorized code
SPEC/CFP2017rate - extra/better vector code
CFP2017rate/526.blender_r - extra vectorized code
CFP2017rate/538.imagick_r - extra vectorized code
CINT2006/403.gcc - extra vectorized code
CINT2006/445.gobmk - extra vectorized code
CINT2006/464.h264ref - extra vectorized code
CINT2006/483.xalancbmk - small variations
CINT2017rate/525.x264_r - better vectorization
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/115201
If several reduced values are combined and the first reduced value is
just the original reduced value of the bool logical op, need to freeze
it to prevent the propagation of the poison value.
Fixes#114905
This re-lands the reverted #92418
When the VF is small enough so that dividing the VF by the scaling
factor results in 1, the reduction phi execution thinks the VF is scalar
and sets the reduction's output as a scalar value, tripping assertions
expecting a vector value. The latest commit in this PR fixes that by
using `State.VF` in the scalar check, rather than the divided VF.
---------
Co-authored-by: Nicholas Guy <nicholas.guy@arm.com>
If the operands of the icmp instructions has reduced bitwidth after
MinBitwidth analysis, need to drop samesign flag to preserve correctness
of the transformation.
Fixes#120823
We already canonicalized an undef base vector to the RHS to improve further folding, this extends this to improve the shuffle cost estimate of the single src shuffle
foldShuffleOfShuffles currently only folds unary shuffles to ensure we don't end up with a merged shuffle with more than 2 sources, but this prevented cases where both shuffles were sharing sources.
This patch generalizes the merge process to find up to 2 sources as it merges with the inner shuffles, it also moves the undef/poison handling stages into the merge loop as well.
Fixes#120764
Split DerivedIV simplification off from
https://github.com/llvm/llvm-project/pull/112145 and use to remove the
need for extra checks in createScalarIVSteps. Required an extra
simplification run after IV transforms.
This fixes some regressions from recent changes to vector combine in
#120216. It allows shuffleToIdentity to look through fp casts as other
casts, and makes sure mismatching vector types in splats and casts do
not block the transform, as only the lanes should matter.
insertelt DestVec, (fneg (extractelt SrcVec, Index)), Index
-> shuffle DestVec, (shuffle (fneg SrcVec), poison, SrcMask), Mask
Original combining left the combine between vectors of different lengths as a TODO. this commit do that. (see
#[baab4aa1ba])
- update `VectorUtils:isVectorIntrinsicWithScalarOpAtArg` to use TTI for
all uses, to allow specifiction of target specific intrinsics
- add TTI to the `isVectorIntrinsicWithStructReturnOverloadAtField` api
- update TTI api to provide `isTargetIntrinsicWith...` functions and
consistently name them
- move `isTriviallyScalarizable` to VectorUtils
- update all uses of the api and provide the TTI parameter
Resolves#117030
Following on from https://github.com/llvm/llvm-project/pull/94499, this
patch adds support to the Loop Vectorizer to emit the partial reduction
intrinsics where they may be beneficial for the target.
---------
Co-authored-by: Samuel Tebbs <samuel.tebbs@arm.com>