SubVectorsMask might be less than CommonMask, if the vectors with larger
number of elements are permuted or reused elements are used. Need to
consider this when estimation/building the vector to avoid compiler
crash
Fixes#117518
Patch allows to vector scalar instruction + poison values as if poisons
are instructions with the same opcode. It allows better vectorization of
the repeated values, reduces number of insertelement instructions and
serves as a base ground for copyable elements vectorization
AVX512, -O3 + LTO
JM/ldecod - better vector code
Applications/oggenc - better vectorization
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - better vector code
CFP2017rate/526.blender_r - better vector code
CFP2006/447.dealII - small variations
Benchmarks/Bullet - extra vector code
CFP2017rate/510.parest_r - better vectorization
CINT2017rate/502.gcc_r
CINT2017speed/602.gcc_s - extra vector code
Benchmarks/tramp3d-v4 - small variations
CFP2006/453.povray - extra vector code
JM/lencod - better vector code
CFP2017rate/511.povray_r - extra vector code
MemFunctions/MemFunctions - extra vector code
LoopVectorization/LoopVectorizationBenchmarks - extra vector code
XRay/FDRMode - extra vector code
XRay/ReturnReference - extra vector code
LCALS/SubsetCLambdaLoops - extra vector code
LCALS/SubsetCRawLoops - extra vector code
LCALS/SubsetARawLoops - extra vector code
LCALS/SubsetALambdaLoops - extra vector code
DOE-ProxyApps-C++/miniFE - extra vector code
LoopVectorization/LoopInterleavingBenchmarks - extra vector code
LCALS/SubsetBLambdaLoops - extra vector code
MicroBenchmarks/harris - extra vector code
ImageProcessing/Dither - extra vector code
MicroBenchmarks/SLPVectorization - extra vector code
ImageProcessing/Blur - extra vector code
ImageProcessing/Dilate - extra vector code
Builtins/Int128 - extra vector code
ImageProcessing/Interpolation - extra vector code
ImageProcessing/BilateralFiltering - extra vector code
ImageProcessing/AnisotropicDiffusion - extra vector code
MicroBenchmarks/LoopInterchange - extra code vectorized
LCALS/SubsetBRawLoops - extra code vectorized
CINT2006/464.h264ref - extra vectorization with wider vectors
CFP2017rate/508.namd_r - small variations, extra phis vectorized
CFP2006/444.namd - 2 2 x phi replaced by 4 x phi
DOE-ProxyApps-C/SimpleMOC - extra code vectorized
CINT2017rate/541.leela_r
CINT2017speed/641.leela_s - the function better vectorized and inlined
Benchmarks/Misc/oourafft - 2 4 x bit reductions replaced by 2 x vector code
FreeBench/fourinarow - better vectorization
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/115946
We don't want reorderTopToBottom to reorder ShuffleVectorInst (because
ShuffleVectorInst currently supports only a limited set of patterns).
Either we make ShuffleVectorInst support more patterns, or we let
ReorderIndices reorder the result of the vectorization of
ShuffleVectorInst. We choose the latter solution.
Currently sequences reduction_add(ext(<n x i1>)) are modeled as vector
extensions + reduction add, but later instcombiner transforms it into
ext(ctcpop(bitcast <n x i1> to int n)). Patch adds direct support for
this in SLP vectorizer, which enables better cost estimation.
AVX512, -O3+LTO
CINT2006/445.gobmk - extra vector code
Prolangs-C/bison - extra vector code
Benchmarks/NPB-serial/is - 16 x + 8 x reductions vectorized as 24
x reduction
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/116875
Only BinaryOperator and CastInst support alternate instruction. It
always returns false for TreeEntry::isAltShuffle if an instruction is
ExtractElementInst, ExtractValueInst, LoadInst, StoreInst or
InsertElementInst.
VPReverseVectorPointer relies on the runtime VF, but in DataWithEVL
tail-folding, EVL (which can be less than VF at runtime) should be used
instead.
This patch updates the logic to check the users of VF and replaces the
second operand if the user is VPReverseVectorPointer.
The any-of reduction contains phi and select instructions.
The select instruction might be optimized and removed in the vplan which
may cause VF difference between legacy and VPlan-based model. But if the
select instruction be removed, `planContainsAdditionalSimplifications()`
will catch it and disable the assertion.
Therefore, we can just remove the ayn-of reduction calculation in the
precomputeCost().
Only BinaryOperator and CastInst support alternate instruction. It
always returns false for TreeEntry::isAltShuffle if an instruction is
ExtractElementInst, ExtractValueInst, LoadInst, StoreInst or
InsertElementInst.
This changes allows target intrinsics to specify and overwrite overloaded types.
- Updates `ReplaceWithVecLib` to not provide TTI as there most probably won't be a use-case
- Updates `SLPVectorizer` to use available TTI
- Updates `VPTransformState` to pass down TTI
- Updates `VPlanRecipe` to use passed-down TTI
This change will let us add scalarization for `asdouble`: #114847
Currently sequences reduction_add(ext(<n x i1>)) are modeled as vector
extensions + reduction add, but later instcombiner transforms it into
ext(ctcpop(bitcast <n x i1> to int n)). Patch adds direct support for
this in SLP vectorizer, which enables better cost estimation.
AVX512, -O3+LTO
CINT2006/445.gobmk - extra vector code
Prolangs-C/bison - extra vector code
Benchmarks/NPB-serial/is - 16 x + 8 x reductions vectorized as 24
x reduction
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/116875
Load/Store isSimple is a necessary condition for VectorSeeds, but not
sufficient, so reverse the condition and return value, and continue the
check. Add relevant tests.
Set the maximum interleaving factor to 4, aligning with the number of available
SIMD pipelines. This increases the number of vector instructions in the vectorised
loop body, enhancing performance during its execution. However, for very low
iteration counts, the vectorised body might not execute at all, leaving only the
epilogue loop to run. This issue affects e.g. cam4_r from SPEC FP, which
experienced a performance regression. To address this, the patch reduces the
minimum epilogue vectorisation factor from 16 to 8, enabling the epilogue to be
vectorised and largely mitigating the regression.
This patch adds the callback registration logic in the DAG's constructor
and the corresponding deregistration logic in the destructor. It also
implements the code that makes sure that SchedBundle and DGNodes can be
safely destroyed in any order.
If the buildvector root has no uses, it might be still needed as a part
of the graph, so need to check that it is not a part of the graph before
deletion.
Fixes#116852
There are lots of places where we try to estimate the runtime
vectorisation factor based on the getVScaleForTuning TTI hook.
I've added a new getEstimatedRuntimeVF function and taught
several places in the vectoriser to use this new function.
Currently it's very difficult to improve the cost model for tail-folded
loops because as soon as you add a VPInstruction::computeCost function
that adds the costs of instructions such as
VPInstruction::ActiveLaneMask
and VPInstruction::ExplicitVectorLength the assert in
LoopVectorizationPlanner::computeBestVF fails for some tests. This is
because the VF chosen by the legacy cost model doesn't match the vplan
cost model. See PR #90191. This assert is currently making it difficult
to improve the cost model.
Hopefully we will be in a position to remove the assert soon, however
in order to do that we have to fix up a whole bunch of tests that rely
upon the legacy cost model output. I've tried my best to update
these tests to use vplan output instead.
There is still work needed for the VF=1 case because the vplan cost
model is not printed out in this case. I've not attempted to fix those
in this patch.
- Consider MainLoopVF * IC when determining whether Epilogue
Vectorization is profitable
- Allow the same VF for the Epilogue as for the main loop
- Use an upper bound for the trip count of the Epilogue when choosing
the Epilogue VF
PR: https://github.com/llvm/llvm-project/pull/108190
---------
Co-authored-by: Florian Hahn <flo@fhahn.com>
Up until now we could only support packing of scalar elements. This
patch fixes this by implementing packing of vector elements, by
generating extractelement and insertelement instruction pairs.
This patch implements packing of scalar operands when the vectorizer
decides to stop vectorizing. Packing is implemented with a sequence of
InsertElement instructions.
Packing vectors requires different instructions so it's implemented in a
follow-up patch.
Enables splat support for loads with lanes> 2 or number of operands> 2.
Allows better detect splats of loads and reduces number of shuffles in
some cases.
X86, AVX512, -O3+LTO
Metric: size..text
results results0 diff
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 154867.00 156723.00 1.2%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12467735.00 12468023.00 0.0%
Better vectorization quality
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/115173
Prior to this patch, optimizeForVFAndUF may optimize the conditional
branch for a VPBasicblock to have a constant condition, but
unnecessarily drops the DILocation attachment when it does so; this
patch changes it to preserve the DILocation.
In #111310 an assert was added that for the IV overflow check used with
tail folding, the overflow check is never known.
However when applying the loop guards, it looks like it's possible that
we might actually know the IV won't overflow: this occurs in
500.perlbench_r from SPEC CPU 2017 and triggers the assertion:
Assertion failed: (!isIndvarOverflowCheckKnownFalse(Cost, VF * UF) &&
!SE.isKnownPredicate(CmpInst::getInversePredicate(ICmpInst::ICMP_ULT),
TC2OverflowSCEV, SE.getSCEV(Step)) && "unexpectedly proved overflow
check to be known"), function emitIterationCountCheck, file
LoopVectorize.cpp, line 2501.
There is a discrepancy between `isIndvarOverflowCheckKnownFalse` and the
ICMP_ULT check, because the former uses `getSmallConstantMaxTripCount`
which only takes into trip counts that fit into 32 bits. There doesn't
seem to be an easy way to make the assertion aware of this, so this PR
just removes it for now.
There are two potential follow up things from this PR:
1. We miss calculating the max trip count in `@trip_count_max_1024`, it
looks like we might need to apply loop guards somewhere in
`ScalarEvolution::computeExitLimitFromICmp`
2. In `@overflow_at_0`, if `%tc == 0` then we the overflow check will
always return false, even though it will overflow
Fixes https://github.com/llvm/llvm-project/issues/115755
In #101641, support for out of loop reductions with EVL tail folding was
added by transforming selects to vp_merges in
transformRecipestoEVLRecipes.
Whilst the select was previously free, the vp_merge wasn't and incurs a
cost on RISC-V with the VPlan cost model. But this diverged from the
legacy cost model and caused the "VPlan cost model and legacy cost model
disagreed" assertion to trigger when building 502.gcc_r from SPEC CPU
2017.
Neither the select nor vp_merge recipes from the VPlan exist in the
underlying instructions, so I thought it would make the most sense to
fix this by adding the cost to the underlying phi instruction in
getInstructionCost.
It's worth noting that on RISC-V this vp_merge won't actually generate
any instructions because the mask is all true, and will be folded away.
So we should update the cost model at some point to reflect that.
We should always update the `SkipComputation` which is set in
`VPCostContext` before VPlan compute costs.
This patch prevent the assertion of in-loop reduction in the
`VPReductionRecipe::computeCost()` and other potential assertions of
partially implemented VPlan-based cost model.
Use the IV resume/end values from the phis in the scalar header,
instead of collecting them in a map. This removes some complexity
from the code dealing with induction resume values.
Analogous to 1edd22030 which did the same for reduction resume values.