If the buildvector root has no uses, it might be still needed as a part
of the graph, so need to check that it is not a part of the graph before
deletion.
Fixes#116852
…nteger division
The last resort to vectorize a bundle of integer divisions is considered
scalarizing it. Currently, the cost estimates for scalarizing a vector
division can be considerably overestimated as is the scenario with this
motivating test case i.e. vector cost should not deviate much from the
scalar cost.
Future patch will try to improve the scalarization cost.
Enables splat support for loads with lanes> 2 or number of operands> 2.
Allows better detect splats of loads and reduces number of shuffles in
some cases.
X86, AVX512, -O3+LTO
Metric: size..text
results results0 diff
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 154867.00 156723.00 1.2%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12467735.00 12468023.00 0.0%
Better vectorization quality
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/115173
Use generic createShuffle function, which know how to adjust the vectors
correctly, to avoid compiler crash when trying to build a buildvector as
a shuffle
Fixes#115732
This change is part of this proposal:
https://discourse.llvm.org/t/rfc-all-the-math-intrinsics/78294
- Return true for atan2 from isTriviallyVectorizable
- Add atan2 to VecFuncs.def for massv and accelerate libraries.
- Add atan2 to hasOptimizedCodeGen
- Add atan2 support in llvm/lib/Analysis/ValueTracking.cpp
llvm::getIntrinsicForCallSite and update vectorization tests
- Add atan2 name check to isLoweredToCall in
llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
- Note: there's no test coverage for these names in isLoweredToCall, except that Transforms/TailCallElim/inf-recursion.ll is impacted by the "fabs" case
Thanks to @jroelofs for the atan2 accelerate veclib and associated test
additions, plus the hasOptimizedCodeGen addition.
Part of: Implement the atan2 HLSL Function #70096.
If looking for the insertion point for the node and the node is
a buildvector node, the compiler should not use scheduling info for such
nodes, they may contain only partial info, which is not fully correct
and may cause compiler crash.
Fixes#114082
Since the stores are sorted by distance, comparing the indices in the
original array and early exit, if the index is less than the index of
the last store, not always the best strategy. Better to remove such
stores explicitly to try better to check for the vectorization
opportunity.
Fixes#115008
This is also split off from the zvfhmin/zvfbfmin
isLegalElementTypeForRVV work.
Enabling this will cause SLP and RISCVGatherScatterLowering to emit
@llvm.experimental.vp.strided.{load,store} intrinsics, and codegen
support for this was added in #109387 and #114750.
Consider all possible reductions ops as being non-poisoning boolean
logical operations, which require freeze to be fully correct.
https://alive2.llvm.org/ce/z/TKWDMPFixes#114738
Most of the x86 shuffle instructions operate within each 128-bit subvector lane, but our shuffle costs struggle to handle this and have to fallback to worst case shuffles that reference elements from any lane.
This patch detects shuffle masks that we know are "inlane" and enable us to assume a cheaper shuffle cost.
The code in EH and non-returning blocks can be skipped by the
vectorizer, since it does not add to the perfromance, just consumes
compile/link time.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/112221
If the instruction is vectorized and it is a part of the reduced values
gather/buildvector node, it should replaced in reduced operation
instructions before removal properly, to avoid compiler crash.
Fixes#114371
Returning invalid instruction costs when converting from/to fp16 in
`X86TTIImpl::getCastInstrCost` when there is no hardware support
available was triggering asserts. This changes the code to return a
large (arbitrary) number to model the fact that libcalls are used to
implement the conversion.
This also simplifies the code by only reporting costs for the scalar
fp16 conversion; vectorized costs being left to the fallback assuming
scalarization.
This is a follow-up to assertion issues reported for the changes in
#113195
Currently, we dont have much tests that show SLP outcome for integer
divisions. This patch adds tests for same.
In certain scenarios, for Neon, vectorization is profitable. An attempt
would be made in future to improve the cost-model for the same.
If the list of scalars vectorized as the part of the same vector node,
no need to generate vector node again, it will be handled as part of
overlapping matching.
Fixes#113810
Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer
transforms the patterns:
- Override `X86TTIImpl::getStoreMinimumVF` to report a minimum VF of 4 (SSE
register can hold 4xfloat converted/stored to 4xf16) this is necessary as
fp16 stores are neither modeled as trunc-stores nor can we mark direct Xxfp16
stores as legal as we generally expand fp16 operations).
- Add missing cost entries to `X86TTIImpl::getCastInstrCost`
conversion from/to fp16. Note that conversion from f64 to f16 is not
supported by an X86 instruction.
If the scalars is used externally is in the root node, it may have
incorrect signedness info because of the conflict with the demanded bits
analysis. Need to perform exact signedness analysis and compute it
rather than rely on the precomputed value, which might be incorrect for
alternate zext/sext nodes.
Fixes#113520
Since SLP support "clusterization" of the non-load instructions, the
restriction for reduced values for loads only should be removed to avoid
compiler crash.
Fixes#113516
Need to consider undefs correctly, when trying to replace them with
potentially poisonous values in shuffles. Such elements should not be
silently replaced by poison values, instead complex analysis should be
implemented to see if it is safe to do it.
Fixes#113425
If the graph is small and has single buildvector node, all scalars
instructions must be from the same basic block to prevent compiler
crash.
Fixes#113451
Enables initial non-power-of-2 support (but still requires number of
elements, forming whole registers) for reductions.
Enables extra vectorization for
MultiSource/Benchmarks/7zip/7zip-benchmark, CINT2006/464.h264ref and
CFP2017rate/526.blender_r (checked for SSE2)
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/112361
Root gather/buildvector node should be ignored when SLP vectorizer tries
to find matching gather nodes, vectorized earlier. This node is
definitely the last one in the pipeline and it does not have users. It
may cause the compiler crash
Fixes#113143
This reverts commit 7f2e937469a8cec3fe977bf41ad2dfb9b4ce648a as it causes
regressions in the tests it modifies, and undoes what was added in #100653
(which itself was a fix for a previous regression).
Enables initial non-power-of-2 support (but still requires number of
elements, forming whole registers) for reductions.
Enables extra vectorization for
MultiSource/Benchmarks/7zip/7zip-benchmark, CINT2006/464.h264ref and
CFP2017rate/526.blender_r (checked for SSE2)
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/112361
Enables initial non-power-of-2 support (but still requiresnumber of
elements, forming whole registers) for reductions.
Enables extra vectorization for
MultiSource/Benchmarks/7zip/7zip-benchmark, CINT2006/464.h264ref and
CFP2017rate/526.blender_r (checked for SSE2)
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/112361