Currently we emit gathers for scalars being vectorized in the tree as
a pair of extractelement/insertelement instructions. Instead we can try
to find all required vectors and emit shuffle vector instructions
directly, improving the code and reducing compile time.
Part of non-power-of-2 vectorization.
Differential Revision: https://reviews.llvm.org/D110978
Currently, MinBWs map uses Value* as a key and stores mapping for each
value to be demoted. It make is it hard to get the actual MinBWs value
for the buildvector scalars(constants), since same constant might be
used in different nodes with the different MinBWs values/decisions.
Also, it consumes extra memory for the vectorized values/instructions
from the same nodes.
Better to map actual nodes. It fixes the bitwidth data fetching for
buildvector scalars and improves memory consumption/analysis time for
other instructions.
No need to store optional boolean in the map, enough to store boolean
directly. Also, we can do preliminary check for instruction and if they
are not simple, mark as aliased without storing this result in the map.
If the gather node includes ordered loads only partially (not the whole
node consists of loads) and the other gathered scalar are not loads, and
no other dependency from other nodes is found, we still can improve the
cost of gather, if take into account the fact that these loads still can
be vectorized.
operand.
Need to copy the submask not to the very first part of the common
extractelements vector mask, but to the proper one to avoid wrong code
emission.
If the buildvector node contains extractelement, which vector operand
depends on vector node, need to check if the node is ready and use
vectorized value instead of the original vector operation.
SLP includes analysis for the minimum bitwidth, the actual integer
operations can be emitted. It allows to reduce register pressure and
improve perf. Currently, it includes only cost model and the next
transformation relies on InstructionCombiner. Better to do it directly
in SLP, it allows to reduce compile time and fix cost model issues.
SLP includes analysis for the minimum bitwidth, the actual integer
operations can be emitted. It allows to reduce register pressure and
improve perf. Currently, it includes only cost model and the next
transformation relies on InstructionCombiner. Better to do it directly
in SLP, it allows to reduce compile time and fix cost model issues.
Currently tryToGatherExtractElements function analyzes the whole vector,
regrdless number of actual registers, used in this vector. It may
prevent some optimizations, because per-register analysis may allow to
simplify the final code by reusing more already emitted vectors and
better shuffles.
Differential Revision: https://reviews.llvm.org/D148855
This causes asserts:
llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:10082:
Value *llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::adjustExtracts(
const TreeEntry *, MutableArrayRef<int>, unsigned int, bool &):
Assertion `Part == 0 && "Expected firs part."' failed.
See comment on the code review.
> Currently tryToGatherExtractElements function analyzes the whole vector,
> regrdless number of actual registers, used in this vector. It may
> prevent some optimizations, because per-register analysis may allow to
> simplify the final code by reusing more already emitted vectors and
> better shuffles.
>
> Differential Revision: https://reviews.llvm.org/D148855
This reverts commit 9dfdbd788707edc8c39eb2bff16004aba1f3586b.
Currently tryToGatherExtractElements function analyzes the whole vector,
regrdless number of actual registers, used in this vector. It may
prevent some optimizations, because per-register analysis may allow to
simplify the final code by reusing more already emitted vectors and
better shuffles.
Differential Revision: https://reviews.llvm.org/D148855
This reverts commit 3e6d7c6d983dd5896e3a03857584654eb1360fda.
That commit caused miscompilation of ffmpeg's libavcodec/vp9dsp_8bpp.o
on aarch64; the file still compiles correctly, but no longer produces
the right result - see https://reviews.llvm.org/D148855#4655968
for details.
Currently tryToGatherExtractElements function analyzes the whole vector,
regrdless number of actual registers, used in this vector. It may
prevent some optimizations, because per-register analysis may allow to
simplify the final code by reusing more already emitted vectors and
better shuffles.
Differential Revision: https://reviews.llvm.org/D148855
Currently tryToGatherExtractElements function analyzes the whole vector,
regrdless number of actual registers, used in this vector. It may
prevent some optimizations, because per-register analysis may allow to
simplify the final code by reusing more already emitted vectors and
better shuffles.
Differential Revision: https://reviews.llvm.org/D148855
the middle of reduction ops.
Need to emit freeze instruction not only in the case, where the root is
bool logical op, but also if we reduce several scalars, but unable to
say precisely, if the root is bool logical op.
If the very first reduction operation is not bool logical op, but some
others are, still need to emit the boo logic op for all the extra
reduction operations to avoid incorrect poison propagation.
type.
Need to check, if the number of vector registers, returned by TTI, is
not greater than total number of mask element and not zero, before
trying to perform any operations. TTI still may return non-valid number
of registers.
Currently when building gather/buildvector node, we try to build nodes
shuffles without taking into account separate vector registers. We can
improve final codegen and the whole vectorization process by including
this info into the analysis and the vector code emission, allows to emit
better vectorized code.
Differential Revision: https://reviews.llvm.org/D149742
Currently when building gather/buildvector node, we try to build nodes
shuffles without taking into account separate vector registers. We can
improve final codegen and the whole vectorization process by including
this info into the analysis and the vector code emission, allows to emit
better vectorized code.
Differential Revision: https://reviews.llvm.org/D149742
C++20 comes with std::erase to erase a value from std::vector. This
patch renames llvm::erase_value to llvm::erase for consistency with
C++20.
We could make llvm::erase more similar to std::erase by having it
return the number of elements removed, but I'm not doing that for now
because nobody seems to care about that in our code base.
Since there are only 50 occurrences of erase_value in our code base,
this patch replaces all of them with llvm::erase and deprecates
llvm::erase_value.