If at least a single user of the gathered trunc'ed instruction is
vectorized and requires wider type, than the trunc node, such
gathers/buildvectors should not be optimized for better bitwidth.
Since commit 82b800ecb35fb46881aa52000fa40b1b99aa654e addressed the
issue #99327 , we see some performance regression (13%) on some
verilator generated C++ code. This is because the UsesLimit is set to 8,
which is too small for the verilator generated code. I have analyzed the
need for the UsesLimit from [1] and found that the UsesLimit should be
at least 64 to cover most of these cases. Thus, This patch increases the
UsesLimit to 64.
Link:
https://github.com/llvm/llvm-project/issues/99327#issuecomment-2236052879
[1]
Signed-off-by: Yangyu Chen <cyy@cyyself.name>
The argument V may come from adjustExtracts, which is the vector operand
of ExtractElementInst. In addition, it is not existed in getTreeEntry.
The vector operand of ExtractElementInst may have a type of <1 x Ty>,
ensuring that the number of elements in ScalarTy and VecTy are equal.
reference: https://github.com/llvm/llvm-project/issues/99411
BoUpSLP::buildExternalUses runs through all the users of the vectorized
scalars, which may require significant amount of time, if there are too
many users. Limited the analysis, if there are too many users, all of
them are replaced, not individually.
If the gather node is trunc'ed, better to trunc scalars and then gather
them rather than gather and then trunc. Trunc for scalars is free in
most cases.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/99072
If the gather node is trunc'ed, better to trunc scalars and then gather
them rather than gather and then trunc. Trunc for scalars is free in
most cases.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/99072
If the gather node is trunc'ed, better to trunc scalars and then gather
them rather than gather and then trunc. Trunc for scalars is free in
most cases.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/99072
If MaxVFOnly for buildvector/buildvalue vectorization is set to true and the
total number of elements to vectorize is <= 2, better to try to
vectorize reductions at first, which may produce larger tree (reductions
have a limit of at least 4 elements to vectorize). Smaller
buildvector/buildvalue sequence will be attempted to vectorize later,
with MaxVFOnly set to false.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/98957
The patch enables detection of minnum/maxnum patterns for float point
instruction, represented as select/cmp. Also, enables better cost
estimation for integer min/max patterns since the compiler starts
to estimate the scalars separately.
Reviewers: nikic, RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/98570
The patch enables detection of minnum/maxnum patterns for float point
instruction, represented as select/cmp. Also, enables better cost
estimation for integer min/max patterns since the compiler starts
to estimate the scalars separately.
Reviewers: nikic, RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/98570
If the reduction operation is a select-based logical op, the condition
should be replaced by the poison, better to replace by the non-poisoning
constant to prevent poison propagation in the vector code.
Fixes https://github.com/llvm/llvm-project/issues/98838
SLP vectorizes scalar type to vector type. In the future, we will try to
make SLP vectorizes vector type to vector type. We add a getWidenedType
as a helper function. For example, SLP will make the following code
%v0 = load i32, ptr %in0, align 4
%v1 = load i32, ptr %in1, align 4
%v2 = load i32, ptr %in2, align 4
%v3 = load i32, ptr %in3, align 4
into a load <4 x i32>. The ScalarTy is i32 and VF is 4. In the future,
SLP will make the following code
%v0 = load <4 x i32>, ptr %in0, align 4
%v1 = load <4 x i32>, ptr %in1, align 4
%v2 = load <4 x i32>, ptr %in2, align 4
%v3 = load <4 x i32>, ptr %in3, align 4
into a load <16 x i32>. The ScalarTy is <4 x i32> and VF is 4.
reference:
https://discourse.llvm.org/t/rfc-make-slp-vectorizer-revectorize-vector-instructions/79436
The patch tries to keep the original order of the instruction in the
reductions. Previously, two first instructions were switched, giving
reverse order.
The first step to support of the ordered reductions.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/98025
The "instruction" reordering mode should be selected only if there are
compatible instructions in other operands, which can be reordered.
Otherwise, better to select splat reordering mode.
Metric: size..text
Program size..text
results results0 diff
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12383340.00 12383324.00 -0.0%
Some 4x operations get replaced by 8x.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/97485
If the instruction is marked for deletion, better to drop all its
operands and mark them for deletion too (if allowed). It allows to have
more vectorizable patterns and generate less useless extractelement
instructions.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/97409
Allows better codegen with the free resizing of small VF vector operands
and then regular shuffling of the operands of the same size and
simplifies the code.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/97414
If the instruction is marked for deletion, better to drop all its
operands and mark them for deletion too (if allowed). It allows to have
more vectorizable patterns and generate less useless extractelement
instructions.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/97409
Currently SLP vectorizer tries at first to find reduction nodes, and
then vectorize buildvector sequences. Need to try to vectorize wide
buildvector sequences at first and only then try to vectorize
reductions, and then smaller buildvector sequences.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/96943
I'm not super familiar with this code, but it seems that we were just
missing a check.
The original code that triggered this did not have uselistorders but
llvm-reduce created them and it reproduces the same issue in a way more
compact way.
Fixes https://github.com/llvm/llvm-project/issues/95016
Since `raw_string_ostream` doesn't own the string buffer, it is
desirable (in terms of memory safety) for users to directly reference
the string buffer rather than use `raw_string_ostream::str()`.
Work towards TODO comment to remove `raw_string_ostream::str()`.
Previous patch did not pass the list of the extract indices by
reference, so the compiler just ignored them. Pass indices by reference
and fix the per-register analysis.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/96808
Previous patch did not pass the list of the extract indices by
reference, so the compiler just ignored them. Pass indices by reference
and fix the per-register analysis.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/96808
If the base node is signed, but some values are unsigned, still the
whole node should be considered signed. Also, an extra bitwidth analysis
should be performed, when estimating the minimal bitwidth.
Uses the new InsertPosition class (added in #94226) to simplify some of
the IRBuilder interface, and removes the need to pass a BasicBlock
alongside a BasicBlock::iterator, using the fact that we can now get the
parent basic block from the iterator even if it points to the sentinel.
This patch removes the BasicBlock argument from each constructor or call
to setInsertPoint.
This has no functional effect, but later on as we look to remove the
`Instruction *InsertBefore` argument from instruction-creation
(discussed
[here](https://discourse.llvm.org/t/psa-instruction-constructors-changing-to-iterator-only-insertion/77845)),
this will simplify the process by allowing us to deprecate the
InsertPosition constructor directly and catch all the cases where we use
instructions rather than iterators.