A dominance query of a block that is in a different function is
ill-defined, so assert that getNode() is only called for blocks that are
in the same function.
There are three cases, where this behavior did occur. LoopFuse didn't
explicitly do this, but didn't invalidate the SCEV block dispositions,
leaving dangling pointers to free'ed basic blocks behind, causing
use-after-free. We do, however, want to be able to dereference basic
blocks inside the dominator tree, so that we can refer to them by a
number stored inside the basic block.
Reverts #102780
Reland #101198Fixes#102784
Co-authored-by: Alexis Engelke <engelke@in.tum.de>
Add getShufflevectorNumGroups to vectorize shufflevector.
Current getShufflevectorNumGroups can only vectorize limited pattern
(e.g., the masks of shufflevector use the elements of the source in
order).
In addition, ReuseShuffleIndices and ReorderIndices are not supported.
Currently SLP vectorizer tries to keep only GEPs as scalar, if they are
vectorized but used externally. Same approach can be used for all scalar
values. This patch tries to keep original scalars if all its operands
remain scalar or externally used, the cost of the original scalar is
lower than the cost of the extractelement instruction, or if the number
of externally used scalars in the same entry is power of 2. Last
criterion allows better revectorization for multiply used scalars.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/100904
This adds combined vectorized node. It simplifies handling of the
combined nodes, like select/cmp, which can be reduced to min/max,
mul/add transformed to fma, etc. Improves cost mode handling and may end
up with better codegen in future (direct emission of the intrinsics).
Allow SLP optimization to progress in the presence of freeze
instructions. Prior
to this commit, freeze instructions blocked SLP optimization.
The following URL shows correctness of the addsub_freeze test:
https://alive2.llvm.org/ce/z/qm38oh
1. When REVEC is enabled, we need to expand vector types into scalar
types.
2. When REVEC is enabled, CreateInsertVector (and CreateExtractVector)
is used because the scalar type may be a FixedVectorType.
3. Since the mask indices which are used by processBuildVector expect
the source is scalar type, we need to transform the mask indices into a
form which can be used when REVEC is enabled. The transform is only
called when the mask is really used.
If the reduced value was replaced by the extractelement instruction
during vectorization and we attempt to check if this is so, need to
check the tracked value, not the original (deleted) instruction.
Otherwise, the compiler may crash
Fixes https://github.com/llvm/llvm-project/issues/102279
When store chains have the same value type ID and pointer type ID, they
may mix different sizes of values, such as i8 and i64. This can lead to
missed vectorization opportunities.
Currently SLP vectorizer compares phi instructions by the type id of the
compared instructions, which may failed in case of different integer
types,
with the different sizes. Patch adds comparison by type sizes to fix
this.
Landingpad instruction must be the very first instruction after the phi
nodes, so need to inser extractelement/shuffles after this instruction.
Fixes https://github.com/llvm/llvm-project/issues/102187
Currently SLP vectorizer compares cmp instructions by the type id of the
compared operands, which may failed in case of different integer types,
for example, which have same type id, but different sizes. Patch adds
comparison by type sizes to fix this.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/102132
If any pointer operand of the non-cosencutive loads is an instructions
with the user, which is not part of the current graph, and, thus,
requires emission of the extractelement instruction, better to try to
detect if the load sequence can be repsented as strided load and
extractelement instructions for pointers are not required.
Reviewers: preames, RKSimon, topperc
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/101668
If the graph includes only strided loads node, the compiler should still
try to vectorize it.
Reviewers: RKSimon, preames, topperc
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/101659
When trying to reuse extractelement instruction, need to check that it
is inserted into proper position. Its original vector operand should
come before new vector value, otherwise new extractelement instruction
must be generated.
Fixes https://github.com/llvm/llvm-project/issues/101213
In order to enforce a strict-weak ordering, this patch clusters the
bases that are being sorted by the root - the first value in a gep
chain. The sorting is then performed in each cluster.
When trying to reuse extractelement instruction, need to check that it
is inserted into proper position. Its original vector operand should
come before new vector value, otherwise new extractelement instruction
must be generated.
Fixes https://github.com/llvm/llvm-project/issues/101213
No need to handle extra arguments during the reductions anymore, the
compiler now can handle all reduced values and reduction operands
correctly, even if they are from different basic blocks.
Simplifies analysis, reduces compiler size, improves overall
vectorization.
Metric: size..text
test-suite :: SingleSource/Benchmarks/Misc-C++/stepanov_container.test 16668.00 17148.00 2.9%
test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test 2389675.00 2418683.00 1.2%
test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 253517.00 253645.00 0.1%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 309678.00 309806.00 0.0%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 389203.00 389363.00 0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test 111120.00 111152.00 0.0%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1039103.00 1039215.00 0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1155883.00 1155963.00 0.0%
test-suite :: MicroBenchmarks/LoopVectorization/LoopInterleavingBenchmarks.test 276646.00 276662.00 0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 848691.00 848739.00 0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1138604.00 1138636.00 0.0%
test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 910201.00 910217.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12385484.00 12385628.00 0.0%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 9667580.00 9667676.00 0.0%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 9667580.00 9667676.00 0.0%
test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 2856182.00 2856198.00 0.0%
test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 2856182.00 2856198.00 0.0%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 773224.00 773192.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1035148.00 1035084.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 98126.00 98094.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test 97966.00 97934.00 -0.0%
test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test 167391.00 167215.00 -0.1%
test-suite :: MultiSource/Applications/ALAC/encode/alacconvert-encode.test 56685.00 56605.00 -0.1%
test-suite :: MultiSource/Applications/ALAC/decode/alacconvert-decode.test 56685.00 56605.00 -0.1%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-20050826-2.test 1302.00 1294.00 -0.6%
Misc-C++/stepanov_container - better code due to cost fixes.
483.xalancbmk - better code due to cost fixes.
ASCI_Purple/SMG2000 - better code due to cost fixes.
Benchmarks/Bullet - better vector code because of the cost.
JM/ldecod - extra code remain scalar, extra reduction vectorized
consumer-jpeg - extra code remain scalar because of the cost.
tramp3d-v4 - better vectorization because of cost fixes.
511.povray_r - better vectorization because of cost fixes.
LoopInterleavingBenchmarks - extra reductions are vectorized
JM/lencod - small changes in vector code because of extract cost fixes.
453.povray - small changes in vector code because of extract cost fixes.
445.gobmk - extra small reduction vectorized
526.blender_r - extra reduced scalars, better small reduction, small
changes in the vetorization because of the fixes for extracts cost
602.gcc_s
502.gcc_r - small changes in reductions vectorization because of the
fixes in the extract cost.
631.deepsjeng_s
623.xalancbmk_s - small changes in reductions vectorization because of
the fixes in the extract cost.
MallocBench/gs - extra code remain scalar because of extracts cost
alacconvert-encode - extra code remain scalar because of extracts cost
alacconvert-decode - extra code remain scalar because of extracts cost
GCC-C-execute-20050826-2 - extra reduction gets vectorized
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/99923
This attempts to fix a regression from #98025, where the new order of
reduction nodes causes later passes to not be able to produce as nice
shuffles. The issue boils down to picking an order of [0 1 3 2] for
loaded v4i8 values, which meant later parts could not find a simpler
ordering for the shuffles given the legal nodes available in AArch64. If
instead we make sure they are ordered [0 1 2 3] then everything can fall
into place.
In order to produce a better order that is more likely to work in more
cases, this patch takes the existing clustered loads and sort the base
pointers if there is an order between them. i.e if `V2 == gep (V1, X)`
then V1 is sorted before V2.
If the load is a part of the gather node and also a part of the
vectorized subvector, need to add the estimation for the non-vectorized
external uses.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/99889
If at least a single user of the gathered trunc'ed instruction is
vectorized and requires wider type, than the trunc node, such
gathers/buildvectors should not be optimized for better bitwidth.
Since commit 82b800ecb35fb46881aa52000fa40b1b99aa654e addressed the
issue #99327 , we see some performance regression (13%) on some
verilator generated C++ code. This is because the UsesLimit is set to 8,
which is too small for the verilator generated code. I have analyzed the
need for the UsesLimit from [1] and found that the UsesLimit should be
at least 64 to cover most of these cases. Thus, This patch increases the
UsesLimit to 64.
Link:
https://github.com/llvm/llvm-project/issues/99327#issuecomment-2236052879
[1]
Signed-off-by: Yangyu Chen <cyy@cyyself.name>
The argument V may come from adjustExtracts, which is the vector operand
of ExtractElementInst. In addition, it is not existed in getTreeEntry.
The vector operand of ExtractElementInst may have a type of <1 x Ty>,
ensuring that the number of elements in ScalarTy and VecTy are equal.
reference: https://github.com/llvm/llvm-project/issues/99411
BoUpSLP::buildExternalUses runs through all the users of the vectorized
scalars, which may require significant amount of time, if there are too
many users. Limited the analysis, if there are too many users, all of
them are replaced, not individually.
If the gather node is trunc'ed, better to trunc scalars and then gather
them rather than gather and then trunc. Trunc for scalars is free in
most cases.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/99072
If the gather node is trunc'ed, better to trunc scalars and then gather
them rather than gather and then trunc. Trunc for scalars is free in
most cases.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/99072
If the gather node is trunc'ed, better to trunc scalars and then gather
them rather than gather and then trunc. Trunc for scalars is free in
most cases.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/99072
If MaxVFOnly for buildvector/buildvalue vectorization is set to true and the
total number of elements to vectorize is <= 2, better to try to
vectorize reductions at first, which may produce larger tree (reductions
have a limit of at least 4 elements to vectorize). Smaller
buildvector/buildvalue sequence will be attempted to vectorize later,
with MaxVFOnly set to false.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/98957
The patch enables detection of minnum/maxnum patterns for float point
instruction, represented as select/cmp. Also, enables better cost
estimation for integer min/max patterns since the compiler starts
to estimate the scalars separately.
Reviewers: nikic, RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/98570