SLP vectorizer has an estimation for gather/buildvector nodes, which
contain some scalar loads. SLP vectorizer performs pretty similar (but
large in SLOCs) estimation, which not always correct. Instead, this
patch implements clustering analysis and actual node allocation with the
full analysis for the vectorized clustered scalars (not only loads, but
also some other instructions) with the correct cost estimation and
vector insert instructions. Improves overall vectorization quality and
simplifies analysis/estimations.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/104144
with "[Vectorize] Fix warnings"
It introduced compiler crashes, see #104144.
This reverts commit 69332bb8995aef60d830406de12cb79a50390261 and
351f4a5593f1ef507708ec5eeca165b20add3340.
This patch fixes warnings of the form:
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:9300:23: error: loop
variable '[E, Idx]' creates a copy from type 'const value_type' (aka
'const std::pair<const llvm::slpvectorizer::BoUpSLP::TreeEntry *,
unsigned int>') [-Werror,-Wrange-loop-construct]
SLP vectorizer has an estimation for gather/buildvector nodes, which
contain some scalar loads. SLP vectorizer performs pretty similar (but
large in SLOCs) estimation, which not always correct. Instead, this
patch implements clustering analysis and actual node allocation with the
full analysis for the vectorized clustered scalars (not only loads, but
also some other instructions) with the correct cost estimation and
vector insert instructions. Improves overall vectorization quality and
simplifies analysis/estimations.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/104144
If the external user of the scalar to be extract is in
unreachable/landing pad block, we can skip counting their cost.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/105667
Before doing the vectorization of the PHI nodes, the compiler sorts them
by the opcodes of the operands. If the scalar is replaced during the
vectorization by extractelement, it breaks this sorting and prevent some
further vectorization attempts. Patch tries to improve this by doing
extra analysis of the scalars and tries to keep them, if it is found that
this scalar is used in other (external) PHI node in the same block.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/103923
The operands of the phi nodes should be vectorized in the same order, in
which they were created, otherwise the compiler may crash when trying
to correctly build dependency for nodes with non-schedulable
instructions for gather/buildvector nodes.
Fixes https://github.com/llvm/llvm-project/issues/105120
Currently, the SLP schedule has two containers of `ScheduleData`:
`ExtraScheduleDataMap` and `ScheduleDataMap`. However, the
`ScheduleData` in `ExtraScheduleDataMap` is only used to indicate
whether the instruction is processed or not and does not participate in
the schedule, which is useless. `ScheduleDataMap` is sufficient for this
purpose. The `OpValue` member is used only in `ExtraScheduleDataMap`,
which is also useless.
If the scalars do not require scheduling and were already vectorized,
but in the different order, compiler still tries to create the new node.
It may cause the compiler crash for the gathered operands. Instead need
to consider such nodes as full overlap and just reshuffle vectorized
node.
Fixes https://github.com/llvm/llvm-project/issues/104637
The minbitwidth restrictions can be skipped only for immediate reduced
values, for other nodes still need to check if external users allow
bitwidth reduction.
Fixes https://github.com/llvm/llvm-project/issues/104422
A dominance query of a block that is in a different function is
ill-defined, so assert that getNode() is only called for blocks that are
in the same function.
There are three cases, where this behavior did occur. LoopFuse didn't
explicitly do this, but didn't invalidate the SCEV block dispositions,
leaving dangling pointers to free'ed basic blocks behind, causing
use-after-free. We do, however, want to be able to dereference basic
blocks inside the dominator tree, so that we can refer to them by a
number stored inside the basic block.
Reverts #102780
Reland #101198Fixes#102784
Co-authored-by: Alexis Engelke <engelke@in.tum.de>
Add getShufflevectorNumGroups to vectorize shufflevector.
Current getShufflevectorNumGroups can only vectorize limited pattern
(e.g., the masks of shufflevector use the elements of the source in
order).
In addition, ReuseShuffleIndices and ReorderIndices are not supported.
Currently SLP vectorizer tries to keep only GEPs as scalar, if they are
vectorized but used externally. Same approach can be used for all scalar
values. This patch tries to keep original scalars if all its operands
remain scalar or externally used, the cost of the original scalar is
lower than the cost of the extractelement instruction, or if the number
of externally used scalars in the same entry is power of 2. Last
criterion allows better revectorization for multiply used scalars.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/100904
This adds combined vectorized node. It simplifies handling of the
combined nodes, like select/cmp, which can be reduced to min/max,
mul/add transformed to fma, etc. Improves cost mode handling and may end
up with better codegen in future (direct emission of the intrinsics).
Allow SLP optimization to progress in the presence of freeze
instructions. Prior
to this commit, freeze instructions blocked SLP optimization.
The following URL shows correctness of the addsub_freeze test:
https://alive2.llvm.org/ce/z/qm38oh
1. When REVEC is enabled, we need to expand vector types into scalar
types.
2. When REVEC is enabled, CreateInsertVector (and CreateExtractVector)
is used because the scalar type may be a FixedVectorType.
3. Since the mask indices which are used by processBuildVector expect
the source is scalar type, we need to transform the mask indices into a
form which can be used when REVEC is enabled. The transform is only
called when the mask is really used.
If the reduced value was replaced by the extractelement instruction
during vectorization and we attempt to check if this is so, need to
check the tracked value, not the original (deleted) instruction.
Otherwise, the compiler may crash
Fixes https://github.com/llvm/llvm-project/issues/102279
When store chains have the same value type ID and pointer type ID, they
may mix different sizes of values, such as i8 and i64. This can lead to
missed vectorization opportunities.
Currently SLP vectorizer compares phi instructions by the type id of the
compared instructions, which may failed in case of different integer
types,
with the different sizes. Patch adds comparison by type sizes to fix
this.
Landingpad instruction must be the very first instruction after the phi
nodes, so need to inser extractelement/shuffles after this instruction.
Fixes https://github.com/llvm/llvm-project/issues/102187
Currently SLP vectorizer compares cmp instructions by the type id of the
compared operands, which may failed in case of different integer types,
for example, which have same type id, but different sizes. Patch adds
comparison by type sizes to fix this.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/102132
If any pointer operand of the non-cosencutive loads is an instructions
with the user, which is not part of the current graph, and, thus,
requires emission of the extractelement instruction, better to try to
detect if the load sequence can be repsented as strided load and
extractelement instructions for pointers are not required.
Reviewers: preames, RKSimon, topperc
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/101668
If the graph includes only strided loads node, the compiler should still
try to vectorize it.
Reviewers: RKSimon, preames, topperc
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/101659
When trying to reuse extractelement instruction, need to check that it
is inserted into proper position. Its original vector operand should
come before new vector value, otherwise new extractelement instruction
must be generated.
Fixes https://github.com/llvm/llvm-project/issues/101213
In order to enforce a strict-weak ordering, this patch clusters the
bases that are being sorted by the root - the first value in a gep
chain. The sorting is then performed in each cluster.
When trying to reuse extractelement instruction, need to check that it
is inserted into proper position. Its original vector operand should
come before new vector value, otherwise new extractelement instruction
must be generated.
Fixes https://github.com/llvm/llvm-project/issues/101213
No need to handle extra arguments during the reductions anymore, the
compiler now can handle all reduced values and reduction operands
correctly, even if they are from different basic blocks.
Simplifies analysis, reduces compiler size, improves overall
vectorization.
Metric: size..text
test-suite :: SingleSource/Benchmarks/Misc-C++/stepanov_container.test 16668.00 17148.00 2.9%
test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test 2389675.00 2418683.00 1.2%
test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 253517.00 253645.00 0.1%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 309678.00 309806.00 0.0%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 389203.00 389363.00 0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test 111120.00 111152.00 0.0%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1039103.00 1039215.00 0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1155883.00 1155963.00 0.0%
test-suite :: MicroBenchmarks/LoopVectorization/LoopInterleavingBenchmarks.test 276646.00 276662.00 0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 848691.00 848739.00 0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1138604.00 1138636.00 0.0%
test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 910201.00 910217.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12385484.00 12385628.00 0.0%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 9667580.00 9667676.00 0.0%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 9667580.00 9667676.00 0.0%
test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 2856182.00 2856198.00 0.0%
test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 2856182.00 2856198.00 0.0%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 773224.00 773192.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1035148.00 1035084.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 98126.00 98094.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test 97966.00 97934.00 -0.0%
test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test 167391.00 167215.00 -0.1%
test-suite :: MultiSource/Applications/ALAC/encode/alacconvert-encode.test 56685.00 56605.00 -0.1%
test-suite :: MultiSource/Applications/ALAC/decode/alacconvert-decode.test 56685.00 56605.00 -0.1%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-20050826-2.test 1302.00 1294.00 -0.6%
Misc-C++/stepanov_container - better code due to cost fixes.
483.xalancbmk - better code due to cost fixes.
ASCI_Purple/SMG2000 - better code due to cost fixes.
Benchmarks/Bullet - better vector code because of the cost.
JM/ldecod - extra code remain scalar, extra reduction vectorized
consumer-jpeg - extra code remain scalar because of the cost.
tramp3d-v4 - better vectorization because of cost fixes.
511.povray_r - better vectorization because of cost fixes.
LoopInterleavingBenchmarks - extra reductions are vectorized
JM/lencod - small changes in vector code because of extract cost fixes.
453.povray - small changes in vector code because of extract cost fixes.
445.gobmk - extra small reduction vectorized
526.blender_r - extra reduced scalars, better small reduction, small
changes in the vetorization because of the fixes for extracts cost
602.gcc_s
502.gcc_r - small changes in reductions vectorization because of the
fixes in the extract cost.
631.deepsjeng_s
623.xalancbmk_s - small changes in reductions vectorization because of
the fixes in the extract cost.
MallocBench/gs - extra code remain scalar because of extracts cost
alacconvert-encode - extra code remain scalar because of extracts cost
alacconvert-decode - extra code remain scalar because of extracts cost
GCC-C-execute-20050826-2 - extra reduction gets vectorized
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/99923