Before doing the vectorization of the PHI nodes, the compiler sorts them
by the opcodes of the operands. If the scalar is replaced during the
vectorization by extractelement, it breaks this sorting and prevent some
further vectorization attempts. Patch tries to improve this by doing
extra analysis of the scalars and tries to keep them, if it is found that
this scalar is used in other (external) PHI node in the same block.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/103923
The operands of the phi nodes should be vectorized in the same order, in
which they were created, otherwise the compiler may crash when trying
to correctly build dependency for nodes with non-schedulable
instructions for gather/buildvector nodes.
Fixes https://github.com/llvm/llvm-project/issues/105120
Currently, the SLP schedule has two containers of `ScheduleData`:
`ExtraScheduleDataMap` and `ScheduleDataMap`. However, the
`ScheduleData` in `ExtraScheduleDataMap` is only used to indicate
whether the instruction is processed or not and does not participate in
the schedule, which is useless. `ScheduleDataMap` is sufficient for this
purpose. The `OpValue` member is used only in `ExtraScheduleDataMap`,
which is also useless.
If the scalars do not require scheduling and were already vectorized,
but in the different order, compiler still tries to create the new node.
It may cause the compiler crash for the gathered operands. Instead need
to consider such nodes as full overlap and just reshuffle vectorized
node.
Fixes https://github.com/llvm/llvm-project/issues/104637
The minbitwidth restrictions can be skipped only for immediate reduced
values, for other nodes still need to check if external users allow
bitwidth reduction.
Fixes https://github.com/llvm/llvm-project/issues/104422
A dominance query of a block that is in a different function is
ill-defined, so assert that getNode() is only called for blocks that are
in the same function.
There are three cases, where this behavior did occur. LoopFuse didn't
explicitly do this, but didn't invalidate the SCEV block dispositions,
leaving dangling pointers to free'ed basic blocks behind, causing
use-after-free. We do, however, want to be able to dereference basic
blocks inside the dominator tree, so that we can refer to them by a
number stored inside the basic block.
Reverts #102780
Reland #101198Fixes#102784
Co-authored-by: Alexis Engelke <engelke@in.tum.de>
Add getShufflevectorNumGroups to vectorize shufflevector.
Current getShufflevectorNumGroups can only vectorize limited pattern
(e.g., the masks of shufflevector use the elements of the source in
order).
In addition, ReuseShuffleIndices and ReorderIndices are not supported.
Currently SLP vectorizer tries to keep only GEPs as scalar, if they are
vectorized but used externally. Same approach can be used for all scalar
values. This patch tries to keep original scalars if all its operands
remain scalar or externally used, the cost of the original scalar is
lower than the cost of the extractelement instruction, or if the number
of externally used scalars in the same entry is power of 2. Last
criterion allows better revectorization for multiply used scalars.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/100904
This adds combined vectorized node. It simplifies handling of the
combined nodes, like select/cmp, which can be reduced to min/max,
mul/add transformed to fma, etc. Improves cost mode handling and may end
up with better codegen in future (direct emission of the intrinsics).
Allow SLP optimization to progress in the presence of freeze
instructions. Prior
to this commit, freeze instructions blocked SLP optimization.
The following URL shows correctness of the addsub_freeze test:
https://alive2.llvm.org/ce/z/qm38oh
1. When REVEC is enabled, we need to expand vector types into scalar
types.
2. When REVEC is enabled, CreateInsertVector (and CreateExtractVector)
is used because the scalar type may be a FixedVectorType.
3. Since the mask indices which are used by processBuildVector expect
the source is scalar type, we need to transform the mask indices into a
form which can be used when REVEC is enabled. The transform is only
called when the mask is really used.
If the reduced value was replaced by the extractelement instruction
during vectorization and we attempt to check if this is so, need to
check the tracked value, not the original (deleted) instruction.
Otherwise, the compiler may crash
Fixes https://github.com/llvm/llvm-project/issues/102279
When store chains have the same value type ID and pointer type ID, they
may mix different sizes of values, such as i8 and i64. This can lead to
missed vectorization opportunities.
Currently SLP vectorizer compares phi instructions by the type id of the
compared instructions, which may failed in case of different integer
types,
with the different sizes. Patch adds comparison by type sizes to fix
this.
Landingpad instruction must be the very first instruction after the phi
nodes, so need to inser extractelement/shuffles after this instruction.
Fixes https://github.com/llvm/llvm-project/issues/102187
Currently SLP vectorizer compares cmp instructions by the type id of the
compared operands, which may failed in case of different integer types,
for example, which have same type id, but different sizes. Patch adds
comparison by type sizes to fix this.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/102132
If any pointer operand of the non-cosencutive loads is an instructions
with the user, which is not part of the current graph, and, thus,
requires emission of the extractelement instruction, better to try to
detect if the load sequence can be repsented as strided load and
extractelement instructions for pointers are not required.
Reviewers: preames, RKSimon, topperc
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/101668
If the graph includes only strided loads node, the compiler should still
try to vectorize it.
Reviewers: RKSimon, preames, topperc
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/101659
When trying to reuse extractelement instruction, need to check that it
is inserted into proper position. Its original vector operand should
come before new vector value, otherwise new extractelement instruction
must be generated.
Fixes https://github.com/llvm/llvm-project/issues/101213
In order to enforce a strict-weak ordering, this patch clusters the
bases that are being sorted by the root - the first value in a gep
chain. The sorting is then performed in each cluster.
When trying to reuse extractelement instruction, need to check that it
is inserted into proper position. Its original vector operand should
come before new vector value, otherwise new extractelement instruction
must be generated.
Fixes https://github.com/llvm/llvm-project/issues/101213
No need to handle extra arguments during the reductions anymore, the
compiler now can handle all reduced values and reduction operands
correctly, even if they are from different basic blocks.
Simplifies analysis, reduces compiler size, improves overall
vectorization.
Metric: size..text
test-suite :: SingleSource/Benchmarks/Misc-C++/stepanov_container.test 16668.00 17148.00 2.9%
test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test 2389675.00 2418683.00 1.2%
test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 253517.00 253645.00 0.1%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 309678.00 309806.00 0.0%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 389203.00 389363.00 0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test 111120.00 111152.00 0.0%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1039103.00 1039215.00 0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1155883.00 1155963.00 0.0%
test-suite :: MicroBenchmarks/LoopVectorization/LoopInterleavingBenchmarks.test 276646.00 276662.00 0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 848691.00 848739.00 0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1138604.00 1138636.00 0.0%
test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 910201.00 910217.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12385484.00 12385628.00 0.0%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 9667580.00 9667676.00 0.0%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 9667580.00 9667676.00 0.0%
test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 2856182.00 2856198.00 0.0%
test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 2856182.00 2856198.00 0.0%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 773224.00 773192.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1035148.00 1035084.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 98126.00 98094.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test 97966.00 97934.00 -0.0%
test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test 167391.00 167215.00 -0.1%
test-suite :: MultiSource/Applications/ALAC/encode/alacconvert-encode.test 56685.00 56605.00 -0.1%
test-suite :: MultiSource/Applications/ALAC/decode/alacconvert-decode.test 56685.00 56605.00 -0.1%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-20050826-2.test 1302.00 1294.00 -0.6%
Misc-C++/stepanov_container - better code due to cost fixes.
483.xalancbmk - better code due to cost fixes.
ASCI_Purple/SMG2000 - better code due to cost fixes.
Benchmarks/Bullet - better vector code because of the cost.
JM/ldecod - extra code remain scalar, extra reduction vectorized
consumer-jpeg - extra code remain scalar because of the cost.
tramp3d-v4 - better vectorization because of cost fixes.
511.povray_r - better vectorization because of cost fixes.
LoopInterleavingBenchmarks - extra reductions are vectorized
JM/lencod - small changes in vector code because of extract cost fixes.
453.povray - small changes in vector code because of extract cost fixes.
445.gobmk - extra small reduction vectorized
526.blender_r - extra reduced scalars, better small reduction, small
changes in the vetorization because of the fixes for extracts cost
602.gcc_s
502.gcc_r - small changes in reductions vectorization because of the
fixes in the extract cost.
631.deepsjeng_s
623.xalancbmk_s - small changes in reductions vectorization because of
the fixes in the extract cost.
MallocBench/gs - extra code remain scalar because of extracts cost
alacconvert-encode - extra code remain scalar because of extracts cost
alacconvert-decode - extra code remain scalar because of extracts cost
GCC-C-execute-20050826-2 - extra reduction gets vectorized
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/99923
This attempts to fix a regression from #98025, where the new order of
reduction nodes causes later passes to not be able to produce as nice
shuffles. The issue boils down to picking an order of [0 1 3 2] for
loaded v4i8 values, which meant later parts could not find a simpler
ordering for the shuffles given the legal nodes available in AArch64. If
instead we make sure they are ordered [0 1 2 3] then everything can fall
into place.
In order to produce a better order that is more likely to work in more
cases, this patch takes the existing clustered loads and sort the base
pointers if there is an order between them. i.e if `V2 == gep (V1, X)`
then V1 is sorted before V2.
If the load is a part of the gather node and also a part of the
vectorized subvector, need to add the estimation for the non-vectorized
external uses.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/99889
If at least a single user of the gathered trunc'ed instruction is
vectorized and requires wider type, than the trunc node, such
gathers/buildvectors should not be optimized for better bitwidth.
Since commit 82b800ecb35fb46881aa52000fa40b1b99aa654e addressed the
issue #99327 , we see some performance regression (13%) on some
verilator generated C++ code. This is because the UsesLimit is set to 8,
which is too small for the verilator generated code. I have analyzed the
need for the UsesLimit from [1] and found that the UsesLimit should be
at least 64 to cover most of these cases. Thus, This patch increases the
UsesLimit to 64.
Link:
https://github.com/llvm/llvm-project/issues/99327#issuecomment-2236052879
[1]
Signed-off-by: Yangyu Chen <cyy@cyyself.name>