1830 Commits

Author SHA1 Message Date
Alexey Bataev
20b2c9f10f [SLP][NFC]Use GatheredScalars vector instead of the original E->Scalars, NFC
GateredScalars is a full copy of the E->Scalars in this places and can
be safely used for now. Unifies the code across the function.
2024-08-14 08:29:38 -07:00
Alexey Bataev
d9b9ae6ba9 [SLP][NFC]Use transform nodes before building external uses, NFC.
In preparing for the future upcoming patches, just moving the call to
the proper place, which is NFC for now.
2024-08-14 08:19:05 -07:00
Han-Kuan Chen
246f345152
[SLP][REVEC] Make CastInst support vector instructions. (#103216) 2024-08-13 23:52:32 +08:00
Han-Kuan Chen
6aad4918e8
[SLP][REVEC] Make MinBWs support vector instructions. (#103049)
If ScalarTy is FixedVectorType, it should remain as FixedVectorType.
2024-08-13 21:35:28 +08:00
Han-Kuan Chen
2256d00a14
[SLP][REVEC] Use VL.front()->getType() as ScalarTy. (#102437)
VL.front()->getType() may be FixedVectorType when revec is enabled.

Fix "Expected item in MinBWs.".
2024-08-13 19:53:45 +08:00
Han-Kuan Chen
875b551de7
[SLP][REVEC] Make computeMinimumValueSizes and collectValuesToDemote support vector instructions. (#103005) 2024-08-13 19:35:25 +08:00
Vitaly Buka
5ce47a5813
Reland "[Support] Assert that DomTree nodes share parent" (#102782)
A dominance query of a block that is in a different function is
ill-defined, so assert that getNode() is only called for blocks that are
in the same function.

There are three cases, where this behavior did occur. LoopFuse didn't
explicitly do this, but didn't invalidate the SCEV block dispositions,
leaving dangling pointers to free'ed basic blocks behind, causing
use-after-free. We do, however, want to be able to dereference basic
blocks inside the dominator tree, so that we can refer to them by a
number stored inside the basic block.

Reverts #102780
Reland #101198
Fixes #102784

Co-authored-by: Alexis Engelke <engelke@in.tum.de>
2024-08-13 11:56:02 +02:00
Han-Kuan Chen
b4b0c02306
[SLP][REVEC] Make tryToReduce and related functions support vector instructions. (#102327) 2024-08-13 11:44:23 +08:00
Han-Kuan Chen
70cf58e6c1
[SLP][REVEC] Make SLP vectorize shufflevector. (#102489)
Add getShufflevectorNumGroups to vectorize shufflevector.

Current getShufflevectorNumGroups can only vectorize limited pattern
(e.g., the masks of shufflevector use the elements of the source in
order).

In addition, ReuseShuffleIndices and ReorderIndices are not supported.
2024-08-13 11:19:29 +08:00
Alexey Bataev
ecbbe5b431
[SLP]Fix mask building for alternate node cost estimation (#102966)
Need to to use same functionality in cost model, as for the codegen, to
correctly build the shuffle mask and estimate the cost.
2024-08-12 17:26:56 -04:00
Alexey Bataev
b10ecfa914
[SLP]Represent externally used values as original scalars, if profitable.
Currently SLP vectorizer tries to keep only GEPs as scalar, if they are
vectorized but used externally. Same approach can be used for all scalar
values. This patch tries to keep original scalars if all its operands
remain scalar or externally used, the cost of the original scalar is
lower than the cost of the extractelement instruction, or if the number
of externally used scalars in the same entry is power of 2. Last
criterion allows better revectorization for multiply used scalars.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/100904
2024-08-12 10:15:02 -04:00
Alexey Bataev
34514ce09a [SLP][NFC]Use local getShuffleCost function across the code, NFC. 2024-08-12 06:49:53 -07:00
Alexey Bataev
2a05971de2 [SLP]Add index of the node to the short name output.
Improves debugging experience, does nothing with the functionality.
2024-08-08 08:57:14 -07:00
Han-Kuan Chen
7a4fc7491c
[SLP][REVEC] Fix insertelement has multiple uses. (#102329) 2024-08-08 23:23:10 +08:00
Alexey Bataev
7e7a439705
[SLP][NFC]Introduce CombinedVectorize nodes, NFC. (#99309)
This adds combined vectorized node. It simplifies handling of the
combined nodes, like select/cmp, which can be reduced to min/max,
mul/add transformed to fma, etc. Improves cost mode handling and may end
up with better codegen in future (direct emission of the intrinsics).
2024-08-08 08:05:33 -04:00
Han-Kuan Chen
60ac34701e
[SLP][REVEC] Make getAltInstrMask and getGatherCost vectorize vector instructions. (#99461) 2024-08-08 10:39:01 +08:00
John McIver
bb82c79d3b
[SLP] Enable optimization of freeze instructions (#102217)
Allow SLP optimization to progress in the presence of freeze
instructions. Prior
to this commit, freeze instructions blocked SLP optimization.
    
The following URL shows correctness of the addsub_freeze test:
https://alive2.llvm.org/ce/z/qm38oh
2024-08-07 15:01:37 -04:00
Han-Kuan Chen
97743b8be8
[SLP][REVEC] Make ShuffleCostEstimator and ShuffleInstructionBuilder support vector instructions. (#99499)
1. When REVEC is enabled, we need to expand vector types into scalar
types.
2. When REVEC is enabled, CreateInsertVector (and CreateExtractVector)
is used because the scalar type may be a FixedVectorType.
3. Since the mask indices which are used by processBuildVector expect
the source is scalar type, we need to transform the mask indices into a
form which can be used when REVEC is enabled. The transform is only
called when the mask is really used.
2024-08-07 23:47:57 +08:00
Alexey Bataev
441f94f4bd [SLP]Fix PR102279: check the tracked values for extractelements, not the original values
If the reduced value was replaced by the extractelement instruction
during vectorization and we attempt to check if this is so, need to
check the tracked value, not the original (deleted) instruction.
Otherwise, the compiler may crash

Fixes https://github.com/llvm/llvm-project/issues/102279
2024-08-07 04:21:24 -07:00
tcwzxx
b64ec3c9fa
[SLP] The order of store chains needs to consider the size of the values. (#101810)
When store chains have the same value type ID and pointer type ID, they
may mix different sizes of values, such as i8 and i64. This can lead to
missed vectorization opportunities.
2024-08-07 11:01:53 +08:00
Alexey Bataev
af80d3a248
[SLP]Better sorting of phi instructions by comparing type sizes (#102188)
Currently SLP vectorizer compares phi instructions by the type id of the
compared instructions, which may failed in case of different integer
types,
with the different sizes. Patch adds comparison by type sizes to fix
this.
2024-08-06 16:09:11 -04:00
Alexey Bataev
2601d6f189 [SLP]Fix PR102187: do not insert extractelement before landingpad instruction.
Landingpad instruction must be the very first instruction after the phi
nodes, so need to inser extractelement/shuffles after this instruction.

Fixes https://github.com/llvm/llvm-project/issues/102187
2024-08-06 12:33:13 -07:00
Alexey Bataev
3c3ea7e751
[SLP]Better sorting of cmp instructions by comparing type sizes.
Currently SLP vectorizer compares cmp instructions by the type id of the
compared operands, which may failed in case of different integer types,
for example, which have same type id, but different sizes. Patch adds
  comparison by type sizes to fix this.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/102132
2024-08-06 11:03:36 -04:00
Alexey Bataev
daf4a06e5c
[SLP]Try detect strided loads, if any pointer op require extraction.
If any pointer operand of the non-cosencutive loads is an instructions
with the user, which is not part of the current graph, and, thus,
requires emission of the extractelement instruction, better to try to
detect if the load sequence can be repsented as strided load and
extractelement instructions for pointers are not required.

Reviewers: preames, RKSimon, topperc

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/101668
2024-08-06 09:20:50 -04:00
Alexey Bataev
799fd3d87b
[SLP]Support vectorization of small strided loads only graph.
If the graph includes only strided loads node, the compiler should still
try to vectorize it.

Reviewers: RKSimon, preames, topperc

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/101659
2024-08-05 12:51:10 -04:00
Kazu Hirata
b7146aed5b
[Transforms] Construct SmallVector with ArrayRef (NFC) (#101851) 2024-08-03 15:33:08 -07:00
Florian Hahn
edf46f365c
[SCEV] Use const SCEV * explicitly in more places.
Use const SCEV * explicitly in more places to prepare for
https://github.com/llvm/llvm-project/pull/91961. Split off as suggested.
2024-08-03 20:10:01 +01:00
Han-Kuan Chen
b5a7d3b6c2
[SLP][REVEC] Make Instruction::Select support vector instructions. (#100507) 2024-07-31 23:03:50 +08:00
Alexey Bataev
6b1d13761a [SLP]Fix PR101213: Reuse extractelement, only if its vector operand comes before new vector value.
When trying to reuse extractelement instruction, need to check that it
is inserted into proper position. Its original vector operand should
come before new vector value, otherwise new extractelement instruction
must be generated.

Fixes https://github.com/llvm/llvm-project/issues/101213
2024-07-30 16:02:46 -07:00
Alexey Bataev
a6ef0864e9 Revert "[SLP]Fix PR101213: Reuse extractelement, only if its vector operand comes before new vector value."
This reverts commit f70f1228035c9610de38e0e376afdacb647c4ad9 to fix the
crash reported by https://lab.llvm.org/buildbot/#/builders/133/builds/2456.
2024-07-30 15:11:35 -07:00
David Green
89b67a6400
[SLP] Cluster SortedBases before sorting. (#101144)
In order to enforce a strict-weak ordering, this patch clusters the
bases that are being sorted by the root - the first value in a gep
chain. The sorting is then performed in each cluster.
2024-07-30 22:12:20 +01:00
Alexey Bataev
f70f122803 [SLP]Fix PR101213: Reuse extractelement, only if its vector operand comes before new vector value.
When trying to reuse extractelement instruction, need to check that it
is inserted into proper position. Its original vector operand should
come before new vector value, otherwise new extractelement instruction
must be generated.

Fixes https://github.com/llvm/llvm-project/issues/101213
2024-07-30 14:04:50 -07:00
Alexey Bataev
197f4a9051
[SLP]Remove ExtraArgs from reductions.
No need to handle extra arguments during the reductions anymore, the
compiler now can handle all reduced values and reduction operands
correctly, even if they are from different basic blocks.

Simplifies analysis, reduces compiler size, improves overall
vectorization.

Metric: size..text
test-suite :: SingleSource/Benchmarks/Misc-C++/stepanov_container.test    16668.00    17148.00  2.9%
test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test  2389675.00  2418683.00  1.2%
test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test   253517.00   253645.00  0.1%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   309678.00   309806.00  0.0%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   389203.00   389363.00  0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test   111120.00   111152.00  0.0%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  1039103.00  1039215.00  0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  1155883.00  1155963.00  0.0%
test-suite :: MicroBenchmarks/LoopVectorization/LoopInterleavingBenchmarks.test   276646.00   276662.00  0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test   848691.00   848739.00  0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  1138604.00  1138636.00  0.0%
test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   910201.00   910217.00  0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12385484.00 12385628.00  0.0%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test  9667580.00  9667676.00  0.0%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test  9667580.00  9667676.00  0.0%
test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test  2856182.00  2856198.00  0.0%
test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test  2856182.00  2856198.00  0.0%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   773224.00   773192.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  1035148.00  1035084.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test    98126.00    98094.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test    97966.00    97934.00 -0.0%
test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test   167391.00   167215.00 -0.1%
test-suite :: MultiSource/Applications/ALAC/encode/alacconvert-encode.test    56685.00    56605.00 -0.1%
test-suite :: MultiSource/Applications/ALAC/decode/alacconvert-decode.test    56685.00    56605.00 -0.1%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-20050826-2.test     1302.00     1294.00 -0.6%

Misc-C++/stepanov_container - better code due to cost fixes.
483.xalancbmk - better code due to cost fixes.
ASCI_Purple/SMG2000 - better code due to cost fixes.
Benchmarks/Bullet - better vector code because of the cost.
JM/ldecod - extra code remain scalar, extra reduction vectorized
consumer-jpeg - extra code remain scalar because of the cost.
tramp3d-v4 - better vectorization because of cost fixes.
511.povray_r - better vectorization because of cost fixes.
LoopInterleavingBenchmarks - extra reductions are vectorized
JM/lencod - small changes in vector code because of extract cost fixes.
453.povray - small changes in vector code because of extract cost fixes.
445.gobmk - extra small reduction vectorized
526.blender_r - extra reduced scalars, better small reduction, small
changes in the vetorization because of the fixes for extracts cost
602.gcc_s
502.gcc_r - small changes in reductions vectorization because of the
fixes in the extract cost.
631.deepsjeng_s
623.xalancbmk_s - small changes in reductions vectorization because of
the fixes in the extract cost.
MallocBench/gs - extra code remain scalar because of extracts cost
alacconvert-encode - extra code remain scalar because of extracts cost
alacconvert-decode - extra code remain scalar because of extracts cost
GCC-C-execute-20050826-2 - extra reduction gets vectorized

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/99923
2024-07-29 13:23:56 -04:00
David Green
f2d2ae3f5a
[SLP] Order clustered load base pointers by ascending offsets (#100653)
This attempts to fix a regression from #98025, where the new order of
reduction nodes causes later passes to not be able to produce as nice
shuffles. The issue boils down to picking an order of [0 1 3 2] for
loaded v4i8 values, which meant later parts could not find a simpler
ordering for the shuffles given the legal nodes available in AArch64. If
instead we make sure they are ordered [0 1 2 3] then everything can fall
into place.

In order to produce a better order that is more likely to work in more
cases, this patch takes the existing clustered loads and sort the base
pointers if there is an order between them. i.e if `V2 == gep (V1, X)`
then V1 is sorted before V2.
2024-07-27 11:18:56 +01:00
Alexey Bataev
1e1c8d1615
[SLP]Add external uses cost for the gathered loads.
If the load is a part of the gather node and also a part of the
vectorized subvector, need to add the estimation for the non-vectorized
external uses.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/99889
2024-07-26 11:09:44 -04:00
Han-Kuan Chen
5fc9502f19
[SLP] NFC. ShuffleInstructionBuilder::add V1->getType() is always a FixedVectorType. (#99842)
castToScalarTyElem has a cast<VectorType>(V->getType()).
2024-07-24 01:40:24 +08:00
Alexey Bataev
3cb82f49dc [SLP]Fix PR99899: Use canonical type instead of original vector of ptr.
Use adjusted canonical integer type instead of the original ptr type to
fix the crash in the TTI.
Fixes https://github.com/llvm/llvm-project/issues/99899
2024-07-22 13:05:12 -07:00
Alexey Bataev
f6e01b9ece [SLP]Do not trunc bv nodes, if the user is vectorized an requires wider type.
If at least a single user of the gathered trunc'ed instruction is
vectorized and requires wider type, than the trunc node, such
gathers/buildvectors should not be optimized for better bitwidth.
2024-07-19 07:28:04 -07:00
Yangyu Chen
007aa6d1b2
[SLP] Increase UsesLimit to 64 (#99467)
Since commit 82b800ecb35fb46881aa52000fa40b1b99aa654e addressed the
issue #99327 , we see some performance regression (13%) on some
verilator generated C++ code. This is because the UsesLimit is set to 8,
which is too small for the verilator generated code. I have analyzed the
need for the UsesLimit from [1] and found that the UsesLimit should be
at least 64 to cover most of these cases. Thus, This patch increases the
UsesLimit to 64.

Link:
https://github.com/llvm/llvm-project/issues/99327#issuecomment-2236052879
[1]

Signed-off-by: Yangyu Chen <cyy@cyyself.name>
2024-07-19 20:32:28 +08:00
Han-Kuan Chen
39bb244a16
[SLP][REVEC] Make Instruction::Call support vector instructions. (#99317) 2024-07-18 20:49:53 +08:00
Han-Kuan Chen
b634e057dd
[SLP][REVEC] Fix false assumption of the source for castToScalarTyElem. (#99424)
The argument V may come from adjustExtracts, which is the vector operand
of ExtractElementInst. In addition, it is not existed in getTreeEntry.

The vector operand of ExtractElementInst may have a type of <1 x Ty>,
ensuring that the number of elements in ScalarTy and VecTy are equal.

reference: https://github.com/llvm/llvm-project/issues/99411
2024-07-18 19:54:46 +08:00
Alexey Bataev
82b800ecb3 [SLP][NFC]Limit number of the external uses analysis, NFC.
BoUpSLP::buildExternalUses runs through all the users of the vectorized
scalars, which may require significant amount of time, if there are too
many users. Limited the analysis, if there are too many users, all of
them are replaced, not individually.
2024-07-17 14:12:22 -07:00
Alexey Bataev
c5c1bd164f [SLP]Improve minbitwidth analysis for trun'ed gather nodes.
If the gather node is trunc'ed, better to trunc scalars and then gather
them rather than gather and then trunc. Trunc for scalars is free in
most cases.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/99072
2024-07-17 07:41:00 -07:00
Alexey Bataev
05b067b5f9 Revert "[SLP]Improve minbitwidth analysis for trun'ed gather nodes."
This reverts commit d3d2f9a4208eedbd2f372c34725ab61c3f4d3aed to fix
buildbot https://lab.llvm.org/buildbot/#/builders/92/builds/1880.
2024-07-17 07:31:27 -07:00
Alexey Bataev
d3d2f9a420 [SLP]Improve minbitwidth analysis for trun'ed gather nodes.
If the gather node is trunc'ed, better to trunc scalars and then gather
them rather than gather and then trunc. Trunc for scalars is free in
most cases.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/99072
2024-07-17 07:29:02 -07:00
Alexey Bataev
b05ccaf451 Revert "[SLP]Improve minbitwidth analysis for trun'ed gather nodes."
This reverts commit 6425f2d66740b84fc3027b649cd4baf660c384e8 to fix the
buildbost issues reported in https://lab.llvm.org/buildbot/#/builders/95/builds/1404.
2024-07-17 05:51:54 -07:00
Han-Kuan Chen
1813ffd6b2
[SLP][REVEC] Make SLP support revectorization (-slp-revec) and add simple test. (#98269)
This PR will make SLP support revectorization. Add an option -slp-revec
to control the functionality.

reference:

https://discourse.llvm.org/t/rfc-make-slp-vectorizer-revectorize-vector-instructions/79436
2024-07-17 20:14:12 +08:00
Alexey Bataev
6425f2d667
[SLP]Improve minbitwidth analysis for trun'ed gather nodes.
If the gather node is trunc'ed, better to trunc scalars and then gather
them rather than gather and then trunc. Trunc for scalars is free in
most cases.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/99072
2024-07-17 07:17:25 -04:00
Alexey Bataev
15915c06d5
[SLP]Do not vectorize small (<=2) buildvector/buildvalue sequences with MaxVF==true.
If MaxVFOnly for buildvector/buildvalue vectorization is set to true and the
total number of elements to vectorize is <= 2, better to try to
vectorize reductions at first, which may produce larger tree (reductions
have a limit of at least 4 elements to vectorize). Smaller
buildvector/buildvalue sequence will be attempted to vectorize later,
with MaxVFOnly set to false.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/98957
2024-07-16 12:45:58 -04:00
Alexey Bataev
8ff233f4f1 [SLP]Correctly detect minnum/maxnum patterns for select/cmp operations on floats.
The patch enables detection of minnum/maxnum patterns for float point
instruction, represented as select/cmp. Also, enables better cost
estimation for integer min/max patterns since the compiler starts
to estimate the scalars separately.

Reviewers: nikic, RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/98570
2024-07-16 09:42:08 -07:00