1988 Commits

Author SHA1 Message Date
Alexey Bataev
76422385c3
[SLP]Support reordered buildvector nodes for better clustering
Patch adds reordering of the buildvector nodes for better clustering of
the compatible operations and future vectorization. Includes basic cost
estimation and if the transformation is not profitable - reverts it.

AVX512, -O3+LTO
Metric: size..text

Program                                                                          size..text
                                                                                       results     results0    diff
                        test-suite :: External/SPEC/CINT2006/401.bzip2/401.bzip2.test    74565.00    75701.00  1.5%
                test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test    75773.00    76397.00  0.8%
               test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test    75773.00    76397.00  0.8%
               test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2014462.00  2024494.00  0.5%
                         test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   395219.00   396979.00  0.4%
                         test-suite :: MultiSource/Applications/JM/lencod/lencod.test   857795.00   859667.00  0.2%
                    test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   800472.00   802440.00  0.2%
                       test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test   590699.00   591403.00  0.1%
        test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   203006.00   203102.00  0.0%
            test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test    42408.00    42424.00  0.0%
            test-suite ::  External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12451575.00  12451927.00  0.0%
            test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1396480.00  1396448.00 -0.0%
             test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1396480.00  1396448.00 -0.0%
                        test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  1047708.00  1047580.00 -0.0%
        test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test   111344.00   111328.00 -0.0%
                test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1087660.00  1087500.00 -0.0%
       test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   280664.00   280616.00 -0.0%
                          test-suite :: MultiSource/Applications/sqlite3/sqlite3.test   502646.00   502006.00 -0.1%
                      test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  1033135.00  1031567.00 -0.2%
        test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  2070917.00  2065845.00 -0.2%
       test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  2070917.00  2065845.00 -0.2%
                        test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test    33893.00    33797.00 -0.3%
          test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test    39677.00    39549.00 -0.3%
                 test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test    39674.00    39546.00 -0.3%
test-suite :: MultiSource/Benchmarks/MiBench/security-blowfish/security-blowfish.test    11560.00    11512.00 -0.4%
                 test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   653867.00   649275.00 -0.7%
                  test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   653867.00   649275.00 -0.7%

CINT2006/401.bzip2 - extra code vectorized
CINT2017rate/541.leela_r
CINT2017speed/641.leela_s - function
_ZN9FastBoard25get_pattern3_augment_specEiib not inlined anymore, better
vectorization
CFP2017rate/510.parest_r - better vectorization
JM/ldecod - better vectorization
JM/lencod - same
CINT2006/464.h264ref - extra code vectorized
CFP2006/447.dealII - extra vector code
MiBench/consumer-lame - vectorized 2 loops previously scalar
DOE-ProxyApps-C/miniGMG - small changes
Benchmarks/7zip - extra code vectorized, better vectorization
CFP2017rate/526.blender_r - extra vectorization
CFP2017speed/638.imagick_s
CFP2017rate/538.imagick_r - extra vectorization
MiBench/consumer-jpeg - extra vectorization
CINT2006/400.perlbench - extra vectorization
Prolangs-C/TimberWolfMC - small variations
Applications/sqlite3 - extra function vectorized and inlined
Benchmarks/tramp3d-v4 - extra code vectorized
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - extra code vectorized, function digcpy gets
vectorized and inlined
CINT2006/473.astar - extra code vectorized
MiBench/telecomm-gsm - extra code vectorized, better vector code
mediabench/gsm - same
MiBench/security-blowfish - extra code vectorized
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - sub4x4_dct function vectorized and gets
inlined

RISCV-V, SiFive-p670, O3+LTO

CFP2017rate/510.parest_r - extra vectorization
CFP2017rate/526.blender_r - extra vectorization
MiBench/consumer-lame - extra vectorized code

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/114284
2024-11-06 10:51:15 -05:00
Paul Walker
38fffa630e
[LLVM][IR] Use splat syntax when printing Constant[Data]Vector. (#112548) 2024-11-06 11:53:33 +00:00
Alexey Bataev
c1cec8c0dc [SLP][NFC]Add a test with missed splat ordering for loads, NFC 2024-11-05 14:08:17 -08:00
Alexey Bataev
0c18def2c1 [SLP]Allow interleaving check only if it is less than number of elements
Need to check if the interleaving factor is less than total number of
elements in loads slice to handle it correctly and avoid compiler crash.

Fixes report https://github.com/llvm/llvm-project/pull/112361#issuecomment-2457227670
2024-11-05 07:06:15 -08:00
Alexey Bataev
899336735a [SLP]Be more pessimistic about poisonous reductions
Consider all possible reductions ops as being non-poisoning boolean
logical operations, which require freeze to be fully correct.

https://alive2.llvm.org/ce/z/TKWDMP

Fixes #114738
2024-11-04 06:13:52 -08:00
Alexey Bataev
a15bf88d53 [SLP][NFC]Add a test with missing freeze instruction before reduction, NFC 2024-11-04 04:38:09 -08:00
Simon Pilgrim
ac1869aa70
[CostModel][X86] Add initial costs for non-lane-crossing one/two input shuffles (#114680)
Most of the x86 shuffle instructions operate within each 128-bit subvector lane, but our shuffle costs struggle to handle this and have to fallback to worst case shuffles that reference elements from any lane.

This patch detects shuffle masks that we know are "inlane" and enable us to assume a cheaper shuffle cost.
2024-11-04 10:19:02 +00:00
Han-Kuan Chen
a795a18bba
[SLP][REVEC] VF should be scaled when ScalarTy is FixedVectorType. (#114551) 2024-11-02 03:03:52 +08:00
Han-Kuan Chen
e4aeeba84c
[SLP][REVEC] When ScalarTy is FixedVectorType, the insertion index should consider the number of elements of ScalarTy. (#114526) 2024-11-01 21:17:57 +08:00
Alexey Bataev
e05def081e
[SLP]Do not vectorize code in EH and non-returning blocks
The code in EH and non-returning blocks can be skipped by the
vectorizer, since it does not add to the perfromance, just consumes
compile/link time.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/112221
2024-10-31 13:50:02 -04:00
Alexey Bataev
19a34dded7
[SLP]Do not account external uses in EH block and in non-returning blocks
No need to account the cost of the external uses in EH and non-returning
basic blocks.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/112045
2024-10-31 13:23:43 -04:00
Alexey Bataev
e7080fd735 [SLP]Extra check if the intruction matked for removal, must be replaced in reduction ops
If the instruction is vectorized and it is a part of the reduced values
gather/buildvector node, it should replaced in reduced operation
instructions before removal properly, to avoid compiler crash.

Fixes #114371
2024-10-31 09:59:35 -07:00
Matthias Braun
255e441613
X86: Do not return invalid cost for fp16 conversion (#114128)
Returning invalid instruction costs when converting from/to fp16 in
`X86TTIImpl::getCastInstrCost` when there is no hardware support
available was triggering asserts. This changes the code to return a
large (arbitrary) number to model the fact that libcalls are used to
implement the conversion.

This also simplifies the code by only reporting costs for the scalar
fp16 conversion; vectorized costs being left to the fallback assuming
scalarization.

This is a follow-up to assertion issues reported for the changes in
#113195
2024-10-29 17:16:17 -07:00
Sushant Gokhale
c9f01f699c
[SLP][AArch64][NFC] Add more tests for SLP vectorization of div (#113876)
Currently, we dont have much tests that show SLP outcome for integer
divisions. This patch adds tests for same.

In certain scenarios, for Neon, vectorization is profitable. An attempt
would be made in future to improve the cost-model for the same.
2024-10-28 20:37:41 +05:30
Alexey Bataev
7152bf3bc8 [SLP]Do not create new vector node if scalars fully overlap with the existing one
If the list of scalars vectorized as the part of the same vector node,
no need to generate vector node again, it will be handled as part of
overlapping matching.

Fixes #113810
2024-10-28 06:59:41 -07:00
Matthias Braun
054c23d78f
X86: Improve cost model of fp16 conversion (#113195)
Improve cost-modeling for x86 __fp16 conversions so the SLPVectorizer
transforms the patterns:

- Override `X86TTIImpl::getStoreMinimumVF` to report a minimum VF of 4 (SSE
  register can hold 4xfloat converted/stored to 4xf16) this is necessary as
  fp16 stores are neither modeled as trunc-stores nor can we mark direct Xxfp16
  stores as legal as we generally expand fp16 operations).
- Add missing cost entries to `X86TTIImpl::getCastInstrCost`
  conversion from/to fp16. Note that conversion from f64 to f16 is not
  supported by an X86 instruction.
2024-10-25 16:22:24 -07:00
Jonas Paulsson
aba39c3974
[System] Precommit of test for #112491 (#113704) 2024-10-25 17:40:00 +02:00
Alexey Bataev
e914421d7f [SLP]Do correct signedness analysis for externally used scalars
If the scalars is used externally is in the root node, it may have
incorrect signedness info because of the conflict with the demanded bits
analysis. Need to perform exact signedness analysis and compute it
rather than rely on the precomputed value, which might be incorrect for
alternate zext/sext nodes.

Fixes #113520
2024-10-24 08:59:24 -07:00
Alexey Bataev
d2e7ee77d3 [SLP]Do not check for clustered loads only
Since SLP support "clusterization" of the non-load instructions, the
restriction for reduced values for loads only should be removed to avoid
compiler crash.

Fixes #113516
2024-10-24 08:16:42 -07:00
Alexey Bataev
cb5046da26 [SLP]Do not ignore undefs when trying to replace with "poisonous" shuffles
Need to consider undefs correctly, when trying to replace them with
potentially poisonous values in shuffles. Such elements should not be
silently replaced by poison values, instead complex analysis should be
implemented to see if it is safe to do it.

Fixes #113425
2024-10-24 07:47:23 -07:00
Alexey Bataev
b65b2b4ab6 [SLP]Expand vector to the whole register size in extracts adjustment
Need to expand the number of elements to the whole register to correctly
process estimation and avoid compiler crash.

Fixes #113462
2024-10-23 12:04:40 -07:00
Alexey Bataev
a3508e0246 [SLP]Small buidlvector only graph should contains scalars from same block
If the graph is small and has single buildvector node, all scalars
instructions must be from the same basic block to prevent compiler
crash.

Fixes #113451
2024-10-23 10:46:38 -07:00
Alexey Bataev
4b1b51ac52 [SLP]Initial non-power-of-2 support (but still whole register) for reductions
Enables initial non-power-of-2 support (but still requires number of
elements, forming whole registers) for reductions.
Enables extra vectorization for
MultiSource/Benchmarks/7zip/7zip-benchmark, CINT2006/464.h264ref and
CFP2017rate/526.blender_r (checked for SSE2)

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/112361
2024-10-21 12:25:39 -07:00
Alexey Bataev
9e03920cbf [SLP]Ignore root gather node, when searching for reuses
Root gather/buildvector node should be ignored when SLP vectorizer tries
to find matching gather nodes, vectorized earlier. This node is
definitely the last one in the pipeline and it does not have users. It
may cause the compiler crash

Fixes #113143
2024-10-21 09:16:16 -07:00
David Green
17ac10c28f Revert "[SLP]Initial non-power-of-2 support (but still whole register) for reductions"
This reverts commit 7f2e937469a8cec3fe977bf41ad2dfb9b4ce648a as it causes
regressions in the tests it modifies, and undoes what was added in #100653
(which itself was a fix for a previous regression).
2024-10-21 13:37:44 +01:00
Alexey Bataev
709abacdc3 [SLP]Check that operand of abs does not overflow before making it part of minbitwidth transformation
Need to check that the operand of the abs intrinsic can be safely
truncated before making it part of the minbitwidth transformation.

Fixes #112577
2024-10-18 13:56:19 -07:00
Alexey Bataev
825f9cb1b3 [SLP][NFC]Add a test with the incorrect casting of the abs argument, NFC 2024-10-18 13:44:57 -07:00
Alexey Bataev
e56e9dd8ad [SLP]Fix minbitwidth emission and analysis for freeze instruction
Need to add minbw emission and analysis for freeze instruction to fix
incorrect signedness propagation.

Fixes #112460
2024-10-18 13:36:37 -07:00
Alexey Bataev
4c4b93dcb9 [SLP][NFC]Add a test with the incorrect casting of freeze instruction operands, NFC 2024-10-18 13:29:18 -07:00
Alexey Bataev
7f2e937469 [SLP]Initial non-power-of-2 support (but still whole register) for reductions
Enables initial non-power-of-2 support (but still requires number of
elements, forming whole registers) for reductions.
Enables extra vectorization for
MultiSource/Benchmarks/7zip/7zip-benchmark, CINT2006/464.h264ref and
CFP2017rate/526.blender_r (checked for SSE2)

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/112361
2024-10-18 12:50:11 -07:00
Han-Kuan Chen
12bcea3292
[RISCV][TTI] Recognize CONCAT_VECTORS if a shufflevector mask is multiple insert subvector. (#111459)
reference: https://github.com/llvm/llvm-project/pull/110457
2024-10-18 20:16:56 +07:00
Alexey Bataev
685bec722f Revert "[SLP]Initial non-power-of-2 support (but still whole register) for reductions"
This reverts commit 8287fa8e596d8fc8655c8df3bc99e068ad9f7d4b to
investigate and fix compile time regressions reported by https://llvm-compile-time-tracker.com/compare.php?from=ec78f0da0e9b1b8e2b2323e434ea742e272dd913&to=8287fa8e596d8fc8655c8df3bc99e068ad9f7d4b&stat=instructions:u
2024-10-15 12:59:44 -07:00
Alexey Bataev
8287fa8e59
[SLP]Initial non-power-of-2 support (but still whole register) for reductions
Enables initial non-power-of-2 support (but still requiresnumber of
elements, forming whole registers) for reductions.
Enables extra vectorization for
MultiSource/Benchmarks/7zip/7zip-benchmark, CINT2006/464.h264ref and
CFP2017rate/526.blender_r (checked for SSE2)

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/112361
2024-10-15 12:10:48 -04:00
Alexey Bataev
ab902ee54a [SLP][NFC]Replace more unreachable terminators by rets, NFC 2024-10-14 07:50:07 -07:00
Alexey Bataev
91a0fecf19 [SLP][NFC]Replace unreachable instructions by rets, NFC. 2024-10-14 07:00:56 -07:00
Alexey Bataev
f9bc00e4bb
[SLP]Initial support for interleaved loads
Adds initial support for interleaved loads, which allows
emission of segmented loads for RISCV RVV.

Vectorizes extra code for RISCV
CFP2006/447.dealII, CFP2006/453.povray,
CFP2017rate/510.parest_r, CFP2017rate/511.povray_r,
CFP2017rate/526.blender_r, CFP2017rate/538.imagick_r, CINT2006/403.gcc,
CINT2006/473.astar, CINT2017rate/502.gcc_r, CINT2017rate/525.x264_r

Reviewers: RKSimon, preames

Reviewed By: preames

Pull Request: https://github.com/llvm/llvm-project/pull/112042
2024-10-14 09:12:33 -04:00
Alexey Bataev
4b5018d231 [SLP]Track repeated reduced value as it might be vectorized
Need to track changes with the repeated reduced value, since it might be
vectorized in the next attempt for reduction vectorization, to correctly
generate the code and avoid compiler crash.

Fixes #111887
2024-10-10 13:41:56 -07:00
Alexey Bataev
f020bf1526
[SLP]Initial support for non-power-of-2 (but whole reg) vectorization for stores
Allows non-power-of-2 vectorization for stores, but still requires, that
vectorized number of elements forms full vector registers.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/111194
2024-10-09 15:22:44 -04:00
Alexey Bataev
9f3c55954e
[SLP]Fix loads sorting for loads from diffrent basic blocks
Patch fixes lookup for loads from different basic blocks. Originally,
the code checked is the main key (combined with parent basic block) was
created, but did not include the key into LoadsMap. When the code looked for
the load pointer in LoadsMap, it skipped check for parent basic block
and could mix loads from different basic blocks (but the same underlying
pointer). Currently, it does lead to any issues, since later the code
compares parent basic blocks and sorts loads properly. But it increases
compile time and affects compile time.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/111521
2024-10-08 16:44:16 -04:00
Alexey Bataev
a65a5feb1a
[SLP]Improve masked loads vectorization, attempting gathered loads
If the vector of loads can be vectorized as masked gather and there are
several other masked gather nodes, compiler can try to attempt to check,
if it possible to gather such nodes into big consecutive/strided loads
  node, which provide better performance.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/110151
2024-10-08 16:43:10 -04:00
Philip Reames
f11568bcb0 Revert "[RISCV][TTI] Recognize CONCAT_VECTORS if a shufflevector mask is multiple insert subvector. (#110457)"
This reverts commit 554eaec63908ed20c35c8cc85304a3d44a63c634.  Change was not approved when landed.
2024-10-07 11:31:57 -07:00
Luke Lau
20864d2cf6
[ValueTypes][RISCV] Add v1bf16 type (#111112)
When trying to add RISC-V fadd reduction cost model tests for bf16, I
noticed a crash when the vector was of <1 x bfloat>.

It turns out that this was being scalarized because unlike f16/f32/f64,
there's no v1bf16 value type, and the existing cost model code assumed
that the legalized type would always be a vector.

This adds v1bf16 to bring bf16 in line with the other fp types.

It also adds some more RISC-V bf16 reduction tests which previously
crashed, including tests to ensure that SLP won't emit fadd/fmul
reductions for bf16 or f16 w/ zvfhmin after #111000.
2024-10-06 22:20:51 +08:00
Han-Kuan Chen
554eaec639
[RISCV][TTI] Recognize CONCAT_VECTORS if a shufflevector mask is multiple insert subvector. (#110457) 2024-10-05 14:58:44 +08:00
Alexey Bataev
f74879cf0c
[SLP]Make PHICompare comparator follow weak strict ordering requirement
Reviewers: efriedma-quic

Reviewed By: efriedma-quic

Pull Request: https://github.com/llvm/llvm-project/pull/110529
2024-10-04 14:23:48 -04:00
Alexey Bataev
c0dfef878e [SLP][NFC]Add a test with potential non-power-of2 (but whole reg) vectorized stores 2024-10-04 11:22:55 -07:00
Alexey Bataev
d991e05452 [SLP]Fix compiler crash on vectorizing gatehrd loads with different types
Need to check not only parents, but also types for compatible loads,
when trying to build the vectorizable sequences.

Fixes crash reported in https://github.com/llvm/llvm-project/pull/107461#issuecomment-2392980214
2024-10-04 08:36:57 -07:00
Elvina Yakubova
15ee17c3ce
[SLP] Move more X86 tests to common directory (#111134)
Some of the tests from the X86 directory can be generalized to improve
coverage for other architectures (cont.)
2024-10-04 13:18:56 +01:00
Alexey Bataev
133c1224de [SLP]Fix a crash on accessing element with index -1 for reused mask with PoisonMaskElem
Need to check if the index from the ReuseShuffleIndices mask is not
equal to PoisonMaskElem before trying to access the element by index.
2024-10-03 08:24:05 -07:00
Alexey Bataev
c1b911c579 [SLP]Do correct signedness analysis for clustered nodes
Should get the signedness info from the original scalar instructions, if
possible, to correctly generate sext/zext instructions. Also, the
clustered node must be assigned a gather node user info to correctly
estimate its bitwidth/sign.
2024-10-02 12:56:49 -07:00
Alexey Bataev
848cb21ddc [SLP][NFC]Add a test with the incorrect signedness info for subvector 2024-10-02 12:06:08 -07:00