182 Commits

Author SHA1 Message Date
Alexey Bataev
1160994602 [SLP]Fix a crash for very long GEP chains
Need to check if the GEP bases are equal and return false early. Also,
need to return false if the lookup is too deep, considering bases equal
too. Fixes a crash in the assertion.
2025-01-08 06:47:41 -08:00
Alexey Bataev
07d284d4eb
[SLP]Add cost estimation for gather node reshuffling
Adds cost estimation for the variants of the permutations of the scalar
values, used in gather nodes. Currently, SLP just unconditionally emits
shuffles for the reused buildvectors, but in some cases better to leave
them as buildvectors rather than shuffles, if the cost of such
buildvectors is better.

X86, AVX512, -O3+LTO
Metric: size..text

Program                                                                        size..text
                                                                               results     results0    diff
                 test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   912998.00   913238.00  0.0%
 test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   203070.00   203102.00  0.0%
     test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1396320.00  1396448.00  0.0%
      test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1396320.00  1396448.00  0.0%
                       test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   309790.00   309678.00 -0.0%
      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12477607.00 12470807.00 -0.1%

CINT2006/445.gobmk - extra code vectorized
MiBench/consumer-lame - small variations
CFP2017speed/638.imagick_s
CFP2017rate/538.imagick_r - extra vectorized code
Benchmarks/Bullet - extra code vectorized
CFP2017rate/526.blender_r - extra vector code

RISC-V, sifive-p670, -O3+LTO
CFP2006/433.milc - regressions, should be fixed by https://github.com/llvm/llvm-project/pull/115173
CFP2006/453.povray - extra vectorized code
CFP2017rate/508.namd_r - better vector code
CFP2017rate/510.parest_r - extra vectorized code
SPEC/CFP2017rate - extra/better vector code
CFP2017rate/526.blender_r - extra vectorized code
CFP2017rate/538.imagick_r - extra vectorized code
CINT2006/403.gcc - extra vectorized code
CINT2006/445.gobmk - extra vectorized code
CINT2006/464.h264ref - extra vectorized code
CINT2006/483.xalancbmk - small variations
CINT2017rate/525.x264_r - better vectorization

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/115201
2024-12-24 15:35:29 -05:00
Han-Kuan Chen
3133acf1fb Revert "[SLP] Make getSameOpcode support different instructions if they have same semantics. (#112181)"
This reverts commit 82204154b7bd1f8c487c94c7ef00399d776b29f0.
2024-12-12 20:38:31 -08:00
Han-Kuan Chen
82204154b7
[SLP] Make getSameOpcode support different instructions if they have same semantics. (#112181) 2024-12-13 12:06:10 +08:00
Han-Kuan Chen
2546ae4ed0
[SLP][REVEC] Fix the number of elements in the mask of a ShuffleVectorInst is not a power of 2. (#119689)
The following shufflevector should not be vectorized when
slp-vectorize-non-power-of-2 is enabled.

shufflevector <8 x float> %1, <8 x float> poison, <3 x i32> <i32 0, i32
1, i32 2>
shufflevector <8 x float> %1, <8 x float> poison, <3 x i32> <i32 4, i32
5, i32 6>
2024-12-13 02:22:41 +08:00
Alexey Bataev
7523086a05
[SLP]Use getExtendedReduction cost and fix reduction cost calculations
Patch uses getExtendedReduction for reductions of ext-based nodes + adds
cost estimation for ctpop-kind reductions into basic implementation and
RISCV-V specific vcpop cost estimation.

Reviewers: RKSimon, preames

Reviewed By: preames

Pull Request: https://github.com/llvm/llvm-project/pull/117350
2024-11-22 16:12:53 -05:00
Alexey Bataev
9c9e030fba [SLP][NFC]Add a test with the RISCV ctpop-based reduction 2024-11-22 09:25:00 -08:00
Alexey Bataev
58c8d73172 [SLP][NFC]Add a test with multi reductions, NFC 2024-11-21 09:48:19 -08:00
Alexey Bataev
f6e1d64458
[SLP]Enable interleaved stores support
Enables interaleaved stores, results in better estimation for segmented
stores for RISC-V

Reviewers: preames, topperc, RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/115354
2024-11-15 11:01:57 -05:00
Alexey Bataev
af3295bd3d
[SLP]Enable splat ordering for loads
Enables splat support for loads with lanes> 2 or number of operands> 2.

Allows better detect splats of loads and reduces number of shuffles in
some cases.

X86, AVX512, -O3+LTO

Metric: size..text
                                                                          results     results0    diff
               test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test   154867.00   156723.00  1.2%
 test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12467735.00 12468023.00  0.0%

Better vectorization quality

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/115173
2024-11-15 10:29:43 -05:00
Han-Kuan Chen
3cdd86bb47
[SLP][REVEC] Make GetMinMaxCost support FixedVectorType when REVEC is enabled. (#115417) 2024-11-10 13:53:15 +08:00
Alexey Bataev
62db1c8a07 [SLP]Better decision making on whether to try stores packs for vectorization
Since the stores are sorted by distance, comparing the indices in the
original array and early exit, if the index is less than the index of
the last store, not always the best strategy. Better to remove such
stores explicitly to try better to check for the vectorization
opportunity.

Fixes #115008
2024-11-07 14:23:15 -08:00
Alexey Bataev
dec3839979 [SLP][NFC]Add a test with the missed vectorization opportunity for stores with same address 2024-11-07 13:53:23 -08:00
Kazu Hirata
22b4b1ab10 Revert "[SLP][REVEC] Make GetMinMaxCost support FixedVectorType when REVEC is enabled. (#114946)"
This reverts commit f58757b8dc167809b69ec00f9b5ab59281df0902.

Failing buildbots:
https://lab.llvm.org/buildbot/#/builders/174/builds/8058
https://lab.llvm.org/buildbot/#/builders/127/builds/1357
2024-11-07 10:43:11 -08:00
Han-Kuan Chen
f58757b8dc
[SLP][REVEC] Make GetMinMaxCost support FixedVectorType when REVEC is enabled. (#114946) 2024-11-08 00:52:59 +08:00
Alexey Bataev
79fd615759 [SLP][NFC]Add a test with the segmented loads, NFC 2024-11-07 07:08:24 -08:00
Luke Lau
343a810725
[RISCV] Allow f16/bf16 with zvfhmin/zvfbfmin as legal strided access (#115264)
This is also split off from the zvfhmin/zvfbfmin
isLegalElementTypeForRVV work.

Enabling this will cause SLP and RISCVGatherScatterLowering to emit
@llvm.experimental.vp.strided.{load,store} intrinsics, and codegen
support for this was added in #109387 and #114750.
2024-11-07 14:40:15 +08:00
Paul Walker
38fffa630e
[LLVM][IR] Use splat syntax when printing Constant[Data]Vector. (#112548) 2024-11-06 11:53:33 +00:00
Alexey Bataev
c1cec8c0dc [SLP][NFC]Add a test with missed splat ordering for loads, NFC 2024-11-05 14:08:17 -08:00
Alexey Bataev
0c18def2c1 [SLP]Allow interleaving check only if it is less than number of elements
Need to check if the interleaving factor is less than total number of
elements in loads slice to handle it correctly and avoid compiler crash.

Fixes report https://github.com/llvm/llvm-project/pull/112361#issuecomment-2457227670
2024-11-05 07:06:15 -08:00
Han-Kuan Chen
a795a18bba
[SLP][REVEC] VF should be scaled when ScalarTy is FixedVectorType. (#114551) 2024-11-02 03:03:52 +08:00
Alexey Bataev
cb5046da26 [SLP]Do not ignore undefs when trying to replace with "poisonous" shuffles
Need to consider undefs correctly, when trying to replace them with
potentially poisonous values in shuffles. Such elements should not be
silently replaced by poison values, instead complex analysis should be
implemented to see if it is safe to do it.

Fixes #113425
2024-10-24 07:47:23 -07:00
Han-Kuan Chen
12bcea3292
[RISCV][TTI] Recognize CONCAT_VECTORS if a shufflevector mask is multiple insert subvector. (#111459)
reference: https://github.com/llvm/llvm-project/pull/110457
2024-10-18 20:16:56 +07:00
Alexey Bataev
f9bc00e4bb
[SLP]Initial support for interleaved loads
Adds initial support for interleaved loads, which allows
emission of segmented loads for RISCV RVV.

Vectorizes extra code for RISCV
CFP2006/447.dealII, CFP2006/453.povray,
CFP2017rate/510.parest_r, CFP2017rate/511.povray_r,
CFP2017rate/526.blender_r, CFP2017rate/538.imagick_r, CINT2006/403.gcc,
CINT2006/473.astar, CINT2017rate/502.gcc_r, CINT2017rate/525.x264_r

Reviewers: RKSimon, preames

Reviewed By: preames

Pull Request: https://github.com/llvm/llvm-project/pull/112042
2024-10-14 09:12:33 -04:00
Alexey Bataev
4b5018d231 [SLP]Track repeated reduced value as it might be vectorized
Need to track changes with the repeated reduced value, since it might be
vectorized in the next attempt for reduction vectorization, to correctly
generate the code and avoid compiler crash.

Fixes #111887
2024-10-10 13:41:56 -07:00
Alexey Bataev
a65a5feb1a
[SLP]Improve masked loads vectorization, attempting gathered loads
If the vector of loads can be vectorized as masked gather and there are
several other masked gather nodes, compiler can try to attempt to check,
if it possible to gather such nodes into big consecutive/strided loads
  node, which provide better performance.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/110151
2024-10-08 16:43:10 -04:00
Philip Reames
f11568bcb0 Revert "[RISCV][TTI] Recognize CONCAT_VECTORS if a shufflevector mask is multiple insert subvector. (#110457)"
This reverts commit 554eaec63908ed20c35c8cc85304a3d44a63c634.  Change was not approved when landed.
2024-10-07 11:31:57 -07:00
Luke Lau
20864d2cf6
[ValueTypes][RISCV] Add v1bf16 type (#111112)
When trying to add RISC-V fadd reduction cost model tests for bf16, I
noticed a crash when the vector was of <1 x bfloat>.

It turns out that this was being scalarized because unlike f16/f32/f64,
there's no v1bf16 value type, and the existing cost model code assumed
that the legalized type would always be a vector.

This adds v1bf16 to bring bf16 in line with the other fp types.

It also adds some more RISC-V bf16 reduction tests which previously
crashed, including tests to ensure that SLP won't emit fadd/fmul
reductions for bf16 or f16 w/ zvfhmin after #111000.
2024-10-06 22:20:51 +08:00
Han-Kuan Chen
554eaec639
[RISCV][TTI] Recognize CONCAT_VECTORS if a shufflevector mask is multiple insert subvector. (#110457) 2024-10-05 14:58:44 +08:00
Alexey Bataev
b16e694948
[SLP]Try to keep operand of external casts as scalars, if profitable
If the cost of original scalar instruction + cast is better than the
extractelement from the vector cast instruction, better to keep original
scalar instructions, where possible

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/110537
2024-10-01 13:35:42 -04:00
Han-Kuan Chen
061762933b
[SLP][REVEC] Fix cost model for getBuildVectorCost with FixedVectorType ScalarTy. (#110073)
BoUpSLP::gather always use CreateInsertVector for FixedVectorType
ScalarTy.
2024-09-30 21:51:12 +08:00
Philip Reames
556ec4a726
[SLP] Pass operand info to getCmpSelInstrInfo (#109998)
Depending on the constant, selects with constant arms can have highly
varying cost. This adjusts SLP to use the new API introduced in
d2885743.

Fixes https://github.com/llvm/llvm-project/issues/109466.
2024-09-25 08:17:55 -07:00
Philip Reames
bcbdf7ad6b [RISCV][TTI/SLP] Add test coverage for select of constants costing
Provides coverage for an upcoming change which accounts for the cost
of materializing the vector constants in the vector select.
2024-09-24 08:15:40 -07:00
Han-Kuan Chen
00629752e6
[SLP][REVEC] Fix cost model for getGatherCost with FixedVectorType ScalarTy. (#109369) 2024-09-24 08:08:35 +08:00
Philip Reames
7f6bbb3c4f
[RISCV][TTI] Reduce cost of a build_vector pattern (#108419)
This change is actually two related changes, but they're very hard to
meaningfully separate as the second balances the first, and yet doesn't
do much good on it's own.

First, we can reduce the cost of a build_vector pattern. Our current
costing for this defers to generic insertelement costing which isn't
unreasonable, but also isn't correct. While inserting N elements
requires N-1 slides and N vmv.s.x, doing the full build_vector only
requires N vslide1down. (Note there are other cases that our build
vector lowering can do more cheaply, this is simply the easiest upper
bound which appears to be "good enough" for SLP costing purposes.)

Second, we need to tell SLP that calls don't preserve vector registers.
Without this, SLP will vectorize scalar code which performs e.g. 4 x
float @exp calls as two <2 x float> @exp intrinsic calls. Oddly, the
costing works out that this is in fact the optimal choice - except that
we don't actually have a <2 x float> @exp, and unroll during DAG. This
would be fine (or at least cost neutral) except that the libcall for the
scalar @exp blows all vector registers. So the net effect is we added a
bunch of spills that SLP had no idea about. Thankfully, AArch64 has a
similiar problem, and has taught SLP how to reason about spill cost once
the right TTI hook is implemented.

Now, for some implications...

The SLP solution for spill costing has some inaccuracies. In particular,
it basically just guesses whether a intrinsic will be lowered to a call
or not, and can be wrong in both directions. It also has no mechanism to
differentiate on calling convention.

This has the effect of making partial vectorization (i.e. starting in
scalar) more profitable. In practice, the major effect of this is to
make it more like SLP will vectorize part of a tree in an intersecting
forrest, and then vectorize the remaining tree once those uses have been
removed.

This has the effect of biasing us slightly away from strided, or indexed
loads during vectorization - because the scalar cost is more accurately
modeled, and these instructions look relevatively less profitable.
2024-09-20 08:34:36 -07:00
Philip Reames
fa8b737a81 [SLP][RISCV] Add test for 3 element build vector feeding reduce
Our costs for build vectors are currently a bit off which inhibits
vectorization.  Fix forthcoming.
2024-09-11 15:56:53 -07:00
Philip Reames
7910812414 [SLP] Regen a test to pick up naming changes 2024-09-11 15:46:57 -07:00
Philip Reames
247d3ea843 [SLP] Expand non-power-of-two bailout in TryToFindDuplicates
This fixes a crash noticed when doing a downstream merge.  The
test case has been reduced, and is included in this commit.

The existing bailout for non-power-of-two vectors in TryToFindDuplicates
did not consider the case where the list being vectorized had no
root node.  This allowed reshuffled scalars to slip through to code
which does not yet expect to handle it.

This was an existing bug (likely introduced by my ed03070e), but
made easier to hit by 63e8a1b1
2024-09-05 13:51:11 -07:00
Philip Reames
63e8a1b16f
[SLP] Enable reordering for non-power-of-two vectors (#106638)
This change tries to enable vector reordering during vectorization for
non-power-of-two vectors. Specifically, my goal is to be able to
vectorize reductions whose operands appear in other than identity order.
(i.e. a[1] + a[0] + a[2]). Our standard pass pipeline, Reassociation
effectively canonicalizes towards this form. So for reduction
vectorization to be wildly applicable, we need this feature.

This change enables the use of a non-empty ReorderIndices structure -
which is effectively required for out of order loads or gathers - while
leaving the ReuseShuffleIndices mechanism unused and disabled. If I've
understood the code structure, the former is used when describing
implicit shuffles required by the vectorization strategy (i.e. loading
elements 0,1,3,2 in the order 0,1,2,3 and then shuffling later), while
the later is used when trying to optimize explode/buildvectors (called
gathers in this code).

I audited all the code enabled by this change, but can't claim to
deeply understand most of it. I added a couple of bailouts in places
which appeared to be difficult to audit and optional optimizations. I've
tried to do so in the least risky way I can, but am not completely
confident in this change. Careful review appreciated.
2024-09-05 07:52:27 -07:00
Alexey Bataev
a724f9a7e5 [SLP][NFC]Make whole reg non-power-2 test for x86 and aarch64 along with risc-v 2024-09-04 10:11:05 -07:00
Alexey Bataev
571c8c2c88 Revert "[SLP]Initial support for non-power-of-2 (but still whole register) number of elements in operands."
This reverts commit a3ea90ffbbe47d9a1b3eab03324f09d7b8e0dcb3 after the
post commit review. The number of parts is calculated incorrectly.
2024-09-03 11:02:07 -07:00
Alexey Bataev
884d7c137a Revert "[SLP]Check for the whole vector vectorization in unique scalars analysis"
This reverts commit b74e09cb20e6218320013b54c9ba2f5c069d44b9 after
post-commit review. The number of parts is calculated incorrectly.
2024-09-03 11:02:07 -07:00
Philip Reames
2c7786e94a
Prefer use of 0.0 over -0.0 for fadd reductions w/nsz (in IR) (#106770)
This is a follow up to 924907bc6, and is mostly motivated by consistency
but does include one additional optimization. In general, we prefer 0.0
over -0.0 as the identity value for an fadd. We use that value in
several places, but don't in others. So, let's be consistent and use the
same identity (when nsz allows) everywhere.

This creates a bunch of test churn, but due to 924907bc6, most of that
churn doesn't actually indicate a change in codegen. The exception is
that this change enables the use of 0.0 for nsz, but *not* reasoc, fadd
reductions. Or said differently, it allows the neutral value of an
ordered fadd reduction to be 0.0.
2024-09-03 09:16:37 -07:00
Alexey Bataev
b74e09cb20 [SLP]Check for the whole vector vectorization in unique scalars analysis
Need to check that thr whole number of register is attempted to
vectorize before actually trying to build the node to avoid compiler
crash.
2024-09-03 06:19:21 -07:00
Alexey Bataev
a3ea90ffbb [SLP]Initial support for non-power-of-2 (but still whole register) number of elements in operands.
Patch adds basic support for non-power-of-2 number of elements in
operands. The patch still requires that this number addresses whole
registers.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/106449
2024-08-31 08:14:49 -07:00
Martin Storsjö
9e86d4f2ed Revert "[SLP]Initial support for non-power-of-2 (but still whole register) number of elements in operands."
This reverts commit 6ab07d71174982e5cb95420ee4df01347333c342.

This commit caused failed asserts, see
https://github.com/llvm/llvm-project/pull/106449.
2024-08-31 14:53:08 +03:00
Alexey Bataev
6ab07d7117
[SLP]Initial support for non-power-of-2 (but still whole register) number of elements in operands.
Patch adds basic support for non-power-of-2 number of elements in
operands. The patch still requires that this number addresses whole
registers.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/106449
2024-08-30 14:50:34 -04:00
Simon Pilgrim
ceb613a8be [RISCV] Add full test coverage for acos/asin/atan and cosh/sinh/tanh intrinsics to support #106584 2024-08-30 14:01:15 +01:00
Philip Reames
22ba351108 [RISCV][SLP] Test for <3 x Ty> reductions which require reordering
These tests show a vectorizable reduction where the order of the
reduction has been adjusted so that profitable vectorization requires
a reordering of the computation.   We currently have no reordering
in SLP for non-power-of-two vectors, so this doesn't work.

Note that due to reassociation performed in the standard pipeline,
this is actually the canonical form for a reduction reaching SLP.
2024-08-29 11:42:09 -07:00
Alexey Bataev
be7014e95a [SLP][NFC]Add a test with non-power-of-2 (but still whole vector) operands. 2024-08-28 10:08:20 -07:00