286 Commits

Author SHA1 Message Date
David Green
f000c053bf [VectorCombine] Add test coverage to shuffleToIdentity for fp casts. NFC 2024-12-19 21:19:06 +00:00
Simon Pilgrim
fbc18b85d6
Revert "[VectorCombine] Combine scalar fneg with insert/extract to vector fneg when length is different" (#120422)
Reverts llvm/llvm-project#115209 - investigating a reported regression
2024-12-18 13:32:53 +00:00
hanbeom
b7a8d9584c
[VectorCombine] Combine scalar fneg with insert/extract to vector fneg when length is different (#115209)
insertelt DestVec, (fneg (extractelt SrcVec, Index)), Index
-> shuffle DestVec, (shuffle (fneg SrcVec), poison, SrcMask), Mask

Original combining left the combine between vectors of different lengths as a TODO.
2024-12-18 07:47:42 +00:00
Simon Pilgrim
5287299f88
[VectorCombine] foldShuffleOfBinops - prefer same cost fold if it reduces instruction count (#120216)
We don't fold "shuffle (binop), (binop)" -> "binop (shuffle), (shuffle)" if the old/new costs are equal, but we can relax this if either new shuffle will constant fold as it will reduce instruction count.
2024-12-17 18:10:20 +00:00
Simon Pilgrim
df2356b475 [X86] getShuffleCost - ensure we treat constant folded shuffles as free 2024-12-17 09:01:54 +00:00
Simon Pilgrim
8217c2eaef
[VectorCombine] foldShuffleOfBinops - extend to handle icmp/fcmp ops as well (#120075)
Extend binary instructions matching to match compare instructions + predicate as well.
2024-12-16 17:23:04 +00:00
Simon Pilgrim
8dd27d4569 [VectorCombine] Add test coverage for shuffle(cmp,cmp) fold patterns 2024-12-16 12:52:38 +00:00
Simon Pilgrim
cc54a0ce56
[VectorCombine] vectorizeLoadInsert - only fold when inserting into a poison vector (#119906)
We have corresponding poison tests in the "-inseltpoison.ll" sibling test files.

Fixes #119900
2024-12-14 11:56:12 +00:00
Simon Pilgrim
86779da52b
[VectorCombine] Fold "(or (zext (bitcast X)), (shl (zext (bitcast Y)), C))" -> "(bitcast (concat X, Y))" MOVMSK bool mask style patterns (#119695)
Mask/Bool vectors are often bitcast to/from scalar integers, in particular when concatenating mask results, often this is due to the difficulties of working with vector of bools on C/C++. On x86 this typically involves the MOVMSK/KMOV instructions.

To concatenate bool masks, these are typically cast to scalars, which are then zero-extended, shifted and OR'd together.

This patch attempts to match these scalar concatenation patterns and convert them to vector shuffles instead. This in turn often assists with further vector combines, depending on the cost model.

Reapplied patch from #119559 - fixed use after free issue.

Fixes #111431
2024-12-12 13:45:10 +00:00
Simon Pilgrim
625ec7ec89 [VectorCombine] Move concat-boolmasks.ll tests to be VectorCombine only
Suggested on #119559
2024-12-12 11:02:01 +00:00
Simon Pilgrim
673c324ae3 [VectorCombine] foldInsExtVectorToShuffle - canonicalize new shuffle(undef,x) -> shuffle(x,undef).
foldInsExtVectorToShuffle is likely to be inserting into an undef value, so make sure we've canonicalized this to the RHS in the folded shuffle to help further VectorCombine folds.

Minor tweak to help #34072
2024-12-11 16:09:13 +00:00
Simon Pilgrim
4f933277a5
[CostModel][X86] Improve cost estimation of insert_subvector shuffle patterns of legalized types (#119363)
In cases where the base/sub vector type in an insert_subvector pattern legalize to the same width through splitting, we can assume that the shuffle becomes free as the legalized vectors will not overlap.

Note this isn't true if the vectors have been widened during legalization (e.g. v2f32 insertion into v4f32 would legalize to v4f32 into v4f32).

Noticed while working on adding processShuffleMasks handling for SK_PermuteTwoSrc.
2024-12-10 16:28:56 +00:00
Simon Pilgrim
05b907f66b
[VectorCombine] foldShuffleOfShuffles - allow fold with only single shuffle operand. (#119354)
foldShuffleOfShuffles already handles "shuffle (shuffle x, undef), (shuffle y, undef)" patterns, this patch relaxes the requirement so it can handle cases where only a single operand is a shuffle (and the other can be any other value and will be kept in place).

Fixes #86068
2024-12-10 13:10:25 +00:00
Simon Pilgrim
9ab016f1ee [VectorCombine] Add test coverage for #86068 2024-12-10 10:00:56 +00:00
Paul Walker
56c091ea71
[LLVM][IR] Use splat syntax when printing ConstantExpr based splats. (#116856)
This brings the printing of scalable vector constant splats inline with
their fixed length counterparts.
2024-11-21 11:21:12 +00:00
Luke Lau
1e897ed28d
[TTI][RISCV] Deduplicate type-based VP costing (#115983)
We have a lot of code in RISCVTTIImpl::getIntrinsicInstrCost for vp
intrinsics, which just forward the cost to the underlying non-vp cost
function.

However I just also noticed that there is generic code in BasicTTIImpl's
getIntrinsicInstrCost that does the same thing, added in #67178. The
only difference is that BasicTTIImpl doesn't yet handle it for
type-based costing. There doesn't seem to be any reason that it can't
since it's just inspecting the argument types.

This shuffles the VP costing up to handle both regular and type-based
costing, which allows us to deduplicate some of the VP specific costing
in RISCVTTIImpl by delegating it to BasicTTIImpl.h. More of those nodes
can be moved over to BasicTTIImpl.h later.

It's not NFC since it picks up a couple of VP nodes that had slipped
through the cracks. Future PRs can begin to move more of the code from
RISCVTTIImpl to BasicTTIImpl.
2024-11-19 16:20:29 +08:00
Simon Pilgrim
0baa6a7272
[VectorCombine] foldShuffleOfShuffles - relax one-use of inner shuffles (#116062)
Allow multi-use of either of the inner shuffles and account for that in the cost comparison.
2024-11-13 16:18:11 +00:00
Simon Pilgrim
1878b94568
[VectorCombine] isExtractExtractCheap - specify the extract/insert shuffle mask to improve shuffle costs (#114780)
This shuffle mask is so focused, the cost model is very likely to be able to determine a specific (lower) cost
2024-11-13 12:31:39 +00:00
hanbeom
d942f5e13d
[VectorCombine] Combine extract/insert from vector into a shuffle (#115213)
insert (DstVec, (extract SrcVec, ExtIdx), InsIdx) --> shuffle (DstVec, SrcVec, Mask)

This commit combines extract/insert on a vector into Shuffle with vector.
2024-11-13 11:16:09 +00:00
Simon Pilgrim
958e37cd1f [VectorCombine] scalarizeBinopOrCmp - check for out of bounds element indices
Fixes #115575
2024-11-09 16:00:03 +00:00
Simon Pilgrim
6fb2a6044f [VectorCombine] Add test coverage for #115575 2024-11-09 16:00:03 +00:00
Simon Pilgrim
e3a0775651 [VectorCombine] foldExtractedCmps - (re-)enable fold on non-commutative binops
#114901 exposed that foldExtractedCmps didn't account for non-commutative binops, and were disabled by 05e838f428555bcc4507bd37912da60ea9110ef6

This patch re-enables support for non-commutative binops by ensuring that the LHS/RHS arg order of the binop is retained.
2024-11-06 12:10:31 +00:00
Paul Walker
38fffa630e
[LLVM][IR] Use splat syntax when printing Constant[Data]Vector. (#112548) 2024-11-06 11:53:33 +00:00
Simon Pilgrim
d8354d63db [VectorCombine] Extend test coverage for #114901 with commuted test case 2024-11-06 11:27:05 +00:00
Simon Pilgrim
05e838f428 [VectorCombine] foldExtractedCmps - disable fold on non-commutative binops
The fold needs to be adjusted to correctly track the LHS/RHS operands, which will take some refactoring, for now just disable the fold in this case.

Fixes #114901
2024-11-05 11:42:30 +00:00
Simon Pilgrim
a88be11eef [VectorCombine] Add test coverage for #114901 2024-11-05 11:42:30 +00:00
Simon Pilgrim
92af82a48d
[VectorCombine] Fold "shuffle (binop (shuffle, shuffle)), undef" --> "binop (shuffle), (shuffle)" (#114101)
Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles.

Fixes #94546
Fixes #49736
2024-10-31 10:58:09 +00:00
Simon Pilgrim
80c8ecd565 [VectorCombine] Add baseline "shuffle (binop (shuffle, shuffle)), undef" tests for #114101 2024-10-30 13:42:58 +00:00
Ramkumar Ramachandra
1f919aa778
VectorCombine: lift one-use limitation in foldExtractedCmps (#110902)
There are artificial one-use limitations on foldExtractedCmps. Adjust
the costs to account for multi-use, and strip the one-use matcher,
lifting the limitations.
2024-10-10 14:10:41 +01:00
David Green
c136d3237a
[VectorCombine] Do not try to operate on OperandBundles. (#111635)
This bails out if we see an intrinsic with an operand bundle on it, to
make sure we don't process the bundles incorrectly.

Fixes #110382.
2024-10-09 16:20:03 +01:00
Philip Reames
0b524efa95
[RISCV][TTI] Reduce cost of a <N x i1> build_vector pattern (#109449)
This is a follow up to 7f6bbb3. When lowering a <N x i1> build_vector,
we currently chose to extend to i8, perform the build_vector there, and
then truncate back in vector. Our costing on the other hand accounts for
it as if we performed a vector extend, an insert, and a vector extract
for every element. This significantly over estimates the cost.

Note that we can likely do better in our build_vector lowering here by
packing the bits in scalar, and doing a build_vector of the packed bits.
Regardless, our costing should match our lowering.
2024-09-23 07:21:54 -07:00
Yingwei Zheng
87663fdab9
[VectorCombine] Don't shrink lshr if the shamt is not less than bitwidth (#108705)
Consider the following case:
```
define <2 x i32> @test(<2 x i64> %vec.ind16, <2 x i32> %broadcast.splat20) {
  %19 = icmp eq <2 x i64> %vec.ind16, zeroinitializer
  %20 = zext <2 x i1> %19 to <2 x i32>
  %21 = lshr <2 x i32> %20, %broadcast.splat20
  ret <2 x i32> %21
}
```
After https://github.com/llvm/llvm-project/pull/104606, we shrink the
lshr into:
```
define <2 x i32> @test(<2 x i64> %vec.ind16, <2 x i32> %broadcast.splat20) {
  %1 = icmp eq <2 x i64> %vec.ind16, zeroinitializer
  %2 = trunc <2 x i32> %broadcast.splat20 to <2 x i1>
  %3 = lshr <2 x i1> %1, %2
  %4 = zext <2 x i1> %3 to <2 x i32>
  ret <2 x i32> %4
}
```
It is incorrect since `lshr i1 X, 1` returns `poison`.
This patch adds additional check on the shamt operand. The lshr will get
shrunk iff we ensure that the shamt is less than bitwidth of the smaller
type. As `computeKnownBits(&I, *DL).countMaxActiveBits() > BW` always
evaluates to true for `lshr(zext(X), Y)`, this check will only apply to
bitwise logical instructions.

Alive2: https://alive2.llvm.org/ce/z/j_RmTa
Fixes https://github.com/llvm/llvm-project/issues/108698.
2024-09-15 18:38:06 +08:00
Igor Kirillov
1b57cbcf25
[VectorCombine] Refactor Insertion Point setting in shrinkType (#108398) 2024-09-13 10:03:31 +01:00
Igor Kirillov
958a337132
[VectorCombine] Fix trunc generated between PHINodes (#108228) 2024-09-12 10:20:56 +01:00
Han-Kuan Chen
0ccc6092d2
[VectorCombine] Add foldShuffleOfIntrinsics. (#106502) 2024-09-10 21:10:09 +08:00
Igor Kirillov
bf694841f5
[VectorCombine] Add type shrinking and zext propagation for fixed-width vector types (#104606)
Check that `binop(zext(value)`, other) is possible and profitable to transform
into: `zext(binop(value, trunc(other)))`.
When CPU architecture has illegal scalar type iX, but vector type <N * iX> is
legal, scalar expressions before vectorisation may be extended to a legal
type iY. This extension could result in underutilization of vector lanes,
as more lanes could be used at one instruction with the lower type.
Vectorisers may not always recognize opportunities for type shrinking, and
this patch aims to address that limitation.
2024-09-10 10:09:03 +01:00
David Spickett
d1f3a92ea9 Revert "[AArch64] Remove special-case inserted shuffle cost."
This reverts commit 19b785b7334d01354e8430634bab3c3341c671ca.

My bisect must have been wrong because they're still failing,
and there are follow ups to this that would need unpicking anyway.
2024-07-29 11:24:39 +00:00
David Spickett
19b785b733 Revert "[AArch64] Remove special-case inserted shuffle cost."
This reverts commit f67fa3be4db68afc08c7f3d9523f1533fa5687b7.

Caused test suite failures on AArch64:
https://lab.llvm.org/buildbot/#/builders/17/builds/1349
2024-07-29 11:03:25 +00:00
Vitaly Buka
497e2e8cf8
[NFC][VectorCombine] Add negative sanitizer tests (#100832)
They are already work as expected.
2024-07-26 17:20:57 -07:00
David Green
f67fa3be4d [AArch64] Remove special-case inserted shuffle cost.
This special case tried to measure if the shuffle vector will be multiple
inserts into an existing vector, with one of the lanes already in-place. If so
it reduces the cost by 1 to to represent it will can insert n-1 vector lanes.
This isn't always true though as the original vector may need to be moved to a
new value to start inserting new values into it, if other values from the
original are still needed.

This didn't effect performance much when I tried it, but should hopefully start
to address a regression we see from differences in SLP vectorization lane
orders.
2024-07-25 17:46:48 +01:00
Luke Lau
58854facb3
[RISCV] Don't cost vector arithmetic fp ops as cheaper than scalar (#99594)
I was comparing some SPEC CPU 2017 benchmarks across rva22u64 and
rva22u64_v, and noticed that in a few cases that rva22u64_v was
considerably slower.

One of them was 519.lbm_r, which has a large loop that was being
unprofitably vectorized. It has an if/else in the loop which requires
large amounts of predication when vectorized, but despite the loop
vectorizer taking this into account the vector cost came out as cheaper
than the scalar.

It looks like the reason for this is because we cost scalar floating
point ops as 2, but their vector equivalents as 1 (for LMUL 1). This
comes from how we use BasicTTIImpl for scalars which treats floats as
twice as expensive as integers.

This patch doubles the cost of vector floating point arithmetic ops so
that they're at least as expensive as their scalar counterparts, which
gives a 13% speedup on 519.lbm_r at -O3 on the spacemit-x60.

Fixes #62576 (the last point there about scalar fsub/fmul)
2024-07-22 13:56:10 +08:00
Philip Reames
ded35c0c3a
[vectorcombine] Pull sext/zext through reduce.or/and/xor (#99548)
This extends the existing foldTruncFromReductions transform to handle
sext and zext as well. This is only legal for the bitwise reductions
(and/or/xor) and not the arithmetic ones (add, mul). Use the same
costing decision to drive whether we do the transform.
2024-07-18 13:56:40 -07:00
Philip Reames
5e8cd29d62 [RISCV] Add coverage for vector combine reduce(cast x) transformation
This covers both the existing trunc transform - basically checking
that it performs sanely with the RISCV cost model - and a planned
change to handle sext/zext as well.
2024-07-18 11:39:25 -07:00
Simon Pilgrim
da286c8bf6
[VectorCombine] foldShuffleToIdentity - peek through bitcasts to see if they come from the same value to form identity sequence (#98334)
Workaround until I can get #96884 fixed properly - when trying to find identity sequences, peek through any bitcasts to see if the values all came from the same source. We don't run CSE frequently enough to merge all the bitcasts that we end up with.
2024-07-15 21:36:23 +01:00
Simon Pilgrim
eb656ea6d7 [VectorCombine] Add vectorcombine specific test coverage for #98334
Don't rely on phaseordering tests alone
2024-07-15 11:14:43 +01:00
Simon Pilgrim
ef5b1ec0dd [VectorCombine] foldShuffleToIdentity - ensure casts have the same source type 2024-07-09 13:10:11 +01:00
Simon Pilgrim
7e054c33d4 [VectorCombine] foldShuffleOfCastops - don't restrict to oneuse but compare total costs instead
Some casts (especially bitcasts but others as well) are incredibly cheap (or free), so don't limit the shuffle(cast(x),cast(y)) -> cast(shuffle(x,y)) to oneuse cases, but instead compare the total before/after costs of possibly repeating some casts.
2024-07-08 14:57:51 +01:00
Elvis Wang
4762f3bab0
[RISCV][TTI] Add cost of type based binOp VP intrinsics with functionalOPC. (#93435)
Intrinsics not supported in the backend will fall Into BasicTTIImpl,
which will check if the VP intrinsic is a type based instruction.
All type based instruction will fall into the
`getTypeBasedIntrinsicInstrCost()` which doesn't support instruction
with scalable vector type.

This patch adds the instruction cost for type based binOp VP intrinsic
instructions in the backend to get the valid instruction costs.
The cost of type based binOp VP intrinsics will be same as their non-VP
counterpart.
2024-07-05 08:13:18 +08:00
David Green
76c8e1d857 [VectorCombine] Guard against the lane zero select predicate being scalar
All but the first lane was being checked, but this could leave the first lane
with a scalar select predicate. This just extends the check to make sure the
types are all the same
2024-06-28 17:27:16 +01:00
David Green
efa8463ab9
[VectorCombine] Add free concats to shuffleToIdentity. (#94954)
This is another relatively small adjustment to shuffleToIdentity, which
has had a few knock-one effects to need a few more changes. It attempts
to detect free concats, that will be legalized to multiple vector
operations. For example if the lanes are '[a[0], a[1], b[0], b[1]]' and
a and b are v2f64 under aarch64.

In order to do this:
- isFreeConcat detects whether the input has piece-wise identities from
multiple inputs that can become a concat.
- A tree of concat shuffles is created to concatenate the input values
into a single vector. This is a little different to most other inputs as
there are created from multiple values that are being combined together,
and we cannot rely on the Lane0 insert location always being valid.
- The insert location is changed to the original location instead of
updating per item, which ensure it is valid due to the order that we
visit and create items.
2024-06-25 07:55:08 +01:00