197 Commits

Author SHA1 Message Date
Ramkumar Ramachandra
5187482fd0
IR: handle FP predicates in CmpPredicate::getMatching (#122924)
CmpPredicate::getMatching implicitly assumes that both predicates are
integer-predicates, and this has led to a crash being reported in
VectorCombine after e409204 (VectorCombine: teach foldExtractedCmps
about samesign). FP predicates are simple enough to handle as there is
never any samesign information associated with them: hence handle them
in CmpPredicate::getMatching, fixing the VectorCombine crash and
guarding against future incorrect usages.
2025-01-14 18:17:07 +00:00
Simon Pilgrim
87750c9de4 [VectorCombine] foldPermuteOfBinops - match identity shuffles only if they match the destination type
Fixes regression identified after #122118
2025-01-14 16:09:50 +00:00
Ramkumar Ramachandra
e409204a89
VectorCombine: teach foldExtractedCmps about samesign (#122883)
Follow up on 4a0d53a (PatternMatch: migrate to CmpPredicate) to get rid
of one of the FIXMEs it introduced by replacing a predicate comparison
with CmpPredicate::getMatching.
2025-01-14 12:04:14 +00:00
Simon Pilgrim
0bf1591d01
[VectorCombine] foldPermuteOfBinops - fold "shuffle (binop (shuffle, other)), undef" --> "binop (shuffle), (shuffle)". (#122118)
foldPermuteOfBinops currently requires both binop operands to be oneuse shuffles to fold the shuffles across the binop, but there will be cases where its still profitable to fold across the binop with only one foldable shuffle.
2025-01-14 10:43:22 +00:00
Simon Pilgrim
5a7dfb4659 [CostModel][X86] Attempt to match v4f32 shuffles that map to MOVSS/INSERTPS instruction
improveShuffleKindFromMask matches this as a SK_InsertSubvector of a v1f32 (which legalises to f32) into a v4f32 base vector, making it easy to recognise. MOVSS is limited to index0.
2025-01-07 11:31:44 +00:00
Simon Pilgrim
4a42658c1b [VectorCombine][X86] shuffle-of-cmps.ll - tweak shuf_fcmp_oeq_v4i32 shuffle to be not so cheap
An upcoming patch will recognise this as a cheap INSERTPS shuffle - alter the shuffle to ensure the 2 x FCMP is still cheaper on SSE4 targets
2025-01-07 11:07:48 +00:00
Simon Pilgrim
db88071a8b
[CostModel][X86] Attempt to match cheap v4f32 shuffles that map to SHUFPS instruction (#121778)
Avoid always assuming the worst for v4f32 2 input shuffles, and match the SHUFPS pattern where possible - each pair of output elements must come from the same source register.
2025-01-06 17:54:36 +00:00
Simon Pilgrim
7cdbde70fa
[CostModel][X86] getShuffleCost - use processShuffleMasks for all shuffle kinds to legal types (#120599) (#121760)
Now that processShuffleMasks can correctly handle 2 src shuffles, we can completely remove the shuffle kind limits and correctly recognize the number of active subvectors per legalized shuffle - improveShuffleKindFromMask will determine the shuffle kind for each split subvector.
2025-01-06 13:32:55 +00:00
Simon Pilgrim
054e7c5971 [VectorCombine] foldInsExtVectorToShuffle - ignore shuffle costs for 'identity' insertion masks
<u,1,u,u> 'inplace' single src shuffles can be treated as free identity shuffles - ignore any shuffle cost (similar to what we already do in other folds like foldShuffleOfShuffles) - eventually getShuffleCost should just return TCC_Free in these cases but in a lot of the targets' shuffle cost logic this currently ends up treated as a generic SK_PermuteSingleSrc.

We still want to generate the shuffle as it will help further shuffle folds with the additional PoisonMaskElem 'undemanded' elements.
2025-01-05 13:02:31 +00:00
Simon Pilgrim
e3ec5a7286
[VectorCombine] foldShuffleOfBinops - fold shuffle(binop(shuffle(x),shuffle(z)),binop(shuffle(y),shuffle(w)) -> binop(shuffle(x,z),shuffle(y,w)) (#120984)
Some patterns (in particular horizontal style patterns) can end up with shuffles straddling both sides of a binop/cmp.

Where individually the folds aren't worth it, by merging the (oneuse) shuffles we can notably reduce the net instruction count and cost.

One of the final steps towards finally addressing #34072
2025-01-03 10:29:07 +00:00
Simon Pilgrim
035e64c0ec [VectorCombine] eraseInstruction - ensure we reattempt to fold other users of an erased instruction's operands (REAPPLIED)
As we're reducing the use count of the operands its more likely that they will now fold, as they were previously being prevented by a m_OneUse check, or the cost of retaining the extra instruction had been too high.

This is necessary for some upcoming patches, although the only change so far is instruction ordering as it allows some SSE folds of 256/512-bit with 128-bit subvectors to occur earlier in foldShuffleToIdentity as the subvector concats are free.

Reapplied with a fix for foldSingleElementStore/scalarizeLoadExtract which were replacing/removing memory operations - we need to ensure that the worklist is populated in the correct order so all users of the old memory operations are erased first, so there are no remaining users of the loads when its time to remove them as well.

Pulled out of #120984
2025-01-02 18:19:02 +00:00
Simon Pilgrim
5ed6229019 [VectorCombine] Add scalarizeLoadExtract infinite loop test from #120984 regression
scalarizeLoadExtract replaces instructions up the use list, which can result in the vectorcombine worklist adding users back to the worklist when they should really be erased first.
2025-01-02 17:23:12 +00:00
Simon Pilgrim
d5a96eb125 Revert af83093933ca73bc82c33130f8bda9f1ae54aae2 "[VectorCombine] eraseInstruction - ensure we reattempt to fold other users of an erased instruction's operands"
Reports of hung builds, but I don't have time to investigate at the moment.
2024-12-30 21:20:56 +00:00
Simon Pilgrim
af83093933 [VectorCombine] eraseInstruction - ensure we reattempt to fold other users of an erased instruction's operands
As we're reducing the use count of the operands its more likely that they will now fold, as they were previously being prevented by a m_OneUse check, or the cost of retaining the extra instruction had been too high.

This is necessary for some upcoming patches, although the only change so far is instruction ordering as it allows some SSE folds of 256/512-bit with 128-bit subvectors to occur earlier in foldShuffleToIdentity as the subvector concats are free.

Pulled out of #120984
2024-12-30 17:52:42 +00:00
Simon Pilgrim
f2f02b21cd [VectorCombine] foldShuffleOfBinops - only accept exact matching cmp predicates
m_SpecificCmp allowed equivalent predicate+flags which don't necessarily work after being folded from "shuffle (cmpop), (cmpop)" into "cmpop (shuffle), (shuffle)"

Fixes #121110
2024-12-28 09:21:31 +00:00
Simon Pilgrim
f68dbbbd57 [VectorCombine] Add test coverage for #121110 2024-12-28 09:21:31 +00:00
Simon Pilgrim
e3f8c229f5 [VectorCombine] foldInsExtVectorToShuffle - inserting into a poison base vector can be modelled as a single src shuffle
We already canonicalized an undef base vector to the RHS to improve further folding, this extends this to improve the shuffle cost estimate of the single src shuffle
2024-12-23 15:49:17 +00:00
hanbeom
ff93ca7d6c
[VectorCombine] Combine scalar fneg with insert/extract to vector fneg when length is different (#120461)
insertelt DestVec, (fneg (extractelt SrcVec, Index)), Index
-> shuffle DestVec, (shuffle (fneg SrcVec), poison, SrcMask), Mask

Original combining left the combine between vectors of different lengths as a TODO. this commit do that. (see
#[baab4aa1ba])
2024-12-20 10:44:49 +00:00
Simon Pilgrim
611401c115 [CostModel][X86] getShuffleCost - use processShuffleMasks to split SK_PermuteTwoSrc shuffles to legal types (#120599)
processShuffleMasks can now correctly handle 2 src shuffles, so we can use the existing SK_PermuteSingleSrc splitting cost logic to handle SK_PermuteTwoSrc as well and correctly recognise the number of active subvectors per legalised shuffle.
2024-12-20 10:39:45 +00:00
Simon Pilgrim
091448e3c1
Revert "[CostModel][X86] getShuffleCost - use processShuffleMasks to split SK_PermuteTwoSrc shuffles to legal types" (#120707)
Reverts llvm/llvm-project#120599 - some recent tests are currently failing
2024-12-20 10:06:03 +00:00
Simon Pilgrim
81e63f9e0c
[CostModel][X86] getShuffleCost - use processShuffleMasks to split SK_PermuteTwoSrc shuffles to legal types (#120599)
processShuffleMasks can now correctly handle 2 src shuffles, so we can use the existing SK_PermuteSingleSrc splitting cost logic to handle SK_PermuteTwoSrc as well and correctly recognise the number of active subvectors per legalised shuffle.
2024-12-20 09:55:11 +00:00
Simon Pilgrim
fbc18b85d6
Revert "[VectorCombine] Combine scalar fneg with insert/extract to vector fneg when length is different" (#120422)
Reverts llvm/llvm-project#115209 - investigating a reported regression
2024-12-18 13:32:53 +00:00
hanbeom
b7a8d9584c
[VectorCombine] Combine scalar fneg with insert/extract to vector fneg when length is different (#115209)
insertelt DestVec, (fneg (extractelt SrcVec, Index)), Index
-> shuffle DestVec, (shuffle (fneg SrcVec), poison, SrcMask), Mask

Original combining left the combine between vectors of different lengths as a TODO.
2024-12-18 07:47:42 +00:00
Simon Pilgrim
5287299f88
[VectorCombine] foldShuffleOfBinops - prefer same cost fold if it reduces instruction count (#120216)
We don't fold "shuffle (binop), (binop)" -> "binop (shuffle), (shuffle)" if the old/new costs are equal, but we can relax this if either new shuffle will constant fold as it will reduce instruction count.
2024-12-17 18:10:20 +00:00
Simon Pilgrim
df2356b475 [X86] getShuffleCost - ensure we treat constant folded shuffles as free 2024-12-17 09:01:54 +00:00
Simon Pilgrim
8217c2eaef
[VectorCombine] foldShuffleOfBinops - extend to handle icmp/fcmp ops as well (#120075)
Extend binary instructions matching to match compare instructions + predicate as well.
2024-12-16 17:23:04 +00:00
Simon Pilgrim
8dd27d4569 [VectorCombine] Add test coverage for shuffle(cmp,cmp) fold patterns 2024-12-16 12:52:38 +00:00
Simon Pilgrim
cc54a0ce56
[VectorCombine] vectorizeLoadInsert - only fold when inserting into a poison vector (#119906)
We have corresponding poison tests in the "-inseltpoison.ll" sibling test files.

Fixes #119900
2024-12-14 11:56:12 +00:00
Simon Pilgrim
86779da52b
[VectorCombine] Fold "(or (zext (bitcast X)), (shl (zext (bitcast Y)), C))" -> "(bitcast (concat X, Y))" MOVMSK bool mask style patterns (#119695)
Mask/Bool vectors are often bitcast to/from scalar integers, in particular when concatenating mask results, often this is due to the difficulties of working with vector of bools on C/C++. On x86 this typically involves the MOVMSK/KMOV instructions.

To concatenate bool masks, these are typically cast to scalars, which are then zero-extended, shifted and OR'd together.

This patch attempts to match these scalar concatenation patterns and convert them to vector shuffles instead. This in turn often assists with further vector combines, depending on the cost model.

Reapplied patch from #119559 - fixed use after free issue.

Fixes #111431
2024-12-12 13:45:10 +00:00
Simon Pilgrim
625ec7ec89 [VectorCombine] Move concat-boolmasks.ll tests to be VectorCombine only
Suggested on #119559
2024-12-12 11:02:01 +00:00
Simon Pilgrim
673c324ae3 [VectorCombine] foldInsExtVectorToShuffle - canonicalize new shuffle(undef,x) -> shuffle(x,undef).
foldInsExtVectorToShuffle is likely to be inserting into an undef value, so make sure we've canonicalized this to the RHS in the folded shuffle to help further VectorCombine folds.

Minor tweak to help #34072
2024-12-11 16:09:13 +00:00
Simon Pilgrim
4f933277a5
[CostModel][X86] Improve cost estimation of insert_subvector shuffle patterns of legalized types (#119363)
In cases where the base/sub vector type in an insert_subvector pattern legalize to the same width through splitting, we can assume that the shuffle becomes free as the legalized vectors will not overlap.

Note this isn't true if the vectors have been widened during legalization (e.g. v2f32 insertion into v4f32 would legalize to v4f32 into v4f32).

Noticed while working on adding processShuffleMasks handling for SK_PermuteTwoSrc.
2024-12-10 16:28:56 +00:00
Simon Pilgrim
05b907f66b
[VectorCombine] foldShuffleOfShuffles - allow fold with only single shuffle operand. (#119354)
foldShuffleOfShuffles already handles "shuffle (shuffle x, undef), (shuffle y, undef)" patterns, this patch relaxes the requirement so it can handle cases where only a single operand is a shuffle (and the other can be any other value and will be kept in place).

Fixes #86068
2024-12-10 13:10:25 +00:00
Simon Pilgrim
9ab016f1ee [VectorCombine] Add test coverage for #86068 2024-12-10 10:00:56 +00:00
Simon Pilgrim
1878b94568
[VectorCombine] isExtractExtractCheap - specify the extract/insert shuffle mask to improve shuffle costs (#114780)
This shuffle mask is so focused, the cost model is very likely to be able to determine a specific (lower) cost
2024-11-13 12:31:39 +00:00
hanbeom
d942f5e13d
[VectorCombine] Combine extract/insert from vector into a shuffle (#115213)
insert (DstVec, (extract SrcVec, ExtIdx), InsIdx) --> shuffle (DstVec, SrcVec, Mask)

This commit combines extract/insert on a vector into Shuffle with vector.
2024-11-13 11:16:09 +00:00
Simon Pilgrim
958e37cd1f [VectorCombine] scalarizeBinopOrCmp - check for out of bounds element indices
Fixes #115575
2024-11-09 16:00:03 +00:00
Simon Pilgrim
6fb2a6044f [VectorCombine] Add test coverage for #115575 2024-11-09 16:00:03 +00:00
Simon Pilgrim
e3a0775651 [VectorCombine] foldExtractedCmps - (re-)enable fold on non-commutative binops
#114901 exposed that foldExtractedCmps didn't account for non-commutative binops, and were disabled by 05e838f428555bcc4507bd37912da60ea9110ef6

This patch re-enables support for non-commutative binops by ensuring that the LHS/RHS arg order of the binop is retained.
2024-11-06 12:10:31 +00:00
Paul Walker
38fffa630e
[LLVM][IR] Use splat syntax when printing Constant[Data]Vector. (#112548) 2024-11-06 11:53:33 +00:00
Simon Pilgrim
d8354d63db [VectorCombine] Extend test coverage for #114901 with commuted test case 2024-11-06 11:27:05 +00:00
Simon Pilgrim
05e838f428 [VectorCombine] foldExtractedCmps - disable fold on non-commutative binops
The fold needs to be adjusted to correctly track the LHS/RHS operands, which will take some refactoring, for now just disable the fold in this case.

Fixes #114901
2024-11-05 11:42:30 +00:00
Simon Pilgrim
a88be11eef [VectorCombine] Add test coverage for #114901 2024-11-05 11:42:30 +00:00
Simon Pilgrim
92af82a48d
[VectorCombine] Fold "shuffle (binop (shuffle, shuffle)), undef" --> "binop (shuffle), (shuffle)" (#114101)
Add foldPermuteOfBinops - to fold a permute (single source shuffle) through a binary op that is being fed by other shuffles.

Fixes #94546
Fixes #49736
2024-10-31 10:58:09 +00:00
Simon Pilgrim
80c8ecd565 [VectorCombine] Add baseline "shuffle (binop (shuffle, shuffle)), undef" tests for #114101 2024-10-30 13:42:58 +00:00
Ramkumar Ramachandra
1f919aa778
VectorCombine: lift one-use limitation in foldExtractedCmps (#110902)
There are artificial one-use limitations on foldExtractedCmps. Adjust
the costs to account for multi-use, and strip the one-use matcher,
lifting the limitations.
2024-10-10 14:10:41 +01:00
Han-Kuan Chen
0ccc6092d2
[VectorCombine] Add foldShuffleOfIntrinsics. (#106502) 2024-09-10 21:10:09 +08:00
Vitaly Buka
497e2e8cf8
[NFC][VectorCombine] Add negative sanitizer tests (#100832)
They are already work as expected.
2024-07-26 17:20:57 -07:00
Simon Pilgrim
7e054c33d4 [VectorCombine] foldShuffleOfCastops - don't restrict to oneuse but compare total costs instead
Some casts (especially bitcasts but others as well) are incredibly cheap (or free), so don't limit the shuffle(cast(x),cast(y)) -> cast(shuffle(x,y)) to oneuse cases, but instead compare the total before/after costs of possibly repeating some casts.
2024-07-08 14:57:51 +01:00
Simon Pilgrim
22530e7985 [CostModel][X86] Update vXi8 mul costs for AVX512BW/AVX2/AVX1/SSE
Later levels were inheriting some of the worst case costs from SSE/AVX1 etc.

Based off llvm-mca numbers from the check_cost_tables.py script in https://github.com/RKSimon/llvm-scripts

Cleanup prep work for #90748
2024-06-16 07:27:35 +01:00