113 Commits

Author SHA1 Message Date
Simon Pilgrim
156f83adc2 [X86] combineVectorTruncation - use PACKUSDW(BLENDW(X,0),BLENDW(Y,0)) for v8i32->v8i16 truncation
Limit this to SSE41 - AVX1 targets to avoid UNPCKL(PSHUFB,PSHUFB), pre-SSE41 we don't have PACKUSDW/BLENDW and with AVX2 we can perform this as PERMQ(PSHUFB()).
2022-01-30 20:07:04 +00:00
Roman Lebedev
c1a36ba002
[DAGCombine][X86][ARM] EXTRACT_SUBVECTOR(VECTOR_SHUFFLE(?,?,Mask)) -> VECTOR_SHUFFLE(EXTRACT_SUBVECTOR(?, ?), EXTRACT_SUBVECTOR(?, ?), Mask')
In most test changes this allows us to drop some broadcasts/shuffles.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D104156
2021-12-13 20:03:44 +03:00
Simon Pilgrim
55e4cd8485 [X86][AVX2] Recognise 256-bit truncation shuffles and mask 256-bit source
For v8i16 shuffle patterns that are lowered with AND+PACKUS, check to see if the sources are from a 256-bit vector and perform the masking using BLENDW at the 256-bit level.

With the test changes we can see more examples of duplicate XMM/YMM zero vectors (PR26018) :(
2021-11-07 21:24:55 +00:00
Roman Lebedev
cf9b1f7a0e
[X86] Split FeatureFastVariableShuffle tuning into Lane-Crossing and Per-Lane variants
Currently, X86 backend only has a global one-size-fits-all `FeatureFastVariableShuffle` feature,
which controls profitability of both the cross-lane and per-lane variable shuffles.
I guess, this has been fine so far.

But at least on AMD Zen 3, while per-line variable shuffles (e.g. `VPSHUFB`)
are as fast as as shuffles with fixed/immediate mask,
while lane-crossing shuffles, e.g. `VPERMPS` is performing worse.

So to get the benefits of variable-mask shuffles, but not the drawbacks of lane-crossing shuffles,
as suggested by @RKSimon, split the feature flag into two.

Differential Revision: https://reviews.llvm.org/D103274
2021-06-01 10:39:36 +03:00
Roman Lebedev
bafbec8535
[NFC][X86][Codegen] Re-autogenerate check lines in a few tests to remove noise from future changes 2021-05-27 20:29:50 +03:00
Simon Pilgrim
c769ba9514 [X86][AVX] combineHorizOpWithShuffle - improve SHUFFLE(HOP(LOSUBVECTOR(X),HISUBVECTOR(X))) folding
Peek through bitcasts to find subvector splits and use getTargetShuffleInputs to decode target shuffles as well as ShuffleVectorSDNode
2021-03-26 17:23:54 +00:00
Simon Pilgrim
36e3c6c841 [X86][AVX] Truncate vectors with PACKSS/PACKUS on AVX2 targets
Until AVX512 we don't have any vector truncation instructions, and always lower using shuffles instead.

combineVectorTruncation performs this earlier than lowering as it makes it easier to use any sign/zero-extended bits in the truncated bits with PACKSS/PACKUS to perform the shuffle.

We currently don't attempt to use combineVectorTruncation on AVX2 targets as in the past 256-bit PACKSS/PACKUS tended to cause 128-bit lane shuffle regressions - but these should now be all resolved with combineHorizOpWithShuffle and in all cases we now reduce the amount of cross-lane shuffling and variable shuffle mask usage.

Differential Revision: https://reviews.llvm.org/D96609
2021-03-25 10:34:34 +00:00
Simon Pilgrim
13f2aee783 [X86][AVX] Generalize vperm2f128/vperm2i128 patterns to support all legal 256-bit vector types
Remove bitcasts to/from v4x64 types through vperm2f128/vperm2i128 ops to help improve shuffle combining and demanded vector elts folding.
2021-01-25 15:35:36 +00:00
Simon Pilgrim
69bc0990a9 [DAGCombiner] Enable SimplifyDemandedBits vector support for TRUNCATE (REAPPLIED).
Add DemandedElts support inside the TRUNCATE analysis.

REAPPLIED - this was reverted by @hans at rGa51226057fc3 due to an issue with vector shift amount types, which was fixed in rG935bacd3a724 and an additional test case added at rG0ca81b90d19d

Differential Revision: https://reviews.llvm.org/D56387
2021-01-21 13:01:34 +00:00
Hans Wennborg
a51226057f Revert "[DAGCombiner] Enable SimplifyDemandedBits vector support for TRUNCATE"
It caused "Vector shift amounts must be in the same as their first arg"
asserts in Chromium builds. See the code review for repro instructions.

> Add DemandedElts support inside the TRUNCATE analysis.
>
> Differential Revision: https://reviews.llvm.org/D56387

This reverts commit cad4275d697c601761e0819863f487def73c67f8.
2021-01-20 20:06:55 +01:00
Simon Pilgrim
cad4275d69 [DAGCombiner] Enable SimplifyDemandedBits vector support for TRUNCATE
Add DemandedElts support inside the TRUNCATE analysis.

Differential Revision: https://reviews.llvm.org/D56387
2021-01-20 15:39:58 +00:00
Simon Pilgrim
5626adcd6b [X86][SSE] combineVectorSignBitsTruncation - fold trunc(srl(x,c)) -> packss(sra(x,c))
If a srl doesn't introduce any sign bits into the truncated result, then replace with a sra to let us use a PACKSS truncation - fixes a regression noticed in D56387 on pre-SSE41 targets that don't have PACKUSDW.
2021-01-19 11:04:13 +00:00
Simon Pilgrim
4214ca9614 [X86][AVX] Attempt to fold vpermf128(op(x,i),op(y,i)) -> op(vpermf128(x,y),i)
If vpermf128/vpermi128 is acting on 2 similar 'inlane' ops, then try to perform the vpermf128 first which will allow us to merge the ops.

This will help us fix one of the regressions in D56387
2021-01-11 16:59:25 +00:00
Simon Pilgrim
283036394e [X86][SSE] combineVectorTruncation - enable (pre-SSSE3) vXi16->vXi8 truncation.
Shuffle combining can now handle this output, and by performing this early in combineVectorTruncation we avoid a scalarization that caused a regression on D87502.
2020-09-24 15:51:36 +01:00
Simon Pilgrim
21d02dc595 [X86][SSE] SimplifyDemandedVectorEltsForTargetNode - add general shuffle combining support
This patch uses partial DemandedElts masks to further simplify target shuffle chains and finally starts making target shuffle combining part of SimplifyDemandedBits/SimplifyDemandedVectorElts.

We already manage this for Depth == 0 cases, where combineX86ShuffleChain would early-out if the shuffle combined to the same op, but the patch generalizes this by manipulating the depth handling of combineX86ShufflesRecursively - calling with a new Depth = 0 and reducing the maximum shuffle combine depth accordingly.

Differential Revision: https://reviews.llvm.org/D66004
2020-09-02 09:24:46 +01:00
Simon Pilgrim
d2057a8015 [X86][AVX] Lower v16i8/v8i16 binary shuffles using VTRUNC/TRUNCATE
This patch adds lowerShuffleWithVTRUNC to handle basic binary shuffles that can be lowered either as a pure ISD::TRUNCATE or a X86ISD::VTRUNC (with undef/zero values in the remaining upper elements).

We concat the binary sources together into a single 256-bit source vector. To avoid regressions we perform this after we've tried to lower with PACKS/PACKUS which typically does a cleaner job than a concat.

For non-AVX512VL cases we have to canonicalize VTRUNC cases to use a 512-bit source vectors (inserting undefs/zeros in the upper elements as necessary), truncate and then (possibly) extract the 128-bit result.

This should address the last regressions in D66004

Differential Revision: https://reviews.llvm.org/D86093
2020-08-18 11:11:58 +01:00
Simon Pilgrim
0c005be6eb [X86][SSE] getV4X86ShuffleImm8 - canonicalize broadcast masks
If the mask input to getV4X86ShuffleImm8 only refers to a single source element (+ undefs) then canonicalize to a full broadcast.

getV4X86ShuffleImm8 defaults to inline values for undefs, which can be useful for shuffle widening/narrowing but does leave SimplifyDemanded* calls thinking the shuffle depends on unnecessary elements.

I'm still investigating what we should do more generally to avoid these undemanded elements, but broadcast cases was a simpler win.
2020-07-29 11:32:44 +01:00
Craig Topper
f8f007e378 [X86] Consistently use 128 as the PSHUFB/VPPERM index for zero
Bit 7 of the index controls zeroing, the other bits are ignored when bit 7 is set. Shuffle lowering was using 128 and shuffle combining was using 255. Seems like we should be consistent.

This patch changes shuffle combining to use 128 to match lowering.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D83587
2020-07-12 10:52:43 -07:00
Simon Pilgrim
71f342d6c3 [X86][AVX] Fold PACK(LOSUBVECTOR(SHUFFLE(X)),HISUBVECTOR(SHUFFLE(X))) -> SHUFFLE(PACK(LOSUBVECTOR(X),HISUBVECTOR(X)))
Using PACK for truncations leaves us with intermediate shuffles that can be tricky to remove while the truncation tree is being formed.

This fold helps pull out the PERMQ case which is one of the most common, avoiding some costly lane-crossing shuffles.

A future patch will begin adding more general shuffle folding, which we should be able to use for HADD/HSUB as well.
2020-07-04 13:54:30 +01:00
Simon Pilgrim
fb9f9dc318 [X86][SSE] Add SimplifyDemandedVectorEltsForTargetShuffle to handle target shuffle variable masks
Pulled out from the ongoing work on D66004, currently we don't do a good job of simplifying variable shuffle masks that have already lowered to constant pool entries.

This patch adds SimplifyDemandedVectorEltsForTargetShuffle (a custom x86 helper) to first try SimplifyDemandedVectorElts (which we already do) and then constant pool simplification to help mark undefined elements.

To prevent lowering/combines infinite loops, we only handle basic constant pool loads instead of creating new BUILD_VECTOR nodes for lowering - e.g. we don't try to convert them to broadcast/vzext_load - there might be some benefit to this but if so I'd rather we come up with some way to reuse existing code than reimplement a lot of BUILD_VECTOR code.

Differential Revision: https://reviews.llvm.org/D81791
2020-06-21 11:16:07 +01:00
Simon Pilgrim
3079e51858 [X86][SSE] Generalize shuffle(HORIZOP,HORIZOP) -> HORIZOP combine
Our existing combine allows to merge the shuffle of 2 similar 64-bit wide 'horizontal ops' (HADD/PACK/etc.) if the shuffle was a UNPCK/MOVSD.

This patch generalizes this to decode any target shuffle mask that can be widened to a 128-bit repeating v2*64 mask, which helps us catch PBLENDW/PBLENDD cases.
2020-04-05 12:09:19 +01:00
Simon Pilgrim
e5e719d885 [X86][SSE] lowerV8I16Shuffle - lower compaction shuffles using PACKUSDW(PBLENDW,PBLENDW) on SSE41+
Similar to the lowerV16I8Shuffle implementation, for binary compaction v8i16 shuffles we can avoid the PUNPCKLDQ(PSHUFB,PSHUFB) pattern on SSE41+ targets by using PACKUSDW and PBLENDW. Before SSE41 we would need to use PACKSSDW but that requires sign extension that seems to destroy any gains, even on targets without PSHUFB.

This is a bigger gain on AMD than Intel targets but should never be a regression, and avoiding the shuffle mask load(s) is always useful.

Noticed in codegen while dealing with PR31443.
2020-04-04 13:08:25 +01:00
Simon Pilgrim
ad36491ebb [X86] Prefer PACKUS(AND(),AND()) to SHUFFLE(PSHUFB(),PSHUFB()) on all targets
Extends rG9d1721ce3926 to support AVX2+ targets.
2020-03-26 20:46:24 +00:00
Simon Pilgrim
39a52a19ed [X86] lowerV16I8Shuffle - create v8i16 mask for PACKUS(AND(),AND()) patterns.
We can improve computeKnownBits results by avoiding excess bitcasts.

For this pattern we were doing:

  (v16i8 PACKUS(v8i16 BITCAST(v16i8 AND(V1, MASK)), v8i16 BITCAST(v16i8 AND(V2, MASK))))

By performing the MASK/AND with a v8i16 type and bitcasting V1/V2 directly we can help computeKnownBits see that the mask is clearing the upper bits and allows shuffle combining to peek through later on.

This will be necessary to extend rG9d1721ce3926 to AVX2+ targets in a future patch.
2020-03-26 19:59:57 +00:00
Simon Pilgrim
9d1721ce39 [X86][SSE] Prefer PACKUS(AND(),AND()) to SHUFFLE(PSHUFB(),PSHUFB()) on pre-AVX2 targets
As discussed on PR31443, we should be trying to use PACKUS for binary truncation patterns to reduce the number of shuffles.

The plan is to support AVX2+ targets once we've worked around PR45315 - we fail to peek through a VBROADCAST_LOAD mask to recognise zero upper bits in a PACKUS pattern.

We should also be able to add support for v8i16 and possibly 256/512-bit vectors as well.
2020-03-26 15:47:43 +00:00
Simon Pilgrim
c6e5531f9b [X86][AVX] Combine shuffles to TRUNCATE/VTRUNC patterns
Add support for combining shuffles to AVX512 truncate instructions - another step toward fixing D56387/D66004. It also fixes SKX code on PR31443.

We could probably extend this further to handle non-VLX truncation cases.
2020-03-25 17:41:51 +00:00
Simon Pilgrim
4ceade0428 [X86] Combine concat(shufps,shufps) -> shufps(concat,concat)
Now that rG18c19441d105 has improved VPERM2X128 handling, we can perform this to improve x64->x32 truncation without poor cross-lane issues.

Someday combineX86ShufflesRecursively will handle this, but we're still really bad at dealing with different vector widths.
2020-03-21 12:44:10 +00:00
Simon Pilgrim
e71fb46a8f [TargetLowering] SimplifyDemandedVectorElts - add DemandedElts mask to ISD::BITCAST SimplifyDemandedBits call.
This fixes most of the regressions introduced in the rG4bc6f6332028 bugfix. The vector-trunc.ll issue should be fixed by D66004.
2020-03-10 13:39:10 +00:00
Simon Pilgrim
18c19441d1 [X86][AVX] combineX86ShuffleChain - combine binary shuffles to X86ISD::VPERM2X128
For pre-AVX512 targets, combine binary shuffles to X86ISD::VPERM2X128 if possible. This mainly helps optimize the blend(extract_subvector(x,1),y) pattern.

At some point soon we're going to have make a decision about when to combine AVX512 shuffles more aggressively - we bail out if there is any change in element size (to protect predicate mask merging) which means we miss out on a lot of optimizations.
2020-03-10 10:44:28 +00:00
Craig Topper
18e8d02e8c [X86] Pass v32i16/v64i8 in zmm registers on KNL target.
gcc and icc pass these types in zmm registers in zmm registers.

This patch implements a quick hack to override the register
type before calling convention handling to one that is legal.
Longer term we might want to do something similar to 256-bit
integer registers on AVX1 where we just split all the operations.

Fixes PR42957

Differential Revision: https://reviews.llvm.org/D66708

llvm-svn: 370495
2019-08-30 17:35:08 +00:00
Craig Topper
3dae6347da Recommit r368079 "[X86] Remove uses of the -x86-experimental-vector-widening-legalization flag from test/CodeGen/X86/"
llvm-svn: 368184
2019-08-07 16:33:37 +00:00
Craig Topper
8b5f2ab2a4 Recommit r367901 "[X86] Enable -x86-experimental-vector-widening-legalization by default."
The assert that caused this to be reverted should be fixed now.

Original commit message:

This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.

This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.

Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.

llvm-svn: 368183
2019-08-07 16:24:26 +00:00
Mitch Phillips
bd0d97e1c4 Revert "[X86] Enable -x86-experimental-vector-widening-legalization by default."
This reverts commit 3de33245d2c992c9e0af60372043540b60f3a810.

This commit broke the MSan buildbots. See
https://reviews.llvm.org/rL367901 for more information.

llvm-svn: 368107
2019-08-06 23:00:43 +00:00
Mitch Phillips
2f908c1436 Revert "[X86] Remove uses of the -x86-experimental-vector-widening-legalization flag from test/CodeGen/X86/"
This reverts commit 3f572c7b8405f36993ec8a226dcddd57283a7c1e.

The MSan sanitizer buildbot was broken by rL367901. This commit
(rL368079) depends on the broken commit that need to be reverted, and
thus itself is being reverted.

See https://reviews.llvm.org/rL367901 for more information.

llvm-svn: 368106
2019-08-06 23:00:30 +00:00
Craig Topper
3f572c7b84 [X86] Remove uses of the -x86-experimental-vector-widening-legalization flag from test/CodeGen/X86/
This flag is now the default behavior so we no longer need to
set it in tests.

Some redundant tests have been removed after verifying we have
an equivalent test that didn't use the flag.

llvm-svn: 368079
2019-08-06 20:12:20 +00:00
Craig Topper
3de33245d2 [X86] Enable -x86-experimental-vector-widening-legalization by default.
This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.

This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.

Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.

llvm-svn: 367901
2019-08-05 18:25:36 +00:00
Craig Topper
510e6fadaa [X86] When using AND+PACKUS in lowerV16I8Shuffle, generate the build vector directly in v16i8 with the correct 0x00 or 0xFF elements rather than using another VT and bitcasting it.
The build_vector will become a constant pool load. By using the
desired type initially, it ensures we don't generate a bitcast
of the constant pool load which will need to be folded with
the load.

While experimenting with another patch, I noticed that when the
load type and the constant pool type don't match, then
SimplifyDemandedBits can't handle it. While we should probably
fix that, this was a simple way to fix the issue I saw.

llvm-svn: 366732
2019-07-22 19:58:49 +00:00
Craig Topper
e8da65c698 [X86] Turn v16i16->v16i8 truncate+store into a any_extend+truncstore if we avx512f, but not avx512bw.
Ideally we'd be able to represent this truncate as a any_extend to
v16i32 and a truncate, but SelectionDAG doens't know how to not
fold those together.

We have isel patterns to use a vpmovzxwd+vpdmovdb for the truncate,
but we aren't able to simultaneously fold the load and the store
from the isel pattern. By pulling the truncate into the store we
can successfully hide it from the DAG combiner. Then we can isel
pattern match the truncstore and load+any_extend separately.

llvm-svn: 364163
2019-06-23 23:51:21 +00:00
Sanjay Patel
606eb2367f [x86] split 256-bit store of concatenated vectors
This shows up as a side issue to the main problem for the AVX target example from PR37428:
https://bugs.llvm.org/show_bug.cgi?id=37428 - https://godbolt.org/z/7tpRa3

But as we can see in the pile of existing test diffs, it's actually a widespread problem
that affects any AVX or later target. Apart from a couple of oddballs, I think these are
all improvements for the reasons stated in the code comment: we do not want to enable YMM
unnecessarily (avoid vzeroupper and frequency throttling) and some cores split 256-bit
stores anyway.

We could say that MergeConsecutiveStores() is going overboard on some of these examples,
but that won't solve the problem completely. But that is a reason I'm proposing this as
a lowering rather than a combine: we will infinite loop fighting the merge code if we try
this earlier.

Differential Revision: https://reviews.llvm.org/D62498

llvm-svn: 362524
2019-06-04 16:40:04 +00:00
Sanjay Patel
f7980e727f Revert "[x86] split 256-bit store of concatenated vectors"
This reverts commit d5a8637072f4c556b88156bd2f6237a2ead47d31.

Most likely suspect for this bot failure:
http://lab.llvm.org:8011/builders/clang-cmake-x86_64-avx2-linux/builds/9684

llvm-svn: 361850
2019-05-28 17:37:58 +00:00
Sanjay Patel
d5a8637072 [x86] split 256-bit store of concatenated vectors
This shows up as a side issue to the main problem for the AVX target example from PR37428:
https://bugs.llvm.org/show_bug.cgi?id=37428 - https://godbolt.org/z/7tpRa3

But as we can see in the pile of existing test diffs, it's actually a widespread problem
that affects any AVX or later target. Apart from a couple of oddballs, I think these are
all improvements for the reasons stated in the code comment: we do not want to enable YMM
unnecessarily (avoid vzeroupper and frequency throttling) and some cores split 256-bit
stores anyway.

We could say that MergeConsecutiveStores() is going overboard on some of these examples,
but that won't solve the problem completely. But that is the reason I'm proposing this as
a lowering rather than a combine: we will infinite loop fighting the merge code if we try
this earlier.

Differential Revision: https://reviews.llvm.org/D62498

llvm-svn: 361822
2019-05-28 13:54:17 +00:00
Simon Pilgrim
b0f51266b8 [X86][AVX] Fold concat(packus(),packus()) -> packus(concat(),concat()) (PR34773)
Basic "revectorization" combine, we can probably do more opcodes here but it can be a tricky cost-benefit depending on where the subvectors came from - but this case helps shuffle combining.

llvm-svn: 360134
2019-05-07 11:17:39 +00:00
Simon Pilgrim
22d1476bfa [X86][AVX] Combine non-lane crossing binary shuffles using X86ISD::VPERMV3
Some of the combines might be further improved if we lower more shuffles with X86ISD::VPERMV3 directly, instead of waiting to combine the results.

llvm-svn: 359400
2019-04-28 14:31:01 +00:00
Simon Pilgrim
4171a91e92 [X86] combineVectorTruncationWithPACKUS - remove split/concatenation of mask
combineVectorTruncationWithPACKUS is currently splitting the upper bit bit masking into 128-bit subregs and then concatenating them back together.

This was originally done to avoid regressions that caused existing subregs to be concatenated to the larger type just for the AND masking before being extracted again. This was fixed by @spatel (notably rL303997 and rL347356).

This also lets SimplifyDemandedBits do some further improvements before it hits the recursive depth limit.

My only annoyance with this is that we were broadcasting some xmm masks but we seem to have lost them by moving to ymm - but that's a known issue as the logic in lowerBuildVectorAsBroadcast isn't great.

Differential Revision: https://reviews.llvm.org/D60375#inline-539623

llvm-svn: 358692
2019-04-18 17:23:09 +00:00
Sanjay Patel
fff628274d [x86] split more v8f32/v8i32 shuffles in lowering
Similar to D57867 - this is a small patch with lots of test diffs.
With half-vector-width narrowing potential, using an extract + 128-bit vshufps
is a win because it replaces a 256-bit shuffle with a 128-bit shufle.

This seems like it should be a win even for targets with 'fast-variable-shuffle',
but we are intentionally deferring that to an independent change to make sure
that is true.

Differential Revision: https://reviews.llvm.org/D58181

llvm-svn: 354279
2019-02-18 16:46:12 +00:00
Simon Pilgrim
e95550f508 [X86][AVX] Add VMOVDDUP-VPBROADCASTQ execution domain mapping
Noticed in D57514.

Differential Revision: https://reviews.llvm.org/D57519

llvm-svn: 352922
2019-02-01 21:41:30 +00:00
Simon Pilgrim
9f4dea8c06 [X86] Add VPSLLI/VPSRLI ((X >>u C1) << C2) SimplifyDemandedBits combine
Repeat of the generic SimplifyDemandedBits shift combine

llvm-svn: 350399
2019-01-04 15:43:43 +00:00
Craig Topper
ec096a1dae [X86] Custom type legalize v2i32/v4i16/v8i8->i64 bitcasts in 64-bit mode similar to what's done when the destination is f64.
The generic legalizer will fall back to a stack spill that uses a truncating store. That store will get expanded into a shuffle and non-truncating store on pre-avx512 targets. Once that happens the stack store/load pair will be combined away leaving behind the shuffle and bitcasts. On avx512 targets the truncating store is legal so doesn't get folded away.

By custom legalizing it we can avoid this churn and maybe produce better code.

llvm-svn: 348085
2018-12-02 05:46:48 +00:00
Craig Topper
bc8148f7b0 [X86] Lower v16i16->v8i16 truncate using an 'and' with 255, an extract_subvector, and a packuswb instruction.
Summary: This is an improvement over the two pshufbs and punpcklqdq we'd get otherwise.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D54671

llvm-svn: 347171
2018-11-18 17:59:28 +00:00
Sanjay Patel
0a515595a7 [x86] allow vector load narrowing with multi-use values
This is a long-awaited follow-up suggested in D33578. Since then, we've picked up even more
opportunities for vector narrowing from changes like D53784, so there are a lot of test diffs.
Apart from 2-3 strange cases, these are all wins.

I've structured this to be no-functional-change-intended for any target except for x86
because I couldn't tell if AArch64, ARM, and AMDGPU would improve or not. All of those
targets have existing regression tests (4, 4, 10 files respectively) that would be
affected. Also, Hexagon overrides the shouldReduceLoadWidth() hook, but doesn't show
any regression test diffs. The trade-off is deciding if an extra vector load is better
than a single wide load + extract_subvector.

For x86, this is almost always better (on paper at least) because we often can fold
loads into subsequent ops and not increase the official instruction count. There's also
some unknown -- but potentially large -- benefit from using narrower vector ops if wide
ops are implemented with multiple uops and/or frequency throttling is avoided.

Differential Revision: https://reviews.llvm.org/D54073

llvm-svn: 346595
2018-11-10 20:05:31 +00:00