146 Commits

Author SHA1 Message Date
Simon Pilgrim
69ffa7be3b
[X86] X86FixupVectorConstants - load+zero vector constants that can be stored in a truncated form (#80428)
Further develops the vsextload support added in #79815 / b5d35feacb7246573c6a4ab2bddc4919a4228ed5 - reduces the size of the vector constant by storing it in the constant pool in a truncated form, and zero-extend it as part of the load.
2024-02-05 12:17:58 +00:00
Simon Pilgrim
b5d35feacb
[X86] X86FixupVectorConstants - load+sign-extend vector constants that can be stored in a truncated form (#79815)
Reduce the size of the vector constant by storing it in the constant pool in a truncated form, and sign-extend it as part of the load.

I've extended the existing FixupConstant functionality to support these sext constant rebuilds - we still select the smallest stored constant entry and prefer vzload/broadcast/vextload for same bitwidth to avoid domain flips.

I intend to add the matching load+zero-extend handling in a future PR, but that requires some alterations to the existing MC shuffle comments handling first.
2024-02-02 11:28:58 +00:00
Simon Pilgrim
11276563c8 [X86] X86DAGToDAGISel - attempt to merge XMM/YMM loads with YMM/ZMM loads of the same ptr (#73126)
If we are loading the same ptr at different vector widths, then reuse the largest load and just extract the low subvector.

Unlike the equivalent VBROADCAST_LOAD/SUBV_BROADCAST_LOAD folds which can occur in DAG, we have to wait until DAGISel otherwise we can hit infinite loops if constant folding recreates the original constant value.

This is mainly useful for better constant sharing.
2023-11-27 10:26:26 +00:00
Simon Pilgrim
381efa4960 Revert rG67275263b3b781a "[X86] X86DAGToDAGISel - attempt to merge XMM/YMM loads with YMM/ZMM loads of the same ptr (#73126)"
Missed an issue that we were calling continue from within the for loop - fixed version incoming shortly.
2023-11-23 16:50:58 +00:00
Simon Pilgrim
67275263b3
[X86] X86DAGToDAGISel - attempt to merge XMM/YMM loads with YMM/ZMM loads of the same ptr (#73126)
If we are loading the same ptr at different vector widths, then reuse the larger load and just extract the low subvector.

Unlike the equivalent VBROADCAST_LOAD/SUBV_BROADCAST_LOAD folds which can occur in DAG, we have to wait until DAGISel otherwise we can hit infinite loops if constant folding recreates the original constant value.

This is mainly useful for better constant sharing.
2023-11-23 14:10:23 +00:00
Jay Foad
7b3bbd83c0 Revert "[CodeGen] Really renumber slot indexes before register allocation (#67038)"
This reverts commit 2501ae58e3bb9a70d279a56d7b3a0ed70a8a852c.

Reverted due to various buildbot failures.
2023-10-09 12:31:32 +01:00
Jay Foad
2501ae58e3
[CodeGen] Really renumber slot indexes before register allocation (#67038)
PR #66334 tried to renumber slot indexes before register allocation, but
the numbering was still affected by list entries for instructions which
had been erased. Fix this to make the register allocator's live range
length heuristics even less dependent on the history of how instructions
have been added to and removed from SlotIndexes's maps.
2023-10-09 11:44:41 +01:00
Simon Pilgrim
6cf8bde056 [X86] getFauxShuffleMask - add SIGN_EXTEND_VECTOR_INREG handling for all-signbits sources
Add suport for shuffle combines (via combineEXTEND_VECTOR_INREG) to begin from SIGN_EXTEND_VECTOR_INREG nodes
2023-07-19 14:32:34 +01:00
Simon Pilgrim
5bc836422e [X86] LowerEXTEND_VECTOR_INREG - add sign_extend_vector_inreg fast path for all-signbits source values
If the source operand is already all-signbits we don't need to create the sign extended elements - just splat the source element to the destination element width
2023-07-19 10:13:08 +01:00
Simon Pilgrim
fd2de54920 [X86] Canonicalize vXi64 SIGN_EXTEND_INREG vXi1 to use v2Xi32 splatted shifts instead
If somehow a vXi64 bool sign_extend_inreg pattern has been lowered to vector shifts (without PSRAQ support), then try to canonicalize to vXi32 shifts to improve likelihood of value tracking being able to fold them away.

Using a PSLLQ and bitcasted PSRAD node make it very difficult for later fold to recover from this.
2023-07-17 10:18:03 +01:00
Simon Pilgrim
34961c600d [X86] LowerTRUNCATE - attempt to use PACKSS/PACKUS on AVX512 targets if the truncation source is concatenating from smaller subvectors
Don't just use AVX512 truncation ops if PACKSS/PACKUS can do this more cheaply
2023-06-29 15:27:41 +01:00
Simon Pilgrim
0f8e0f4228 [X86] lowerBuildVectorAsBroadcast - broadcast Constant of original (BuildVector) element size
Noticed in D150143/D150526 - we currently create scalar Constant values using the broadcast instruction width, which might be wider than the original build vector width, making it tricky to recognise the original constant bits data.

If we have widened the broadcast value, its much more useful for asm comments if we create a ConstantVector with the original element data, add that to the constant-pool and load that with the same (wider) broadcast instruction.
2023-05-27 14:05:44 +01:00
Simon Pilgrim
c3bf6d20ac [X86] Fold PSHUF(VSHIFT(X,Y)) -> VSHIFT(PSHUF(X),Y)
PSHUFD/PSHUFLW/PSHUFHW can act as a vector move / folded load, notably helping simplify pre-AVX cases in particular.

This is a much milder alternative to refactoring canonicalizeShuffleWithBinOps to support SSE shifts nodes.
2023-04-22 20:02:27 +01:00
Noah Goldstein
69a322fed1 Add new pass X86FixupInstTuning for fixing up machine-instruction selection.
There are a variety of cases where we want more control over the exact
instruction emitted. This commit creates a new pass to fixup
instructions after the DAG has been lowered. The pass is only meant to
replace instructions that are guranteed to be interchangable, not to
do analysis for special cases.

Handling these instruction changes in in X86ISelLowering of
X86ISelDAGToDAG isn't ideal, as its liable to either break existing
patterns that expected a certain instruction or generate infinite
loops.

As well, operating as the MachineInstruction level allows us to access
scheduling/code size information for making the decisions.

Currently only implements `{v}permilps` -> `{v}shufps/{v}shufd` but
more transforms can be added.

Differential Revision: https://reviews.llvm.org/D143787
2023-02-27 18:53:25 -06:00
Simon Pilgrim
78739fdb4d [DAG] Enable combineShiftOfShiftedLogic folds after type legalization
This was disabled to prevent regressions, which appear to be just occurring on AMDGPU (at least in our current lit tests), which I've addressed by adding AMDGPUTargetLowering::isDesirableToCommuteWithShift overrides.

Fixes #57872

Differential Revision: https://reviews.llvm.org/D136042
2022-10-29 12:30:04 +01:00
Sanjay Patel
f0dd12ec5c [x86] use zero-extending load of a byte outside of loops too (2nd try)
The first attempt missed changing test files for tools
(update_llc_test_checks.py).

Original commit message:

This implements the main suggested change from issue #56498.
Using the shorter (non-extending) instruction with only
-Oz ("minsize") rather than -Os ("optsize") is left as a
possible follow-up.

As noted in the bug report, the zero-extending load may have
shorter latency/better throughput across a wide range of x86
micro-arches, and it avoids a potential false dependency.
The cost is an extra instruction byte.

This could cause perf ups and downs from secondary effects,
but I don't think it is possible to account for those in
advance, and that will likely also depend on exact micro-arch.
This does bring LLVM x86 codegen more in line with existing
gcc codegen, so if problems are exposed they are more likely
to occur for both compilers.

Differential Revision: https://reviews.llvm.org/D129775
2022-07-19 21:27:08 -04:00
Sanjay Patel
95401b0153 Revert "[x86] use zero-extending load of a byte outside of loops too"
This reverts commit 9d1ea1774c51c44ddf0b5065bf600919988d7015.
There are tests of update_llc_tests_checks.py that missed being updated.
2022-07-19 17:37:22 -04:00
Sanjay Patel
9d1ea1774c [x86] use zero-extending load of a byte outside of loops too
This implements the main suggested change from issue #56498.
Using the shorter (non-extending) instruction with only
-Oz ("minsize") rather than -Os ("optsize") is left as a
possible follow-up.

As noted in the bug report, the zero-extending load may have
shorter latency/better throughput across a wide range of x86
micro-arches, and it avoids a potential false dependency.
The cost is an extra instruction byte.

This could cause perf ups and downs from secondary effects,
but I don't think it is possible to account for those in
advance, and that will likely also depend on exact micro-arch.
This does bring LLVM x86 codegen more in line with existing
gcc codegen, so if problems are exposed they are more likely
to occur for both compilers.

Differential Revision: https://reviews.llvm.org/D129775
2022-07-19 16:43:47 -04:00
Nikita Popov
2f448bf509 [X86] Migrate tests to use opaque pointers (NFC)
Test updates were performed using:
https://gist.github.com/nikic/98357b71fd67756b0f064c9517b62a34

These are only the test updates where the test passed without
further modification (which is almost all of them, as the backend
is largely pointer-type agnostic).
2022-06-22 14:38:25 +02:00
Matt Arsenault
4a36e96c3f RegAllocGreedy: Account for reserved registers in num regs heuristic
This simple heuristic uses the estimated live range length combined
with the number of registers in the class to switch which heuristic to
use. This was taking the raw number of registers in the class, even
though not all of them may be available. AMDGPU heavily relies on
dynamically reserved numbers of registers based on user attributes to
satisfy occupancy constraints, so the raw number is highly misleading.

There are still a few problems here. In the original testcase that
made me notice this, the live range size is incorrect after the
scheduler rearranges instructions, since the instructions don't have
the original InstrDist offsets. Additionally, I think it would be more
appropriate to use the number of disjointly allocatable registers in
the class. For the AMDGPU register tuples, there are a large number of
registers in each tuple class, but only a small fraction can actually
be allocated at the same time since they all overlap with each
other. It seems we do not have a query that corresponds to the number
of independently allocatable registers. Relatedly, I'm still debugging
some allocation failures where overlapping tuples seem to not be
handled correctly.

The test changes are mostly noise. There are a handful of x86 tests
that look like regressions with an additional spill, and a handful
that now avoid a spill. The worst looking regression is likely
test/Thumb2/mve-vld4.ll which introduces a few additional
spills. test/CodeGen/AMDGPU/soft-clause-exceeds-register-budget.ll
shows a massive improvement by completely eliminating a large number
of spills inside a loop.
2021-09-14 21:00:29 -04:00
Eli Friedman
bdd55b2f18 Fix the default alignment of i1 vectors.
Currently, the default alignment is much larger than the actual size of
the vector in memory.  Fix this to use a sane default.

For SVE, temporarily remove lowering of load/store operations for
predicates with less than 16 elements. The layout the backend was
assuming for SVE predicates with less than 16 elements doesn't agree
with the frontend. More work probably needs to be done here.

This change is, strictly speaking, not backwards-compatible at the
bitcode level. But probably nobody is actually depending on that; i1
vectors in memory are rare, and the code that does use them probably
ends up forcing the alignment to something sane anyway.  If we think
this is a concern, I can restrict this to scalable vectors for now
(where it's actually causing issues for me at the moment).

Differential Revision: https://reviews.llvm.org/D88994
2021-07-31 14:09:59 -07:00
Roman Lebedev
0aef747b84
[NFC][X86][Codegen] Megacommit: mass-regenerate all check lines that were already autogenerated
The motivation is that the update script has at least two deviations
(`<...>@GOT`/`<...>@PLT`/ and not hiding pointer arithmetics) from
what pretty much all the checklines were generated with,
and most of the tests are still not updated, so each time one of the
non-up-to-date tests is updated to see the effect of the code change,
there is a lot of noise. Instead of having to deal with that each
time, let's just deal with everything at once.

This has been done via:
```
cd llvm-project/llvm/test/CodeGen/X86
grep -rl "; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py" | xargs -L1 <...>/llvm-project/llvm/utils/update_llc_test_checks.py --llc-binary <...>/llvm-project/build/bin/llc
```

Not all tests were regenerated, however.
2021-06-11 23:57:02 +03:00
Craig Topper
0248e24071 [X86][update_llc_test_checks] Use a less greedy regular expression for replacing constant pool labels in tests.
While working on D97208 I noticed that these greedy regular
expressions prevent tests from failing when (%rip) appears after
a constant pool label when it didn't before.

Reviewed By: RKSimon, pengfei

Differential Revision: https://reviews.llvm.org/D99460
2021-03-28 11:39:46 -07:00
Simon Pilgrim
9fde88c3e2 [X86][AVX] splitIntVSETCC - handle separate (canonicalized) SETCC operands
LowerVSETCC calls splitIntVSETCC after canonicalizing certain patterns, in particular (X & CPow2 != 0) -> (X & CPow2 == CPow2).

Unfortunately if we're splitting for AVX1/non-AVX512BW cases, we lose these canonicalizations as we call the split with the original SetCC node, and when the split nodes are later lowered in LowerVSETCC the patterns are lost behind extract_subvector etc. But if we pass the canonicalized operands for splitting we retain the optimizations.

Differential Revision: https://reviews.llvm.org/D99256
2021-03-25 10:18:44 +00:00
Simon Pilgrim
0765d78a41 [X86] vector-sext.ll - replace X32 check prefix with X86. NFC.
We typically use X32 for gnux32 triples
2020-11-17 12:39:47 +00:00
Simon Pilgrim
0c005be6eb [X86][SSE] getV4X86ShuffleImm8 - canonicalize broadcast masks
If the mask input to getV4X86ShuffleImm8 only refers to a single source element (+ undefs) then canonicalize to a full broadcast.

getV4X86ShuffleImm8 defaults to inline values for undefs, which can be useful for shuffle widening/narrowing but does leave SimplifyDemanded* calls thinking the shuffle depends on unnecessary elements.

I'm still investigating what we should do more generally to avoid these undemanded elements, but broadcast cases was a simpler win.
2020-07-29 11:32:44 +01:00
Simon Pilgrim
17eafe0841 [X86][SSE] lowerV2I64Shuffle - use undef elements in PSHUFD mask widening
If we lower a v2i64 shuffle to PSHUFD, we currently clamp undef elements to 0, (elements 0,1 of the v4i32) which can result in the shuffle referencing more elements of the source vector than expected, affecting later shuffle combines and KnownBits/SimplifyDemanded calls.

By ensuring we widen the undef mask element we allow getV4X86ShuffleImm8 to use inline elements as the default, which are more likely to fold.
2020-07-26 16:04:22 +01:00
LemonBoy
6d103ca855 [SelectionDAG] Unify scalarizeVectorLoad and VectorLegalizer::ExpandLoad
The two code paths have the same goal, legalizing a load of a non-byte-sized vector by loading the "flattened" representation in memory, slicing off each single element and then building a vector out of those pieces.

The technique employed by `ExpandLoad`  is slightly more convoluted and produces slightly better codegen on ARM, AMDGPU and x86 but suffers from some bugs (D78480) and is wrong for BE machines.

Differential Revision: https://reviews.llvm.org/D79096
2020-05-02 15:18:10 -07:00
Craig Topper
8dfb9627b7 [X86] Make v32i16/v64i8 legal types without avx512bw. Use custom splitting instead.
This moves v32i16/v64i8 to a model consistent with how we
treat integer types with avx1.

This does change the ABI for types vXi16/vXi8 vectors larger than
512 bits to pass in multiple zmms instead of multiple ymms. We'd
already hacked some code to make v64i8/v32i16 pass in zmm.

Cost model is still a bit of a mess. In some place I tried to
match existing behavior. But really we need to account for
splitting and concating costs. Cost model for shuffles is
especially pessimistic.

Differential Revision: https://reviews.llvm.org/D76212
2020-04-15 12:17:18 -07:00
Craig Topper
3dcc0db15e [X86] Teach combineToExtendBoolVectorInReg to create opportunities for using broadcast load instructions.
If we're inserting a scalar that is smaller than the element
size of the final VT, the value of the extra bits doesn't matter.

Previously we any_extended in the scalar domain before inserting.

This patch changes this to use a broadcast of the original
scalar type and then a bitcast to the final type. This might
enable the use of a broadcast load.

This recovers regressions from 07d68c24aa19483e44db4336b0935b00a5d69949
and 9fcd212e2f678fdbdf304399a1e58ca490dc54d1 without relying on
alignment of the load.

Differential Revision: https://reviews.llvm.org/D75835
2020-03-09 11:26:12 -07:00
Craig Topper
07d68c24aa [X86] Remove isel patterns that matched vXi16 X86VBroadcast with i8->i16 aextload input.
This was selecting VBROADCASTW which turned the 8-bit load into
a 16-bit load if it happened to be 2 byte aligned.

I have a plan to fix the regression with a follow up patch
which I'll post shortly.
2020-03-08 19:16:24 -07:00
Craig Topper
3c4e635593 [X86] Always emit an integer vbroadcast_load from lowerBuildVectorAsBroadcast regardless of AVX vs AVX2
If we go with D75412, we no longer depend on the scalar type directly. So we don't need to avoid using i64. We already have AVX1 fallback patterns with i32 and i64 scalar types so we don't need to avoid using integer types on AVX1.

Differential Revision: https://reviews.llvm.org/D75413
2020-03-03 10:39:11 -08:00
Craig Topper
9fcd212e2f [X86] Remove isel patterns from broadcast of loadi32.
We already combine non extending loads with broadcasts in DAG
combine. All these patterns are picking up is the aligned extload
special case. But the only lit test we have that exercsises it is
using v8i1 load that datalayout is reporting align 8 for. That
seems generous. So without a realistic test case I don't think
there is much value in these patterns.
2020-02-28 16:39:27 -08:00
Simon Pilgrim
651fa669a2 [TargetLowering] SimplifyDemandedBits ANY_EXTEND/ANY_EXTEND_VECTOR_INREG multi-use handling
Call SimplifyMultipleUseDemandedBits to peek through extended source args with multiple uses
2020-01-21 14:07:19 +00:00
Simon Pilgrim
b5088aa944 [X86][SSE] lowerV16I8Shuffle - tryToWidenViaDuplication - undef unpack args
tryToWidenViaDuplication lowers using the shuffle_v8i16(unpack_v16i8(shuffle_v8i16(x),shuffle_v8i16(x))) pattern, but the unpack only needs the even/odd 16i8 args if the original v16i8 shuffle mask references the even/odd elements - which isn't true for many extension style shuffles.

llvm-svn: 375342
2019-10-19 13:18:02 +00:00
Craig Topper
18e8d02e8c [X86] Pass v32i16/v64i8 in zmm registers on KNL target.
gcc and icc pass these types in zmm registers in zmm registers.

This patch implements a quick hack to override the register
type before calling convention handling to one that is legal.
Longer term we might want to do something similar to 256-bit
integer registers on AVX1 where we just split all the operations.

Fixes PR42957

Differential Revision: https://reviews.llvm.org/D66708

llvm-svn: 370495
2019-08-30 17:35:08 +00:00
Craig Topper
8b5f2ab2a4 Recommit r367901 "[X86] Enable -x86-experimental-vector-widening-legalization by default."
The assert that caused this to be reverted should be fixed now.

Original commit message:

This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.

This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.

Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.

llvm-svn: 368183
2019-08-07 16:24:26 +00:00
Mitch Phillips
bd0d97e1c4 Revert "[X86] Enable -x86-experimental-vector-widening-legalization by default."
This reverts commit 3de33245d2c992c9e0af60372043540b60f3a810.

This commit broke the MSan buildbots. See
https://reviews.llvm.org/rL367901 for more information.

llvm-svn: 368107
2019-08-06 23:00:43 +00:00
Craig Topper
3de33245d2 [X86] Enable -x86-experimental-vector-widening-legalization by default.
This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.

This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.

Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.

llvm-svn: 367901
2019-08-05 18:25:36 +00:00
Simon Pilgrim
fde766de4b [X86][AVX1] Combine concat_vectors(pshufd(x,c),pshufd(y,c)) -> vpermilps(concat_vectors(x,y),c)
Bitcast v4i32 to v8f32 and back again - it might be worth adding isel patterns for X86PShufd v8i32 on AVX1 targets like we did for X86Blendi to avoid the bitcasts?

llvm-svn: 365125
2019-07-04 10:17:10 +00:00
Simon Pilgrim
32aac1727a [X86][SSE] Improve bool vector extload (PR26091)
We already have good codegen for (vXiY *ext(vXi1 bitcast(iX))) cases, this patch uses it for loads of vXi1 types as well - changing the load into a iX integer load, and bitcasting so that combineToExtendBoolVectorInReg can then use it.

Differential Revision: https://reviews.llvm.org/D62449

llvm-svn: 362081
2019-05-30 10:25:20 +00:00
Simon Pilgrim
40fa52b174 [X86] lowerBuildVectorToBitOp - support build_vector(shift()) -> shift(build_vector(),C)
Commonly occurs in sign-extension cases

llvm-svn: 361706
2019-05-25 18:02:17 +00:00
Simon Pilgrim
34d5a74b03 [X86][SSE] vector-sext - cleanup prefix lists
Add X32-SSE common prefix to merge some checks

llvm-svn: 361702
2019-05-25 16:33:17 +00:00
Craig Topper
424417da79 [X86] Use (SUBREG_TO_REG (MOV32rm)) for extloadi64i8/extloadi64i16 when the load is 4 byte aligned or better and not volatile.
Summary:
Previously we would use MOVZXrm8/MOVZXrm16, but those are longer encodings.

This is similar to what we do in the loadi32 predicate.

Reviewers: RKSimon, spatel

Reviewed By: RKSimon

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60341

llvm-svn: 357875
2019-04-07 19:19:44 +00:00
Sanjay Patel
665a385035 [DAGCombiner] fold sext into decrement
This is a sibling to rL357178 that I noticed we'd hit if we chose
an alternate transform in D59818.

  %z = zext i8 %x to i32
  %dec = add i32 %z, -1
  %r = sext i32 %dec to i64
  =>
  %z2 = zext i8 %x to i64
  %r = add i64 %z2, -1

https://rise4fun.com/Alive/kPP

The x86 vector diffs show a slight regression, so there's a chance
that we should limit this and the previous transform to scalars.

But given that we allowed vectors before, I'm matching that behavior
here. We should change both transforms together if that's the right
thing to do.

llvm-svn: 357254
2019-03-29 13:49:08 +00:00
Sanjay Patel
881bcbe094 [x86] add tests for decrement+sext; NFC
llvm-svn: 357251
2019-03-29 13:34:48 +00:00
Sanjay Patel
ffa8d3def7 [DAGCombiner] fold sext into negation
As noted in D59818:
  %z = zext i8 %x to i32
  %neg = sub i32 0, %z
  %r = sext i32 %neg to i64
  =>
  %z2 = zext i8 %x to i64
  %r = sub i64 0, %z2

https://rise4fun.com/Alive/KzSR

llvm-svn: 357178
2019-03-28 15:46:02 +00:00
Sanjay Patel
e781528278 [x86] add vector test for sext of negate; NFC
llvm-svn: 357177
2019-03-28 15:30:09 +00:00
Simon Pilgrim
65165d54bb [X86] Add SimplifyDemandedBitsForTargetNode support for PINSRB/PINSRW
llvm-svn: 356270
2019-03-15 16:16:49 +00:00
Craig Topper
a9697f24cf [X86] Enable custom splitting of v8i64/v16i32 sext/zext for avx/avx2 when input type will be promoted by the type legalize to 128-bits.
If the the input type will be promoted to 128 bits its better to put a sign_extend_inreg/and in the 128 bit register before the split occurs. Otherwise we end up doing it on each half in the wider register.

Some of the overflow arithmetic tests are regressions, but I think we can make some improvement using getSetccResultType in DAG combine and/or type legalization.

llvm-svn: 354709
2019-02-23 00:35:02 +00:00