8239 Commits

Author SHA1 Message Date
serge-sans-paille
38818b60c5
Move from llvm::makeArrayRef to ArrayRef deduction guides - llvm/ part
Use deduction guides instead of helper functions.

The only non-automatic changes have been:

1. ArrayRef(some_uint8_pointer, 0) needs to be changed into ArrayRef(some_uint8_pointer, (size_t)0) to avoid an ambiguous call with ArrayRef((uint8_t*), (uint8_t*))
2. CVSymbol sym(makeArrayRef(symStorage)); needed to be rewritten as CVSymbol sym{ArrayRef(symStorage)}; otherwise the compiler is confused and thinks we have a (bad) function prototype. There was a few similar situation across the codebase.
3. ADL doesn't seem to work the same for deduction-guides and functions, so at some point the llvm namespace must be explicitly stated.
4. The "reference mode" of makeArrayRef(ArrayRef<T> &) that acts as no-op is not supported (a constructor cannot achieve that).

Per reviewers' comment, some useless makeArrayRef have been removed in the process.

This is a follow-up to https://reviews.llvm.org/D140896 that introduced
the deduction guides.

Differential Revision: https://reviews.llvm.org/D140955
2023-01-05 14:11:08 +01:00
Roman Lebedev
dbce1110f1
[NFC][DAG] Move getOpcode_EXTEND*() helpers from X86 into SelectionDAG
To be used in an upcoming patch.
2023-01-05 01:12:30 +03:00
Roman Lebedev
e4b260efb2
[Codegen][X86] LowerBUILD_VECTOR(): improve lowering w/ multiple FREEZE-UNDEF ops
While we have great handling for UNDEF operands,
FREEZE-UNDEF operands are effectively normal operands.

We are better off "interleaving" such BUILD_VECTORS into a blend
between a splat of FREEZE-UNDEF, and "thawed" source BUILD_VECTOR,
both of which are more natural for us to handle.

Refs. f738ab9075 (r95017306)
2023-01-04 21:16:11 +03:00
Thomas Köppe
82be8a1d2b [X86] Emit RIP-relative access to local function in PIC medium code model
Currently, the medium code model for x86_64 emits position-dependent relocations (R_X86_64_64) for local functions, regardless of PIC or no-PIC mode. (This means generically that code compiled with the medium model cannot be linked into a position-independent executable.)

Example:

```
static int g(int n) {
  return 2 * n + 3;
}

void f(int(**p)(int)) {
  *p = g;
}
```

This results in:

```
Disassembly of section .text:

0000000000000000 <f>:
       0: 48 b8 00 00 00 00 00 00 00 00	movabs	rax, 0x0
       a: 48 89 07                     	mov	qword ptr [rdi], rax
       d: c3                           	ret
```

```
Relocation section '.rela.text' at offset 0xf0 contains 1 entries:
    Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
0000000000000002  0000000200000001 R_X86_64_64            0000000000000000 .text + 10
```

This patch changes the behaviour to unconditionally emit a RIP-relative access, both in PIC and non-PIC mode. This fixes PIC mode, and is perhaps an improvement in non-PIC mode, too, since it results in a shorter instruction. A 32-bit relocation should suffice since the medium memory model demands that all code fit within 2GiB.

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D140593
2022-12-28 11:14:39 -08:00
Evgenii Kudriashov
15dd5ed96c [X86] Support ANDNP combine through vector_shuffle
Combine
```
   and (vector_shuffle<Z,...,Z>
            (insert_vector_elt undef, (xor X, -1), Z), undef), Y
   ->
   andnp (vector_shuffle<Z,...,Z>
              (insert_vector_elt undef, X, Z), undef), Y
```

Reviewed By: RKSimon, pengfei

Differential Revision: https://reviews.llvm.org/D138521
2022-12-22 16:55:14 +08:00
Craig Topper
eeb8de9363 [X86] Replace getOperand calls with an existing variable. NFC 2022-12-20 19:27:11 -08:00
Qiu Chaofan
a40ef656d8 [Intrinsic] Rename flt.rounds intrinsic to get.rounding
Address the inconsistency between FLT_ROUNDS_ and SET_ROUNDING SDAG
node. Rename FLT_ROUNDS_ to GET_ROUNDING and add llvm.get.rounding
intrinsic to replace flt.rounds.

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D139507
2022-12-19 15:22:39 +08:00
Simon Pilgrim
37c3b83bd8 [X86] combineBitcastvxi1 - handle boolmask sign-extension through vselect
See if we can freely sign-extend both sources of a vselect operand, also handle allones constant build vectors (easily rematerializable and uses in the test case).

Fixes #59526
2022-12-15 16:40:44 +00:00
Matt Arsenault
c16a58b36c Attributes: Add function getter to parse integer string attributes
The most common case for string attributes parses them as integers. We
don't have a convenient way to do this, and as a result we have
inconsistent missing attribute and invalid attribute handling
scattered around. We also have inconsistent radix usage to
getAsInteger; some places use the default 0 and others use base 10.

Update a few of the uses, but there are quite a lot of these.
2022-12-14 13:12:35 -05:00
Simon Pilgrim
463910ab2a [X86] Don't fold scalar_to_vector(i64 C) -> vzext_movl(scalar_to_vector(i32 C))
Fixes constant-folding infinite loop reported by @uabelho on rG5ca77541446d
2022-12-14 12:11:06 +00:00
Simon Pilgrim
4f41ea2016 [X86] lowerShuffleAsVTRUNC - bit shift the offset elements into place instead of shuffle
This helps avoid issues on non-BWI targets which can end up splitting the shuffles to 2 x 256-bit bitshifts of a smaller scalar width
2022-12-14 11:41:14 +00:00
Simon Pilgrim
b3eaf40166 [X86] lowerShuffleAsVTRUNC - improve detection of cheap/free vector concatenation
Handle the case where the lo/hi subvectors are a split load.
2022-12-14 10:49:44 +00:00
Phoebe Wang
57f71dccd3 [NFC] Fix duplicated Src 2022-12-13 22:44:28 +08:00
Simon Pilgrim
4177e6cd4f [X86] lowerShuffleAsVTRUNC - support offseted truncations
Extend the <0,Scale,2*Scale,..> pattern to allow for a fixed offset <Offset,Offset+Scale,Offset+2*Scale,..> pattern, which will lower to a single additional bitshift/pshufd.

At the moment I've limited this to cases where the LHS/RHS operands are concatenated for free, but this is only to avoid a couple of regressions that should be easily addressable in followups.
2022-12-13 14:00:35 +00:00
Kazu Hirata
20cde15415 [Target] Use std::nullopt instead of None (NFC)
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated.  The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.

This is part of an effort to migrate from llvm::Optional to
std::optional:

https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
2022-12-02 20:36:06 -08:00
Krzysztof Parzyszek
864aaa21b4 TargetLowering: convert Optional to std::optional 2022-12-01 16:19:10 -08:00
Phoebe Wang
54ebf1c4a1 [X86][FP16] Do not combine fminnum/fmaxnum for FP16 emulation
Under the emulation situation, we lack native fmin/fmax instruction support.

Fixes #59258

Reviewed By: skan, spatel

Differential Revision: https://reviews.llvm.org/D139078
2022-12-01 23:24:40 +08:00
Freddy Ye
89f36dd8f3 [X86] Add ExpandLargeFpConvert Pass and enable for X86
As stated in
https://discourse.llvm.org/t/rfc-llc-add-expandlargeintfpconvert-pass-for-fp-int-conversion-of-large-bitint/65528,
this implementation is very similar to ExpandLargeDivRem, which expands
‘fptoui .. to’, ‘fptosi .. to’, ‘uitofp .. to’, ‘sitofp .. to’ instructions
with a bitwidth above a threshold into auto-generated functions. This is
useful for targets like x86_64 that cannot lower fp convertions with more
than 128 bits. The expanded nodes are referring from the IR generated by
`compiler-rt/lib/builtins/floattidf.c`, `compiler-rt/lib/builtins/fixdfti.c`,
and etc.

Corner cases:
1. For fp16: as there is no related builtins added in compliler-rt. So I
mainly utilized the fp32 <-> fp16 lib calls to implement.
2. For fp80: as this pass is soft fp emulation and no fp80 instructions can
help in this problem. I recommend users to deprecate this usage. For now, the
implementation uses fp128 as the temporary conversion type and inserts
fptrunc/ext at top/end of the function.
3. For bf16: as clang FE currently doesn't support bf16 algorithm operations
(convert to int, float, +, -, *, ...), this patch doesn't consider bf16 for
now.
4. For unsigned FPToI: since both default hardware behaviors and libgcc are
ignoring "returns 0 for negative input" spec. This pass follows this old way
to ignore unsigned FPToI. See this example:
https://gcc.godbolt.org/z/bnv3jqW1M

The end-to-end tests are uploaded at https://reviews.llvm.org/D138261

Reviewed By: LuoYuanke, mgehre-amd

Differential Revision: https://reviews.llvm.org/D137241
2022-12-01 13:47:43 +08:00
Simon Pilgrim
c757780c62 [X86] lowerShuffleAsDecomposedShuffleMerge - try to match unpck(permute(x),permute(y)) for v4i32/v2i64 shuffles
We're using lowerShuffleAsPermuteAndUnpack, which can probably be improved to handle 256/512-bit types pretty easily.

First step towards trying to address the poor vector-shuffle-sse4a.ll pre-SSSE3 codegen mentioned on D127115
2022-11-25 16:24:56 +00:00
Simon Pilgrim
38275ab1b3 [X86] Move lowerShuffleAsPermuteAndUnpack earlier in the source next to similar helpers. NFC.
I'm currently investigating using this inside lowerShuffleAsDecomposedShuffleMerge
2022-11-25 14:56:38 +00:00
Simon Pilgrim
6fd0ae39be [X86] combineScalarAndWithMaskSetcc - handle (concat_vectors (and (vYi1 setcc, vYi1 x), undef)) patterns
If one of the AND operands is a setcc then we're implicitly zeroing the upper mask bits

Similar pattern to regressions identified in D127115 (masked comparisons)
2022-11-25 11:16:24 +00:00
Simon Pilgrim
dbe2f44316 [X86] combineScalarAndWithMaskSetcc - optionally peek through (oneuse) any_extend node
Extend pass to handle: (and (any_extend (bitcast (vXi1 (concat_vectors (vYi1 setcc), undef,)))), C)

Fixes several regressions identified in D127115
2022-11-24 16:26:35 +00:00
Phoebe Wang
7218103bca [X86] Use lock add/sub/or/and/xor for cases that we only care about the EFLAGS (negated cases)
This fixes #58685

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D138428
2022-11-23 09:39:04 +08:00
Davide Italiano
0c011335c9 [X86] Don't lower f16->f80 fpext to libcall on darwin.
We don't provide __extendhfxf2, and only have the soft-float
__extendhfsf2 in compiler-rt.  This only changed recently with
655ba9c8a1d2, so this patch reverts back to the previous behavior.

However, the f80->f16 fptrunc is not easily implementable without
the compiler-rt __truncxfhf2, but that has always been true, and
isn't an immediate regression.

Patch by Ahmed Bougacha.

rdar://102194995
2022-11-22 12:32:22 -08:00
Phoebe Wang
b39b76f2ef [X86] Allow no X87 on 32-bit
This patch is an alternative of D100091. It solved the problems in `f80` type lowering.

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D137946
2022-11-22 10:47:47 +08:00
Benjamin Kramer
e2bff1e489 [X86] Fix atomic rmw intrinsic expansion for non-opaque pointers
This is a bit annoying, but there are still users out there that got
broken by this (this time it was numba). We need to keep some barebones
support around until non-opaque pointers are completely gone.
2022-11-20 15:39:30 +01:00
Phoebe Wang
510e5fba16 [X86] Use lock or/and/xor for cases that we only care about the EFLAGS
This is a follow up of D137711 to fix the reset of #58685.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D138294
2022-11-20 10:42:48 +08:00
Phoebe Wang
d558255650 [X86] Use lock add/sub for cases that we only care about the EFLAGS
This fixes #36373, #36905 and partial of #58685.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D137711
2022-11-18 21:43:47 +08:00
Stanislav Mekhanoshin
bcaf31ec3f [AMDGPU] Allow finer grain control of an unaligned access speed
A target can return if a misaligned access is 'fast' as defined
by the target or not. In reality there can be different levels
of 'fast' and 'slow'. This patch changes the boolean 'Fast'
argument of the allowsMisalignedMemoryAccesses family of functions
to an unsigned representing its speed.

A target can still define it as it wants and the direct translation
of the current code uses 0 and 1 for current false and true. This
makes the change an NFC.

Subsequent patch will start using an actual value of speed in
the load/store vectorizer to compare if a vectorized access going
to be not just fast, but not slower than before.

Differential Revision: https://reviews.llvm.org/D124217
2022-11-17 09:23:53 -08:00
Simon Pilgrim
ff252e6b13 [X86] combineConcatVectorOps - don't concat(vselect,vselect) if the concatenated selection mask isn't legal
One of the crash regression tests now exposes an existing issue with SelectionDAG::simplifySelect not folding vselect with constant masks

Fixes #59003
2022-11-16 11:49:14 +00:00
Tim Northover
2bcf51c7f8 X86: call fp16-conversion functions soft-float on Darwin.
We've been shipping implementations of these with a soft-float ABI since MacOS
10.10 in 2014 and there's evidence they're in binaries now, so we can't easily
switch to %xmm0.

This emits special libcalls with casts in place to restore the soft-float ABI
for __truncdfhf2, __truncsfhf2, and __extendhfsf2.
2022-11-10 10:00:01 +00:00
Nathan James
6aa050a690 Reland "[llvm][NFC] Use c++17 style variable type traits"
This reverts commit 632a389f96355cbe7ed8fa7b8d2ed6267c92457c.

This relands commit
1834a310d060d55748ca38d4ae0482864c2047d8.

Differential Revision: https://reviews.llvm.org/D137493
2022-11-08 14:15:15 +00:00
Nathan James
632a389f96 Revert "[llvm][NFC] Use c++17 style variable type traits"
This reverts commit 1834a310d060d55748ca38d4ae0482864c2047d8.
2022-11-08 13:11:41 +00:00
Nathan James
1834a310d0
[llvm][NFC] Use c++17 style variable type traits
This was done as a test for D137302 and it makes sense to push these changes

Reviewed By: dblaikie

Differential Revision: https://reviews.llvm.org/D137493
2022-11-08 12:22:52 +00:00
Simon Pilgrim
90ec51a9ab [X86] combineConcatVectorOps - fold 512-bit concat(GF2P8AFFINEQB(x,y,c),GF2P8AFFINEQB(z,w,c)) -> GF2P8AFFINEQB(concat(x,z),concat(y,w),c)
Now that D137036 has landed, we just need AVX512F support to generate 512-bit GF2P8AFFINEQB ops
2022-11-01 12:06:46 +00:00
Freddy Ye
aee2a35ac4 [X86] Add AVX-NE-CONVERT instructions.
For more details about these instructions, please refer to the latest ISE document: https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D135930
2022-10-31 23:39:38 +08:00
Simon Pilgrim
b172c7e193 [X86] combineConcatVectorOps - fold concat(GF2P8AFFINEQB(x,y,c),GF2P8AFFINEQB(z,w,c)) -> GF2P8AFFINEQB(concat(x,z),concat(y,w),c)
Pulled out of D137026
2022-10-31 12:27:57 +00:00
Freddy Ye
23f02693ec [X86] Add AVX-VNNI-INT8 instructions.
For more details about these instructions, please refer to the latest ISE document: https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

Reviewed By: pengfei, skan

Differential Revision: https://reviews.llvm.org/D135938
2022-10-28 10:39:54 +08:00
Phoebe Wang
b51b90d6e2 [X86][1/2] SUPPORT RAO-INT
For more details about these instructions, please refer to the latest ISE document: https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

Initial authored by Liu Chen (@LiuChen3)

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D135951
2022-10-27 17:20:07 +08:00
Simon Pilgrim
ed1b0da557 [X86] combineConcatVectorOps - fold v4i64/v8x32 concat(broadcast(),broadcast()) -> permilps(concat())
Extend the existing v4f64 fold to handle v4i64/v8f32/v8i32 as well

Fixes #58585
2022-10-25 15:37:42 +01:00
Simon Pilgrim
c4051b2606 [X86] Fold vbroadcast(bitcast(vbroadcast(src))) -> bitcast(vbroadcast(vbroadcast(src)))
If the inner broadcast scalar type is smaller/same width as the outer broadcast scalar type then we can broadcast using the same inner type directly. Works for vbroadcast_load as well.
2022-10-25 14:03:43 +01:00
Freddy Ye
fdac4c4e92 [X86] Add CMPCCXADD instructions.
For more details about these instructions, please refer to the latest ISE document: https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

Reviewed By: pengfei, skan

Differential Revision: https://reviews.llvm.org/D135933
2022-10-25 14:33:39 +08:00
Simon Pilgrim
4e8f847676 [X86][AVX512] Fold extract_element(bitcast(<X x i1>) -> bitcast(extract_subvector())
On AVX512, extract legal bool vectors as bool subvectors before bitcasting to scalars to avoid spilling to stack.

This helps rust which internally represents bool vectors as bool arrays

It also exposes more missed opportunities to use the KADD instruction to add masks together before moving to gpr

Fixes #58546
2022-10-23 14:47:24 +01:00
Simon Pilgrim
c175d880a4 [X86] Add freeze(pshufd/permilps(x,imm)) -> pshufd/permilps(freeze(x),imm) folding
Add X86 isGuaranteedNotToBeUndefOrPoisonForTargetNode / canCreateUndefOrPoisonForTargetNode overrides and add X86ISD::PSHUFD/VPERMILPI handling.
2022-10-23 10:39:12 +01:00
Kazu Hirata
5bb00cd309 [llvm] Use llvm::is_contained (NFC) 2022-10-22 08:57:37 -07:00
Xiang1 Zhang
661881d436 [X86] Add AMX-FP16 instructions.
Differential Revision: https://reviews.llvm.org/D135941
2022-10-22 08:05:22 +08:00
Simon Pilgrim
5ca7754144 [X86] Fold scalar_to_vector(i64 zext(x)) -> bitcast(vzext_movl(scalar_to_vector(i32 x)))
Extends existing anyextend fold to make use of the implicit zero-extension of the movd instruction

This also helps replace some nasty xmm->gpr->xmm traffic with a shuffle pattern instead

Noticed while looking at D130953
2022-10-21 10:40:13 +01:00
Phoebe Wang
bc1819389f [X86][RFC] Using __bf16 for AVX512_BF16 intrinsics
This is an alternative of D120395 and D120411.

Previously we use `__bfloat16` as a typedef of `unsigned short`. The
name may give user an impression it is a brand new type to represent
BF16. So that they may use it in arithmetic operations and we don't have
a good way to block it.

To solve the problem, we introduced `__bf16` to X86 psABI and landed the
support in Clang by D130964. Now we can solve the problem by switching
intrinsics to the new type.

Reviewed By: LuoYuanke, RKSimon

Differential Revision: https://reviews.llvm.org/D132329
2022-10-19 23:47:04 +08:00
Han Zhu
d0d48a91f8 [X86] Lower vector interleave into unpck and perm
[This Godbolt link](https://godbolt.org/z/s17Kv1s9T) shows different codegen between clang and gcc for a transpose operation.

clang result:
```
        vmovdqu xmm0, xmmword ptr [rcx + rax]
        vmovdqu xmm1, xmmword ptr [rcx + rax + 16]
        vmovdqu xmm2, xmmword ptr [r8 + rax]
        vmovdqu xmm3, xmmword ptr [r8 + rax + 16]
        vpunpckhbw      xmm4, xmm2, xmm0
        vpunpcklbw      xmm0, xmm2, xmm0
        vpunpcklbw      xmm2, xmm3, xmm1
        vpunpckhbw      xmm1, xmm3, xmm1
        vmovdqu xmmword ptr [rdi + 2*rax + 48], xmm1
        vmovdqu xmmword ptr [rdi + 2*rax + 32], xmm2
        vmovdqu xmmword ptr [rdi + 2*rax], xmm0
        vmovdqu xmmword ptr [rdi + 2*rax + 16], xmm4
```
gcc result:
```
        vmovdqu ymm3, YMMWORD PTR [rdi+rax]
        vpunpcklbw      ymm1, ymm3, YMMWORD PTR [rsi+rax]
        vpunpckhbw      ymm0, ymm3, YMMWORD PTR [rsi+rax]
        vperm2i128      ymm2, ymm1, ymm0, 32
        vperm2i128      ymm1, ymm1, ymm0, 49
        vmovdqu YMMWORD PTR [rcx+rax*2], ymm2
        vmovdqu YMMWORD PTR [rcx+32+rax*2], ymm1
```
clang's code is roughly 15% slower than gcc's when evaluated on an internal compression benchmark.

The loop vectorizer generates the following shufflevector intrinsic:
```
%interleaved.vec = shufflevector <32 x i8> %a, <32 x i8> %b, <64 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47, i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>
```
which is lowered to SelectionDAG:
```
t2: v32i8,ch = CopyFromReg t0, Register:v32i8 %0
t6: v64i8 = concat_vectors t2, undef:v32i8
t4: v32i8,ch = CopyFromReg t0, Register:v32i8 %1
t7: v64i8 = concat_vectors t4, undef:v32i8
t8: v64i8 = vector_shuffle<0,64,1,65,2,66,3,67,4,68,5,69,6,70,7,71,8,72,9,73,10,74,11,75,12,76,13,77,14,78,15,79,16,80,17,81,18,82,19,83,20,84,21,85,22,86,23,87,24,88,25,89,26,90,27,91,28,92,29,93,30,94,31,95> t6, t7
```

So far this `vector_shuffle` is good enough for us to pattern-match and transform, but as we go down the SelectionDAG pipeline, it got split into smaller shuffles. During dagcombine1, the shuffle is split by `foldShuffleOfConcatUndefs`.
```
  // shuffle (concat X, undef), (concat Y, undef), Mask -->
  // concat (shuffle X, Y, Mask0), (shuffle X, Y, Mask1)
t2: v32i8,ch = CopyFromReg t0, Register:v32i8 %0
t4: v32i8,ch = CopyFromReg t0, Register:v32i8 %1
t19: v32i8 = vector_shuffle<0,32,1,33,2,34,3,35,4,36,5,37,6,38,7,39,8,40,9,41,10,42,11,43,12,44,13,45,14,46,15,47> t2, t4
t15: ch,glue = CopyToReg t0, Register:v32i8 $ymm0, t19
t20: v32i8 = vector_shuffle<16,48,17,49,18,50,19,51,20,52,21,53,22,54,23,55,24,56,25,57,26,58,27,59,28,60,29,61,30,62,31,63> t2, t4
t17: ch,glue = CopyToReg t15, Register:v32i8 $ymm1, t20, t15:1
```

With `foldShuffleOfConcatUndefs` commented out, the vector is still split later by the type legalizer, which comes after dagcombine1, because v64i8 is not a legal type in AVX2 (64 * 8 = 512 bits while ymm = 256 bits). There doesn't seem to be a good way to avoid this split. Lowering the `vector_shuffle` into unpck and perm during dagcombine1 is too early. Therefore, although somewhat inconvenient, we decided to go with pattern-matching a pair vector shuffles later in the SelectionDAG pipeline, as part of `lowerV32I8Shuffle`.

The code looks at the two operands of the first shuffle it encounters, iterates through the users of the operands, and tries to find two shuffles that are consecutive interleaves. Once the pattern is found, it lowers them into unpcks and perms. It returns the perm for the shuffle that's currently being lowered (have ISel modify the DAG), and replaces the other shuffle in place.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D134477
2022-10-17 11:39:27 -07:00
Xiang1 Zhang
aad013de41 [InlineAsm][bugfix] Correct function addressing in inline asm
In Linux PIC model, there are 4 cases about value/label addressing:
Case 1: Function call or Label jmp inside the module.
Case 2: Data access (such as global variable, static variable) inside the module.
Case 3: Function call or Label jmp outside the module.
Case 4: Data access (such as global variable) outside the module.

Due to current llvm inline asm architecture designed to not "recognize" the asm
code, there are quite troubles for us to treat mem addressing differently for
same value/adress used in different instuctions.
For example, in pic model, call a func may in plt way or direclty pc-related,
but lea/mov a function adress may use got.

This patch fix/refine the case 1 and case 2 in inline asm.
Due to currently inline asm didn't support jmp the outsider lable, this patch
mainly focus on fix the function call addressing bugs in inline asm.

Reviewed By: Pengfei, RKSimon

Differential Revision: https://reviews.llvm.org/D133914
2022-10-14 09:47:26 +08:00