52796 Commits

Author SHA1 Message Date
Amara Emerson
eaab3245d4
[GlobalISel] Add constant folding support for G_FMA/G_FMAD in the combiner. (#65659) 2023-09-09 16:32:02 +08:00
liqin.weng
1b622fff44 [VP] IR expansion for inttoptr/ptrtoint
Add basic handling for VP ops that can expand to cast intrinsics

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D159478
2023-09-09 15:34:46 +08:00
Thomas
0a7a926007
[NVPTX] Make i16x2 a native type and add supported vec instructions (#65799)
recommit https://github.com/llvm/llvm-project/pull/65432 with minor bug
fix for bitcasts
2023-09-08 13:44:58 -07:00
Philip Reames
0a298277d3 [RISCV] Add a negative case for strided load matching
This covers the bug identified in review of pr 65777.
2023-09-08 13:33:52 -07:00
Dmitri Gribenko
b3a14cac4f Revert "[NVPTX] Make i16x2 a native type and add supported vec instructions (#65432)"
This reverts commit db5d845c73ee2d64f1a5bab3fc72edece9e3a7ba.

As per PR discussion "Looks like we've missed lowering of bitcasts
between v2f16 and v2i16 and it breaks XLA."
2023-09-08 19:28:15 +02:00
Philip Reames
5ee7dc04de
[RISCV] Match gather(splat(ptr)) as zero strided load (#65769)
We were already handling the case where the broadcast was being done via
a GEP, but we hadn't handled the case of a broadcast via a shuffle.
2023-09-08 09:37:52 -07:00
Philip Reames
8dd87a5f57 [RISCV] Add gather test coverage for non-intptr index widths
Note that these are non-canonical.  At IR, we will generally canonicalize to the intptrty width form if knownbits allows.
2023-09-08 08:53:19 -07:00
David Stuttard
8c03239934
[AMDGPU] New intrinsic void llvm.amdgcn.s.nop(i16) (#65757)
This allows front ends to insert s_nops - this is most often when a
delay less
than s_sleep 1 is required.
2023-09-08 16:24:10 +01:00
Jay Foad
8669a9f93a
[AMDGPU] Cope with SelectionDAG::UpdateNodeOperands returning a different SDNode (#65765)
SITargetLowering::adjustWritemask calls SelectionDAG::UpdateNodeOperands
to update an EXTRACT_SUBREG node in-place to refer to a new IMAGE_LOAD
instruction, before we delete the old IMAGE_LOAD instruction. But in
UpdateNodeOperands can do CSE on the fly and return a different
EXTRACT_SUBREG node, so the original EXTRACT_SUBREG node would still
exist and would refer to the old deleted IMAGE_LOAD instruction. This
caused errors like:

t31: v3i32,ch = <<Deleted Node!>> # D:1
This target-independent node should have been selected!
UNREACHABLE executed at lib/CodeGen/SelectionDAG/InstrEmitter.cpp:1209!

Fix it by detecting the CSE case and replacing all uses of the original
EXTRACT_SUBREG node with the CSE'd one.

Recommit with a fix for a use-after-free bug in the first version of
this patch (#65340) which was caught by asan.
2023-09-08 16:16:02 +01:00
Phoebe Wang
24194090e1 [X86][RFC] Add new option -m[no-]evex512 to disable ZMM and 64-bit mask instructions for AVX512 features
This is an alternative of D157485 and a pre-feature to support AVX10.

AVX10 Architecture Specification: https://cdrdv2.intel.com/v1/dl/getContent/784267
AVX10 Technical Paper: https://cdrdv2.intel.com/v1/dl/getContent/784343
RFC: https://discourse.llvm.org/t/rfc-design-for-avx10-feature-support/72661

Based on the feedbacks from LLVM and GCC community, we have agreed to
start from supporting `-m[no-]evex512` on existing AVX512 features.
The option `-mno-evex512` can be used with `-mavx512xxx` to build
binaries that can run on both legacy AVX512 targets and AVX10-256.

There're still arguments about what's the expected behavior when this
option as well as `-mavx512xxx` used together with `-mavx10.1-256`. We
decided to defer the support of `-mavx10.1` after we made consensus.
Or furthermore, we start from supporting AVX10.2 and not providing any
AVX10.1 options.

Reviewed By: RKSimon, skan

Differential Revision: https://reviews.llvm.org/D159250
2023-09-08 22:47:22 +08:00
Amara Emerson
730e8f659d [AArch64][GlobalISel] Fix global offset folding combine inserting MIs into wrong place.
Was causing use-before-def issues. Not sure how it remained undetected for so long.
2023-09-08 06:28:12 -07:00
Jay Foad
dd5af895bb
[AMDGPU] Mark S_NOP as having side effects (#65745)
This prevents S_NOP from being rescheduled past other (side-effecting)
instructions, which is useful because it is generally used to introduce
a short delay or to avoid hazards. Currently this only affects MIR tests
because the compiler itself only inserts nops in PostRAHazardRecognizer
which runs after all scheduling.
2023-09-08 14:05:56 +01:00
Phoebe Wang
da1eb886c4
[X86] Do not check alignment for VINSERTPS (#65721)
We don't have alignment constraint in AVX instructions.
2023-09-08 19:23:43 +08:00
Qiu Chaofan
d4d0b5eaab Fix MIR failure after b922a362 2023-09-08 16:33:45 +08:00
David Green
a82c106e57 [ARM] Change CRC predicate to just HasCRC
This removes the backend requirement for crc instructions on HasV8, relying on
just HasCRC instead. This should allow them to be selected with ArmV7 + crc,
making them more usable whilst hopefully not making them incorrectly generated
(they only come from intrinsics, and HasCRC usually requires HasV8). This is
how most other instructions are specified.
2023-09-08 09:02:15 +01:00
Qiu Chaofan
b922a36211 [PowerPC] Define SchedModel for Power8
PowerPC subtargets prior to Power9 use the 'legacy' itinerary way to
provide scheduling information. This patch re-writes the tablegen file
to define the scheduling information in the new SchedModel way, which
can bring improvements to some benchmarks.

Reviewed By: shchenz

Differential Revision: https://reviews.llvm.org/D154488
2023-09-08 15:43:21 +08:00
Phoebe Wang
2e44b07e24
[X86] Do not directly fold for VINSERTPS (#65718)
We have already customized folding for VINSERTPS by 7e6606f4f1, which do
the folding when alignment >= 4 bytes.

We cannot arbitrarily fold it like others because we need to calculate
the source offset.
2023-09-08 15:35:44 +08:00
bzEq
d9efcb54c9
[PEI][PowerPC] Fix false alarm of stack size limit (#65559)
PPC64 allows stack size up to ((2^63)-1) bytes. Currently llc reports
```
warning: stack frame size (4294967568) exceeds limit (4294967295) in function 'main'
```
if the stack allocated is larger than 4G.
2023-09-08 15:16:00 +08:00
Phoebe Wang
11c3b979e6 [X86][NFC] Add a test case to show wrong memory folding for vinsertps 2023-09-08 14:30:33 +08:00
Nicolai Hähnle
2eb767c9e1
AMDGPU: Scratch instructions are trivially disjoint from SMEM and buffer instructions (#65287)
Scratch instructions are always in addrspace(5), which can only alias
with flat (and itself). SMEM and buffer instructions can never reference
those address spaces, so they are trivially disjoint.
2023-09-08 07:43:36 +02:00
Amy Kwan
3f46e5453d [AIX][TLS] Produce a faster local-exec access sequence with -maix-small-local-exec-tls (And optimize when load/store offsets are 0)
This patch utilizes the -maix-small-local-exec-tls option added in
D155544 to produce a faster access sequence for the local-exec TLS
model, where loading from the TOC can be avoided.

The patch either produces an addi/la with a displacement off of r13
(the thread pointer) when the address is calculated, or it produces an
addi/la followed by a load/store when the address is calculated and
used for further accesses.

This patch also optimizes this sequence a bit more where we can remove
the addi/la when the load/store offset is 0. A follow up patch will
be posted to account for when the load/store offset is non-zero, and
currently in these situations we keep the addi/la that precedes the
load/store.

Furthermore, this access sequence is only performed for TLS variables
that are less than ~32KB in size.

Differential Revision: https://reviews.llvm.org/D155600
2023-09-07 20:05:29 -05:00
Amy Kwan
8bdbee8aaa [AIX][TLS] Add target attribute for -maix-small-local-exec-tls option.
This patch adds a target attribute for an AIX-specific option that
informs the compiler that it can use a faster access sequence for the
local-exec TLS model (formally named aix-small-local-exec-tls).

The Clang portion of this option is in D155544.
The initial implementation to generate the faster access sequence is in
D155600.

Differential Revision: https://reviews.llvm.org/D156203
2023-09-07 20:05:29 -05:00
Philip Reames
b4a99f1cd6
[RISCV] Lower constant build_vectors with few non-sign bits via vsext (#65648)
If we have a build_vector such as [i64 0, i64 3, i64 1, i64 2], we
instead lower this as vsext([i8 0, i8 3, i8 1, i8 2]). For vectors with
4 or fewer elements, the resulting narrow vector can be generated via
scalar materialization.

For shuffles which get lowered to vrgathers, constant build_vectors of
small constants are idiomatic. As such, this change covers all shuffles
with an output type of 4 or less.

I deliberately started narrow here. I think it makes sense to expand
this to longer vectors, but we need a more robust profit model on the
recursive expansion. It's questionable if we want to do the zsext if
we're going to generate a constant pool load for the narrower type
anyways.

One possibility for future exploration is to allow the narrower VT to be
less than 8 bits. We can't use vsext for that, but we could use
something analogous to our widening interleave lowering with some extra
shifts and ands.
2023-09-07 16:01:16 -07:00
Jeffrey Byrnes
5044531afd
[AMDGPU] Teach CalculateByteProvider about AMDGPUISD::PERM (#65547)
As a standalone patch, it has limited effect. However, it is necessary
as it supports upcoming commits.
2023-09-07 15:13:42 -07:00
stefanp-ibm
0a4a8bec34
[PowerPC] Turn string pooling on by default. (#65628)
This patch turns the string pooling pass on by default. Some tests are
updated as required.
2023-09-07 16:49:31 -04:00
Jeffrey Byrnes
7fda1b74be [AMDGPU]: Allow combining into v_dot4
Differential Revision: https://reviews.llvm.org/D155995

Change-Id: I794f540217f0f84141338757b41b1be0493c7207
2023-09-07 12:58:48 -07:00
Philip Reames
063524e35a [RISCV] Add coverage for missing gather/scatter combines 2023-09-07 11:57:30 -07:00
Wael Yehia
11d5c7bd28 [AIX] Add threadId and use nanosecond timestamp in sinit/sterm symbols
With ThinLTO, when compiling SPEC 2017 omnetpp_r with -threads=4, two
small modules can end up with the same timestamp in their sinit symbols
when calculating time in seconds, creating duplicate definitions.

This patch uses a timestamp in nanoseconds.
Because the race can be between threads, embed the thread ID as well.

Reviewed By: xingxue, daltenty

Differential Revision: https://reviews.llvm.org/D159319
2023-09-07 17:46:41 +00:00
Amy Kwan
f94f85348d Revert "[AIX][TLS] Generate .extern and .ref references to __tls_get_addr for local-exec accesses."
This reverts commit f0b2f6954101c9052763a99a1e7ac135770e779a.
The implementation is incorrect and breaks compiling local-exec programs.
2023-09-07 12:10:37 -05:00
esmeyi
b85a9b3093 [PowerPC] Try to use less instructions to materialize 64-bit constant when High32=Low32.
Summary: Materialization a 64-bit constant with High32=Low32 only requires 2 instructions instead of 3 when Low32 can be materialized in 1 instruction.

Reviewed By: qiucf

Differential Revision: https://reviews.llvm.org/D158495
2023-09-07 13:03:17 -04:00
Amara Emerson
1cc9f626cb
[GlobalISel] Add constant-folding of FP binops to combiner. (#65230) 2023-09-07 19:33:35 +03:00
Vladislav Dzhidzhoev
c6ae7df999 [AArch64][GlobalISel] Regenerate arm64-ld1.ll 2023-09-07 17:48:15 +02:00
Stefan Pintilie
84e2fd7ee4 [PowerPC] Add a pass to merge all of the constant global arrays into one pool.
On PowerPC the number of TOC entries must be kept low for large
applications. In order to reduce the number of constant global arrays
we can pool them into one structure and then access them as the base
address of that structure plus some offset. The constant global arrays
may be arrays of `i8` which are constant strings but they may also be
arrays of `i32, i64, etc...`.

Reviewed By: lei, amyk

Differential Revision: https://reviews.llvm.org/D155730
2023-09-07 11:14:56 -04:00
Phoebe Wang
0856efbf88 Revert "[X86][RFC] Add new option -m[no-]evex512 to disable ZMM and 64-bit mask instructions for AVX512 features"
This reverts commit 7dd48cc24de2d54d40527432cbee8a9d97a8a4f7.

Causing buildbot failure.
2023-09-07 21:59:01 +08:00
Phoebe Wang
7dd48cc24d [X86][RFC] Add new option -m[no-]evex512 to disable ZMM and 64-bit mask instructions for AVX512 features
This is an alternative of D157485 and a pre-feature to support AVX10.

AVX10 Architecture Specification: https://cdrdv2.intel.com/v1/dl/getContent/784267
AVX10 Technical Paper: https://cdrdv2.intel.com/v1/dl/getContent/784343
RFC: https://discourse.llvm.org/t/rfc-design-for-avx10-feature-support/72661

Based on the feedbacks from LLVM and GCC community, we have agreed to
start from supporting `-m[no-]evex512` on existing AVX512 features.
The option `-mno-evex512` can be used with `-mavx512xxx` to build
binaries that can run on both legacy AVX512 targets and AVX10-256.

There're still arguments about what's the expected behavior when this
option as well as `-mavx512xxx` used together with `-mavx10.1-256`. We
decided to defer the support of `-mavx10.1` after we made consensus.
Or furthermore, we start from supporting AVX10.2 and not providing any
AVX10.1 options.

Reviewed By: RKSimon, skan

Differential Revision: https://reviews.llvm.org/D159250
2023-09-07 21:38:35 +08:00
Stefan Pintilie
492c1f3d7c [PowerPC] Merge rotate and clear into single instruction.
This patch tries to catch a codegen opportunity where the rotate and
mask can be merged into a single RLDCL instruction.

Reviewed By: lei, amyk

Differential Revision: https://reviews.llvm.org/D158328
2023-09-07 09:25:41 -04:00
Vladislav Dzhidzhoev
0de6baab91 [AArch64][GlobalISel] Look through COPY and G_BITCAST while selecting fcvtl2 (fpext)
It tackles some regressions introduced in
https://reviews.llvm.org/D144670.
2023-09-07 14:08:20 +02:00
pvanhout
69036eb735 [AMDGPU] Fix code-size-estimate.mir test
Expensive-checks was failing on it.
2023-09-07 14:04:12 +02:00
Pierre van Houtryve
30955c9d22
[AMDGPU] Fix V_MOV_B32_indirect inst size (#65584)
This inst lowers to a normal v_mov_b32 so it's not zero-sized, but has a
size of 4.

Solves SWDEV-416337
2023-09-07 13:12:58 +02:00
Tuan Chuong Goh
b7a305deca [AArch64][GlobalISel] Optimise Combine Funnel Shift
Combine any funnel shift with a shift amount of 0 to a copy.
Modulo is applied to shift amount if it is larger than the
instruction's bitwidth.

Differential Revision: https://reviews.llvm.org/D157591
2023-09-07 11:58:12 +01:00
Vladislav Dzhidzhoev
1d0d2dfce7 [GlobalISel] Fold G_SHUFFLE_VECTOR with a single element mask to G_EXTRACT_VECTOR_ELT
It introduces minor regression in arm64-vcvt_f.ll, which will be fixed
later.
2023-09-07 12:03:04 +02:00
Mohamed Atef
741c127817 [SelectionDAG] Add computeOverflowForSignedMul / computeOverflowForUnsignedMul overflow handlers
Support signed multiplication
Support unsigned multiplication

Differential Revision: https://reviews.llvm.org/D159406
2023-09-07 10:03:18 +01:00
Jianjian Guan
fab2594968
[RISCV][NFC] Remove unused checkline (#65560) 2023-09-07 16:24:31 +08:00
Thomas
db5d845c73
[NVPTX] Make i16x2 a native type and add supported vec instructions (#65432)
On sm_90 some instructions now support i16x2 which allows hardware to
execute more efficiently add, min and max instructions.

In order to support that we need to make i16x2 a native type in the
backend. This does the necessary changes to make i16x2 a native type and
adds support for the instructions natively supporting i16x2.

This caused a negative test in nvptx slp to start passing. Changed the
test to a positive one as the IR is correctly vectorized.
2023-09-06 21:59:13 -07:00
Daniel Hoekwater
866ae69cfa [AArch64] [BranchRelaxation] Optimize for hot code size in AArch64 branch relaxation
On AArch64, it is safe to let the linker handle relaxation of
unconditional branches; in most cases, the destination is within range,
and the linker doesn't need to do anything. If the linker does insert
fixup code, it clobbers the x16 inter-procedural register, so x16 must
be available across the branch before linking. If x16 isn't available,
but some other register is, we can relax the branch either by spilling
x16 OR using the free register for a manually-inserted indirect branch.

This patch builds on D145211. While that patch is for correctness, this
one is for performance of the common case. As noted in
https://reviews.llvm.org/D145211#4537173, we can trust the linker to
relax cross-section unconditional branches across which x16 is
available.

Programs that use machine function splitting care most about the
performance of hot code at the expense of the performance of cold code,
so we prioritize minimizing hot code size.

Here's a breakdown of the cases:

   Hot -> Cold [x16 is free across the branch]
     Do nothing; let the linker relax the branch.

   Cold -> Hot [x16 is free across the branch]
     Do nothing; let the linker relax the branch.

   Hot -> Cold [x16 used across the branch, but there is a free register]
     Spill x16; let the linker relax the branch.

     Spilling requires fewer instructions than manually inserting an
     indirect branch.

   Cold -> Hot [x16 used across the branch, but there is a free register]
     Manually insert an indirect branch.

     Spilling would require adding a restore block in the hot section.

   Hot -> Cold [No free regs]
     Spill x16; let the linker relax the branch.

   Cold -> Hot [No free regs]
     Spill x16 and put the restore block at the end of the hot function; let the linker relax the branch.
     Ex:
       [Hot section]
       func.hot:
         ... hot code...
       func.restore:
         ... restore x16 ...
         B func.hot

       [Cold section]
         func.cold:
         ... spill x16 ...
         B func.restore

     Putting the restore block at the end of the function instead of
     just before the destination increases the cost of executing the
     store, but it avoids putting cold code in the middle of hot code.
     Since the restore is very rarely taken, this is a worthwhile
     tradeoff.

Differential Revision: https://reviews.llvm.org/D156767
2023-09-06 20:44:40 +00:00
Florian Mayer
42a1d16179 Revert "[AMDGPU] Cope with SelectionDAG::UpdateNodeOperands returning a different SDNode (#65340)"
This reverts commit 11171d81aeafb0c2818f288900423e366a2787fc.

Broke ASAN bot.
2023-09-06 13:16:55 -07:00
Vladislav Dzhidzhoev
c39edd7b53 [AArch64][GlobalISel] Regenerate prelegalizercombiner-shuffle-vector.mir 2023-09-06 18:38:13 +02:00
Craig Topper
bb810d8fa0 [RISCV] Disable machine verifier in gisel-commandline-option.ll. NFC
Hopefully this fixes the expensive checks build.
2023-09-06 09:32:32 -07:00
Simon Pilgrim
e4d0e12099 [DAG] Fold (shl (sext (add_nsw x, c1)), c2) -> (add (shl (sext x), c2), c1 << c2) (REAPPLIED)
Assuming the ADD is nsw then it may be sign-extended to merge with a SHL op in a similar fold to the existing (shl (add x, c1), c2) -> (add (shl x, c2), c1 << c2) fold.

This is most useful for helping to expose address math for X86, but has also touched several aarch64 test cases as well.

Alive2: https://alive2.llvm.org/ce/z/2UpSbJ

Differential Revision: https://reviews.llvm.org/D159198
2023-09-06 13:19:42 +01:00
Jay Foad
11171d81ae
[AMDGPU] Cope with SelectionDAG::UpdateNodeOperands returning a different SDNode (#65340)
SITargetLowering::adjustWritemask calls SelectionDAG::UpdateNodeOperands
to update an EXTRACT_SUBREG node in-place to refer to a new IMAGE_LOAD
instruction, before we delete the old IMAGE_LOAD instruction. But in
UpdateNodeOperands can do CSE on the fly and return a different
EXTRACT_SUBREG node, so the original EXTRACT_SUBREG node would still
exist and would refer to the old deleted IMAGE_LOAD instruction. This
caused errors like:

t31: v3i32,ch = <<Deleted Node!>> # D:1
This target-independent node should have been selected!
UNREACHABLE executed at lib/CodeGen/SelectionDAG/InstrEmitter.cpp:1209!

Fix it by detecting the CSE case and replacing all uses of the original
EXTRACT_SUBREG node with the CSE'd one.
2023-09-06 12:51:44 +01:00