This reverts commit db5d845c73ee2d64f1a5bab3fc72edece9e3a7ba.
As per PR discussion "Looks like we've missed lowering of bitcasts
between v2f16 and v2i16 and it breaks XLA."
SITargetLowering::adjustWritemask calls SelectionDAG::UpdateNodeOperands
to update an EXTRACT_SUBREG node in-place to refer to a new IMAGE_LOAD
instruction, before we delete the old IMAGE_LOAD instruction. But in
UpdateNodeOperands can do CSE on the fly and return a different
EXTRACT_SUBREG node, so the original EXTRACT_SUBREG node would still
exist and would refer to the old deleted IMAGE_LOAD instruction. This
caused errors like:
t31: v3i32,ch = <<Deleted Node!>> # D:1
This target-independent node should have been selected!
UNREACHABLE executed at lib/CodeGen/SelectionDAG/InstrEmitter.cpp:1209!
Fix it by detecting the CSE case and replacing all uses of the original
EXTRACT_SUBREG node with the CSE'd one.
Recommit with a fix for a use-after-free bug in the first version of
this patch (#65340) which was caught by asan.
This is an alternative of D157485 and a pre-feature to support AVX10.
AVX10 Architecture Specification: https://cdrdv2.intel.com/v1/dl/getContent/784267
AVX10 Technical Paper: https://cdrdv2.intel.com/v1/dl/getContent/784343
RFC: https://discourse.llvm.org/t/rfc-design-for-avx10-feature-support/72661
Based on the feedbacks from LLVM and GCC community, we have agreed to
start from supporting `-m[no-]evex512` on existing AVX512 features.
The option `-mno-evex512` can be used with `-mavx512xxx` to build
binaries that can run on both legacy AVX512 targets and AVX10-256.
There're still arguments about what's the expected behavior when this
option as well as `-mavx512xxx` used together with `-mavx10.1-256`. We
decided to defer the support of `-mavx10.1` after we made consensus.
Or furthermore, we start from supporting AVX10.2 and not providing any
AVX10.1 options.
Reviewed By: RKSimon, skan
Differential Revision: https://reviews.llvm.org/D159250
This prevents S_NOP from being rescheduled past other (side-effecting)
instructions, which is useful because it is generally used to introduce
a short delay or to avoid hazards. Currently this only affects MIR tests
because the compiler itself only inserts nops in PostRAHazardRecognizer
which runs after all scheduling.
This removes the backend requirement for crc instructions on HasV8, relying on
just HasCRC instead. This should allow them to be selected with ArmV7 + crc,
making them more usable whilst hopefully not making them incorrectly generated
(they only come from intrinsics, and HasCRC usually requires HasV8). This is
how most other instructions are specified.
PowerPC subtargets prior to Power9 use the 'legacy' itinerary way to
provide scheduling information. This patch re-writes the tablegen file
to define the scheduling information in the new SchedModel way, which
can bring improvements to some benchmarks.
Reviewed By: shchenz
Differential Revision: https://reviews.llvm.org/D154488
We have already customized folding for VINSERTPS by 7e6606f4f1, which do
the folding when alignment >= 4 bytes.
We cannot arbitrarily fold it like others because we need to calculate
the source offset.
PPC64 allows stack size up to ((2^63)-1) bytes. Currently llc reports
```
warning: stack frame size (4294967568) exceeds limit (4294967295) in function 'main'
```
if the stack allocated is larger than 4G.
Scratch instructions are always in addrspace(5), which can only alias
with flat (and itself). SMEM and buffer instructions can never reference
those address spaces, so they are trivially disjoint.
This patch utilizes the -maix-small-local-exec-tls option added in
D155544 to produce a faster access sequence for the local-exec TLS
model, where loading from the TOC can be avoided.
The patch either produces an addi/la with a displacement off of r13
(the thread pointer) when the address is calculated, or it produces an
addi/la followed by a load/store when the address is calculated and
used for further accesses.
This patch also optimizes this sequence a bit more where we can remove
the addi/la when the load/store offset is 0. A follow up patch will
be posted to account for when the load/store offset is non-zero, and
currently in these situations we keep the addi/la that precedes the
load/store.
Furthermore, this access sequence is only performed for TLS variables
that are less than ~32KB in size.
Differential Revision: https://reviews.llvm.org/D155600
This patch adds a target attribute for an AIX-specific option that
informs the compiler that it can use a faster access sequence for the
local-exec TLS model (formally named aix-small-local-exec-tls).
The Clang portion of this option is in D155544.
The initial implementation to generate the faster access sequence is in
D155600.
Differential Revision: https://reviews.llvm.org/D156203
If we have a build_vector such as [i64 0, i64 3, i64 1, i64 2], we
instead lower this as vsext([i8 0, i8 3, i8 1, i8 2]). For vectors with
4 or fewer elements, the resulting narrow vector can be generated via
scalar materialization.
For shuffles which get lowered to vrgathers, constant build_vectors of
small constants are idiomatic. As such, this change covers all shuffles
with an output type of 4 or less.
I deliberately started narrow here. I think it makes sense to expand
this to longer vectors, but we need a more robust profit model on the
recursive expansion. It's questionable if we want to do the zsext if
we're going to generate a constant pool load for the narrower type
anyways.
One possibility for future exploration is to allow the narrower VT to be
less than 8 bits. We can't use vsext for that, but we could use
something analogous to our widening interleave lowering with some extra
shifts and ands.
With ThinLTO, when compiling SPEC 2017 omnetpp_r with -threads=4, two
small modules can end up with the same timestamp in their sinit symbols
when calculating time in seconds, creating duplicate definitions.
This patch uses a timestamp in nanoseconds.
Because the race can be between threads, embed the thread ID as well.
Reviewed By: xingxue, daltenty
Differential Revision: https://reviews.llvm.org/D159319
Summary: Materialization a 64-bit constant with High32=Low32 only requires 2 instructions instead of 3 when Low32 can be materialized in 1 instruction.
Reviewed By: qiucf
Differential Revision: https://reviews.llvm.org/D158495
On PowerPC the number of TOC entries must be kept low for large
applications. In order to reduce the number of constant global arrays
we can pool them into one structure and then access them as the base
address of that structure plus some offset. The constant global arrays
may be arrays of `i8` which are constant strings but they may also be
arrays of `i32, i64, etc...`.
Reviewed By: lei, amyk
Differential Revision: https://reviews.llvm.org/D155730
This is an alternative of D157485 and a pre-feature to support AVX10.
AVX10 Architecture Specification: https://cdrdv2.intel.com/v1/dl/getContent/784267
AVX10 Technical Paper: https://cdrdv2.intel.com/v1/dl/getContent/784343
RFC: https://discourse.llvm.org/t/rfc-design-for-avx10-feature-support/72661
Based on the feedbacks from LLVM and GCC community, we have agreed to
start from supporting `-m[no-]evex512` on existing AVX512 features.
The option `-mno-evex512` can be used with `-mavx512xxx` to build
binaries that can run on both legacy AVX512 targets and AVX10-256.
There're still arguments about what's the expected behavior when this
option as well as `-mavx512xxx` used together with `-mavx10.1-256`. We
decided to defer the support of `-mavx10.1` after we made consensus.
Or furthermore, we start from supporting AVX10.2 and not providing any
AVX10.1 options.
Reviewed By: RKSimon, skan
Differential Revision: https://reviews.llvm.org/D159250
This patch tries to catch a codegen opportunity where the rotate and
mask can be merged into a single RLDCL instruction.
Reviewed By: lei, amyk
Differential Revision: https://reviews.llvm.org/D158328
Combine any funnel shift with a shift amount of 0 to a copy.
Modulo is applied to shift amount if it is larger than the
instruction's bitwidth.
Differential Revision: https://reviews.llvm.org/D157591
On sm_90 some instructions now support i16x2 which allows hardware to
execute more efficiently add, min and max instructions.
In order to support that we need to make i16x2 a native type in the
backend. This does the necessary changes to make i16x2 a native type and
adds support for the instructions natively supporting i16x2.
This caused a negative test in nvptx slp to start passing. Changed the
test to a positive one as the IR is correctly vectorized.
On AArch64, it is safe to let the linker handle relaxation of
unconditional branches; in most cases, the destination is within range,
and the linker doesn't need to do anything. If the linker does insert
fixup code, it clobbers the x16 inter-procedural register, so x16 must
be available across the branch before linking. If x16 isn't available,
but some other register is, we can relax the branch either by spilling
x16 OR using the free register for a manually-inserted indirect branch.
This patch builds on D145211. While that patch is for correctness, this
one is for performance of the common case. As noted in
https://reviews.llvm.org/D145211#4537173, we can trust the linker to
relax cross-section unconditional branches across which x16 is
available.
Programs that use machine function splitting care most about the
performance of hot code at the expense of the performance of cold code,
so we prioritize minimizing hot code size.
Here's a breakdown of the cases:
Hot -> Cold [x16 is free across the branch]
Do nothing; let the linker relax the branch.
Cold -> Hot [x16 is free across the branch]
Do nothing; let the linker relax the branch.
Hot -> Cold [x16 used across the branch, but there is a free register]
Spill x16; let the linker relax the branch.
Spilling requires fewer instructions than manually inserting an
indirect branch.
Cold -> Hot [x16 used across the branch, but there is a free register]
Manually insert an indirect branch.
Spilling would require adding a restore block in the hot section.
Hot -> Cold [No free regs]
Spill x16; let the linker relax the branch.
Cold -> Hot [No free regs]
Spill x16 and put the restore block at the end of the hot function; let the linker relax the branch.
Ex:
[Hot section]
func.hot:
... hot code...
func.restore:
... restore x16 ...
B func.hot
[Cold section]
func.cold:
... spill x16 ...
B func.restore
Putting the restore block at the end of the function instead of
just before the destination increases the cost of executing the
store, but it avoids putting cold code in the middle of hot code.
Since the restore is very rarely taken, this is a worthwhile
tradeoff.
Differential Revision: https://reviews.llvm.org/D156767
Assuming the ADD is nsw then it may be sign-extended to merge with a SHL op in a similar fold to the existing (shl (add x, c1), c2) -> (add (shl x, c2), c1 << c2) fold.
This is most useful for helping to expose address math for X86, but has also touched several aarch64 test cases as well.
Alive2: https://alive2.llvm.org/ce/z/2UpSbJ
Differential Revision: https://reviews.llvm.org/D159198
SITargetLowering::adjustWritemask calls SelectionDAG::UpdateNodeOperands
to update an EXTRACT_SUBREG node in-place to refer to a new IMAGE_LOAD
instruction, before we delete the old IMAGE_LOAD instruction. But in
UpdateNodeOperands can do CSE on the fly and return a different
EXTRACT_SUBREG node, so the original EXTRACT_SUBREG node would still
exist and would refer to the old deleted IMAGE_LOAD instruction. This
caused errors like:
t31: v3i32,ch = <<Deleted Node!>> # D:1
This target-independent node should have been selected!
UNREACHABLE executed at lib/CodeGen/SelectionDAG/InstrEmitter.cpp:1209!
Fix it by detecting the CSE case and replacing all uses of the original
EXTRACT_SUBREG node with the CSE'd one.