64918 Commits

Author SHA1 Message Date
Benjamin Kramer
8b8e8704ce [PowerPC] Fix a nullptr dereference
LiMI1/LiMI2 can be null, so don't call a method on them before checking.
Found by ubsan.
2021-11-16 23:52:42 +01:00
Victor Huang
ae27ca9a67 [PowerPC] PPC backend optimization on conditional trap intrustions
This patch adds PPC back end optimization to analyze the arguments of a
conditional trap instruction to execute one of the following:
1. Delete it if never trap
2. Replace it if always trap
3. Otherwise keep it

Reviewed By: nemanjai, amyk, PowerPC

Differential revision: https://reviews.llvm.org/D111434
2021-11-16 13:11:57 -06:00
Kazu Hirata
ee0133dc6d [llvm] Use range-for loops (NFC) 2021-11-16 09:01:56 -08:00
David Sherwood
4607459022 [AArch64] Fix TypeSize->uint64_t implicit conversion in AArch64ISelLowering::hasAndNot
For now I've just changed the code to only return true from
AArch64ISelLowering::hasAndNot if the vector is fixed-length.
Once we have the right patterns or DAG combines to use bic/bif
we can also enable this for SVE.

Test added here:

  CodeGen/AArch64/vselect-constants.ll

Differential Revision: https://reviews.llvm.org/D113994
2021-11-16 16:25:16 +00:00
Jon Chesterfield
30b29db7c7 [amdgpu] Don't crash on empty global ctor/dtor
Global ctor/dtor can be an empty array, which is a Constant not a
ConstantArray. The cast<ConstantArray> therefore asserts / crashes.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D113800
2021-11-16 14:36:08 +00:00
Matt Devereau
f526c600c0 [AArch64][SVE] Instcombine SVE LD1/ST1 to stock LLVM IR
InstCombine AArch64 LD1/ST1 to llvm.masked.load/llvm.masked.store
and LD1/ST1 to load/store when a ptrue all predicate pattern operand
is present.

This allows existing IR optimizations such as dead-load removal to
occur.

Differential Revision: https://reviews.llvm.org/D113489
2021-11-16 11:10:23 +00:00
jacquesguan
6405e8b584 [RISCV] Refactor some rvv instructions' definition with foreach.
Simplify rvv instructions that use eew in their mnemonic and encoding with foreach. And fix a scheduling bug.

Differential Revision: https://reviews.llvm.org/D113453
2021-11-16 15:20:45 +08:00
Serguei Katkov
0ecb12a27f [STATEPOINT] Force implicit-def for lr register.
STATEPOINT instruction behavior is similar to call instruction.
In aarch64 BL instruction implicitly define lr register, so
STATEPOINT instruction should do the same.
However STATEPOINT is a general pseudo instruction and I could not find
a way to override list of implicit defs for specific target.

So this patch post processes inserting STATEPOINT instruction by
adding implisit dead def for lr.

Reviewers: reames, loicottet, ostannard
Reviewed By: reames
Subscribers: danilaml, hiraditya, kristof.beyls, llvm-commits, yrouban
Differential Revision: https://reviews.llvm.org/D111114
2021-11-16 12:52:00 +07:00
Kazu Hirata
7f00806a6a [llvm] Use make_early_inc_range (NFC) 2021-11-15 21:28:46 -08:00
Amara Emerson
dc84770d55 [GlobalISel] Add a store-merging optimization pass and enable for AArch64.
This is a first attempt at a constant value consecutive store merging pass,
a counterpart to the DAGCombiner's store merging optimization.

The high level goals of this pass:

* Have a simple and efficient algorithm. As close to linear time as we can get.
  Thus, prioritizing scalability of the algorithm over merging every corner case
  we can find. The DAGCombiner's store merging code has been the source of
  compile time and complexity issues in the past and I wanted to avoid that.
* Don't introduce any new data structures for ordering memory operations. In MIR,
  we don't have the concept of chains like we do in the DAG, and the instruction
  order is stricter than enforcing ordering with graph edges. Although I
  considered adding something similar, I couldn't justify the overhead.

The pass is current split into 3 main parts. The main store merging code focuses
on identifying candidate stores and managing the candidate group that's under
consideration for merging. Analyzing addressing of stores is a potentially
complex part and for now there's just a basic implementation to identify easy
cases. Finally, the other main bit of complexity is the alias analysis, which
tries to follow the same logic as the DAG's AA.

Currently this implementation only supports merging of constant stores. Stores
of arbitrary variables are technically possible with a very small change, but
the DAG chooses not to do this. Doing so here makes most code worse since
there's extra overhead in merging values into wider registers.

On AArch64 -Os, this optimization results in very minor savings on CTMark.

Differential Revision: https://reviews.llvm.org/D109131
2021-11-15 21:10:39 -08:00
Craig Topper
391b0ba603 [RISCV] Override TargetLowering::hasAndNot for Zbb.
Differential Revision: https://reviews.llvm.org/D113937
2021-11-15 18:44:07 -08:00
Matt Arsenault
659887b405 AMDGPU: Mark prolog/epilog SCC defs as dead
A future change will add SCC liveness checks. Since we are still
relying on forward register scavenging, add dead flags to avoid
spuriously detecting SCC as live.
2021-11-15 21:35:06 -05:00
Ben Shi
4c3d916c4b [RISCV] Optimize immediate materialisation with SH*ADD
Use LUI+SH*ADD+ADDI to compose specific immediates.

Reviewed By: craig.topper, luismarques

Differential Revision: https://reviews.llvm.org/D113568
2021-11-15 23:34:28 +00:00
Jonas Paulsson
1c3ef9ef4a [SystemZ] Support symbolic displacements.
This patch adds support for symbolic displacements, e.g. like 'lg %r0,
sym(%r1)', which is done using relocations. This is needed to compile the
kernel without disabling the integrated assembler.

Review: Ulrich Weigand

Differential Revision: https://reviews.llvm.org/D113341
2021-11-15 16:46:31 -05:00
Fangrui Song
fee52fe0ad [X86] Fix -Wunused-variable in -DLLVM_ENABLE_ASSERTIONS=off build. NFC 2021-11-15 13:10:47 -08:00
Lei Huang
f50c6c1718 [PowerPC] Fix 32bit vector insert instructions for ISA3.1
The platform independent ISD::INSERT_VECTOR_ELT take a element index,
but vins* instructions take a byte index. Update 32bit td patterns for
vector insert to handle the element index accordingly.

Since vector insert for non constant index are supported in
ISA3.1, there is no need to use platform specific ISD node,
PPCISD::VECINSERT.  Update td pattern to directly use
ISD::INSERT_VECTOR_ELT instead.

Reviewed By: nemanjai, #powerpc

Differential Revision: https://reviews.llvm.org/D113802
2021-11-15 14:36:39 -06:00
Simon Pilgrim
441de2536b [X86] Add generic splitVectorOp helper. NFC
Update splitVectorIntUnary/splitVectorIntBinary to use this internally, after some operand type sanity checks.

Avoid code duplication and makes it easier to split other vector instruction forms in the future.
2021-11-15 17:59:23 +00:00
Craig Topper
f59307bfdc [RISCV] Teach needVSETVLIPHI to handle mask register instructions.
This handles the case where the mask register instruction input
comes from a Phi of vsetvlis. If the VLMAX is the same as the VLMAX
required by the mask register instruction, we can avoid a vsetvli.

Reviewed By: frasercrmck

Differential Revision: https://reviews.llvm.org/D113204
2021-11-15 09:57:28 -08:00
Simon Pilgrim
fc7c1cebbc [X86] LowerFunnelShift - pull out repeated EltSizeInBits variable. NFC. 2021-11-15 17:11:44 +00:00
Sanjay Patel
3d01507c2d [x86] fold vector (X > -1) & Y to shift+andn (2nd try)
The first try at this patch ( bf5748a1af0d ) was reverted ( 5be64d416481 )
because it could crash. The cause of that problem was failing to account
for the optional peek-through-bitcast in the enclosing function.

This version of the patch adds a clause to avoid the fold in case of
bitcasts because it is unlikely to be profitable in that scenario.

A test case based on https://llvm.org/PR52504 was added to make sure
we don't have that problem again.

Original commit message:

and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y

This avoids the -1 constant vector in favor of an arithmetic shift
instruction if it exists (the ISA is still not complete after all
these years...).

We catch this pattern late in combining by matching PCMPGT, so it
should not interfere with more general folds.

Differential Revision: https://reviews.llvm.org/D113603
2021-11-15 11:09:32 -05:00
Roman Lebedev
5c7255fe3a
[X86][Costmodel] getReplicationShuffleCost(): promote 8 bit-wide elements to 32 bit when no AVX512VBMI
Currently `X86TTIImpl::getInterleavedMemoryOpCostAVX512()` asks about i8 elt type,
so this change does affect vectorization. In the end, it will ask about i1.

We should also try to promote to i16 if we have AVX512BW, i'll do that in a follow-up.
All costs here look good, i've added the missing truncation costs in preparatory patches.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113853
2021-11-15 19:04:02 +03:00
Roman Lebedev
a468c39c90
[X86][Costmodel] trunc v32i16 to v64i8 can appear after legalization, cost is same as for trunc v32i16 to v32i8
Some of the costs get larger here,
but i suppose that makes sense since we'd previously query
scalarization costs that may not be really representative of the reality.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113852
2021-11-15 19:04:02 +03:00
Roman Lebedev
9e57d9b09d
[X86][Costmodel] trunc v8i64 to v16i8/v32i8/v64i8 can appear after legalization, cost is same as for trunc v8i64 to v8i8
While this one is trivial and identical to the previous patch,
there is a weird cost change in a follow-up patch that i'm not sure about.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113851
2021-11-15 19:04:02 +03:00
Roman Lebedev
0116c708c6
[X86][Costmodel] trunc v16i32 to v32i8/v64i8 can appear after legalization, cost is same as for trunc v16i32 to v16i8
While this one is trivial and identical to the previous patch,
there is a weird cost change in a follow-up patch that i'm not sure about.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113850
2021-11-15 19:04:02 +03:00
Simon Pilgrim
ea9e6aa423 [X86] getAVX512Node() - find constant broadcasts to encourage load-folding
If an operand is a bitcasted or widended constant, try to more aggressively create broadcastable constants for folding, which in particular helps non-VLX modes.

I've refactored getAVX512Node so that VLX targets can make better use of this as well.

NOTE: In the future, I think we should consider removing the broadcast of constant data from DAG entirely and move this to either X86InstrInfo::foldMemoryOperand or a new pass - AVX1/2 targets has similar problems with missed (whole vector) folds that need to be improved as well.

Differential Revision: https://reviews.llvm.org/D113845
2021-11-15 15:52:03 +00:00
Hans Wennborg
5be64d4164 Revert "[x86] fold vector (X > -1) & Y to shift+andn"
This casued assertion failures:

  llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:9446:
  void llvm::SelectionDAG::ReplaceAllUsesWith(llvm::SDNode *, llvm::SDNode *):
  Assertion `(!From->hasAnyUseOfValue(i) || From->getValueType(i) == To->getValueType(i))
  && "Cannot use this version of ReplaceAllUsesWith!"' failed.

See comment on the code review.

(Had to update some expectations in test/CodeGen/X86/vselect-zero.ll
 manually due to other changes having landed after the reverted one.)

> and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y
>
> This avoids the -1 constant vector in favor of an arithmetic shift
> instruction if it exists (the ISA is still not complete after all
> these years...).
>
> We catch this pattern late in combining by matching PCMPGT, so it
> should not interfere with more general folds.
>
> Differential Revision: https://reviews.llvm.org/D113603

This reverts commit bf5748a1af0d2f6f9396d9dc6ac89d15de41eee7.
2021-11-15 12:35:49 +01:00
Dmitry Preobrazhensky
91f4650ebb [AMDGPU][MC][GFX10] Corrected global_atomic_fcmpswap*
Corrected src data size of global_atomic_fcmpswap and global_atomic_fcmpswap_x2 opcodes.

Differential Revision: https://reviews.llvm.org/D113746
2021-11-15 12:51:12 +03:00
David Green
4c3bfdc7f1 [ARM] Fix GatherScatter AddLikeOr condition 2021-11-15 09:44:41 +00:00
Peter Waller
599ea3e73f [AArch64][SVE] Break false dependencies for inactive lanes of FP unary operations
Follow up to D105889, covering instructions using sve_fp_2op_p_zd_HSD:
frintn, frintp, frintm, frintz, frinta, frintx, frinti, frecpx and
fsqrt.

Reviewed By: bsmith

Differential Revision: https://reviews.llvm.org/D113485
2021-11-15 09:15:21 +00:00
Simon Moll
7cf887b950 [VE] Fix SDNode user loop after efa896e5f7
Rewriting SDNode user loops broke VEISelLowering (commit efa896e5f7).
This fixes it.
2021-11-15 09:53:09 +01:00
Kazu Hirata
d243cbf8ea [llvm] Use isa instead of dyn_cast (NFC) 2021-11-14 19:40:46 -08:00
Kazu Hirata
a84a401f7e [AMDGPU] Remove selectStoreIntrinsic (NFC)
The last use was removed on Jan 13, 2020 in commit
533d650e947a2f7216a315aeb8c79ac1d4740e5f.
2021-11-14 19:40:44 -08:00
Chen Zheng
eec9ca622c [PowerPC] guard update form prepare with non-const increment with option
Reviewed By: jsji

Differential Revision: https://reviews.llvm.org/D113471
2021-11-15 02:16:46 +00:00
Simon Pilgrim
c3a772fdf5 [X86] Add getPack helper
This helper provides a more complete approach for lowering to X86ISD::PACKSS/PACKUS nodes - testing for existing suitable sign/zero extension before recreating it.

It also optionally packs the upper half instead of the lower half.
2021-11-14 21:27:15 +00:00
Koakuma
3e0f3041cc [SPARC] Zero-extend the operands when doing UMULO on 64-bit integers
On SPARC, S/UMULO operation on 64-bit integers works by extending the value to 128-bit, then doing a multiplication and checking the upper half of the result.
This makes UMULO works correctly by putting a zero in the upper half rather than doing a sign extension.

Reviewed By: LemonBoy

Differential Revision: https://reviews.llvm.org/D110555
2021-11-14 19:59:52 +01:00
Roman Lebedev
4dd2f0446c
[X86][Costmodel] getReplicationShuffleCost(): promote 16 bit-wide elements to 32 bit when no AVX512BW
The basic idea is simple, if we don't have native shuffle for this element type,
then we must have native shuffle for wider element type,
so promote, replicate, demote.

I believe, asking `getCastInstrCost(Instruction::Trunc` is correct semantically,
case in point `trunc <32 x i32> to <32 x i8>` aka 2 * ZMM will naively result in
2 * XMM, that then will be packed into 1 * YMM,
and it should count the cost of said packing,
not just the truncations.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113609
2021-11-14 20:01:38 +03:00
Roman Lebedev
e876698a5d
[NFC][TTI] getReplicationShuffleCost(): s/Replicated/Dst/
'Replicated' is mouthful and somewhat ambigious,
while 'destination' is pretty self-explanatory.
2021-11-14 20:01:38 +03:00
Roman Lebedev
b283961012
[X86][Costmodel] trunc v8i64 to v16i16/v32i16 can appear after legalization, cost is same as for trunc v8i64 to v8i16
Same as D113842, but for i64

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113843
2021-11-14 18:41:38 +03:00
Roman Lebedev
a5f2fdca99
[X86][Costmodel] trunc v16i32 to v32i16 can appear after legalization, cost is same as for trunc v16i32 to v16i16
This was noticed in D113609, hopefully it unblocks that patch.
There are likely other similar problems.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113842
2021-11-14 18:41:37 +03:00
Simon Pilgrim
f4143ffed7 [X86] Widen 128/256-bit VPTERNLOG patterns to 512-bit on non-VLX targets
Similar to what we've done for other ops, this patch widens VPTERNLOG to a 512-bit op for non-VLX targets.

Fixes regressions in D113192

Differential Revision: https://reviews.llvm.org/D113827
2021-11-14 13:40:53 +00:00
Kazu Hirata
7505b7045f [llvm] Use GetElementPtrInst::indices (NFC) 2021-11-13 21:43:28 -08:00
Kazu Hirata
609ccbb240 [PowerPC] Use SDNode::uses (NFC) 2021-11-13 08:34:22 -08:00
Simon Pilgrim
a310cbae02 [X86] Add getAVX512Node helper. NFC.
For AVX512 targets without VLX, we have to widen 128/256-bit vectors to 512-bits to use some specific AVX512 instructions (or some other instructions with predicates etc.).

I've pulled out the widening code from LowerFunnelShift into the helper function, so we can convert some other widening patterns in the future.
2021-11-13 13:59:42 +00:00
Craig Topper
82bc6a094e [X86] Promote f16 STRICT_FROUND to f32 and call libc.
Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D113817
2021-11-12 21:37:03 -08:00
Kazu Hirata
efa896e5f7 [Target] Use SDNode::uses (NFC) 2021-11-12 21:23:04 -08:00
Phoebe Wang
e49fcfc7cd [X86][ABI] Change the alignment of f80 in 32-bit calling convention to meet with different data layout
Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113739
2021-11-13 10:00:34 +08:00
Craig Topper
02bed66cd5 [RISCV] Improve codegen for i32 udiv/urem by constant on RV64.
The division by constant optimization often produces constants that
are uimm32, but not simm32. These constants require 3 or 4 instructions
to materialize without Zba.

Since these instructions are often used by a multiply with a LHS
that needs to be zero extended with an AND, we can switch the MUL
to a MULHU by shifting both inputs left by 32. Once we shift the
constant left, the upper 32 bits no longer need to be 0 so constant
materialization is free to use LUI+ADDIW. This reduces the constant
materialization from 4 instructions to 3 in some cases while also
reducing the zero extend of the LHS from 2 shifts to 1.

Differential Revision: https://reviews.llvm.org/D113805
2021-11-12 14:49:10 -08:00
Simon Pilgrim
6bb71738e2 [X86] convertShiftLeftToScale - improve vXi8 constant handling
Add support for v32i8/v64i8 converting shift-by-constant to multiply-by-constant. This helps us avoid the generic vXi8 shift lowering, and a lot of VPBLENDVB ops which can be particularly slow.

We also needed to reorder a few shift lowering patterns to prevent regressions, particularly for XOP+AVX2 (Excavator) targets (which can split to fast v16i8 shifts) and AVX512-BWI targets (which prefers to extend to fast v32i16 shifts).
2021-11-12 16:48:10 +00:00
Jay Foad
a70bbb5f7a [AMDGPU] Simplify 64-bit division/remainder expansion
The old expansion open-coded a 64-bit addition in a strange way, by
adding the high parts *without* carry-in from the low part, and then
adding the carry back in later on. Fixing this saves a couple of
instructions and makes the code much easier to understand.

Differential Revision: https://reviews.llvm.org/D113679
2021-11-12 15:48:41 +00:00
Kerry McLaughlin
7647822156 [AArch64][SVE] Remove i1 type from isElementTypeLegalForScalableVector
`collectElementTypesForWidening` collects the types of load, store and
reduction Phis in a loop. These types are later checked using
`isElementTypeLegalForScalableVector` to prevent vectorisation of
loops with instruction types that are unsupported.

This patch removes i1 from the list of types supported for scalable
vectors. This fixes an assert ("Cannot yet scalarize uniform stores") in
`setCostBasedWideningDecision` when we have a loop containing a uniform
i1 store and a scalable VF, which we cannot create a scatter for.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D113680
2021-11-12 14:24:38 +00:00