32463 Commits

Author SHA1 Message Date
Sanjay Patel
0cf0be993c [InstCombine] fix operands of shouldChangeType() for casted phi transform
This is a bug noted in the recent D72733 and seen
in the similar transform just above the changed source code.

I added tests with illegal types and zexts to show the bug -
we could transform legal phi ops to illegal, etc. I did not add
tests with trunc because we won't see any diffs on those patterns.
That is because InstCombiner::SliceUpIllegalIntegerPHI() appears to
do those transforms independently of datalayout. It can also create
more casts than are present in existing code.

There are some existing regression tests that do not include a
datalayout that would be altered by this fix. I assumed that the
lack of a datalayout in those regression files is an oversight, so
I added the minimal layout (make i32 legal) necessary to preserve
behavior on those tests.

Differential Revision: https://reviews.llvm.org/D73907
2020-02-04 07:45:48 -05:00
David Green
362d00e051 [ARM][VecReduce] Force expand vector_reduce_fmin
Under MVE, we do not have any lowering for fminimum, which a
vector_reduce_fmin without NoNan will be expanded into. As with the
other recent patches, force this to expand in the pre-isel pass. Note
that Neon lowering would be OK because the scalar fminimum uses the
vector VMIN instruction, but is probably better to just rely on the
scalar operations, which is what is done here.

Also fixes what appears to be the reversal of INF vs -INF in the
vector_reduce_fmin widening code.
2020-02-04 09:36:59 +00:00
Jessica Paquette
9effe38b22 [AArch64][GlobalISel] Fold G_XOR into TB(N)Z bit calculation
This ports the existing case for G_XOR from `getTestBitOperand` in
AArch64ISelLowering into GlobalISel.

The idea is to flip between TBZ and TBNZ while walking through G_XORs.

Let's say we have

```
tbz (xor x, c), b
```

Let's say the `b`-th bit in `c` is 1. Then

- If the `b`-th bit in `x` is 1, the `b`-th bit in `(xor x, c)` is 0.
- If the `b`-th bit in `x` is 0, then the `b`-th bit in `(xor x, c)` is 1.

So, then

```
tbz (xor x, c), b == tbnz x, b
```

Let's say the `b`-th bit in `c` is 0. Then

- If the `b`-th bit in `x` is 1, the `b`-th bit in `(xor x, c)` is 1.
- If the `b`-th bit in `x` is 0, then the `b`-th bit in `(xor x, c)` is 0.

So, then

```
tbz (xor x, c), b == tbz x, b
```

Differential Revision: https://reviews.llvm.org/D73929
2020-02-03 15:22:24 -08:00
Jay Foad
2252cac694 [ANDGPU] getMemOperandsWithOffset: support BUF non-stack-access instructions with resource but no vaddr
Summary:
This enables clustering for many more BUF instructions.

Reviewers: rampitec, arsenm, nhaehnle

Subscribers: jvesely, wdng, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73868
2020-02-03 22:49:30 +00:00
Jessica Paquette
37910fd0e1 [AArch64][GlobalISel] Fold G_SHL into TB(N)Z bit calculation
This implements the following optimization:

```
(tbz (shl x, c), b) -> (tbz x, b-c)
```

Which appears in `getTestBitOperand` in AArch64ISelLowering.cpp.

If we test bit `b` of `shl x, c`, we can fold away the `shl` by looking `c` bits
to the right of `b` in `x` when this fits in the type. So, we can just test the
`b-c`th bit.

Differential Revision: https://reviews.llvm.org/D73924
2020-02-03 14:27:08 -08:00
Matt Arsenault
7d3aace3f5 AMDGPU: Add flag to control mem intrinsic expansion
GlobalISel doesn't implement the expansion for these yet, so add a
flag to force expanding these so it's possible to avoid these for a
while.
2020-02-03 14:26:01 -08:00
David Green
d05e4ff4af [ARM] MVE vector reduction fadd and fmul tests. NFC 2020-02-03 22:03:56 +00:00
Matt Arsenault
cb7b661d3d AMDGPU: Analyze divergence of inline asm 2020-02-03 12:42:16 -08:00
Matt Arsenault
2758ae41ae AMDGPU/GlobalISel: Allow selecting s128 load/stores 2020-02-03 12:28:08 -08:00
Matt Arsenault
726446a009 AMDGPU: Fix splitting wide f32 s.buffer.load intrinsics
This would witch f32 to i32, and produce an invald concat_vectors from
i32 pieces to an f32 vector.
2020-02-03 12:28:08 -08:00
David Tenty
77e71c5217 [AIX] Don't use a zero fill with a second parameter
Summary:
The AIX assembler .space directive can't take a second non-zero argument to fill
with. But LLVM emitFill currently assumes it can. We add a flag to the AsmInfo
to check if non-zero fill is supported, and if we can't zerofill non-zero values
we just splat the .byte directives.

Reviewers: stevewan, sfertile, DiggerLin, jasonliu, Xiangling_L

Reviewed By: jasonliu

Subscribers: Xiangling_L, wuzish, nemanjai, hiraditya, kbarton, jsji, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73554
2020-02-03 15:16:08 -05:00
Jessica Paquette
2bd46444d7 [AArch64][GlobalISel] Walk through G_AND in TB(N)Z bit calculation
Given

```
tb(n)z (and x, m), b
```

Where the `b`-th bit of `m` is 1,

```
tb(n)z (and x, m), b == tb(n)z x, b
```

So, we can walk past a `G_AND` in this case.

Also add test/CodeGen/AArch64/GlobalISel/opt-fold-and-tbz-tbnz.mir to test this.

Differential Revision: https://reviews.llvm.org/D73790
2020-02-03 11:53:47 -08:00
Amara Emerson
b911b99052 [AArch64][GlobalISel] Don't reconvert to p0 in convertPtrAddToAdd().
convertPtrAddToAdd improved overall code size and quality by a significant amount,
but on -O0 we generate some cross-class copies due to the fact that we emitted
G_PTRTOINT and G_INTTOPTR around the G_ADD. Unfortunately at -O0 we don't run any
register coalescing, so these cross class copies end up escaping as moves, and
we ended up regressing 3 benchmarks on CTMark (though still a winner overall).

This patch changes the lowering to instead directly emit the G_ADD into the
destination register, and then force changes the dest LLT to s64 from p0. This
should be ok, as all uses of the register should now be selected and therefore
the LLT doesn't matter for the users. It does however matter for the importer
patterns, which will fail to select a G_ADD if there's a p0 LLT.

I'm not able to get rid of the G_PTRTOINT on the source yet however. We can't
use the same trick of breaking the type system since that could break the
selection of the defining instruction. Thus with -O0 we still end up with a
cross class copy on source.

Code size improvements on -O0:
Program                                         baseline      new         diff
 test-suite :: CTMark/Bullet/bullet.test        965520       949164      -1.7%
 test-suite...TMark/7zip/7zip-benchmark.test    1069456      1052600     -1.6%
 test-suite...ark/tramp3d-v4/tramp3d-v4.test    1213692      1199804     -1.1%
 test-suite...:: CTMark/sqlite3/sqlite3.test    421680       419736      -0.5%
 test-suite...-typeset/consumer-typeset.test    837076       833380      -0.4%
 test-suite :: CTMark/lencod/lencod.test        799712       796976      -0.3%
 test-suite...:: CTMark/ClamAV/clamscan.test    688264       686132      -0.3%
 test-suite :: CTMark/kimwitu++/kc.test         1002344      999648      -0.3%
 test-suite...Mark/mafft/pairlocalalign.test    422296       421768      -0.1%
 test-suite :: CTMark/SPASS/SPASS.test          656792       656532      -0.0%
 Geomean difference                                                      -0.6%

Differential Revision: https://reviews.llvm.org/D73910
2020-02-03 11:50:22 -08:00
Matt Arsenault
cd7650c186 GlobalISel: Implement fewerElementsVector for G_SEXT_INREG
Start using a new strategy with a combination of merge and unmerges.

This allows scalarizing before lowering, which in cases like
<2 x s128> avoids producing giant illegal shifts.
2020-02-03 11:47:33 -08:00
Nikita Popov
1cc4f8d172 [ARM] Expand vector reduction intrinsics on soft float
Followup to D73135. If the target doesn't have hard float (default
for ARM), then we assert when trying to soften the result of vector
reduction intrinsics. This patch marks these for expansion as well.
(A bit odd to use vectors on a target without hard float ... but
that's where you end up if you expose target-independent vector types.)

Differential Revision: https://reviews.llvm.org/D73854
2020-02-03 18:49:12 +01:00
Jay Foad
05297b7cbe [AMDGPU] getMemOperandsWithOffset: add resource operand for BUF instructions
Summary:
This prevents unwanted clustering of BUF instructions with the same
vaddr but different resource descriptors.

Reviewers: rampitec, arsenm, nhaehnle

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73867
2020-02-03 17:06:09 +00:00
Simon Pilgrim
61621f826a [TargetLowering] SimplifyDemandedBits - add basic KnownBits ZEXTLoad handling
We have to be careful in SimplifyDemandedBits with loads in case we attempt to combine back to a constant (which then gets turned into a constant pool load again), but we can at least set the upper KnownBits for a ZEXTLoad to zero.
2020-02-03 16:50:04 +00:00
Simon Pilgrim
8c0e715eb2 [X86] BEXTR SimplifyDemandedBitsForTargetNode - length == 0 -> result = 0 2020-02-03 16:50:03 +00:00
Kazushi (Jam) Marukawa
be9fe6aa8b [VE] (fp)trunc+store & load+(fp)ext isel
Summary: load+sext/zext/fpext and (fp)trunc+store isel legalization and tests

Reviewers: arsenm, craig.topper, rengolin, k-ishizaka

Reviewed By: arsenm

Subscribers: merge_guards_bot, wdng, hiraditya, llvm-commits

Tags: #ve, #llvm

Differential Revision: https://reviews.llvm.org/D73774
2020-02-03 16:55:44 +01:00
Simon Pilgrim
8ead5df0b1 [X86] computeKnownBitsForTargetNode - add BEXTR support (PR39153)
Add a KnownBits::extractBits helper
2020-02-03 15:43:59 +00:00
Kazushi (Jam) Marukawa
07c9f7574d [VE] vaarg functions callers and callees
Summary: Isel patterns and tests for vaarg functions as callers and callees.

Reviewers: arsenm, rengolin, k-ishizaka

Subscribers: merge_guards_bot, wdng, hiraditya, llvm-commits

Tags: #ve, #llvm

Differential Revision: https://reviews.llvm.org/D73710
2020-02-03 16:26:44 +01:00
Simon Pilgrim
241c9a50b4 [X86] Add some initial BEXTR combine tests 2020-02-03 15:16:40 +00:00
Matt Arsenault
00b22df71d AMDGPU: Fix extra type mangling on llvm.amdgcn.if.break
These have to be the same mask type.
2020-02-03 07:02:05 -08:00
John Brawn
68cf574857 [FPEnv][AArch64] Add lowering of f128 STRICT_FSETCC
These get lowered to function calls, like the non-strict versions.

Differential Revision: https://reviews.llvm.org/D73784
2020-02-03 14:39:16 +00:00
Matt Arsenault
95a9b828f3 AMDGPU/GlobalISel: Fix mem size in test
This wasn't intended to tests an extload.
2020-02-03 05:41:14 -08:00
John Brawn
b37d59353f [FPEnv][ARM] Add lowering of STRICT_FSETCC and STRICT_FSETCCS
These can be lowered to code sequences using CMPFP and CMPFPE which then get
selected to VCMP and VCMPE. The implementation isn't fully correct, as the chain
operand isn't handled correctly, but resolving that looks like it would involve
changes around FPSCR-handling instructions and how the FPSCR is modelled.

The fp-intrinsics test was already testing some of this but as the entire test
was being XFAILed it wasn't noticed. Un-XFAIL the test and instead leave the
cases where we aren't generating the right instruction sequences as FIXME.

Differential Revision: https://reviews.llvm.org/D73194
2020-02-03 12:59:12 +00:00
Simon Tatham
961530fdc9 [ARM,MVE] Fix vreinterpretq in big-endian mode.
Summary:
In big-endian MVE, the simple vector load/store instructions (i.e.
both contiguous and non-widening) don't all store the bytes of a
register to memory in the same order: it matters whether you did a
VSTRB.8, VSTRH.16 or VSTRW.32. Put another way, the in-register
formats of different vector types relate to each other in a different
way from the in-memory formats.

So, if you want to 'bitcast' or 'reinterpret' one vector type as
another, you have to carefully specify which you mean: did you want to
reinterpret the //register// format of one type as that of the other,
or the //memory// format?

The ACLE `vreinterpretq` intrinsics are specified to reinterpret the
register format. But I had implemented them as LLVM IR bitcast, which
is specified for all types as a reinterpretation of the memory format.
So a `vreinterpretq` intrinsic, applied to values already in registers,
would code-generate incorrectly if compiled big-endian: instead of
emitting no code, it would emit a `vrev`.

To fix this, I've introduced a new IR intrinsic to perform a
register-format reinterpretation: `@llvm.arm.mve.vreinterpretq`. It's
implemented by a trivial isel pattern that expects the input in an
MQPR register, and just returns it unchanged.

In the clang codegen, I only emit this new intrinsic where it's
actually needed: I prefer a bitcast wherever it will have the right
effect, because LLVM understands bitcasts better. So we still generate
bitcasts in little-endian mode, and even in big-endian when you're
casting between two vector types with the same lane size.

For testing, I've moved all the codegen tests of vreinterpretq out
into their own file, so that they can have a different set of RUN
lines to check both big- and little-endian.

Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D73786
2020-02-03 11:20:06 +00:00
Simon Tatham
f8d4afc49a [ARM,MVE] Add intrinsics for v[id]dupq and v[id]wdupq.
Summary:
These instructions generate a vector of consecutive elements starting
from a given base value and incrementing by 1, 2, 4 or 8. The `wdup`
versions also wrap the values back to zero when they reach a given
limit value. The instruction updates the scalar base register so that
another use of the same instruction will continue the sequence from
where the previous one left off.

At the IR level, I've represented these instructions as a family of
target-specific intrinsics with two return values (the constructed
vector and the updated base). The user-facing ACLE API provides a set
of intrinsics that throw away the written-back base and another set
that receive it as a pointer so they can update it, plus the usual
predicated versions.

Because the intrinsics return two values (as do the underlying
instructions), the isel has to be done in C++.

This is the first family of MVE intrinsics that use the `imm_1248`
immediate type in the clang Tablegen framework, so naturally, I found
I'd given it the wrong C integer type. Also added some tests of the
check that the immediate has a legal value, because this is the first
time those particular checks have been exercised.

Finally, I also had to fix a bug in MveEmitter which failed an
assertion when I nested two `seq` nodes (the inner one used to extract
the two values from the pair returned by the IR intrinsic, and the
outer one put on by the predication multiclass).

Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D73357
2020-02-03 11:20:06 +00:00
Simon Tatham
cf7e98e6f7 [ARM,MVE] Add intrinsics for vdupq.
Summary:
The unpredicated case of this is trivial: the clang codegen just makes
a vector splat of the input, and LLVM isel is already prepared to
handle that. For the predicated version, I've generated a `select`
between the same vector splat and the `inactive` input parameter, and
added new Tablegen isel rules to match that pattern into a predicated
`MVE_VDUP` instruction.

Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard

Reviewed By: dmgreen

Subscribers: kristof.beyls, hiraditya, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D73356
2020-02-03 11:20:06 +00:00
Guillaume Chatelet
75d9994a51 Fix broken invariant
Summary:
A Copy with a source that is zeros is the same as a Set of zeros.
This fixes the invariant that SrcAlign should always be non-null.

Reviewers: courbet

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73791
2020-02-03 11:01:05 +01:00
Jay Foad
97d9a76afc [AMDGPU] Don't remove short branches over kills
Summary:
D68092 introduced a new SIRemoveShortExecBranches optimization pass and
broke some graphics shaders. The problem is that it was removing
branches over KILL pseudo instructions, and the fix is to explicitly
check for that in mustRetainExeczBranch.

Reviewers: critson, arsenm, nhaehnle, cdevadas, hakzsam

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73771
2020-02-03 09:26:52 +00:00
Craig Topper
ee85415dbb [X86] Use MVT::f80 for the result type of the FLD used to convert from SSE register to X87 register in FP_TO_INTHelper. 2020-02-02 13:24:37 -08:00
Craig Topper
246262671f [X86] Cleanup the lrint/llrint/lround/llround tests a bit.
We don't need tests for truncating the result. There's nothing
special about those truncates.

We can test llrint/llround for 64-bit and 32-bit targets in the same file.
Same with lrint/lround with i32 result result. lrint/lround with
64-bit result should only occur on a 64-bit target.

Add some missing tests for f80 conversions.
2020-02-02 11:01:05 -08:00
Simon Pilgrim
547a94ffa1 Regenerate bitcast test for upcoming patch. 2020-02-02 18:27:44 +00:00
Simon Pilgrim
0c78b64696 [X86][SSE] Add bitcast <128 x i1> %1 to <2 x i64> test case 2020-02-02 18:00:09 +00:00
Simon Pilgrim
17e91b7dd2 [X86][SSE] combineBitcastvxi1 - add pre-AVX512 v64i1 handling 2020-02-02 18:00:09 +00:00
Fangrui Song
5a56a25b0b [CodeGenPrepare] Make TargetPassConfig required
The code paths in the absence of TargetMachine, TargetLowering or
TargetRegisterInfo are poorly tested. As rL285987 said, requiring
TargetPassConfig allows us to delete many (untested) checks littered
everywhere.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D73754
2020-02-02 09:28:45 -08:00
David Green
d50e188a07 Revert "[ARM][MVE] VPT Blocks: findVCMPToFoldIntoVPS"
This reverts commit e34801c8e6df and the followup due to multiple
problems.

I've tried to keep the tests and RDA parts where possible, as those
still seem useful.
2020-02-02 13:24:05 +00:00
Fangrui Song
5932f7b8f2 [PatchableFunction] Use an empty DebugLoc
The current FirstMI.getDebugLoc() is actually null in almost all cases.
If it isn't, the generated .loc will be considered initial. The .loc
will have the prologue_end flag and terminate the prologue prematurely.

Also use an overload of BuildMI that will not prepend
PATCHABLE_FUNCTION_ENTRY to a MachineInstr bundle.
2020-02-01 14:12:06 -08:00
Nicolai Hähnle
ba8110161d AMDGPU/GFX10: Fix NSA reassign pass when operands are undef
Summary:
Virtual registers that are undef have an empty LiveInterval at this
point, which means beginIndex() and endIndex() cannot be used. We
only need those indices to determine the range in which to scan for
affected other NSA instructions, and undef operands cannot contribute
to that range.

Reviewers: arsenm, rampitec, mareko

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73831
2020-02-01 22:41:40 +01:00
Craig Topper
a57dd66d5e [X86] In X86FastEmitSSESelect, fall back to SelectionDAG if the inputs to the compare can't be found in registers.
We were checking that the original Value * for the compare operands
were null. But that can never happen.

I believe we intended to check for 0 registers here instead.

Fixes PR44749.
2020-02-01 12:24:55 -08:00
Craig Topper
943b5561d6 [LegalizeTypes][X86] Add a new strategy for type legalizing f16 type that softens it to i16, but promotes to f32 around arithmetic ops.
This is based on this llvm-dev thread http://lists.llvm.org/pipermail/llvm-dev/2019-December/137521.html

The current strategy for f16 is to promote type to float every except where the specific width is required like loads, stores, and bitcasts. This results in rounding occurring in odd places instead of immediately after arithmetic operations. This interacts in weird ways with the __fp16 type in clang which is a storage only type where arithmetic is always promoted to float. InstCombine can remove some fpext/fptruncs around such arithmetic and turn it into arithmetic on half. This wouldn't be so bad if SelectionDAG was able to put those fpext/fpround back in when it promotes.

It is also not obvious how to handle to make the existing strategy work with STRICT fp. We need to use STRICT versions of the conversions which require chain operands. But if the conversions are created for a bitcast, there is no place to get an appropriate chain from.

This patch implements a different strategy where conversions are emitted directly around arithmetic operations. And otherwise its passed around as an i16 including in arguments and return values. This can result in more conversions between arithmetic operations, but is closer to matching the IR the frontend generates for __fp16. And it will allow us to use the chain from constrained arithmetic nodes to link the STRICT_FP_TO_FP16/STRICT_FP16_TO_FP that will need to be added. I've set it up so that each target can opt into the new behavior. Converting all the targets myself was more than I was able to handle.

Differential Revision: https://reviews.llvm.org/D73749
2020-02-01 11:21:04 -08:00
Alex Richardson
24ee9c8496 Don't mark MIPS TRAP as isTerminator
This was causing machine verifier errors when compiling libunwind.

Reviewed By: atanasyan
Differential Revision: https://reviews.llvm.org/D73648
2020-02-01 15:50:22 +00:00
Matt Arsenault
c0b12916a7 AMDGPU/GlobalISel: Use more wide vector load/stores
This improves the type breakdown for some large vectors. For example,
we now get a <4 x s32> and s32 store instead of 5 s32 stores for
<5 x s32>.
2020-02-01 10:47:21 -05:00
Matt Arsenault
e3117e5c30 AMDGPU/GlobalISel: Improve legalization of wide stores
This fixes legalizations of global stores > 128-bits. It seems work is
needed on how this split actually occurs. For example, we get the
right code for s160, with an s128 and s32 load, but get 5 s32 loads
for <5 x s32>.
2020-02-01 10:47:03 -05:00
Matt Arsenault
bc101ffd77 GlobalISel: Support widening unmerge results with pointer source 2020-02-01 10:47:03 -05:00
Matt Arsenault
98aaed2980 AMDGPU/GlobalISel: Fix forming G_TRUNC with vcc result
This somehow got lost when I fixed the boolean handling.
2020-01-31 20:29:41 -05:00
Matt Arsenault
c28f1faaff AMDGPU: Switch some tests to use generated checks
Control flow tests are particularly annoying, and it's probably better
to be have comprehensive check lines for them.
2020-01-31 20:29:41 -05:00
Jessica Paquette
b9bf9305d1 [AArch64][GlobalISel] Walk through G_TRUNC in getTestBitReg
When you encounter a G_TRUNC, you are moving from a larger type to a smaller
type.

Asking for the i-th bit on a larger value is the same as asking for the i-th
bit on a smaller value.

So, we should always be able to walk through G_TRUNC when computing the bit
for a TB(N)Z.

Differential Revision: https://reviews.llvm.org/D73748
2020-01-31 11:09:55 -08:00
Simon Pilgrim
8fbc7fd567 [DAG] SimplifyMultipleUseDemandedBits - peek through unused ISD::INSERT_SUBVECTOR subvectors
If we don't demand any elements of the inserted subvector then just skip it.
2020-01-31 18:57:22 +00:00