11543 Commits

Author SHA1 Message Date
David Green
f37e132263 [ARM] Add VFP lowering for fptosi.sat
This extends D107865 to the VFP insructions, lowering llvm.fptosi.sat
and llvm.fptoui.sat to VCVT instructions that inherently perform the
saturate.

Differential Revision: https://reviews.llvm.org/D107866
2021-09-03 18:11:08 +01:00
David Green
9cb8f4d1ad [ARM] Add a tail-predication loop predicate register
The semantics of tail predication loops means that the value of LR as an
instruction is executed determines the predicate. In other words:

mov r3, #3
DLSTP lr, r3        // Start tail predication, lr==3
VADD.s32 q0, q1, q2 // Lanes 0,1 and 2 are updated in q0.
mov lr, #1
VADD.s32 q0, q1, q2 // Only first lane is updated.

This means that the value of lr cannot be spilled and re-used in tail
predication regions without potentially altering the behaviour of the
program. More lanes than required could be stored, for example, and in
the case of a gather those lanes might not have been setup, leading to
alignment exceptions.

This patch adds a new lr predicate operand to MVE instructions in order
to keep a reference to the lr that they use as a tail predicate. It will
usually hold the zeroreg meaning not predicated, being set to the LR phi
value in the MVETPAndVPTOptimisationsPass. This will prevent it from
being spilled anywhere that it needs to be used.

A lot of tests needed updating.

Differential Revision: https://reviews.llvm.org/D107638
2021-09-02 13:42:58 +01:00
David Green
49476a4d66 [ARM] Add MVE lowering for fptosi.sat
This adds lowering of the llvm.fptosi.sat and llvm.fptoui.sat intinsics,
selecting a VCVT instruction which under MVE will inherently perform the
saturate.

Differential Revision: https://reviews.llvm.org/D107865
2021-09-01 22:38:47 +01:00
Philip Reames
29fa37ec9f [SCEV] If max BTC is zero, then so is the exact BTC [2 of 2]
This extends D108921 into a generic rule applied to constructing ExitLimits along all paths. The remaining paths (primarily howFarToZero) don't have the same reasoning about UB sensitivity as the howManyLessThan ones did. Instead, the remain cause for max counts being more precise than exact counts is that we apply context sensitive loop guards on the max path, and not on the exact path. That choice is mildly suspect, but out of scope of this patch.

The MVETailPredication.cpp change deserves a bit of explanation. We were previously figuring out that two SCEVs happened to be equal because the happened to be identical. When we optimized one with context sensitive information, but not the other, we lost the ability to prove them equal. So, cover this case by subtracting and then applying loop guards again. Without this, we see changes in test/CodeGen/Thumb2/mve-blockplacement.ll

Differential Revision: https://reviews.llvm.org/D109015
2021-09-01 11:51:48 -07:00
David Green
22c384129e [ARM] Add missing validForTailPredication for VMINNM/VMAXNM
Apparently this was missing, preventing the generation of tail
predication loops containing VMINNM, VMAXNM, VMINNMA and VMAXNMA.
2021-08-31 18:19:03 +01:00
Simon Wallis
f417b660ee [Arm] Add assert in T2 Imm7s code emitter
Add assert to provoke failure in object file output, not just in disassembly output.

Reviewed By: yroux

Differential Revision: https://reviews.llvm.org/D107259
2021-08-31 08:16:48 +01:00
David Green
efa340fbd2 [ARM] Workaround tailpredication min/max costmodel
The min/max intrinsics are not yet canonical, but when they are the tail
predications analysis will change from treating them like icmp to
treating them like intrinsics. Unfortunately, they can currently produce
better code by not being tail predicated thanks to the vectorizer picking
higher VF's and the backend folding to better instructions (especially
for saturate patterns). In the long run we will need to improve the
vectorizers cost modelling, recognizing the instruction directly, but in
the meantime this treats min/max as before to prevent performance
regressions.
2021-08-30 19:19:51 +01:00
Nikita Popov
0529e2e018 [InstrInfo] Use 64-bit immediates for analyzeCompare() (NFCI)
The backend generally uses 64-bit immediates (e.g. what
MachineOperand::getImm() returns), so use that for analyzeCompare()
and optimizeCompareInst() as well. This avoids truncation for
targets that support immediates larger 32-bit. In particular, we
can avoid the bugprone value normalization hack in the AArch64
target.

This is a followup to D108076.

Differential Revision: https://reviews.llvm.org/D108875
2021-08-30 19:46:04 +02:00
Nick Desaulniers
5c91b98c5d [ARMISelLowering] avoid emitting libcalls to __mulodi4()
__has_builtin(__builtin_mul_overflow) returns true for 32b ARM targets,
but Clang is deferring to compiler RT when encountering `long long`
types. This breaks sanitizer builds of the Linux kernel that are using
__builtin_mul_overflow with these types for these targets.

If the semantics of __has_builtin mean "the compiler resolves these,
always" then we shouldn't conditionally emit a libcall.

This will still need to be worked around in the Linux kernel in order to
continue to support allmodconfig builds of the Linux kernel for this
target with older releases of clang.

Link: https://bugs.llvm.org/show_bug.cgi?id=28629
Link: https://github.com/ClangBuiltLinux/linux/issues/1438

Reviewed By: rengolin

Differential Revision: https://reviews.llvm.org/D108842
2021-08-27 15:14:47 -07:00
Martin Storsjö
039b469b85 [ARM] Allow using ';' as asm statement separator in MSVC mode
This does the same as D96259, but for ARM, just like AArch64,
using the same comment char as for ELF and MinGW mode.

As the assembly input/output of LLVM is GAS style, trying to
match what MS armasm.exe does isn't needed (because the comment
char used is the least concern when it comes to that; all
directives differ too). If a separate armasm compatible mode
is implemented, it can use its own comment style (just like
llvm-ml implements MS ml.exe compatible assembly parsing).

This fixes building compiler-rt assembly files for ARM in MSVC
mode.

The updated testcase literals-comments.s was only intended to
make sure that '#' isn't interpreted as a comment char.

Differential Revision: https://reviews.llvm.org/D107251
2021-08-24 11:01:49 +03:00
David Green
605489d593 [ARM] Fix VQDMULH fold for scalar smin
Add a variant of mve-vqdmulh tests that uses min/max intrinsics
directly, including a scalar test that shows it misbehaving for min
intrinsics and a fix for the combine to prevent it from misbehaving.
2021-08-21 16:33:18 +01:00
Arthur Eubanks
46cf82532c [NFC] Replace Function handling of attributes with less confusing calls
To avoid magic constants and confusing indexes.
2021-08-17 21:05:40 -07:00
Simon Pilgrim
1e770f0388 [ARM] ARMDAGToDAGISel::tryReadRegister/tryWriteRegister - don't dereference dyn_cast<> results.
dyn_cast<> can return nullptr if the cast is illegal, use cast<> instead which will assert that the cast is correct.

Fixes static analyser warnings.
2021-08-17 18:40:59 +01:00
David Green
52e0cf9d61 [ARM] Enable subreg liveness
This enables subreg liveness in the arm backend when MVE is present,
which allows the register allocator to detect when subregister are
alive/dead, compared to only acting on full registers. This can helps
produce better code on MVE with the way MQPR registers are made up of
SPR registers, but is especially helpful for MQQPR and MQQQQPR
registers, where there are very few "registers" available and being able
to split them up into subregs can help produce much better code.

Differential Revision: https://reviews.llvm.org/D107642
2021-08-17 14:10:33 +01:00
David Green
62e892fa2d [ARM] Add MQQPR and MQQQQPR spill and reload pseudo instructions
As a part of D107642, this adds pseudo instructions for MQQPR and
MQQQQPR register classes, that can spill and reloads entire registers
whilst keeping them combined, not splitting them into multiple D subregs
that a VLDMIA/VSTMIA would use. This can help certain analyses, and
helps to prevent verifier issues with subreg liveness.
2021-08-17 13:51:34 +01:00
David Green
9236dea255 [ARM] Create MQQPR and MQQQQPR register classes
Similar to the MQPR register class as the MVE equivalent to QPR, this
adds MQQPR and MQQQQPR register classes for the MVE equivalents of QQPR
and QQQQPR registers. The MVE MQPR seemed have worked out quite well,
and adding MQQPR and MQQQQPR allows us to a little more accurately
specify the number of registers, calculating register pressure limits a
little better.

Differential Revision: https://reviews.llvm.org/D107463
2021-08-16 22:58:12 +01:00
Simon Pilgrim
d6fe8d37c6 [DAG] Fold concat_vectors(concat_vectors(x,y),concat_vectors(a,b)) -> concat_vectors(x,y,a,b)
Follow-up to D107068, attempt to fold nested concat_vectors/undefs, as long as both the vector and inner subvector types are legal.

This exposed the same issue in ARM's MVE LowerCONCAT_VECTORS_i1 (raised as PR51365) and AArch64's performConcatVectorsCombine which both assumed concat_vectors only took 2 subvector operands.

Differential Revision: https://reviews.llvm.org/D107597
2021-08-16 16:06:54 +01:00
Arthur Eubanks
92ce6db9ee [NFC] Rename AttributeList::hasFnAttribute() -> hasFnAttr()
This is more consistent with similar methods.
2021-08-13 11:09:18 -07:00
David Green
ae9a346ef8 [ARM] Fix DAG combine loop in reduction distribution
Given a constant operand, the MVE and DAGCombine combines could fight,
each redistributing in the opposite order. Add a guard to the MVE
vecreduce distribution to prevent that.
2021-08-12 16:37:39 +01:00
David Green
8c50b5fbfe [ARM] Add extra debug messages for validating live outs. NFC
We are running into more and more cases where the liveouts of low
overhead loops do not validate. Add some extra debug messages to make it
clearer why.
2021-08-11 10:35:53 +01:00
David Green
c140ff493e [ARM] Change a couple of instances of LiveRegs.contains to !LiveRegs.available
This changes a couple of calls to LiveRegs.contains to
!LiveRegs.available, one in Thumb1FrameLoweringInfo (which modifies a
test to look more correct to me, given r7 should be the frame pointer so
is not available), and another in the ARMLoadStoreOptimizer, that I
don't have a test for, it was just found by inspection.

Differential Revision: https://reviews.llvm.org/D107454
2021-08-10 09:53:26 +01:00
Amara Emerson
2b067e3335 Change TargetLowering::canMergeStoresTo() to take a MF instead of DAG.
DAG is unnecessary and we need this hook to implement store merging on GlobalISel too.
2021-08-06 12:57:53 -07:00
David Green
77e8f4eeee [ARM] Define ComplexPatternFuncMutatesDAG
Some of the Arm complex pattern functions call canExtractShiftFromMul,
which can modify the DAG in-place. For this to be valid and handled
successfully we need to define ComplexPatternFuncMutatesDAG.

Differential Revision: https://reviews.llvm.org/D107476
2021-08-06 17:35:11 +01:00
Simon Pilgrim
dbce6a8d9d [ARM] Fold insert_subvector to concat_vectors
D107068 fixed the same problem on aarch64 but the arm variant wasn't exposed in existing test coverage.

I've copied the arm64-neon-copy tests (and stripped the intrinsic test from it) for testing on arm neon builds as well.
2021-08-06 11:21:31 +01:00
Amara Emerson
4fee756c75 Delete copy-ctor of MachineFrameInfo.
I just hit a nasty bug when writing a unit test after calling MF->getFrameInfo()
without declaring the variable as a reference.

Deleting the copy-constructor also showed a place in the ARM backend which was
doing the same thing, albeit it didn't impact correctness there from the looks of it.
2021-08-05 23:24:37 -07:00
Igor Kudrin
2c14798ead [ARM][llvm-objdump] Annotate PC-relative memory operands of VLDR instructions
This extends D105979 and adds support for VLDR instructions.

Differential Revision: https://reviews.llvm.org/D105980
2021-08-05 14:11:11 +07:00
Igor Kudrin
ddbe812bcc [ARM][llvm-objdump] Annotate PC-relative memory operands
This implements `MCInstrAnalysis::evaluateMemoryOperandAddress()` for
Arm so that the disassembler can print the target address of memory
operands that use PC+immediate addressing.

Differential Revision: https://reviews.llvm.org/D105979
2021-08-05 14:11:11 +07:00
Tomas Matheson
40650f27b5 [ARM][atomicrmw] Fix CMP_SWAP_32 expand assert
This assert is intended to ensure that the high registers are not
selected when it is passed to one of the thumb UXT instructions. However
it was triggering even for 32 bit where no UXT instruction is emitted.

Fixes PR51313.

Differential Revision: https://reviews.llvm.org/D107363
2021-08-04 15:02:02 +01:00
Arthur Eubanks
ad25344620 [MC][CodeGen] Emit constant pools earlier
Previously we would emit constant pool entries for ldr inline asm at the
very end of AsmPrinter::doFinalization(). However, if we're emitting
dwarf aranges, that would end all sections with aranges. Then if we have
constant pool entries to be emitted in those same sections, we'd hit an
assert that the section has already been ended.

We want to emit constant pool entries before emitting dwarf aranges.
This patch splits out arm32/64's constant pool entry emission into its
own MCTargetStreamer virtual method.

Fixes PR51208

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D107314
2021-08-03 20:55:31 -07:00
Roman Lebedev
6f6e9a867f
[BasicTTIImpl][LoopUnroll] getUnrollingPreferences(): emit ORE remark when advising against unrolling due to a call in a loop
I'm not sure this is the best way to approach this,
but the situation is rather not very detectable unless we explicitly call it out when refusing to advise to unroll.

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D107271
2021-08-03 00:57:26 +03:00
David Green
c423a586a7 [ARM] Remove setPreservesCFG from ARMBlockPlacement
As of 28293918409dd3a5a it no longer preserves the CFG, needing to
split blocks in order to add DLS instructions.
2021-08-02 14:15:45 +01:00
Simon Pilgrim
0579050116 Fix MSVC signed/unsigned comparison warning. NFCI. 2021-08-02 11:23:43 +01:00
David Green
2829391840 [ARM] Revert WLSTP to DLSTP if the target block is out of range
If the block target for a WLSTP instruction is known to be out of range,
and cannot be fixed by the ARMBlockPlacementPass, we can relax it to a
DLSTP (and cmp/branch) to still allow the creation of tail predicated
loops. That is what this patch does, adding extra revert code to the
fallback path of ARMBlockPlacementPass.

Due to the code produced when reverting, this creates a DLSTP between a
Bcc and a Br. As a DLS isn't necessarily a terminator we need to split
the block to move the DLS/Br into.

Differential Revision: https://reviews.llvm.org/D104709
2021-08-02 10:59:52 +01:00
David Green
15a1d7e839 [ARM] Switch order of creating VADDV and VMLAV.
It can be beneficial to attempt to try the larger VMLAV patterns before
VADDV, in case both may match the same code.
2021-07-31 16:28:52 +01:00
David Green
69cdadddec [ARM] Distribute reductions based on ascending load offset
This distributes reductions based on the relative offset of loads, if
one is found from their operands. Given chains of reductions this will
then sort them in ascending load order, which in turn can help simple
prefetches latch on to increasing strides more easily.

Differential Revision: https://reviews.llvm.org/D106569
2021-07-30 19:50:07 +01:00
David Green
532d05b714 [ARM] Attempt to distribute reductions
This adds a combine for adds of reductions, distributing them so that
they occur sequentially to enable better use of accumulating VADDVA
instructions. It combines:
  add(X, add(vecreduce(Y), vecreduce(Z))) ->
    add(add(X, vecreduce(Y)), vecreduce(Z))
and
  add(add(A, reduce(B)), add(C, reduce(D))) ->
    add(add(add(A, C), reduce(B)), reduce(D))

These together distribute the add's so that more reductions can be
selected to VADDVA.

Differential Revision: https://reviews.llvm.org/D106532
2021-07-30 14:48:31 +01:00
David Green
4b56306762 [ARM] Turn vecreduce_add(add(x, y)) into vecreduce(x) + vecreduce(y)
Under MVE we can use VADDV/VADDVA's to perform integer add reductions,
so it can be beneficial to use more reductions than summing subvectors
and reducing once. Especially for VMLAV/VMLAVA the mul can be
incorporated into the reduction, producing less instructions.

Some of the test cases currently get larger due to extra integer adds,
but will be improved in a followup patch.

Differential Revision: https://reviews.llvm.org/D106531
2021-07-30 10:10:41 +01:00
David Green
d4a2daa919 [ARM] Define a couple more ssub indexes. NFC
Same as 91bd3ad128f7b3b28bd98242e9a5df214eb04eea, this doesn't really
change anything but gives the registers better names than the ones
tablegen would define. And fills in the missing gaps.
2021-07-29 23:00:35 +01:00
David Green
41cedb1c9a [LV][ARM] Tighten up MLA reduction costing
This makes a couple of changes to the costing of MLA reduction patterns,
to more accurately cost various patterns that can come up from
vectorization.

 - The Arm implementation of getExtendedAddReductionCost is altered to
   only provide costs for legal or smaller types. Larger than legal types
   need to be split, which currently does not work very well, especially
   for predicated reductions where the predicate may be legal but needs to
   be split. Currently we limit it to legal or smaller input types.
 - The getReductionPatternCost has learnt that reduce(ext(mul(ext, ext))
   is a pattern that can come up, and can be treated the same as
   reduce(mul(ext, ext)) providing the extension types match.
 - And it has been adjusted to not count the ext in reduce(mul(ext, ext))
   as part of a reduce(mul) pattern.

Together these changes help to more accurately cost the mla reductions
in cases such as where the extend types don't match or the extend
opcodes are different, picking better vector factors that don't result
in expanded reductions.

Differential Revision: https://reviews.llvm.org/D106166
2021-07-28 12:50:58 +01:00
Anirudh Prasad
a8cfa4b9bd [SystemZ][z/OS] Initial code to generate assembly files on z/OS
- This patch consists of the bare basic code needed in order to generate some assembly for the z/OS target.
- Only the .text and the .bss sections are added for now.
- The relevant MCSectionGOFF/Symbol interfaces have been added. This enables us to print out the GOFF machine code sections.
- This patch enables us to add simple lit tests wherever possible, and contribute to the testing coverage for the z/OS target
- Further improvements and additions will be made in future patches.

Reviewed By: tmatheson

Differential Revision: https://reviews.llvm.org/D106380
2021-07-27 11:29:15 -04:00
David Green
54c91c0c74 [ARM] Implement isLoad/StoreFromStackSlot for MVE stack stores accesses
This implements the isLoadFromStackSlot and isStoreToStackSlot for MVE
MVE_VSTRWU32 and MVE_VLDRWU32 functions. They behave the same as many
other loads/stores, expecting a FI in Op1 and zero offset in Op2. At the
same time this alters VLDR_P0_off and VSTR_P0_off to use the same code
too, as they too should be returning VPR in Op0, take a FI in Op1 and
zero offset in Op2.

Differential Revision: https://reviews.llvm.org/D106797
2021-07-27 09:11:58 +01:00
David Green
010f8e3057 [ARM] Ensure correct regclass in distributing postinc
The register class required for some MVE loads/stores is more
constrained than the register we use when creating postinc. Make sure we
constrain the register class to keep the code correct.
2021-07-26 14:26:38 +01:00
David Sherwood
0aff1798b5 [Analysis] Add simple cost model for strict (in-order) reductions
I have added a new FastMathFlags parameter to getArithmeticReductionCost
to indicate what type of reduction we are performing:

  1. Tree-wise. This is the typical fast-math reduction that involves
  continually splitting a vector up into halves and adding each
  half together until we get a scalar result. This is the default
  behaviour for integers, whereas for floating point we only do this
  if reassociation is allowed.
  2. Ordered. This now allows us to estimate the cost of performing
  a strict vector reduction by treating it as a series of scalar
  operations in lane order. This is the case when FP reassociation
  is not permitted. For scalable vectors this is more difficult
  because at compile time we do not know how many lanes there are,
  and so we use the worst case maximum vscale value.

I have also fixed getTypeBasedIntrinsicInstrCost to pass in the
FastMathFlags, which meant fixing up some X86 tests where we always
assumed the vector.reduce.fadd/mul intrinsics were 'fast'.

New tests have been added here:

  Analysis/CostModel/AArch64/reduce-fadd.ll
  Analysis/CostModel/AArch64/sve-intrinsics.ll
  Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll
  Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

Differential Revision: https://reviews.llvm.org/D105432
2021-07-26 10:26:06 +01:00
David Green
ba42f6a4b5 [ARM] Pass SelectionDAG to methods that dont require DCI. NFC
In these methods DCI is never used, only the DAG from it. Pass the DAG
directly, cleaning up the code a little.
2021-07-21 22:11:09 +01:00
Tim Northover
19d2e42be2 ARM: don't return by popping PC if we have to adjust the stack afterwards.
In mandatory tail calling conventions we might have to deallocate stack
space used by our arguments before return. This happens after popping
CSRs, so the pop cannot be turned into the return itself in this case.

The else branch here was already a nop, so removing it as a tidy-up.
2021-07-21 09:35:14 +01:00
David Green
5561ad8b36 [ARM] Remove PromotedBitwiseVT for NEON types
This removes the promotion of NEON AND, OR and XOR nodes to v2i32/v4i32,
treating them the same as the AArch64 and MVE backends where we just add
the relevant patterns for each legal type. This prevents a lot of
bitcasts from being added to the DAG, which have the potential to make
optimizations more difficult. It does mean adding extra patterns, and
some codegen can change due to the types now being legal, not promoted.

Differential Revision: https://reviews.llvm.org/D105588
2021-07-19 16:36:33 +01:00
David Green
eb1e95dbdf [ARM] Extend more reductions during lowering
This relaxes the VMLAV and VADDV reduction recognition code to handle
smaller than legal types, extending them as needed. That was already
handled for some reductions, this extends it to more types in a more
generic way. If a smaller than legal value is found it is extended to
the legal type as needed.

Differential Revision: https://reviews.llvm.org/D106051
2021-07-19 08:58:03 +01:00
David Green
5acddf5b09 [ARM] Lower non-extended small gathers via truncated gathers.
Corollary to 1113e06821e6baffc84b8caf96a28bf62e6d28dc this allows us to
match gather that dont produce a full vector width results. They use an
extended gather which is truncated back to the original type.
2021-07-17 22:38:31 +01:00
David Green
ad8e75caa2 [ARM] Fix for matching reductions that are both sext and zext.
Fix a silly mistake that was not making sure that _both_ operands were
the correct extend code.
2021-07-16 23:11:42 +01:00
Mehdi Amini
76374573ce Use ManagedStatic and lazy initialization of cl::opt in libSupport to make it free of global initializer
We can build it with -Werror=global-constructors now. This helps
in situation where libSupport is embedded as a shared library,
potential with dlopen/dlclose scenario, and when command-line
parsing or other facilities may not be involved. Avoiding the
implicit construction of these cl::opt can avoid double-registration
issues and other kind of behavior.

Reviewed By: lattner, jpienaar

Differential Revision: https://reviews.llvm.org/D105959
2021-07-16 07:38:16 +00:00