In this optimisation, the Chain and Glue from the original CopyFromReg
was being lost by this optimisation, which resulted in miscompiles.
This fix just ensures that the input chains are correctly updated, and
that any any users are also updated with the new chain from the new
CopyFromReg.
Fixes#60510.
Differential Revision: https://reviews.llvm.org/D143713
After https://reviews.llvm.org/rGff4027d152d0 and
https://reviews.llvm.org/rG7d15212b8c0c we saw crashes in SelectionDAG
when trying to use these constraints when you don't have the fp16 or
bf16 extensions.
However, it is still possible to move 16-bit floating point values into
the right place in S registers with a normal `vmov`, even if we don't
have fp16 instructions we can use within the inline assembly string.
This patch therefore fixes the crash.
I think the reason we weren't getting this crash before is because I
think the __fp16 and __bf16 types got an error diagnostic in the Clang
frontend when you didn't have the right architectural extensions to use
them. This restriction was recently relaxed.
The approach for bf16 needs a bit more explanation. Exactly how BF16 is
legalized was changed in rGb769eb02b526e3966847351e15d283514c2ec767 -
effectively, whether you have the right instructions to get a bf16 value
into/out of a S register with MoveTo/FromHPR depends on hasFullFP16, but
whether you use a HPR for a value of type MVT::bf16 depends on hasBF16.
This is why the tests are not changed by `+bf16` vs `-bf16`, but I've
left both sets of RUN lines in case this changes in the future.
Test Changes:
- Added more testing for testing inline asm (the core part)
- fp16-promote.ll and pr47454.ll show improvements where unnecessary
fp16-fp32 up/down-casts are no longer emitted. This results in fewer
libcalls where those casts would be done with a libcall.
- aes-erratum-fix.ll is fairly noisy, and I need to revisit this test so
that the IR is more minimal than it is right now, because most of the
changes in this commit do not relate to what AES is actually trying to
verify.
Differential Revision: https://reviews.llvm.org/D143711
This function was added for ARM targets, but aligning global/stack pointer
arguments passed to memcpy/memmove/memset can improve code size and
performance for all targets that don't have fast unaligned accesses.
This adds a generic implementation that adjusts the alignment to pointer
size if unaligned accesses are slow.
Review D134168 suggests that this significantly improves performance on
synthetic benchmarks such as Dhrystone on RV32 as it avoids memcpy() calls.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D134282
This looks for vaddv(shuffle) or vmlav(shuffle, shuffle), with a shuffle where
all the lanes are used once. Due to the reduction being commutative the shuffle
can be removed.
Differential Revision: https://reviews.llvm.org/D143382
This removes the FlattenVectorShuffle that folds shuffles through certain
binops. This is now handled by generic DAG combines for all but ARMISD::VQDMULH
where a PerformVQDMULHCombine is added to compensate. It pushes identical
shuffles down through the operation, in a similar way to the other combines in
DAG.
Armv8.1-M can be configured to support the integer subset of the MVE
vector instructions, and no floating point. In that situation, the FP
and vector registers still exist, and so do the load, store and move
instructions that transfer data in and out of them. So there's no
reason the hard floating point ABI can't be supported, and you might
reasonably want to use it, for the sake of intrinsics-based code
passing explicit MVE vector types between functions.
But the selection of the hard float ABI in the backend was gated on
Subtarget->hasVFP2Base(), which is false in the case of integer MVE
and no FP.
As a result, you'd silently get the soft float ABI even if you
deliberately tried to select it, e.g. with clang options such as
--target=arm-none-eabi -mfloat-abi=hard -march=armv8.1m.main+nofp+mve
The hard float ABI should have been gated on the weaker condition
Subtarget->hasFPRegs(), because the only requirement for being able to
pass arguments in the FP registers is that the registers themselves
should exist.
I haven't added a new test, because changing the existing
CodeGen/Thumb2/float-ops.ll test seemed sufficient. But I've added a
comment explaining why the results are expected to be what they are.
Reviewed By: lenary
Differential Revision: https://reviews.llvm.org/D142703
Change MCInstrDesc::operands to return an ArrayRef so we can easily use
it everywhere instead of the (IMHO ugly) opInfo_begin and opInfo_end.
A future patch will remove opInfo_begin and opInfo_end.
Also use it instead of raw access to the OpInfo pointer. A future patch
will remove this pointer.
Differential Revision: https://reviews.llvm.org/D142213
The existing lowering of i1 vector shuffle was only considering
single-source shuffles, always assuming the second was undef. This
extends that to properly handle both operands.
I'm helping with the remaining regressions on D127115, and one of my candidate fixes caused some regressions with MVE interleaved shuffles due to poor handling of 'truncation' style shuffle masks (0,2,4,6,...).
This patch attempts to use the ARMISD::MVETRUNC node to handle these cases, based off existing code in LowerTruncate.
It handles both (0,2,4,6,...) and (1,3,5,7,....) 'top' style patterns (assuming no endian problems). I shift down the 'top' patterns - a basic search of ARM docs suggests MVE has some top/bottom truncation/narrowing instructions but I don't seem to be able to get them to be used.
Differential Revision: https://reviews.llvm.org/D141791
https://reviews.llvm.org/D140493 is going to teach SROA how to promote allocas
that have variably-indexed loads. That does bring up questions of cost model,
since that requires creating wide shifts.
Indeed, our legalization for them is not optimal.
We either split it into parts, or lower it into a libcall.
But if the shift amount is by a multiple of CHAR_BIT,
we can also legalize it throught stack.
The basic idea is very simple:
1. Get a stack slot 2x the width of the shift type
2. store the value we are shifting into one half of the slot
3. pad the other half of the slot. for logical shifts, with zero, for arithmetic shift with signbit
4. index into the slot (starting from the base half into which we spilled, either upwards or downwards)
5. load
6. split loaded integer
This works for both little-endian and big-endian machines:
https://alive2.llvm.org/ce/z/YNVwd5
And better yet, if the original shift amount was not a multiple of CHAR_BIT,
we can just shift by that remainder afterwards: https://alive2.llvm.org/ce/z/pz5G-K
I think, if we are going perform shift->shift-by-parts expansion more than once,
we should instead go through stack, which is what this patch does.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D140638
Use deduction guides instead of helper functions.
The only non-automatic changes have been:
1. ArrayRef(some_uint8_pointer, 0) needs to be changed into ArrayRef(some_uint8_pointer, (size_t)0) to avoid an ambiguous call with ArrayRef((uint8_t*), (uint8_t*))
2. CVSymbol sym(makeArrayRef(symStorage)); needed to be rewritten as CVSymbol sym{ArrayRef(symStorage)}; otherwise the compiler is confused and thinks we have a (bad) function prototype. There was a few similar situation across the codebase.
3. ADL doesn't seem to work the same for deduction-guides and functions, so at some point the llvm namespace must be explicitly stated.
4. The "reference mode" of makeArrayRef(ArrayRef<T> &) that acts as no-op is not supported (a constructor cannot achieve that).
Per reviewers' comment, some useless makeArrayRef have been removed in the process.
This is a follow-up to https://reviews.llvm.org/D140896 that introduced
the deduction guides.
Differential Revision: https://reviews.llvm.org/D140955
Address the inconsistency between FLT_ROUNDS_ and SET_ROUNDING SDAG
node. Rename FLT_ROUNDS_ to GET_ROUNDING and add llvm.get.rounding
intrinsic to replace flt.rounds.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D139507
A target can return if a misaligned access is 'fast' as defined
by the target or not. In reality there can be different levels
of 'fast' and 'slow'. This patch changes the boolean 'Fast'
argument of the allowsMisalignedMemoryAccesses family of functions
to an unsigned representing its speed.
A target can still define it as it wants and the direct translation
of the current code uses 0 and 1 for current false and true. This
makes the change an NFC.
Subsequent patch will start using an actual value of speed in
the load/store vectorizer to compare if a vectorized access going
to be not just fast, but not slower than before.
Differential Revision: https://reviews.llvm.org/D124217
Adds the Complex Deinterleaving Pass implementing support for complex numbers in a target-independent manner, deferring to the TargetLowering for the given target to create a target-specific intrinsic.
Differential Revision: https://reviews.llvm.org/D114174
The instruction icmp ule <4 x i32> %0, zeroinitializer will usually be
simplified to icmp eq <4 x i32> %0, zeroinitializer. It is not
guaranteed though, and the code for lowering vector compares could pick
the wrong form of the instruction if this happened. I've tried to make
the code more explicit about the supported conditions.
This fixes NEON being unable to select VCMPZ with HS conditions, and
fixes some incorrect MVE patterns.
Fixes#58514.
Differential Revision: https://reviews.llvm.org/D136447
fp16 and bf16 values can be used in GCC's inline assembly using the "w"
constraint, which means "VFP floating-point registers d0-d31" - fp16 and
bf16 values are stored in S registers (which alias the D registers).
This change ensures that LLVM is compatible with GCC for programs that
use fp16 and the 'w' constraint.
Differential Revision: https://reviews.llvm.org/D135662
fp16 and bf16 values can be used in GCC's inline assembly using the "t"
constraint, which means "VFP floating-point registers s0-s31" - fp16 and
bf16 values are stored in S registers too.
This change ensures that LLVM is compatible with GCC for programs that
use fp16 and the 't' constraint.
Fixes#57753
Differential Revision: https://reviews.llvm.org/D134553
The `CodeGenPrepare` pass can sink bitwise `and` used by compare to
zero into the basic blocks where the users are. This operation is
guarded by lowering hook, which is disabled for ARM. In the ARM
architecture versions from v7-M up these two operations can be folded
into `tst rN, #imm` instruction. Sinking of `and` can also enable
the cmov-to-bfi DAG combiner.
This patch fixes some benchmark regressions caused
by https://reviews.llvm.org/D129370 as well scoring slightly better overall.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D134360
All in-tree targets pass pointer-sized ConstantSDNodes to the
method. This overload reduced amount of boilerplate code a bit. This
also makes getCALLSEQ_END consistent with getCALLSEQ_START, which
already takes uint64_ts.
For remainder:
If (1 << (Bitwidth / 2)) % Divisor == 1, we can add the high and low halves
together and use a (Bitwidth / 2) urem. If (BitWidth /2) is a legal integer
type, this urem will be expand by DAGCombiner using multiply by magic
constant. We do have to take into account that adding high and low
together can produce a carry, making it a (BitWidth / 2)+1 bit number.
So we need to also add back in the carry from the first addition.
For division:
We can use the above trick to compute the remainder, subtract that
remainder from the dividend, then multiply by the multiplicative
inverse of the Divisor modulo (1 << BitWidth).
This is based on the section "Remainder by Summing Digits" in
Hacker's delight.
The remainder trick is similar to a trick you may have learned for
determining if a decimal number is divisible by 3. You can add all the
digits together and see if the sum is divisible by 3. If you're not sure
if the sum is divisible by 3, you can add its digits together. This
can be repeated until you have a single decimal digit. If that digit
is 3, 6, or 9, then the original number is divisible by 3. This works
because 10 % 3 == 1.
gcc already does this same trick. There are additional tricks gcc
does urem as well as srem, udiv, and sdiv that I plan to add in
future patches.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D130862
Also remove new-pass-manager version of ExpandLargeDivRem because there is no way
yet to access TargetLowering in the new pass manager.
Differential Revision: https://reviews.llvm.org/D133691
LLVM contains a helpful function for getting the size of a C-style
array: `llvm::array_lengthof`. This is useful prior to C++17, but not as
helpful for C++17 or later: `std::size` already has support for C-style
arrays.
Change call sites to use `std::size` instead.
Differential Revision: https://reviews.llvm.org/D133429
When the only ADR instruction we have is the 16-bit thumb one then all
constant pool entries need to be 4-byte aligned, as tADR has an offset
that's a multiple of 4.
It looks like previously there happened to be no situations in which
we encountered a constant pool entry with alignment less than 4, so
failing to do this didn't cause any problems, but the expansion of
cttz to a table added by D128911 does use a constant pool with
alignment 1, so we now need to handle it correctly.
Differential Revision: https://reviews.llvm.org/D133199
This patch adds a Type operand to the TLI isCheapToSpeculateCttz/isCheapToSpeculateCtlz callbacks, allowing targets to decide whether branches should occur on a type-by-type/legality basis.
For X86, this patch proposes to allow CTTZ speculation for i8/i16 types that will lower to promoted i32 BSF instructions by masking the operand above the msb (we already do something similar for i8/i16 TZCNT). This required a minor tweak to CTTZ lowering - if the src operand is known never zero (i.e. due to the promotion masking) we can remove the CMOV zero src handling.
Although BSF isn't very fast, most CPUs from the last 20 years don't do that bad a job with it, although there are some annoying passthrough EFLAGS dependencies. Additionally, now that we emit 'REP BSF' in most cases, we are tending towards assuming this will most likely be executed as a TZCNT instruction on any semi-modern CPU.
Differential Revision: https://reviews.llvm.org/D132520
TragetLowering had two last InstructionCost related `getTypeLegalizationCost()`
and `getScalingFactorCost()` members, but all other costs are processed in TTI.
E.g. it is not comfortable to use other TTI members in these two functions
overrided in a target.
Minor refactoring: `getTypeLegalizationCost()` now doesn't need DataLayout
parameter - it was always passed from TTI.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D117723
ROPI/RWPI are not supported with LOAD_STACK_GUARD currently.
Reviewed By: nickdesaulniers, rengolin
Differential Revision: https://reviews.llvm.org/D131427
This adds a +atomic-32 target feature, which instructs LLVM to assume
that lock-free 32-bit atomics are available for this target, even
if they usually wouldn't be.
If only atomic loads/stores are used, then this won't emit libcalls.
If atomic CAS is used, then the user is responsible for providing
any necessary __sync implementations (e.g. by masking interrupts
for single-core privileged use cases).
See https://reviews.llvm.org/D120026#3674333 for context on this
change. The tl;dr is that the thumbv6m target in Rust has
historically made atomic load/store only available, which is
incompatible with the change from D120026, which switched these to
use libatomic.
Differential Revision: https://reviews.llvm.org/D130480