In https://reviews.llvm.org/D159196 we avoided stackslot scavenging
when there was no FP available. But in the case where FP is available
we need to actually prefer using the FP over the BP.
This change affects more than just SME, but it should be a general
improvement, since any slot above the (address pointed to by) FP
is always closer to FP than BP, so it makes sense to always favour
using the FP to address it when the FP is available.
This also fixes the issue for SME where this is not just preferred
but required.
fadd reduction with
1. Fast flag set
2. No of elements in input vector is power of 2 results in series of
faddp instructions. faddp instruction has latency/throughput identical
to fadd instruction and hence, we set relative cost=1 for faddp as well.
The change didn't show any regression with SPEC17-FP(C/C++),
llvm-test-suite on Neoverse-V2.
These fp128 G_FNEG operations should be treated more like G_FABS, where the
operation is lowered to simple integer arithmetic. All other operations are the
same between the two ActionDefinitionsBuilders.
Originally I tried spliting these features in the compiler with
https://github.com/llvm/llvm-project/pull/101712, but we decided to lump
those features in the ACLE specification (see
https://github.com/ARM-software/acle/pull/346). Since there are no
hardware implementations out there which implement ls64 without ls64_v
or ls64_accdata, this shouldn't be a regression for feature detection.
It is almost always simpler to use {} instead of std::nullopt to
initialize an empty ArrayRef. This patch changes all occurrences I could
find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor
could be deprecated or removed.
The ldr instructions implicitly zero any upper lanes, so we can use them
for insert(zerovec, load, 0) patterns. Likewise insert(undef, load, 0)
or scalar_to_reg can reuse the scalar loads as the top bits are undef.
This patch makes sure there are patterns for each type and for each of
the normal, unaligned, roW and roX addressing modes.
This extends the existing patterns for addp to 64bit outputs with a single
input. Whilst the general pattern is similar to the 128bit patterns
(add(uzp1(extract_lo, extract_hi), uzp2(extract_lo, extract_hi))), at the late
stage other optimzations have happened to turn the first uzp1 into trunc and
the second into extract(uzp2) with undef.
Fixes#109108
This commit adds the missing support for varargs in the instruction
selection pass for AAPCS. Previously we only implemented this for
Darwin.
The implementation was according to AAPCS and SelectionDAG's
LowerAAPCS_VASTART.
It resolves all VA_START fallbacks in RAJAperf, llvm-test-suite, and
SPEC CPU2017. These benchmarks now compile and pass without fallbacks
due to varargs.
---------
Co-authored-by: Madhur Amilkanthwar <madhura@nvidia.com>
Using the LSR or LSL aliases of UBFM can be faster on some CPUs, so it
is worth changing 64 bit UBFM instructions, that are equivalent to 32
bit LSR/LSL operations, to 32 bit variants.
This change folds the following patterns:
* If `Imms == 31` and `Immr <= Imms`:
`UBFMXri %0, Immr, Imms` -> `UBFMWri %0.sub_32, Immr, Imms`
* If `Immr == Imms + 33`:
`UBFMXri %0, Immr, Imms` -> `UBFMWri %0.sub_32, Immr - 32, Imms`
Unlike scalar, where AArch64 prefers expanding scmp/ucmp with select,
under Neon we can use the arithmetic expansion to generate fewer
instructions. Notably it also prevents the scalarization of vselect
during vector-legalization.
This is an implementation of the saturating fp to int conversions for
GlobalISel. On AArch64 the converstion instrctions work this way,
producing saturating results. LegalizerHelper::lowerFPTOINT_SAT is
ported from SDAG.
AArch64 has a lot of existing tests for fptosi_sat, covering a wide
range of types. I have tried to make most of them work all at once, but
a few fall back due to other missing features such as f128 handling for
min/max.
When AArch64LoadStoreOptimizer pass merges an SP update with a
load/store instruction and needs to adjust unwind information either:
* create the merged instruction at the location of the SP update
(so no CFI instructions are moved), or
* only move a CFI instruction if the move would not reorder it across
other CFI instructions
If neither of the above is possible, don't perform the optimisation.
Update the predicate protecting bfloat instructions to only reference
FEAT_SVE_B16B16, which matches the specification.
Rename and move instruction classes to match the names of the encoding
groups the bfloat arithmetic instructions belong.
GCC compiles the built-in function `__builtin_bswap16`, to the ARM
instruction rev16, which reverses the byte order of 16-bit data. On the
other Clang compiles the same built-in function to e.g.
```
rev w8, w0
lsr w0, w8, #16
```
i.e. it performs a byte reversal of a 32-bit register, (which moves the
lower half, which contains the 16-bit data, to the upper half) and then
right shifts the reversed 16-bit data back to the lower half of the
register.
We can improve Clang codegen by generating `rev16` instead of `rev` and
`lsr`, like GCC.
This patch implements the intrinsics of the form
floatNxM_t vamin[q]_fN(floatNxM_t vn, floatNxM_t vm);
floatNxM_t vamax[q]_fN(floatNxM_t vn, floatNxM_t vm);
as defined in https://github.com/ARM-software/acle/pull/324
---------
Co-authored-by: Hassnaa Hamdi <hassnaa.hamdi@arm.com>
After #107201 and #107367 the codegen for zext(ld4) can use and / shift
to extract the lanes out of the original vectors elements. This avoids
the need for the expensive ld4 operations, so can lead to performance
improvements over using the interleaving loads and ushll.
This patch stops the generation of ld4 for uitofp(ld4) that would become
uitofp(zext(ld4)). It doesn't handle zext yet to make sure that widening
instructions like mull and addl are not adversely affected.
Generating tbl instruction for zext in an expression like: mul(zext(i8),
sext) is not optimal.
Instead, allowing later optimisations to generate smull(zext, sext)
would do some of the type extensions implicitly and be faster.
This PR adds lowering for fixed-width <4 x i32> and <2 x i32> partial
reductions to a dot product when Neon and the dot product feature are
available.
The work is by Max Beck-Jones (@DevM-uk).
Similar to #107201, this comes up from the lowering of zext of
deinterleaving shuffles. Patterns such as ext(extract_subvector(uzp(a,
b))) can be converted to a simple and to perform the extract/zext from a
uzp1. Uzp2 can be handled with an extra shift, and due to the existing
legalization we could have and / shift between which can be combined in.
Mostly this reduces instruction count or increases the amount of
parallelism in the sequence.
This removes a redundant 'COPY' instruction that #81716 probably forgot
to remove.
This redundant COPY led to an issue because because code in
LiveRangeSplitting expects that the instruction emitted by
`loadRegFromStackSlot` is an instruction that accesses memory, which
isn't the case for the COPY instruction.
NOTE: There are no dedicated SVE instructions but bf16->f32 is just a
left shift because they share the same exponent range and from there
other convert instructions can be used.
This patch fixes incorrect usage of scalar+immediate variant of ld1/st1
instructions during stack allocation caused by
[c4bac7f](c4bac7f7dc).
This commit used ld1/st1 even when stack offset was outside of immediate
range for this instruction, producing invalid assembly. This commit was also using incorrect offsets when using ld1/st1.
This is part 1 of a few patches that are intended to take deinterleaving
shuffles with masks like `[0,4,8,12]`, where the shuffle is
zero-extended to a larger size, and optimize away the deinterleave. In
this case it converts them to `and(uzp1, mask)`, where the `uzp1` act
upon the elements in the larger type size to get the lanes into the
correct possitions, and the `and` performs the zext. It performs the
combine fairly late, on the legalized type so that uitofp that are
converted to uitofp(zext(..)) will also be handled.
Fix incorrect use of AArch64ISD::UZP1/UUNPK{HI,LO} in:
AArch64TargetLowering::LowerDIV
AArch64TargetLowering::LowerINSERT_SUBVECTOR
The latter highlighted DAG combines that relied on broken behaviour,
which this patch also fixes.