The patch extends AArch64TTIImpl::instCombineIntrinsic to simplify
llvm.aarch64.neon.f{min,max}nm(a, a) -> a.
This helps with simplifying code written using the ACLE, e.g.
see https://godbolt.org/z/jYxsoc89c
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D125234
This adds some extra costs for reverse shuffles under AArch64, filling
in the i16/f16/i8 gaps in the cost model.
Differential Revision: https://reviews.llvm.org/D124786
This builds on top of the target-independent cost model added in D124269
to add aarch64 specific costs for fptoui_sat and fptosi_sat intrinsics.
For many common types they will be legal instructions as the AArch64
instructions will saturate naturally. For unsupported pairs of integer
and floating point types, an additional min/max clamp is needed.
Differential Revision: https://reviews.llvm.org/D124357
Given a larger-than-legal shuffle mask, the final codegen will split
into multiple sub-vectors. This attempts to model that in
AArch64TTIImpl::getShuffleCost, splitting masks up according to the size
of the legalized vectors. If the sub-masks have at most 2 input sources
we can call getShuffleCost on them and sum the costs, to get a more
accurate final cost for the entire shuffle. The call to
improveShuffleKindFromMask helps to improve the shuffle kind for the
sub-mask cost call.
Differential Revision: https://reviews.llvm.org/D123414
Given a shuffle with 4 elements size 16 or 32, we can use the costs
directly from the PerfectShuffle tables to get a slightly more accurate
cost for the resulting shuffle.
Differential Revision: https://reviews.llvm.org/D123409
Before this patch `Args` was used to pass a broadcat's arguments by SLP.
This patch changes this. `Args` is now used for passing the operands of
the shuffle.
Differential Revision: https://reviews.llvm.org/D124202
The original patch (https://reviews.llvm.org/D121354) targets x86 and adjusts
the lookahead score of splat loads ad they can be done by the `movddup`
instruction that combines the load and the broadcast and is cheap to execute.
A similar issue shows up on AArch64. The `ld1r` instruction performs a broadcast
load and is cheap to execute.
This patch implements the TargetTransformInfo hooks for AArch64.
Differential Revision: https://reviews.llvm.org/D123638
This reverts commit 64b6192e812977092242ae34d6eafdcd42fea39d.
This broke LLVM AArch64 buildbot clang-aarch64-sve-vls-2stage:
https://lab.llvm.org/buildbot/#/builders/176/builds/1515
llvm-tblgen crashes after applying this patch.
An insert subvector under aarch64 can often be done as a single lane mov
operation. For example a v4i8 inserted into a v16i8 is a s-reg mov, so
long as the index is a multiple of 4. This teaches the cost model that,
using code copied over from the X86 backend.
Some of the costs (v16i16_4_0) are still high because they get matched
as a SK_Select, not an SK_InsertSubvector. D120879 has some codegen
tests for inserting subvectors, which I were added as
llvm/test/CodeGen/AArch64/insert-subvector.ll.
Differential Revision: https://reviews.llvm.org/D120880
The cost of a v2i64 multiply was special cased in D92208 as scalarized
into 4*extract + 2*insert + 2*mul. Scalarizing to/from gpr registers are
expensive though, and the cost wasn't high enough to prevent vectorizing
in places where it can be detrimental for performance. This increases it
so that the costs of copying to/from GPRs is increased to 2 each, with
the total cost increasing to 14. So long as umull/smull are handled
correctly (as in D123006) this seems to lead to better vectorization
factors and better performance.
Differential Revision: https://reviews.llvm.org/D123007
A vector mul(sext, sext) or mul(zext, zext) will be code generated as a
single smull or umull instruction. This most notably effects v2i64
multiplies, which are otherwise not legal and need to be expanded.
The oneuse check has also been slightly changed, as it is already
checked from the use of isWideningInstruction in getCastInstrCost.
Differential Revision: https://reviews.llvm.org/D123006
InstCombine llvm.aarch64.sve.sel to select. This allows an existing instCombine
added in 20b0fa91c9ee to fire.
Differential Revision: https://reviews.llvm.org/D121792
Currently the cost model under-estimates the cost of certain
FP16 conversions.
This patch updates getCastInstrCost to return more accurate costs for
the cases improved in c2ed9fd05479.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D113700
The costs of vector shifts was 2 as opposed to 1, as the nodes are
marked custom. Fix this like the others and mark the nodes as cheap.
Differential Revision: https://reviews.llvm.org/D120773
In Clang we can attach TBAA metadata based on the load/store intrinsics
based on the operation's element type.
This also contains changes to InstCombine where the AArch64-specific
intrinsics are transformed into generic LLVM load/store operations,
to ensure that all metadata is transferred to the new instruction.
There will be some further work after this patch to also emit TBAA
metadata for SVE's gather/scatter- and struct load/store intrinsics.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D119319
The current cost-model overestimates the cost of vector compares &
selects for ordered floating point compares. This patch fixes that by
extending the existing logic for integer predicates.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D118256
We have some bitcasts which we know will be simplified,
so their cost is zero.
Reviewed By: david-arm, sdesmalen
Differential Revision: https://reviews.llvm.org/D118019
Generated code resulted in redundant aarch64.sve.convert.to.svbool
calls for AArch64 Binary Operations. Narrow the more precise operands
instead of widening the less precise operands
Differential Revision: https://reviews.llvm.org/D116730
A lot of neon intrinsics work lane-wise, meaning that non-demanded
elements in and not demanded out. This teaches that to
AArch64TTIImpl::simplifyDemandedVectorEltsIntrinsic for some simple
single-input truncate intrinsics, which can help remove unnecessary
instructions.
Differential Revision: https://reviews.llvm.org/D117097
If we are inserting into or extracting from a scalable vector we do
not know the number of elements at runtime, so we can only let the
index wrap for fixed-length vectors.
Tests added here:
Analysis/CostModel/AArch64/sve-insert-extract.ll
Differential Revision: https://reviews.llvm.org/D117099
Similar to D116732, this adds basic scalar sadd_with_overflow,
uadd_with_overflow, ssub_with_overflow and usub_with_overflow costs for
aarch64, which are usually quite efficiently lowered.
Differential Revision: https://reviews.llvm.org/D116734
This adds some AArch64 specific smul_with_overflow and umul_with_overflow
costs, overriding the default costs. The code generation for these mul
with overflow intrinsics is usually better than the default expansion on
AArch64. The costs come from https://godbolt.org/z/zEzYhMWqo with various
types, or llvm/test/CodeGen/AArch64/arm64-xaluo.ll.
Differential Revision: https://reviews.llvm.org/D116732
This patch adds on an overhead cost for gathers and scatters, which
is a rough estimate based on performance investigations I have
performed on SVE hardware for various micro-benchmarks.
Differential Revision: https://reviews.llvm.org/D115143
This patch extends the "is all active predicate" check to cover
cases where the predicate is casted but in a way that doesn't
change its "all active" status.
Differential Revision: https://reviews.llvm.org/D115047
We can not swap multiplicand and multiplier because the sve intrinsics
are predicated. Imagine lanes in vectors having the following values:
pg = 0
multiplicand = 1 (from dup)
multiplier = 2
The resulting value should be 1, but if we swap multiplicand and multiplier it will become 2,
which is incorrect.
Differential Revision: https://reviews.llvm.org/D114577
InstCombine AArch64 LD1/ST1 to llvm.masked.load/llvm.masked.store
and LD1/ST1 to load/store when a ptrue all predicate pattern operand
is present.
This allows existing IR optimizations such as dead-load removal to
occur.
Differential Revision: https://reviews.llvm.org/D113489