420 Commits

Author SHA1 Message Date
David Green
f0e79d9152 [AArch64] Add a cost for identity shuffles.
These are mostly handled at a higher level when costing shuffles, but some
masks can end up being identity or concat masks which we can treat as free.
2024-04-09 17:16:14 +01:00
David Green
4ac2721e51
[AArch64] Add costs for ST3 and ST4 instructions, modelled as store(shuffle). (#87934)
This tries to add some costs for the shuffle in a ST3/ST4 instruction,
which are represented in LLVM IR as store(interleaving shuffle). In
order to detect the store, it needs to add a CxtI context instruction to
check the users of the shuffle. LD3 and LD4 are added, LD2 should be a
zip1 shuffle, which will be added in another patch.

It should help fix some of the regressions from #87510.
2024-04-09 16:36:08 +01:00
Il-Capitano
308ed0233a
[Intrinsics] Make patchpoint.i64 generic on its return type (#85911)
Currently patchpoints can only have two result types, `void` and `i64`.
This limits the result to general purpose registers.
This patch makes `patchpoint.i64` an overloadable intrinsic, allowing
result values that can fit in a single register (e.g. integers,
pointers, floats).
2024-03-26 19:08:52 +05:30
Jeremy Morse
b9d83eff25
[NFC][RemoveDIs] Use iterators for insertion at various call-sites (#84736)
These are the last remaining "trivial" changes to passes that use
Instruction pointers for insertion. All of this should be NFC, it's just
changing the spelling of how we identify a position.

In one or two locations, I'm also switching uses of getNextNode etc to
using std::next with iterators. This too should be NFC.

---------

Merged by: Stephen Tozer <stephen.tozer@sony.com>
2024-03-19 16:36:29 +00:00
Graham Hunter
56abb8d535 [AArch64] Be stricter about insert/extract index
Post-commit fixup patch for a request on
https://github.com/llvm/llvm-project/pull/81135
2024-03-05 09:50:32 +00:00
Graham Hunter
03f852f704
[AArch64] Improve cost model for legal subvec insert/extract (#81135)
Currently we model subvector inserts and extracts as shuffles,
potentially going as far as scalarizing. If the types are legal then
they can just be simple zip/unzip operations, or possible even no-ops.
Change the cost to a relatively small one to ensure that simple loops
featuring such operations between fixed and scalable vector types that
are effectively the same at a given sve width can be unrolled and
further optimized.
2024-03-04 16:17:01 +00:00
Paschalis Mpeis
bbdc62e718
[AArch64][CostModel] Improve scalar frem cost (#80423)
In AArch64 the cost of scalar frem is the cost of a call to 'fmod'.
2024-02-23 09:29:45 +00:00
Usman Nadeem
267d6b5ed2
[AArch64][SVE] Instcombine uzp1/reinterpret svbool to use vector.insert (#81069)
Concatenating two predictes using uzp1 after converting to double length
using sve.convert.to/from.svbool is optimized poorly in the backend,
resulting in additional `and` instructions to zero the lanes. See
https://github.com/llvm/llvm-project/pull/78623/

Combine this pattern to use `llvm.vector.insert` to concatenate and get
rid of convert to/from svbools.
2024-02-15 10:40:09 -08:00
Alexey Bataev
7bc079c852
[TTI]Fallback to SingleSrcPermute shuffle kind, if no direct estimation for
extract subvector.

Many targets do not have cost for extractsubvector shuffle kind, but
have the costs for single source permute. If there are no costs
estimation for extractsubvector, better to switchto single source
permute for better cost estimation.

Reviewers: RKSimon, davemgreen, arsenm

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/79837
2024-02-12 07:09:49 -05:00
Sander de Smalen
d313614b60
[AArch64] Replace LLVM IR function attributes for PSTATE.ZA. (#79166)
Since https://github.com/ARM-software/acle/pull/276 the ACLE
defines attributes to better describe the use of a given SME state.

Previously the attributes merely described the possibility of it being
'shared' or 'preserved', whereas the new attributes have more semantics
and also describe how the data flows through the program.

For ZT0 we already had to add new LLVM IR attributes:
* aarch64_new_zt0
* aarch64_in_zt0
* aarch64_out_zt0
* aarch64_inout_zt0
* aarch64_preserves_zt0

We have now done the same for ZA, such that we add:
* aarch64_new_za       (previously `aarch64_pstate_za_new`)
* aarch64_in_za (more specific variation of `aarch64_pstate_za_shared`)
* aarch64_out_za (more specific variation of `aarch64_pstate_za_shared`)
* aarch64_inout_za (more specific variation of
`aarch64_pstate_za_shared`)
* aarch64_preserves_za (previously `aarch64_pstate_za_shared,
aarch64_pstate_za_preserved`)

This explicitly removes 'pstate' from the name, because with SME2 and
the new ACLE attributes there is a difference between "sharing ZA"
(sharing
the ZA matrix register with the caller) and "sharing PSTATE.ZA" (sharing
either the ZA or ZT0 register, both part of PSTATE.ZA with the caller).
2024-02-01 13:37:37 +00:00
Sander de Smalen
3abf55a68c
[AArch64][SME] Fix inlining bug introduced in #78703 (#79994)
Calling a `__arm_locally_streaming` function from a function that
is not a streaming-SVE function would lead to incorrect inlining.

The issue didn't surface because the tests were not testing what
they were supposed to test.
2024-01-31 11:38:29 +00:00
David Green
a2d68b4bec
[SelectOpt] Add handling for Select-like operations. (#77284)
Some operations behave like selects. For example `or(zext(c), y)` is the
same as select(c, y|1, y)` and instcombine can canonicalize the select
to the or form. These operations can still be worthwhile converting to
branch as opposed to keeping as a select or or instruction.

This patch attempts to add some basic handling for them, creating a
SelectLike abstraction in the select optimization pass. The backend can
opt into handling `or(zext(c),x)` as a select if it could be profitable,
and the select optimization pass attempts to handle them in much the
same way as a `select(c, x|1, x)`. The Or(x, 1) may need to be added as
a new instruction, generated as the or is converted to branches.

This helps fix a regression from selects being converted to or's
recently.
2024-01-22 23:46:58 +00:00
Sander de Smalen
5f41cef58f
[AArch64] NFC: Simplify discombobulating 'requiresSMChange' interface (#78703)
Having it return a `std::optional<bool>` is unnecessarily confusing.
This patch changes it to a simple 'bool'.

This patch also removes the 'BodyOverridesInterface' operand because
there is only a single use for this which is easily rewritten.
2024-01-19 16:15:38 +00:00
Florian Hahn
e473daa779
[AArch64] Improve cost computations for odd vector mem ops. (#78181)
Improve cost computaton for odd vector mem ops by breaking them down
into smaller power-of-2 parts and sum up the cost of those parts.

This fixes the current cost estimates, which for most parts
underestimated the cos, due to using getTypeLegalizationCost, which
widens to the next power-of-2 in a single step in most cases. This
doesn't reflect the actual cost.

See https://llvm.godbolt.org/z/vMsnxMf1v for codegen for the tests.

Note that there is a special case for v3i8, for which current codegen is
pretty bad, due to automatic widening to v4i8, which in turn requires
the conversion to go through memory ops in the stack. I am planning on
fixing that as a follow-up, but I am not yet sure where to best fix
this.

At the moment, there are almost no cases in which such vector operations
will be generated automatically. The motivating case is non-power-of-2
SLP vectorization: https://github.com/llvm/llvm-project/pull/77790

PR: https://github.com/llvm/llvm-project/pull/78181
2024-01-17 21:32:06 +00:00
Mark Harley
adfd13157d
[AArch64][SVE] Add optimisation for SVE intrinsics with no active lanes (#73964)
This patch introduces optimisations for SVE intrinsic function calls
which have all false predicates.
2024-01-10 11:56:52 +00:00
Sander de Smalen
81b7f115fb
[llvm][TypeSize] Fix addition/subtraction in TypeSize. (#72979)
It seems TypeSize is currently broken in the sense that:

  TypeSize::Fixed(4) + TypeSize::Scalable(4) => TypeSize::Fixed(8)

without failing its assert that explicitly tests for this case:

  assert(LHS.Scalable == RHS.Scalable && ...);

The reason this fails is that `Scalable` is a static method of class
TypeSize,
and LHS and RHS are both objects of class TypeSize. So this is
evaluating
if the pointer to the function Scalable == the pointer to the function
Scalable,
which is always true because LHS and RHS have the same class.

This patch fixes the issue by renaming `TypeSize::Scalable` ->
`TypeSize::getScalable`, as well as `TypeSize::Fixed` to
`TypeSize::getFixed`,
so that it no longer clashes with the variable in
FixedOrScalableQuantity.

The new methods now also better match the coding standard, which
specifies that:
* Variable names should be nouns (as they represent state)
* Function names should be verb phrases (as they represent actions)
2023-11-22 08:52:53 +00:00
Sander de Smalen
00a831421f
[AArch64][SME] Extend Inliner cost-model with custom penalty for calls. (#68416)
This is a stacked PR following on from #68415 

This patch has two purposes:
(1) It tries to make inlining more likely when it can avoid a
streaming-mode change.
(2) It avoids inlining when inlining causes more streaming-mode changes.

An example of (1) is:
```
  void streaming_compatible_bar(void);

  void foo(void) __arm_streaming {
    /* other code */
    streaming_compatible_bar();
    /* other code */
  }

  void f(void) {
    foo();            // expensive streaming mode change
  }

  ->

  void f(void) {
    /* other code */
    streaming_compatible_bar();
    /* other code */
  }
```
where it wouldn't have inlined the function when foo would be a
non-streaming function.

An example of (2) is:
```
  void streaming_bar(void) __arm_streaming;

  void foo(void) __arm_streaming {
    streaming_bar();
    streaming_bar();
  }

  void f(void) {
    foo();            // expensive streaming mode change
  }

  -> (do not inline into)

  void f(void) {
    streaming_bar();  // these are now two expensive streaming mode changes
    streaming_bar();
  }```
2023-10-31 10:28:40 +00:00
Igor Kirillov
849f963e31
[CodeGen] Improve ExpandMemCmp for more efficient non-register aligned sizes handling (#70469)
* Enhanced the logic of ExpandMemCmp pass to merge contiguous
subsequences
  in LoadSequence, based on sizes allowed in `AllowedTailExpansions`.
* This enhancement seeks to minimize the number of basic blocks and
produce
  optimized code when using memcmp with non-register aligned sizes.
* Enable this feature for AArch64 with memcmp sizes modulo 8 equal to
  3, 5, and 6.

Reapplication of #69942 after fixing a bug
2023-10-30 18:40:48 +00:00
Sander de Smalen
6d30bc0085
[AArch64][SME] Allow inlining when streaming-mode attributes dont match up. (#68415)
The use-case here is to support things like:

  int foo(int x, int y) __arm_streaming { return std::max<int>(x, y); }

where the call to non-streaming `std::max<int>(x, y)` can be safely
inlined into the streaming function.

This is a first step and will need further work to allow more cases
(e.g. more finegrained analysis of the function calls to ensure they
don't result in any incompatible instructions for the requested mode).
2023-10-30 10:47:07 +00:00
Antonio Frighetto
138e6c1c86 [AArch64][TTI] Improve LegalVF when gather loads are scalarized
After determining the cost of loads that could not be coalesced into
`VectorizedLoads` in SLP, computing the cost of a gather-vectorized
load is carried out. Favour a potentially high valid cost when the
type of a group of loads, whose type is a vector of size dependent
upon `VF`, may be legalized into a scalar value.

Fixes: https://github.com/llvm/llvm-project/issues/68953.
2023-10-27 20:22:54 +02:00
Igor Kirillov
deb429e5b0 Revert "[CodeGen] Improve ExpandMemCmp for more efficient non-register aligned sizes handling (#69942)"
This reverts commit 9bcb30d31813bbdea6b65789f64aed3f0e58d507.
2023-10-27 14:12:45 +00:00
Igor Kirillov
9bcb30d318
[CodeGen] Improve ExpandMemCmp for more efficient non-register aligned sizes handling (#69942)
* Enhanced the logic of ExpandMemCmp pass to merge contiguous
subsequences
  in LoadSequence, based on sizes allowed in `AllowedTailExpansions`.
* This enhancement seeks to minimize the number of basic blocks and
produce optimized code when using memcmp with non-register aligned sizes.
* Enable this feature for AArch64 with memcmp sizes modulo 8 equal to
  3, 5, and 6.
2023-10-27 12:41:08 +01:00
Fangrui Song
8e247b8f47 Replace TypeSize::{getFixed,getScalable} with canonical TypeSize::{Fixed,Scalable}. NFC 2023-10-27 00:30:41 -07:00
KAWASHIMA Takahiro
926173c614
[AArch64] Prevent argument promotion of vector with size > 128 bits (#70034)
This patch prevents argument promotion from promoting pointers to
fixed-length vector types larger than 128 bits like `<8 x float>` into
the values of the pointees.

Such vector types are used for SVE VLS but there is no ABI for SVE VLS
arguments and the backend cannot lower such value arguments.

Fixes #69147
2023-10-26 14:51:20 +09:00
zhongyunde 00443407
bf90ffb9b4 [SVE][InstCombine] Delete redundante sel instructions with ptrue
svsel(pture, x, y) => x. depend on D121792
Reviewed By: paulwalker-arm, david-arm
2023-10-13 09:20:36 +08:00
Alexey Bataev
263a00fa91 [COST][AARCH64]Fix crash in cost calculation for shuffles.
Need to take the mask size as number of elements, not the number of
elements of the original fixed vector. Otherwise, the compiler may
crash.
2023-10-02 07:49:03 -07:00
David Sherwood
fad69a5009
[Analysis][SVE] Improve cost model for some extending masked loads (#65957)
When performing a masked load of an unpacked SVE vector type, i.e.
nxv8i8, followed by a zero- or sign-extend to an illegal wide type
such as nxv8i32 we typically end up with a combination of an
extending masked load and pair(s) of uunpklo/hi or sunpklo/hi
instructions. For example, see test @masked_sload_8i8_8i32 in file

  CodeGen/AArch64/sve-masked-ldst-sext.ll

where

  %aval = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8(...
  %aext = sext <vscale x 8 x i8> %aval to <vscale x 8 x i32>

gets lowered to

  ld1sb { z1.h }, ...
  sunpklo z0.s, z1.h
  sunpkhi z1.s, z1.h

Currently the cost for the 'sext' operation in the example above is
1, whereas this patch changes it to 2 to reflect the pair of
instructions required. Similarly, when doing a masked load of a
nxv8i8 and extending to nxv8i64 the cost is changed to 6 to reflect
the 6 unpacks required.
2023-10-02 10:50:56 +01:00
zhongyunde
f41223eeca [AArch64][SVE2] Delete an unused parameter for isExtPartOfAvgExpr, NFC
Depend on D157628, which set the cost of extends 0 because they will fold into
the s/urhadd.

Reviewed By: kmclaughlin
Differential Revision: https://reviews.llvm.org/D159273
2023-09-01 23:40:52 +08:00
Sander de Smalen
7e815dd76d [AArch64][SME] Create new interface for isSVEAvailable.
When a function is compiled to be in Streaming(-compatible) mode, the full
set of SVE instructions may not be available. This patch adds an interface
to query that and changes the codegen for FADDA (not legal in Streaming-SVE
mode) to instead be expanded for fixed-length vectors, or otherwise not to
code-generate for scalable vectors.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D156109
2023-09-01 12:00:36 +00:00
Sander de Smalen
0a32a999ae [AArch64][SME] NFC: Rename hasNewZAInterface to hasNewZABody.
__arm_new_za is a declaration attribution, not a type attribute,
and is therefore not part of the interface of a function.
2023-08-30 13:14:42 +00:00
Kerry McLaughlin
9a98ab589a [AArch64][SVE2] Change the cost of extends with S/URHADD to 0
When SVE2 is enabled, we can combine an add of 1, add & shift right by 1
to a single s/urhadd instruction. If the operands to the adds are extended,
these extends will fold into the s/urhadd and their costs should be 0.

Reviewed By: david-arm, dtemirbulatov

Differential Revision: https://reviews.llvm.org/D157628
2023-08-29 12:24:47 +00:00
Alexey Bataev
9a207578ac [TTI]Add InsertSubvector pattern in improveShuffleKindFromMask().
It improves shuffle instructions estimation and improves vectorization
outcome.

Differential Revision: https://reviews.llvm.org/D157425
2023-08-18 13:47:01 -07:00
Kerry McLaughlin
5d814b3848 Revert "[AArch64][SVE2] Change the cost of extends with S/URHADD to 0"
This reverts commit dda2cd2505301aa626fcd3e8dea2a447227d00ca.
2023-08-14 10:44:13 +00:00
Kerry McLaughlin
dda2cd2505 [AArch64][SVE2] Change the cost of extends with S/URHADD to 0
When SVE2 is enabled, we can combine an add of 1, add & shift right by 1
to a single s/urhadd instruction. If the operands to the adds are extended,
these extends will fold into the s/urhadd and their costs should be 0.

Reviewed By: dtemirbulatov

Differential Revision: https://reviews.llvm.org/D157628
2023-08-14 10:32:06 +00:00
Mel Chen
425e9e81a0 [LV] Rename the Select[I|F]Cmp reduction pattern to [I|F]AnyOf. (NFC)
Regarding this NFC change, please refer to the discussion in this thread. https://reviews.llvm.org/D150851#4467261

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D155786
2023-08-03 00:37:19 -07:00
David Green
2a859b2014 [AArch64] Change the cost of vector insert/extract to 2
The cost of vector instructions has always been high under AArch64, in order to
add a high cost for inserts/extracts, shuffles and scalarization. This is a
conservative approach to limit the scope of unusual SLP vectorization where the
codegen ends up being quite poor, but has always been higher than the correct
costs would be for any specific core.

This relaxes that, reducing the vector insert/extract cost from 3 to 2. It is a
generalization of D142359 to all AArch64 cpus. The ScalarizationOverhead is
also overridden for integer vector at the same time, to remove the effect of
lane 0 being considered free for integer vectors (something that should only be
true for float when scalarizing).

The lower insert/extract cost will reduce the cost of insert, extracts,
shuffling and scalarization. The adjustments of ScalaizationOverhead will
increase the cost on integer, especially for small vectors. The end result will
be lower cost for float and long-integer types, some higher cost for some
smaller vectors. This, along with the raw insert/extract cost being lower, will
generally mean more vectorization from the Loop and SLP vectorizer.

We may end up regretting this, as that vectorization is not always profitable.
In all the benchmarking I have done this is generally an improvement in the
overall performance, and I've attempted to address the places where it wasn't
with other costmodel adjustments.

Differential Revision: https://reviews.llvm.org/D155459
2023-07-28 21:26:50 +01:00
David Green
8da62b865f [AArch64] Basic vector bswap costs
This adds some basic vector bswap costs, providing the type is supported.

Differential Revision: https://reviews.llvm.org/D155806
2023-07-21 08:48:53 +01:00
Sander de Smalen
08fd44b300 [AArch64] Force streaming-compatible codegen when attributes are set.
Before this patch, the only way to generate streaming-compatible code
was to use the `-force-streaming-compatible-sve` flag, but the compiler
should also avoid the use of instructions invalid in streaming mode
when a function has the aarch64_pstate_sm_enabled/compatible attribute.

Reviewed By: paulwalker-arm, david-arm

Differential Revision: https://reviews.llvm.org/D155428
2023-07-18 10:26:00 +00:00
Sander de Smalen
ec6af93d02 [AArch64] NFC: Replace 'forceStreamingCompatibleSVE' with 'isNeonAvailable'.
The AArch64Subtarget interface 'isNeonAvailable' is more appropriate going
forward, as we may also want to generate 'streaming SVE' code (not just
'streaming-compatible SVE' code), but here we must still make sure not to
use NEON instructions which are invalid in streaming SVE mode.
2023-07-17 08:24:10 +00:00
David Green
1712ae6709 [AArch64] Improve cost of umull from known bits
As in D140287, we can now generate umull from mul(zext(x), y) in cases where we
know that the top bits of y are zero. This teaches that to the cost model,
adjusting how isWideningInstruction detects mul operations that can extend both
operands. This helps for constants and other cases where the operands of the
mul are known to be extended, but not directly extends.

Differential Revision: https://reviews.llvm.org/D154936
2023-07-12 13:13:06 +01:00
Tuan Chuong Goh
e36dd3ea8a [AArch64] Fix cost modelling for SVE Min/Max Intrinsics
Add more legal types for SMIN, SMAX, UMIN, UMAX in cost modelling for AArch64

Differential Revision: https://reviews.llvm.org/D154622
2023-07-12 07:46:12 +01:00
Youngsuk Kim
f69b9b7cce [llvm] Remove uses of Type::getPointerTo() (NFC)
Partial progress towards removing in-tree uses of `getPointerTo()`,
by employing the following options:

* Drop the call entirely if the sole purpose of it is to support
  a no-op bitcast (remove the no-op bitcast as well).

* Replace with `PointerType::get()`/`PointerType::getUnqual()`.

Also, remove no-op function `EmitBitCastOfLValueToProperType()`.

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D154392
2023-07-08 13:05:58 -04:00
David Green
12025cef3e [CostModel] Use min/max intrinsics for vecreduce.min/max costs
This changes the costmodelling of the vecreduce.min/max nodes to use the costs
of the relevant min/max intrinsics instead of expanding them to compare and
selects. The getMinMaxReductionCost have changed to take a Opcode for the
relevant intrinsic, dropping the IsUnsigned and CondTy parameters as they are
no longer needed.

A follow up patch will add some basic fminimum/fmaximum costmodelling.

Differential Revision: https://reviews.llvm.org/D153547
2023-07-04 15:02:30 +01:00
David Green
5106b221c8 [AArch64] Treat the icmp in icmp(and(..), 0) as free
As in https://godbolt.org/z/4dafd9Geq, the icmp from an And may use an Ands to
set flags, meaning the icmp is free.

This could also be done for add/sub, but those patterns often happen in the
induction variable of a loop, making them quite performance sensitive.

Differential Revision: https://reviews.llvm.org/D153611
2023-07-01 21:59:54 +01:00
Igor Kirillov
17bde328d6 [LV] Add mask support for vectorizing interleaved groups
This patch extends LoopVectorize to handle the vectorization of interleaved
memory accesses with scalable vectors when mask is required or/and predicated
tail folding is enabled.

Differential Revision: https://reviews.llvm.org/D152258
2023-06-29 17:50:56 +00:00
Jolanta Jensen
5cd16e2cb7 [NFC SVE ACLE] Remove IR combines that no longer apply.
Remove IR combines that no longer apply after the SVE merging
intrinsics taking an all active predicate, have been canonicalised
to their equivalent undef (_u) variants.

Differential Revision: https://reviews.llvm.org/D153415
2023-06-22 10:26:20 +00:00
Jolanta Jensen
ecb07f481b [SVE ACLE] Implement IR combines to convert intrinsics used for _m C/C++ builtins
This patch implements IR combines to convert intrinsics used for _m C/C++ builtins
which take an all active predicate to their equivalent _u intrinsic.

Differential Revision: https://reviews.llvm.org/D152005
2023-06-21 10:35:13 +00:00
Zhongyunde
cb353dc74e [LV] Add cost model for simd vector select instructions of type float
For simd vector selects, use cmeq + bsl for v2f32/v4f32/v2f64, so their cost are cheep.
Fix https://github.com/llvm/llvm-project/issues/63082

Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D152523
2023-06-20 13:12:19 +08:00
Paul Walker
b7287a82d3 [SVE][AArch64TTI] Fix invalid mla combine that miscomputes the value of inactive lanes.
Consider: add(pg, a, mul_u(pg, b, c))

Although the multiply's inactive lanes are undefined, they don't
contribute to the final result.  The overall result of the inactive
lanes come from "a" and thus the above is another form of mla
rather than mla_u.
2023-06-18 13:07:03 +01:00
Paul Walker
c7c71aa123 [NFC][AArch64TTI] Breakout add/sub combines into discrete functions. 2023-06-18 13:07:03 +01:00