320 Commits

Author SHA1 Message Date
Krzysztof Parzyszek
86fe4dfdb6 TargetTransformInfo: convert Optional to std::optional
Recommit: added missing "#include <cstdint>".
2022-12-02 11:42:15 -08:00
Krzysztof Parzyszek
4e12d1836a Revert "TargetTransformInfo: convert Optional to std::optional"
This reverts commit b83711248cb12639e7ef7303cfbb4452b4067e85.

Some buildbots are failing.
2022-12-02 11:34:04 -08:00
Krzysztof Parzyszek
b83711248c TargetTransformInfo: convert Optional to std::optional 2022-12-02 11:27:12 -08:00
David Green
f2a92db29e [AArch64] Don't treat SVE scalable extends as free widening instructions
The logic in isWideningInstruction handles instructions like uaddw and
smull, where 'add(x, zext(y))' or 'mul(sext(x), sext(y))' can be
converted to single instructions, making the extends free. This doesn't
apply the same to SVE instructions though.
https://godbolt.org/z/695d3nhGd

(There are instructions like SMULLT/B, but they require top/bottom lane
interleaving. That is similar to MVE instructions, which required a
special pass to perform the lane interleaving).

This patch just bails out of the call to isWideningInstruction if the
vector is scalable, getting a more accurate cost.

Differential Revision: https://reviews.llvm.org/D138591
2022-11-30 13:09:48 +00:00
Nicola Lancellotti
49cd18c55e Revert "[AArch64] Canonicalize ZERO_EXTEND to VSELECT"
This reverts commit 43fe14c056458501990c3db2788f67268d1bdf38.
2022-11-28 16:37:30 +00:00
Zain Jaffal
6e4cea55f0 [AArch64] Fix cost model for udiv instruction when one of the operands is a uniform constant
Currently the model over estimates the cost of a udiv instruction with one constant. The correct cost for a udiv instruction is
insert_cost * extract_cost * num_elements

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D135991
2022-11-28 10:38:17 +02:00
Manuel Brito
1e55d5b1f2 Use poison instead of undef as placeholder for vector construction [NFC]
Differential Revision: https://reviews.llvm.org/D138450
2022-11-21 18:43:23 +00:00
Bradley Smith
daf1a1f690 [AArch64][SVE] Add instcombine to convert ptest.last/first to ptest.any
This allow for better optimization later in the backend.

This fixes the remaining missed optimizations in D137717.

Depends on D137930

Differential Revision: https://reviews.llvm.org/D137947
2022-11-15 15:59:21 +00:00
Cullen Rhodes
50621169ae [AArch64][SVE] Extend PTEST_ANY(X=OP(PG,...), X) -> PTEST_ANY(PG, X)) instcombine
Extend above instcombine added in D134946 to cover more flag-setting
instructions.

Reviewed By: peterwaller-arm

Differential Revision: https://reviews.llvm.org/D136438
2022-11-04 08:58:15 +00:00
Sander de Smalen
137459aff6 [AArch64][SME] Disable (SLP|Loop)Vectorizer when function may be executed in streaming mode.
When the SME attributes tell that a function is or may be executed in Streaming
SVE mode, we currently need to be conservative and disable _any_ vectorization
(fixed or scalable) because the code-generator does not yet support generating
streaming-compatible code.

Scalable auto-vec will be gradually enabled in the future when we have
confidence that the loop-vectorizer won't use any SVE or NEON instructions
that are illegal in Streaming SVE mode.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D135950
2022-10-19 16:42:20 +00:00
Nicola Lancellotti
43fe14c056 [AArch64] Canonicalize ZERO_EXTEND to VSELECT
Differential Revision: https://reviews.llvm.org/D135596
2022-10-17 15:42:46 +01:00
Cullen Rhodes
388cacb341 [AArch64][SVE] Add instcombine for PTEST_ANY(X=OP(PG,...), X) -> PTEST_ANY(PG, X))
Given this is an OR reduction the two are equivalent and later
optimizations (AArch64InstrInfo::optimizePTestInstr) may rewrite the
sequence to use the flag-setting variant of instruction X, to remove the
PTEST altogether.

Reviewed By: paulwalker-arm, bsmith

Differential Revision: https://reviews.llvm.org/D134946
2022-10-12 09:14:08 +00:00
Florian Hahn
eba84971ae
Revert "[AARCH64][CostModel] Modified the cost of mask vector load/store"
This reverts commit 1c62af3e23cab41074f7ce0ba86a93bea82b99b9.

The commit causes the test below to fail. Revert for now to get the bots
back to green.

Failing test:
lvm/test/Transforms/LoopVectorize/AArch64/masked-op-cost.ll
2022-09-28 15:35:13 +01:00
liqinweng
1c62af3e23 [AARCH64][CostModel] Modified the cost of mask vector load/store
Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D134413
2022-09-28 19:40:29 +08:00
Hassnaa Hamdi
181f200a1c [NFC]: AArch64-SVE
modify some comments
2022-09-23 12:07:31 +00:00
Caroline Concatto
5431bf27bd [AArch64]Remove svget/svset/svcreate from llvm
This patch removes the aarch64 instrinsic svget/svset/svcreate from llvm.
It also implements the InstCombine for vector.extract that used to be in svget.

Depends on: D131547

Differential Revision: https://reviews.llvm.org/D131548
2022-09-23 10:48:43 +01:00
Hassnaa Hamdi
f2072e0ae0 [AArh64-SVE]: Improve cost model for div/udiv/mul 128-bit vector operations
Differential Revision: https://reviews.llvm.org/D132477
2022-09-22 16:50:55 +00:00
David Sherwood
64bef3d568 [AArch64][SME] Disable inlining when SME attributes require smstart/smstop or lazy-save.
Inlining must be disabled when the call-site needs to toggle PSTATE.SM or
when the callee's function body is executed in a different streaming mode than
its caller. This is needed because function calls are the boundaries for
streaming mode changes.

More details about the SME attributes and design can be found
in D131562.

Differential Revision: https://reviews.llvm.org/D131581
2022-09-21 09:35:47 +01:00
Mingming Liu
8aa800614b [AArch64][CostModel] Detects that {extract,insert}-element at lane 0 has the same cost as the other lane for vector instructions in the IR.
Currently, {extract,insert}-element has zero cost at lane 0 [1]. However, there is a cost (by fmov instruction [2], or ext/ins instruction) to move values from SIMD registers to GPR registers, when the element is used explicitly as integers.

See https://godbolt.org/z/faPE1nTn8, when fmov is generated for d* register -> x* register conversion.

Implementation-wise, add a private method `AArch64TTIImpl::getVectorInstrCostHelper` as a helper function. This way, instruction-based method could share the core logic (e.g.,
returning zero cost if type is legalized to scalar).

[1] 2cf320d41e/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (L1853)
[2] 2cf320d41e/llvm/lib/Target/AArch64/AArch64InstrInfo.td (L8150-L8157)

Differential Revision: https://reviews.llvm.org/D128302
2022-09-09 09:47:30 -07:00
David Green
3875c38adf [AArch64] Fix formatting of the Shuffle Cost tables. NFC 2022-09-08 19:54:12 +01:00
liqinweng
723245bfac [AARCH64][COST] Improve cost of reverse shuffles for AArch64
Update the comments for reverse shuffles and add tests

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D132730
2022-09-08 18:55:49 +08:00
Eli Friedman
b219a9c0a2 [CostModel][AArch64] Fix ctpop intrinsic cost when NEON is disabled.
If we don't have NEON, we use the generic fallback, which takes 12
instructions. Make sure the costs reflect that.

(On a related note, we could optimize the generic fallback a bit. It
currently uses sequences like lsr+and+add; if we use and+lsr+add
instead, we can fold the lsr into the add.)

Differential Revision: https://reviews.llvm.org/D133154
2022-09-02 15:17:55 -07:00
Mingming Liu
242203d254 [AArch64][TTI] Add cost table entry for trunc over vector of integers.
1) Tablegen patterns exist to use 'xtn' and 'uzp1' for trunc [1]. Cost table entries are updated based on the actual number of {xtn, uzp1} instructions generated.
2) Without this, an IR instruction like trunc <8 x i16> %v to <8 x i8> is considered free and might be sinked to other basic blocks. As a result, the sinked 'trunc' is in a different basic block with its (usually not-free) vector operand and misses the chance to be combined during instruction selection. (examples in [2])
3) It's a lot of effort to teach CodeGenPrepare.cpp to sink the operand of trunc without introducing regressions, since the instruction to compute the operand of trunc could be faster (e.g., throughput) than the instruction corresponding to "trunc (bin-vector-op". For instance in [3], sinking %1 (as trunc operand) into bb.1 and bb.2 means to replace 2 xtn with 2 shrn (shrn has a throughput of 1 and only utilize v1 pipeline), which is not necessarily good, especially since ushr result needs to be preserved for store operation in bb.0. Meanwhile, it's too optimistic (for CodeGenPrepare pass) to assume machine-cse will always be able to de-dup shrn from various basic blocks into one shrn.

[1] For {v8i16->v8i8, v4i32->v4i16, v2i64->v2i32}, 813ae2871d/llvm/lib/Target/AArch64/AArch64InstrInfo.td (L4472).
    For concat (trunc, trunc) -> uzip1, 813ae2871d/llvm/lib/Target/AArch64/AArch64InstrInfo.td (L5428-L5437)
[2] examples
    - trunc(umin(X, 255)) -> UQXTRN v8i8 (and other {u,s}x{min,max} pattern for v8i16 operands) from 813ae2871d/llvm/lib/Target/AArch64/AArch64InstrInfo.td (L4515-L4528)
    - trunc (AArch64vlshr v8i16, imm) -> SHRNv8i8 (same missed for SHRNv2i32) from 813ae2871d/llvm/lib/Target/AArch64/AArch64InstrInfo.td (L6743-L6748)
[3]
    ---
    ; instruction latency / throughput / pipeline on `neoverse-n1`
    bb.0:
      %1 = lshr <8 x i16> %10, <i16 4, i16 4, i16 4, i16 4, i16 4, i16 4, i16 4, i16 4>   ; ushr, latency 2, throughput 1, pipeline V1
      %2 = trunc <8 x i16> %1 to <8 x i8>  ; xtn, latency 2, throughput 2, pipeline V
      %3 = store <8 x i8> %1, ptr %addr
      br cond i1 cond, label bb.1, label bb.2

    bb.1:
      %4 = trunc <8 x i16> %1 to <8 x i8> ; xtn

    bb.2:
      %5 = trunc <8 x i16> %1 to <8 x i8> ; xtn
    ---

Differential Revision: https://reviews.llvm.org/D132784
2022-09-02 10:06:55 -07:00
Paul Walker
3bb228729f [CostModel][SVE] Correct cost model of SK_Splice shuffles for <vscale x 1 x Ty> vector types.
AArch64TTIImpl::getSpliceCost() is now used more aggressively and
LNT (MultiSource/Benchmarks/mafft) exposed a failure case for
<vscale x 1 x i1>. I've tested other element types and whilst they
can be costed they cannot be code generated, so this patch returns
InstructionCost::getInvalid() for all cases.
2022-08-26 16:06:01 +01:00
Philip Reames
c9608d57b8 [TTI] Plumb through OperandValueInfo in getMemoryOpCost [NFC]
This has the effect of exposing the power-of-two property for use in memory op costing, but no target actually uses it yet.  The main point of this change is simple consistency with the recently changes getArithmeticInstrCost, and to remove the last (interface) use of OperandValueKind.
2022-08-23 07:55:42 -07:00
Philip Reames
104fa367ee [TTI] Use OperandValueInfo in getArithmeticInstrCost implementation [NFC]
This change completes the process of replacing OperandValueKind and OperandValueProperties which were previously passed independently in this API with a single container class which contains both.

This is the change which motivated the whole sequence which preceeded it.  In an original spike version of this change, I'd noticed a nasty bug: I'd changed the signature without changing names, and as result, we silently passed additional information through a callsite which previously dropped the power-of-two fact.  This might be harmless in most cases, but at least a couple clearly dependend for correctness on not passing that property through.

I did my best to split off prior changes which reduced the scope of this one, and which made it possible to use compiler assistance.  For instance, every parameter which changes type in this change also changes name.  This was intentional to make sure that every call site possible effected must show up in the diff.  This let me audit each one closely.
2022-08-22 15:16:39 -07:00
Philip Reames
478cf94378 [X86][AArch64][WebAsm][RISCV] Query operand properties instead of using enums directly [nfc]
This is part of an ongoing transition to use OperandValueInfo which combines OperandValueKind and OperandValueProperties.  This change adds some accessor methods and uses them to simplify backend code.  The primary motivation of doing so is removing uses of the parameters so that an upcoming api change is less error prone.
2022-08-22 13:37:59 -07:00
David Green
0cf9e47f27 [AArch64] Add SK_Splice fixed-width costs
A fixed length SK_Splice shuffle vector is lowered to a Ext under
AArch64, which should have a cost of 1.

Differential Revision: https://reviews.llvm.org/D132299
2022-08-22 12:44:57 +01:00
Simon Pilgrim
5263155d5b [CostModel] Add CostKind argument to getShuffleCost
Defaults to TCK_RecipThroughput - as most explicit calls were assuming TCK_RecipThroughput (vectorizers) or was just doing a before-vs-after comparison (vectorcombiner). Calls via getInstructionCost were just dropping the CostKind, so again there should be no change at this time (as getShuffleCost and its expansions don't use CostKind yet) - but it will make it easier for us to better account for size/latency shuffle costs in inline/unroll passes in the future.

Differential Revision: https://reviews.llvm.org/D132287
2022-08-21 10:54:51 +01:00
Alexey Bataev
d53e245951 [COST][NFC]Introduce OperandValueKind in getMemoryOpCost, NFC.
Added OperandValueKind OpdInfo parameter to getMemoryOpCost functions to
better estimate cost with immediate values.

Part of D126885.
2022-08-19 07:33:00 -07:00
Florian Hahn
b8709a9d03
[LV] Support fixed order recurrences.
If the incoming previous value of a fixed-order recurrence is a phi in
the header, go through incoming values from the latch until we find a
non-phi value. Use this as the new Previous, all uses in the header
will be dominated by the original phi, but need to be moved after
the non-phi previous value.

At the moment, fixed-order recurrences are modeled as a chain of
first-order recurrences.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D119661
2022-08-18 19:15:52 +01:00
Daniil Fukalov
7ed3d81333 [NFCI] Move cost estimation from TargetLowering to TargetTransformInfo.
TragetLowering had two last InstructionCost related `getTypeLegalizationCost()`
and `getScalingFactorCost()` members, but all other costs are processed in TTI.

E.g. it is not comfortable to use other TTI members in these two functions
overrided in a target.

Minor refactoring: `getTypeLegalizationCost()` now doesn't need DataLayout
parameter - it was always passed from TTI.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D117723
2022-08-18 00:38:55 +03:00
Fangrui Song
de9d80c1c5 [llvm] LLVM_FALLTHROUGH => [[fallthrough]]. NFC
With C++17 there is no Clang pedantic warning or MSVC C5051.
2022-08-08 11:24:15 -07:00
David Sherwood
4ef9cb6c17 [AArch64][LoopVectorize] Disable tail-folding for SVE when loop has interleaved accesses
If we have interleave groups in the loop we want to vectorise then
we should fall back on normal vectorisation with a scalar epilogue. In
such cases when tail-folding is enabled we'll almost certainly go on to
create vplans with very high costs for all vector VFs and fall back on
VF=1 anyway. This is likely to be worse than if we'd just used an
unpredicated vector loop in the first place.

Once the vectoriser has proper support for analysing all the costs
for each combination of VF and vectorisation style, then we should
be able to remove this.

Added an extra test here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

Differential Revision: https://reviews.llvm.org/D128342
2022-08-02 09:52:33 +01:00
Vasileios Porpodas
f669030373 [TTI][AArch64][SLP] Sets the cost of an ADD reduction 2xi64 to 2.
2xi64 is the legalized type for wide reductions (like 16xi64) and setting the
cost to 2 makes `load-reduce` and `load-zext-reduce` patterns profitable.

The few performance measurments that I did on an aarch64 machine confirm that
these patterns are actually faster when vectorized.

Differential Revision: https://reviews.llvm.org/D130740
2022-08-01 13:03:14 -07:00
chendewen
7eeb468ae5 [Aarch64] Add cost for missing extensions.
This patch adds a cost estimate for some missing sign extensions.
ref: https://reviews.llvm.org/D14730

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D130565
2022-07-28 17:34:00 +08:00
David Sherwood
f15b6b2907 [AArch64] Add target hook for preferPredicateOverEpilogue
This patch adds the AArch64 hook for preferPredicateOverEpilogue,
which currently returns true if SVE is enabled and one of the
following conditions (non-exhaustive) is met:

1. The "sve-tail-folding" option is set to "all", or
2. The "sve-tail-folding" option is set to "all+noreductions"
and the loop does not contain reductions,
3. The "sve-tail-folding" option is set to "all+norecurrences"
and the loop has no first-order recurrences.

Currently the default option is "disabled", but this will be
changed in a later patch.

I've added new tests to show the options behave as expected here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

Differential Revision: https://reviews.llvm.org/D129560
2022-07-21 17:20:06 +01:00
Cullen Rhodes
7c3cda551a [AArch64][SVE] Prefer SIMD&FP variant of clast[ab]
The scalar variant with GPR source/dest has considerably higher latency
than the SIMD&FP scalar variant across a variety of micro-architectures:

  Core           Scalar    SIMD&FP
  --------------------------------
  Neoverse V1     9 cyc      3 cyc
  Neoverse N2     8 cyc      3 cyc
  Cortex A510     8 cyc      4 cyc
  A64FX          29 cyc      6 cyc
2022-07-13 08:53:36 +00:00
Bradley Smith
a83aa33d1b [IR] Move vector.insert/vector.extract out of experimental namespace
These intrinsics are now fundemental for SVE code generation and have been
present for a year and a half, hence move them out of the experimental
namespace.

Differential Revision: https://reviews.llvm.org/D127976
2022-06-27 10:48:45 +00:00
David Green
fb4d3d238f [AArch64] Remove unnecessary funnel shift sve costs.
D127680 added some unnecessary funnel shift costs for AArch64 to "match
the legacy behaviour". The default costs are closer to the correct
values and line up with the scalar/neon costs better. Remove the lines
again to clean up the code, they can be added back at a later date with
better values if needed.
2022-06-21 12:21:37 +01:00
Philip Reames
db85345f2d [BasicTTI] Allow generic handling of scalable vector fshr/fshl
This change removes an explicit scalable vector bailout for fshl and fshr. This bailout was added in 60e4698b9aba8, when sinking a unconditional bailout for all intrinsics into selected cases. Its not clear if the bailout was originally unneeded, or if our cost model infrastructure has simply matured in the meantime. Either way, the generic code appears to handle scalable vectors without issue.

Note that the RISC-V cost model changes here aren't particularly interesting. They do probably better match the current lowering, but the main point is to have coverage of the BasicTTI path and simply show lack of crashing.

AArch64 costing was changed to preserve legacy behavior.  There will most likely be an upcoming change to use the generic costs there too, but I didn't want to make that change not being particularly familiar with the target.

Differential Revision: https://reviews.llvm.org/D127680
2022-06-20 10:38:51 -07:00
Tiehu Zhang
b329156f4f [AArch64][LV] AArch64 does not prefer vectorized addressing
TTI::prefersVectorizedAddressing() try to vectorize the addresses that lead to loads.
For aarch64, only gather/scatter (supported by SVE) can deal with vectors of addresses.
This patch specializes the hook for AArch64, to return true only when we enable SVE.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D124612
2022-06-17 18:32:50 +08:00
Jingu Kang
bb82f74612 Revert "Revert "[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth""
This reverts commit 42ebfa8269470e6b1fe2de996d3f1db6d142e16a.

The commmit from https://reviews.llvm.org/D125918 has fixed the stage 2 build
failure.

Differential Revision: https://reviews.llvm.org/D118979
2022-05-23 16:15:45 +01:00
Bradley Smith
5f4541fefb [AArch64][SVE] Convert SRSHL to LSL when the fed from an ABS intrinsic
Differential Revision: https://reviews.llvm.org/D125233
2022-05-19 14:07:59 +00:00
Florian Hahn
17a73992dd
[AArch64] Remove redundant f{min,max}nm intrinsics.
The patch extends AArch64TTIImpl::instCombineIntrinsic to simplify
llvm.aarch64.neon.f{min,max}nm(a, a) -> a.

This helps with simplifying code written using the ACLE, e.g.
see https://godbolt.org/z/jYxsoc89c

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D125234
2022-05-10 19:57:43 +01:00
David Green
dccc69a38d [AArch64] Add extra reverse costs.
This adds some extra costs for reverse shuffles under AArch64, filling
in the i16/f16/i8 gaps in the cost model.

Differential Revision: https://reviews.llvm.org/D124786
2022-05-06 18:23:36 +01:00
David Green
2dcb2d8562 [AArch64] Cost modelling for fptoi_sat
This builds on top of the target-independent cost model added in D124269
to add aarch64 specific costs for fptoui_sat and fptosi_sat intrinsics.
For many common types they will be legal instructions as the AArch64
instructions will saturate naturally. For unsupported pairs of integer
and floating point types, an additional min/max clamp is needed.

Differential Revision: https://reviews.llvm.org/D124357
2022-05-02 11:36:05 +01:00
David Kreitzer
6918a15f43 Test commit. Fixed a typo in a comment. 2022-04-29 16:18:09 -07:00
David Green
46cef9a82d [AArch64] Attempt to fix bots by ensuring legalized type is a vector 2022-04-27 15:36:15 +01:00
David Green
8e2a0e61f5 [AArch64] Break up larger shuffle-masks into legal sizes in getShuffleCost
Given a larger-than-legal shuffle mask, the final codegen will split
into multiple sub-vectors. This attempts to model that in
AArch64TTIImpl::getShuffleCost, splitting masks up according to the size
of the legalized vectors. If the sub-masks have at most 2 input sources
we can call getShuffleCost on them and sum the costs, to get a more
accurate final cost for the entire shuffle. The call to
improveShuffleKindFromMask helps to improve the shuffle kind for the
sub-mask cost call.

Differential Revision: https://reviews.llvm.org/D123414
2022-04-27 13:51:50 +01:00