1796 Commits

Author SHA1 Message Date
LiqinWeng
98e5962b7c
[RISCV][CostModel] Add cost for fabs/fsqrt of type bf16/f16 (#118608) 2025-01-10 17:22:51 +08:00
Shao-Ce SUN
369c61744a
[RISCV] Fix the cost of llvm.vector.reduce.and (#119160)
I added some CodeGen test cases related to reduce. To maintain
consistency, I also added cases for instructions like
`vector.reduce.or`.

For cases where `v1i1` type generates `VFIRST`, please refer to:
https://reviews.llvm.org/D139512.
2025-01-10 10:10:42 +08:00
David Green
f22441c14d [AArch64] Add sve div and rem tests. NFC 2025-01-09 08:46:13 +00:00
David Green
89483403c3 [AArch64] Add additional div and rem test coverage. NFC 2025-01-08 21:37:11 +00:00
David Green
a8dab1aa03
[AArch64] Add a subvector extract cost. (#121472)
These can generally be emitted using an ext instruction or mov from the
high half. The half half extracts can be free depending on the users,
but that is not handled here, just the basic costs. It originally
included all subvector extracts, but that was toned-down to just
half-vector extracts to try and help the mid end not breakup high/low
extracts without having the SLP vectorizer create a mess using other
shuffles.
2025-01-08 08:13:07 +00:00
Simon Pilgrim
a5e129ccde [CostModel][X86] getVectorInstrCost - correctly cost v4f32 insertelement into index 0
This is just the MOVSS instruction (SSE41 INSERTPS is still necessary for index != 0)

This exposed an issue in VectorCombine::foldInsExtFNeg - we need to use the more general SK_PermuteTwoSrc shuffle kind to allow getShuffleCost to match other shuffle kinds (not just SK_Select).
2025-01-07 12:23:45 +00:00
Simon Pilgrim
5a7dfb4659 [CostModel][X86] Attempt to match v4f32 shuffles that map to MOVSS/INSERTPS instruction
improveShuffleKindFromMask matches this as a SK_InsertSubvector of a v1f32 (which legalises to f32) into a v4f32 base vector, making it easy to recognise. MOVSS is limited to index0.
2025-01-07 11:31:44 +00:00
David Green
2db7b314da
[AArch64] Add BF16 fpext and fptrunc costs. (#119524)
This expands the recently added fp16 fpext and fpround costs to bf16.
Some of the costs are taken from the rough number of instructions
needed, some are a little aspirational.
https://godbolt.org/z/bGEEd1vsW
2025-01-07 09:39:08 +00:00
Simon Pilgrim
db88071a8b
[CostModel][X86] Attempt to match cheap v4f32 shuffles that map to SHUFPS instruction (#121778)
Avoid always assuming the worst for v4f32 2 input shuffles, and match the SHUFPS pattern where possible - each pair of output elements must come from the same source register.
2025-01-06 17:54:36 +00:00
Simon Pilgrim
7cdbde70fa
[CostModel][X86] getShuffleCost - use processShuffleMasks for all shuffle kinds to legal types (#120599) (#121760)
Now that processShuffleMasks can correctly handle 2 src shuffles, we can completely remove the shuffle kind limits and correctly recognize the number of active subvectors per legalized shuffle - improveShuffleKindFromMask will determine the shuffle kind for each split subvector.
2025-01-06 13:32:55 +00:00
ShihPo Hung
366e836051 [RISCV][NFC] precommit test for fcmp with f16 2025-01-03 01:45:27 -08:00
David Green
3ddc9f06ae [AArch64] Additional shuffle subvector-extract cost tests. NFC
A Phase Ordering test for intrinsic shuffles is also added, showing a recent
regression from vector combining.
2025-01-02 10:13:51 +00:00
David Green
641a786bed [AArch64] Add codegen shuffle-select test. NFC
This splits the shuffle-select CostModel test into a seperate CodeGen test and
removes the codegen from the CostModel version. An extra fp16 test is added
too.
2025-01-02 08:05:44 +00:00
Alexey Bataev
07d284d4eb
[SLP]Add cost estimation for gather node reshuffling
Adds cost estimation for the variants of the permutations of the scalar
values, used in gather nodes. Currently, SLP just unconditionally emits
shuffles for the reused buildvectors, but in some cases better to leave
them as buildvectors rather than shuffles, if the cost of such
buildvectors is better.

X86, AVX512, -O3+LTO
Metric: size..text

Program                                                                        size..text
                                                                               results     results0    diff
                 test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   912998.00   913238.00  0.0%
 test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   203070.00   203102.00  0.0%
     test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1396320.00  1396448.00  0.0%
      test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1396320.00  1396448.00  0.0%
                       test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   309790.00   309678.00 -0.0%
      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12477607.00 12470807.00 -0.1%

CINT2006/445.gobmk - extra code vectorized
MiBench/consumer-lame - small variations
CFP2017speed/638.imagick_s
CFP2017rate/538.imagick_r - extra vectorized code
Benchmarks/Bullet - extra code vectorized
CFP2017rate/526.blender_r - extra vector code

RISC-V, sifive-p670, -O3+LTO
CFP2006/433.milc - regressions, should be fixed by https://github.com/llvm/llvm-project/pull/115173
CFP2006/453.povray - extra vectorized code
CFP2017rate/508.namd_r - better vector code
CFP2017rate/510.parest_r - extra vectorized code
SPEC/CFP2017rate - extra/better vector code
CFP2017rate/526.blender_r - extra vectorized code
CFP2017rate/538.imagick_r - extra vectorized code
CINT2006/403.gcc - extra vectorized code
CINT2006/445.gobmk - extra vectorized code
CINT2006/464.h264ref - extra vectorized code
CINT2006/483.xalancbmk - small variations
CINT2017rate/525.x264_r - better vectorization

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/115201
2024-12-24 15:35:29 -05:00
Alexey Bataev
fbdf652d97 [RISCV][COST]Add several test with known vlen, NFC 2024-12-23 12:55:45 -08:00
Simon Pilgrim
611401c115 [CostModel][X86] getShuffleCost - use processShuffleMasks to split SK_PermuteTwoSrc shuffles to legal types (#120599)
processShuffleMasks can now correctly handle 2 src shuffles, so we can use the existing SK_PermuteSingleSrc splitting cost logic to handle SK_PermuteTwoSrc as well and correctly recognise the number of active subvectors per legalised shuffle.
2024-12-20 10:39:45 +00:00
Simon Pilgrim
091448e3c1
Revert "[CostModel][X86] getShuffleCost - use processShuffleMasks to split SK_PermuteTwoSrc shuffles to legal types" (#120707)
Reverts llvm/llvm-project#120599 - some recent tests are currently failing
2024-12-20 10:06:03 +00:00
Simon Pilgrim
81e63f9e0c
[CostModel][X86] getShuffleCost - use processShuffleMasks to split SK_PermuteTwoSrc shuffles to legal types (#120599)
processShuffleMasks can now correctly handle 2 src shuffles, so we can use the existing SK_PermuteSingleSrc splitting cost logic to handle SK_PermuteTwoSrc as well and correctly recognise the number of active subvectors per legalised shuffle.
2024-12-20 09:55:11 +00:00
Simon Pilgrim
9bb1d0369c
[X86] getShuffleCost - when splitting shuffles, if a whole vector source is just copied we should treat this as free. (#120561)
If the shuffle split results in referencing a single legalised whole vector (i.e. no permutation), then this can be treated as free.

We already do something similar for broadcasts / whole subvector insertion + extraction - its purely an issue for register allocation.
2024-12-19 12:55:44 +00:00
David Sherwood
eaf482f012
[AArch64] Tweak truncate costs for some scalable vector types (#119542)
== We were previously returning an invalid cost when truncating
anything to <vscale x 2 x i1>, which is incorrect since we can
generate perfectly good code for this.

== The costs for truncating legal or unpacked types to predicates
seemed overly optimistic. For example, when truncating
<vscale x 8 x i16> to <vscale x 8 x i1> we typically do
something like

  and z0.h, z0.h, #0x1
  cmpne   p0.h, p0/z, z0.h, #0

I guess it might depend upon whether the input value is
generated in the same block or not and if we can avoid the
inreg zero-extend. However, it feels safe to take the more
conservative cost here.

== The costs for some truncates such as

  trunc <vscale x 2 x i32> %a to <vscale x 2 x i16>

were 1, whereas in actual fact they are free and no instructions
are required.

== Also, for this

  trunc <vscale x 8 x i32> %a to <vscale x 8 x i16>

it's just a single uzp1 instruction so I reduced the cost to 1.

In general, I've added costs for all cases where the destination
type is legal or unpacked. One unfortunate side effect of this
is the costs for some fixed-width truncates when using SVE now
look too optimistic.
2024-12-19 10:07:41 +00:00
Simon Pilgrim
89e530a27c [CostModel[X86] Update shuffle non-pow-2 tests to not analyse shuffle(undef,undef)
Avoid shuffle patterns that can be folded away.
2024-12-16 18:32:13 +00:00
Fangrui Song
cd12922235 [test] Change llc -march= to -mtriple=
Similar to 806761a7629df268c8aed49657aeccffa6bca449

-march= is error-prone when running on a host whose OS is different.
2024-12-15 13:08:02 -08:00
Elvis Wang
a674209432
[RISCV][TTI] Model the cost of insert/extractelt when the vector split into multiple register group and idx exceed single group. (#118401)
This patch implements the cost when the size of the vector need to split
into multiple groups and the index exceed single vector group.

For extract element, we need to store split vectors to stack and load
the target element.
For insert element, we need to store split vectors to stack and store
the target element and load vectors back.

After this patch, the cost of insert/extract element will close to the
generated assembly.
2024-12-12 09:26:19 +08:00
Ricardo Jesus
2fe30bc669
[AArch64] Add cost model for @experimental.vector.match (#118512)
The base cost approximates the expansion code in SelectionDAGBuilder.

For the AArch64 cases that don't need generic expansion, fixed-length
search vectors have a higher cost than scalable vectors due to the extra
instructions to convert the boolean mask.
2024-12-11 07:51:11 +00:00
David Green
2f18b5ef03
[AArch64] Add fpext and fpround costs (#119292)
This adds some basic costs for fpext and fpround, many of which were
already handled by the generic costing routines but this does make some
adjustments for larger vector types that can use fcvtn+fcvtn2, as
opposed to fcvtn+fcvtn+concat.

These should now more closely match the codegen from
https://godbolt.org/z/r3P9Mf8ez, for example.
2024-12-11 06:26:41 +00:00
Simon Pilgrim
4f933277a5
[CostModel][X86] Improve cost estimation of insert_subvector shuffle patterns of legalized types (#119363)
In cases where the base/sub vector type in an insert_subvector pattern legalize to the same width through splitting, we can assume that the shuffle becomes free as the legalized vectors will not overlap.

Note this isn't true if the vectors have been widened during legalization (e.g. v2f32 insertion into v4f32 would legalize to v4f32 into v4f32).

Noticed while working on adding processShuffleMasks handling for SK_PermuteTwoSrc.
2024-12-10 16:28:56 +00:00
Shao-Ce SUN
95b6524e5c [NFC] [RISCV] Add tests for llvm.vector.reduce.* 2024-12-10 10:20:21 +08:00
David Green
ca884009e4 [AArch64] Add test coverage of fp16 and bf16 fptrunc and fpext. NFC
Some of the scalable tests have been split off to make the tests more
managable. AArch64TTIImpl::getCastInstrCost is also formatted to avoid the need
to fight against CI.
2024-12-09 23:41:18 +00:00
Lei Huang
a13ec9cd54
[PowerPC] Update data layout aligment of i128 to 16 (#118004)
Fix 64-bit PowerPC part of
https://github.com/llvm/llvm-project/issues/102783.
2024-12-09 18:02:24 -05:00
Alexey Bataev
b9aa155d26
[TTI][X86]Fix detection of the shuffles from the second shuffle operand only
If the shuffle mask uses only indices from the second shuffle operand,
processShuffleMasks function misses it currently, which prevents correct
cost estimation in this corner case. To fix this, need to raise the
limit to 2 * VF rather than just VF and adjust processing
correspondingly. Will allow future improvements for 2 sources
permutations.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/118972
2024-12-06 12:27:00 -05:00
LiqinWeng
46829e5430
[RISCV][CostModel] Correct the cost of some reductions (#118072)
Reductions include: and/or/max/min
2024-12-04 17:26:54 +08:00
Dominik Steenken
866b9f43a0
[SystemZ] Add realistic cost estimates for vector reduction intrinsics (#118319)
This PR adds more realistic cost estimates for these reduction
intrinsics

- `llvm.vector.reduce.umax`
- `llvm.vector.reduce.umin`
- `llvm.vector.reduce.smax`
- `llvm.vector.reduce.smin`
- `llvm.vector.reduce.fadd`
- `llvm.vector.reduce.fmul`
- `llvm.vector.reduce.fmax`
- `llvm.vector.reduce.fmin`
- `llvm.vector.reduce.fmaximum`
- `llvm.vector.reduce.fminimum`
- `llvm.vector.reduce.mul
`
The pre-existing cost estimates for `llvm.vector.reduce.add` are moved
to `getArithmeticReductionCosts` to reduce complexity in
`getVectorIntrinsicInstrCost` and enable other passes, like the SLP
vectorizer, to benefit from these updated calculations.

These are not expected to provide noticable performance improvements and
are rather provided for the sake of completeness and correctness. This
PR is in draft mode pending benchmark confirmation of this.

This also provides and/or updates cost tests for all of these
intrinsics.

This PR was co-authored by me and @JonPsson1 .
2024-12-03 17:08:51 +01:00
LiqinWeng
eb3f1aec6e
[TTI][RISCV] Implement cost of some intrinsics with LMUL (#117874)
Intrinsics include:
sadd_sat/ssub_sat/uadd_sat/usub_sat/fabs/fsqrt/cttz/ctlz/ctpop
2024-12-03 10:17:52 +08:00
Simon Pilgrim
94df95de6b
[TTI][X86] getShuffleCosts - for SK_PermuteTwoSrc, if the masks are known to be "inlane" no need to scale the costs by worst-case legalization (#117999)
SK_PermuteTwoSrc legalization has to assume any of the legalised source registers could be referenced in split shuffles, but if we already know that each 128-bit lane only references elements from the same lane of the source operands, then this scaling won't occur.

Hopefully this can help with #113356 without us having to get full processShuffleMasks canonicalization finished first.
2024-12-01 12:01:47 +00:00
David Green
fe04290482
[AArch64] Change the default vscale-for-tuning to 1. (#117174)
Most AArch64 cpus outside of Neoverse V1 (256) and A64FX (512) have an
SVE vector length of 128, and in environments like Android (where no
mcpu option is common) we would expect all cpus to match. This patch
changes the default vector length to 128 with -mcpu=generic, to match
the most common case.
2024-11-29 17:41:05 +00:00
David Green
6f4b4f41ca [AArch64] Remove LoopVectorizer/AArch64/scatter-cost.ll test. NFC
This test checks the costs, not vectorization, so is better placed in the
existing gather/scatter cost modelling tests. An extra neoverse-v2 check line
has been added for both gathers and scatters.
2024-11-29 14:38:36 +00:00
David Green
d714b221c7
[AArch64] Guard against getRegisterBitWidth returning zero in vector instr cost. (#117749)
If the getRegisterBitWidth is zero (such as in sme streaming functions),
then we could hit a crash from using % RegWidth.
2024-11-29 04:01:03 +00:00
Simon Pilgrim
79dab3f5eb [CostModel][X86] Add shuffle 'splat' tests for broadcasts of non-zero element index.
As noticed on #115201 - its possible for SK_Broadcast to occur for non-zero element index which we don't currently handle.
2024-11-28 13:25:55 +00:00
LiqinWeng
c3377af4c3
[RISCV][CostModel] add cost for cttz/ctlz under the non-zvbb (#117515) 2024-11-26 11:40:52 +08:00
Phoebe Wang
2ab84a60ff
[X86][FP16][BF16] Improve vectorization of fcmp (#116153) 2024-11-26 08:17:48 +08:00
Luke Lau
15fadeb2aa
[RISCV] Add cost for @llvm.experimental.vp.splat (#117313)
This is split off from #115274. There doesn't seem to be an easy way to
share this with getShuffleCost since that requires passing in a real
insert_element operand to get it to recognise it's a scalar splat.

For i1 vectors we can't currently lower them so it returns an invalid
cost.

---------

Co-authored-by: Shih-Po Hung <shihpo.hung@sifive.com>
2024-11-25 11:28:46 +01:00
LiqinWeng
db14010405
[RISCV][TTI] Implement cost of intrinsic abs with LMUL (#115813) 2024-11-25 17:35:58 +08:00
LiqinWeng
48b13ca48b
[RISCV][CostModel] cost of vector cttz/ctlz under ZVBB (#115800) 2024-11-24 09:18:18 +08:00
Luke Lau
1e897ed28d
[TTI][RISCV] Deduplicate type-based VP costing (#115983)
We have a lot of code in RISCVTTIImpl::getIntrinsicInstrCost for vp
intrinsics, which just forward the cost to the underlying non-vp cost
function.

However I just also noticed that there is generic code in BasicTTIImpl's
getIntrinsicInstrCost that does the same thing, added in #67178. The
only difference is that BasicTTIImpl doesn't yet handle it for
type-based costing. There doesn't seem to be any reason that it can't
since it's just inspecting the argument types.

This shuffles the VP costing up to handle both regular and type-based
costing, which allows us to deduplicate some of the VP specific costing
in RISCVTTIImpl by delegating it to BasicTTIImpl.h. More of those nodes
can be moved over to BasicTTIImpl.h later.

It's not NFC since it picks up a couple of VP nodes that had slipped
through the cracks. Future PRs can begin to move more of the code from
RISCVTTIImpl to BasicTTIImpl.
2024-11-19 16:20:29 +08:00
LiqinWeng
5a12881514
[RISCV][Test] Add test for vp float arithmetic ops. NFC (#114516) 2024-11-13 16:23:04 +08:00
LiqinWeng
9aa4f50ae4
[RISCV][TTI] Add vp.fneg intrinsic cost with functionalOP (#114378) 2024-11-13 15:40:48 +08:00
Luke Lau
1294ddabbc [RISCV] Add cost model tests for vp.{s,u}{min,max}. NFC 2024-11-13 14:32:44 +08:00
Sushant Gokhale
9991ea28fc
[CostModel][AArch64] Make extractelement, with fmul user, free whenev… (#111479)
…er possible

In case of Neon, if there exists extractelement from lane != 0 such that
  1. extractelement does not necessitate a move from vector_reg -> GPR
  2. extractelement result feeds into fmul
3. Other operand of fmul is a scalar or extractelement from lane 0 or
lane equivalent to 0
then the extractelement can be merged with fmul in the backend and it
incurs no cost.

  e.g. 
  ```
define double @foo(<2 x double> %a) { 
    %1 = extractelement <2 x double> %a, i32 0 
    %2 = extractelement <2 x double> %a, i32 1
    %res = fmul double %1, %2    
    ret double %res
  }
```
  `%2` and `%res` can be merged in the backend to generate:
  `fmul    d0, d0, v0.d[1]`

The change was tested with SPEC FP(C/C++) on Neoverse-v2. 
**Compile time impact**: None
**Performance impact**: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.
2024-11-13 11:10:49 +05:30
Elvis Wang
3431d133cc [RISCV][TTI] Implement instruction cost for vp.reduce.* #114184
The VP variants simply return the same costs as the non-VP variants.
This assumes that reductions are VL predicated, and that VL predication
has no additional cost.
2024-11-12 10:01:35 -08:00
Philip Reames
6256f4b807 [CostModel][RISCV] Rename misleadingly named test file
RVV intrinsics are the C intrinsic API; this file actually contains tests
for the vp.* family of intrinsics.
2024-11-12 08:20:13 -08:00