507 Commits

Author SHA1 Message Date
Joel E. Denny
18f8106f31
[KernelInfo] Implement new LLVM IR pass for GPU code analysis (#102944)
This patch implements an LLVM IR pass, named kernel-info, that reports
various statistics for codes compiled for GPUs. The ultimate goal of
these statistics to help identify bad code patterns and ways to mitigate
them. The pass operates at the LLVM IR level so that it can, in theory,
support any LLVM-based compiler for programming languages supporting
GPUs. It has been tested so far with LLVM IR generated by Clang for
OpenMP offload codes targeting NVIDIA GPUs and AMD GPUs.

By default, the pass runs at the end of LTO, and options like
``-Rpass=kernel-info`` enable its remarks. Example `opt` and `clang`
command lines appear in `llvm/docs/KernelInfo.rst`. Remarks include
summary statistics (e.g., total size of static allocas) and individual
occurrences (e.g., source location of each alloca). Examples of its
output appear in tests in `llvm/test/Analysis/KernelInfo`.
2025-01-29 12:40:19 -05:00
Alexandros Lamprineas
831527a5ef
[FMV][GlobalOpt] Statically resolve calls to versioned functions. (#87939)
To deduce whether the optimization is legal we need to compare the target
features between caller and callee versions. The criteria for bypassing
the resolver are the following:

 * If the callee's feature set is a subset of the caller's feature set,
   then the callee is a candidate for direct call.

 * Among such candidates the one of highest priority is the best match
   and it shall be picked, unless there is a version of the callee with
   higher priority than the best match which cannot be picked from a
   higher priority caller (directly or through the resolver).

 * For every higher priority callee version than the best match, there
   is a higher priority caller version whose feature set availability
   is implied by the callee's feature set.

Example:

Callers and Callees are ordered in decreasing priority.
The arrows indicate successful call redirections.

  Caller        Callee      Explanation
=========================================================================
mops+sve2 --+--> mops       all the callee versions are subsets of the
            |               caller but mops has the highest priority
            |
     mops --+    sve2       between mops and default callees, mops wins

      sve        sve        between sve and default callees, sve wins
                            but sve2 does not have a high priority caller

  default -----> default    sve (callee) implies sve (caller),
                            sve2(callee) implies sve (caller),
                            mops(callee) implies mops(caller)
2025-01-17 10:49:43 +00:00
Sam Tebbs
795e35a653
Reland "[LoopVectorizer] Add support for partial reductions" with non-phi operand fix. (#121744)
This relands the reverted #120721 with a fix for cases where neither
reduction operand are the reduction phi. Only
63114239cc8d26225a0ef9920baacfc7cc00fc58 and
63114239cc8d26225a0ef9920baacfc7cc00fc58 are new on top of the reverted
PR.

---------

Co-authored-by: Nicholas Guy <nicholas.guy@arm.com>
2025-01-13 11:20:35 +00:00
Zequan Wu
4d8f9594b2 Revert "Reland "[LoopVectorizer] Add support for partial reductions" (#120721)"
This reverts commit c858bf620c3ab2a4db53e84b9365b553c3ad1aa6 as it casuse optimization crash on -O2, see https://github.com/llvm/llvm-project/pull/120721#issuecomment-2563192057
2024-12-27 11:51:54 -08:00
Sam Tebbs
c858bf620c
Reland "[LoopVectorizer] Add support for partial reductions" (#120721)
This re-lands the reverted #92418 

When the VF is small enough so that dividing the VF by the scaling
factor results in 1, the reduction phi execution thinks the VF is scalar
and sets the reduction's output as a scalar value, tripping assertions
expecting a vector value. The latest commit in this PR fixes that by
using `State.VF` in the scalar check, rather than the divided VF.

---------

Co-authored-by: Nicholas Guy <nicholas.guy@arm.com>
2024-12-24 12:08:17 +00:00
Florian Hahn
5f096fd221
Revert "[LoopVectorizer] Add support for partial reductions (#92418)"
This reverts commit 060d62b48aeb5080ffcae1dc56e41a06c6f56701.

It looks like this is triggering an assertion when build llvm-test-suite
on ARM64 macOS.

Reproducer from MultiSource/Benchmarks/Ptrdist/bc/number.c

    target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-n32:64-S128-Fn32"
    target triple = "arm64-apple-macosx15.0.0"

    define void @test(i64 %idx.neg, i8 %0) #0 {
    entry:
      br label %while.body

    while.body:                                       ; preds = %while.body, %entry
      %n1ptr.0.idx131 = phi i64 [ %n1ptr.0.add, %while.body ], [ %idx.neg, %entry ]
      %n2ptr.0.idx130 = phi i64 [ %n2ptr.0.add, %while.body ], [ 0, %entry ]
      %sum.1129 = phi i64 [ %add99, %while.body ], [ 0, %entry ]
      %n1ptr.0.add = add i64 %n1ptr.0.idx131, 1
      %conv = sext i8 %0 to i64
      %n2ptr.0.add = add i64 %n2ptr.0.idx130, 1
      %1 = load i8, ptr null, align 1
      %conv97 = sext i8 %1 to i64
      %mul = mul i64 %conv97, %conv
      %add99 = add i64 %mul, %sum.1129
      %cmp94 = icmp ugt i64 %n1ptr.0.idx131, 0
      %cmp95 = icmp ne i64 %n2ptr.0.idx130, -1
      %2 = and i1 %cmp94, %cmp95
      br i1 %2, label %while.body, label %while.end.loopexit

    while.end.loopexit:                               ; preds = %while.body
      %add99.lcssa = phi i64 [ %add99, %while.body ]
      ret void
    }

    attributes #0 = { "target-cpu"="apple-m1" }

> opt -p loop-vectorize
Assertion failed: ((VF.isScalar() || V->getType()->isVectorTy()) && "scalar values must be stored as (0, 0)"), function set, file VPlan.h, line 284.
2024-12-19 21:46:51 +00:00
Finn Plummer
45c01e8a33
[NFC][TargetTransformInfo][VectorUtils] Consolidate isVectorIntrinsic... api (#117635)
- update `VectorUtils:isVectorIntrinsicWithScalarOpAtArg` to use TTI for
all uses, to allow specifiction of target specific intrinsics
- add TTI to the `isVectorIntrinsicWithStructReturnOverloadAtField` api
- update TTI api to provide `isTargetIntrinsicWith...` functions and
  consistently name them
- move `isTriviallyScalarizable` to VectorUtils
  
- update all uses of the api and provide the TTI parameter

Resolves #117030
2024-12-19 11:54:26 -08:00
Nicholas Guy
060d62b48a
[LoopVectorizer] Add support for partial reductions (#92418)
Following on from https://github.com/llvm/llvm-project/pull/94499, this
patch adds support to the Loop Vectorizer to emit the partial reduction
intrinsics where they may be beneficial for the target.

---------

Co-authored-by: Samuel Tebbs <samuel.tebbs@arm.com>
2024-12-19 11:42:40 +00:00
Jonas Paulsson
0ad6be1927
[SLPVectorizer, TargetTransformInfo, SystemZ] Improve SLP getGatherCost(). (#112491)
As vector element loads are free on SystemZ, this patch improves the cost
computation in getGatherCost() to reflect this.

getScalarizationOverhead() gets an optional parameter which can hold the actual
Values so that they in turn can be passed (by BasicTTIImpl) to
getVectorInstrCost().

SystemZTTIImpl::getVectorInstrCost() will now recognize a LoadInst and
typically return a 0 cost for it, with some exceptions.
2024-11-29 21:19:45 +01:00
Finn Plummer
8663b8777e
[NFC][VectorUtils][TargetTransformInfo] Add isVectorIntrinsicWithOverloadTypeAtArg api (#114849)
This changes allows target intrinsics to specify and overwrite overloaded types.

- Updates `ReplaceWithVecLib` to not provide TTI as there most probably won't be a use-case
- Updates `SLPVectorizer` to use available TTI
- Updates `VPTransformState` to pass down TTI
- Updates `VPlanRecipe` to use passed-down TTI

This change will let us add scalarization for `asdouble`:  #114847
2024-11-21 11:04:25 -08:00
Sjoerd Meijer
9bccf61f5f
[AArch64][LV] Set MaxInterleaving to 4 for Neoverse V2 and V3 (#100385)
Set the maximum interleaving factor to 4, aligning with the number of available
SIMD pipelines. This increases the number of vector instructions in the vectorised
loop body, enhancing performance during its execution. However, for very low
iteration counts, the vectorised body might not execute at all, leaving only the
epilogue loop to run. This issue affects e.g. cam4_r from SPEC FP, which
experienced a performance regression. To address this, the patch reduces the
minimum epilogue vectorisation factor from 16 to 8, enabling the epilogue to be
vectorised and largely mitigating the regression.
2024-11-20 09:33:39 +00:00
Sushant Gokhale
9991ea28fc
[CostModel][AArch64] Make extractelement, with fmul user, free whenev… (#111479)
…er possible

In case of Neon, if there exists extractelement from lane != 0 such that
  1. extractelement does not necessitate a move from vector_reg -> GPR
  2. extractelement result feeds into fmul
3. Other operand of fmul is a scalar or extractelement from lane 0 or
lane equivalent to 0
then the extractelement can be merged with fmul in the backend and it
incurs no cost.

  e.g. 
  ```
define double @foo(<2 x double> %a) { 
    %1 = extractelement <2 x double> %a, i32 0 
    %2 = extractelement <2 x double> %a, i32 1
    %res = fmul double %1, %2    
    ret double %res
  }
```
  `%2` and `%res` can be merged in the backend to generate:
  `fmul    d0, d0, v0.d[1]`

The change was tested with SPEC FP(C/C++) on Neoverse-v2. 
**Compile time impact**: None
**Performance impact**: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.
2024-11-13 11:10:49 +05:30
Kazu Hirata
236fda550d
[Analysis] Remove unused includes (NFC) (#114936)
Identified with misc-include-cleaner.
2024-11-05 19:11:34 -08:00
Nashe Mncube
e37d736def
Recommit: [llvm][ARM][GlobalOpt]Add widen global arrays pass (#113289)
This is a recommit of #107120 . The original PR was approved but failed
buildbot. The newly added tests should only be run for compilers that
support the ARM target. This has been resolved by adding a config file
for these tests.

- Pass optimizes memcpy's by padding out destinations and sources to a
  full word to make ARM backend generate full word loads instead of
  loading a single byte (ldrb) and/or half word (ldrh). Only pads
  destination when it's a stack allocated constant size array and source
  when it's constant string. Heuristic to decide whether to pad or not
  is very basic and could be improved to allow more examples to be
  padded.
- Pass works at the midend level
2024-10-24 10:12:01 +01:00
Nashe Mncube
370fd74361
Revert "[llvm][ARM]Add widen global arrays pass" (#112701)
Reverts llvm/llvm-project#107120 

Unexpected build failures in post-commit pipelines. Needs investigation
2024-10-17 13:38:01 +01:00
Nashe Mncube
ab90d2793c
[llvm][ARM]Add widen global arrays pass (#107120)
- Pass optimizes memcpy's by padding out destinations and sources to a
full word to make backend generate full word loads instead of loading a
single byte (ldrb) and/or half word (ldrh). Only pads destination when
it's a stack allocated constant size array and source when it's constant
array. Heuristic to decide whether to pad or not is very basic and could
be improved to allow more examples to be padded.
- Pass works within GlobalOpt but is disabled by default on all targets
except ARM.
2024-10-17 11:56:00 +01:00
Alexey Bataev
f9bc00e4bb
[SLP]Initial support for interleaved loads
Adds initial support for interleaved loads, which allows
emission of segmented loads for RISCV RVV.

Vectorizes extra code for RISCV
CFP2006/447.dealII, CFP2006/453.povray,
CFP2017rate/510.parest_r, CFP2017rate/511.povray_r,
CFP2017rate/526.blender_r, CFP2017rate/538.imagick_r, CINT2006/403.gcc,
CINT2006/473.astar, CINT2017rate/502.gcc_r, CINT2017rate/525.x264_r

Reviewers: RKSimon, preames

Reviewed By: preames

Pull Request: https://github.com/llvm/llvm-project/pull/112042
2024-10-14 09:12:33 -04:00
Tim Renouf
76007138f4
[LLVM] New NoDivergenceSource function attribute (#111832)
A call to a function that has this attribute is not a source of
divergence, as used by UniformityAnalysis. That allows a front-end to
use known-name calls as an instruction extension mechanism (e.g.
https://github.com/GPUOpen-Drivers/llvm-dialects ) without such a call
being a source of divergence.
2024-10-12 09:34:45 +01:00
Shilei Tian
e34e27f198
[TTI][AMDGPU] Allow targets to adjust LastCallToStaticBonus via getInliningLastCallToStaticBonus (#111311)
Currently we will not be able to inline a large function even if it only
has one live use because the inline cost is still very high after
applying `LastCallToStaticBonus`, which is a constant. This could
significantly impact the performance because CSR spill is very
expensive.

This PR adds a new function `getInliningLastCallToStaticBonus` to TTI to
allow targets to customize this value.

Fixes SWDEV-471398.
2024-10-11 10:19:54 -04:00
Jeffrey Byrnes
853c43d04a
[TTI] NFC: Port TLI.shouldSinkOperands to TTI (#110564)
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.
2024-10-09 14:30:09 -07:00
Farzon Lotfi
63a0a81e73
[NFC][Scalarizer][TargetTransformInfo] Add isTargetIntrinsicWithScalarOpAtArg api (#111441)
This change allows target intrinsics can have scalar args

fixes [111440](https://github.com/llvm/llvm-project/issues/111440)

This change will let us add scalarization for WaveReadLaneAt:
https://github.com/llvm/llvm-project/pull/111010
2024-10-07 19:57:07 -04:00
Philip Reames
d288574363
[TTI][RISCV] Model cost of loading constants arms of selects and compares (#109824)
This follows in the spirit of 7d82c99403f615f6236334e698720bf979959704,
and extends the costing API for compares and selects to provide
information about the operands passed in an analogous manner. This
allows us to model the cost of materializing the vector constant, as
some select-of-constants are significantly more expensive than others
when you account for the cost of materializing the constants involved.

This is a stepping stone towards fixing
https://github.com/llvm/llvm-project/issues/109466. A separate SLP patch
will be required to utilize the new API.
2024-09-25 07:25:57 -07:00
Farzon Lotfi
0f97b4824a
[Scalarizer][DirectX] Add support for scalarization of Target intrinsics (#108776)
Since we are using the Scalarizer pass in the backend we needed a way to
allow this pass to operate on Target intrinsics.
We achieved this by adding `TargetTransformInfo ` to the Scalarizer
pass. This allowed us to call a function available to the DirectX
backend to know if an intrinsic is a target intrinsic that should be
scalarized.
2024-09-17 11:35:42 -04:00
Philip Reames
27a62ec72a
[LSR] Split the -lsr-term-fold transformation into it's own pass (#104234)
This transformation doesn't actually use any of the internal state of
LSR and recomputes all information from SCEV.  Splitting it out makes
it easier to test.
    
Note that long term I would like to write a version of this transform
which *is* integrated with LSR's solver, but if that happens, we'll
just delete the extra pass.
    
Integration wise, I switched from using TTI to using a pass configuration
variable.  This seems slightly more idiomatic, and means we don't run
the extra logic on any target other than RISCV.
2024-08-17 18:34:23 -07:00
Jeremy Morse
bde243259b Revert "[Asan] Provide TTI hook to provide memory reference infromation of target intrinsics. (#97070)"
This reverts commit e8ad87c7d06afe8f5dde2e4c7f13c314cb3a99e9.
This reverts commit d3c9bb0cf811424dcb8c848cf06773dbdde19965.

A few buildbots trip up on asan-rvv-intrinsics.ll. I've also reverted
the follow-up commit d3c9bb0cf8.

https://lab.llvm.org/buildbot/#/builders/46/builds/2895
2024-08-08 12:26:05 +01:00
Yeting Kuo
e8ad87c7d0
[Asan] Provide TTI hook to provide memory reference infromation of target intrinsics. (#97070)
Previously asan considers target intrinsics as black boxes, so asan
could not instrument accurate check. This patch provide TTI hooks to
make targets describe their intrinsic informations to asan.

Note,
1. this patch renames InterestingMemoryOperand to MemoryRefInfo.
2. this patch does not support RVV indexed/segment load/store.
2024-08-08 13:40:26 +08:00
Fabian Ritter
9e462b7ea2
[LowerMemIntrinsics][NFC] Use Align in TTI::getMemcpyLoopLoweringType (#100984)
...and also in TTI::getMemcpyLoopResidualLoweringType.
2024-07-29 13:40:53 +02:00
Tianqing Wang
3d494bfc7f
[SimplifyCFG] Increase budget for FoldTwoEntryPHINode() if the branch is unpredictable. (#98495)
The `!unpredictable` metadata has been present for a long time, but
it's usage in optimizations is still limited. This patch teaches
`FoldTwoEntryPHINode()` to be more aggressive with an unpredictable
branch to reduce mispredictions.

A TTI interface `getBranchMispredictPenalty()` is added to distinguish
between different hardwares to ensure we don't go too far for simpler
cores. For simplicity, only a naive x86 implementation is included for
the time being.
2024-07-23 07:47:21 +08:00
Sjoerd Meijer
c5329c827a
[LV][AArch64] Prefer Fixed over Scalable if cost-model is equal (Neoverse V2) (#95819)
For the Neoverse V2 we would like to prefer fixed width over scalable
    vectorisation if the cost-model assigns an equal cost to both for certain
    loops. This improves 7 kernels from TSVC-2 and several production kernels by
    about 2x, and does not affect SPEC21017 INT and FP. This also adds a new TTI
    hook that can steer the loop vectorizater to preferring fixed width
    vectorization, which can be set per CPU. For now, this is only enabled for the
    Neoverse V2.

    There are 3 reasons why preferring NEON might be better in the case the
    cost-model is a tie and the SVE vector size is the same as NEON (128-bit):
    architectural reasons, micro-architecture reasons, and SVE codegen reasons. The
    latter will be improved over time, so the more important reasons are the former
    two. I.e., (micro) architecture reason is the use of LPD/STP instructions which
    are not available in SVE2 and it avoids predication.

    For what it is worth: this codegen strategy to generate more NEON is inline
    with GCC's codegen strategy, which is actually even more aggressive in
    generating NEON when no predication is required. We could be smarter about the
    decision making, but this seems to be a first good step in the right direction,
    and we can always revise this later (for example make the target hook more
    general).
2024-07-17 10:46:28 +01:00
Sam Parker
d28ed29d6b
[TTI][WebAssembly] Pairwise reduction expansion (#93948)
WebAssembly doesn't support horizontal operations nor does it have a way
of expressing fast-math or reassoc flags, so runtimes are currently
unable to use pairwise operations when generating code from the existing
shuffle patterns.

This patch allows the backend to select which, arbitary, shuffle pattern
to be used per reduction intrinsic. The default behaviour is the same as
the existing, which is by splitting the vector into a top and bottom
half. The other pattern introduced is for a pairwise shuffle.

WebAssembly enables pairwise reductions for int/fp add/sub.
2024-07-17 09:21:52 +01:00
Nikita Popov
9df71d7673
[IR] Add getDataLayout() helpers to Function and GlobalValue (#96919)
Similar to https://github.com/llvm/llvm-project/pull/96902, this adds
`getDataLayout()` helpers to Function and GlobalValue, replacing the
current `getParent()->getDataLayout()` pattern.
2024-06-28 08:36:49 +02:00
Shengchen Kan
15fc801cf0
[X86][CodeGen] Support hoisting load/store with conditional faulting (#96720)
1. Add TTI interface for conditional load/store.
2. Mark 1 x i16/i32/i64 masked load/store legal so that it's not
   legalized in pass scalarize-masked-mem-intrin.
3. Visit 1 x i16/i32/i64 masked load/store to build a target-specific
   CLOAD/CSTORE node to avoid error in
   `DAGTypeLegalizer::ScalarizeVectorResult`.
4. Combine DAG to simplify the nodes for CLOAD/CSTORE.
5. Lower CLOAD/CSTORE to CFCMOV by pattern match.

This is CodeGen part of #95515
2024-06-27 17:01:55 +08:00
Kazu Hirata
1462605ab0
[Analysis] Use range-based for loops (NFC) (#96587) 2024-06-25 06:57:30 -07:00
Alex Bradbury
5a20141539
[LSR] Provide TTI hook to enable dropping solutions deemed to be unprofitable (#89924)
<https://reviews.llvm.org/D126043> introduced a flag to drop solutions
if deemed unprofitable. As noted there, introducing a TTI hook enables
backends to individually opt into this behaviour.

This will be used by #89927.
2024-06-05 14:40:58 +02:00
Tyler Lanphear
e8dd4df72b
[NFC][TTI] Mark getReplicationShuffleCost() as const (#92194) 2024-05-22 09:44:10 -07:00
Graham Hunter
fbb37e9606
[AArch64] Add an all-in-one histogram intrinsic
Based on discussion from
https://discourse.llvm.org/t/rfc-vectorization-support-for-histogram-count-operations/74788

Current interface is:

llvm.experimental.histogram(<vecty> ptrs, <intty> inc_amount, <vecty> mask)

The integer type used by 'inc_amount' needs to match the type of the buckets in memory.

The intrinsic covers the following operations:
  * Gather load
  * histogram on the elements of 'ptrs'
  * multiply the histogram results by 'inc_amount'
  * add the result of the multiply to the values loaded by the gather
  * scatter store the results of the add

Supports lowering to histcnt instructions for AArch64 targets, and scalarization for all others at present.
2024-05-13 11:35:28 +01:00
Graham Hunter
2e8d815596
[TTI] Support scalable offsets in getScalingFactorCost (#88113)
Part of the work to support vscale-relative immediates in LSR.
2024-05-10 11:22:11 +01:00
David Green
4ac2721e51
[AArch64] Add costs for ST3 and ST4 instructions, modelled as store(shuffle). (#87934)
This tries to add some costs for the shuffle in a ST3/ST4 instruction,
which are represented in LLVM IR as store(interleaving shuffle). In
order to detect the store, it needs to add a CxtI context instruction to
check the users of the shuffle. LD3 and LD4 are added, LD2 should be a
zip1 shuffle, which will be added in another patch.

It should help fix some of the regressions from #87510.
2024-04-09 16:36:08 +01:00
Graham Hunter
36a3f8f647
[TTI][TLI][AArch64] Support scalable immediates with isLegalAddImmediate (#84173)
Adds a second parameter (default to 0) to isLegalAddImmediate, to
represent a scalable immediate.

Extends the AArch64 implementation to match immediates based on what addvl and inc[h|w|d] support.
2024-03-20 10:28:46 +00:00
Graham Hunter
cd768ec983
[AArch64] Support scalable offsets with isLegalAddressingMode (#83255)
Allows us to indicate that an addressing mode featuring a
vscale-relative immediate offset is supported.
2024-03-20 10:13:20 +00:00
Paschalis Mpeis
f795d1a8b1
[AArch64][LV][SLP] Vectorizers use call cost for vectorized frem (#82488)
getArithmeticInstrCost is used by both LoopVectorizer and SLPVectorizer
to compute the cost of frem, which becomes a call cost on AArch64 when
TLI has a vector library function.

Add tests that do SLP vectorization for code that contains 2x double and
4x float frem instructions.
2024-03-14 17:20:29 +00:00
Kolya Panchenko
889d99a50f
[TTI] Add alignment argument to TTI for compress/expand support (#83516)
Since `llvm.compressstore` and `llvm.expandload` do require memory
access, it's essential for some target to check if alignment is good to
be able to lower them to target-specific instructions
2024-03-05 20:33:56 -05:00
Alexey Bataev
8ad14b6d90
[TTI]Add support for strided loads/stores.
Added basic legality check and cost estimation functions for strided loads and stores.

These interfaces will be built upon in https://github.com/llvm/llvm-project/pull/80310.

Reviewers: preames

Reviewed By: preames

Pull Request: https://github.com/llvm/llvm-project/pull/80329
2024-02-01 16:07:38 -05:00
David Green
a2d68b4bec
[SelectOpt] Add handling for Select-like operations. (#77284)
Some operations behave like selects. For example `or(zext(c), y)` is the
same as select(c, y|1, y)` and instcombine can canonicalize the select
to the or form. These operations can still be worthwhile converting to
branch as opposed to keeping as a select or or instruction.

This patch attempts to add some basic handling for them, creating a
SelectLike abstraction in the select optimization pass. The backend can
opt into handling `or(zext(c),x)` as a select if it could be profitable,
and the select optimization pass attempts to handle them in much the
same way as a `select(c, x|1, x)`. The Or(x, 1) may need to be added as
a new instruction, generated as the or is converted to branches.

This helps fix a regression from selects being converted to or's
recently.
2024-01-22 23:46:58 +00:00
David Sherwood
c7148467fc
[AArch64] Add an AArch64 pass for loop idiom transformations (#72273)
We have added a new pass that looks for loops such as the following:

```
  while (i != max_len)
      if (a[i] != b[i])
          break;

  ... use index i ...
```

Although similar to a memcmp, this is slightly different because instead
of returning the difference between the values of the first non-matching
pair of bytes, it returns the index of the first mismatch. As such, we
are not able to lower this to a memcmp call.

The new pass can now spot such idioms and transform them into a
specialised predicated loop that gives a significant performance
improvement for AArch64. It is intended as a stop-gap solution until
this can be handled by the vectoriser, which doesn't currently deal with
early exits.

This specialised loop makes use of a generic intrinsic that counts the
trailing zero elements in a predicate vector. This was added in
https://reviews.llvm.org/D159283 and for SVE we end up with brkb & incp
instructions.

Although we have added this pass only for AArch64, it was written in a
generic way so that in theory it could be used by other targets.
Currently the pass requires scalable vector support and needs to know
the minimum page size for the target, however it's possible to make it
work for fixed-width vectors too. Also, the llvm.experimental.cttz.elts
intrinsic used by the pass has generic lowering, but can be made
efficient for targets with instructions similar to SVE's brkb, cntp and
incp.

Original version of patch was posted on Phabricator:

 https://reviews.llvm.org/D158291

Patch co-authored by Kerry McLaughlin (@kmclaughlin-arm) and David
Sherwood (@david-arm)

See the original discussion on Discourse:

https://discourse.llvm.org/t/aarch64-target-specific-loop-idiom-recognition/72383
2024-01-09 11:29:28 +00:00
Alexey Bataev
5096501082 [SLP][TTI][X86]Add addsub pattern cost estimation. (#76461)
SLP/TTI do not know about the cost estimation for addsub pattern,
supported by X86. Previously the support for pattern detection was added
(seeTTI::isLegalAltInstr), but the cost still did not estimated
properly.
2023-12-28 05:04:04 -08:00
Douglas Yung
fb981e6b4b Revert "[SLP][TTI][X86]Add addsub pattern cost estimation. (#76461)"
This reverts commit bc8c4bbd7973ab9527a78a20000aecde9bed652d.

Change is failing to build on several bots:
- https://lab.llvm.org/buildbot/#/builders/127/builds/60184
- https://lab.llvm.org/buildbot/#/builders/123/builds/23709
- https://lab.llvm.org/buildbot/#/builders/216/builds/32302
2023-12-27 23:52:04 -08:00
Alexey Bataev
bc8c4bbd79
[SLP][TTI][X86]Add addsub pattern cost estimation. (#76461)
SLP/TTI do not know about the cost estimation for addsub pattern,
supported by X86. Previously the support for pattern detection was added
(seeTTI::isLegalAltInstr), but the cost still did not estimated
properly.
2023-12-27 15:57:21 -05:00
Paul Walker
930b5b52ff
[ConstantHoisting] Add a TTI hook to prevent hoisting. (#69004)
Code generation can sometimes simplify expensive operations when
an operand is constant.  An example of this is divides on AArch64
where they can be rewritten using a cheaper sequence of multiplies
and subtracts.  Doing this is often better than hoisting expensive
constants which are likely to be hoisted by MachineLICM anyway.
2023-12-13 17:20:36 +00:00
Philip Reames
e947f95337 [LSR][TTI][RISCV] Enable terminator folding for RISC-V
If looking for a miscompile revert candidate, look here!

The transform being enabled prefers comparing to a loop invariant
exit value for a secondary IV over using an otherwise dead primary
IV.  This increases register pressure (by requiring the exit value
to be live through the loop), but reduces the number of instructions
within the loop by one.

On RISC-V which has a large number of scalar registers, this is
generally a profitable transform.  We loose the ability to use a beqz
on what is typically a count down IV, and pay the cost of computing
the exit value on the secondary IV in the loop preheader, but save
an add or sub in the loop body.  For anything except an extremely
short running loop, or one with extreme register pressure, this is
profitable.  On spec2017, we see a 0.42% geomean improvement in
dynamic icount, with no individual workload regressing by more than
0.25%.

Code size wise, we trade a (possibly compressible) beqz and a (possibly
compressible) addi for a uncompressible beq.  We also add instructions
in the preheader.  Net result is a slight regression overall, but
neutral or better inside the loop.

Previous versions of this transform had numerous cornercase correctness
bugs.  All of them ones I can spot by inspection have been fixed, and I
have run this through all of spec2017, but there may be further issues
lurking.  Adding uses to an IV is a fraught thing to do given poison
semantics, so this transform is somewhat inherently risky.

This patch is a reworked version of D134893 by @eop.  That patch has
been abandoned since May, so I picked it up, reworked it a bit, and
am landing it.
2023-11-29 12:04:06 -08:00