258 Commits

Author SHA1 Message Date
Fabian Ritter
a4fd3dba6e
[AMDGPU] Use wider loop lowering type for LowerMemIntrinsics (#112332)
When llvm.memcpy or llvm.memmove intrinsics are lowered as a loop in
LowerMemIntrinsics.cpp, the loop consists of a single load/store pair
per iteration. We can improve performance in some cases by emitting
multiple load/store pairs per iteration. This patch achieves that by
increasing the width of the loop lowering type in the GCN target and
letting legalization split the resulting too-wide access pairs into
multiple legal access pairs.

This change only affects lowered memcpys and memmoves with large (>=
1024 bytes) constant lengths. Smaller constant lengths are handled by
ISel directly; non-constant lengths would be slowed down by this change
if the dynamic length was smaller or slightly larger than what an
unrolled iteration copies.

The chosen default unroll factor is the result of microbenchmarks on
gfx1030. This change leads to speedups of 15-38% for global memory and
1.9-5.8x for scratch in these microbenchmarks.

Part of SWDEV-455845.
2024-10-28 09:04:19 +01:00
Shilei Tian
e34e27f198
[TTI][AMDGPU] Allow targets to adjust LastCallToStaticBonus via getInliningLastCallToStaticBonus (#111311)
Currently we will not be able to inline a large function even if it only
has one live use because the inline cost is still very high after
applying `LastCallToStaticBonus`, which is a constant. This could
significantly impact the performance because CSR spill is very
expensive.

This PR adds a new function `getInliningLastCallToStaticBonus` to TTI to
allow targets to customize this value.

Fixes SWDEV-471398.
2024-10-11 10:19:54 -04:00
Rahul Joshi
fa789dffb1
[NFC] Rename Intrinsic::getDeclaration to getOrInsertDeclaration (#111752)
Rename the function to reflect its correct behavior and to be consistent
with `Module::getOrInsertFunction`. This is also in preparation of
adding a new `Intrinsic::getDeclaration` that will have behavior similar
to `Module::getFunction` (i.e, just lookup, no creation).
2024-10-11 05:26:03 -07:00
Fabian Ritter
173c68239d
[AMDGPU] Enable unaligned scratch accesses (#110219)
This allows us to emit wide generic and scratch memory accesses when we
do not have alignment information. In cases where accesses happen to be
properly aligned or where generic accesses do not go to scratch memory,
this improves performance of the generated code by a factor of up to 16x
and reduces code size, especially when lowering memcpy and memmove
intrinsics.

Also: Make the use of the FeatureUnalignedScratchAccess feature more
consistent: FeatureUnalignedScratchAccess and EnableFlatScratch are now
orthogonal, whereas, before, code assumed that the latter implies the
former at some places.

Part of SWDEV-455845.
2024-10-11 08:50:49 +02:00
Jeffrey Byrnes
853c43d04a
[TTI] NFC: Port TLI.shouldSinkOperands to TTI (#110564)
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.
2024-10-09 14:30:09 -07:00
Matt Arsenault
c198f775cd
AMDGPU: Remove flat/global fmin/fmax intrinsics (#105642)
These have been replaced with atomicrmw
2024-10-09 09:27:28 +04:00
Jay Foad
8d13e7b8c3
[AMDGPU] Qualify auto. NFC. (#110878)
Generated automatically with:
$ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find
lib/Target/AMDGPU/ -type f)
2024-10-03 13:07:54 +01:00
Luke Drummond
3be955abbc [NFC] Remove dead code
There's an early exit branch a couple of lines earlier for `MVT ==
f64`. Convert to an assert rather than using the duplicate ternary here.
This silences an opinionated static analyser that's been bugging me.
2024-08-26 12:59:41 +01:00
Matt Arsenault
ee08d9cba5
AMDGPU: Remove global/flat atomic fadd intrinics (#97051)
These have been replaced with atomicrmw.
2024-08-22 23:27:33 +04:00
Matt Arsenault
cdadc2eb9e
AMDGPU: Correct costs of saturating add/sub intrinsics (#100808)
These are directly legal with fast instructions.
2024-08-09 12:55:15 +04:00
Matt Arsenault
d7824fab6e
TTI: Check legalization cost of abs nodes (#100523) 2024-08-09 12:51:05 +04:00
Matt Arsenault
e7630a0d60
AMDGPU: Improve cost handling of canonicalize (#101479) 2024-08-01 19:02:20 +04:00
Matt Arsenault
524795926b
AMDGPU: Enable vectorization of v2f16 copysign (#100799) 2024-07-30 08:48:13 +04:00
Matt Arsenault
4ed66cb4e1
AMDGPU: Improve cost handling of fma/fmuladd (#100798)
We were overcounting the cost of fast f32 FMA. Also address todo
and handle fmuladd (which I'm just assuming lowers to FMA, the slow FMA
expansion is about as fast on slow targets anyway).
2024-07-30 08:45:07 +04:00
Fabian Ritter
9e462b7ea2
[LowerMemIntrinsics][NFC] Use Align in TTI::getMemcpyLoopLoweringType (#100984)
...and also in TTI::getMemcpyLoopResidualLoweringType.
2024-07-29 13:40:53 +02:00
Nikita Popov
9df71d7673
[IR] Add getDataLayout() helpers to Function and GlobalValue (#96919)
Similar to https://github.com/llvm/llvm-project/pull/96902, this adds
`getDataLayout()` helpers to Function and GlobalValue, replacing the
current `getParent()->getDataLayout()` pattern.
2024-06-28 08:36:49 +02:00
Nikita Popov
2d209d964a
[IR] Add getDataLayout() helpers to BasicBlock and Instruction (#96902)
This is a helper to avoid writing `getModule()->getDataLayout()`. I
regularly try to use this method only to remember it doesn't exist...

`getModule()->getDataLayout()` is also a common (the most common?)
reason why code has to include the Module.h header.
2024-06-27 16:38:15 +02:00
Matt Arsenault
4477ff6836
AMDGPU: Remove ds_fmin/ds_fmax intrinsics (#96739)
These have been replaced with atomicrmw.
2024-06-27 15:35:24 +02:00
Matt Arsenault
70c8b9c24a
AMDGPU: Remove ds atomic fadd intrinsics (#95396)
These have been replaced with atomicrmw fadd
2024-06-23 10:30:20 +02:00
Jeffrey Byrnes
ea43a30899
[AMDGPU] Vectorize more 16 bit shuffles (#90648)
In the case of larger vectors, we should still prefer the vectorized
version (i.e. shufflevector vs extract/insert chains).

In arithmetic chains, vectorization results in chains of packed math
instructions (as opposed to unpack/repack & scalarized arithmetic):
https://godbolt.org/z/c5onaf6G5

In chains with PHIs, vectorization again removes the unnecessary pack /
repack code around BBs: https://godbolt.org/z/vz7zYzvhs
2024-05-21 09:21:36 -07:00
David Green
4ac2721e51
[AArch64] Add costs for ST3 and ST4 instructions, modelled as store(shuffle). (#87934)
This tries to add some costs for the shuffle in a ST3/ST4 instruction,
which are represented in LLVM IR as store(interleaving shuffle). In
order to detect the store, it needs to add a CxtI context instruction to
check the users of the shuffle. LD3 and LD4 are added, LD2 should be a
zip1 shuffle, which will be added in another patch.

It should help fix some of the regressions from #87510.
2024-04-09 16:36:08 +01:00
Alexey Bataev
7bc079c852
[TTI]Fallback to SingleSrcPermute shuffle kind, if no direct estimation for
extract subvector.

Many targets do not have cost for extractsubvector shuffle kind, but
have the costs for single source permute. If there are no costs
estimation for extractsubvector, better to switchto single source
permute for better cost estimation.

Reviewers: RKSimon, davemgreen, arsenm

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/79837
2024-02-12 07:09:49 -05:00
Mariusz Sikora
a018c8cdbb
GFX12: Add LoopDataPrefetchPass (#75625)
It is currently disabled by default. It will need experiments on a real
HW to tune and decide on the profitability.

---------

Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2023-12-19 08:32:16 +01:00
Jessica Del
32f9983c06
[AMDGPU] - Add address space for strided buffers (#74471)
This is an experimental address space for strided buffers. These buffers
can have structs as elements and
a stride > 1.
These pointers allow the indexed access in units of stride, i.e., they
point at `buffer[index * stride]`.
Thus, we can use the `idxen` modifier for buffer loads.

We assign address space 9 to 192-bit buffer pointers which contain a
128-bit descriptor, a 32-bit offset and a 32-bit index. Essentially,
they are fat buffer pointers with an additional 32-bit index.
2023-12-15 15:49:25 +01:00
Mirko Brkušanin
07a6d73664
[AMDGPU] CodeGen for GFX12 VFLAT, VSCRATCH and VGLOBAL instructions (#75493) 2023-12-15 15:01:40 +01:00
Piotr Sobczak
fac093dd08
[AMDGPU] Update IEEE and DX10_CLAMP for GFX12 (#75030)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2023-12-13 13:52:40 +01:00
Kazu Hirata
586ecdf205
[llvm] Use StringRef::{starts,ends}_with (NFC) (#74956)
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.

I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
2023-12-11 21:01:36 -08:00
Sander de Smalen
81b7f115fb
[llvm][TypeSize] Fix addition/subtraction in TypeSize. (#72979)
It seems TypeSize is currently broken in the sense that:

  TypeSize::Fixed(4) + TypeSize::Scalable(4) => TypeSize::Fixed(8)

without failing its assert that explicitly tests for this case:

  assert(LHS.Scalable == RHS.Scalable && ...);

The reason this fails is that `Scalable` is a static method of class
TypeSize,
and LHS and RHS are both objects of class TypeSize. So this is
evaluating
if the pointer to the function Scalable == the pointer to the function
Scalable,
which is always true because LHS and RHS have the same class.

This patch fixes the issue by renaming `TypeSize::Scalable` ->
`TypeSize::getScalable`, as well as `TypeSize::Fixed` to
`TypeSize::getFixed`,
so that it no longer clashes with the variable in
FixedOrScalableQuantity.

The new methods now also better match the coding standard, which
specifies that:
* Variable names should be nouns (as they represent state)
* Function names should be verb phrases (as they represent actions)
2023-11-22 08:52:53 +00:00
Fangrui Song
8e247b8f47 Replace TypeSize::{getFixed,getScalable} with canonical TypeSize::{Fixed,Scalable}. NFC 2023-10-27 00:30:41 -07:00
Jay Foad
eca2fcbdeb
[AMDGPU] Fix cost of fast unsafe f32 fdiv (#68988) 2023-10-15 12:25:36 +01:00
Alexey Bataev
9a207578ac [TTI]Add InsertSubvector pattern in improveShuffleKindFromMask().
It improves shuffle instructions estimation and improves vectorization
outcome.

Differential Revision: https://reviews.llvm.org/D157425
2023-08-18 13:47:01 -07:00
Matt Arsenault
fbe4ff8149 AMDGPU: Partially fix not respecting dynamic denormal mode
The most notable issue was producing v_mad_f32 in functions with the
dynamic mode, since it just ignores the mode. fdiv lowering is still
somewhat broken because it involves a mode switch and we need to query
the original mode.
2023-07-11 15:14:52 -04:00
Matt Arsenault
9df70e4a4d AMDGPU: Fix not applying the correct default memcpy expansion threshold
Fixes 3c848194f28decca41b7362f9dd35d4939797724. The TTI hook name got
renamed at some point in the process and the target implementation was
left behind.

Fixes: SWDEV-407329
2023-07-06 12:14:14 -04:00
David Green
12025cef3e [CostModel] Use min/max intrinsics for vecreduce.min/max costs
This changes the costmodelling of the vecreduce.min/max nodes to use the costs
of the relevant min/max intrinsics instead of expanding them to compare and
selects. The getMinMaxReductionCost have changed to take a Opcode for the
relevant intrinsic, dropping the IsUnsigned and CondTy parameters as they are
no longer needed.

A follow up patch will add some basic fminimum/fmaximum costmodelling.

Differential Revision: https://reviews.llvm.org/D153547
2023-07-04 15:02:30 +01:00
Juan Manuel MARTINEZ CAAMAÑO
dd1df099ae [InlineCost][TargetTransformInfo][AMDGPU] Consider cost of alloca instructions in the caller (2/2)
Before this patch, the compiler gave a bump to the inline-threshold
when the total size of the allocas passed as arguments to the
callee was below 256 bytes.
This heuristic ignores that some of these allocas could have be removed
by SROA if inlining was applied.

Ideally, this bonus would be attributed to the threshold once the
size of all the allocas that could not be handled by SROA is known:
at the end of the InlineCost analysis.
However, we may never reach this point if the inline-cost analysis exits
early when the inline cost goes over the threshold mid-analysis.

This patch proposes:
* Attribute the bonus in the inline-threshold when allocas are passed
  as arguments (regardless of their total size).
* Assigns a cost to each alloca proportional to its size,
  such that the cost of all the allocas cancels the bonus.

Potential problems:
* This patch assumes that removing alloca instructions with SROA is
  always profitable. This may not be the case if the total size of the
  allocas is still too big to be promoted to registers/LDS.
* Redundant calls to getTotalAllocaSize
* Awkwardly, the threshold attributed contributes to the single-bb and
  vector bonus.

Reviewed By: scchan

Differential Revision: https://reviews.llvm.org/D149741
2023-06-29 09:49:16 +02:00
Matt Arsenault
53fb907df4 AMDGPU: Special case uniformity info for single lane workgroups
Constructors/destructors and OpenMP make use of single lane groups
in some cases.
2023-06-28 07:25:48 -04:00
Matt Arsenault
92ee60b66f AMDGPU: Drop and upgrade llvm.amdgcn.atomic.inc/dec to atomicrmw 2023-06-21 21:20:26 -04:00
Matt Arsenault
12c12c5fe0 TTI: Add function to hasBranchDivergence
It my be possible to contextually ignore divergence in a function if
it's known to run single threaded.
2023-06-16 18:47:40 -04:00
Matt Arsenault
3c848194f2 CodeGen: Expand memory intrinsics in PreISelIntrinsicLowering
Expand large or unknown size memory intrinsics into loops in the
default lowering pipeline if the target doesn't have the corresponding
libfunc. Previously AMDGPU had a custom pass which existed to call the
expansion utilities.

With a default no-libcall option, we can remove the libfunc checks in
LoopIdiomRecognize for these, which never made any sense. This also
provides a path to lifting the immarg restriction on
llvm.memcpy.inline.

There seems to be a bug where TLI reports functions as available if
you use -march and not -mtriple.
2023-06-09 21:04:37 -04:00
Alexander Timofeev
bad4de1ae7 Don't disable loop unroll for vectorized loops on AMDGPU target
We've got a performance regression after the https://reviews.llvm.org/D115261.
Despite the loop being vectorized unroll is still required.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D149281
2023-05-25 22:54:41 +02:00
Juan Manuel MARTINEZ CAAMAÑO
33da608ecc [AMDGPU][InferAddressSpaces] Only rewrite address-spaces that can be trivially casted to flat for llvm.amdgcn.flat.atomic.{fadd,fmax,fmin}
The intrinsic @llvm.amdgcn.flat.atomic.{fadd,fmax,fmin} can only be
selected for flat address spaces (constant, flat and global).

This patch restricts the cases over which GCNTTIImpl::rewriteIntrinsicWithAddressSpace
rewrites the intrinsic.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D149938
2023-05-16 17:32:58 +02:00
Krzysztof Drewniak
f0415f2a45 Re-land "[AMDGPU] Define data layout entries for buffers""
Re-land D145441 with data layout upgrade code fixed to not break OpenMP.

This reverts commit 3f2fbe92d0f40bcb46db7636db9ec3f7e7899b27.

Differential Revision: https://reviews.llvm.org/D149776
2023-05-03 19:43:56 +00:00
Krzysztof Drewniak
3f2fbe92d0 Revert "[AMDGPU] Define data layout entries for buffers"
This reverts commit f9c1ede2543b37fabe9f2d8f8fed5073c475d850.

Differential Revision: https://reviews.llvm.org/D149758
2023-05-03 16:11:00 +00:00
Krzysztof Drewniak
f9c1ede254 [AMDGPU] Define data layout entries for buffers
Per discussion at
https://discourse.llvm.org/t/representing-buffer-descriptors-in-the-amdgpu-target-call-for-suggestions/68798,
we define two new address spaces for AMDGCN targets.

The first is address space 7, a non-integral address space (which was
already in the data layout) that has 160-bit pointers (which are
256-bit aligned) and uses a 32-bit offset. These pointers combine a
128-bit buffer descriptor and a 32-bit offset, and will be usable with
normal LLVM operations (load, store, GEP). However, they will be
rewritten out of existence before code generation.

The second of these is address space 8, the address space for "buffer
resources". These will be used to represent the resource arguments to
buffer instructions, and new buffer intrinsics will be defined that
take them instead of <4 x i32> as resource arguments. ptr
addrspace(8). These pointers are 128-bits long (with the same
alignment). They must not be used as the arguments to getelementptr or
otherwise used in address computations, since they can have
arbitrarily complex inherent addressing semantics that can't be
represented in LLVM. Even though, like their address space 7 cousins,
these pointers have deterministic ptrtoint/inttoptr semantics, they
are defined to be non-integral in order to prevent optimizations that
rely on pointers being a [0, [addr_max]] value from applying to them.

Future work includes:
- Defining new buffer intrinsics that take ptr addrspace(8) resources.
- A late rewrite to turn address space 7 operations into buffer
intrinsics and offset computations.

This commit also updates the "fallback address space" for buffer
intrinsics to the buffer resource, and updates the alias analysis
table.

Depends on D143437

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D145441
2023-05-03 15:25:58 +00:00
Simon Pilgrim
9e30b87afb [TTI] getMinMaxReductionCost - add FastMathFlag argument
Similar to the getArithmeticReductionCost / getExtendedReductionCost calls (which really don't need to use std::optional<>).

This will be necessary to correct recognize fast/nnan fmax/fmul reductions which can avoid nan handling - which will allow us to remove the fmax/fmin special case in X86TTIImpl::getMinMaxCost and use getIntrinsicInstrCost like we do for integer reductions (63c3895327839ba5b57f5b99ec9e888abf976ac6).

Differential Revision: https://reviews.llvm.org/D148149
2023-04-13 10:42:42 +01:00
pvanhout
9580bebd47 [AMDGPU] Remove use-legacy-divergence-analysis
This option has been a no-op since we switched to UA.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D146937
2023-03-27 11:30:24 +02:00
Jay Foad
dcb834843e [AMDGPU] Split SIModeRegisterDefaults out of AMDGPUBaseInfo. NFC.
This is only used by CodeGen. Moving it out of AMDGPUBaseInfo simplifies
future changes to make some of it depend on the subtarget.

Differential Revision: https://reviews.llvm.org/D144650
2023-02-23 16:38:15 +00:00
Luke Lau
b02b1e0ed6 [LV][NFC] Use ElementCount for getMaxInterleaveFactor
In order to allow targets to disable interleaving for scalable vectors, pass the entire VF's ElementCount to getMaxInterleaveFactor.
This is based off of the approach used here: 8d36708507

The plan would then be to disable interleaving on scalable VFs on RISC-V in a follow up patch.
See https://reviews.llvm.org/D143723#4132349

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D144474
2023-02-22 10:15:05 +00:00
Janek van Oirschot
e3515ba381 Reapply "[AMDGPU] Modify adjustInliningThreshold to also consider the cost of passing function arguments through the stack"
Reapplies 142c28ffa1323e9a8d53200a22c80d5d778e0d0f as part of D140242 which got reverted due to amdgpu openmp test failures.

This diff fixes said failures by eliding most of `adjustInliningThresholdUsingCallee` for indirect calls as the callee function is unavailable for indirect calls.

Reviewed By: arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D143498
2023-02-13 12:17:43 +00:00
Janek van Oirschot
1beba44526 Revert "[AMDGPU] Modify adjustInliningThreshold to also consider the cost of passing function arguments through the stack"
This reverts commit 142c28ffa1323e9a8d53200a22c80d5d778e0d0f.
2023-02-03 19:13:57 +00:00