281 Commits

Author SHA1 Message Date
Giuseppe Rossini
e94a0b300a
[AMDGPU] Fix vector legalization for bf16 valu ops (#158439)
Add v4,v8,v16,v32 legalizations for the following operations:
- `FADD`
- `FMUL`
- `FMA`
- `FCANONICALIZE`
2025-09-25 10:07:11 +01:00
Krzysztof Drewniak
9470113495
[AMDGPU] Mark workitem IDs uniform in more cases (#152581)
This fixes an old FIXME, where (workitem ID X) / (wavefrront size) would
never be marked uniform if it was possible that there would be Y and Z
dimensions. Now, so long as the required size of the X dimension is a
power of 2, dividing that dimension by the wavefront size creates a
uniform value.

Furthermore, if the required launch size of the X dimension is a power
of 2 that's at least the wavefront size, the Y and Z workitem IDs are
now marked uniform.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-08-29 01:21:04 -05:00
Shilei Tian
351b38f266
[AMDGPU] Mark address space cast from private to flat as divergent if target supports globally addressable scratch (#152376)
Globally addressable scratch is a new feature introduced in gfx1250.
However, this feature changes how scratch space is mapped into the flat
aperture, making address space casts from private to flat no longer
uniform.
2025-08-06 17:08:56 -04:00
paperchalice
8bacfb2538
[AMDGPU] Remove UnsafeFPMath uses (#151079)
Remove `UnsafeFPMath` in AMDGPU part, it blocks some bugfixes related to
clang and the ultimate goal is to remove `resetTargetOptions` method in
`TargetMachine`, see FIXME in `resetTargetOptions`.
See also
https://discourse.llvm.org/t/rfc-honor-pragmas-with-ffp-contract-fast

https://discourse.llvm.org/t/allowfpopfusion-vs-sdnodeflags-hasallowcontract

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-07-31 17:36:57 +08:00
macurtis-amd
cff4a00d3f
AMDGPU: Fix runtime unrolling when cascaded GEPs present (#147700)
Cascaded GEP (i.e. GEP of GEP) are not handled when determining if it is
ok to runtime unroll loops.

This change simply uses `getUnderlyingObjects` to look through cascaded
GEPs.
2025-07-10 03:44:04 -05:00
Gheorghe-Teodor Bercea
3df36a2b18
[AMDGPU] Enable vectorization of i8 values. (#134934)
This patch adjusts the cost model to account for the ability of the
AMDGPU optimizer to group together i8 values into i32 values.

Co-authored-by: Erich Keane <ekeane@nvidia.com>
2025-06-26 19:15:31 -04:00
David Green
77941eba7f
[CostModel] Add a DstTy to getShuffleCost (#141634)
A shuffle will take two input vectors and a mask, to produce a new
vector of size <MaskElts x SrcEltTy>. Historically it has been assumed
that the SrcTy and the DstTy are the same for getShuffleCost, with that
being relaxed in recent years. If the Tp passed to getShuffleCost is the
SrcTy, then the DstTy can be calculated from the Mask elts and the src
elt size, but the Mask is not always provided and the Tp is not reliably
always the SrcTy. This has led to situations notably in the SLP
vectorizer but also in the generic cost routines where assumption about
how vectors will be legalized are built into the generic cost routines -
for example whether they will widen or promote, with the cost modelling
assuming they will widen but the default lowering to promote for integer
vectors.

This patch attempts to start improving that - it originally tried to
alter more of the cost model but that too quickly became too many
changes at once, so this patch just plumbs in a DstTy to getShuffleCost
so that DstTy and SrcTy can be reliably distinguished. The callers of
getShuffleCost have been updated to try and include a DstTy that is more
accurate. Otherwise it tries to be fairly non-functional, keeping the
SrcTy used as the primary type used in shuffle cost routines, only using
DstTy where it was in the past (for InsertSubVector for example).

Some asserts have been added that help to check for consistent values
when a Mask and a DstTy are provided to getShuffleCost. Some of them
took a while to get right, and some non-mask calls might still be
incorrect. Hopefully this will provide a useful base to build more
shuffles that alter size.
2025-06-21 12:29:29 +01:00
Matt Arsenault
a9811340b7
AMDGPU: Report special input intrinsics as free (#141948) 2025-06-18 08:24:58 +09:00
Matt Arsenault
54015f36c6
AMDGPU: Cost model for minimumnum/maximumnum (#141946) 2025-06-18 08:19:06 +09:00
Matt Arsenault
af65cb68f5
AMDGPU: Move fpenvIEEEMode into TTI (#141945) 2025-06-18 08:13:57 +09:00
Matt Arsenault
3800a83160
AMDGPU: Reduce cost of f64 copysign (#141944)
The real implementation is 1 real instruction plus a constant
materialize. Call that a 1, it's not a real f64 operation.
2025-06-18 08:10:53 +09:00
Matt Arsenault
c9b2816388
AMDGPU: Fix cost model for 16-bit operations on gfx8 (#141943)
We should only divide the number of pieces to fit the packed instructions
if we actually have pk instructions. This increases the cost of copysign,
but is closer to the current codegen output. It could be much cheaper
than it is now.
2025-06-18 08:07:03 +09:00
Ramkumar Ramachandra
b40e4ceaa6
[ValueTracking] Make Depth last default arg (NFC) (#142384)
Having a finite Depth (or recursion limit) for computeKnownBits is very
limiting, but is currently a load-bearing necessity, as all KnownBits
are recomputed on each call and there is no caching. As a prerequisite
for an effort to remove the recursion limit altogether, either using a
clever caching technique, or writing a easily-invalidable KnownBits
analysis, make the Depth argument in APIs in ValueTracking uniformly the
last argument with a default value. This would aid in removing the
argument when the time comes, as many callers that currently pass 0
explicitly are now updated to omit the argument altogether.
2025-06-03 17:12:24 +01:00
Krzysztof Drewniak
13c467b2cd
[AMDGPU] Add make.buffer.rsrc to InferAddressSpaces (#140770)
make.buffer.rsrc can be subjected to address space inference. There's
not _currently_ a reason to have this, but we might as well handle this
in case it comes up.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-05-20 16:13:37 -07:00
Krzysztof Drewniak
4bdd116b80
[AMDGPU] Add a new amdgcn.load.to.lds intrinsic (#137425)
This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to
LDS from global (address space 1) pointers and buffer fat pointers
(address space 7), since they use the same API and "gather from a
pointer to LDS" is something of an abstract operation.

This commit adds the intrinsic and its lowerings for addrspaces 1 and 7,
and updates the MLIR wrappers to use it (loosening up the restrictions
on loads to LDS along the way to match the ground truth from target
features).

It also plumbs the intrinsic through to clang.
2025-05-19 07:15:04 -07:00
David Green
abd2c07e39
[CostModel] Make Op0 and Op1 const in getVectorInstrCost. NFC (#137631)
This does not alter much at the moment, but allows const pointers to be
passed as Op0 and Op1, simplifying later patches
2025-05-01 15:55:08 +01:00
Jay Foad
886f1199f0
[AMDGPU] Use variadic isa<>. NFC. (#137016) 2025-04-24 08:19:09 +01:00
David Green
98b6f8dc69
[CostModel] Remove optional from InstructionCost::getValue() (#135596)
InstructionCost is already an optional value, containing an Invalid
state that can be checked with isValid(). There is little point in
returning another optional from getValue(). Most uses do not make use of
it being a std::optional, dereferencing the value directly (either
isValid has been checked previously or the Cost is assumed to be valid).
The one case that does in AMDGPU used value_or which has been replaced
by a isValid() check.
2025-04-23 07:46:27 +01:00
Sergei Barannikov
3334c3597d
[TTI] Fix discrepancies in prototypes between interface and implementations (NFCI) (#136655)
These are not diagnosed because implementations hide the methods of the base class rather than overriding them.
This works as long as a hiding function is callable with the same arguments as the same function from the base class.

Pull Request: https://github.com/llvm/llvm-project/pull/136655
2025-04-22 11:40:12 +03:00
Sergei Barannikov
0014b49482
[TTI] Make all interface methods const (NFCI) (#136598)
Making `TargetTransformInfo::Model::Impl` `const` makes sure all
interface methods are `const`, in `BasicTTIImpl`, its bases, and in all
derived classes.

Pull Request: https://github.com/llvm/llvm-project/pull/136598
2025-04-22 06:27:29 +03:00
Sergei Barannikov
e0c1e23b99
[TTI] Constify BasicTTIImplBase::thisT() (NFCI) (#136575)
The main change is making `thisT` method `const`, the rest of the
changes is fixing compilation errors (*).

(*) There are two tricky methods, `getVectorInstrCost()` and
`getIntImmCost()`.
They have several overloads; some of these overloads are typically
pulled in to derived classes using the `using` directive, and then
hidden by methods in the derived class.
The compiler does not complain if the hiding methods are not marked as
`const`, which means that clients will use the methods from the base
class. If after this change your target fails cost model tests, this
must be the reason. To resolve the issue you need  to make all hiding
overloads `const`. See the second commit in this PR.

Pull Request: https://github.com/llvm/llvm-project/pull/136575
2025-04-21 21:42:40 +03:00
Fabian Ritter
b95a6c750c
[AMDGPU] Remove special cases in TTI::getMemcpyLoop(Residual)LoweringType (#125507)
These special cases limit the width of memory operations we use for
lowering memcpy/memmove when the pointer arguments are 2-aligned or in
the LDS/GDS.

I found that performance in microbenchmarks on gfx90a, gfx1030, and
gfx1100 is better without this limitation.
2025-02-04 08:18:24 +01:00
Joel E. Denny
18f8106f31
[KernelInfo] Implement new LLVM IR pass for GPU code analysis (#102944)
This patch implements an LLVM IR pass, named kernel-info, that reports
various statistics for codes compiled for GPUs. The ultimate goal of
these statistics to help identify bad code patterns and ways to mitigate
them. The pass operates at the LLVM IR level so that it can, in theory,
support any LLVM-based compiler for programming languages supporting
GPUs. It has been tested so far with LLVM IR generated by Clang for
OpenMP offload codes targeting NVIDIA GPUs and AMD GPUs.

By default, the pass runs at the end of LTO, and options like
``-Rpass=kernel-info`` enable its remarks. Example `opt` and `clang`
command lines appear in `llvm/docs/KernelInfo.rst`. Remarks include
summary statistics (e.g., total size of static allocas) and individual
occurrences (e.g., source location of each alloca). Examples of its
output appear in tests in `llvm/test/Analysis/KernelInfo`.
2025-01-29 12:40:19 -05:00
Fabian Ritter
a4fd3dba6e
[AMDGPU] Use wider loop lowering type for LowerMemIntrinsics (#112332)
When llvm.memcpy or llvm.memmove intrinsics are lowered as a loop in
LowerMemIntrinsics.cpp, the loop consists of a single load/store pair
per iteration. We can improve performance in some cases by emitting
multiple load/store pairs per iteration. This patch achieves that by
increasing the width of the loop lowering type in the GCN target and
letting legalization split the resulting too-wide access pairs into
multiple legal access pairs.

This change only affects lowered memcpys and memmoves with large (>=
1024 bytes) constant lengths. Smaller constant lengths are handled by
ISel directly; non-constant lengths would be slowed down by this change
if the dynamic length was smaller or slightly larger than what an
unrolled iteration copies.

The chosen default unroll factor is the result of microbenchmarks on
gfx1030. This change leads to speedups of 15-38% for global memory and
1.9-5.8x for scratch in these microbenchmarks.

Part of SWDEV-455845.
2024-10-28 09:04:19 +01:00
Shilei Tian
e34e27f198
[TTI][AMDGPU] Allow targets to adjust LastCallToStaticBonus via getInliningLastCallToStaticBonus (#111311)
Currently we will not be able to inline a large function even if it only
has one live use because the inline cost is still very high after
applying `LastCallToStaticBonus`, which is a constant. This could
significantly impact the performance because CSR spill is very
expensive.

This PR adds a new function `getInliningLastCallToStaticBonus` to TTI to
allow targets to customize this value.

Fixes SWDEV-471398.
2024-10-11 10:19:54 -04:00
Rahul Joshi
fa789dffb1
[NFC] Rename Intrinsic::getDeclaration to getOrInsertDeclaration (#111752)
Rename the function to reflect its correct behavior and to be consistent
with `Module::getOrInsertFunction`. This is also in preparation of
adding a new `Intrinsic::getDeclaration` that will have behavior similar
to `Module::getFunction` (i.e, just lookup, no creation).
2024-10-11 05:26:03 -07:00
Fabian Ritter
173c68239d
[AMDGPU] Enable unaligned scratch accesses (#110219)
This allows us to emit wide generic and scratch memory accesses when we
do not have alignment information. In cases where accesses happen to be
properly aligned or where generic accesses do not go to scratch memory,
this improves performance of the generated code by a factor of up to 16x
and reduces code size, especially when lowering memcpy and memmove
intrinsics.

Also: Make the use of the FeatureUnalignedScratchAccess feature more
consistent: FeatureUnalignedScratchAccess and EnableFlatScratch are now
orthogonal, whereas, before, code assumed that the latter implies the
former at some places.

Part of SWDEV-455845.
2024-10-11 08:50:49 +02:00
Jeffrey Byrnes
853c43d04a
[TTI] NFC: Port TLI.shouldSinkOperands to TTI (#110564)
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.
2024-10-09 14:30:09 -07:00
Matt Arsenault
c198f775cd
AMDGPU: Remove flat/global fmin/fmax intrinsics (#105642)
These have been replaced with atomicrmw
2024-10-09 09:27:28 +04:00
Jay Foad
8d13e7b8c3
[AMDGPU] Qualify auto. NFC. (#110878)
Generated automatically with:
$ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find
lib/Target/AMDGPU/ -type f)
2024-10-03 13:07:54 +01:00
Luke Drummond
3be955abbc [NFC] Remove dead code
There's an early exit branch a couple of lines earlier for `MVT ==
f64`. Convert to an assert rather than using the duplicate ternary here.
This silences an opinionated static analyser that's been bugging me.
2024-08-26 12:59:41 +01:00
Matt Arsenault
ee08d9cba5
AMDGPU: Remove global/flat atomic fadd intrinics (#97051)
These have been replaced with atomicrmw.
2024-08-22 23:27:33 +04:00
Matt Arsenault
cdadc2eb9e
AMDGPU: Correct costs of saturating add/sub intrinsics (#100808)
These are directly legal with fast instructions.
2024-08-09 12:55:15 +04:00
Matt Arsenault
d7824fab6e
TTI: Check legalization cost of abs nodes (#100523) 2024-08-09 12:51:05 +04:00
Matt Arsenault
e7630a0d60
AMDGPU: Improve cost handling of canonicalize (#101479) 2024-08-01 19:02:20 +04:00
Matt Arsenault
524795926b
AMDGPU: Enable vectorization of v2f16 copysign (#100799) 2024-07-30 08:48:13 +04:00
Matt Arsenault
4ed66cb4e1
AMDGPU: Improve cost handling of fma/fmuladd (#100798)
We were overcounting the cost of fast f32 FMA. Also address todo
and handle fmuladd (which I'm just assuming lowers to FMA, the slow FMA
expansion is about as fast on slow targets anyway).
2024-07-30 08:45:07 +04:00
Fabian Ritter
9e462b7ea2
[LowerMemIntrinsics][NFC] Use Align in TTI::getMemcpyLoopLoweringType (#100984)
...and also in TTI::getMemcpyLoopResidualLoweringType.
2024-07-29 13:40:53 +02:00
Nikita Popov
9df71d7673
[IR] Add getDataLayout() helpers to Function and GlobalValue (#96919)
Similar to https://github.com/llvm/llvm-project/pull/96902, this adds
`getDataLayout()` helpers to Function and GlobalValue, replacing the
current `getParent()->getDataLayout()` pattern.
2024-06-28 08:36:49 +02:00
Nikita Popov
2d209d964a
[IR] Add getDataLayout() helpers to BasicBlock and Instruction (#96902)
This is a helper to avoid writing `getModule()->getDataLayout()`. I
regularly try to use this method only to remember it doesn't exist...

`getModule()->getDataLayout()` is also a common (the most common?)
reason why code has to include the Module.h header.
2024-06-27 16:38:15 +02:00
Matt Arsenault
4477ff6836
AMDGPU: Remove ds_fmin/ds_fmax intrinsics (#96739)
These have been replaced with atomicrmw.
2024-06-27 15:35:24 +02:00
Matt Arsenault
70c8b9c24a
AMDGPU: Remove ds atomic fadd intrinsics (#95396)
These have been replaced with atomicrmw fadd
2024-06-23 10:30:20 +02:00
Jeffrey Byrnes
ea43a30899
[AMDGPU] Vectorize more 16 bit shuffles (#90648)
In the case of larger vectors, we should still prefer the vectorized
version (i.e. shufflevector vs extract/insert chains).

In arithmetic chains, vectorization results in chains of packed math
instructions (as opposed to unpack/repack & scalarized arithmetic):
https://godbolt.org/z/c5onaf6G5

In chains with PHIs, vectorization again removes the unnecessary pack /
repack code around BBs: https://godbolt.org/z/vz7zYzvhs
2024-05-21 09:21:36 -07:00
David Green
4ac2721e51
[AArch64] Add costs for ST3 and ST4 instructions, modelled as store(shuffle). (#87934)
This tries to add some costs for the shuffle in a ST3/ST4 instruction,
which are represented in LLVM IR as store(interleaving shuffle). In
order to detect the store, it needs to add a CxtI context instruction to
check the users of the shuffle. LD3 and LD4 are added, LD2 should be a
zip1 shuffle, which will be added in another patch.

It should help fix some of the regressions from #87510.
2024-04-09 16:36:08 +01:00
Alexey Bataev
7bc079c852
[TTI]Fallback to SingleSrcPermute shuffle kind, if no direct estimation for
extract subvector.

Many targets do not have cost for extractsubvector shuffle kind, but
have the costs for single source permute. If there are no costs
estimation for extractsubvector, better to switchto single source
permute for better cost estimation.

Reviewers: RKSimon, davemgreen, arsenm

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/79837
2024-02-12 07:09:49 -05:00
Mariusz Sikora
a018c8cdbb
GFX12: Add LoopDataPrefetchPass (#75625)
It is currently disabled by default. It will need experiments on a real
HW to tune and decide on the profitability.

---------

Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2023-12-19 08:32:16 +01:00
Jessica Del
32f9983c06
[AMDGPU] - Add address space for strided buffers (#74471)
This is an experimental address space for strided buffers. These buffers
can have structs as elements and
a stride > 1.
These pointers allow the indexed access in units of stride, i.e., they
point at `buffer[index * stride]`.
Thus, we can use the `idxen` modifier for buffer loads.

We assign address space 9 to 192-bit buffer pointers which contain a
128-bit descriptor, a 32-bit offset and a 32-bit index. Essentially,
they are fat buffer pointers with an additional 32-bit index.
2023-12-15 15:49:25 +01:00
Mirko Brkušanin
07a6d73664
[AMDGPU] CodeGen for GFX12 VFLAT, VSCRATCH and VGLOBAL instructions (#75493) 2023-12-15 15:01:40 +01:00
Piotr Sobczak
fac093dd08
[AMDGPU] Update IEEE and DX10_CLAMP for GFX12 (#75030)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2023-12-13 13:52:40 +01:00
Kazu Hirata
586ecdf205
[llvm] Use StringRef::{starts,ends}_with (NFC) (#74956)
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.

I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
2023-12-11 21:01:36 -08:00