304 Commits

Author SHA1 Message Date
Brox Chen
2b2b580c8d
[AMDGPU][CodeGen][True16] Track waitcnt as vgpr32 instead of vgpr16 for D16 Instructions in GFX11 (#157795)
It seems the VMEM access on hi/lo half could interfere the other half.
Track waitcnt of vgpr32 instead of vgpr16 for 16bit reg in GFX11.

---------

Co-authored-by: Joe Nash <joseph.nash@amd.com>
2025-09-17 10:09:06 -04:00
Krzysztof Drewniak
9470113495
[AMDGPU] Mark workitem IDs uniform in more cases (#152581)
This fixes an old FIXME, where (workitem ID X) / (wavefrront size) would
never be marked uniform if it was possible that there would be Y and Z
dimensions. Now, so long as the required size of the X dimension is a
power of 2, dividing that dimension by the wavefront size creates a
uniform value.

Furthermore, if the required launch size of the X dimension is a power
of 2 that's at least the wavefront size, the Y and Z workitem IDs are
now marked uniform.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-08-29 01:21:04 -05:00
Shilei Tian
ca03045d7f
[AMDGPU][Attributor] Remove final update of waves-per-eu after the attributor run (#155246)
We do not need this in the attributor, because `ST.getWavesPerEU`
accounts for both the waves-per-eu and flat-workgroup-size attributes.
If the waves-per-eu values are not valid, it drops them. In the
attributor, we only need to propagate the values without using
intermediate flat workgroup size values.

Fixes SWDEV-550257.
2025-08-27 14:11:03 -04:00
Shilei Tian
7c9c331eb3
[NFC][AMDGPU] Remove redundant code in AMDGPUSubtarget::getWavesPerEU (#155201) 2025-08-25 08:22:00 -04:00
Lucas Ramirez
1b34722302
[AMDGPU] Fix computation of waves/EU maximum (#140921)
This fixes an issue in the waves/EU range calculation wherein, if the
`amdgpu-waves-per-eu` attribute exists and is valid, the entire
attribute may be spuriously and completely ignored if workgroup sizes
and LDS usage restrict the maximum achievable occupancy below the
subtarget maximum. In such cases, we should still honor the requested
minimum number of waves/EU, even if the requested maximum is higher than
the actually achievable maximum (but still within subtarget
specification).

As such, the added unit test `empty_at_least_2_lds_limited`'s waves/EU
range should be [2,4] after this patch, when it is currently [1,4] (i.e,
as if `amdgpu-waves-per-eu` was not specified at all).

Before e377dc4 the default maximum waves/EU was always set to the
subtarget maximum, trivially avoiding the issue.
2025-05-22 02:45:59 +02:00
Shilei Tian
578741b5e8
[AMDGPU][Attributor] Rework update of AAAMDWavesPerEU (#123995)
Currently, we use `AAAMDWavesPerEU` to iteratively update values based
on attributes from the associated function, potentially propagating
user-annotated values, along with `AAAMDFlatWorkGroupSize`. Similarly,
we have `AAAMDFlatWorkGroupSize`. However, since the value calculated
through the flat workgroup size always dominates the user annotation
(i.e., the attribute), running `AAAMDWavesPerEU` iteratively is
unnecessary if no user-annotated value exists.

This PR completely rewrites how the `amdgpu-waves-per-eu` attribute is
handled in `AMDGPUAttributor`. The key changes are as follows:

- `AAAMDFlatWorkGroupSize` remains unchanged.
- `AAAMDWavesPerEU` now only propagates user-annotated values.
- A new function is added to check and update `amdgpu-waves-per-eu`
based on the following rules:
- No waves per eu, no flat workgroup size: Assume a flat workgroup size
of `1,1024` and compute waves per eu based on this.
- No waves per eu, flat workgroup size exists: Use the provided flat
workgroup size to compute waves-per-eu.
- Waves per eu exists, no flat workgroup size: This is a tricky case. In
this PR, we assume a flat workgroup size of `1,1024`, but this can be
adjusted if a different approach is preferred. Alternatively, we could
directly use the user-annotated value.
- Both waves per eu and flat workgroup size exist: If there’s a
conflict, the value derived from the flat workgroup size takes
precedence over waves per eu.

This PR also updates the logic for merging two waves per eu pairs. The
current implementation, which uses `clampStateAndIndicateChange` to
compute a union, might not be ideal. If we think from ensure proper
resource allocation perspective, for instance, if one pair specifies a
minimum of 2 waves per eu, and another specifies a minimum of 4, we
should guarantee that 4 waves per eu can be supported, as failing to do
so could result in excessive resource allocation per wave. A similar
principle applies to the upper bound. Thus, the PR uses the following
approach for merging two pairs, `lo_a,up_a` and `lo_b,up_b`: `max(lo_a,
lo_b), max(up_a, up_b)`. This ensures that resource allocation adheres
to the stricter constraints from both inputs.

Fix #123092.
2025-05-17 01:01:09 -04:00
Lucas Ramirez
e377dc4d38
[AMDGPU] Max. WG size-induced occupancy limits max. waves/EU (#137807)
The default maximum waves/EU returned by the family of
`AMDGPUSubtarget::getWavesPerEU` is currently the maximum number of
waves/EU supported by the subtarget (only a valid occupancy range in
"amdgpu-waves-per-eu" may lower that maximum). This ignores maximum
achievable occupancy imposed by flat workgroup size and LDS usage,
resulting in situations where `AMDGPUSubtarget::getWavesPerEU` produces
a maximum higher than the one from
`AMDGPUSubtarget::getOccupancyWithWorkGroupSizes`.

This limits the waves/EU range's maximum to the maximum achievable
occupancy derived from flat workgroup sizes and LDS usage. This only has
an impact on functions which restrict flat workgroup size with
"amdgpu-flat-work-group-size", since the default range of flat workgroup
sizes achieves the maximum number of waves/EU supported by the
subtarget.

Improvements to the handling of "amdgpu-waves-per-eu" are left for a
follow up PR (e.g., I think the attribute should be able to lower the
full range of waves/EU produced by these methods).
2025-05-01 13:22:23 +02:00
Shilei Tian
51c706c119
[NFC][AMDGPU] Replace direct arch comparison with isAMDGCN() (#131357) 2025-03-14 14:21:44 -04:00
Lucas Ramirez
6206f5444f
[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748)
Occupancy (i.e., the number of waves per EU) depends, in addition to
register usage, on per-workgroup LDS usage as well as on the range of
possible workgroup sizes. Mirroring the latter, occupancy should
therefore be expressed as a range since different group sizes generally
yield different achievable occupancies.

`getOccupancyWithLocalMemSize` currently returns a scalar occupancy
based on the maximum workgroup size and LDS usage. With respect to the
workgroup size range, this scalar can be the minimum, the maximum, or
neither of the two of the range of achievable occupancies. This commit
fixes the function by making it compute and return the range of
achievable occupancies w.r.t. workgroup size and LDS usage; it also
renames it to `getOccupancyWithWorkGroupSizes` since it is the range of
workgroup sizes that produces the range of achievable occupancies.

Computing the achievable occupancy range is surprisingly involved.
Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum
occupancies i.e., sometimes workgroup sizes inside the range yield the
occupancy bounds. The implementation finds these sizes in constant time;
heavy documentation explains the rationale behind the sometimes
relatively obscure calculations.

As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU,
64-wide waves. Also consider a function with no LDS usage and a flat
workgroup size range of [513,1024].

- A group of 513 items requires 9 waves per group. Only 4 groups made up
of 9 waves each can fit fully on a CU at any given time, for a total of
36 waves on the CU, or 9 per EU. However, filling as much as possible
the remaining 40-36=4 wave slots without decreasing the number of groups
reveals that a larger group of 640 items yields 40 waves on the CU, or
10 per EU.
- Similarly, a group of 1024 items requires 16 waves per group. Only 2
groups made up of 16 waves each can fit fully on a CU ay any given time,
for a total of 32 waves on the CU, or 8 per EU. However, removing as
many waves as possible from the groups without being able to fit another
equal-sized group on the CU reveals that a smaller group of 896 items
yields 28 waves on the CU, or 7 per EU.

Therefore the achievable occupancy range for this function is not [8,9]
as the group size bounds directly yield, but [7,10].

Naturally this change causes a lot of test churn as instruction
scheduling is driven by achievable occupancy estimates. In most unit
tests the flat workgroup size range is the default [1,1024] which,
ignoring potential LDS limitations, would previously produce a scalar
occupancy of 8 (derived from 1024) on a lot of targets, whereas we now
consider the maximum occupancy to be 10 in such cases. Most tests are
updated automatically and checked manually for sanity. I also manually
changed some non-automatically generated assertions when necessary.

Fixes #118220.
2025-01-23 16:07:57 +01:00
Matt Arsenault
009368f130
AMDGPU: Mark grid size loads with range metadata (#113019)
Only handles the v5 case.
2024-12-09 11:01:55 -05:00
Kazu Hirata
be187369a0
[AMDGPU] Remove unused includes (NFC) (#116154)
Identified with misc-include-cleaner.
2024-11-13 21:10:03 -08:00
Matt Arsenault
0b40f97929
AMDGPU: Treat uint32_max as the default value for amdgpu-max-num-workgroups (#113751)
0 does not make sense as a value for this to be, much less the default.
Also stop emitting each individual field if it is the default, rather than
if any element was the default. Also fix the name of the test since it didn't
exactly match the real attribute name.
2024-11-05 12:50:44 -08:00
Austin Kerbow
c4d89203f3
[AMDGPU] Support preloading hidden kernel arguments (#98861)
Adds hidden kernel arguments to the function signature and marks them
inreg if they should be preloaded into user SGPRs. The normal kernarg
preloading logic then takes over with some additional checks for the
correct implicitarg_ptr alignment.

Special care is needed so that metadata for the hidden arguments is not
added twice when generating the code object.
2024-10-06 17:44:33 -07:00
Jay Foad
8d13e7b8c3
[AMDGPU] Qualify auto. NFC. (#110878)
Generated automatically with:
$ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find
lib/Target/AMDGPU/ -type f)
2024-10-03 13:07:54 +01:00
Jay Foad
a6bae5cb37
[AMDGPU] Split GCNSubtarget into its own file. NFC. (#105525) 2024-08-21 19:11:02 +01:00
Jay Foad
63fae3ed65
[AMDGPU] clang-tidy: no else after return etc. NFC. (#99298) 2024-07-17 21:11:00 +01:00
Jay Foad
f10a78b7e4 [AMDGPU] clang-tidy: use std::make_unique. NFC. 2024-07-17 07:58:09 +01:00
Jay Foad
fdb669b0dc [AMDGPU] clang-format: pass Triple by value and std::move it. NFC. 2024-07-16 15:52:58 +01:00
Stanislav Mekhanoshin
b132dd41eb
[AMDGPU] Remove wavefrontsize feature from GFX10+ (#98400)
Processor definition shall not include a default feature which may be
switched off by a different wave size. This allows not to write
-mattr=-wavefrontsize32,+wavefrontsize64 in tests.
2024-07-16 01:02:25 -07:00
Craig Topper
c97a8e4bcf [AMDGPU] Remove unneed static_cast from GCNSubtarget constructor. NFC
RegBankInfo is a std::unique_ptr<AMDGPURegisterBankInfo> so we don't
need the cast.
2024-07-10 12:54:17 -07:00
Nikita Popov
9df71d7673
[IR] Add getDataLayout() helpers to Function and GlobalValue (#96919)
Similar to https://github.com/llvm/llvm-project/pull/96902, this adds
`getDataLayout()` helpers to Function and GlobalValue, replacing the
current `getParent()->getDataLayout()` pattern.
2024-06-28 08:36:49 +02:00
Nicolai Hähnle
7e9b49f6b8
AMDGPU: Add plumbing for private segment size argument (#96445)
The actual size of scratch/private is determined at dispatch time, so
add more plumbing to request it. Will be used in subsequent change.
2024-06-25 16:20:51 +02:00
Andreas Jonson
cc19374afa
[AMDGPU] Swap range metadata to attribute for workitem id. (#94871)
Swap out range metadata to range attribute for calls to be able to
deprecate range metadata on calls in the future.
2024-06-09 10:29:50 -04:00
Janek van Oirschot
d86b68afd7
MCExpr-ify SIProgramInfo (#88257)
Convert members in SIProgramInfo affected by variables provided by AMDGPUResourceUsageAnalysis into MCExprs.
2024-05-09 13:02:32 +01:00
Jay Foad
6eb9e214b3
RFC: [AMDGPU] Check subtarget features for consistency (#86957)
Implement GCNSubtarget::checkSubtargetFeatures as a canonical place to
check subtarget features for consistency and diagnose any
inconsistencies. To start with, the implementation just checks that
either wavefrontsize32 or wavefrontsize64 is selected.

checkSubtargetFeatures is called at the start of instruction selection.
This is pretty arbitrary. It is just a convenient point at which we have
access to the subtarget that we're going to use for codegenning a
particular function.
2024-05-09 11:37:28 +01:00
David Green
b24af43fdf
[AArch64] Improve scheduling latency into Bundles (#86310)
By default the scheduling info of instructions into a BUNDLE are given a
latency of 0 as they operate on the implicit register of the bundle.
This modifies that for AArch64 so that the latency is adjusted to use
the latency from the instruction in the bundle instead. This essentially
assumes that the bundled instructions are executed in a single cycle,
which for AArch64 is probably OK considering they are mostly used for
MOVPFX bundles, where this can help create slightly better scheduling
especially for in-order cores.
2024-04-12 10:57:01 +01:00
Jay Foad
9c58f3a234
[AMDGPU] Fix implicit $vcc operands after parsing MIR (#87781)
MIParser checks that implicit operands match the instruction definition,
so they have to be $vcc even in wave32 mode. Use the mirFileLoaded hook
to fix them after MIParser's checks, converting them to $vcc_lo which is
what that rest of CodeGen expects.

This is all just extending the fixImplicitOperands hack which was
introduced with GFX10, but at least it makes it possible to write a MIR
test which creates the same instructions that normal CodeGen would
generate.
2024-04-09 09:10:45 +01:00
Jun Wang
c4e517f59c
[AMDGPU] Adding the amdgpu_num_work_groups function attribute (#79035)
A new function attribute named amdgpu_num_work_groups is added. This
attribute, which consists of three integers, allows programmers to let
the compiler know the number of workgroups to be launched in each of the
three dimensions and do optimizations based on that information.

---------

Co-authored-by: Jun Wang <jun.wang7@amd.com>
2024-03-12 10:30:39 -07:00
Emma Pilkington
bc82cfb38d
[AMDGPU] Add an asm directive to track code_object_version (#76267)
Named '.amdhsa_code_object_version'. This directive sets the
e_ident[ABIVERSION] in the ELF header, and should be used as the assumed
COV for the rest of the asm file.

This commit also weakens the --amdhsa-code-object-version CL flag.
Previously, the CL flag took precedence over the IR flag. Now the IR
flag/asm directive take precedence over the CL flag. This is implemented
by merging a few COV-checking functions in AMDGPUBaseInfo.h.
2024-01-21 11:54:47 -05:00
Mariusz Sikora
a97028ac51
[AMDGPU] Update VOP instructions for GFX12 (#74853)
Co-authored-by: Mirko Brkusanin <Mirko.Brkusanin@amd.com>
2023-12-12 11:38:24 +01:00
Mirko Brkušanin
f5868cb6a6
[AMDGPU][MC] Add GFX12 VIMAGE and VSAMPLE encodings (#74062) 2023-12-04 13:04:42 +01:00
Austin Kerbow
0455596e1e [AMDGPU] Add DAG ISel support for preloaded kernel arguments
This patch adds the DAG isel changes for kernel argument preloading.
These changes are not usable with older firmware but subsequent patches
in the series will make the codegen backwards compatible. This patch
should only be submitted alongside that subsequent patch.

Preloading here begins from the start of the kernel arguments until the
amount of arguments indicated by the CL flag
amdgpu-kernarg-preload-count.

Aggregates and arguments passed by-ref are not supported.

Special care for the alignment of the kernarg segment is needed as well
as consideration of the alignment of addressable SGPR tuples when we
cannot directly use misaligned large tuples that the arguments are
loaded to.

Reviewed By: bcahoon

Differential Revision: https://reviews.llvm.org/D158579
2023-09-25 09:32:59 -07:00
Ivan Kosarev
bea56b0bc0 [AMDGPU] Have a subtarget feature to control use of real True16 instructions.
Real True16 instructions are as they are defined in the ISA. Fake True16
instructions are identical to real ones except that they take 32-bit
registers as operands and always use their low halves.

Reviewed By: Joe_Nash

Differential Revision: https://reviews.llvm.org/D156100
2023-09-22 10:47:13 +01:00
Simon Pilgrim
47a9cd0343 [AMDGPU] Remove constexpr from getNumUserSGPRForField/getMaxNumPreloadedSGPRs to appease older gcc builds
Older versions of gcc wouldn't accept the constexpr getNumUserSGPRForField (introduced in D159439 / 343be5132e2831d85) as it couldn't treat the llvm_unreachable call as constexpr
2023-09-13 12:19:28 +01:00
Austin Kerbow
343be5132e [AMDGPU] Add utilities to track number of user SGPRs. NFC.
Factor out and unify some common code that calculates and tracks the
number of user SGRPs.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D159439
2023-09-12 08:52:30 -07:00
Matt Arsenault
53fb907df4 AMDGPU: Special case uniformity info for single lane workgroups
Constructors/destructors and OpenMP make use of single lane groups
in some cases.
2023-06-28 07:25:48 -04:00
Matt Arsenault
b9c6d9e6c3 AMDGPU: Propagate amdgpu-waves-per-eu with attributor
This will do a value range merging down the callgraph, unlike the
current pass which can only propagate values to undecorated functions
from a kernel.

This one is a bit weird due to the interaction with the implied range
from amdgpu-flat-workgroup-size. At the default group range of 1,1024,
the minimum implied bounds is 4 so this ends up introducing the
attribute on undecorated functions. We could probably simplify this by
ignoring it and propagating the raw values. The subtarget interaction
and the interaction with amdgpu-flat-workgroup-size only really clamp
invalid values (plus the lower bound doesn't seem to do anything as
far as I can tell anyway).
2023-06-16 15:04:08 -04:00
pvanhout
ecbd37d5a3 [AMDGPU] Port no-hsa-graphic-shaders.ll to code object V4
Split from D146023

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D152432
2023-06-09 09:07:53 +02:00
Matt Arsenault
3d0350b762 AMDGPU: Add MF independent version of getImplicitParameterOffset 2023-06-07 08:26:31 -04:00
Changpeng Fang
7ca3444fba AMDGPU: Use module flag to get code object version at IR level folow-up
Summary:
  This is part of the leftover work for https://reviews.llvm.org/D143138.
In this work, we pass code object version as an argument to initialize target ID
and use it for targetID dump.

Reviewers: arsenm

Differential Revision
  https://reviews.llvm.org/D143293
2023-02-10 11:16:38 -08:00
Changpeng Fang
54cf69c9d5 AMDGPU: Use module flag to get code object version at IR level
Summary:
  This patch introduces a mechanism to check the code object version from the module flag, This avoids checking from command line.
In case the module flag is missing, we use the current default code object version supported in the compiler.

For tools whose inputs are not IR, we may need other approach (directive, for example) to check the code
object version, That will be in a separate patch later.

For LIT tests update, we directly add module flag if there is only a single code object version associated with all checks in one file.
In cause of multiple code object version in one file, we use the "sed" method to "clone" the checks to achieve the goal.

Reviewer: arsenm

Differential Revision:
  https://reviews.llvm.org/D14313
2023-02-02 18:57:26 -08:00
Nicolai Hähnle
10cef708a7 AMDGPU: Clean up LDS-related occupancy calculations
Occupancy is expressed as waves per SIMD. This means that we need to
take into account the number of SIMDs per "CU" or, to be more precise,
the number of SIMDs over which a workgroup may be distributed.

getOccupancyWithLocalMemSize was wrong because it didn't take SIMDs
into account at all.

At the same time, we need to take into account that WGP mode offers
access to a larger total amount of LDS, since this can affect how
non-power-of-two LDS allocations are rounded. To make this work
consistently, we distinguish between (available) local memory size and
addressable local memory size (which is always limited by 64kB on
gfx10+, even with WGP mode).

This change results in a massive amount of test churn. A lot of it is
caused by the fact that the default work group size is 1024, which means
that (due to rounding effects) the default occupancy on older hardware
is 8 instead of 10, which affects scheduling via register pressure
estimates. I've adjusted most tests by just running the UTC tools, but
in some cases I manually changed the work group size to 32 or 64 to make
sure that work group size chunkiness has no effect.

Differential Revision: https://reviews.llvm.org/D139468
2023-01-23 21:43:06 +01:00
Nicolai Hähnle
84610a82a1 AMDGPU: Add AMDGPUSubtarget::getEUsPerCU()
We will use this for more accurate occupancy computations. Note that
IsaInfo takes WGP mode vs. CU mode into account on gfx10+.

Differential Revision: https://reviews.llvm.org/D139467
2023-01-23 21:43:05 +01:00
Matt Arsenault
c16a58b36c Attributes: Add function getter to parse integer string attributes
The most common case for string attributes parses them as integers. We
don't have a convenient way to do this, and as a result we have
inconsistent missing attribute and invalid attribute handling
scattered around. We also have inconsistent radix usage to
getAsInteger; some places use the default 0 and others use base 10.

Update a few of the uses, but there are quite a lot of these.
2022-12-14 13:12:35 -05:00
Jay Foad
6443c0ee02 [AMDGPU] Stop using make_pair and make_tuple. NFC.
C++17 allows us to call constructors pair and tuple instead of helper
functions make_pair and make_tuple.

Differential Revision: https://reviews.llvm.org/D139828
2022-12-14 13:22:26 +00:00
Valery Pykhtin
d09d834bb9 [AMDGPU] Fix GCNSubtarget::getMinNumVGPRs, add unit test to check consistency between GCNSubtarget's getMinNumVGPRs, getMaxNumVGPRs and getOccupancyWithNumVGPRs.
```
  /// \returns Minimum number of VGPRs that meets given number of waves per
  /// execution unit requirement supported by the subtarget.
  unsigned getMinNumVGPRs(unsigned WavesPerEU) const;

  /// \returns Maximum number of VGPRs that meets given number of waves per
  /// execution unit requirement supported by the subtarget.
  unsigned getMaxNumVGPRs(unsigned WavesPerEU) const;

  /// Return the maximum number of waves per SIMD for kernels using \p VGPRs
  /// VGPRs
  unsigned getOccupancyWithNumVGPRs(unsigned VGPRs) const;
```

While working on RP tracking issues I noticed that getMinNumVGPRs return incorrect
values: the problem is large VGPR granule sizes on GFX10+ architectures. Some of the
occupancies aren't reachable because require the same amount of VGPR granules as others.
For example 19 waves occupancy on gfx1010 require the same amount of granules as 20 waves
so the resultng occupancy would be 20.

SGPRs have the same issue and even have inconsistency between getMaxNumSGPRs and getOccupancyWithNumSGPRs.
It will be addressed in the next patch.

Legend:
  # MinVGPR and MaxVGPR are values returned by getMinNumVGPRs and getMaxNumVGPRs for a given Occ.
  # (ONumber) is the value returned by getOccupancyWithNumVGPRs for a given MinVGPR or MaxVGPR.
  # R means range problem: MinVGPR should be less than MaxVGPR and both should refer to the same occupancy.

Unit test output without the fix:
```
./build/unittests/Target/AMDGPU/AMDGPUTests --gtest_filter=AMDGPU.TestVGPRLimitsPerOccupancy --print-cpu-reg-limits

 gfx90a gfx940:
Occ    MinVGPR        MaxVGPR
  8        0 (O8)     64  (O8)
  7       65 (O7)     72  (O7)
  6       73 (O6)     80  (O6)
  5       81 (O5)     96  (O5)
  4       97 (O4)     128 (O4)
  3      129 (O3)     168 (O3)
  2      169 (O2)     256 (O2)
  1      257 (O1)     512 (O1)

 gfx600 gfx600 gfx601 gfx601 gfx601 gfx602 gfx602 gfx602 gfx700 gfx700 gfx701 gfx701 gfx702 gfx703 gfx703 gfx703 gfx704 gfx704 gfx705 gfx801 gfx801 gfx802 gfx802 gfx802 gfx803 gfx803 gfx803 gfx803 gfx805 gfx805 gfx810 gfx810 gfx900 gfx902 gfx904 gfx906 gfx908 gfx909 gfx90c:
Occ    MinVGPR        MaxVGPR
 10        0 (O10)    24  (O10)
  9       25 (O9)     28  (O9)
  8       29 (O8)     32  (O8)
  7       33 (O7)     36  (O7)
  6       37 (O6)     40  (O6)
  5       41 (O5)     48  (O5)
  4       49 (O4)     64  (O4)
  3       65 (O3)     84  (O3)
  2       85 (O2)     128 (O2)
  1      129 (O1)     256 (O1)

 gfx1030w64 gfx1031w64 gfx1032w64 gfx1033w64 gfx1034w64 gfx1035w64 gfx1036w64 gfx1102w64 gfx1103w64:
Occ    MinVGPR        MaxVGPR
 16        0 (O16)    32  (O16)
 15       33 (O12) R  32  (O16)
 14       33 (O12) R  32  (O16)
 13       33 (O12) R  32  (O16)
 12       33 (O12)    40  (O12)
 11       41 (O10) R  40  (O12)
 10       41 (O10)    48  (O10)
  9       49 (O9)     56  (O9)
  8       57 (O8)     64  (O8)
  7       65 (O7)     72  (O7)
  6       73 (O6)     80  (O6)
  5       81 (O5)     96  (O5)
  4       97 (O4)     128 (O4)
  3      129 (O3)     168 (O3)
  2      169 (O2)     256 (O2)
  1      256 (O2) R   256 (O2)

 gfx1100w64 gfx1101w64:
Occ    MinVGPR        MaxVGPR
 16        0 (O16)    48  (O16)
 15       49 (O12) R  48  (O16)
 14       49 (O12) R  48  (O16)
 13       49 (O12) R  48  (O16)
 12       49 (O12)    60  (O12)
 11       61 (O10) R  60  (O12)
 10       61 (O10)    72  (O10)
  9       73 (O9)     84  (O9)
  8       85 (O8)     96  (O8)
  7       97 (O7)     108 (O7)
  6      109 (O6)     120 (O6)
  5      121 (O5)     144 (O5)
  4      145 (O4)     192 (O4)
  3      193 (O3)     252 (O3)
  2      253 (O2)     256 (O2)
  1      256 (O2) R   256 (O2)

 gfx1030w32 gfx1031w32 gfx1032w32 gfx1033w32 gfx1034w32 gfx1035w32 gfx1036w32 gfx1102w32 gfx1103w32:
Occ    MinVGPR        MaxVGPR
 16        0 (O16)    64  (O16)
 15       65 (O12) R  64  (O16)
 14       65 (O12) R  64  (O16)
 13       65 (O12) R  64  (O16)
 12       65 (O12)    80  (O12)
 11       81 (O10) R  80  (O12)
 10       81 (O10)    96  (O10)
  9       97 (O9)     112 (O9)
  8      113 (O8)     128 (O8)
  7      129 (O7)     144 (O7)
  6      145 (O6)     160 (O6)
  5      161 (O5)     192 (O5)
  4      193 (O4)     256 (O4)
  3      256 (O4) R   256 (O4)
  2      256 (O4) R   256 (O4)
  1      256 (O4) R   256 (O4)

 gfx1100w32 gfx1101w32:
Occ    MinVGPR        MaxVGPR
 16        0 (O16)    96  (O16)
 15       97 (O12) R  96  (O16)
 14       97 (O12) R  96  (O16)
 13       97 (O12) R  96  (O16)
 12       97 (O12)    120 (O12)
 11      121 (O10) R  120 (O12)
 10      121 (O10)    144 (O10)
  9      145 (O9)     168 (O9)
  8      169 (O8)     192 (O8)
  7      193 (O7)     216 (O7)
  6      217 (O6)     240 (O6)
  5      241 (O5)     256 (O5)
  4      256 (O5) R   256 (O5)
  3      256 (O5) R   256 (O5)
  2      256 (O5) R   256 (O5)
  1      256 (O5) R   256 (O5)

 gfx1010w64 gfx1011w64 gfx1012w64 gfx1013w64:
Occ    MinVGPR        MaxVGPR
 20        0 (O20)    24  (O20)
 19       25 (O18) R  24  (O20)
 18       25 (O18)    28  (O18)
 17       29 (O16) R  28  (O18)
 16       29 (O16)    32  (O16)
 15       33 (O14) R  32  (O16)
 14       33 (O14)    36  (O14)
 13       37 (O12) R  36  (O14)
 12       37 (O12)    40  (O12)
 11       41 (O11)    44  (O11)
 10       45 (O10)    48  (O10)
  9       49 (O9)     56  (O9)
  8       57 (O8)     64  (O8)
  7       65 (O7)     72  (O7)
  6       73 (O6)     84  (O6)
  5       85 (O5)     100 (O5)
  4      101 (O4)     128 (O4)
  3      129 (O3)     168 (O3)
  2      169 (O2)     256 (O2)
  1      256 (O2) R   256 (O2)

 gfx1010w32 gfx1011w32 gfx1012w32 gfx1013w32:
Occ    MinVGPR        MaxVGPR
 20        0 (O20)    48  (O20)
 19       49 (O18) R  48  (O20)
 18       49 (O18)    56  (O18)
 17       57 (O16) R  56  (O18)
 16       57 (O16)    64  (O16)
 15       65 (O14) R  64  (O16)
 14       65 (O14)    72  (O14)
 13       73 (O12) R  72  (O14)
 12       73 (O12)    80  (O12)
 11       81 (O11)    88  (O11)
 10       89 (O10)    96  (O10)
  9       97 (O9)     112 (O9)
  8      113 (O8)     128 (O8)
  7      129 (O7)     144 (O7)
  6      145 (O6)     168 (O6)
  5      169 (O5)     200 (O5)
  4      201 (O4)     256 (O4)
  3      256 (O4) R   256 (O4)
  2      256 (O4) R   256 (O4)
  1      256 (O4) R   256 (O4)
```

After the fix:
```
 gfx90a gfx940:
Occ    MinVGPR        MaxVGPR
  8        0 (O8)     64  (O8)
  7       65 (O7)     72  (O7)
  6       73 (O6)     80  (O6)
  5       81 (O5)     96  (O5)
  4       97 (O4)     128 (O4)
  3      129 (O3)     168 (O3)
  2      169 (O2)     256 (O2)
  1      257 (O1)     512 (O1)

 gfx600 gfx600 gfx601 gfx601 gfx601 gfx602 gfx602 gfx602 gfx700 gfx700 gfx701 gfx701 gfx702 gfx703 gfx703 gfx703 gfx704 gfx704 gfx705 gfx801 gfx801 gfx802 gfx802 gfx802 gfx803 gfx803 gfx803 gfx803 gfx805 gfx805 gfx810 gfx810 gfx900 gfx902 gfx904 gfx906 gfx908 gfx909 gfx90c:
Occ    MinVGPR        MaxVGPR
 10        0 (O10)    24  (O10)
  9       25 (O9)     28  (O9)
  8       29 (O8)     32  (O8)
  7       33 (O7)     36  (O7)
  6       37 (O6)     40  (O6)
  5       41 (O5)     48  (O5)
  4       49 (O4)     64  (O4)
  3       65 (O3)     84  (O3)
  2       85 (O2)     128 (O2)
  1      129 (O1)     256 (O1)

 gfx1030w64 gfx1031w64 gfx1032w64 gfx1033w64 gfx1034w64 gfx1035w64 gfx1036w64 gfx1102w64 gfx1103w64:
Occ    MinVGPR        MaxVGPR
 16        0 (O16)    32  (O16)
 15        0 (O16)    32  (O16)
 14        0 (O16)    32  (O16)
 13        0 (O16)    32  (O16)
 12       33 (O12)    40  (O12)
 11       33 (O12)    40  (O12)
 10       41 (O10)    48  (O10)
  9       49 (O9)     56  (O9)
  8       57 (O8)     64  (O8)
  7       65 (O7)     72  (O7)
  6       73 (O6)     80  (O6)
  5       81 (O5)     96  (O5)
  4       97 (O4)     128 (O4)
  3      129 (O3)     168 (O3)
  2      169 (O2)     256 (O2)
  1      169 (O2)     256 (O2)

 gfx1100w64 gfx1101w64:
Occ    MinVGPR        MaxVGPR
 16        0 (O16)    48  (O16)
 15        0 (O16)    48  (O16)
 14        0 (O16)    48  (O16)
 13        0 (O16)    48  (O16)
 12       49 (O12)    60  (O12)
 11       49 (O12)    60  (O12)
 10       61 (O10)    72  (O10)
  9       73 (O9)     84  (O9)
  8       85 (O8)     96  (O8)
  7       97 (O7)     108 (O7)
  6      109 (O6)     120 (O6)
  5      121 (O5)     144 (O5)
  4      145 (O4)     192 (O4)
  3      193 (O3)     252 (O3)
  2      253 (O2)     256 (O2)
  1      253 (O2)     256 (O2)

 gfx1030w32 gfx1031w32 gfx1032w32 gfx1033w32 gfx1034w32 gfx1035w32 gfx1036w32 gfx1102w32 gfx1103w32:
Occ    MinVGPR        MaxVGPR
 16        0 (O16)    64  (O16)
 15        0 (O16)    64  (O16)
 14        0 (O16)    64  (O16)
 13        0 (O16)    64  (O16)
 12       65 (O12)    80  (O12)
 11       65 (O12)    80  (O12)
 10       81 (O10)    96  (O10)
  9       97 (O9)     112 (O9)
  8      113 (O8)     128 (O8)
  7      129 (O7)     144 (O7)
  6      145 (O6)     160 (O6)
  5      161 (O5)     192 (O5)
  4      193 (O4)     256 (O4)
  3      193 (O4)     256 (O4)
  2      193 (O4)     256 (O4)
  1      193 (O4)     256 (O4)

 gfx1100w32 gfx1101w32:
Occ    MinVGPR        MaxVGPR
 16        0 (O16)    96  (O16)
 15        0 (O16)    96  (O16)
 14        0 (O16)    96  (O16)
 13        0 (O16)    96  (O16)
 12       97 (O12)    120 (O12)
 11       97 (O12)    120 (O12)
 10      121 (O10)    144 (O10)
  9      145 (O9)     168 (O9)
  8      169 (O8)     192 (O8)
  7      193 (O7)     216 (O7)
  6      217 (O6)     240 (O6)
  5      241 (O5)     256 (O5)
  4      241 (O5)     256 (O5)
  3      241 (O5)     256 (O5)
  2      241 (O5)     256 (O5)
  1      241 (O5)     256 (O5)

 gfx1010w64 gfx1011w64 gfx1012w64 gfx1013w64:
Occ    MinVGPR        MaxVGPR
 20        0 (O20)    24  (O20)
 19        0 (O20)    24  (O20)
 18       25 (O18)    28  (O18)
 17       25 (O18)    28  (O18)
 16       29 (O16)    32  (O16)
 15       29 (O16)    32  (O16)
 14       33 (O14)    36  (O14)
 13       33 (O14)    36  (O14)
 12       37 (O12)    40  (O12)
 11       41 (O11)    44  (O11)
 10       45 (O10)    48  (O10)
  9       49 (O9)     56  (O9)
  8       57 (O8)     64  (O8)
  7       65 (O7)     72  (O7)
  6       73 (O6)     84  (O6)
  5       85 (O5)     100 (O5)
  4      101 (O4)     128 (O4)
  3      129 (O3)     168 (O3)
  2      169 (O2)     256 (O2)
  1      169 (O2)     256 (O2)

 gfx1010w32 gfx1011w32 gfx1012w32 gfx1013w32:
Occ    MinVGPR        MaxVGPR
 20        0 (O20)    48  (O20)
 19        0 (O20)    48  (O20)
 18       49 (O18)    56  (O18)
 17       49 (O18)    56  (O18)
 16       57 (O16)    64  (O16)
 15       57 (O16)    64  (O16)
 14       65 (O14)    72  (O14)
 13       65 (O14)    72  (O14)
 12       73 (O12)    80  (O12)
 11       81 (O11)    88  (O11)
 10       89 (O10)    96  (O10)
  9       97 (O9)     112 (O9)
  8      113 (O8)     128 (O8)
  7      129 (O7)     144 (O7)
  6      145 (O6)     168 (O6)
  5      169 (O5)     200 (O5)
  4      201 (O4)     256 (O4)
  3      201 (O4)     256 (O4)
  2      201 (O4)     256 (O4)
  1      201 (O4)     256 (O4)
```

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D138443
2022-12-06 09:14:49 +01:00
Kazu Hirata
20cde15415 [Target] Use std::nullopt instead of None (NFC)
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated.  The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.

This is part of an effort to migrate from llvm::Optional to
std::optional:

https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
2022-12-02 20:36:06 -08:00
Carl Ritson
266b5dbc5d [AMDGPU] Add MIMG NSA threshold configuration attribute
Make MIMG NSA minimum addresses threshold an attribute that can
be set on a function or configured via command line.
This enables frontend tuning which allows increased NSA usage
where beneficial.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D134780
2022-09-28 20:03:18 +09:00
Fangrui Song
de9d80c1c5 [llvm] LLVM_FALLTHROUGH => [[fallthrough]]. NFC
With C++17 there is no Clang pedantic warning or MSVC C5051.
2022-08-08 11:24:15 -07:00
Jon Chesterfield
3a20597776 [amdgpu] Implement lds kernel id intrinsic
Implement an intrinsic for use lowering LDS variables to different
addresses from different kernels. This will allow kernels that cannot
reach an LDS variable to avoid wasting space for it.

There are a number of implicit arguments accessed by intrinsic already
so this implementation closely follows the existing handling. It is slightly
novel in that this SGPR is written by the kernel prologue.

It is necessary in the general case to put variables at different addresses
such that they can be compactly allocated and thus necessary for an
indirect function call to have some means of determining where a
given variable was allocated. Claiming an arbitrary SGPR into which
an integer can be written by the kernel, in this implementation based
on metadata associated with that kernel, which is then passed on to
indirect call sites is sufficient to determine the variable address.

The intent is to emit a __const array of LDS addresses and index into it.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D125060
2022-07-19 17:46:19 +01:00