7 Commits

Author SHA1 Message Date
Emma Pilkington
4490003a22
[AMDGPU] Rename COV module flag to amdhsa_code_object_version (#79905)
The previous name 'amdgpu_code_object_version', was misleading since
this is really a property of the HSA OS. The new spelling also matches
the asm directive I added in bc82cfb.
2024-03-06 09:51:48 -05:00
pvanhout
89e91e4c0c [AMDGPU] Remove post-PromoteAlloca SROA run
PromoteAlloca now uses SSAUpdater, it doesn't need SROA to clean-up after it anymore.

Internal testing shows no noticeable performance impact.

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D156398
2023-08-11 08:29:21 +02:00
Corbin Robeck
7a4968b5a3 [AMDGPU] Add dynamic stack bit info to kernel-resource-usage Rpass output
In code object 5 (https://llvm.org/docs/AMDGPUUsage.html#code-object-v5-metadata) the AMDGPU backend added the .uses_dynamic_stack bit to the kernel meta data to identity kernels which have compile time indeterminable stack usage (indirect function calls and recursion mainly). This patch adds this information to the output of the kernel-resource-usage remarks.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D156040

Author:    Corbin Robeck <corbin.robeck@amd.com>
2023-07-25 12:20:13 -07:00
Austin Kerbow
864a2b25be [AMDGPU] Reserve extra SGPR blocks wth XNACK "any" TID Setting
ASMPrinter was relying on feature bits to setup extra SGRPs in the knerel
descriptor for the xnack_mask. This was broken for the dynamic XNACK "any" TID
setting which could cause user SGPRs to be clobbered if the number of SGPRs
reserved was near a granulated block boundary.

When XNACK was enabled this worked correctly in the ASMParser which meant some
kernels were only failing without "-save-temps".

Fixes: SWDEV-382764

Reviewed By: kzhuravl

Differential Revision: https://reviews.llvm.org/D145401
2023-03-17 20:26:23 -07:00
Nicolai Hähnle
10cef708a7 AMDGPU: Clean up LDS-related occupancy calculations
Occupancy is expressed as waves per SIMD. This means that we need to
take into account the number of SIMDs per "CU" or, to be more precise,
the number of SIMDs over which a workgroup may be distributed.

getOccupancyWithLocalMemSize was wrong because it didn't take SIMDs
into account at all.

At the same time, we need to take into account that WGP mode offers
access to a larger total amount of LDS, since this can affect how
non-power-of-two LDS allocations are rounded. To make this work
consistently, we distinguish between (available) local memory size and
addressable local memory size (which is always limited by 64kB on
gfx10+, even with WGP mode).

This change results in a massive amount of test churn. A lot of it is
caused by the fact that the default work group size is 1024, which means
that (due to rounding effects) the default occupancy on older hardware
is 8 instead of 10, which affects scheduling via register pressure
estimates. I've adjusted most tests by just running the UTC tools, but
in some cases I manually changed the work group size to 32 or 64 to make
sure that work group size chunkiness has no effect.

Differential Revision: https://reviews.llvm.org/D139468
2023-01-23 21:43:06 +01:00
Nikita Popov
bdf2fbba9c [AMDGPU] Convert some tests to opaque pointers (NFC) 2022-12-19 12:41:13 +01:00
Vang Thao
67357739c6 [AMDGPU] Add remarks to output some resource usage
Add analyis remarks to output kernel name, register usage, occupancy,
scratch usage, spills, and LDS information.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D123878
2022-07-15 11:01:53 -07:00