llvm-project

Author	SHA1	Message	Date
Matt Arsenault	273e8d85fe	DiagnosticInfo: Fix missing LLVM_LIFETIME_BOUND on Twine arguments (#190331 ) Fix use after free errors in DiagnosticInfoResourceLimit uses.	2026-04-03 11:08:00 +00:00
Jay Foad	0b49adc32c	[AMDGPU] Rename AMDGPUMachineFunction to AMDGPUMachineFunctionInfo. NFC. (#187276 ) This is derived from MachineFunctionInfo not MachineFunction.	2026-03-18 20:29:47 +00:00
michaelselehov	cb3fbe921b	[AMDGPU] Set preferred function alignment based on icache geometry (#183064 ) Non-entry functions were unconditionally aligned to 4 bytes with no architecture-specific preferred alignment, and setAlignment() was used instead of ensureAlignment(), overwriting any explicit IR attributes. Add instruction cache line size and fetch alignment data to GCNSubtarget for each generation (GFX9: 64B/32B, GFX10: 64B/4B, GFX11+: 128B/4B). Use this to call setPrefFunctionAlignment() in SITargetLowering, aligning non-entry functions to the cache line size by default. Change setAlignment to ensureAlignment in AMDGPUAsmPrinter so explicit IR align attributes are respected. Empirical thread trace analysis on gfx942, gfx1030, gfx1100, and gfx1200 showed that only GFX9 exhibits measurable fetch stalls when functions cross the 32-byte fetch window boundary. GFX10+ showed no alignment sensitivity. A hidden option -amdgpu-align-functions-for-fetch-only is provided to use the fetch granularity instead of cache line size. Assisted-by: Claude Opus	2026-03-11 07:57:37 -04:00
Mirko Brkušanin	d0f50d5574	[AMDGPU] Remove DX10_CLAMP and IEEE bits from gfx1170 (#182107 ) Add `DX10ClampAndIEEEMode` feature and set it for every subtarget prior to gfx1170	2026-03-04 12:16:41 +01:00
Jay Foad	b39247c391	[AMDGPU] Fix typo "PGRM" in variable name. NFC. (#184104 )	2026-03-02 12:20:30 +00:00
David Stuttard	cd68939326	[AMDGPU] Add attribute for FWD_PROGRESS (#181675 ) Added an attribute for FWD_PROGRESS that allows it to be turned off for some shaders.	2026-02-26 11:35:36 +00:00
Mariusz Sikora	3c0f5045e1	[AMDGPU] Add FeatureGFX13 and SMEM encoding for gfx13 (#177567 ) For now list of features is based on gfx12 and gfx1250 --------- Co-authored-by: Jay Foad <jay.foad@amd.com>	2026-01-26 14:16:36 +01:00
Jameson Nash	d10b2b566a	[NFCI] replace getValueType with new getGlobalSize query (#177186 ) Returns uint64_t to simplify callers. The goal is eventually replace getValueType with this query, which should return the known minimum reference-able size, as provided (instead of a Type) during create. Additionally the common isSized query would be replaced with an isExactKnownSize query to test if that size is an exact definition.	2026-01-22 13:55:53 -05:00
Shilei Tian	02d34a76f7	[NFCI][AMDGPU] Remove more redundant code from `GCNSubtarget.h` (#177297 ) We are getting pretty close to use `GET_SUBTARGETINFO_MACRO` in the header with this cleanup.	2026-01-22 09:07:15 -05:00
PMylon	a992f29451	[AMDGPU] Emit amdgpu.max_num_named_barrier resource symbol (#169851 )	2025-12-04 12:48:23 -08:00
Changpeng Fang	5f38ae4a77	[AMDGPU] update LDS block size for gfx1250 (#167614 ) LDS block size should be 2048 bytes (512 dwords) based on current spec.	2025-11-17 16:03:47 -08:00
Craig Topper	72b02c7b37	[AMDGPU] Fix layering violations in AMDGPUMCExpr.cpp. NFC (#168242 ) AMDGPUMCExpr lives in the MC layer it should not depend on Function.h or GCNSubtarget.h Move the function that needed GCNSubtarget to the one file that called it.	2025-11-17 09:13:55 -08:00
Pierre van Houtryve	dcaa29c8ed	Revert "[AMDGPU][gfx1250] Add `cu-store` subtarget feature (#150588 )" (#157639 ) This reverts commit be17791f2624f22b3ed24a2539406164a379125d. This is not necessary for gfx1250 anymore.	2025-09-10 10:20:59 +02:00
Ana Mihajlovic	c4885849ad	[AMDGPU] Fix hw stage metadata setting for unsigned values (#154502 )	2025-09-02 10:42:11 +02:00
Shoreshen	7fff93db50	[AMDGPU] Set GRANULATED_WAVEFRONT_SGPR_COUNT of compute_pgm_rsrc1 to 0 for gfx10+ (#154666 ) According to `llvm-project/llvm/docs/AMDGPUUsage.rst::L5212` the `GRANULATED_WAVEFRONT_SGPR_COUNT`, which is `compute_pgm_rsrc1[6:9]` has to be 0 for gfx10+ arch --------- Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>	2025-08-27 09:48:42 +08:00
Stanislav Mekhanoshin	6bccd967b1	[AMDGPU] Do not assert on non-zero COMPUTE_PGM_RSRC3 on gfx1250. NFCI (#155498 ) COMPUTE_PGM_RSRC3 does exist on gfx1250, we are just not using it yet.	2025-08-26 14:42:48 -07:00
Gang Chen	60dbde69cd	[AMDGPU] report named barrier cnt part2 (#154588 )	2025-08-20 12:00:45 -07:00
Kazu Hirata	fab0860685	[AMDGPU] Remove an unnecessary cast (NFC) (#154470 ) getAddressableLocalMemorySize() already returns unsigned.	2025-08-19 22:45:12 -07:00
Gang Chen	ef68d1587d	[AMDGPU] upstream barrier count reporting part1 (#154409 )	2025-08-19 16:42:31 -07:00
Tim Renouf	f279c47cb3	AMDGPU gfx12: Add _dvgpr$ symbols for dynamic VGPRs (#148251 ) For each function with the AMDGPU_CS_Chain calling convention, with dynamic VGPRs enabled, add a _dvgpr$ symbol, with the value of the function symbol, plus an offset encoding one less than the number of VGPR blocks used by the function (16 VGPRs per block, no more than 128) in bits 5..3 of the symbol value. This is used by a front-end to have functions that are chained rather than called, and a dispatcher that dynamically resizes the VGPR count before dispatching to a function.	2025-08-15 16:33:06 +01:00
Stanislav Mekhanoshin	57c1e01e48	[AMDGPU] Don't allow wgp mode on gfx1250 (#153680 ) - gfx1250 only supports cu mode	2025-08-14 15:16:56 -07:00
Stanislav Mekhanoshin	49f2093477	[AMDGPU] Increase LDS to 320K on gfx1250 (#153645 )	2025-08-14 12:52:00 -07:00
Diana Picus	a910a6a8b5	[AMDGPU] AsmPrinter: Unify arg handling (#151672 ) When computing the number of registers required by entry functions, the `AMDGPUAsmPrinter` needs to take into account both the register usage computed by the `AMDGPUResourceUsageAnalysis` pass, and the number of registers initialized by the hardware. At the moment, the way it computes the latter is different for graphics vs compute, due to differences in the implementation. For kernels, all the information needed is available in the `SIMachineFunctionInfo`, but for graphics shaders we would iterate over the `Function` arguments in the `AMDGPUAsmPrinter`. This pretty much repeats some of the logic from instruction selection. This patch introduces 2 new members to `SIMachineFunctionInfo`, one for SGPRs and one for VGPRs. Both will be computed during instruction selection and then used during `AMDGPUAsmPrinter`, removing the need to refer to the `Function` when printing assembly. This patch is NFC except for the fact that we now add the extra SGPRs (VCC, XNACK etc) to the number of SGPRs computed for graphics entry points. I'm not sure why these weren't included before. It would be nice if someone could confirm if that was just an oversight or if we have some docs somewhere that I haven't managed to find. Only one test is affected (its SGPR usage increases because we now take into account the XNACK registers).	2025-08-08 12:00:37 +02:00
Tim Renouf	e99c565cd2	MC,AMDGPU: Don't pad .text with s_code_end if it would otherwise be empty (#147980 ) We don't want that padding in a module that only contains data, not code. Also fix MCSection::hasInstructions() so it works with the asm streamer too.	2025-08-06 13:25:45 +01:00
Pierre van Houtryve	be17791f26	[AMDGPU][gfx1250] Add `cu-store` subtarget feature (#150588 ) Determines whether we can use `SCOPE_CU` stores (on by default), or whether all stores must be done at `SCOPE_SE` minimum.	2025-07-29 11:38:43 +02:00
Jay Foad	1c49ce676c	[AMDGPU] Enable FWD_PROGRESS bit for GFX10+ on PAL (#139895 ) Performance testing shows no significant gains or losses on graphics workloads, so this is mostly to make the behavior consistent across all supported OSes instead of special-casing HSA.	2025-07-21 17:29:06 +01:00
Vikram Hegde	1ccd779324	[AMDGPU][NewPM] Port "AMDGPUResourceUsageAnalysis" to NPM (#130959 )	2025-07-10 13:35:43 +05:30
Diana Picus	a201f8872a	[AMDGPU] Replace dynamic VGPR feature with attribute (#133444 ) Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget feature, as requested in #130030.	2025-06-24 11:09:36 +02:00
Matt Arsenault	7031280218	AMDGPU: Use reportFatalUsageError for unsupported code object version (#145133 )	2025-06-21 13:01:18 +09:00
Andrew Rogers	19658d1474	[llvm] annotate interfaces in llvm/Target for DLL export (#143615 ) ## Purpose This patch is one in a series of code-mods that annotate LLVM’s public interface for export. This patch annotates the `llvm/Target` library. These annotations currently have no meaningful impact on the LLVM build; however, they are a prerequisite to support an LLVM Windows DLL (shared library) build. ## Background This effort is tracked in #109483. Additional context is provided in [this discourse](https://discourse.llvm.org/t/psa-annotating-llvm-public-interface/85307), and documentation for `LLVM_ABI` and related annotations is found in the LLVM repo [here](https://github.com/llvm/llvm-project/blob/main/llvm/docs/InterfaceExportAnnotations.rst). A sub-set of these changes were generated automatically using the [Interface Definition Scanner (IDS)](https://github.com/compnerd/ids) tool, followed formatting with `git clang-format`. The bulk of this change is manual additions of `LLVM_ABI` to `LLVMInitializeX` functions defined in .cpp files under llvm/lib/Target. Adding `LLVM_ABI` to the function implementation is required here because they do not `#include "llvm/Support/TargetSelect.h"`, which contains the declarations for this functions and was already updated with `LLVM_ABI` in a previous patch. I considered patching these files with `#include "llvm/Support/TargetSelect.h"` instead, but since TargetSelect.h is a large file with a bunch of preprocessor x-macro stuff in it I was concerned it would unnecessarily impact compile times. In addition, a number of unit tests under llvm/unittests/Target required additional dependencies to make them build correctly against the LLVM DLL on Windows using MSVC. ## Validation Local builds and tests to validate cross-platform compatibility. This included llvm, clang, and lldb on the following configurations: - Windows with MSVC - Windows with Clang - Linux with GCC - Linux with Clang - Darwin with Clang	2025-06-17 13:28:45 -07:00
Diana Picus	a5cbd2ab0b	Revert "[AMDGPU] Skip register uses in AMDGPUResourceUsageAnalysis (#… (#144039 ) …133242)" This reverts commit 130080fab11cde5efcb338b77f5c3b31097df6e6 because it causes issues in testcases similar to coalescer_remat.ll [1], i.e. when we use a VGPR tuple but only write to its lower parts. The high VGPRs would then not be included in the vgpr_count, and accessing them would be an out of bounds violation. [1] https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/AMDGPU/coalescer_remat.ll	2025-06-13 12:48:24 +02:00
Diana Picus	130080fab1	[AMDGPU] Skip register uses in AMDGPUResourceUsageAnalysis (#133242 ) Don't count register uses when determining the maximum number of registers used by a function. Count only the defs. This is really an underestimate of the true register usage, but in practice that's not a problem because if a function uses a register, then it has either defined it earlier, or some other function that executed before has defined it. In particular, the register counts are used: 1. When launching an entry function - in which case we're safe because the register counts of the entry function will include the register counts of all callees. 2. At function boundaries in dynamic VGPR mode. In this case it's safe because whenever we set the new VGPR allocation we take into account the outgoing_vgpr_count set by the middle-end. The main advantage of doing this is that the artificial VGPR arguments used only for preserving the inactive lanes when using the llvm.amdgcn.init.whole.wave intrinsic are no longer counted. This enables us to allocate only the registers we need in dynamic VGPR mode. --------- Co-authored-by: Thomas Symalla <5754458+tsymalla@users.noreply.github.com>	2025-06-03 11:20:48 +02:00
Matthias Braun	675cb70641	Register assembly printer passes (#138348 ) Register assembly printer passes in the pass registry. This makes it possible to use `llc -start-before=<target>-asm-printer ...` in tests. Adds a `char &ID` parameter to the AssemblyPrinter constructor to allow targets to use the `INITIALIZE_PASS` macros and register the pass in the pass registry. This currently has a default parameter so it won't break any targets that have not been updated.	2025-05-06 18:01:17 -07:00
Nikita Popov	b492ec5899	[ErrorHandling] Add reportFatalInternalError + reportFatalUsageError (NFC) (#138251 ) This implements the result of the discussion at: https://discourse.llvm.org/t/rfc-report-fatal-error-and-the-default-value-of-gencrashdialog/73587 There are two different use cases for report_fatal_error, so replace it with two functions reportFatalInternalError() and reportFatalUsageError(). The former indicates a bug in LLVM and generates a crash dialog. The latter does not. The names have been suggested by rnk and people seemed to like them. This replaces a lot of the usages that passed an explicit value for GenCrashDiag. I did not bulk replace remaining report_fatal_error usage -- they probably require case by case review for which function to use.	2025-05-05 12:10:03 +02:00
Diana Picus	72c3c30452	[AMDGPU] Allocate scratch space for dVGPRs for CWSR (#130055 ) The CWSR trap handler needs to save and restore the VGPRs. When dynamic VGPRs are in use, the fixed function hardware will only allocate enough space for one VGPR block. The rest will have to be stored in scratch, at offset 0. This patch allocates the necessary space by: - generating a prologue that checks at runtime if we're on a compute queue (since CWSR only works on compute queues); for this we will have to check the ME_ID bits of the ID_HW_ID2 register - if that is non-zero, we can assume we're on a compute queue and initialize the SP and FP with enough room for the dynamic VGPRs - forcing all compute entry functions to use a FP so they can access their locals/spills correctly (this isn't ideal but it's the quickest to implement) Note that at the moment we allocate enough space for the theoretical maximum number of VGPRs that can be allocated dynamically (for blocks of 16 registers, this will be 128, of which we subtract the first 16, which are already allocated by the fixed function hardware). Future patches may decide to allocate less if they can prove the shader never allocates that many blocks. Also note that this should not affect any reported stack sizes (e.g. PAL backend_stack_size etc).	2025-03-19 13:49:19 +01:00
Diana Picus	0a21ef9536	[AMDGPU] Add SubtargetFeature for dynamic VGPR mode (#130030 ) This represents a hardware mode supported only for wave32 compute shaders. When enabled, we set the `.dynamic_vgpr_en` field of `.compute_registers` to true in the PAL metadata. This will be changed to use an attribute after downstream consumers have been migrated.	2025-03-18 11:48:01 +01:00
Alex Voicu	c1fabd681f	[llvm][AMDGPU] Enable FWD_PROGRESS bit for GFX10+ (#128367 ) From GFX10 onwards it is possible to employ benevolent scheduling of waves. This patch unconditionally enables, for the `amdhsa` OS, the bit which controls that capability, as it is beneficial for algorithms that rely on more complex concurrent coordination and it is generally performance neutral otherwise.	2025-03-17 23:17:46 +00:00
Stanislav Mekhanoshin	6c9a9d9fe2	[AMDGPU] Set inst_pref_size to maximum (#126981 ) On gfx11 and gfx12 set initial instruction prefetch size to a minimum of kernel size and maximum allowed value. Fixes: SWDEV-513122	2025-03-03 10:40:31 -08:00
Stanislav Mekhanoshin	2479479285	[AMDGPU] Extend ComputePGMRSrc3 to gfx10+. NFCI. (#129289 ) ComputePGMRSrc3 exists since gfx90a and gfx10+. Current code only expects gfx90a. This is NFCI since we do not fill it on gfx10+ yet.	2025-03-03 08:22:15 -08:00
Stanislav Mekhanoshin	d19187f5fe	[AMDGPU] Move into SIProgramInfo and cache getFunctionCodeSize. NFCI. (#127111 ) This moves function as is, improvements to the estimate go into a subseqent patch.	2025-02-17 18:22:48 -08:00
Lucas Ramirez	6206f5444f	[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748 ) Occupancy (i.e., the number of waves per EU) depends, in addition to register usage, on per-workgroup LDS usage as well as on the range of possible workgroup sizes. Mirroring the latter, occupancy should therefore be expressed as a range since different group sizes generally yield different achievable occupancies. `getOccupancyWithLocalMemSize` currently returns a scalar occupancy based on the maximum workgroup size and LDS usage. With respect to the workgroup size range, this scalar can be the minimum, the maximum, or neither of the two of the range of achievable occupancies. This commit fixes the function by making it compute and return the range of achievable occupancies w.r.t. workgroup size and LDS usage; it also renames it to `getOccupancyWithWorkGroupSizes` since it is the range of workgroup sizes that produces the range of achievable occupancies. Computing the achievable occupancy range is surprisingly involved. Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum occupancies i.e., sometimes workgroup sizes inside the range yield the occupancy bounds. The implementation finds these sizes in constant time; heavy documentation explains the rationale behind the sometimes relatively obscure calculations. As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU, 64-wide waves. Also consider a function with no LDS usage and a flat workgroup size range of [513,1024]. - A group of 513 items requires 9 waves per group. Only 4 groups made up of 9 waves each can fit fully on a CU at any given time, for a total of 36 waves on the CU, or 9 per EU. However, filling as much as possible the remaining 40-36=4 wave slots without decreasing the number of groups reveals that a larger group of 640 items yields 40 waves on the CU, or 10 per EU. - Similarly, a group of 1024 items requires 16 waves per group. Only 2 groups made up of 16 waves each can fit fully on a CU ay any given time, for a total of 32 waves on the CU, or 8 per EU. However, removing as many waves as possible from the groups without being able to fit another equal-sized group on the CU reveals that a smaller group of 896 items yields 28 waves on the CU, or 7 per EU. Therefore the achievable occupancy range for this function is not [8,9] as the group size bounds directly yield, but [7,10]. Naturally this change causes a lot of test churn as instruction scheduling is driven by achievable occupancy estimates. In most unit tests the flat workgroup size range is the default [1,1024] which, ignoring potential LDS limitations, would previously produce a scalar occupancy of 8 (derived from 1024) on a lot of targets, whereas we now consider the maximum occupancy to be 10 in such cases. Most tests are updated automatically and checked manually for sanity. I also manually changed some non-automatically generated assertions when necessary. Fixes #118220.	2025-01-23 16:07:57 +01:00
Janek van Oirschot	82944595fa	[AMDGPU] Change scope of resource usage info symbols (#114810 ) Change scope of resource usage info MC symbols to align with the function linkage type	2025-01-21 13:10:06 +00:00
Austin Kerbow	2e5c298281	[AMDGPU] Add backward compatibility layer for kernarg preloading (#119167 ) Add a prologue to the kernel entry to handle cases where code designed for kernarg preloading is executed on hardware equipped with incompatible firmware. If hardware has compatible firmware the 256 bytes at the start of the kernel entry will be skipped. This skipping is done automatically by hardware that supports the feature. A pass is added which is intended to be run at the very end of the pipeline to avoid any optimizations that would assume the prologue is a real predecessor block to the actual code start. In reality we have two possible entry points for the function. 1. The optimized path that supports kernarg preloading which begins at an offset of 256 bytes. 2. The backwards compatible entry point which starts at offset 0.	2025-01-10 11:39:02 -08:00
Shilei Tian	86734c8577	[NFC][AMDGPU] Remove redundant code in `AMDGPUAsmPrinter.cpp`	2024-11-20 15:08:26 -05:00
Matt Arsenault	5a556d55fb	AMDGPU: Increase the LDS size to support to 160 KB for gfx950 (#116309 )	2024-11-18 10:48:56 -08:00
Shilei Tian	6548b6354d	Reapply "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403 )" This reverts commit ca33649abe5fad93c57afef54e43ed9b3249cd86.	2024-11-08 20:21:16 -05:00
Janek van Oirschot	7f60f1312a	[AMDGPU] Fix resource usage information for unnamed functions (#115320 ) Resource usage information would try to overwrite unnamed functions if there are multiple within the same compilation unit. This aims to either use the `MCSymbol` assigned to the unnamed function (i.e., `CurrentFnSym`), or, rematerialize the `MCSymbol` for the unnamed function.	2024-11-07 18:24:54 +00:00
Jay Foad	8d13e7b8c3	[AMDGPU] Qualify auto. NFC. (#110878 ) Generated automatically with: $ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find lib/Target/AMDGPU/ -type f)	2024-10-03 13:07:54 +01:00
Thomas Symalla	b95d50e5d8	Add and call `AMDGPUMCResourceInfo::reset` method (#110818 ) When compiling multiple pipelines, the `MCRegisterInfo` instance in `AMDGPUAsmPrinter` gets re-used even after finalization, so it calls `finalize()` multiple times. Add a reset method and call it in `AMDGPUAsmPrinter::doFinalization`. Different approach would be to make it a `unique_ptr`. --------- Co-authored-by: Thomas Symalla <tsymalla@amd.com>	2024-10-02 14:17:01 +02:00
Janek van Oirschot	c897c13dde	[AMDGPU] Convert AMDGPUResourceUsageAnalysis pass from Module to MF pass (#102913 ) Converts AMDGPUResourceUsageAnalysis pass from Module to MachineFunction pass. Moves function resource info propagation to to MC layer (through helpers in AMDGPUMCResourceInfo) by generating MCExprs for every function resource which the emitters have been prepped for. Fixes https://github.com/llvm/llvm-project/issues/64863	2024-09-30 11:43:34 +01:00

1 2 3 4 5 ...

397 Commits