llvm-project

Author	SHA1	Message	Date
Gheorghe-Teodor Bercea	d1dc843c18	[AMDGPU] Enable sinking of free vector ops that will be folded into their uses (#162580 ) Sinking ShuffleVectors / ExtractElement / InsertElement into user blocks can help enable SDAG combines by providing visibility to the values instead of emitting CopyTo/FromRegs. The sink IR pass disables sinking into loops, so this PR extends the CodeGenPrepare target hook shouldSinkOperands. Co-authored-by: Jeffrey Byrnes <Jeffrey.Byrnes@amd.com> --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2026-02-09 14:14:31 -05:00
Fabian Ritter	d24a6754ce	[LowerMemIntrinsics] Optimize memset lowering (#169040 ) This patch changes the memset lowering to match the optimized memcpy lowering. The memset lowering now queries TTI.getMemcpyLoopLoweringType for a preferred memory access type. If that type is larger than a byte, the memset is lowered into two loops: a main loop that stores a sufficiently wide vector splat of the SetValue with the preferred memory access type and a residual loop that covers the remaining bytes individually. If the memset size is statically known, the residual loop is replaced by a sequence of stores. This improves memset performance on gfx1030 (AMDGPU) in microbenchmarks by around 7-20x. I'm planning similar treatment for memset.pattern as a follow-up PR. For SWDEV-543208.	2026-02-04 13:35:13 +01:00
Florian Hahn	b794baf8e7	[TTI] Add VectorInstrContext for context-aware insert/extract costs. (#175982 ) This commit introduces the VectorInstrContext (VIC) infrastructure to improve cost estimates for insert/extracts based on the context instruction in which the insert/extract is used. This is similar to CastContextHint, and allows providing context on how the insert/extract is going to be used before creating IR. This is useful in the LoopVectorizer, where costs need to estimated before creating IR. The new hint currently only replaces an existing check in AArch64, but new uses will be introduced in follow-ups, including https://github.com/llvm/llvm-project/pull/177201. PR: https://github.com/llvm/llvm-project/pull/175982	2026-01-27 16:30:29 +00:00
Shilei Tian	4b1cfc5d7c	[NFCI][AMDGPU] Final touch before moving to `GET_SUBTARGETINFO_MACRO` (#177401 )	2026-01-22 17:33:17 +00:00
Shilei Tian	2692f5ed53	[NFCI][AMDGPU] Convert more `SubtargetFeatures` to use `AMDGPUSubtargetFeature` and X-macros (#177256 ) Extend the X-macro pattern to eliminate boilerplate for additional subtarget features. This reduces ~50 lines of repetitive member declarations and getter definitions.	2026-01-21 18:03:32 -05:00
Jameson Nash	ba2bd3fbba	Use AllocaInst::getAllocationSize instead of manual size calculations (#176486 ) Replace patterns that manually compute allocation sizes by multiplying getTypeAllocSize(getAllocatedType()) by the array size with calls to the getAllocationSize(DL) API, which handles this correctly and concisely, returning nullopt for VLAs. This fixes several places that were not accounting for array allocations when computing sizes, simplifies code that was doing this manually, and adds some explicit isFixed checks where implied convert was being used. This PR is because now that we have opaque pointers, I hate that some AllocaInst still has type information being consumed by some passes instead of just using the size, since passes rarely handle that type information well or correctly. I hope this will grow into a sequence of commits to slowly eliminate uses of getAllocatedType from AllocaInst. And similarly later to remove type information from GlobalValue too (it can be replaced with just dereferenceable bytes, similar to arguments). Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 09:55:52 -05:00
Robert Imschweiler	71cc38736c	[InferAddressSpaces] Handle unconverted ptrmask (#140802 ) In case a ptrmask cannot be converted to the new address space due to an unknown mask value, this needs to be detcted and an addrspacecast is needed to not hinder a future use of the unconverted return value of ptrmask. Otherwise, users of this value will become invalid by receiving a nullptr as an operand. This LLVM defect was identified via the AMD Fuzzing project. (See https://reviews.llvm.org/D80129 for an explanation of why some ptrmasks are impossible to convert to other addrspaces.)	2026-01-14 09:10:05 +01:00
Ramkumar Ramachandra	d69335bac9	[LLVM] Clean up code using [not_]equal_to (NFC) (#175824 ) Use llvm::[not_]equal_to landed in d2a521750 ([ADT] Introduce bind_{front,back}, [not_]equal_to, #175056) across LLVM for cleaner code.	2026-01-13 21:19:39 +00:00
Pankaj Dwivedi	000279dcad	[NFC][TTI] Introduce getInstructionUniformity API for uniformity analysis (#168903 ) This patch introduces a new TargetTransformInfo hook `getInstructionUniformity()` that provides a unified interface for querying target-specific uniformity information about instructions and values. The new hook returns an `InstructionUniformity` enum with three values: - Default: Result is uniform if all operands are uniform (standard propagation) - AlwaysUniform: Result is always uniform regardless of operands - NeverUniform: Result can never be assumed uniform This API wraps the existing `isAlwaysUniform()` and `isSourceOfDivergence()` hooks, providing a single entry point for uniformity queries. Both LLVM IR-level (via TTI) and MIR-level (via TargetInstrInfo) uniformity analysis have been updated to use the new hook. Target implementations: - AMDGPU: Wraps existing `isAlwaysUniform()` and `isSourceOfDivergence()` hooks - NVPTX: Wraps existing `isSourceOfDivergence()` hook This is an NFC change - all implementations return conservative defaults or wrap existing functionality. Ref patch:https://github.com/llvm/llvm-project/pull/137639	2025-12-10 23:14:34 +05:30
Nicolai Hähnle	69589dd2c0	AMDGPU: Improve getShuffleCost accuracy for 8- and 16-bit shuffles (#168818 ) These shuffles can always be implemented using v_perm_b32, and so this rewrites the analysis from the perspective of "how many v_perm_b32s does it take to assemble each register of the result?" The test changes in Transforms/SLPVectorizer/reduction.ll are reasonable: VI (gfx8) has native f16 math, but not packed math.	2025-11-21 19:33:13 +00:00
Giuseppe Rossini	e94a0b300a	[AMDGPU] Fix vector legalization for bf16 valu ops (#158439 ) Add v4,v8,v16,v32 legalizations for the following operations: - `FADD` - `FMUL` - `FMA` - `FCANONICALIZE`	2025-09-25 10:07:11 +01:00
Krzysztof Drewniak	9470113495	[AMDGPU] Mark workitem IDs uniform in more cases (#152581 ) This fixes an old FIXME, where (workitem ID X) / (wavefrront size) would never be marked uniform if it was possible that there would be Y and Z dimensions. Now, so long as the required size of the X dimension is a power of 2, dividing that dimension by the wavefront size creates a uniform value. Furthermore, if the required launch size of the X dimension is a power of 2 that's at least the wavefront size, the Y and Z workitem IDs are now marked uniform. --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-08-29 01:21:04 -05:00
Shilei Tian	351b38f266	[AMDGPU] Mark address space cast from private to flat as divergent if target supports globally addressable scratch (#152376 ) Globally addressable scratch is a new feature introduced in gfx1250. However, this feature changes how scratch space is mapped into the flat aperture, making address space casts from private to flat no longer uniform.	2025-08-06 17:08:56 -04:00
paperchalice	8bacfb2538	[AMDGPU] Remove `UnsafeFPMath` uses (#151079 ) Remove `UnsafeFPMath` in AMDGPU part, it blocks some bugfixes related to clang and the ultimate goal is to remove `resetTargetOptions` method in `TargetMachine`, see FIXME in `resetTargetOptions`. See also https://discourse.llvm.org/t/rfc-honor-pragmas-with-ffp-contract-fast https://discourse.llvm.org/t/allowfpopfusion-vs-sdnodeflags-hasallowcontract --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-07-31 17:36:57 +08:00
macurtis-amd	cff4a00d3f	AMDGPU: Fix runtime unrolling when cascaded GEPs present (#147700 ) Cascaded GEP (i.e. GEP of GEP) are not handled when determining if it is ok to runtime unroll loops. This change simply uses `getUnderlyingObjects` to look through cascaded GEPs.	2025-07-10 03:44:04 -05:00
Gheorghe-Teodor Bercea	3df36a2b18	[AMDGPU] Enable vectorization of i8 values. (#134934 ) This patch adjusts the cost model to account for the ability of the AMDGPU optimizer to group together i8 values into i32 values. Co-authored-by: Erich Keane <ekeane@nvidia.com>	2025-06-26 19:15:31 -04:00
David Green	77941eba7f	[CostModel] Add a DstTy to getShuffleCost (#141634 ) A shuffle will take two input vectors and a mask, to produce a new vector of size <MaskElts x SrcEltTy>. Historically it has been assumed that the SrcTy and the DstTy are the same for getShuffleCost, with that being relaxed in recent years. If the Tp passed to getShuffleCost is the SrcTy, then the DstTy can be calculated from the Mask elts and the src elt size, but the Mask is not always provided and the Tp is not reliably always the SrcTy. This has led to situations notably in the SLP vectorizer but also in the generic cost routines where assumption about how vectors will be legalized are built into the generic cost routines - for example whether they will widen or promote, with the cost modelling assuming they will widen but the default lowering to promote for integer vectors. This patch attempts to start improving that - it originally tried to alter more of the cost model but that too quickly became too many changes at once, so this patch just plumbs in a DstTy to getShuffleCost so that DstTy and SrcTy can be reliably distinguished. The callers of getShuffleCost have been updated to try and include a DstTy that is more accurate. Otherwise it tries to be fairly non-functional, keeping the SrcTy used as the primary type used in shuffle cost routines, only using DstTy where it was in the past (for InsertSubVector for example). Some asserts have been added that help to check for consistent values when a Mask and a DstTy are provided to getShuffleCost. Some of them took a while to get right, and some non-mask calls might still be incorrect. Hopefully this will provide a useful base to build more shuffles that alter size.	2025-06-21 12:29:29 +01:00
Matt Arsenault	a9811340b7	AMDGPU: Report special input intrinsics as free (#141948 )	2025-06-18 08:24:58 +09:00
Matt Arsenault	54015f36c6	AMDGPU: Cost model for minimumnum/maximumnum (#141946 )	2025-06-18 08:19:06 +09:00
Matt Arsenault	af65cb68f5	AMDGPU: Move fpenvIEEEMode into TTI (#141945 )	2025-06-18 08:13:57 +09:00
Matt Arsenault	3800a83160	AMDGPU: Reduce cost of f64 copysign (#141944 ) The real implementation is 1 real instruction plus a constant materialize. Call that a 1, it's not a real f64 operation.	2025-06-18 08:10:53 +09:00
Matt Arsenault	c9b2816388	AMDGPU: Fix cost model for 16-bit operations on gfx8 (#141943 ) We should only divide the number of pieces to fit the packed instructions if we actually have pk instructions. This increases the cost of copysign, but is closer to the current codegen output. It could be much cheaper than it is now.	2025-06-18 08:07:03 +09:00
Ramkumar Ramachandra	b40e4ceaa6	[ValueTracking] Make Depth last default arg (NFC) (#142384 ) Having a finite Depth (or recursion limit) for computeKnownBits is very limiting, but is currently a load-bearing necessity, as all KnownBits are recomputed on each call and there is no caching. As a prerequisite for an effort to remove the recursion limit altogether, either using a clever caching technique, or writing a easily-invalidable KnownBits analysis, make the Depth argument in APIs in ValueTracking uniformly the last argument with a default value. This would aid in removing the argument when the time comes, as many callers that currently pass 0 explicitly are now updated to omit the argument altogether.	2025-06-03 17:12:24 +01:00
Krzysztof Drewniak	13c467b2cd	[AMDGPU] Add make.buffer.rsrc to InferAddressSpaces (#140770 ) make.buffer.rsrc can be subjected to address space inference. There's not _currently_ a reason to have this, but we might as well handle this in case it comes up. --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-05-20 16:13:37 -07:00
Krzysztof Drewniak	4bdd116b80	[AMDGPU] Add a new amdgcn.load.to.lds intrinsic (#137425 ) This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to LDS from global (address space 1) pointers and buffer fat pointers (address space 7), since they use the same API and "gather from a pointer to LDS" is something of an abstract operation. This commit adds the intrinsic and its lowerings for addrspaces 1 and 7, and updates the MLIR wrappers to use it (loosening up the restrictions on loads to LDS along the way to match the ground truth from target features). It also plumbs the intrinsic through to clang.	2025-05-19 07:15:04 -07:00
David Green	abd2c07e39	[CostModel] Make Op0 and Op1 const in getVectorInstrCost. NFC (#137631 ) This does not alter much at the moment, but allows const pointers to be passed as Op0 and Op1, simplifying later patches	2025-05-01 15:55:08 +01:00
Jay Foad	886f1199f0	[AMDGPU] Use variadic isa<>. NFC. (#137016 )	2025-04-24 08:19:09 +01:00
David Green	98b6f8dc69	[CostModel] Remove optional from InstructionCost::getValue() (#135596 ) InstructionCost is already an optional value, containing an Invalid state that can be checked with isValid(). There is little point in returning another optional from getValue(). Most uses do not make use of it being a std::optional, dereferencing the value directly (either isValid has been checked previously or the Cost is assumed to be valid). The one case that does in AMDGPU used value_or which has been replaced by a isValid() check.	2025-04-23 07:46:27 +01:00
Sergei Barannikov	3334c3597d	[TTI] Fix discrepancies in prototypes between interface and implementations (NFCI) (#136655 ) These are not diagnosed because implementations hide the methods of the base class rather than overriding them. This works as long as a hiding function is callable with the same arguments as the same function from the base class. Pull Request: https://github.com/llvm/llvm-project/pull/136655	2025-04-22 11:40:12 +03:00
Sergei Barannikov	0014b49482	[TTI] Make all interface methods const (NFCI) (#136598 ) Making `TargetTransformInfo::Model::Impl` `const` makes sure all interface methods are `const`, in `BasicTTIImpl`, its bases, and in all derived classes. Pull Request: https://github.com/llvm/llvm-project/pull/136598	2025-04-22 06:27:29 +03:00
Sergei Barannikov	e0c1e23b99	[TTI] Constify BasicTTIImplBase::thisT() (NFCI) (#136575 ) The main change is making `thisT` method `const`, the rest of the changes is fixing compilation errors (). () There are two tricky methods, `getVectorInstrCost()` and `getIntImmCost()`. They have several overloads; some of these overloads are typically pulled in to derived classes using the `using` directive, and then hidden by methods in the derived class. The compiler does not complain if the hiding methods are not marked as `const`, which means that clients will use the methods from the base class. If after this change your target fails cost model tests, this must be the reason. To resolve the issue you need to make all hiding overloads `const`. See the second commit in this PR. Pull Request: https://github.com/llvm/llvm-project/pull/136575	2025-04-21 21:42:40 +03:00
Fabian Ritter	b95a6c750c	[AMDGPU] Remove special cases in TTI::getMemcpyLoop(Residual)LoweringType (#125507 ) These special cases limit the width of memory operations we use for lowering memcpy/memmove when the pointer arguments are 2-aligned or in the LDS/GDS. I found that performance in microbenchmarks on gfx90a, gfx1030, and gfx1100 is better without this limitation.	2025-02-04 08:18:24 +01:00
Joel E. Denny	18f8106f31	[KernelInfo] Implement new LLVM IR pass for GPU code analysis (#102944 ) This patch implements an LLVM IR pass, named kernel-info, that reports various statistics for codes compiled for GPUs. The ultimate goal of these statistics to help identify bad code patterns and ways to mitigate them. The pass operates at the LLVM IR level so that it can, in theory, support any LLVM-based compiler for programming languages supporting GPUs. It has been tested so far with LLVM IR generated by Clang for OpenMP offload codes targeting NVIDIA GPUs and AMD GPUs. By default, the pass runs at the end of LTO, and options like ``-Rpass=kernel-info`` enable its remarks. Example `opt` and `clang` command lines appear in `llvm/docs/KernelInfo.rst`. Remarks include summary statistics (e.g., total size of static allocas) and individual occurrences (e.g., source location of each alloca). Examples of its output appear in tests in `llvm/test/Analysis/KernelInfo`.	2025-01-29 12:40:19 -05:00
Fabian Ritter	a4fd3dba6e	[AMDGPU] Use wider loop lowering type for LowerMemIntrinsics (#112332 ) When llvm.memcpy or llvm.memmove intrinsics are lowered as a loop in LowerMemIntrinsics.cpp, the loop consists of a single load/store pair per iteration. We can improve performance in some cases by emitting multiple load/store pairs per iteration. This patch achieves that by increasing the width of the loop lowering type in the GCN target and letting legalization split the resulting too-wide access pairs into multiple legal access pairs. This change only affects lowered memcpys and memmoves with large (>= 1024 bytes) constant lengths. Smaller constant lengths are handled by ISel directly; non-constant lengths would be slowed down by this change if the dynamic length was smaller or slightly larger than what an unrolled iteration copies. The chosen default unroll factor is the result of microbenchmarks on gfx1030. This change leads to speedups of 15-38% for global memory and 1.9-5.8x for scratch in these microbenchmarks. Part of SWDEV-455845.	2024-10-28 09:04:19 +01:00
Shilei Tian	e34e27f198	[TTI][AMDGPU] Allow targets to adjust `LastCallToStaticBonus` via `getInliningLastCallToStaticBonus` (#111311 ) Currently we will not be able to inline a large function even if it only has one live use because the inline cost is still very high after applying `LastCallToStaticBonus`, which is a constant. This could significantly impact the performance because CSR spill is very expensive. This PR adds a new function `getInliningLastCallToStaticBonus` to TTI to allow targets to customize this value. Fixes SWDEV-471398.	2024-10-11 10:19:54 -04:00
Rahul Joshi	fa789dffb1	[NFC] Rename `Intrinsic::getDeclaration` to `getOrInsertDeclaration` (#111752 ) Rename the function to reflect its correct behavior and to be consistent with `Module::getOrInsertFunction`. This is also in preparation of adding a new `Intrinsic::getDeclaration` that will have behavior similar to `Module::getFunction` (i.e, just lookup, no creation).	2024-10-11 05:26:03 -07:00
Fabian Ritter	173c68239d	[AMDGPU] Enable unaligned scratch accesses (#110219 ) This allows us to emit wide generic and scratch memory accesses when we do not have alignment information. In cases where accesses happen to be properly aligned or where generic accesses do not go to scratch memory, this improves performance of the generated code by a factor of up to 16x and reduces code size, especially when lowering memcpy and memmove intrinsics. Also: Make the use of the FeatureUnalignedScratchAccess feature more consistent: FeatureUnalignedScratchAccess and EnableFlatScratch are now orthogonal, whereas, before, code assumed that the latter implies the former at some places. Part of SWDEV-455845.	2024-10-11 08:50:49 +02:00
Jeffrey Byrnes	853c43d04a	[TTI] NFC: Port TLI.shouldSinkOperands to TTI (#110564 ) Porting to TTI provides direct access to the instruction cost model, which can enable instruction cost based sinking without introducing code duplication.	2024-10-09 14:30:09 -07:00
Matt Arsenault	c198f775cd	AMDGPU: Remove flat/global fmin/fmax intrinsics (#105642 ) These have been replaced with atomicrmw	2024-10-09 09:27:28 +04:00
Jay Foad	8d13e7b8c3	[AMDGPU] Qualify auto. NFC. (#110878 ) Generated automatically with: $ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find lib/Target/AMDGPU/ -type f)	2024-10-03 13:07:54 +01:00
Luke Drummond	3be955abbc	[NFC] Remove dead code There's an early exit branch a couple of lines earlier for `MVT == f64`. Convert to an assert rather than using the duplicate ternary here. This silences an opinionated static analyser that's been bugging me.	2024-08-26 12:59:41 +01:00
Matt Arsenault	ee08d9cba5	AMDGPU: Remove global/flat atomic fadd intrinics (#97051 ) These have been replaced with atomicrmw.	2024-08-22 23:27:33 +04:00
Matt Arsenault	cdadc2eb9e	AMDGPU: Correct costs of saturating add/sub intrinsics (#100808 ) These are directly legal with fast instructions.	2024-08-09 12:55:15 +04:00
Matt Arsenault	d7824fab6e	TTI: Check legalization cost of abs nodes (#100523 )	2024-08-09 12:51:05 +04:00
Matt Arsenault	e7630a0d60	AMDGPU: Improve cost handling of canonicalize (#101479 )	2024-08-01 19:02:20 +04:00
Matt Arsenault	524795926b	AMDGPU: Enable vectorization of v2f16 copysign (#100799 )	2024-07-30 08:48:13 +04:00
Matt Arsenault	4ed66cb4e1	AMDGPU: Improve cost handling of fma/fmuladd (#100798 ) We were overcounting the cost of fast f32 FMA. Also address todo and handle fmuladd (which I'm just assuming lowers to FMA, the slow FMA expansion is about as fast on slow targets anyway).	2024-07-30 08:45:07 +04:00
Fabian Ritter	9e462b7ea2	[LowerMemIntrinsics][NFC] Use Align in TTI::getMemcpyLoopLoweringType (#100984 ) ...and also in TTI::getMemcpyLoopResidualLoweringType.	2024-07-29 13:40:53 +02:00
Nikita Popov	9df71d7673	[IR] Add getDataLayout() helpers to Function and GlobalValue (#96919 ) Similar to https://github.com/llvm/llvm-project/pull/96902, this adds `getDataLayout()` helpers to Function and GlobalValue, replacing the current `getParent()->getDataLayout()` pattern.	2024-06-28 08:36:49 +02:00
Nikita Popov	2d209d964a	[IR] Add getDataLayout() helpers to BasicBlock and Instruction (#96902 ) This is a helper to avoid writing `getModule()->getDataLayout()`. I regularly try to use this method only to remember it doesn't exist... `getModule()->getDataLayout()` is also a common (the most common?) reason why code has to include the Module.h header.	2024-06-27 16:38:15 +02:00

1 2 3 4 5 ...

291 Commits