llvm-project

Author	SHA1	Message	Date
David Green	abd2c07e39	[CostModel] Make Op0 and Op1 const in getVectorInstrCost. NFC (#137631 ) This does not alter much at the moment, but allows const pointers to be passed as Op0 and Op1, simplifying later patches	2025-05-01 15:55:08 +01:00
Jay Foad	886f1199f0	[AMDGPU] Use variadic isa<>. NFC. (#137016 )	2025-04-24 08:19:09 +01:00
David Green	98b6f8dc69	[CostModel] Remove optional from InstructionCost::getValue() (#135596 ) InstructionCost is already an optional value, containing an Invalid state that can be checked with isValid(). There is little point in returning another optional from getValue(). Most uses do not make use of it being a std::optional, dereferencing the value directly (either isValid has been checked previously or the Cost is assumed to be valid). The one case that does in AMDGPU used value_or which has been replaced by a isValid() check.	2025-04-23 07:46:27 +01:00
Sergei Barannikov	3334c3597d	[TTI] Fix discrepancies in prototypes between interface and implementations (NFCI) (#136655 ) These are not diagnosed because implementations hide the methods of the base class rather than overriding them. This works as long as a hiding function is callable with the same arguments as the same function from the base class. Pull Request: https://github.com/llvm/llvm-project/pull/136655	2025-04-22 11:40:12 +03:00
Sergei Barannikov	0014b49482	[TTI] Make all interface methods const (NFCI) (#136598 ) Making `TargetTransformInfo::Model::Impl` `const` makes sure all interface methods are `const`, in `BasicTTIImpl`, its bases, and in all derived classes. Pull Request: https://github.com/llvm/llvm-project/pull/136598	2025-04-22 06:27:29 +03:00
Sergei Barannikov	e0c1e23b99	[TTI] Constify BasicTTIImplBase::thisT() (NFCI) (#136575 ) The main change is making `thisT` method `const`, the rest of the changes is fixing compilation errors (). () There are two tricky methods, `getVectorInstrCost()` and `getIntImmCost()`. They have several overloads; some of these overloads are typically pulled in to derived classes using the `using` directive, and then hidden by methods in the derived class. The compiler does not complain if the hiding methods are not marked as `const`, which means that clients will use the methods from the base class. If after this change your target fails cost model tests, this must be the reason. To resolve the issue you need to make all hiding overloads `const`. See the second commit in this PR. Pull Request: https://github.com/llvm/llvm-project/pull/136575	2025-04-21 21:42:40 +03:00
Fabian Ritter	b95a6c750c	[AMDGPU] Remove special cases in TTI::getMemcpyLoop(Residual)LoweringType (#125507 ) These special cases limit the width of memory operations we use for lowering memcpy/memmove when the pointer arguments are 2-aligned or in the LDS/GDS. I found that performance in microbenchmarks on gfx90a, gfx1030, and gfx1100 is better without this limitation.	2025-02-04 08:18:24 +01:00
Joel E. Denny	18f8106f31	[KernelInfo] Implement new LLVM IR pass for GPU code analysis (#102944 ) This patch implements an LLVM IR pass, named kernel-info, that reports various statistics for codes compiled for GPUs. The ultimate goal of these statistics to help identify bad code patterns and ways to mitigate them. The pass operates at the LLVM IR level so that it can, in theory, support any LLVM-based compiler for programming languages supporting GPUs. It has been tested so far with LLVM IR generated by Clang for OpenMP offload codes targeting NVIDIA GPUs and AMD GPUs. By default, the pass runs at the end of LTO, and options like ``-Rpass=kernel-info`` enable its remarks. Example `opt` and `clang` command lines appear in `llvm/docs/KernelInfo.rst`. Remarks include summary statistics (e.g., total size of static allocas) and individual occurrences (e.g., source location of each alloca). Examples of its output appear in tests in `llvm/test/Analysis/KernelInfo`.	2025-01-29 12:40:19 -05:00
Fabian Ritter	a4fd3dba6e	[AMDGPU] Use wider loop lowering type for LowerMemIntrinsics (#112332 ) When llvm.memcpy or llvm.memmove intrinsics are lowered as a loop in LowerMemIntrinsics.cpp, the loop consists of a single load/store pair per iteration. We can improve performance in some cases by emitting multiple load/store pairs per iteration. This patch achieves that by increasing the width of the loop lowering type in the GCN target and letting legalization split the resulting too-wide access pairs into multiple legal access pairs. This change only affects lowered memcpys and memmoves with large (>= 1024 bytes) constant lengths. Smaller constant lengths are handled by ISel directly; non-constant lengths would be slowed down by this change if the dynamic length was smaller or slightly larger than what an unrolled iteration copies. The chosen default unroll factor is the result of microbenchmarks on gfx1030. This change leads to speedups of 15-38% for global memory and 1.9-5.8x for scratch in these microbenchmarks. Part of SWDEV-455845.	2024-10-28 09:04:19 +01:00
Shilei Tian	e34e27f198	[TTI][AMDGPU] Allow targets to adjust `LastCallToStaticBonus` via `getInliningLastCallToStaticBonus` (#111311 ) Currently we will not be able to inline a large function even if it only has one live use because the inline cost is still very high after applying `LastCallToStaticBonus`, which is a constant. This could significantly impact the performance because CSR spill is very expensive. This PR adds a new function `getInliningLastCallToStaticBonus` to TTI to allow targets to customize this value. Fixes SWDEV-471398.	2024-10-11 10:19:54 -04:00
Rahul Joshi	fa789dffb1	[NFC] Rename `Intrinsic::getDeclaration` to `getOrInsertDeclaration` (#111752 ) Rename the function to reflect its correct behavior and to be consistent with `Module::getOrInsertFunction`. This is also in preparation of adding a new `Intrinsic::getDeclaration` that will have behavior similar to `Module::getFunction` (i.e, just lookup, no creation).	2024-10-11 05:26:03 -07:00
Fabian Ritter	173c68239d	[AMDGPU] Enable unaligned scratch accesses (#110219 ) This allows us to emit wide generic and scratch memory accesses when we do not have alignment information. In cases where accesses happen to be properly aligned or where generic accesses do not go to scratch memory, this improves performance of the generated code by a factor of up to 16x and reduces code size, especially when lowering memcpy and memmove intrinsics. Also: Make the use of the FeatureUnalignedScratchAccess feature more consistent: FeatureUnalignedScratchAccess and EnableFlatScratch are now orthogonal, whereas, before, code assumed that the latter implies the former at some places. Part of SWDEV-455845.	2024-10-11 08:50:49 +02:00
Jeffrey Byrnes	853c43d04a	[TTI] NFC: Port TLI.shouldSinkOperands to TTI (#110564 ) Porting to TTI provides direct access to the instruction cost model, which can enable instruction cost based sinking without introducing code duplication.	2024-10-09 14:30:09 -07:00
Matt Arsenault	c198f775cd	AMDGPU: Remove flat/global fmin/fmax intrinsics (#105642 ) These have been replaced with atomicrmw	2024-10-09 09:27:28 +04:00
Jay Foad	8d13e7b8c3	[AMDGPU] Qualify auto. NFC. (#110878 ) Generated automatically with: $ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find lib/Target/AMDGPU/ -type f)	2024-10-03 13:07:54 +01:00
Luke Drummond	3be955abbc	[NFC] Remove dead code There's an early exit branch a couple of lines earlier for `MVT == f64`. Convert to an assert rather than using the duplicate ternary here. This silences an opinionated static analyser that's been bugging me.	2024-08-26 12:59:41 +01:00
Matt Arsenault	ee08d9cba5	AMDGPU: Remove global/flat atomic fadd intrinics (#97051 ) These have been replaced with atomicrmw.	2024-08-22 23:27:33 +04:00
Matt Arsenault	cdadc2eb9e	AMDGPU: Correct costs of saturating add/sub intrinsics (#100808 ) These are directly legal with fast instructions.	2024-08-09 12:55:15 +04:00
Matt Arsenault	d7824fab6e	TTI: Check legalization cost of abs nodes (#100523 )	2024-08-09 12:51:05 +04:00
Matt Arsenault	e7630a0d60	AMDGPU: Improve cost handling of canonicalize (#101479 )	2024-08-01 19:02:20 +04:00
Matt Arsenault	524795926b	AMDGPU: Enable vectorization of v2f16 copysign (#100799 )	2024-07-30 08:48:13 +04:00
Matt Arsenault	4ed66cb4e1	AMDGPU: Improve cost handling of fma/fmuladd (#100798 ) We were overcounting the cost of fast f32 FMA. Also address todo and handle fmuladd (which I'm just assuming lowers to FMA, the slow FMA expansion is about as fast on slow targets anyway).	2024-07-30 08:45:07 +04:00
Fabian Ritter	9e462b7ea2	[LowerMemIntrinsics][NFC] Use Align in TTI::getMemcpyLoopLoweringType (#100984 ) ...and also in TTI::getMemcpyLoopResidualLoweringType.	2024-07-29 13:40:53 +02:00
Nikita Popov	9df71d7673	[IR] Add getDataLayout() helpers to Function and GlobalValue (#96919 ) Similar to https://github.com/llvm/llvm-project/pull/96902, this adds `getDataLayout()` helpers to Function and GlobalValue, replacing the current `getParent()->getDataLayout()` pattern.	2024-06-28 08:36:49 +02:00
Nikita Popov	2d209d964a	[IR] Add getDataLayout() helpers to BasicBlock and Instruction (#96902 ) This is a helper to avoid writing `getModule()->getDataLayout()`. I regularly try to use this method only to remember it doesn't exist... `getModule()->getDataLayout()` is also a common (the most common?) reason why code has to include the Module.h header.	2024-06-27 16:38:15 +02:00
Matt Arsenault	4477ff6836	AMDGPU: Remove ds_fmin/ds_fmax intrinsics (#96739 ) These have been replaced with atomicrmw.	2024-06-27 15:35:24 +02:00
Matt Arsenault	70c8b9c24a	AMDGPU: Remove ds atomic fadd intrinsics (#95396 ) These have been replaced with atomicrmw fadd	2024-06-23 10:30:20 +02:00
Jeffrey Byrnes	ea43a30899	[AMDGPU] Vectorize more 16 bit shuffles (#90648 ) In the case of larger vectors, we should still prefer the vectorized version (i.e. shufflevector vs extract/insert chains). In arithmetic chains, vectorization results in chains of packed math instructions (as opposed to unpack/repack & scalarized arithmetic): https://godbolt.org/z/c5onaf6G5 In chains with PHIs, vectorization again removes the unnecessary pack / repack code around BBs: https://godbolt.org/z/vz7zYzvhs	2024-05-21 09:21:36 -07:00
David Green	4ac2721e51	[AArch64] Add costs for ST3 and ST4 instructions, modelled as store(shuffle). (#87934 ) This tries to add some costs for the shuffle in a ST3/ST4 instruction, which are represented in LLVM IR as store(interleaving shuffle). In order to detect the store, it needs to add a CxtI context instruction to check the users of the shuffle. LD3 and LD4 are added, LD2 should be a zip1 shuffle, which will be added in another patch. It should help fix some of the regressions from #87510.	2024-04-09 16:36:08 +01:00
Alexey Bataev	7bc079c852	[TTI]Fallback to SingleSrcPermute shuffle kind, if no direct estimation for extract subvector. Many targets do not have cost for extractsubvector shuffle kind, but have the costs for single source permute. If there are no costs estimation for extractsubvector, better to switchto single source permute for better cost estimation. Reviewers: RKSimon, davemgreen, arsenm Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/79837	2024-02-12 07:09:49 -05:00
Mariusz Sikora	a018c8cdbb	GFX12: Add LoopDataPrefetchPass (#75625 ) It is currently disabled by default. It will need experiments on a real HW to tune and decide on the profitability. --------- Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2023-12-19 08:32:16 +01:00
Jessica Del	32f9983c06	[AMDGPU] - Add address space for strided buffers (#74471 ) This is an experimental address space for strided buffers. These buffers can have structs as elements and a stride > 1. These pointers allow the indexed access in units of stride, i.e., they point at `buffer[index * stride]`. Thus, we can use the `idxen` modifier for buffer loads. We assign address space 9 to 192-bit buffer pointers which contain a 128-bit descriptor, a 32-bit offset and a 32-bit index. Essentially, they are fat buffer pointers with an additional 32-bit index.	2023-12-15 15:49:25 +01:00
Mirko Brkušanin	07a6d73664	[AMDGPU] CodeGen for GFX12 VFLAT, VSCRATCH and VGLOBAL instructions (#75493 )	2023-12-15 15:01:40 +01:00
Piotr Sobczak	fac093dd08	[AMDGPU] Update IEEE and DX10_CLAMP for GFX12 (#75030 ) Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2023-12-13 13:52:40 +01:00
Kazu Hirata	586ecdf205	[llvm] Use StringRef::{starts,ends}_with (NFC) (#74956 ) This patch replaces uses of StringRef::{starts,ends}with with StringRef::{starts,ends}_with for consistency with std::{string,string_view}::{starts,ends}_with in C++20. I'm planning to deprecate and eventually remove StringRef::{starts,ends}with.	2023-12-11 21:01:36 -08:00
Sander de Smalen	81b7f115fb	[llvm][TypeSize] Fix addition/subtraction in TypeSize. (#72979 ) It seems TypeSize is currently broken in the sense that: TypeSize::Fixed(4) + TypeSize::Scalable(4) => TypeSize::Fixed(8) without failing its assert that explicitly tests for this case: assert(LHS.Scalable == RHS.Scalable && ...); The reason this fails is that `Scalable` is a static method of class TypeSize, and LHS and RHS are both objects of class TypeSize. So this is evaluating if the pointer to the function Scalable == the pointer to the function Scalable, which is always true because LHS and RHS have the same class. This patch fixes the issue by renaming `TypeSize::Scalable` -> `TypeSize::getScalable`, as well as `TypeSize::Fixed` to `TypeSize::getFixed`, so that it no longer clashes with the variable in FixedOrScalableQuantity. The new methods now also better match the coding standard, which specifies that: * Variable names should be nouns (as they represent state) * Function names should be verb phrases (as they represent actions)	2023-11-22 08:52:53 +00:00
Fangrui Song	8e247b8f47	Replace TypeSize::{getFixed,getScalable} with canonical TypeSize::{Fixed,Scalable}. NFC	2023-10-27 00:30:41 -07:00
Jay Foad	eca2fcbdeb	[AMDGPU] Fix cost of fast unsafe f32 fdiv (#68988 )	2023-10-15 12:25:36 +01:00
Alexey Bataev	9a207578ac	[TTI]Add InsertSubvector pattern in improveShuffleKindFromMask(). It improves shuffle instructions estimation and improves vectorization outcome. Differential Revision: https://reviews.llvm.org/D157425	2023-08-18 13:47:01 -07:00
Matt Arsenault	fbe4ff8149	AMDGPU: Partially fix not respecting dynamic denormal mode The most notable issue was producing v_mad_f32 in functions with the dynamic mode, since it just ignores the mode. fdiv lowering is still somewhat broken because it involves a mode switch and we need to query the original mode.	2023-07-11 15:14:52 -04:00
Matt Arsenault	9df70e4a4d	AMDGPU: Fix not applying the correct default memcpy expansion threshold Fixes 3c848194f28decca41b7362f9dd35d4939797724. The TTI hook name got renamed at some point in the process and the target implementation was left behind. Fixes: SWDEV-407329	2023-07-06 12:14:14 -04:00
David Green	12025cef3e	[CostModel] Use min/max intrinsics for vecreduce.min/max costs This changes the costmodelling of the vecreduce.min/max nodes to use the costs of the relevant min/max intrinsics instead of expanding them to compare and selects. The getMinMaxReductionCost have changed to take a Opcode for the relevant intrinsic, dropping the IsUnsigned and CondTy parameters as they are no longer needed. A follow up patch will add some basic fminimum/fmaximum costmodelling. Differential Revision: https://reviews.llvm.org/D153547	2023-07-04 15:02:30 +01:00
Juan Manuel MARTINEZ CAAMAÑO	dd1df099ae	[InlineCost][TargetTransformInfo][AMDGPU] Consider cost of alloca instructions in the caller (2/2) Before this patch, the compiler gave a bump to the inline-threshold when the total size of the allocas passed as arguments to the callee was below 256 bytes. This heuristic ignores that some of these allocas could have be removed by SROA if inlining was applied. Ideally, this bonus would be attributed to the threshold once the size of all the allocas that could not be handled by SROA is known: at the end of the InlineCost analysis. However, we may never reach this point if the inline-cost analysis exits early when the inline cost goes over the threshold mid-analysis. This patch proposes: * Attribute the bonus in the inline-threshold when allocas are passed as arguments (regardless of their total size). * Assigns a cost to each alloca proportional to its size, such that the cost of all the allocas cancels the bonus. Potential problems: * This patch assumes that removing alloca instructions with SROA is always profitable. This may not be the case if the total size of the allocas is still too big to be promoted to registers/LDS. * Redundant calls to getTotalAllocaSize * Awkwardly, the threshold attributed contributes to the single-bb and vector bonus. Reviewed By: scchan Differential Revision: https://reviews.llvm.org/D149741	2023-06-29 09:49:16 +02:00
Matt Arsenault	53fb907df4	AMDGPU: Special case uniformity info for single lane workgroups Constructors/destructors and OpenMP make use of single lane groups in some cases.	2023-06-28 07:25:48 -04:00
Matt Arsenault	92ee60b66f	AMDGPU: Drop and upgrade llvm.amdgcn.atomic.inc/dec to atomicrmw	2023-06-21 21:20:26 -04:00
Matt Arsenault	12c12c5fe0	TTI: Add function to hasBranchDivergence It my be possible to contextually ignore divergence in a function if it's known to run single threaded.	2023-06-16 18:47:40 -04:00
Matt Arsenault	3c848194f2	CodeGen: Expand memory intrinsics in PreISelIntrinsicLowering Expand large or unknown size memory intrinsics into loops in the default lowering pipeline if the target doesn't have the corresponding libfunc. Previously AMDGPU had a custom pass which existed to call the expansion utilities. With a default no-libcall option, we can remove the libfunc checks in LoopIdiomRecognize for these, which never made any sense. This also provides a path to lifting the immarg restriction on llvm.memcpy.inline. There seems to be a bug where TLI reports functions as available if you use -march and not -mtriple.	2023-06-09 21:04:37 -04:00
Alexander Timofeev	bad4de1ae7	Don't disable loop unroll for vectorized loops on AMDGPU target We've got a performance regression after the https://reviews.llvm.org/D115261. Despite the loop being vectorized unroll is still required. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D149281	2023-05-25 22:54:41 +02:00
Juan Manuel MARTINEZ CAAMAÑO	33da608ecc	[AMDGPU][InferAddressSpaces] Only rewrite address-spaces that can be trivially casted to flat for llvm.amdgcn.flat.atomic.{fadd,fmax,fmin} The intrinsic @llvm.amdgcn.flat.atomic.{fadd,fmax,fmin} can only be selected for flat address spaces (constant, flat and global). This patch restricts the cases over which GCNTTIImpl::rewriteIntrinsicWithAddressSpace rewrites the intrinsic. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D149938	2023-05-16 17:32:58 +02:00
Krzysztof Drewniak	f0415f2a45	Re-land "[AMDGPU] Define data layout entries for buffers"" Re-land D145441 with data layout upgrade code fixed to not break OpenMP. This reverts commit 3f2fbe92d0f40bcb46db7636db9ec3f7e7899b27. Differential Revision: https://reviews.llvm.org/D149776	2023-05-03 19:43:56 +00:00

1 2 3 4 5 ...

266 Commits