llvm-project

Author	SHA1	Message	Date
Fabian Ritter	a4fd3dba6e	[AMDGPU] Use wider loop lowering type for LowerMemIntrinsics (#112332 ) When llvm.memcpy or llvm.memmove intrinsics are lowered as a loop in LowerMemIntrinsics.cpp, the loop consists of a single load/store pair per iteration. We can improve performance in some cases by emitting multiple load/store pairs per iteration. This patch achieves that by increasing the width of the loop lowering type in the GCN target and letting legalization split the resulting too-wide access pairs into multiple legal access pairs. This change only affects lowered memcpys and memmoves with large (>= 1024 bytes) constant lengths. Smaller constant lengths are handled by ISel directly; non-constant lengths would be slowed down by this change if the dynamic length was smaller or slightly larger than what an unrolled iteration copies. The chosen default unroll factor is the result of microbenchmarks on gfx1030. This change leads to speedups of 15-38% for global memory and 1.9-5.8x for scratch in these microbenchmarks. Part of SWDEV-455845.	2024-10-28 09:04:19 +01:00
Shilei Tian	e34e27f198	[TTI][AMDGPU] Allow targets to adjust `LastCallToStaticBonus` via `getInliningLastCallToStaticBonus` (#111311 ) Currently we will not be able to inline a large function even if it only has one live use because the inline cost is still very high after applying `LastCallToStaticBonus`, which is a constant. This could significantly impact the performance because CSR spill is very expensive. This PR adds a new function `getInliningLastCallToStaticBonus` to TTI to allow targets to customize this value. Fixes SWDEV-471398.	2024-10-11 10:19:54 -04:00
Rahul Joshi	fa789dffb1	[NFC] Rename `Intrinsic::getDeclaration` to `getOrInsertDeclaration` (#111752 ) Rename the function to reflect its correct behavior and to be consistent with `Module::getOrInsertFunction`. This is also in preparation of adding a new `Intrinsic::getDeclaration` that will have behavior similar to `Module::getFunction` (i.e, just lookup, no creation).	2024-10-11 05:26:03 -07:00
Fabian Ritter	173c68239d	[AMDGPU] Enable unaligned scratch accesses (#110219 ) This allows us to emit wide generic and scratch memory accesses when we do not have alignment information. In cases where accesses happen to be properly aligned or where generic accesses do not go to scratch memory, this improves performance of the generated code by a factor of up to 16x and reduces code size, especially when lowering memcpy and memmove intrinsics. Also: Make the use of the FeatureUnalignedScratchAccess feature more consistent: FeatureUnalignedScratchAccess and EnableFlatScratch are now orthogonal, whereas, before, code assumed that the latter implies the former at some places. Part of SWDEV-455845.	2024-10-11 08:50:49 +02:00
Jeffrey Byrnes	853c43d04a	[TTI] NFC: Port TLI.shouldSinkOperands to TTI (#110564 ) Porting to TTI provides direct access to the instruction cost model, which can enable instruction cost based sinking without introducing code duplication.	2024-10-09 14:30:09 -07:00
Matt Arsenault	c198f775cd	AMDGPU: Remove flat/global fmin/fmax intrinsics (#105642 ) These have been replaced with atomicrmw	2024-10-09 09:27:28 +04:00
Jay Foad	8d13e7b8c3	[AMDGPU] Qualify auto. NFC. (#110878 ) Generated automatically with: $ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find lib/Target/AMDGPU/ -type f)	2024-10-03 13:07:54 +01:00
Luke Drummond	3be955abbc	[NFC] Remove dead code There's an early exit branch a couple of lines earlier for `MVT == f64`. Convert to an assert rather than using the duplicate ternary here. This silences an opinionated static analyser that's been bugging me.	2024-08-26 12:59:41 +01:00
Matt Arsenault	ee08d9cba5	AMDGPU: Remove global/flat atomic fadd intrinics (#97051 ) These have been replaced with atomicrmw.	2024-08-22 23:27:33 +04:00
Matt Arsenault	cdadc2eb9e	AMDGPU: Correct costs of saturating add/sub intrinsics (#100808 ) These are directly legal with fast instructions.	2024-08-09 12:55:15 +04:00
Matt Arsenault	d7824fab6e	TTI: Check legalization cost of abs nodes (#100523 )	2024-08-09 12:51:05 +04:00
Matt Arsenault	e7630a0d60	AMDGPU: Improve cost handling of canonicalize (#101479 )	2024-08-01 19:02:20 +04:00
Matt Arsenault	524795926b	AMDGPU: Enable vectorization of v2f16 copysign (#100799 )	2024-07-30 08:48:13 +04:00
Matt Arsenault	4ed66cb4e1	AMDGPU: Improve cost handling of fma/fmuladd (#100798 ) We were overcounting the cost of fast f32 FMA. Also address todo and handle fmuladd (which I'm just assuming lowers to FMA, the slow FMA expansion is about as fast on slow targets anyway).	2024-07-30 08:45:07 +04:00
Fabian Ritter	9e462b7ea2	[LowerMemIntrinsics][NFC] Use Align in TTI::getMemcpyLoopLoweringType (#100984 ) ...and also in TTI::getMemcpyLoopResidualLoweringType.	2024-07-29 13:40:53 +02:00
Nikita Popov	9df71d7673	[IR] Add getDataLayout() helpers to Function and GlobalValue (#96919 ) Similar to https://github.com/llvm/llvm-project/pull/96902, this adds `getDataLayout()` helpers to Function and GlobalValue, replacing the current `getParent()->getDataLayout()` pattern.	2024-06-28 08:36:49 +02:00
Nikita Popov	2d209d964a	[IR] Add getDataLayout() helpers to BasicBlock and Instruction (#96902 ) This is a helper to avoid writing `getModule()->getDataLayout()`. I regularly try to use this method only to remember it doesn't exist... `getModule()->getDataLayout()` is also a common (the most common?) reason why code has to include the Module.h header.	2024-06-27 16:38:15 +02:00
Matt Arsenault	4477ff6836	AMDGPU: Remove ds_fmin/ds_fmax intrinsics (#96739 ) These have been replaced with atomicrmw.	2024-06-27 15:35:24 +02:00
Matt Arsenault	70c8b9c24a	AMDGPU: Remove ds atomic fadd intrinsics (#95396 ) These have been replaced with atomicrmw fadd	2024-06-23 10:30:20 +02:00
Jeffrey Byrnes	ea43a30899	[AMDGPU] Vectorize more 16 bit shuffles (#90648 ) In the case of larger vectors, we should still prefer the vectorized version (i.e. shufflevector vs extract/insert chains). In arithmetic chains, vectorization results in chains of packed math instructions (as opposed to unpack/repack & scalarized arithmetic): https://godbolt.org/z/c5onaf6G5 In chains with PHIs, vectorization again removes the unnecessary pack / repack code around BBs: https://godbolt.org/z/vz7zYzvhs	2024-05-21 09:21:36 -07:00
David Green	4ac2721e51	[AArch64] Add costs for ST3 and ST4 instructions, modelled as store(shuffle). (#87934 ) This tries to add some costs for the shuffle in a ST3/ST4 instruction, which are represented in LLVM IR as store(interleaving shuffle). In order to detect the store, it needs to add a CxtI context instruction to check the users of the shuffle. LD3 and LD4 are added, LD2 should be a zip1 shuffle, which will be added in another patch. It should help fix some of the regressions from #87510.	2024-04-09 16:36:08 +01:00
Alexey Bataev	7bc079c852	[TTI]Fallback to SingleSrcPermute shuffle kind, if no direct estimation for extract subvector. Many targets do not have cost for extractsubvector shuffle kind, but have the costs for single source permute. If there are no costs estimation for extractsubvector, better to switchto single source permute for better cost estimation. Reviewers: RKSimon, davemgreen, arsenm Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/79837	2024-02-12 07:09:49 -05:00
Mariusz Sikora	a018c8cdbb	GFX12: Add LoopDataPrefetchPass (#75625 ) It is currently disabled by default. It will need experiments on a real HW to tune and decide on the profitability. --------- Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2023-12-19 08:32:16 +01:00
Jessica Del	32f9983c06	[AMDGPU] - Add address space for strided buffers (#74471 ) This is an experimental address space for strided buffers. These buffers can have structs as elements and a stride > 1. These pointers allow the indexed access in units of stride, i.e., they point at `buffer[index * stride]`. Thus, we can use the `idxen` modifier for buffer loads. We assign address space 9 to 192-bit buffer pointers which contain a 128-bit descriptor, a 32-bit offset and a 32-bit index. Essentially, they are fat buffer pointers with an additional 32-bit index.	2023-12-15 15:49:25 +01:00
Mirko Brkušanin	07a6d73664	[AMDGPU] CodeGen for GFX12 VFLAT, VSCRATCH and VGLOBAL instructions (#75493 )	2023-12-15 15:01:40 +01:00
Piotr Sobczak	fac093dd08	[AMDGPU] Update IEEE and DX10_CLAMP for GFX12 (#75030 ) Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2023-12-13 13:52:40 +01:00
Kazu Hirata	586ecdf205	[llvm] Use StringRef::{starts,ends}_with (NFC) (#74956 ) This patch replaces uses of StringRef::{starts,ends}with with StringRef::{starts,ends}_with for consistency with std::{string,string_view}::{starts,ends}_with in C++20. I'm planning to deprecate and eventually remove StringRef::{starts,ends}with.	2023-12-11 21:01:36 -08:00
Sander de Smalen	81b7f115fb	[llvm][TypeSize] Fix addition/subtraction in TypeSize. (#72979 ) It seems TypeSize is currently broken in the sense that: TypeSize::Fixed(4) + TypeSize::Scalable(4) => TypeSize::Fixed(8) without failing its assert that explicitly tests for this case: assert(LHS.Scalable == RHS.Scalable && ...); The reason this fails is that `Scalable` is a static method of class TypeSize, and LHS and RHS are both objects of class TypeSize. So this is evaluating if the pointer to the function Scalable == the pointer to the function Scalable, which is always true because LHS and RHS have the same class. This patch fixes the issue by renaming `TypeSize::Scalable` -> `TypeSize::getScalable`, as well as `TypeSize::Fixed` to `TypeSize::getFixed`, so that it no longer clashes with the variable in FixedOrScalableQuantity. The new methods now also better match the coding standard, which specifies that: * Variable names should be nouns (as they represent state) * Function names should be verb phrases (as they represent actions)	2023-11-22 08:52:53 +00:00
Fangrui Song	8e247b8f47	Replace TypeSize::{getFixed,getScalable} with canonical TypeSize::{Fixed,Scalable}. NFC	2023-10-27 00:30:41 -07:00
Jay Foad	eca2fcbdeb	[AMDGPU] Fix cost of fast unsafe f32 fdiv (#68988 )	2023-10-15 12:25:36 +01:00
Alexey Bataev	9a207578ac	[TTI]Add InsertSubvector pattern in improveShuffleKindFromMask(). It improves shuffle instructions estimation and improves vectorization outcome. Differential Revision: https://reviews.llvm.org/D157425	2023-08-18 13:47:01 -07:00
Matt Arsenault	fbe4ff8149	AMDGPU: Partially fix not respecting dynamic denormal mode The most notable issue was producing v_mad_f32 in functions with the dynamic mode, since it just ignores the mode. fdiv lowering is still somewhat broken because it involves a mode switch and we need to query the original mode.	2023-07-11 15:14:52 -04:00
Matt Arsenault	9df70e4a4d	AMDGPU: Fix not applying the correct default memcpy expansion threshold Fixes 3c848194f28decca41b7362f9dd35d4939797724. The TTI hook name got renamed at some point in the process and the target implementation was left behind. Fixes: SWDEV-407329	2023-07-06 12:14:14 -04:00
David Green	12025cef3e	[CostModel] Use min/max intrinsics for vecreduce.min/max costs This changes the costmodelling of the vecreduce.min/max nodes to use the costs of the relevant min/max intrinsics instead of expanding them to compare and selects. The getMinMaxReductionCost have changed to take a Opcode for the relevant intrinsic, dropping the IsUnsigned and CondTy parameters as they are no longer needed. A follow up patch will add some basic fminimum/fmaximum costmodelling. Differential Revision: https://reviews.llvm.org/D153547	2023-07-04 15:02:30 +01:00
Juan Manuel MARTINEZ CAAMAÑO	dd1df099ae	[InlineCost][TargetTransformInfo][AMDGPU] Consider cost of alloca instructions in the caller (2/2) Before this patch, the compiler gave a bump to the inline-threshold when the total size of the allocas passed as arguments to the callee was below 256 bytes. This heuristic ignores that some of these allocas could have be removed by SROA if inlining was applied. Ideally, this bonus would be attributed to the threshold once the size of all the allocas that could not be handled by SROA is known: at the end of the InlineCost analysis. However, we may never reach this point if the inline-cost analysis exits early when the inline cost goes over the threshold mid-analysis. This patch proposes: * Attribute the bonus in the inline-threshold when allocas are passed as arguments (regardless of their total size). * Assigns a cost to each alloca proportional to its size, such that the cost of all the allocas cancels the bonus. Potential problems: * This patch assumes that removing alloca instructions with SROA is always profitable. This may not be the case if the total size of the allocas is still too big to be promoted to registers/LDS. * Redundant calls to getTotalAllocaSize * Awkwardly, the threshold attributed contributes to the single-bb and vector bonus. Reviewed By: scchan Differential Revision: https://reviews.llvm.org/D149741	2023-06-29 09:49:16 +02:00
Matt Arsenault	53fb907df4	AMDGPU: Special case uniformity info for single lane workgroups Constructors/destructors and OpenMP make use of single lane groups in some cases.	2023-06-28 07:25:48 -04:00
Matt Arsenault	92ee60b66f	AMDGPU: Drop and upgrade llvm.amdgcn.atomic.inc/dec to atomicrmw	2023-06-21 21:20:26 -04:00
Matt Arsenault	12c12c5fe0	TTI: Add function to hasBranchDivergence It my be possible to contextually ignore divergence in a function if it's known to run single threaded.	2023-06-16 18:47:40 -04:00
Matt Arsenault	3c848194f2	CodeGen: Expand memory intrinsics in PreISelIntrinsicLowering Expand large or unknown size memory intrinsics into loops in the default lowering pipeline if the target doesn't have the corresponding libfunc. Previously AMDGPU had a custom pass which existed to call the expansion utilities. With a default no-libcall option, we can remove the libfunc checks in LoopIdiomRecognize for these, which never made any sense. This also provides a path to lifting the immarg restriction on llvm.memcpy.inline. There seems to be a bug where TLI reports functions as available if you use -march and not -mtriple.	2023-06-09 21:04:37 -04:00
Alexander Timofeev	bad4de1ae7	Don't disable loop unroll for vectorized loops on AMDGPU target We've got a performance regression after the https://reviews.llvm.org/D115261. Despite the loop being vectorized unroll is still required. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D149281	2023-05-25 22:54:41 +02:00
Juan Manuel MARTINEZ CAAMAÑO	33da608ecc	[AMDGPU][InferAddressSpaces] Only rewrite address-spaces that can be trivially casted to flat for llvm.amdgcn.flat.atomic.{fadd,fmax,fmin} The intrinsic @llvm.amdgcn.flat.atomic.{fadd,fmax,fmin} can only be selected for flat address spaces (constant, flat and global). This patch restricts the cases over which GCNTTIImpl::rewriteIntrinsicWithAddressSpace rewrites the intrinsic. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D149938	2023-05-16 17:32:58 +02:00
Krzysztof Drewniak	f0415f2a45	Re-land "[AMDGPU] Define data layout entries for buffers"" Re-land D145441 with data layout upgrade code fixed to not break OpenMP. This reverts commit 3f2fbe92d0f40bcb46db7636db9ec3f7e7899b27. Differential Revision: https://reviews.llvm.org/D149776	2023-05-03 19:43:56 +00:00
Krzysztof Drewniak	3f2fbe92d0	Revert "[AMDGPU] Define data layout entries for buffers" This reverts commit f9c1ede2543b37fabe9f2d8f8fed5073c475d850. Differential Revision: https://reviews.llvm.org/D149758	2023-05-03 16:11:00 +00:00
Krzysztof Drewniak	f9c1ede254	[AMDGPU] Define data layout entries for buffers Per discussion at https://discourse.llvm.org/t/representing-buffer-descriptors-in-the-amdgpu-target-call-for-suggestions/68798, we define two new address spaces for AMDGCN targets. The first is address space 7, a non-integral address space (which was already in the data layout) that has 160-bit pointers (which are 256-bit aligned) and uses a 32-bit offset. These pointers combine a 128-bit buffer descriptor and a 32-bit offset, and will be usable with normal LLVM operations (load, store, GEP). However, they will be rewritten out of existence before code generation. The second of these is address space 8, the address space for "buffer resources". These will be used to represent the resource arguments to buffer instructions, and new buffer intrinsics will be defined that take them instead of <4 x i32> as resource arguments. ptr addrspace(8). These pointers are 128-bits long (with the same alignment). They must not be used as the arguments to getelementptr or otherwise used in address computations, since they can have arbitrarily complex inherent addressing semantics that can't be represented in LLVM. Even though, like their address space 7 cousins, these pointers have deterministic ptrtoint/inttoptr semantics, they are defined to be non-integral in order to prevent optimizations that rely on pointers being a [0, [addr_max]] value from applying to them. Future work includes: - Defining new buffer intrinsics that take ptr addrspace(8) resources. - A late rewrite to turn address space 7 operations into buffer intrinsics and offset computations. This commit also updates the "fallback address space" for buffer intrinsics to the buffer resource, and updates the alias analysis table. Depends on D143437 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D145441	2023-05-03 15:25:58 +00:00
Simon Pilgrim	9e30b87afb	[TTI] getMinMaxReductionCost - add FastMathFlag argument Similar to the getArithmeticReductionCost / getExtendedReductionCost calls (which really don't need to use std::optional<>). This will be necessary to correct recognize fast/nnan fmax/fmul reductions which can avoid nan handling - which will allow us to remove the fmax/fmin special case in X86TTIImpl::getMinMaxCost and use getIntrinsicInstrCost like we do for integer reductions (63c3895327839ba5b57f5b99ec9e888abf976ac6). Differential Revision: https://reviews.llvm.org/D148149	2023-04-13 10:42:42 +01:00
pvanhout	9580bebd47	[AMDGPU] Remove `use-legacy-divergence-analysis` This option has been a no-op since we switched to UA. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D146937	2023-03-27 11:30:24 +02:00
Jay Foad	dcb834843e	[AMDGPU] Split SIModeRegisterDefaults out of AMDGPUBaseInfo. NFC. This is only used by CodeGen. Moving it out of AMDGPUBaseInfo simplifies future changes to make some of it depend on the subtarget. Differential Revision: https://reviews.llvm.org/D144650	2023-02-23 16:38:15 +00:00
Luke Lau	b02b1e0ed6	[LV][NFC] Use ElementCount for getMaxInterleaveFactor In order to allow targets to disable interleaving for scalable vectors, pass the entire VF's ElementCount to getMaxInterleaveFactor. This is based off of the approach used here: `8d36708507` The plan would then be to disable interleaving on scalable VFs on RISC-V in a follow up patch. See https://reviews.llvm.org/D143723#4132349 Reviewed By: reames Differential Revision: https://reviews.llvm.org/D144474	2023-02-22 10:15:05 +00:00
Janek van Oirschot	e3515ba381	Reapply "[AMDGPU] Modify adjustInliningThreshold to also consider the cost of passing function arguments through the stack" Reapplies 142c28ffa1323e9a8d53200a22c80d5d778e0d0f as part of D140242 which got reverted due to amdgpu openmp test failures. This diff fixes said failures by eliding most of `adjustInliningThresholdUsingCallee` for indirect calls as the callee function is unavailable for indirect calls. Reviewed By: arsenm, #amdgpu Differential Revision: https://reviews.llvm.org/D143498	2023-02-13 12:17:43 +00:00
Janek van Oirschot	1beba44526	Revert "[AMDGPU] Modify adjustInliningThreshold to also consider the cost of passing function arguments through the stack" This reverts commit 142c28ffa1323e9a8d53200a22c80d5d778e0d0f.	2023-02-03 19:13:57 +00:00

1 2 3 4 5 ...

258 Commits