llvm-project

Author	SHA1	Message	Date
pvanhout	f5399e89a2	Remove trailing whitespaces in AMDGPUUsage.rst	2024-02-12 09:30:10 +01:00
Corbin Robeck	fcb59203c8	[AMDGPU][DOC] Add MI200 Names to AMDGPUUsage Doc (#81252 )	2024-02-09 10:05:26 -05:00
Jan Patrick Lehr	f661057865	Revert "[AMDGPU] Compiler should synthesize private buffer resource descriptor from flat_scratch_init" (#81234 ) Reverts llvm/llvm-project#79586 This broke the AMDGPU OpenMP Offload buildbot. The typical error message was that the GPU attempted to read beyong the largest legal address. Error message: AMDGPU fatal error 1: Received error in queue 0x7f8363f22000: HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address.	2024-02-09 09:57:38 +01:00
alex-t	88e52511ca	[AMDGPU] Compiler should synthesize private buffer resource descriptor from flat_scratch_init (#79586 ) This change implements synthesizing the private buffer resource descriptor in the kernel prolog instead of using the preloaded kernel argument.	2024-02-08 20:27:36 +01:00
Saiyedul Islam	082f87c9d4	[AMDGPU] Change default AMDHSA Code Object version to 5 (#79038 ) Also update LIT tests and docs. For more details, see https://llvm.org/docs/AMDGPUUsage.html#code-object-v5-metadata Corresponding llvm-objdump AMDGPU lit tests are updated in a follow-up PR.	2024-01-23 17:08:18 +05:30
Konstantin Zhuravlyov	726d940586	AMDGPU/Docs: Add link to MI300 Instruction Set Architecture (#78777 )	2024-01-22 10:32:35 -05:00
Emma Pilkington	bc82cfb38d	[AMDGPU] Add an asm directive to track code_object_version (#76267 ) Named '.amdhsa_code_object_version'. This directive sets the e_ident[ABIVERSION] in the ELF header, and should be used as the assumed COV for the rest of the asm file. This commit also weakens the --amdhsa-code-object-version CL flag. Previously, the CL flag took precedence over the IR flag. Now the IR flag/asm directive take precedence over the CL flag. This is implemented by merging a few COV-checking functions in AMDGPUBaseInfo.h.	2024-01-21 11:54:47 -05:00
Jay Foad	9ca36932b5	[AMDGPU] Work around s_getpc_b64 zero extending on GFX12 (#78186 )	2024-01-18 10:23:27 +00:00
Mariusz Sikora	c99da46fc1	[AMDGPU][GFX12] Add Atomic cond_sub_u32 (#76224 ) Co-authored-by: Vang Thao <Vang.Thao@amd.com>	2024-01-17 19:23:42 +01:00
Chaitanya	9803de0e8e	[AMDGPU] Add dynamic LDS size implicit kernel argument to CO-v5 (#65273 ) "hidden_dynamic_lds_size" argument will be added in the reserved section at offset 120 of the implicit argument layout. Add "isDynamicLDSUsed" flag to AMDGPUMachineFunction to identify if a function uses dynamic LDS. hidden argument will be added in below cases: - LDS global is used in the kernel. - Kernel calls a function which uses LDS global. - LDS pointer is passed as argument to kernel itself.	2024-01-04 19:05:12 +05:30
Jay Foad	c01e844a7e	[AMDGPU] Update compute program resource registers for GFX12 (#75911 ) Co-authored-by: Konstantin Zhuravlyov <kzhuravl@amd.com>	2024-01-02 13:24:42 +00:00
Jeffrey Byrnes	f1156fb622	[AMDGPU][IGLP]: Add SchedGroupMask::TRANS (#75416 ) Makes constructing SchedGroups of this type easier, and provides ability to create them with __builtin_amdgcn_sched_group_barrier	2023-12-19 16:54:18 -08:00
Jessica Del	32f9983c06	[AMDGPU] - Add address space for strided buffers (#74471 ) This is an experimental address space for strided buffers. These buffers can have structs as elements and a stride > 1. These pointers allow the indexed access in units of stride, i.e., they point at `buffer[index * stride]`. Thus, we can use the `idxen` modifier for buffer loads. We assign address space 9 to 192-bit buffer pointers which contain a 128-bit descriptor, a 32-bit offset and a 32-bit index. Essentially, they are fat buffer pointers with an additional 32-bit index.	2023-12-15 15:49:25 +01:00
Piotr Sobczak	fac093dd08	[AMDGPU] Update IEEE and DX10_CLAMP for GFX12 (#75030 ) Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2023-12-13 13:52:40 +01:00
Michael Halkenhaeuser	19fa27605c	[NFC][docs] Add AMDGPU documentation for `LIBOMPTARGET_STACK_SIZE` Add documentation w.r.t. changes by #72606, which allows to set the dynamic callstack size.	2023-11-28 14:09:42 -05:00
Jay Foad	a1b3b78c55	[AMDGPU] Clarify description of _HI relocation types (#73663 ) Clarify how the addend is used in _HI relocation types like R_AMDGPU_ABS32_HI based on the current behaviour of the Mesa and AMDPAL ELF loaders. This affects Mesa and AMDPAL because they use REL relocation records, so the addend for these types is the 32-bit literal value from the instruction being relocated. AMDHSA is not affected because it uses RELA relocation records which have a 64-bit addend.	2023-11-28 17:06:11 +00:00
Jay Foad	82a5708c7a	[AMDGPU] Document that PAL uses Elf64_Rel relocation records (#73648 )	2023-11-28 14:37:10 +00:00
Jay Foad	cf1e0c0b07	[AMDGPU] Define new targets gfx1200 and gfx1201 (#73133 ) Define target names and ELF numbers for new GFX12 targets gfx1200 and gfx1201. For now they behave identically to GFX11.	2023-11-23 16:44:05 +00:00
Pierre van Houtryve	4428b01faa	Reland: [AMDGPU] Remove Code Object V3 (#67118 ) V3 has been deprecated for a while as well, so it can safely be removed like V2 was removed. - [Clang] Set minimum code object version to 4 - [lld] Fix tests using code object v3 - Remove code object V3 from the AMDGPU backend, and delete or port v3 tests to v4. - Update docs to make it clear V3 can no longer be emitted.	2023-11-07 12:23:03 +01:00
bcahoon	f4b54f799a	[AMDGPU] Add documentation for scheduler intrinsics (#69854 ) Adding sched_barrier, sched_group_barrier, and iglp_opt.	2023-11-02 18:47:45 -05:00
Konstantin Zhuravlyov	8b36a19b3f	AMDGPU/Docs: Memory model updates for GFX940, GFX941, GFX942 (#71091 ) - Update memory model sequences for GFX940, GFX941, GFX942 to match implementation - Re-title "Memory Model GFX940" to "Memory Model GFX942" Co-authored with @t-tye Change-Id: I82f1707b7c3e010ce1fe8207fcca18c4570057a3 Co-authored-by: Konstantin Zhuravlyov <kzhuravl@amd.com>	2023-11-02 14:48:21 -04:00
Konstantin Zhuravlyov	8b61ef0925	AMDGPU/Docs: Add links to instruction descriptions for gfx941, gfx942 (#70941 ) Co-authored-by: Konstantin Zhuravlyov <kzhuravl@amd.com>	2023-11-01 14:14:48 -04:00
Austin Kerbow	d681461098	[AMDGPU] Add doc updates for kernarg preloading (#67516 )	2023-10-19 13:43:35 -07:00
pvanhout	868abf0961	Revert "[AMDGPU] Remove Code Object V3 (#67118 )" This reverts commit 544d91280c26fd5f7acd70eac4d667863562f4cc.	2023-10-18 12:55:36 +02:00
Pierre van Houtryve	544d91280c	[AMDGPU] Remove Code Object V3 (#67118 ) V3 has been deprecated for a while as well, so it can safely be removed like V2 was removed. - [Clang] Set minimum code object version to 4 - [lld] Fix tests using code object v3 - Remove code object V3 from the AMDGPU backend, and delete or port v3 tests to v4. - Update docs to make it clear V3 can no longer be emitted.	2023-10-16 08:21:48 +02:00
Pierre van Houtryve	fe2f67e4ba	[AMDGPU] Remove Code Object V2 (#65715 ) Code Object V2 has been deprecated for more than a year now. We can safely remove it from LLVM. - [clang] Remove support for the `-mcode-object-version=2` option. - [lld] Remove/refactor tests that were still using COV2 - [llvm] Update AMDGPUUsage.rst - Code Object V2 docs are left for informational purposes because those code objects may still be supported by the runtime/loaders for a while. - [AMDGPU] Remove COV2 emission capabilities. - [AMDGPU] Remove `MetadataStreamerYamlV2` which was only used by COV2 - [AMDGPU] Update all tests that were still using COV2 - They are either deleted or ported directly to code object v4 (as v3 is also planned to be removed soon).	2023-09-21 12:00:45 +02:00
Saiyedul Islam	466a8149b3	Revert "[AMDGPU] Make default AMDHSA Code Object Version to be 5 (#65410 )" (#66060 ) This reverts commit 0a8d17e79b02a92814a2a788d79df1f54d70ec3e.	2023-09-12 15:13:59 +05:30
Saiyedul Islam	0a8d17e79b	[AMDGPU] Make default AMDHSA Code Object Version to be 5 (#65410 ) Also update LIT tests and docs. For more details, see https://llvm.org/docs/AMDGPUUsage.html#code-object-v5-metadata Reviewed By: arsenm, jhuber6 Github PR: #65410 Differential Revision: https://reviews.llvm.org/D129818	2023-09-12 13:53:31 +05:30
Matt Arsenault	17bd80601e	AMDGPU: Implement llvm.get.fpmode Currently s_getreg_b32 is missing the possible mode use. Really we need separate pseudos for mode-only accesses, but leave this as a pre-existing issue. https://reviews.llvm.org/D152710	2023-09-10 10:19:19 +03:00
Matt Arsenault	5f8ee45d5a	AMDGPU: Implement llvm.get.rounding There are really two rounding modes, so only return the standard values if both modes are the same. Otherwise, return a bitmask representing the two modes. Annoyingly the register doesn't use the same values as FLT_ROUNDS. Use a simple integer table we can shift into to convert. https://reviews.llvm.org/D153158	2023-08-30 14:06:13 -04:00
Kazu Hirata	3a14993fa4	Fix typos in documentation	2023-08-27 00:18:14 -07:00
Jeffrey Byrnes	3ba8dabbf3	[AMDGPU] Add sdot4 / sdot8 intrinsics for gfx11 This provides a uniform way to lower into the relevant instructions across all generations. Differential Revision: https://reviews.llvm.org/D158468 Change-Id: I1f7ba4b15ee470738535cf1c7d177a11fc471e43	2023-08-25 11:45:55 -07:00
Diana Picus	71eb8c07dd	[AMDGPU] Update amdgpu_cs_chain_preserve docs. NFC We no longer allow calls to functions with the `amdgpu_gfx` calling convention from functions with the `amdgpu_cs_chain_preserve` calling convention. See D153517. Also mention that we can't have a chain call from amdgpu_cs_chain_preserve using more VGPRs than it has received. Differential Revision: https://reviews.llvm.org/D156408	2023-08-24 10:17:00 +02:00
Matt Arsenault	81b278e613	AMDGPU: Fix fast f32 exp2 Mirror of the previous log changes, OpenCL conformance doesn't like interpreting afn as ignore denormal handling but was previously hidden by flag dropping.	2023-08-15 10:48:46 -04:00
Matt Arsenault	e09b3593ba	AMDGPU: Fix fast math log2 f32 Apparently afn doesn't allow you to drop the denormal handling according to OpenCL conformance. This was hidden by losing the flags during the library linking process. Fast log is still broken and needs more work. https://reviews.llvm.org/D157936	2023-08-15 10:48:46 -04:00
Changpeng Fang	93eb6c8662	{AMDGPU] [NFC] Correct a typo in docs	2023-08-11 16:47:01 -07:00
Changpeng Fang	d77c62053c	[clang][AMDGPU]: Don't use byval for struct arguments in function ABI Summary: Byval requires allocating additional stack space, and always requires an implicit copy to be inserted in codegen, where it can be difficult to optimize. In this work, we use byref/IndirectAliased promotion method instead of byval with the implicit copy semantics. Reviewers: arsenm Differential Revision: https://reviews.llvm.org/D155986	2023-08-11 16:37:42 -07:00
Matt Arsenault	9a53f5f5c4	AMDGPU: Handle llvm.stacksave and llvm.stackrestore Not sure if the only valid use is to have stackrestore directly consume stacksave outputs or not. Handled exactly like a regular stack pointer so all the edge cases theoretically should work. https://reviews.llvm.org/D156669	2023-08-11 10:25:01 -04:00
Matt Arsenault	395cd33ba8	AMDGPU: Remove trailing whitespace from documentation	2023-07-25 07:54:11 -04:00
Matt Arsenault	e3fd8f83a8	AMDGPU: Correctly expand f64 sqrt intrinsic rocm-device-libs and llpc were avoiding using f64 sqrt intrinsics in favor of their own expansions. Port the expansion into the backend. Both of these users should be updated to call the intrinsic instead. The library and llpc expansions are slightly different. llpc uses an ldexp to do the scale; the library uses a multiply. Use ldexp to do the scale instead of the multiply. I believe v_ldexp_f64 and v_mul_f64 are always the same number of cycles, but it's cheaper to materialize the 32-bit integer constant than the 64-bit double constant. The libraries have another fast version of sqrt which will be handled separately. I am tempted to do this in an IR expansion instead. In the IR we could take advantage of computeKnownFPClass to avoid the 0-or-inf argument check.	2023-07-25 07:54:11 -04:00
Pravin Jagtap	c48ed93cf8	[AMDGPU] Add llvm.amdgcn.wave.reduce.umin/umax Intrinsic. When input to intrinsic is uniform value, reduced value is same as input whereas if input value is divergent we need to iterate over all active lanes of WaveFront to perform the reduction. The control flow for a `loop` has been set up, which iterates over `only` active lanes to perform reduction. Introduced WAVE_REDUCE_UMIN_PSEUDO_U32 and WAVE_REDUCE_UMAX_PSEUDO_U32 Pseudos which are lowered Post-ISel (in `EmitInstrWithCustomInserter `). Reviewed By: arsenm, #amdgpu Differential Revision: https://reviews.llvm.org/D154858	2023-07-24 00:06:00 -04:00
Jay Foad	92542f2a40	[AMDGPU] Add targets gfx1150 and gfx1151 This is the target definition only. Currently they are treated the same as GFX 11.0.x. Differential Revision: https://reviews.llvm.org/D155429	2023-07-17 13:06:12 +01:00
Jon Chesterfield	6043d4dfec	[amdgpu] Accept an optional max to amdgpu-lds-size attribute for use in PromoteAlloca	2023-07-15 21:37:21 +01:00
Jon Chesterfield	74e928a081	[amdgpu][lds] Remove recalculation of LDS frame from backend Do the LDS frame calculation once, in the IR pass, instead of repeating the work in the backend. Prior to this patch: The IR lowering pass sets up a per-kernel LDS frame and annotates the variables with absolute_symbol metadata so that the assembler can build lookup tables out of it. There is a fragile association between kernel functions and named structs which is used to recompute the frame layout in the backend, with fatal_errors catching inconsistencies in the second calculation. After this patch: The IR lowering pass additionally sets a frame size attribute on kernels. The backend uses the same absolute_symbol metadata that the assembler uses to place objects within that frame size. Deleted the now dead allocation code from the backend. Left for a later cleanup: - enabling lowering for anonymous functions - removing the elide-module-lds attribute (test churn, it's not used by llc any more) - adjusting the dynamic alignment check to not use symbol names Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D155190	2023-07-13 23:54:38 +01:00
Matt Arsenault	30fd35f59c	AMDGPU: Add some notes about amdgpu-flat-work-group-size	2023-07-07 19:02:46 -04:00
Matt Arsenault	5491666248	AMDGPU: Correctly lower llvm.exp.f32 The library expansion has too many paths for all the permutations of DAZ, unsafe and the 3 exp functions. It's easier to expand it in the backend when we know all of these things. The library currently misses the no-infinity check on the overflow, which this handles optimizing out. Some of the <3 x half> fast tests regress due to vector widening dropping flags which will be fixed separately. Apparently there is no exp10 intrinsic, but there should be. Adds some deadish code in preparation for adding one while I'm following along with the current library expansion.	2023-07-05 17:23:49 -04:00
Matt Arsenault	ed556a1ad5	AMDGPU: Correctly lower llvm.exp2.f32 Previously this did a fast math expansion only.	2023-07-05 17:23:48 -04:00
Matt Arsenault	4e15f378ee	AMDGPU: Correctly lower llvm.log.f32 and llvm.log10.f32 Previously we expanded these in a fast-math way and the device libraries were relying on this behavior. The libraries have a pending change to switch to the new target intrinsic. Unlike the library version, this takes advantage of no-infinities on the result overflow check.	2023-07-05 15:30:35 -04:00
Matt Arsenault	003b58f65b	IR: Add llvm.frexp intrinsic Add an intrinsic which returns the two pieces as multiple return values. Alternatively could introduce a pair of intrinsics to separately return the fractional and exponent parts. AMDGPU has native instructions to return the two halves, but could use some generic legalization and optimization handling. For example, we should be able to handle legalization of f16 on older targets, and for bf16. Additionally antique targets need a hardware workaround which would be better handled in the backend rather than in library code where it is now.	2023-06-28 14:50:16 -04:00
Matt Arsenault	89ccfa1b39	AMDGPU: Use correct lowering for llvm.log2.f32 We previously directly codegened to v_log_f32, which is broken for denormals. The lowering isn't complicated, you simply need to scale denormal inputs and adjust the result. Note log and log10 are still not accurate enough, and will be fixed separately.	2023-06-23 08:37:37 -04:00

1 2 3 4 5 ...

414 Commits