llvm-project

Author	SHA1	Message	Date
alex-t	88e52511ca	[AMDGPU] Compiler should synthesize private buffer resource descriptor from flat_scratch_init (#79586 ) This change implements synthesizing the private buffer resource descriptor in the kernel prolog instead of using the preloaded kernel argument.	2024-02-08 20:27:36 +01:00
Ivan Kosarev	7d19dc50de	[AMDGPU][True16] Support VOP3 source DPP operands. (#80892 )	2024-02-08 16:23:00 +00:00
Pierre van Houtryve	9ff3b82948	[AMDGPU] Revert Metadata Version Upgrade (#80995 ) Metadata is still 1.2, not 1.3 after V6. I thought that amdhsa.version mapped to the COV version but it's separate, and there are no MD changes in V6, hence it doesn't need to be updated.	2024-02-08 08:30:59 +01:00
Jeffrey Byrnes	3115ad8980	[AMDGPU] Accept arbitrary sized sources in CalculateByteProvider (#70240 ) Reland the original patch with additional commit containing fix for two issues: 1. Attempting to bitcast using MVTs with no corresponding LLVM type. getDWordFromOffset now works directly with the original vector to get the corresponding elements given the DWordOffset. 2. Improper bit tracking in CalculateByteProvider for vector types using certain ops. Previously, bit tracking for certain ops (e.g. ISD::TRUNCATE) assumed operands were scalar types, which is not correct since these ops have different semantics depending on vector / scalar. CalculateByteProvider / CalculateSrcByte now exit on vector types, handling which is a TODO.	2024-02-07 11:34:50 -08:00
Carl Ritson	7d508eb5d3	Revert "[AMDGPU] Add pal metadata 3.0 support to callable pal funcs (#67104 )" This reverts commit d6c7253d32e4bdff619c39708170f1c1fa01ff95. Change causing CTS failures due to incomplete metadata.	2024-02-07 17:09:56 +09:00
Carl Ritson	9bda1de0b6	[TwoAddressInstruction] Propagate undef flags for partial defs (#79286 ) If part of a register (lowered from REG_SEQUENCE) is undefined then we should propagate undef flags to uses of those lanes. This is only performed when live intervals are present as it requires live intervals to correctly match uses to defs, and the primary goal is to allow precise computation of subrange intervals.	2024-02-07 16:46:00 +09:00
choikwa	e5638c5a00	[AMDGPU] Use correct number of bits needed for div/rem shrinking (#80622 ) There was an error where dividend of type i64 and actual used number of bits of 32 fell into path that assumes only 24 bits being used. Check that AtLeast field is used correctly when using computeNumSignBits and add necessary extend/trunc for 32 bits path. Regolden and update testcases. @jrbyrnes @bcahoon @arsenm @rampitec	2024-02-06 21:32:28 +05:30
David Stuttard	d6c7253d32	[AMDGPU] Add pal metadata 3.0 support to callable pal funcs (#67104 ) PAL Metadata 3.0 introduces an explicit structure in metadata for the programmable registers written out by the compiler backend. The previous approach used opaque registers which can change between different architectures and required encoding the bitfield information in the backend, which may change between versions. This change is an extension the previously added support - which only handled entry functions. This adds support for all functions. The change also includes some re-factoring to separate common code.	2024-02-06 15:34:36 +00:00
Thorsten Schütt	364f781344	[GlobalIsel] Combine logic of icmps (#77855 ) Inspired by InstCombinerImpl::foldAndOrOfICmpsUsingRanges with some adaptations to MIR.	2024-02-06 15:58:02 +01:00
Simon Pilgrim	b8cdc2638e	[DAG] visitCTPOP - if only the upper half of the ctpop operand is zero then see if its profitable to only count the lower half. (#80473 )	2024-02-06 12:19:31 +00:00
Matt Arsenault	42b5b720ca	AMDGPU/GlobalISel: Fix not running -global-isel in global isel test	2024-02-06 14:55:48 +05:30
Stanislav Mekhanoshin	ea9276d47e	[AMDGPU] GlobalISel for f8 conversions (#80503 )	2024-02-05 09:41:37 -08:00
Stanislav Mekhanoshin	d0b5d32ce6	[AMDGPU] Fixed byte_sel of v_cvt_f32_bf8/v_cvt_f32_fp8 (#80502 ) Opsel bits are swapped. Actual byte select table: Byte OPSEL 0 0 1 2 2 1 3 3	2024-02-05 09:35:01 -08:00
Kevin P. Neal	d15c454bed	[FPEnv][AMDGPU] Correct strictfp tests. Correct AMDGPU strictfp tests to follow the rules documented in the LangRef: https://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics These tests needed the strictfp attribute added to function calls and some declarations. Some of the tests now pass with D146845, others get farther along and fail with D146845. The tests revealed that further work is required in mostly AMDGPU atomics to get the tests passing. Since I was here anyway I removed the strictfp attribute from some constrained intrinsic declarations. They have this attribute by default. Test changes verified with D146845.	2024-02-05 09:29:31 -05:00
Matt Arsenault	a5d206df79	AMDGPU: Set max supported div/rem size to 64 (#80669 ) This enables IR expansion for i128 divisions. The vector case is still broken because ExpandLargeDivRem doesn't try to handle them. Fixes: SWDEV-426193	2024-02-05 19:09:38 +05:30
Pierre van Houtryve	4e958abf2f	[AMDGPU][PromoteAlloca] Support memsets to ptr allocas (#80678 ) Fixes #80366	2024-02-05 14:36:15 +01:00
Petar Avramovic	06f711a906	AMDGPU/GlobalISelDivergenceLowering: select divergent i1 phis (#80003 ) Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper. Use machine uniformity analysis to find divergent i1 phis and select them as lane mask phis in same way SILowerI1Copies select VReg_1 phis. Note that divergent i1 phis include phis created by LCSSA and all cases of uses outside of cycle are actually covered by "lowering LCSSA phis". GlobalISel lane masks are registers with sgpr register class and S1 LLT. TODO: General goal is that instructions created in this pass are fully instruction-selected so that selection of lane mask phis is not split across multiple passes. patch 3 from: https://github.com/llvm/llvm-project/pull/73337	2024-02-05 14:07:01 +01:00
Christudasan Devadasan	89ec940b4a	[AMDGPU] Insert spill codes for the SGPRs used for EXEC copy (#79428 ) The SGPR registers used for preserving EXEC mask while lowering the whole-wave register spills and copies should be preserved at the prolog and epilog if they are in the CSR range. It isn't happening when there is only wwm-copy lowered and there are no wwm-spills. This patch addresses that problem.	2024-02-05 18:32:23 +05:30
Nikita Popov	00a4e248dc	[AMDGPU] Convert tests to opaque pointers (NFC)	2024-02-05 12:42:23 +01:00
Pierre van Houtryve	500846d2f5	[AMDGPU] Introduce Code Object V6 (#76954 ) Introduce Code Object V6 in Clang, LLD, Flang and LLVM. This is the same as V5 except a new "generic version" flag can be present in EFLAGS. This is related to new generic targets that'll be added in a follow-up patch. It's also likely V6 will have new changes (possibly new metadata entries) added later. Docs change are part of the follow-up patch #76955	2024-02-05 08:19:53 +01:00
Simon Pilgrim	faeb3d1f10	[AMDGPU] Regenerate ctpop64.ll test checks	2024-02-02 18:08:15 +00:00
Yaxun (Sam) Liu	1f3c30911c	[AMDGPU] Mark PC_ADD_REL_OFFSET rematerializable (#79674 ) Currently machine LICM hoist PC_ADD_REL_OFFSET out of loops, causes register pressure when function calls are deep in loops. This is a main cause of sgpr spill for programs containing large number of function calls in loops. This patch marks PC_ADD_REL_OFFSET as rematerializable, which eliminates sgpr spills due to function calls in loops.	2024-02-01 12:21:19 -05:00
Quentin Dian	112fba974c	[MIRPrinter] Don't print line break when there is no instructions (NFC) (#80147 ) Per #80143, we can remove the extra line break when there is no instruction.	2024-02-01 22:10:52 +08:00
Joseph Huber	f956e7fbf1	[AMDGPU] Prefer `s_memtime` for `readcyclecounter` on GFX10 (#80211 ) Summary: The old `s_memtime` instruction was supported until the GFX10 architecture. Although this instruction has a higher latency than the new shader counter, it's much more usable as a processor clock as it is a full 64-bit counter. The new shader counter is only a 20-bit counter, which makes it difficult to use as a standard cycle counter as it will overflow in a few milliseconds. This patch suggests preferring `s_memtime` for this instrinsic if it is still available.	2024-02-01 07:19:57 -06:00
Quentin Dian	b7738e275d	[MIRPrinter] Don't print space when there is no successor (#80143 ) Extra space causes the checks generated by update_mir_test_checks to be unavailable. ``` # NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4 # RUN: llc -mtriple=x86_64-- -o - %s -run-pass=none -verify-machineinstrs -simplify-mir \| FileCheck %s --- name: foo body: \| ; CHECK-LABEL: name: foo ; CHECK: bb.0: ; CHECK-NEXT: successors: ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: bb.1: ; CHECK-NEXT: RET 0, $eax bb.0: successors: bb.1: RET 0, $eax ... ``` The failure log is as follows: ``` llvm/test/CodeGen/MIR/X86/unreachable-block-print.mir:9:16: error: CHECK-NEXT: is on the same line as previous match ; CHECK-NEXT: {{ $}} ^ <stdin>:21:13: note: 'next' match was here successors: ^ <stdin>:21:13: note: previous match ended here successors: ```	2024-01-31 22:35:41 +08:00
Jay Foad	c2c650f62e	[AMDGPU] Stop combining arbitrary offsets into PAL relocs (#80034 ) PAL uses ELF REL (not RELA) relocations which can only store a 32-bit addend in the instruction, even for reloc types like R_AMDGPU_ABS32_HI which require the upper 32 bits of a 64-bit address calculation to be correct. This means that it is not safe to fold an arbitrary offset into a GlobalAddressSDNode, so stop doing that. In practice this is mostly a problem for small negative offsets which do not work as expected because PAL treats the 32-bit addend as unsigned.	2024-01-31 10:28:23 +00:00
Changpeng Fang	3564666fe1	[AMDGPU]: Fix type signatures for wmma intrinsics, NFC (#80087 ) Make the wmma intrinsic type signatures to be canonical. We need a type signature as long as the type is not fixed. However, when an argument's type matches a previous argument's type, we do not need the signature for this argument. This patch fixes three general cases: 1. add missing signatures 2. remove signatures for matching arguments 3. reorer the signatures -- return type signature should always appear first	2024-01-30 23:17:35 -08:00
Pierre van Houtryve	ce72f78f37	[AMDGPU] Fix mul combine for MUL24 (#79110 ) MUL24 can now return a i64 for i32 operands, but the combine was never updated to handle this case. Extend the operand when rewriting the ADD to handle it. Fixes SWDEV-436654	2024-01-29 16:37:20 +01:00
Krzysztof Drewniak	63fe80fb18	[SeperateConstOffsetFromGEP] Handle `or disjoint` flags (#76997 ) This commit extends separate-const-offset-from-gep to look at the newly-added `disjoint` flag on `or` instructions so as to preserve additional opportunities for optimization. The tests were pre-committed in #76972.	2024-01-26 09:56:06 -06:00
Diana Picus	46dd8acf36	[AMDGPU] Fix typos. NFC	2024-01-26 12:04:58 +01:00
Jay Foad	c5d59fe1b2	[AMDGPU] Disable V_MAD_U64_U32/V_MAD_I64_I32 workaround for GFX11.5 (#79460 ) The hardware bug only affects GFX11.0.x.	2024-01-25 16:28:49 +00:00
Jay Foad	45d2d7757f	[AMDGPU] New llvm.amdgcn.wave.id intrinsic (#79325 ) This is only valid on targets with architected SGPRs.	2024-01-25 07:48:06 +00:00
Jay Foad	70fc970378	[AMDGPU] Move architected SGPR implementation into isel (#79120 )	2024-01-24 15:06:20 +00:00
Nikita Popov	90ba33099c	[InstCombine] Canonicalize constant GEPs to i8 source element type (#68882 ) This patch canonicalizes getelementptr instructions with constant indices to use the `i8` source element type. This makes it easier for optimizations to recognize that two GEPs are identical, because they don't need to see past many different ways to express the same offset. This is a first step towards https://discourse.llvm.org/t/rfc-replacing-getelementptr-with-ptradd/68699. This is limited to constant GEPs only for now, as they have a clear canonical form, while we're not yet sure how exactly to deal with variable indices. The test llvm/test/Transforms/PhaseOrdering/switch_with_geps.ll gives two representative examples of the kind of optimization improvement we expect from this change. In the first test SimplifyCFG can now realize that all switch branches are actually the same. In the second test it can convert it into simple arithmetic. These are representative of common optimization failures we see in Rust. Fixes https://github.com/llvm/llvm-project/issues/69841.	2024-01-24 15:25:29 +01:00
Mirko Brkušanin	7fdf608cef	[AMDGPU] Add GFX12 WMMA and SWMMAC instructions (#77795 ) Co-authored-by: Petar Avramovic <Petar.Avramovic@amd.com> Co-authored-by: Piotr Sobczak <piotr.sobczak@amd.com>	2024-01-24 13:43:07 +01:00
Mariusz Sikora	cfddb59be2	[AMDGPU][GFX12] VOP encoding and codegen - add support for v_cvt fp8/… (#78414 ) …bf8 instructions Add VOP1, VOP1_DPP8, VOP1_DPP16, VOP3, VOP3_DPP8, VOP3_DPP16 instructions that were supported on GFX940 (MI300): - V_CVT_F32_FP8 - V_CVT_F32_BF8 - V_CVT_PK_F32_FP8 - V_CVT_PK_F32_BF8 - V_CVT_PK_FP8_F32 - V_CVT_PK_BF8_F32 - V_CVT_SR_FP8_F32 - V_CVT_SR_BF8_F32 --------- Co-authored-by: Mateja Marjanovic <mateja.marjanovic@amd.com> Co-authored-by: Mirko Brkušanin <Mirko.Brkusanin@amd.com>	2024-01-24 12:21:15 +01:00
Petar Avramovic	c46109d0d7	Revert "AMDGPU/GlobalISelDivergenceLowering: select divergent i1 phis" (#79274 ) Reverts llvm/llvm-project#78482	2024-01-24 12:18:34 +01:00
Petar Avramovic	149ed9d2c5	AMDGPU: update GFX11 wmma hazards (#76143 ) One V_NOP or unrelated VALU instruction in between is required for correctness when matrix A or B of current WMMA instruction overlaps with matrix D of previous WMMA instruction. Remaining cases of WMMA operand overlaps are handled by the hardware and do not require handling in hazard recognizer. Hardware may stall in cases where: - matrix C of current WMMA instruction overlaps with matrix D of previous WMMA instruction - VALU instruction reads matrix D of previous WMMA instruction - matrix A,B or C of WMMA instruction reads result of previous VALU instruction	2024-01-24 12:00:35 +01:00
Petar Avramovic	91ddcba83a	AMDGPU/GlobalISelDivergenceLowering: select divergent i1 phis (#78482 ) Implement PhiLoweringHelper for GlobalISel in DivergenceLoweringHelper. Use machine uniformity analysis to find divergent i1 phis and select them as lane mask phis in same way SILowerI1Copies select VReg_1 phis. Note that divergent i1 phis include phis created by LCSSA and all cases of uses outside of cycle are actually covered by "lowering LCSSA phis". GlobalISel lane masks are registers with sgpr register class and S1 LLT. TODO: General goal is that instructions created in this pass are fully instruction-selected so that selection of lane mask phis is not split across multiple passes. patch 3 from: https://github.com/llvm/llvm-project/pull/73337	2024-01-24 11:58:32 +01:00
Christudasan Devadasan	230c13d59d	[AMDGPU] Pick available high VGPR for CSR SGPR spilling (#78669 ) CSR SGPR spilling currently uses the early available physical VGPRs. It currently imposes a high register pressure while trying to allocate large VGPR tuples within the default register budget. This patch changes the spilling strategy by picking the VGPRs in the reverse order, the highest available VGPR first and later after regalloc shift them back to the lowest available range. With that, the initial VGPRs would be available for allocation and possibility of finding large number of contiguous registers will be more.	2024-01-24 07:08:43 +05:30
Changpeng Fang	32073b8356	AMDGPU: Do not generate non-temporal hint when Load_Tr intrinsic did not specify it (#79104 ) int_amdgcn_global_load_tr did not specify non-temporal load transpose, thus we should not genetrate the non-temporal hint for the load. We need to implement getTgtMemIntrinsic to create the corresponding MemSDNode. And we don't set the non-temporal flag because the intrinsic did not specify it. NOTE: We need to implement getTgtMemIntrinsic for any memory intrinsics.	2024-01-23 10:05:32 -08:00
Jay Foad	6cf37dd504	[AMDGPU] Enable architected SGPRs for GFX12 (#79160 )	2024-01-23 16:36:30 +00:00
Mirko Brkušanin	6bb7d515c3	[AMDGPU] Properly check op_sel in GCNDPPCombine (#79122 )	2024-01-23 17:21:16 +01:00
Pierre van Houtryve	42b0884238	[AMDGPU] Handle V_PERMLANE64_B32 in fixVcmpxPermlaneHazards (#79125 ) Fixes #78856	2024-01-23 13:10:58 +01:00
Carl Ritson	4db4d7f282	[AMDGPU] SILowerSGPRSpills: do not update MRI reserve registers (#77888 ) VGPRs used for spilling do not require explicit reservation with MRI. freezeReservedRegs() executed before register allocation ensures these are placed in the reserve set. The only pass after SILowerSGPRSpills is SIPreAllocateWWMRegs which explicitly tests for interference before register allocation so should not reuse a WWM VGPR holding spill data. reserveReg prevents calculation of correct liveness for physical registers which could be used to extend SIPreAllocateWWMRegs.	2024-01-23 10:49:26 +09:00
Emma Pilkington	4897b9888f	[AMDGPU] Make a few more tests default COV agnostic (#78926 )	2024-01-22 11:22:57 -05:00
Jeremy Morse	52a8bed426	[DebugInfo][RemoveDIs] Adjust AMDGPU passes to work with DPValues (#78736 ) This patch tweaks two AMDGPU passes to use iterators rather than instruction pointers for expressing an insertion point. This is needed to accurately support DPValues, the non-instruction storage object for debug-info. Two tests were sensitive to this change (variable assignments were being put in the wrong place), and I've added extra run-lines with the "try new debug-info..." flag. These get tested on our public buildbot to ensure they continue to work accurately.	2024-01-22 14:25:08 +00:00
Pierre van Houtryve	ac296b696c	[AMDGPU] Drop verify from SIMemoryLegalizer tests (#78697 ) SIMemoryLegalizer tests were slow, with most of them taking 4.5 to 5.3s to complete and that's on a fast machine. I also recall seeing them in the slowest tests list on build bots. This removes the verify-machineinstrs option from these tests to speed them up, bringing the slowest test down to +-2s. Verifier still runs in EXPENSIVE_CHECKS builds.	2024-01-22 10:31:37 +01:00
Emma Pilkington	bc82cfb38d	[AMDGPU] Add an asm directive to track code_object_version (#76267 ) Named '.amdhsa_code_object_version'. This directive sets the e_ident[ABIVERSION] in the ELF header, and should be used as the assumed COV for the rest of the asm file. This commit also weakens the --amdhsa-code-object-version CL flag. Previously, the CL flag took precedence over the IR flag. Now the IR flag/asm directive take precedence over the CL flag. This is implemented by merging a few COV-checking functions in AMDGPUBaseInfo.h.	2024-01-21 11:54:47 -05:00
Jay Foad	63d7ca924f	[AMDGPU] Add GFX12 llvm.amdgcn.s.wait.*cnt intrinsics (#78723 )	2024-01-20 11:44:42 +00:00

1 2 3 4 5 ...

7164 Commits