llvm-project

Author	SHA1	Message	Date
Shilei Tian	a74659445d	[AMDGPU] Skip terminators when forcing emit zero flag (#112116 ) When forcing emit zero, we need to skip terminators of a MBB; otherwise the terminator list of the MBB would be broken.	2024-10-14 11:46:18 -04:00
Akshat Oke	dbfca24b99	[MIR] Serialize virtual register flags (#110228 ) [MIR] Serialize virtual register flags This introduces target-specific vreg flag serialization. Flags are represented as `uint8_t` and the `TargetRegisterInfo` override provides methods `getVRegFlagValue` to deserialize and `getVRegFlagsOfReg` to serialize.	2024-10-14 14:19:53 +05:30
Janek van Oirschot	50866e84d1	Revert "[AMDGPU] Avoid resource propagation for recursion through multiple functions" (#112013 ) Reverts llvm/llvm-project#111004	2024-10-11 17:10:28 +01:00
Juan Manuel Martinez Caamaño	2d5f3b0a61	[AMDGPU][SIPreEmitPeephole] mustRetainExeczBranch: use BranchProbability and TargetSchedmodel (#109818 ) Remove s_cbranch_execnz branches if the transformation is profitable according to `BranchProbability` and `TargetSchedmodel`.	2024-10-11 17:45:59 +02:00
Janek van Oirschot	67160c5ab5	[AMDGPU] Avoid resource propagation for recursion through multiple functions (#111004 ) Avoid constructing recursive MCExpr definitions when multiple functions cause a recursion. Fixes #110863	2024-10-11 16:42:50 +01:00
Matt Arsenault	14705a912f	CodeGen: Remove redundant REQUIRES registered-target from tests (#111982 ) These are already in target specific test directories.	2024-10-11 16:16:12 +04:00
Petar Avramovic	7b0d56be1d	AMDGPU/GlobalISel: Fix inst-selection of ballot (#109986 ) Both input and output of ballot are lane-masks: result is lane-mask with 'S32/S64 LLT and SGPR bank' input is lane-mask with 'S1 LLT and VCC reg bank'. Ballot copies bits from input lane-mask for all active lanes and puts 0 for inactive lanes. GlobalISel did not set 0 in result for inactive lanes for non-constant input.	2024-10-11 11:40:27 +02:00
Fabian Ritter	173c68239d	[AMDGPU] Enable unaligned scratch accesses (#110219 ) This allows us to emit wide generic and scratch memory accesses when we do not have alignment information. In cases where accesses happen to be properly aligned or where generic accesses do not go to scratch memory, this improves performance of the generated code by a factor of up to 16x and reduces code size, especially when lowering memcpy and memmove intrinsics. Also: Make the use of the FeatureUnalignedScratchAccess feature more consistent: FeatureUnalignedScratchAccess and EnableFlatScratch are now orthogonal, whereas, before, code assumed that the latter implies the former at some places. Part of SWDEV-455845.	2024-10-11 08:50:49 +02:00
Jay Foad	62b3a4bc70	[AMDGPU] Improve codegen for s_barrier_init (#111866 )	2024-10-10 19:40:02 +01:00
Mikhail Goncharov	8a849a2a56	Revert "Reapply "[AMDGPU][GlobalISel] Fix load/store of pointer vectors, buffer.*.pN (#110714 )" v2 (#111708 )" This reverts commit 4b4a0d419c81b8b12a7dbb33dae1f7e9be91a88f. New test fails on buildbots https://lab.llvm.org/buildbot/#/builders/63/builds/2039 https://lab.llvm.org/buildbot/#/builders/127/builds/1055	2024-10-10 13:37:44 +02:00
Matt Arsenault	c36f902372	AMDGPU/GlobalISel: Insert m0 initialization before sextload/zextload (#111720 ) Fixes missing m0 initialize for pre-gfx9 targets with local extending loads.	2024-10-10 14:01:49 +04:00
Shilei Tian	5a74a4a667	[Attributor] Take the address space from addrspacecast directly (#108258 ) Currently `AAAddressSpace` relies on identifying the address spaces of all underlying objects. However, it might infer sub-optimal address space when the underlying object is a function argument. In `AMDGPUPromoteKernelArgumentsPass`, the promotion of a pointer kernel argument is by adding a series of `addrspacecast` instructions (as shown below), and hoping `InferAddressSpacePass` can pick it up and do the rewriting accordingly. Before promotion: ``` define amdgpu_kernel void @kernel(ptr %to_be_promoted) { %val = load i32, ptr %to_be_promoted ... ret void } ``` After promotion: ``` define amdgpu_kernel void @kernel(ptr %to_be_promoted) { %ptr.cast.0 = addrspace cast ptr % to_be_promoted to ptr addrspace(1) %ptr.cast.1 = addrspace cast ptr addrspace(1) %ptr.cast.0 to ptr # all the use of %to_be_promoted will use %ptr.cast.1 %val = load i32, ptr %ptr.cast.1 ... ret void } ``` When `AAAddressSpace` analyzes the code after promotion, it will take `%to_be_promoted` as the underlying object of `%ptr.cast.1`, and use its address space (which is 0) as its final address space, thus simply do nothing in `manifest`. The attributor framework will them eliminate the address space cast from 0 to 1 and back to 0, and replace `%ptr.cast.1` with `%to_be_promoted`, which basically reverts all changes by `AMDGPUPromoteKernelArgumentsPass`. IMHO I'm not sure if `AMDGPUPromoteKernelArgumentsPass` promotes the argument in a proper way. To improve the handling of this case, this PR adds an extra handling when iterating over all underlying objects. If an underlying object is a function argument, it means it reaches a terminal such that we can't futher deduce its underlying object further. In this case, we check all uses of the argument. If they are all `addrspacecast` instructions and their destination address spaces are same, we take the destination address space. Fixes: SWDEV-482640.	2024-10-09 22:51:07 -04:00
Jeffrey Byrnes	a31d0b2e2b	[AMDGPU] Remove some lit check lines Change-Id: I77e72d23d41095b8fcc47996d8004f9e264968de	2024-10-09 16:12:31 -07:00
Krzysztof Drewniak	4b4a0d419c	Reapply "[AMDGPU][GlobalISel] Fix load/store of pointer vectors, buffer.*.pN (#110714 )" v2 (#111708 ) This adds `-disable-gisel-legality-check` to some gfx6 and gfx7 test lines to prevent behavior mismatches between debug and release builds The first attempted reapply was #111059 This reverts commit e075dcf7d270fd52dc837163ff24e8c872dfeb49.	2024-10-09 17:11:41 -05:00
Matt Arsenault	a075e785b8	AMDGPU: Fix incorrectly selecting fp8/bf8 conversion intrinsics (#107291 ) Trying to codegen these on targets without the instructions should fail to select. Not sure if all the predicates are correct. We had a fake one disconnected to a feature which was always true. Fixes: SWDEV-482274	2024-10-09 21:38:47 +04:00
Jeffrey Byrnes	17bc959961	[AMDGPU] Optionally Use GCNRPTrackers during scheduling (#93090 ) This adds the ability to use the GCNRPTrackers during scheduling. These trackers have several advantages over the generic trackers: 1. global live-thru trackers, 2. subregister based RP deltas, and 3. flexible vreg -> PressureSet mappings. This feature is off-by-default to ease with the roll-out process. In particular, when using the optional trackers, the scheduler will still maintain the generic trackers leading to unnecessary compile time.	2024-10-09 09:54:11 -07:00
Matt Arsenault	e85fcb7631	AMDGPU: Add instruction flags when lowering ctor/dtor (#111652 ) These should be well behaved address computations.	2024-10-09 18:03:35 +04:00
Matt Arsenault	1e357cde48	AMDGPU: Use pointer types more consistently (#111651 ) This was using addrspace 0 and 1 pointers interchangably. This works out since they happen to use the same size, but consistently query or use the correct one.	2024-10-09 17:23:50 +04:00
Matt Arsenault	890e481358	AMDGPU: Regenerate test checks	2024-10-09 16:39:15 +04:00
Matt Arsenault	671cbcf642	AMDGPU: Add baseline tests for gep flag handling (#110814 ) We need to know the address computation won't overflow on older subtargets to match the addressing mode of stack instructions.	2024-10-09 13:48:01 +04:00
Matt Arsenault	c198f775cd	AMDGPU: Remove flat/global fmin/fmax intrinsics (#105642 ) These have been replaced with atomicrmw	2024-10-09 09:27:28 +04:00
Shilei Tian	88a239d292	[AMDGPU] Adopt new lowering sequence for `fdiv16` (#109295 ) The current lowering of `fdiv16` can generate incorrectly rounded result in some cases. The new sequence was provided by the HW team, as shown below written in C++. ``` half fdiv(half a, half b) { float a32 = float(a); float b32 = float(b); float r32 = 1.0f / b32; float q32 = a32 * r32; float e32 = -b32 * q32 + a32; q32 = e32 * r32 + q32; e32 = -b32 * q32 + a32; float tmp = e32 * r32; uin32_t tmp32 = std::bit_cast<uint32_t>(tmp); tmp32 = tmp32 & 0xff800000; tmp = std::bit_cast<float>(tmp32); q32 = tmp + q32; half q16 = half(q32); q16 = div_fixup_f16(q16); return q16; } ``` Fixes SWDEV-477608.	2024-10-08 09:49:20 -04:00
Shilei Tian	48ac846fbc	[AMDGPU][GlobalISel] Align `selectVOP3PMadMixModsImpl` with the `SelectionDAG` counterpart (#110168 ) The current `selectVOP3PMadMixModsImpl` can produce `V_MAD_FIX_F32` instruction that violates constant bus restriction, while its `SelectionDAG` counterpart doesn't. The culprit is in the copy stripping while the `SelectionDAG` version only has a bitcast stripping. This PR simply aligns the two version.	2024-10-08 09:41:24 -04:00
Christudasan Devadasan	6636f32615	[AMDGPU] Include WWM register spill into BB Prolog (#111496 ) With #93526 we split the regalloc pipeline further to have a standalone allocation for wwm registers and per-lane VGPRs. Currently the presence of the wwm-spill reloads inserted at the bb-top limits the isBasicPrologue function during the per-lane vgpr regalloc to skip past the exec manipulation instruction and ended up causing incorrect codegen. The wmm-spill inserted during the wwm-regalloc pipeline should also be included in the bb-prolog so that the per-lane vgpr regalloc pipeline can identify the appropriate insertion points for their spills and copies.	2024-10-08 15:13:12 +05:30
Matt Arsenault	5f94b0cbdd	AMDGPU: Try to reuse dest reg for s_add_i32 frame indexes (#111201 ) Hack around the register scavenger doing the wrong thing. It does not find the result register as available in the case the frame index add isn't also reading the dest register. This is the quick fix for a regression where the scavenge would create a broken spill of SGPR to memory. I believe this is still broken for cases we cannot use the result register. I'm confused about what position the scavenger iterator is supposed to be in, and what RestoreAfter is for. The scavenger is missing a full set of forward/backward APIs and there seems to be an off by one somewhere.	2024-10-07 18:01:24 +04:00
Paul Walker	02dd6b1014	[LLVM][CodeGen] Add lowering for scalable vector bfloat operations. (#109803 ) Specifically: fabs, fadd, fceil, fdiv, ffloor, fma, fmax, fmaxnm, fmin, fminnm, fmul, fnearbyint, fneg, frint, fround, froundeven, fsub, fsqrt & ftrunc	2024-10-07 13:01:59 +01:00
Pierre van Houtryve	924a64a348	[AMDGPU] Only emit SCOPE_SYS global_wb (#110636 ) global_wb with scopes lower than SCOPE_SYS is unnecessary for correctness. I was initially optimistic they would be very cheap no-ops but they can actually be quite expensive so let's avoid them.	2024-10-07 07:35:31 +02:00
Austin Kerbow	c4d89203f3	[AMDGPU] Support preloading hidden kernel arguments (#98861 ) Adds hidden kernel arguments to the function signature and marks them inreg if they should be preloaded into user SGPRs. The normal kernarg preloading logic then takes over with some additional checks for the correct implicitarg_ptr alignment. Special care is needed so that metadata for the hidden arguments is not added twice when generating the code object.	2024-10-06 17:44:33 -07:00
NAKAMURA Takumi	e075dcf7d2	Revert "Reapply "[AMDGPU][GlobalISel] Fix load/store of pointer vectors, buffer.*.pN (#110714 )" (#111059 )" This reverts commit 98a15c7b0c6ec129d371f0c121dbe9396c4f5609. (llvmorg-20-init-8051-g98a15c7b0c6e)	2024-10-06 10:50:51 +09:00
Yaxun (Sam) Liu	3b88805ca2	[AMDGPU] Fix SDWA commuting (#106920 ) SDWA insts miss reverse opcode, which causes them to be treated as commutable with default reverse opcode i.e. their own opcode. As a result, SWDA F16 sub A, B and Sub B, A are merged by machine CSE. The correct behavior is to merged sub A, B and subrev B, A instead of sub B, A. This issues caused failures in rocFFT tests. Another issue is that src0_sel and src1_sel are not swapped when SDWA insts are commuted. Verified that this fixes rocFFT tests failure.	2024-10-04 15:53:40 -04:00
Krzysztof Drewniak	98a15c7b0c	Reapply "[AMDGPU][GlobalISel] Fix load/store of pointer vectors, buffer.*.pN (#110714 )" (#111059 ) This reverts commit 650c41aad2eb43c634a05b2b5799a0c13a73b92f. The test failures appear to be from conflicts with other PRs that landed around this time.	2024-10-04 12:33:26 -05:00
Matt Arsenault	e5a0c30e4a	AMDGPU: Work around machine verifier failure with convergence tokens Apparently any function with convergence tokens will fail the machine verifier after register allocation. The existing codegen tests for tokens use stop-before, and do not run to the end. Work around this by splitting out tests with convergence tokens. Fixes EXPENSIVE_CHECKS bot failures after c08d7b3de7409aecadd7f9edfe0f3a1ce28a6374 and 428ae0f12e29eff1ddcaf59bdcce904ec056963e	2024-10-04 16:59:23 +04:00
Matt Arsenault	428ae0f12e	AMDGPU: Do not tail call if an inreg argument requires waterfalling (#111002 ) If we have a divergent value passed to an outgoing inreg argument, the call needs to be executed in a waterfall loop and thus cannot be tail called. The waterfall handling of arbitrary calls is broken on the selectiondag path, so some of these cases still hit an error later. I also noticed the argument evaluation code in isEligibleForTailCallOptimization is not correctly accounting for implicit argument assignments. It also seems inreg codegen is generally broken; we are assigning arguments to the reserved private resource descriptor.	2024-10-04 00:04:02 +04:00
Matt Arsenault	c08d7b3de7	AMDGPU: Fix verifier error on tail call target in vgprs (#110984 ) We allow tail calls of known uniform function pointers. This would produce a verifier error if the uniform value is in VGPRs. Insert readfirstlanes just in case this occurs, which will fold out later if it is unnecessary. GlobalISel should need a similar fix, but it currently does not attempt tail calls of indirect calls. Fixes #107447 Fixes subissue of #110930	2024-10-03 21:50:56 +04:00
Juan Manuel Martinez Caamaño	b9bb77f0c7	[AMDGPU] Update branch-condition-and.ll to auto-generated checks (#110860 )	2024-10-03 17:12:31 +02:00
vikashgu	870bdc6ea7	Reapply "[AMDGPU]Optimize SGPR spills (#93668 )" This reverts commit c2fc7f75f67039bb1ed577bc0edbd699a850cd9d. As the dependent patch about split vgpr regalloc pipeline solved the issue(#96353).	2024-10-03 09:47:15 +00:00
NAKAMURA Takumi	650c41aad2	Revert "[AMDGPU][GlobalISel] Fix load/store of pointer vectors, buffer.*.pN (#110714 )" Some builders has been failing tests. ``` Failed Tests (2): LLVM :: CodeGen/AMDGPU/GlobalISel/inst-select-load-global-old-legalization.mir LLVM :: CodeGen/AMDGPU/GlobalISel/inst-select-load-local.mir ``` This reverts commit ae5bd2a9f292037c605b2ec0ee31200581bd8701. (llvmorg-20-init-7805-gae5bd2a9f292)	2024-10-03 15:38:34 +09:00
Krzysztof Drewniak	ae5bd2a9f2	[AMDGPU][GlobalISel] Fix load/store of pointer vectors, buffer.*.pN (#110714 ) Certain pointer address spaces were not being correctly handled by the GlobalISel lowering for buffer_load and buffer_store. 1. ptr addrspace(1) and addrspace(4) did not have rewrite patterns defined for them, while p0 did, since those pointer types weren't in the list of types that was iterated to form the patterns. 2. Vectors of pointers need to be bitcast to vectors of the corresponding scalars, since there doesn't seem to be a good way to define the rewrite patterns for buffer_load/store of those types The need to bitcast vectors of pointers was also revealed to affect ordinary `G_LOAD` and `G_STORE` in some cases, so `shouldBitcastLoadStore()` has been fixed to handle it properly.	2024-10-02 13:46:56 -05:00
Matt Arsenault	f2eeb3dc7b	AMDGPU: Handle v_add* in eliminateFrameIndex (#102346 )	2024-10-02 21:19:45 +04:00
Matt Arsenault	4cd1f9ac9f	AMDGPU: Add baseline test for frame index folding (#110737 ) We currently can increase the instruction count when a frame index requires materialization.	2024-10-02 19:47:32 +04:00
Matt Arsenault	187dcd8e22	DAG: Preserve disjoint flag when emitting final instructions (#110795 )	2024-10-02 19:37:04 +04:00
Janek van Oirschot	e35319524a	[AMDGPU] Fix stack size metadata for functions with direct and indirect calls (#110828 ) When a function has an external call, it should still use the stack sizes of direct, known, calls to calculate its own stack size	2024-10-02 14:52:52 +01:00
Matt Arsenault	30ea7e859b	AMDGPU: Preserve flags in regbankselect when splitting or	2024-10-02 12:08:21 +04:00
Matt Arsenault	203d5158c3	AMDGPU/GlobalISel: Preserve flags when splitting select RegBankSelect was losing flags on selects.	2024-10-02 12:08:17 +04:00
Matt Arsenault	2d3119c3d9	AMDGPU: Add more tests for frame index code quality There are also some bugs with sgpr constraints.	2024-10-01 21:42:34 +04:00
Matt Arsenault	f61abee01a	AMDGPU: Add missing tests for local stack alloc s_add_i32 handling None of these tested the case where the non-frame index operand was a register.	2024-10-01 19:49:18 +04:00
Fabian Ritter	16ba126a14	[AMDGPU][GlobalISel][NFC] Use amdhsa target for flat/private tests (#110672 ) As a proxy criterion, mesa targets have unaligned-access-mode (which determines whether the hardware allows unaligned memory accesses) not set whereas amdhsa targets do. This PR changes tests to use amdhsa instead of mesa and inserts additional checks with unaligned-access-mode unset explicitly. This is in preparation for PR #110219, which will generate different code depending on the unaligned-access-mode.	2024-10-01 17:04:18 +02:00
Matt Arsenault	8395b3f60f	AMDGPU: Mark scc dead when materialized frame base registers	2024-10-01 18:54:18 +04:00
Brox Chen	2672037e36	[AMDGPU][True16][MC] Support VOP3 only instructions with true16 and fake16 (#109891 ) Update VOP3 only instructions with true16 and fake16 formats. This patch includes instructions: V_MUL_LO_U16 V_MAX_U16 V_MAX_I16 V_MIN_U16 V_MIN_I16 V_LSHLREV_B16 V_LSHRREV_B16 V_ASHRREV_I16	2024-10-01 09:25:36 -04:00
Fabian Ritter	3ba4092c06	[AMDGPU] Check vector sizes for physical register constraints in inline asm (#109955 ) For register constraints that require specific register ranges, the width of the range should match the type of the associated parameter/return value. With this PR, we error out when that is not the case. Previously, these cases would hit assertions or llvm_unreachables. The handling of register constraints that require only a single register remains more lenient to allow narrower non-vector types for the associated IR values. For example, constraining an i16 or i8 value to a 32-bit register is still allowed. Fixes #101190. --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2024-10-01 10:29:35 +02:00

1 2 3 4 5 ...

7854 Commits