llvm-project

Author	SHA1	Message	Date
Jay Foad	a156362e93	[AMDGPU] Fix machine verification failure after SIFoldOperandsImpl::tryFoldOMod (#113544 ) Fixes #54201	2024-10-29 14:59:37 +00:00
Matt Arsenault	88e23eb2cf	DAG: Fix legalization of vector addrspacecasts (#113964 )	2024-10-29 08:08:50 -05:00
Matt Arsenault	1ceccbb0dd	VirtRegRewriter: Add implicit register defs for live out undef lanes (#112679 ) If an undef subregister def is live into another block, we need to maintain a physreg def to track the liveness of those lanes. This would manifest a verifier error after branch folding, when the cloned tail block use no longer had a def. We need to detect interference with other assigned intervals to avoid clobbering the undef lanes defined in other intervals, since the undef def didn't count as interference. This is pretty ugly and adds a new dependency on LiveRegMatrix, keeping it live for one more pass. It also adds a lot of implicit operand spam (we really should have a better representation for this). There is a missing verifier check for this situation. Added an xfailed test that demonstrates this. We may also be able to revert the changes in 47d3cbcf842a036c20c3f1c74255cdfc213f41c2. It might be better to insert an IMPLICIT_DEF before the instruction rather than using the implicit-def operand. Fixes #98474	2024-10-28 17:33:53 -07:00
Fabian Ritter	a4fd3dba6e	[AMDGPU] Use wider loop lowering type for LowerMemIntrinsics (#112332 ) When llvm.memcpy or llvm.memmove intrinsics are lowered as a loop in LowerMemIntrinsics.cpp, the loop consists of a single load/store pair per iteration. We can improve performance in some cases by emitting multiple load/store pairs per iteration. This patch achieves that by increasing the width of the loop lowering type in the GCN target and letting legalization split the resulting too-wide access pairs into multiple legal access pairs. This change only affects lowered memcpys and memmoves with large (>= 1024 bytes) constant lengths. Smaller constant lengths are handled by ISel directly; non-constant lengths would be slowed down by this change if the dynamic length was smaller or slightly larger than what an unrolled iteration copies. The chosen default unroll factor is the result of microbenchmarks on gfx1030. This change leads to speedups of 15-38% for global memory and 1.9-5.8x for scratch in these microbenchmarks. Part of SWDEV-455845.	2024-10-28 09:04:19 +01:00
Gaëtan Bossu	a0c318938a	[CodeGen][NFC] Properly split MachineLICM and EarlyMachineLICM (#113573 ) Both are based on MachineLICMBase, and the functionality there is "switched" based on a PreRegAlloc flag. This commit is simply about trusting the original value of that flag, defined by the `MachineLICM` and `EarlyMachineLICM` classes. The `PreRegAlloc` flag used to be overwritten it based on MRI.isSSA(), which is un-reliable due to how it is inferred by the MIRParser. I see that we can now define isSSA in MIR (thanks @gargaroff ), meaning the fix isn’t really needed anymore, but redefining that flag still feels wrong. Note that I'm looking into upstreaming more changes to MachineLICM, see [the discourse thread](https://discourse.llvm.org/t/extending-post-regalloc-machinelicm/82725).	2024-10-25 11:19:22 -07:00
Jun Wang	19b0453361	[AMDGPU][MC] Fix disassembler problem for image_atomic with TFE (#112622 ) For image_atomic instructions with TFE, in some cases (e.g., when dmask=3) the disassembler produces dst register with wrong size (e.g., image_atomic_smin v5, v1, s[8:15] dmask:0x3 tfe, instead of v[5:7]). This patch fixes the VDataDwords values for image atomic instructions.	2024-10-24 16:19:18 -07:00
Carl Ritson	076aac59ac	[AMDGPU] Add a new target for gfx1153 (#113138 )	2024-10-23 12:56:58 +09:00
Janek van Oirschot	a18826d75c	[AMDGPU] Create local KnownBits in case DenseMap gets invalidated (#111568 ) KnownBits retrieved from DenseMap may invalidate if insertion requires a (re)growth. Fixes https://github.com/llvm/llvm-project/issues/110930	2024-10-22 16:05:07 +01:00
Fabian Ritter	4c697f7037	[LowerMemIntrinsics] Use i8 GEPs in memcpy/memmove lowering (#112707 ) The IR lowering of memcpy/memmove intrinsics uses a target-specific type for its load/store operations. So far, the loaded and stored addresses are computed with GEPs based on this type. That is wrong if the allocation size of the type differs from its store size: The width of the accesses is determined by the store size, while the GEP stride is determined by the allocation size. If the allocation size is greater than the store size, some bytes are not copied/moved. This patch changes the GEPs to use i8 addressing, with offsets based on the type's store size. The correctness of the lowering therefore no longer depends on the type's allocation size. This is in support of PR #112332, which allows adjusting the memcpy loop lowering type through a command line argument in the AMDGPU backend.	2024-10-22 16:48:50 +02:00
Akshat Oke	ca32bd643b	[NewPM][AMDGPU] Port SIPreAllocateWWMRegs to NPM (#109939 )	2024-10-22 15:37:08 +05:30
Akshat Oke	f8cb526076	[AMDGPU] Add tests for SIPreAllocateWWMRegs (#109963 )	2024-10-22 15:33:46 +05:30
Fabian Ritter	69abfd3141	[AMDGPU] Allow casts between the Global and Constant Addr Spaces in isValidAddrSpaceCast (#112493 ) So far, isValidAddrSpaceCast only allows casts to the flat address space and between the constant(32) address spaces. It does not allow casting between the global and constant address spaces, even though they alias. That affects, e.g., the lowering of memmoves from the constant to the global address space in LowerMemIntrinsics, since that requires aliasing address spaces to be castable. This patch relaxes isValidAddrSpaceCast and allows such casts. It also includes a memmove test that would crash with the previous implementation because the memmove IR lowering would not be applicable for the move from constant AS to global AS.	2024-10-22 09:33:21 +02:00
Stanislav Mekhanoshin	3277c7cd28	[AMDGPU] Skip VGPR deallocation for waveslot limited kernels (#112765 ) MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels. It's been identified this message is only really needed for VGPR limited kernels. A kernel becomes VGPR limited if a total number of VGPRs per SIMD / number of used VGPRs is more than a number of wave slots.	2024-10-21 09:39:52 -07:00
Christudasan Devadasan	3c5cea650d	[AMDGPU]: Add implicit-def to the BB prolog (#112872 ) IMPLICIT_DEF inserted for a wwm-register at the very first block or the predecessor block where it is used for sgpr spilling can appear at a block begin that requires spill-insertion during per-lane VGPR regalloc phase. The presence of the IMPLICIT_DEF currently breaks the BB prolog. Fixes: SWDEV-490717	2024-10-21 13:21:16 +05:30
Matt Arsenault	ef91cd3f01	AMDGPU: Handle folding frame indexes into add with immediate (#110738 )	2024-10-19 12:33:03 -07:00
Alex Rønne Petersen	ad4a582fd9	[llvm] Consistently respect `naked` fn attribute in `TargetFrameLowering::hasFP()` (#106014 ) Some targets (e.g. PPC and Hexagon) already did this. I think it's best to do this consistently so that frontend authors don't run into inconsistent results when they emit `naked` functions. For example, in Zig, we had to change our emit code to also set `frame-pointer=none` to get reliable results across targets. Note: I don't have commit access.	2024-10-18 09:35:42 +04:00
Shilei Tian	92663defb1	[NFC][AMDGPU] Auto-generate check lines for some test cases (#112426 ) - `llvm/test/CodeGen/AMDGPU/andorbitset.ll` - `llvm/test/CodeGen/AMDGPU/andorxorinvimm.ll` - `llvm/test/CodeGen/AMDGPU/fabs.f64.ll` - `llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.store.ll` - `llvm/test/CodeGen/AMDGPU/s_mulk_i32.ll`	2024-10-17 10:55:29 -04:00
Brox Chen	35e937b4de	[AMDGPU][True16][CodeGen] fp conversion in true/fake16 format (#101678 ) fp conversion V_CVT_F_F/V_CVT_F_U instructions true16 format were previously implemented using fake16 profile. With the MC support inplace, correct and support these instructions in true16/fake16 format in CodeGen	2024-10-16 12:26:01 -04:00
Brox Chen	7b4c8b35d4	[AMDGPU][True16][MC] VOP3 profile in True16 format (#109031 ) Modify VOP3 profile and pesudo, and add encoding info for VOP3 True16 including DPP and DPP8 in true16 and fake16 format. This patch applies true16/fake16 changes and asm/dasm changes to V_ADD_NC_U16 V_ADD_NC_I16 V_SUB_NC_U16 V_SUB_NC_I16	2024-10-16 10:27:44 -04:00
Petar Avramovic	14d006c53c	AMDGPU/GlobalISel: Run redundant_and combine in RegBankCombiner (#112353 ) Combine is needed to clear redundant ANDs with 1 that will be created by reg-bank-select to clean-up high bits in register. Fix replaceRegWith from CombinerHelper: If copy had to be inserted, first create copy then delete MI. If MI is deleted first insert point is not valid.	2024-10-16 09:43:16 +02:00
Matt Arsenault	b0a25468fa	AMDGPU: Add baseline tests for flat-may-alias private atomic expansions (#109406 )	2024-10-15 22:29:24 +04:00
David Stenberg	97861981cc	[LiveDebugVariables] Fix a DBG_VALUE reordering issue (#111124 ) LDV could reorder reinserted fragment and non-fragment debug values for the same variable (compared to the input order), potentially resulting in stale values being presented. For example, before: DBG_VALUE 1001, $noreg, !13, !DIExpression(DW_OP_LLVM_fragment, 0, 16) DBG_VALUE 1002, $noreg, !13, !DIExpression(DW_OP_LLVM_fragment, 16, 16) DBG_VALUE %0, $noreg, !13, !DIExpression() After (without this patch): DBG_VALUE %stack.0, 0, !13, !DIExpression() DBG_VALUE 1002, $noreg, !13, !DIExpression(DW_OP_LLVM_fragment, 16, 16) DBG_VALUE 1001, $noreg, !13, !DIExpression(DW_OP_LLVM_fragment, 0, 16) It would also reorder DBG_VALUEs for different variables. Although that does not matter for the debug information output, it resulted in some noise in before/after pass diffs. This should hopefully align so that instruction referencing and DBG_VALUE emit debug instructions in the same order (see the sdag-salvage-add.ll change).	2024-10-15 11:36:24 +02:00
Yingwei Zheng	8d8bb4032b	[Verifier] Verify attribute `denormal-fp-math[-f32]` (#112310 ) Some typos are also fixed. Address https://github.com/llvm/llvm-project/pull/112067#pullrequestreview-2363722447.	2024-10-15 17:32:16 +08:00
David Green	04546a0dd6	[GlobalISel] Support vector G_UNMERGE_VALUES in computeKnownBits. (#112172 ) This adds computeKnownBits support for vector->vector G_UNMERGE_VALUES, grabbing the known bits with an adjusted DemandedElts mask.	2024-10-15 08:23:05 +01:00
Carl Ritson	784230b850	[AMDGPU] Tidy SIPreAllocateWWMRegs after recent changes (NFCI) (#111967 ) - V_SET_INACTIVE is always in WWM/WQM so can be treated like any other operation in WWM/WQM. - After encountering SI_SPILL_S32_TO_VGPR loop should bypass to avoid double processing its defs.	2024-10-15 11:48:22 +09:00
Shilei Tian	a74659445d	[AMDGPU] Skip terminators when forcing emit zero flag (#112116 ) When forcing emit zero, we need to skip terminators of a MBB; otherwise the terminator list of the MBB would be broken.	2024-10-14 11:46:18 -04:00
Akshat Oke	dbfca24b99	[MIR] Serialize virtual register flags (#110228 ) [MIR] Serialize virtual register flags This introduces target-specific vreg flag serialization. Flags are represented as `uint8_t` and the `TargetRegisterInfo` override provides methods `getVRegFlagValue` to deserialize and `getVRegFlagsOfReg` to serialize.	2024-10-14 14:19:53 +05:30
Janek van Oirschot	50866e84d1	Revert "[AMDGPU] Avoid resource propagation for recursion through multiple functions" (#112013 ) Reverts llvm/llvm-project#111004	2024-10-11 17:10:28 +01:00
Juan Manuel Martinez Caamaño	2d5f3b0a61	[AMDGPU][SIPreEmitPeephole] mustRetainExeczBranch: use BranchProbability and TargetSchedmodel (#109818 ) Remove s_cbranch_execnz branches if the transformation is profitable according to `BranchProbability` and `TargetSchedmodel`.	2024-10-11 17:45:59 +02:00
Janek van Oirschot	67160c5ab5	[AMDGPU] Avoid resource propagation for recursion through multiple functions (#111004 ) Avoid constructing recursive MCExpr definitions when multiple functions cause a recursion. Fixes #110863	2024-10-11 16:42:50 +01:00
Matt Arsenault	14705a912f	CodeGen: Remove redundant REQUIRES registered-target from tests (#111982 ) These are already in target specific test directories.	2024-10-11 16:16:12 +04:00
Petar Avramovic	7b0d56be1d	AMDGPU/GlobalISel: Fix inst-selection of ballot (#109986 ) Both input and output of ballot are lane-masks: result is lane-mask with 'S32/S64 LLT and SGPR bank' input is lane-mask with 'S1 LLT and VCC reg bank'. Ballot copies bits from input lane-mask for all active lanes and puts 0 for inactive lanes. GlobalISel did not set 0 in result for inactive lanes for non-constant input.	2024-10-11 11:40:27 +02:00
Fabian Ritter	173c68239d	[AMDGPU] Enable unaligned scratch accesses (#110219 ) This allows us to emit wide generic and scratch memory accesses when we do not have alignment information. In cases where accesses happen to be properly aligned or where generic accesses do not go to scratch memory, this improves performance of the generated code by a factor of up to 16x and reduces code size, especially when lowering memcpy and memmove intrinsics. Also: Make the use of the FeatureUnalignedScratchAccess feature more consistent: FeatureUnalignedScratchAccess and EnableFlatScratch are now orthogonal, whereas, before, code assumed that the latter implies the former at some places. Part of SWDEV-455845.	2024-10-11 08:50:49 +02:00
Jay Foad	62b3a4bc70	[AMDGPU] Improve codegen for s_barrier_init (#111866 )	2024-10-10 19:40:02 +01:00
Mikhail Goncharov	8a849a2a56	Revert "Reapply "[AMDGPU][GlobalISel] Fix load/store of pointer vectors, buffer.*.pN (#110714 )" v2 (#111708 )" This reverts commit 4b4a0d419c81b8b12a7dbb33dae1f7e9be91a88f. New test fails on buildbots https://lab.llvm.org/buildbot/#/builders/63/builds/2039 https://lab.llvm.org/buildbot/#/builders/127/builds/1055	2024-10-10 13:37:44 +02:00
Matt Arsenault	c36f902372	AMDGPU/GlobalISel: Insert m0 initialization before sextload/zextload (#111720 ) Fixes missing m0 initialize for pre-gfx9 targets with local extending loads.	2024-10-10 14:01:49 +04:00
Shilei Tian	5a74a4a667	[Attributor] Take the address space from addrspacecast directly (#108258 ) Currently `AAAddressSpace` relies on identifying the address spaces of all underlying objects. However, it might infer sub-optimal address space when the underlying object is a function argument. In `AMDGPUPromoteKernelArgumentsPass`, the promotion of a pointer kernel argument is by adding a series of `addrspacecast` instructions (as shown below), and hoping `InferAddressSpacePass` can pick it up and do the rewriting accordingly. Before promotion: ``` define amdgpu_kernel void @kernel(ptr %to_be_promoted) { %val = load i32, ptr %to_be_promoted ... ret void } ``` After promotion: ``` define amdgpu_kernel void @kernel(ptr %to_be_promoted) { %ptr.cast.0 = addrspace cast ptr % to_be_promoted to ptr addrspace(1) %ptr.cast.1 = addrspace cast ptr addrspace(1) %ptr.cast.0 to ptr # all the use of %to_be_promoted will use %ptr.cast.1 %val = load i32, ptr %ptr.cast.1 ... ret void } ``` When `AAAddressSpace` analyzes the code after promotion, it will take `%to_be_promoted` as the underlying object of `%ptr.cast.1`, and use its address space (which is 0) as its final address space, thus simply do nothing in `manifest`. The attributor framework will them eliminate the address space cast from 0 to 1 and back to 0, and replace `%ptr.cast.1` with `%to_be_promoted`, which basically reverts all changes by `AMDGPUPromoteKernelArgumentsPass`. IMHO I'm not sure if `AMDGPUPromoteKernelArgumentsPass` promotes the argument in a proper way. To improve the handling of this case, this PR adds an extra handling when iterating over all underlying objects. If an underlying object is a function argument, it means it reaches a terminal such that we can't futher deduce its underlying object further. In this case, we check all uses of the argument. If they are all `addrspacecast` instructions and their destination address spaces are same, we take the destination address space. Fixes: SWDEV-482640.	2024-10-09 22:51:07 -04:00
Jeffrey Byrnes	a31d0b2e2b	[AMDGPU] Remove some lit check lines Change-Id: I77e72d23d41095b8fcc47996d8004f9e264968de	2024-10-09 16:12:31 -07:00
Krzysztof Drewniak	4b4a0d419c	Reapply "[AMDGPU][GlobalISel] Fix load/store of pointer vectors, buffer.*.pN (#110714 )" v2 (#111708 ) This adds `-disable-gisel-legality-check` to some gfx6 and gfx7 test lines to prevent behavior mismatches between debug and release builds The first attempted reapply was #111059 This reverts commit e075dcf7d270fd52dc837163ff24e8c872dfeb49.	2024-10-09 17:11:41 -05:00
Matt Arsenault	a075e785b8	AMDGPU: Fix incorrectly selecting fp8/bf8 conversion intrinsics (#107291 ) Trying to codegen these on targets without the instructions should fail to select. Not sure if all the predicates are correct. We had a fake one disconnected to a feature which was always true. Fixes: SWDEV-482274	2024-10-09 21:38:47 +04:00
Jeffrey Byrnes	17bc959961	[AMDGPU] Optionally Use GCNRPTrackers during scheduling (#93090 ) This adds the ability to use the GCNRPTrackers during scheduling. These trackers have several advantages over the generic trackers: 1. global live-thru trackers, 2. subregister based RP deltas, and 3. flexible vreg -> PressureSet mappings. This feature is off-by-default to ease with the roll-out process. In particular, when using the optional trackers, the scheduler will still maintain the generic trackers leading to unnecessary compile time.	2024-10-09 09:54:11 -07:00
Matt Arsenault	e85fcb7631	AMDGPU: Add instruction flags when lowering ctor/dtor (#111652 ) These should be well behaved address computations.	2024-10-09 18:03:35 +04:00
Matt Arsenault	1e357cde48	AMDGPU: Use pointer types more consistently (#111651 ) This was using addrspace 0 and 1 pointers interchangably. This works out since they happen to use the same size, but consistently query or use the correct one.	2024-10-09 17:23:50 +04:00
Matt Arsenault	890e481358	AMDGPU: Regenerate test checks	2024-10-09 16:39:15 +04:00
Matt Arsenault	671cbcf642	AMDGPU: Add baseline tests for gep flag handling (#110814 ) We need to know the address computation won't overflow on older subtargets to match the addressing mode of stack instructions.	2024-10-09 13:48:01 +04:00
Matt Arsenault	c198f775cd	AMDGPU: Remove flat/global fmin/fmax intrinsics (#105642 ) These have been replaced with atomicrmw	2024-10-09 09:27:28 +04:00
Shilei Tian	88a239d292	[AMDGPU] Adopt new lowering sequence for `fdiv16` (#109295 ) The current lowering of `fdiv16` can generate incorrectly rounded result in some cases. The new sequence was provided by the HW team, as shown below written in C++. ``` half fdiv(half a, half b) { float a32 = float(a); float b32 = float(b); float r32 = 1.0f / b32; float q32 = a32 * r32; float e32 = -b32 * q32 + a32; q32 = e32 * r32 + q32; e32 = -b32 * q32 + a32; float tmp = e32 * r32; uin32_t tmp32 = std::bit_cast<uint32_t>(tmp); tmp32 = tmp32 & 0xff800000; tmp = std::bit_cast<float>(tmp32); q32 = tmp + q32; half q16 = half(q32); q16 = div_fixup_f16(q16); return q16; } ``` Fixes SWDEV-477608.	2024-10-08 09:49:20 -04:00
Shilei Tian	48ac846fbc	[AMDGPU][GlobalISel] Align `selectVOP3PMadMixModsImpl` with the `SelectionDAG` counterpart (#110168 ) The current `selectVOP3PMadMixModsImpl` can produce `V_MAD_FIX_F32` instruction that violates constant bus restriction, while its `SelectionDAG` counterpart doesn't. The culprit is in the copy stripping while the `SelectionDAG` version only has a bitcast stripping. This PR simply aligns the two version.	2024-10-08 09:41:24 -04:00
Christudasan Devadasan	6636f32615	[AMDGPU] Include WWM register spill into BB Prolog (#111496 ) With #93526 we split the regalloc pipeline further to have a standalone allocation for wwm registers and per-lane VGPRs. Currently the presence of the wwm-spill reloads inserted at the bb-top limits the isBasicPrologue function during the per-lane vgpr regalloc to skip past the exec manipulation instruction and ended up causing incorrect codegen. The wmm-spill inserted during the wwm-regalloc pipeline should also be included in the bb-prolog so that the per-lane vgpr regalloc pipeline can identify the appropriate insertion points for their spills and copies.	2024-10-08 15:13:12 +05:30
Matt Arsenault	5f94b0cbdd	AMDGPU: Try to reuse dest reg for s_add_i32 frame indexes (#111201 ) Hack around the register scavenger doing the wrong thing. It does not find the result register as available in the case the frame index add isn't also reading the dest register. This is the quick fix for a regression where the scavenge would create a broken spill of SGPR to memory. I believe this is still broken for cases we cannot use the result register. I'm confused about what position the scavenger iterator is supposed to be in, and what RestoreAfter is for. The scavenger is missing a full set of forward/backward APIs and there seems to be an off by one somewhere.	2024-10-07 18:01:24 +04:00

1 2 3 4 5 ...

7879 Commits