llvm-project

Author	SHA1	Message	Date
Vikram Hegde	225fc4f356	[AMDGPU][SDAG] Try folding "lshr i64 + mad" to "mad_u64_u32" (#119218 ) The intention is to use a "copy" instead of a "sub" to handle the high parts of 64-bit multiply for this specific case. This unlocks copy prop use cases where the copy can be reused by later multiply+add sequences if possible. Fixes: SWDEV-487672, SWDEV-487669	2025-01-17 11:09:39 +05:30
Matt Arsenault	7475f0a345	DAG: Avoid forming shufflevector from a single extract_vector_elt (#122672 ) This avoids regressions in a future AMDGPU commit. Previously we would have a build_vector (extract_vector_elt x), undef with free access to the elements bloated into a shuffle of one element + undef, which has much worse combine support than the extract. Alternatively could check aggressivelyPreferBuildVectorSources, but I'm not sure it's really different than isExtractVecEltCheap.	2025-01-17 08:44:43 +07:00
Matt Arsenault	ca95519704	AMDGPU: Implement isExtractVecEltCheap (#122460 ) Once again we have excessive TLI hooks with bad defaults. Permit this for 32-bit element vectors, which are just use-different-register. We should permit 16-bit vectors as cheap with legal packed instructions, but I see some mixed improvements and regressions that need investigation.	2025-01-17 08:38:01 +07:00
Matt Arsenault	4431106630	DAG: Fix vector bin op scalarize defining a partially undef vector (#122459 ) This avoids some of the pending regressions after AMDGPU implements isExtractVecEltCheap. In a case like shl <value, undef>, splat k, because the second operand was fully defined, we would fall through and use the splat value for the first operand, losing the undef high bits. This would result in an additional instruction to handle the high bits. Add some reduced testcases for different opcodes for one of the regressions.	2025-01-17 08:34:03 +07:00
Brox Chen	8a0c2e7567	[AMDGPU][True16][MC][CodeGen] true16 for v_cndmask_b16 (#119736 ) Support true16 format for v_cndmask_b16 in MC and CodeGen in true16 and fake16 flow. Since we are replacing `v_cndmask_b16` to `v_cndmask_b16_t16/fake16`, we have to at least update the fake16 codeGen to get codeGen test passing. For this case, we have to update the true16 and with fake16 together, otherwise some of the true16 tests will fail	2025-01-16 17:18:28 -05:00
Christudasan Devadasan	1797fb6b23	[AMDGPU][NewPM] Port SILowerControlFlow pass into NPM. (#123045 )	2025-01-16 11:06:38 +05:30
jofrn	c8bbbaa5c7	[SelectionDAG][AMDGPU] Negative offset when selecting scratch sv offsets (#122251 ) APInt will fail when given a negative offset. SelectScratchSVAddr utilizes this function and can be given a negative offset as well, so this change modifies it to use APSInt instead.	2025-01-15 06:56:28 -05:00
Mariusz Sikora	b3924cb9ec	[AMDGPU] Set Convergent property for image.(getlod/sample*) intrinsics which uses WQM (#122908 ) This change adds IntrConvergent property to image.getlod intrinsic and to several image.sample intrinsics. All image.sample intrinsics apart from LOD(_L), Level 0(_LZ), Derivative(_D) will be marked as Convergent.	2025-01-15 10:23:28 +01:00
Shoreshen	b665dddd70	[AMDGPU] Add tests for v_sat_pk_u8_i16 codegen (#122438 ) Preparation for #121124 This PR provides tests added into [PR](https://github.com/llvm/llvm-project/pull/121124) that add selection patterns for instruction `v_sat_pk`, in order to specify the change of the tests before and after the commit. Pre-commit tests PR for #121124 : Add selection patterns for instruction `v_sat_pk`	2025-01-14 19:26:46 -05:00
Brox Chen	f1b1c7f3c1	[AMDGPU][True16][CodeGen] Undo sub(x,c) to add in true16 flow (#118854 ) Undo sub x, c -> add x, -c canonicalization in true16 fow. This duplicating the pattern from fake16 and implemement the same pattern in true16 format	2025-01-14 10:57:33 -05:00
Brox Chen	5e26ff35c1	[AMDGPU][True16][MC] true16 for v_cmp_lt_f16 (#122499 ) True16 format for v_cmp_lt_f16. Update VOPC t16 and fake16 pseudo.	2025-01-14 10:03:36 -05:00
Acim Maravic	cc3aab580b	[AMDGPU] Handle nontemporal and amdgpu.last.use metadata in amdgpu-lower-buffer-fat-pointers (#120139 )	2025-01-14 11:22:20 +01:00
Piotr Sobczak	40fa7f5e8b	[AMDGPU] Fix computed kill mask (#122736 ) Replace S_XOR with S_ANDN2 when computing the kill mask in demote/kill lowering. This has the effect of AND'ing demote/kill condition with exec which is needed for proper live mask update. The S_XOR is inadequate because it may return true for lane with exec=0. This patch fixes an image corruption in game. I think the issue went unnoticed because demote/kill condition is often naturally dependent on exec, so AND'ing with exec is usually not required.	2025-01-14 10:00:40 +01:00
Brox Chen	0f3aeca16f	[AMDGPU][True16][CodeGen] Update and/or/xor codegen pattern for i16 (#121835 ) In true16 flow, remove and/or/xor 32bit patterns for i16	2025-01-13 16:48:00 -05:00
Brox Chen	26e13091ea	[AMDGPU][True16][CodeGen] true16 codegen pattern for v_pack_b32_f16 (#121988 ) true16 codegen pattern for v_pack_b32_f16	2025-01-13 12:26:36 -05:00
Matt Arsenault	f4598194b5	DAG: Fold bitcast of scalar_to_vector to anyext (#122660 ) scalar_to_vector is difficult to make appear and test, but I found one case where this makes an observable difference. It fires more often than this in the test suite, but most of them have no net result in the final code. This helps reduce regressions in a future commit.	2025-01-13 19:38:58 +07:00
Matt Arsenault	e9a55770dc	AMDGPU: Add gfx9 run line to scalar_to_vector test (#122659 )	2025-01-13 19:35:56 +07:00
Akshat Oke	73b0e8a191	[AMDGPU][NewPM] Port AMDGPUOpenCLEnqueuedBlockLowering to NPM (#122434 )	2025-01-13 17:52:30 +05:30
Akshat Oke	7bf1cb702b	[AMDGPU][NewPM] Port AMDGPURemoveIncompatibleFunctions to NPM (#122261 )	2025-01-13 10:11:40 +05:30
Shilei Tian	f15da5fb78	[AMDGPU] Fix an invalid cast in `AMDGPULateCodeGenPrepare::visitLoadInst` (#122494 ) Fixes: SWDEV-507695	2025-01-12 23:40:25 -05:00
Austin Kerbow	657fb4433e	[AMDGPU] Add target hook to isGlobalMemoryObject (#112781 ) We want special handing for IGLP instructions in the scheduler but they should still be treated like they have side effects by other passes. Add a target hook to the ScheduleDAGInstrs DAG builder so that we have more control over this.	2025-01-11 09:57:57 -08:00
Austin Kerbow	2e5c298281	[AMDGPU] Add backward compatibility layer for kernarg preloading (#119167 ) Add a prologue to the kernel entry to handle cases where code designed for kernarg preloading is executed on hardware equipped with incompatible firmware. If hardware has compatible firmware the 256 bytes at the start of the kernel entry will be skipped. This skipping is done automatically by hardware that supports the feature. A pass is added which is intended to be run at the very end of the pipeline to avoid any optimizations that would assume the prologue is a real predecessor block to the actual code start. In reality we have two possible entry points for the function. 1. The optimized path that supports kernarg preloading which begins at an offset of 256 bytes. 2. The backwards compatible entry point which starts at offset 0.	2025-01-10 11:39:02 -08:00
Matt Arsenault	7ebf0df409	AMDGPU: Test gfx940 mfma intrinsics on gfx950 This requires splitting the xf32 cases into a separate file	2025-01-10 23:16:25 +07:00
Mirko Brkušanin	3def49cb64	[AMDGPU] Remove s_wakeup_barrier instruction (#122277 )	2025-01-10 11:30:22 +01:00
Nikita Popov	eeac0ffaf4	Revert "[MachineLICM] Use `RegisterClassInfo::getRegPressureSetLimit` (#119826 )" This reverts commit b4e17d4a314ed87ff6b40b4b05397d4b25b6636a. This causes a large compile-time regression.	2025-01-10 09:05:06 +01:00
Jakub Chlanda	01a7d4e26b	[AMDGPU] Allow selection of BITOP3 for some 2 opcodes and B32 cases (#122267 ) This came up in downstream static analysis - as a dead code. Admittedly, it depends on what the intention was when checking for [`if (NumOpcodes == 2 && IsB32)`](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp#L3792C3-L3792C32) and I took a guess that for certain cases the selection should take place. If that's incorrect, that whole if statement can be removed, as it is after a check for: [`if (NumOpcodes < 4)`](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp#L3788)	2025-01-10 07:49:11 +01:00
Chinmay Deshpande	211bcf67aa	[AMDGPU] Implement IR variant of isFMAFasterThanFMulAndFAdd (#121465 )	2025-01-10 09:05:41 +05:30
Brox Chen	222ff18608	[AMDGPU][True16][CodeGen] Update codegen pattern for v_med3_f16 (#121992 ) true16 codegen pattern for v_med3_f16	2025-01-09 13:40:13 -05:00
Matt Arsenault	d2b78c646b	AMDGPU: Custom lower bf16 shuffles (#122252 ) We already custom lower the other 16-bit element type shuffles.	2025-01-09 21:37:27 +07:00
Pengcheng Wang	b4e17d4a31	[MachineLICM] Use `RegisterClassInfo::getRegPressureSetLimit` (#119826 ) `RegisterClassInfo::getRegPressureSetLimit` is a wrapper of `TargetRegisterInfo::getRegPressureSetLimit` with some logics to adjust the limit by removing reserved registers. It seems that we shouldn't use `TargetRegisterInfo::getRegPressureSetLimit` directly, just like the comment "This limit must be adjusted dynamically for reserved registers" said. Separate from https://github.com/llvm/llvm-project/pull/118787	2025-01-09 21:05:52 +08:00
Chinmay Deshpande	659cd2a48a	[NFC][AMDGPU] Pre-commit tests for IR variant - isFMAFasterThanFMulAdd (#121925 )	2025-01-09 15:51:37 +05:30
Matt Arsenault	09583dec15	AMDGPU: Reduce 64-bit add width if low bits are known 0 (#122049 ) If one of the inputs has all 0 bits, the low part cannot carry and we can just pass through the original value. Add case: https://alive2.llvm.org/ce/z/TNc7hf Sub case: https://alive2.llvm.org/ce/z/AjH2-J We could do this in the general case with computeKnownBits, but add is so common this could be potentially expensive for something which will fire infrequently. One potential concern is this could break the 64-bit add we expect to see for addressing mode matching, but these constants shouldn't appear often in addressing expressions. One test for large offset expressions changes but isn't worse. Fixes https://github.com/ROCm/llvm-project/issues/237	2025-01-08 22:33:54 +07:00
Matt Arsenault	637641840d	AMDGPU: Add baseline test for add64 with constant test (#122048 ) Add baseline test for 64-bit adds when the low half of an operand is known 0.	2025-01-08 22:30:04 +07:00
Changpeng Fang	68694259b2	AMDGPU: Use getSignedTargetConstant for ImmOffset in SelectScratchSVAddr (#121978 ) ImmOffset is signed and we will hit an assert with negative ImmOffset when getTargetConstant is used. Fixes: SWDEV-506453	2025-01-07 12:02:18 -08:00
Brox Chen	49357b22db	[AMDGPU][True16][CodeGen] true16 codegen pattern for v_med3_u/i16 (#121850 ) True16 codegen pattern for v_med3_u/i16	2025-01-07 13:18:28 -05:00
Matt Arsenault	7899572c88	AMDGPU: Forcibly disable verifier in test The test added in f6365a47a1ad9ab6d432f6e40d14a11419e21282 fails the verifier for the reason noted in the comment, but we need to skip the verifier error in EXPENSIVE_CHECKS builds	2025-01-07 22:46:46 +07:00
Brox Chen	d0812dbbff	[AMDGPU][True16][MC] true16 for v_minmax/maxmin_f16 and v_minmax/maxmin_num_f16 (#120617 ) True16 support for v_minmax/maxmin_f16(GFX11) and v_minmax/maxmin_num_f16(GFX12). These insts are updated at the same time since we are replacing the `v_minmax/maxmin_f16` to `v_minmax/maxmin_fake16_f16` while `v_minmax/maxmin_num_f16` are alias insts and share the same CodeGen pattern. Added a GFX12 runline in minmax.ll in fake16 flow	2025-01-07 10:27:54 -05:00
bcahoon	17c8c1c509	[AMDGPU] Do not fold into v_accvpr_mov/write/read (#120475 ) In SIFoldOperands, leave copies for moving between agpr and vgpr registers. The register coalescer is able to handle the copies more efficiently than v_accvgpr_mov, v_accvgpr_write, and v_accvgpr_read. Otherwise, the compiler generates unneccesary instructions such as v_accvgpr_mov a0, a0.	2025-01-07 09:25:01 -06:00
choikwa	8d2e611802	[AMDGPU] Calculate getDivNumBits' AtLeast using bitwidth (#121758 ) Previously in shrinkDivRem64, it used fixed value 32 for AtLeast which meant that <64bit divisions would be rejected from shrinking since logic depended only on number of sign bits. I.e. 'idiv i48 %0, %1' would return 24 for number of sign bits if %0,%1 both had 24 division bits, and was rejected.	2025-01-07 01:31:09 -05:00
Matt Arsenault	8c0483bba2	RegisterCoalescer: Fix assert on remat to copy-to-physreg with subregs (#121734 ) Do not try to rematerialize a super-register def used by a subregister extract copy into a copy to a physical register if the other pieces of the full physreg are live at the rematerialization point. It would insert the super-register def at the rematerialization point, and assert since the other half of the register was already live. This is analagous to the undef subregister def handling above, which handled the virtual register case. Fixes #120970	2025-01-07 12:22:23 +07:00
Matt Arsenault	93e63460a2	RegAllocGreedy: Un-disable test in expensive_checks builds This reverts a8f3ebaf11c3745e5123054776eb71755d16f2f9. You need to use -verify-regalloc to get a MachineVerifier run with LiveIntervals, otherwise cases not covered by the basic liveness implementation in the verifier are passed through (which covers most use of undefined subrange errors).	2025-01-07 12:21:21 +07:00
Matt Arsenault	a8f3ebaf11	AMDGPU: Mark test as XFAIL in expensive_checks builds One of the tests added in 93220e7e06473a11bf48fee26bcea16cc527e5dc fails the machine verifier after allocation, but this is a separate issue.	2025-01-07 08:47:59 +07:00
Matt Arsenault	f6365a47a1	AMDGPU: Fix assert on physreg MUBUF rsrc operand (#120815 ) The stack case uses a physical register and should not ordinarily reach here, but strange things happen at -O0. The testcase still errors because we do not yet attempt to handle arbitrary dynamic sized allocas yet. Fixes: SWDEV-503538	2025-01-07 08:11:05 +07:00
Brox Chen	ce831a231a	[AMDGPU][True16][MC] true16 for v_fma_f16 (#119477 ) Support true16 format for v_fma_f16 in MC. Since we are replacing v_fma_f16 to v_fma_f16_t16/v_fma_f16_fake16 in Post-GFX11, have to update the CodeGen pattern for v_fma_f16_fake16 to get CodeGen test passing. There is no pattern modified/created, but just replacing the v_fma_f16 with fake16 format.	2025-01-06 15:02:04 -05:00
Emma Pilkington	dc0e258fe4	[AMDGPU] Remove Dwarf encodings for subregisters (#117891 ) Previously, registers and subregisters mapped to the same Dwarf encoding. We don't really have any way to refer to subregisters directly from Dwarf, the expression emitter should instead use DW_OPs to stencil out the subregister from the whole register. This was also confusing tools that need to map back to the llvm reg (e.g. dwarfdump), since getLLVMRegNum() would arbitrarily return the _LO16 register.	2025-01-06 14:51:16 -05:00
Matt Arsenault	93220e7e06	RegAllocGreedy: Fix use after free during last chance recoloring (#120697 ) Last chance recoloring can delete the current fixed interval during recursive assignment of interfering live intervals. Check if the virtual register value was assigned before attempting the unassignment, as is done in other scenarios. This relies on the fact that we do not recycle virtual register numbers. I have only seen this occur in error situations where the allocation will fail, but I think this can theoretically happen in working allocations. This feels very brute force, but I've spent over a week debugging this and this is what works without any lit regressions. The surprising piece to me was that unspillable live ranges may be spilled, and a number of tests rely on optimizations occurring on them. My other attempts to fixed this mostly revolved around not identifying unspillable live ranges as snippet copies. I've also discovered we're making some unproductive live range splits with subranges. If we avoid such splits, some of the unspillable copies disappear but mandating that be precise to fix a use after free doesn't sound right.	2025-01-06 23:12:55 +07:00
Phoebe Wang	1547382033	[X86] Support lowering of FMINIMUMNUM/FMAXIMUMNUM (#121464 )	2025-01-06 21:28:58 +08:00
Vikash Gupta	fd6f8b3ce3	[AMDGPU] [GlobalIsel] Combine Fmul with Select into ldexp instruction. (#120104 ) This combine pattern perform the below transformation. fmul x, select(y, A, B) -> fldexp (x, select i32 (y, a, b)) fmul x, select(y, -A, -B) -> fldexp ((fneg x), select i32 (y, a, b)) where, A=2^a & B=2^b ; a and b are integers. It is a follow-up PR to implement the above combine for globalIsel, as the corresponding DAG combine has been done for SelectionDAG Isel (#111109)	2025-01-06 17:42:38 +05:30
Aaditya	0bd1c87996	[AMDGPU] Support divergent sized dynamic alloca (#121148 ) Currently, AMDGPU backend can handle uniform-sized dynamic allocas. This patch extends support for divergent-sized dynamic allocas. When the size argument of a dynamic alloca is divergent, a wave-wide reduction is performed to get the required stack space. `@llvm.amdgcn.wave.reduce.umax` is used to perform the wave reduction. Dynamic allocas are not completely supported yet, as the stack is not properly restored on function exit. This patch doesn't attempt to address the aforementioned issue. Note: Compiler already Zero-Extends or Truncates all other types(of alloca size arg) to i32.	2025-01-06 12:28:24 +07:00
Matt Arsenault	d34f7ead88	DAG: Fix assuming f16 is the only 16-bit fp type in concat vector combine (#121637 ) This would see if there are mixed integer and FP types and pick an equivalently sized FP type to use as the vector element type, and only cast if there were mixed integers. We need to insert a cast if the types are mixed, which may include different FP types. Fixes #121601	2025-01-06 10:38:54 +07:00

1 2 3 4 5 ...

8163 Commits