llvm-project

Author	SHA1	Message	Date
Stanislav Mekhanoshin	ddbdc9a86e	[AMDGPU] Add baseline test to show spilling of wmma scale. NFC (#168163 ) This is to show the spilling of WMMA scale values which are limited to low 256 VGPRs. We have free registers, just RA allocates low 256 first.	2025-11-19 10:24:06 -08:00
LU-JOHN	b79a665f71	[AMDGPU] Remove leftover implicit operands from SI_SPILL/SI_RESTORE. (#168546 ) Remove leftover implicit operands from SI_SPILL/SI_RESTORE. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2025-11-19 09:02:03 -06:00
Fabian Ritter	125af56867	[AMDGPU][SDAG] Only fold flat offsets if they are inbounds PTRADDs (#165427 ) For flat memory instructions where the address is supplied as a base address register with an immediate offset, the memory aperture test ignores the immediate offset. Currently, SDISel does not respect that, which leads to miscompilations where valid input programs crash when the address computation relies on the immediate offset to get the base address in the proper memory aperture. Global or scratch instructions are not affected. This patch only selects flat instructions with immediate offsets from PTRADD address computations with the inbounds flag: If the PTRADD does not leave the bounds of the allocated object, it cannot leave the bounds of the memory aperture and is therefore safe to handle with an immediate offset. Affected tests: - CodeGen/AMDGPU/fold-gep-offset.ll: Offsets are no longer wrongly folded, added new positive tests where we still do fold them. - CodeGen/AMDGPU/infer-addrspace-flat-atomic.ll: Offset folding doesn't seem integral to this test, so the test is not changed to make offset folding still happen. - CodeGen/AMDGPU/loop-prefetch-data.ll: loop-reduce transforms inbounds addresses for accesses to be based on potentially OOB addresses used for prefetching. - I think the remaining ones suffer from the limited preservation of the inbounds flag in PTRADD DAGCombines due to the provenance problems pointed out in PR #165424 and the fact that `AMDGPUTargetLowering::SplitVector{Load\|Store}` legalizes too-wide accesses by repeatedly splitting them in half. Legalizing a V32S32 memory accesses therefore leads to inbounds ptradd chains like (ptradd inbounds (ptradd inbounds (ptradd inbounds P, 64), 32), 16). The DAGCombines fold them into a single ptradd, but the involved transformations generally cannot preserve the inbounds flag (even though it would be valid in this case). Similar previous PR that relied on `ISD::ADD inbounds` instead of `ISD::PTRADD inbounds` (closed): #132353 Analogous PR for GISel (merged): #153001 Fixes SWDEV-516125.	2025-11-19 11:48:12 +01:00
Carl Ritson	711a295479	[AMDGPU] Ignore wavefront barrier latency during scheduling DAG mutation (#168500 ) Do not add latency for wavefront and singlethread scope fences during barrier latency DAG mutation. These scopes do not typically introduce any latency and adjusting schedules based on them significantly impacts latency hiding.	2025-11-19 17:49:14 +09:00
Anshil Gandhi	5ee95f48b8	[AMDGPU][GlobalISel] Add regbankselect rules for G_FSHR (#159818 )	2025-11-18 22:57:59 -05:00
Shoreshen	52a58a4193	[AMDGPU] Adding instruction specific features (#167809 )	2025-11-19 11:06:00 +08:00
Shilei Tian	6665642ce4	[AMDGPU] Don't fold an i64 immediate value if it can't be replicated from its lower 32-bit (#168458 ) On some targets, a packed f32 instruction can only read 32 bits from a scalar operand (SGPR or literal) and replicates the bits to both channels. In this case, we should not fold an immediate value if it can't be replicated from its lower 32-bit. Fixes SWDEV-567139.	2025-11-18 17:11:10 -05:00
Robert Imschweiler	0b82415c59	[AMDGPU] Consider FLAT instructions for VMEM hazard detection (#137170 ) In general, "Flat instructions look at the per-workitem address and determine for each work item if the target memory address is in global, private or scratch memory." (RDNA2 ISA) That means that FLAT instructions need to be considered for VMEM hazards even without "specific segment". Also, LDS DMA should be considered for LDS hazard detection. See also #137148	2025-11-18 18:41:04 +01:00
vangthao95	cb5812982d	[AMDGPU][GlobalISel] Add RegBankLegalize support for G_IS_FPCLASS (#167575 )	2025-11-18 09:00:57 -08:00
Changpeng Fang	5f38ae4a77	[AMDGPU] update LDS block size for gfx1250 (#167614 ) LDS block size should be 2048 bytes (512 dwords) based on current spec.	2025-11-17 16:03:47 -08:00
vangthao95	4dd2796070	[AMDGPU][GlobalISel] Add RegBankLegalize support for G_FMUL (#167847 )	2025-11-17 08:45:49 -08:00
Ryan Cowan	d65be16ab6	[AArch64][GlobalISel] Add combine for build_vector(unmerge, unmerge, undef, undef) (#165539 ) This PR adds a new combine to the `post-legalizer-combiner` pass. The new combine checks for vectors being unmerged and subsequently padded with `G_IMPLICIT_DEF` values by building a new vector. If such a case is found, the vector being unmerged is instead just concatenated with a `G_IMPLICIT_DEF` that is as wide as the vector being unmerged. This removes unnecessary `mov` instructions in a few places.	2025-11-17 15:55:40 +00:00
Fabian Ritter	550522d07e	[AMDGPU][NFC] Mark GEPs in flat offset folding tests as inbounds (#165426 ) This is in preparation for a patch that will only fold offsets into flat instructions if their addition is inbounds. Marking the GEPs inbounds here means that their output won't change with the later patch. Basically a retry of the very similar PR #131994, as part of an updated stack of PRs. For SWDEV-516125.	2025-11-17 12:17:26 +01:00
David Green	22968f5b4a	[DAG] Add strictfp implicit def reg after metadata. (#168282 ) This prevents a machine verifier error, where it "Expected implicit register after groups". Fixes #158661	2025-11-17 10:57:21 +00:00
Chaitanya	49d5bb0ad0	[AMDGPU] Add amdgpu-lower-exec-sync pass to lower named-barrier globals (#165692 ) This PR introduces `amdgpu-lower-exec-sync` pass which specifically lowers named-barrier LDS globals introduced by #114550 . Changes include: - Moving the logic of lowering named-barrier LDS globals from `amdgpu-lower-module-lds` pass to this new pass. - This PR adds the pass to pipeline, remove the existing lowering logic for named-barrier LDS in `amdgpu-lower-module-lds` See #161827 for discussion on this topic.	2025-11-17 10:08:40 +05:30
Sergei Barannikov	900c517919	[AMDGPU] TableGen-erate SDNode descriptions (#168248 ) This allows SDNodes to be validated against their expected type profiles and reduces the number of changes required to add a new node. Autogenerated node names start with "AMDGPUISD::", hence the changes in the tests. The few nodes defined in R600.td are not imported because TableGen processes AMDGPU.td that doesn't include R600.td. Ideally, we would have two sets of nodes, but that would require careful reorganization of td files since some nodes are shared between AMDGPU/R600. Not sure if it something worth looking into. Some nodes fail validation, those are listed in `AMDGPUSelectionDAGInfo::verifyTargetNode()`. Part of #119709. Pull Request: https://github.com/llvm/llvm-project/pull/168248	2025-11-17 00:10:07 +00:00
ronlieb	6d5f87fc42	Revert "DAG: Allow select ptr combine for non-0 address spaces" (#168292 ) Reverts llvm/llvm-project#167909	2025-11-16 18:35:51 -05:00
LU-JOHN	9fa15ef916	[AMDGPU] When shrinking and/or to bitset, remove implicit scc def (#168128 ) When shrinking and/or to bitset remove leftover implicit scc def. bitset* instructions do not set scc. Signed-off-by: John Lu <John.Lu@amd.com>	2025-11-15 09:21:43 -06:00
Matt Arsenault	fbf74b2553	AMDGPU: Select vector reg class for divergent build_vector (#168169 ) The main improvement is to the mfma tests. There are some mild regressions scattered around, and a few major ones. The worst regressions are in some of the bitcast tests; these are cases where the SGPR argument list runs out and uses VGPRs, and the copies-from-VGPR are misidentified as divergent. Most of the shufflevector tests are also regressions. These end up with cleaner MIR, but then get poor regalloc decisions.	2025-11-14 21:53:39 -08:00
Matt Arsenault	9fecebf97b	AMDGPU: Consider isVGPRImm when forming constant from build_vector (#168168 ) This probably should have turned into a regular integer constant earlier. This is to defend against future regressions.	2025-11-14 21:42:26 -08:00
Matt Arsenault	d8f6e108da	AMDGPU: Use vgpr to implement divergent i32->i64 anyext (#168167 ) Handle this for consistency with the zext case.	2025-11-15 04:58:25 +00:00
Matt Arsenault	0fa6a67a42	AMDGPU: Use v_mov_b32 to implement divergent zext i32->i64 (#168166 ) Some cases are relying on SIFixSGPRCopies to force VALU reg_sequence inputs with SGPR inputs to use all VGPR inputs, but this doesn't always happen if the reg_sequence isn't invalid. Make sure we use a vgpr up-front here so we don't rely on something later.	2025-11-14 20:19:24 -08:00
Shilei Tian	72a6ae6844	[AMDGPU] Fix wrong MSB encoding for V_FMAMK instructions (#168107 ) These instructions use `src0`, `imm`, `src1` as operand. Fixes SWDEV-566579.	2025-11-14 22:50:17 +00:00
Gang Chen	a407d02752	Revert "[Transform][LoadStoreVectorizer] allow redundant in Chain (#1… (#168105 ) …63019)" This reverts commit 92e5608ffa6ff39ac3707f29418cc9482471f5d9.	2025-11-14 11:49:09 -08:00
Matt Arsenault	b2f12331ab	AMDGPU: Fix verifier error when waterfall call target is in AV register (#168017 )	2025-11-14 09:49:40 -08:00
Matt Arsenault	cfc74dddef	AMDGPU: Constrain readfirstlane operand when writing to m0 (#168004 ) Fixes another verifier error after introducing AV registers. Also fixes not clearing the subregister index if there was one.	2025-11-14 17:18:43 +00:00
Matt Arsenault	c6ee2d9860	AMDGPU: Constrain readfirstlane operand to vgpr_32 (#168001 )	2025-11-14 08:40:18 -08:00
LU-JOHN	b67e465b49	[AMDGPU] Ensure SCC is not live before shrinking to s_bitset* (#167907 ) Ensure SCC is not live before shrinking s_and/s_or instructions to s_bitset*. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2025-11-14 10:22:15 -06:00
Brox Chen	3d4156700e	[AMDGPU][True16][CodeGen] lower flat_d16_saddr_t16 to saddr inst (#166603 ) In true16 mode, D16 insts are lowered to a pseudo t16 first, and then lowered to hi/lo inst in MC lowering using D16T16 table. However, the D16T16 table selects both `flat_load_d16_t16 / flat_load_d16_t16_saddr` to `flat_load_d16_(hi)_b16` which is wrong. saddr pseudo inst `flat_load_d16_t16_saddr` should be selected to saddr hi/lo inst The global/scratch are correct while the flat seems to be the only one with this issue.	2025-11-14 09:47:05 -05:00
Alexander Belyaev	7ee0e0f956	Revert "[LICM] Sink unused l-invariant loads in preheader. #157559 " This reverts commit 469702c5d5cc4fa18c3a962afb971950a084f373. https://github.com/llvm/llvm-project/issues/168048	2025-11-14 14:51:33 +01:00
Pierre van Houtryve	31b7f1fa0b	[GlobalISel] Add support for value/constants as inline asm memory operand (#161501 ) InlineAsmLowering rejected inline assembly with memory reference inputs if the values passed to the inline asm weren't pointers. The DAG lowering however handled them just fine. This patch updates InlineAsmLowering to store such values on the stack, and then use the stack pointer as the "indirect" version of the operand.	2025-11-14 10:34:38 +01:00
Gang Chen	92e5608ffa	[Transform][LoadStoreVectorizer] allow redundant in Chain (#163019 ) This can absorb redundant loads when forming vector load. Can be used to fix the situation created by VectorCombine. See: https://discourse.llvm.org/t/what-is-the-purpose-of-vectorizeloadinsert-in-the-vectorcombine-pass/88532	2025-11-13 12:19:29 -08:00
Matt Arsenault	e5f499f48f	DAG: Allow select ptr combine for non-0 address spaces (#167909 )	2025-11-13 18:58:08 +00:00
Matt Arsenault	c7019c7eda	AMDGPU: Really use AV classes by default for vector classes (#166483 ) AMDGPU: Really use AV classes by default for vector classes Update getRegClassFor to use AV classes in place of VGPRs for gfx90a-gfx950. There are a handful of regressions. Most are enabling unprofitable rematerialization which reduce register count by 1 but add an unnecessary instruction.	2025-11-13 18:54:02 +00:00
Ryan Mitchell	5e4505d562	[AMDGPU][SIInsertWaitCnts] Gfx12.5 - Refactor xcnt optimization (#164357 ) Refactor the XCnt optimization checks so that they can be checked when applying a pre-existing waitcnt. This removes unnecessary xcnt waits when taking a loop backedge.	2025-11-13 18:43:12 +00:00
Matt Arsenault	ac27b2455a	AMDGPU: Add baseline test for load-select to load select of pointer combine (#167908 )	2025-11-13 10:17:32 -08:00
Matt Arsenault	7ff4cd4da8	AMDGPU: Start to use AV classes for unknown vector class (#166482 ) AMDGPU: Start to use AV classes for unknown vector class Use AGPR+VGPR superclasses for gfx90a+. The type used for the class should be the broadest possible class, to be contextually restricted later. InstrEmitter clamps these to the common subclass of the context use instructions, so we're best off using the broadest possible class for all types. Note this does very little because we only use VGPR classes for FP types (though this doesn't particularly make any sense), and we legalize normal loads and stores to integer.	2025-11-13 09:28:33 -08:00
Fei Peng	f67409c3ec	Redesign Straight-Line Strength Reduction (SLSR) (#162930 ) This PR implements parts of https://github.com/llvm/llvm-project/issues/162376 - Broader equivalence than constant index deltas: - Add Base-delta and Stride-delta matching for Add and GEP forms using ScalarEvolution deltas. - Reuse enabled for both constant and variable deltas when an available IR value dominates the user. - Dominance-aware dictionary instead of linear scans: - Tuple-keyed candidate dictionary grouped by basic block. - Walk the immediate-dominator chain to find the nearest dominating basis quickly and deterministically. - Simple cost model and best-rewrite selection: - Score candidate expressions and rewrites; select the highest-profit rewrite per instruction. - Skip rewriting when expressions are already foldable or high-efficiency. - Path compression for better ILP: - Compress chains of rewrites to a deeper dominating basis when a constant delta exists along the path, reducing dependent bumps on critical paths. - Dependency-aware rewrite ordering: - Build a dependency graph (basis, stride, variable delta producers) and rewrite in topological order. - This dependency graph will be needed by the next PR that adds partial strength reduction.	2025-11-13 08:08:52 -06:00
Mariusz Sikora	4cd836181f	[AMDGPU] Lower S_ABSDIFF_I32 to VALU instructions (#167691 ) Added support for lowering the scalar S_ABSDIFF_I32 instruction to equivalent VALU operations.	2025-11-13 14:35:44 +01:00
Nicolai Hähnle	66366599a9	CodeGen/AMDGPU: Allow 3-address conversion of bundled instructions (#166213 ) This is in preparation for future changes in AMDGPU that will make more substantial use of bundles pre-RA. For now, simply test this with degenerate (single-instruction) bundles.	2025-11-12 22:04:46 +00:00
Matt Arsenault	24be0ba39b	DAG: Fix assert on nofpclass call with aggregate return (#167725 )	2025-11-12 18:12:20 +00:00
Dhruva Chakrabarti	58ac95db60	[AMDGPU] Avoid changing minOccupancy if unclustered schedule was not run for any region. (#162025 ) During init of unclustered schedule stage, minOccupancy may be temporarily increased. But subsequently, if none of the regions are scheduled because they don't meet the conditions of initGCNRegion, minOccupancy remains incorrectly set. This patch avoids this incorrectness by delaying the change of minOccupancy until a region is about to be scheduled.	2025-11-12 10:11:22 -08:00
Jay Foad	5e4f177142	[AMDGPU] Fix missing S_WAIT_XCNT with multiple pending VMEMs (#166779 )	2025-11-12 09:44:08 +00:00
Chinmay Deshpande	79d9ae7a77	[AMDGPU][GISel] Add RegBankLegalize support for G_AMDGPU_WAVE_ADDRESS (#167456 )	2025-11-11 16:37:42 -08:00
Matt Arsenault	441e511522	AMDGPU: Update register class numbers in test (#167601 )	2025-11-11 15:30:41 -08:00
Matt Arsenault	2bf92787df	AMDGPU: Start using RegClassByHwMode for wavesize operands (#159884) This eliminates the pseudo registerclasses used to hack the wave register class, which are now replaced with RegClassByHwMode, so most of the diff is from register class ID renumbering.	2025-11-11 15:07:59 -08:00
Matt Arsenault	2308d16fdc	AMDGPU: Regenerate test checks after bbde79278 (#167590 ) Merge chasing latest versions of bulk test updates	2025-11-11 14:22:24 -08:00
Matt Arsenault	bbde792786	AMDGPU: Relax shouldCoalesce to allow more register tuple widening (#166475 ) Allow widening up to 128-bit registers or if the new register class is at least as large as one of the existing register classes. This was artificially limiting. In particular this was doing the wrong thing with sequences involving copies between VGPRs and AV registers. Nearly all test changes are improvements. The coalescer does not just widen registers out of nowhere. If it's trying to "widen" a register, it's generally packing a register into an existing register tuple, or in a situation where the constraints imply the wider class anyway. 067a11015 addressed the allocation failure concern by rejecting coalescing if there are no available registers. The original change in a4e63ead4b didn't include a realistic testcase to judge if this is harmful for pressure. I would expect any issues from this to be of garden variety subreg handling issue. We could use more dynamic state information here if it really is an issue. I get the best results by removing this override completely. This is a smaller step for patch splitting purposes.	2025-11-11 13:50:57 -08:00
Akash Dutta	b8add3710d	[AMDGPU] Add pattern to select scalar ops for fshr with uniform operands (#165295 ) Reasoning behind proposed change. This helps us move away from selecting v_alignbits for fshr with uniform operands. V_ALIGNBIT is defined in the ISA as: D0.u32 = 32'U(({ S0.u32, S1.u32 } >> S2.u32[4 : 0]) & 0xffffffffLL) Note: S0 carries the MSBs and S1 carries the LSBs of the value being aligned. I interpret that as : concat (s0, s1) >> S2, and use the 0X1F mask to return the lower 32 bits. fshr: fshr i32 %src0, i32 %src1, i32 %src2 Where: concat(%src0, %src1) represents the 64-bit value formed by %src0 as the high 32 bits and %src1 as the low 32 bits. %src2 is the shift amount. Only the lower 32 bits are returned. So these two are identical. So, I can expand the V_ALIGNBIT through bit manipulation as: Concat: S1 \| (S0 << 32) Shift: ((S1 \| (S0 << 32)) >> S2) Break the shift: (S1>>S2) \| (S0 << (32 – S2) The proposed pattern does exactly this. Additionally, src2 in the fshr pattern should be: * must be 0–31. * If the shift is ≥32, hardware semantics differ; you must handle it with extra instructions. The extra S_ANDs limit the selection only to the last 5 bits	2025-11-11 13:14:48 -06:00
vangthao95	5eb8d290dc	[AMDGPU][GlobalISel] Add RegBankLegalize support for G_BLOCK_ADDR and G_GLOBAL_VALUE (#165340 )	2025-11-11 09:31:59 -08:00

1 2 3 4 5 ...

9591 Commits