8516 Commits

Author SHA1 Message Date
Brox Chen
a61cc1b99a
[AMDGPU][True16][CodeGen] Skip combineDpp with t16 instructions (#128918)
We only emits v_mov_b32/64_dpp. Don't combine t16 instructions with mov
dpp. Update the test inputs to be legal.

It is future work to emit v_mov_b16_dpp, and then update GCNDPPCombine
to combine it with the 16-bit instructions.
2025-03-31 10:18:25 -04:00
Simon Pilgrim
9b32f3d096
[DAG] visitEXTRACT_SUBVECTOR - don't return early on failure of EXTRACT_SUBVECTOR(INSERT_SUBVECTOR()) -> BITCAST fold (#133695)
Always allow later folds to try to match as well.
2025-03-31 14:32:43 +01:00
Fangrui Song
04a67528d3
[MC] Simplify MCBinaryExpr/MCUnaryExpr printing by reducing parentheses (#133674)
The existing pretty printer generates excessive parentheses for
MCBinaryExpr expressions. This update removes unnecessary parentheses
of MCBinaryExpr with +/- operators and MCUnaryExpr.
Since relocatable expressions only use + and -, this change improves
readability in most cases.

Examples:

- (SymA - SymB) + C now prints as SymA - SymB + C.
  This updates the output of -fexperimental-relative-c++-abi-vtables for
  AArch64 and x86 to `.long _ZN1B3fooEv@PLT-_ZTV1B-8`
- expr + (MCTargetExpr) now prints as expr + MCTargetExpr, with this
  change primarily affecting AMDGPUMCExpr.
2025-03-30 22:03:14 -07:00
Ana Mihajlovic
8c7550132f
[AMDGPU] Unused sdst writing to null (#133229)
Unused sdst writing to null to avoid a false VALU->SALU dependency
stall. This requires using the VOP3 encoding.
2025-03-28 18:12:34 +01:00
LU-JOHN
827f2ad643
AMDGPU: Convert vector 64-bit shl to 32-bit if shift amt >= 32 (#132964)
Convert vector 64-bit shl to 32-bit if shift amt is known to be >= 32.

---------

Signed-off-by: John Lu <John.Lu@amd.com>
2025-03-28 23:46:35 +07:00
Ana Mihajlovic
f7a034d400
[AMDGPU] (x or y) xor -1 -> x nor y (#130264)
Added pattern so s_nor is selected for ((i1 x or i1 y) xor -1) instead
of s_or and s_xor . This patch is for i1 divergent. The ballot in the
test is added for the retrieval of lanemask. The control flow is needed
because the combiner can't pass through phi instructions.
2025-03-28 11:20:17 +01:00
Brox Chen
06411399fb
[AMDGPU][True16][CodeGen] srl pattern for true16 mode (#132987)
Added a srl pattern for true16 flow. Changing right shift 16bit to a
reg_sequence
`srl vgpr32, 16 -> reg_sequence (vgpr32.hi16,  0)`

and finally it's lowered to two COPY
`vdst.lo16 = COPY vsrc.hi16`
`vdst.hi16 = COPY 0`

The benefits of this transform is allowing the following pass to
optimize out these copy.
2025-03-26 18:38:20 -04:00
Alex MacLean
672c51c9cb
[SDAG][tests] add some test cases covering an add-based rotate (#132842)
Add tests to various targets covering rotate idioms where an 'ADD' node
is used to combine the halves instead of an 'OR'. Some of these cases
will be better optimized following #125612, while others are already
well optimized or do not have a valid fold to a rotate or funnel-shift.
2025-03-26 09:47:28 -07:00
Akshat Oke
719b029c16
[AMDGPU][NPM] Port SILateBranchLowering to NPM (#130063) 2025-03-26 19:28:19 +05:30
Jeffrey Byrnes
e5641f6584
[AMDGPU] Autogen checks for mfma-loop.ll (#133004)
Needed for a RegisterCoalescing patch
2025-03-25 15:24:40 -07:00
Juan Manuel Martinez Caamaño
2f8d699845
[AMDGPU][SelectionDAG] Use COPY instead of S_MOV_B32 to assign values to M0 (#132957)
This is consistent with what's done on GISel. This allows the register
coalescer to remove the redundant intermediate `s_mov_b32` instructions
by using `m0` directly as the result register.
2025-03-25 19:05:43 +01:00
Jeffrey Byrnes
25938389c0
[AMDGPU] Autogen checks for agpr-csr.ll (#132959)
Needed for a RegisterCoalescer patch
2025-03-25 10:28:35 -07:00
LU-JOHN
70aeb89094
Calculate KnownBits from Metadata correctly for vector loads (#128908)
Calculate KnownBits correctly from metadata for vector loads.

---------

Signed-off-by: John Lu <John.Lu@amd.com>
2025-03-25 22:46:30 +07:00
Akshat Oke
f8e908a0ed
[AMDGPU][NPM] Port SIInsertHardClauses to NPM (#130062) 2025-03-25 15:33:32 +05:30
Austin Kerbow
e75f586b81
[AMDGPU] Relax lds dma waitcnt with no aliasing pair (#131842)
If we cannot find any lds DMA instruction that is aliased by some load
from lds, we will still insert vmcnt(0). This is overly cautious since
handling inter-thread dependences is normally managed by the memory
model instead of the waitcnt pass, so this change updates the behavior
to be more inline with how other types of memory events are handled.
2025-03-24 10:38:47 -07:00
Jay Foad
02ed65912e
[AMDGPU] 4-align TTMP triples (#132759)
Follow up to e4284a7c70cd "[AMDGPU] 4-align SGPR triples".

Previously TTMP triples like ttmp[3:5] were aligned on a 3-TTMP boundary
which has no basis in hardware.

Aligning them on a 4-TTMP boundary matches what we do for SGPRs, which
reduces the number of extra register classes synthesized by TableGen,
bringing the total number down from 653 to 615.
2025-03-24 17:11:39 +00:00
Ana Mihajlovic
cdea46cc8c
[AMDGPU] Add pattern for inverse.ballot.i64 Wave32 (#132770) 2025-03-24 17:30:02 +01:00
Akshat Oke
f10dc76f03
[AMDGPU][NPM] Port SIInsertWaitcnts to NPM (#130061) 2025-03-24 21:36:45 +05:30
Juan Manuel Martinez Caamaño
5634e7e2f0
[AMDGCN][SIWholeQuadMode] Rework splitBlock/lowerKillI1/lowerKillF32 to handle case when SI_KILL_I1_TERMINATOR -1 0 is not the unique terminator
The lowerKillI1 method wrongly handled cases where it inserted a
new S_BRANCH instruction when the kill was not the only terminator,
and then tried to split the block.

`SI_KILL_I1_TERMINATOR -1,0` doesn't have any effect. Instead of
lowering to an unconditional branch, we remove the instruction and
insert an unconditional branch only if the instruction is the last
terminator. No split is needed in this case (if the last terminator
has been reached, then the whole block was processed).

Also stop generating an unconditional branch in splitBlock: this
branch was redundant since TermMI is promoted to a
terminator that fallsthrough to the next block already.

Solves SWDEV-508819
2025-03-24 15:57:08 +01:00
Pierre van Houtryve
c457c88951
[GlobalISel] Combine (sext (trunc x)) to (sext_inreg x) (#131622)
Split from #131312
2025-03-24 09:32:04 +01:00
Pierre van Houtryve
6e3c24fc0a
[DAG] Combine (sext (sext_in_reg x)) to (sext_in_reg (any_extend x)) (#132386) 2025-03-24 09:31:02 +01:00
Shoreshen
054e0b41a8
[AMDGPU] Add all type for bitcast on VReg_512 (#131775)
Add all types pattern for bitcast on VReg_512
2025-03-24 11:52:10 +08:00
Shilei Tian
f1ac2afe21
Reapply "[AMDGPU] Use COV6 by default (#118515)" (#130963)
This reverts commit 68bcba6d7a1cc18996c0bcb7c62267c62d2040d0.
2025-03-21 15:26:45 -04:00
Stephen Thomas
2e3fa4ba9e
[AMDGPU] Insert before and after instructions that always use GDS (#131338)
It is an architectural requirement that there must be no outstanding GDS
instructions when an "always GDS" instruction is issued, and also that
an always GDS instruction must be allowed to complete.

Insert waits on DScnt/LGKMcnt prior to (if necessary) and subsequent to
(unconditionally) any always GDS instruction, and an additional S_NOP if
the subsequent wait was followed by S_ENDPGM.

Always GDS instructions are GWS instructions, DS_ORDERED_COUNT,
DS_ADD_GS_REG_RTN, and DS_SUB_GS_REG_RTN (the latter two as considered
always GDS as of this patch).
2025-03-21 09:33:04 +00:00
Brox Chen
55d3a55cc1
[AMDGPU][True16][CodeGen]disable true16 on fneg test (#132221)
This is a NFC change.

Revert the failed test case in
https://github.com/llvm/llvm-project/pull/131206
2025-03-20 10:32:49 -04:00
Diana Picus
e17b3cdfb3
[AMDGPU] Dynamic VGPR support for llvm.amdgcn.cs.chain (#130094)
The llvm.amdgcn.cs.chain intrinsic has a 'flags' operand which may
indicate that we want to reallocate the VGPRs before performing the
call.

A call with the following arguments:
```
llvm.amdgcn.cs.chain %callee, %exec, %sgpr_args, %vgpr_args,
  /*flags*/0x1, %num_vgprs, %fallback_exec, %fallback_callee
```
is supposed to do the following:
- copy the SGPR and VGPR args into their respective registers
- try to change the VGPR allocation
- if the allocation has succeeded, set EXEC to %exec and jump to
%callee, otherwise set EXEC to %fallback_exec and jump to
%fallback_callee

This patch implements the dynamic VGPR behaviour by generating an
S_ALLOC_VGPR followed by S_CSELECT_B32/64 instructions for the EXEC and
callee. The rest of the call sequence is left undisturbed (i.e.
identical to the case where the flags are 0 and we don't use dynamic
VGPRs). We achieve this by introducing some new pseudos
(SI_CS_CHAIN_TC_Wn_DVGPR) which are expanded in the SILateBranchLowering
pass, just like the simpler SI_CS_CHAIN_TC_Wn pseudos. The main reason
is so that we don't risk other passes (particularly the PostRA
scheduler) introducing instructions between the S_ALLOC_VGPR and the
jump. Such instructions might end up using VGPRs that have been
deallocated, or the wrong EXEC mask. Once the whole backend treats
S_ALLOC_VGPR and changes to EXEC as barriers for instructions that use
VGPRs, we could in principle move the expansion earlier (but in the
absence of a good reason for that my personal preference is to keep it
later in order to make debugging easier).

Since the expansion happens after register allocation, we're careful to
select constants to immediate operands instead of letting ISel generate
S_MOVs which could interfere with register allocation (i.e. make it look
like we need more registers than we actually do).

For GFX12, S_ALLOC_VGPR only works in wave32 mode, so we bail out during
ISel in wave64 mode. However, we can define the pseudos for wave64 too
so it's easy to handle if future generations support it.

---------

Co-authored-by: Ana Mihajlovic <Ana.Mihajlovic@amd.com>
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
2025-03-20 08:38:04 +01:00
Jeffrey Byrnes
6cc918089a
[AMDGPU] Autogen checks for mfma-no-register-aliasing.ll (#132117)
For an upcoming RegisterCoalescer PR
2025-03-19 16:02:51 -07:00
Mariusz Sikora
4f5ccf22fa
[AMDGPU] Support image_bvh8_intersect_ray instruction and intrinsic. (#130041)
Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>
2025-03-19 16:08:08 +01:00
Emma Pilkington
3eddb992d0
[AMDGPU] Fix a crash by skipping DBG instrs at start of sched region (#131167)
Fixes SWDEV-514946
2025-03-19 09:31:54 -04:00
Diana Picus
72c3c30452
[AMDGPU] Allocate scratch space for dVGPRs for CWSR (#130055)
The CWSR trap handler needs to save and restore the VGPRs. When dynamic
VGPRs are in use, the fixed function hardware will only allocate enough
space for one VGPR block. The rest will have to be stored in scratch, at
offset 0.

This patch allocates the necessary space by:
- generating a prologue that checks at runtime if we're on a compute
queue (since CWSR only works on compute queues); for this we will have
to check the ME_ID bits of the ID_HW_ID2 register - if that is non-zero,
we can assume we're on a compute queue and initialize the SP and FP with
enough room for the dynamic VGPRs
- forcing all compute entry functions to use a FP so they can access
their locals/spills correctly (this isn't ideal but it's the quickest to
implement)

Note that at the moment we allocate enough space for the theoretical
maximum number of VGPRs that can be allocated dynamically (for blocks of
16 registers, this will be 128, of which we subtract the first 16, which
are already allocated by the fixed function hardware). Future patches
may decide to allocate less if they can prove the shader never allocates
that many blocks.

Also note that this should not affect any reported stack sizes (e.g. PAL
backend_stack_size etc).
2025-03-19 13:49:19 +01:00
Matt Arsenault
5b6b4fdb4b
DAG: Fix promote of half freeze (#131844) 2025-03-19 18:30:34 +07:00
Diana Picus
8a53324aa5
[AMDGPU] Deallocate VGPRs before exiting in dynamic VGPR mode (#130037)
In dynamic VGPR mode, Waves must deallocate all VGPRs before exiting. If
the shader program does not do this, hardware inserts `S_ALLOC_VGPR 0`
before S_ENDPGM, but this may incur some performance cost. Therefore
it's better if the compiler proactively generates that instruction.

This patch extends `si-insert-waitcnts` to deallocate the VGPRs via a
`S_ALLOC_VGPR 0` before any `S_ENDPGM` when in dynamic VGPR mode.
2025-03-19 09:00:36 +01:00
Shoreshen
b907920058
[AMDGPU] auto-generate file check line for amdgcn.bitcast.ll (#131955)
Replace check lines by auto-generated
2025-03-19 15:40:58 +08:00
Mariusz Sikora
575fde0995
[AMDGPU] Add intrinsic and MI for image_bvh_dual_intersect_ray (#130038)
- Add llvm.amdgcn.image.bvh.dual.intersect.ray intrinsic and
image_bvh_dual_intersect_ray machine instruction.
- Add llvm_v10i32_ty and llvm_v10f32_ty

---------

Co-authored-by: Mateja Marjanovic <mateja.marjanovic@amd.com>
2025-03-19 07:35:09 +01:00
Akshat Oke
6cc23faaac
[AMDGPU][NPM] Port AMDGPUMarkLastScratchLoad to NPM (#131738)
This finishes all passes for the optimized regalloc path.

---------

Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
2025-03-19 09:27:05 +05:30
Matt Arsenault
5ac680c5bf
AMDGPU: Add more freeze codegen tests (#131843) 2025-03-19 10:17:59 +07:00
Matt Arsenault
428e3a27c3
AMDGPU: Fix attributor not handling all trap intrinsics (#131758) 2025-03-19 10:17:28 +07:00
Matt Arsenault
2e39533e50
AMDGPU: Fix broken check prefix and degraded cov4 test coverage (#131757) 2025-03-19 08:30:05 +07:00
Matt Arsenault
6f44be97d0
IR: Make llvm.fake.use a DefaultAttrsIntrinsic (#131743)
This shouldn't be special and is just an ordinary sideeffect.
2025-03-19 08:29:04 +07:00
Carl Ritson
0e4116a6b9
[AMDGPU] Fix typing error in multi dimensional promote alloca (#131763)
Fix type error when GEP uses i64 index introduced in #127973.
2025-03-19 08:17:04 +09:00
Julian Brown
84909d7977 [AMDGCN] Allow unscheduling of bundled insns
This is a patch arising from AMD's fuzzing project.

In the test case, the scheduling algorithm decides to undo an attempted
schedule, but is unprepared to handle bundled instructions at that
point -- and those can arise via the expansion of intrinsics earlier
in compilation.  The fix is to use the splice method instead of
remove/insert, since that can handle bundles properly.
2025-03-18 11:56:51 -05:00
Matt Arsenault
dea5aa73fa
AMDGPU: Move insertion into V2SCopies map (#130776)
Insert the start instruction directly into the map before the uses. This
prevents improperly re-visting sgpr->vgpr phi inputs multiple times
which
would trigger a use after free.

I don't particularly trust the iteration scheme here. This is also
unnecessarily revisting transitive users of a phi or reg_sequence for
every
input operand, but I will address that separately.

Fixes #130646. I also believe it fixes #130119, although that test fails
less consistently for me.
2025-03-18 23:28:49 +07:00
Fabian Ritter
332f060363
[SeparateConstOffsetFromGEP] Don't set unsound inbounds flag (#130616)
The language reference says about inbounds geps that "if the
getelementptr has any non-zero indices[...] [t]he base pointer has an in
bounds address of the allocated object that it is based on [and]
[d]uring the successive addition of offsets to the address, the
resulting pointer must remain in bounds of the allocated object at each
step."

If (gep inbounds p, (a + 5)) is translated to (gep [inbounds] (gep p,
a), 5) with p pointing to the beginning of an object and a=-4, as the
example in the comments suggests, that's the case for neither of the
resulting geps. Therefore, we need to clear the inbounds flag for both
geps.

We might want to use ValueTracking to check if a is known to be
non-negative to preserve the inbounds flags.

For the AMDGPU tests with scratch instructions, removing the unsound
inbounds flag means that AMDGPUDAGToDAGISel::isFlatScratchBaseLegal sees
no NUW flag at the pointer add, which prevents generation of scratch
instructions with immediate offsets.

For SWDEV-516125.
2025-03-18 12:30:20 +01:00
Diana Picus
0a21ef9536
[AMDGPU] Add SubtargetFeature for dynamic VGPR mode (#130030)
This represents a hardware mode supported only for wave32 compute
shaders. When enabled, we set the `.dynamic_vgpr_en` field of
`.compute_registers` to true in the PAL metadata.

This will be changed to use an attribute after downstream consumers
have been migrated.
2025-03-18 11:48:01 +01:00
Matt Arsenault
c5fe075eaf
AMDGPU: Use freeze poison instead of undef in alloca promotion (#131285)
Previously the value created to represent the uninitialized memory
of the alloca was undef. Use freeze poison instead. Enables some
optimization improvements (which need defeating in the limit tests),
but also a few regressions. Seems to leave behind dead code in some
cases too.
2025-03-18 17:27:02 +07:00
Vikash Gupta
bdb63208b4
[AMDGPU][CodeGen] Using MBB's liveIn check in tandem with MCRegAliasIterator in SILowerSGPRSpills (#129848)
This patch replaces use of MachineRegisterInfo's liveIn check with the
machine basicBlock's liveIn. As the MRI's liveIn is inconsistent with
the entry MBB liveIns, when it comes to the machine verifier checks.

PS: Its an alternative solution with respect to #126926.
2025-03-18 10:51:07 +05:30
Jim Lin
00cad3ed22
[SDAG] Handle extract_subvector in isKnownNeverNaN (#131581)
Propagate nnan across extract_subvector.
2025-03-18 09:37:16 +08:00
Matt Arsenault
092e25571c AMDGPU: Add REQUIRES: asserts to machine pass violation test
We should promote this to a proper error and not llvm_unreachable
2025-03-18 07:31:50 +07:00
Tim Gymnich
887cf1f8ce
[AMDGPU][GlobalISel] Enable vector reductions (#131413)
- Enable llvm vector reductions for AMDGPU.

fixes https://github.com/llvm/llvm-project/issues/114816
2025-03-17 14:25:30 -07:00
Shilei Tian
e2c43ba981
[NFC][AMDGPU] Auto generate check lines for llvm/test/CodeGen/AMDGPU/packed-fp32.ll (#131629) 2025-03-17 11:42:03 -04:00