8529 Commits

Author SHA1 Message Date
Sterling-Augustine
7514225052
Use a more proper idiom for "the output file doesn't matter". NFC. (#134280)
As in the description. Follow up to PR #134179.
2025-04-03 10:24:10 -07:00
Brox Chen
bf388f8a43
[AMDGPU][True16][CodeGen] legalize operands when move16bit SALU to VALU (#133985)
This is a follow up PR from
https://github.com/llvm/llvm-project/pull/132089.

When a V2S copy and its useMI are lowered to VALU,  this patch check:
If the generated new VALU is a true16 inst. Add subreg access on all
operands if necessary.

an example MIR looks like:
```
%1:vgpr_32 = V_CVT_F32_U32_e64 %0:vgpr_32, 0, 0 ...
%2:sreg_32 = COPY %1:vgpr_32
%3:sreg_32 = S_FLOOR_F16 %2:sreg_32, ...
```
currently lowered to
```
%1:vgpr_32 = V_CVT_F32_U32_e64 %0:vgpr_32, 0, 0 ...
%2:vgpr_16 = V_FLOOR_F16_t16_e64 0, %1:vgpr_32, 0, 0, 0 ...
```
after this patch
```
%1:vgpr_32 = V_CVT_F32_U32_e64 %0:vgpr_32, 0, 0 ...
%2:vgpr_16 = V_FLOOR_F16_t16_e64 0, %1.lo16:vgpr_32, 0, 0, 0 ...
```
2025-04-03 12:26:41 -04:00
Krzysztof Drewniak
f23bb530cf
[AMDGPULowerBufferFatPointers] Use InstSimplifyFolder during rewrites (#134137)
This PR updates AMDGPULowerBufferFatPointers to use the
InstSimplifyFolder
when creating IR during buffer fat pointer lowering.

This shouldn't cause any large functional changes and might improve the
quality of the generated code.
2025-04-03 10:12:18 -05:00
Sterling-Augustine
f68a5185d0
Allow this test to pass when the source is on a read-only filesystem (#134179)
llc attempts to create an empty file in the current directory, but it
can't do that on a read-only file system. Send that empty-output to
stdout, which prevents this failure.
2025-04-02 16:49:57 -07:00
Brox Chen
066787b9bd
[AMDGPU][True16][CodeGen] fold clamp update for true16 (#128919)
Check through COPY for possible clamp folding for v_mad_mixhi_f16 isel
2025-04-02 17:10:53 -04:00
Brox Chen
fb0e7b5f16
[AMDGPU][True16][CodeGen] Implement sgpr folding in true16 (#128929)
We haven't implemented 16 bit SGPRs. Currently allow 32-bit SGPRs to be
folded into True16 bit instructions taking 16 bit values. Also use
sgpr_32 when Imm is copied to spgr_lo16 so it could be further folded.
This improves generated code quality.
2025-04-02 16:08:26 -04:00
Juan Manuel Martinez Caamaño
beae0e9f1a
[AMDGPU] Use a target feature to enable __builtin_amdgcn_global_load_lds on gfx9/10 (#133055)
This patch introduces the `vmem-to-lds-load-insts` target feature, which
can be used to enable builtins `__builtin_amdgcn_global_load_lds` and
`__builtin_amdgcn_raw_ptr_buffer_load_lds` on platforms which have this
feature.

This feature is only available on gfx9/10.

A limitation of using a common target feature for both builtins is that
we could have made `__builtin_amdgcn_raw_ptr_buffer_load_lds` available
on gfx6,7,8.
2025-04-02 20:00:09 +02:00
Juan Manuel Martinez Caamaño
0375ef07c3
[Clang][AMDGPU] Add __builtin_amdgcn_cvt_off_f32_i4 (#133741)
This built-in maps to `V_CVT_OFF_F32_I4` which treats its input as
a 4-bit signed integer and returns `0.0625f * src`.

SWDEV-518861
2025-04-02 19:51:40 +02:00
Akshat Oke
a13a51b91f
[AMDGPU][NPM] Port AMDGPUSetWavePriority to NPM (#130064) 2025-04-02 16:28:05 +05:30
Brox Chen
dd1d41f833
[AMDGPU][True16][CodeGen] fix moveToVALU with proper subreg access in true16 (#132089)
There are V2S copies between vpgr16 and spgr32 in true16 mode. This is
caused by vgpr16 and sgpr32 both selectable by 16bit src in ISel.

When a V2S copy and its useMI are lowered to VALU,  this patch check
1. If the generated new VALU is used by a true16 inst. Add subreg access
if necessary.
2. Legalize the V2S copy by replacing it to subreg_to_reg

an example MIR looks like:
```
%2:sgpr_32 = COPY %1:vgpr_16
%3:sgpr_32 = S_OR_B32 %2:sgpr_32, ...
%4:vgpr_16 = V_ADD_F16_t16 %3:sgpr_32, ...
```
currently lowered to
```
%2:vgpr_32 = COPY %1:vgpr_16
%3:vgpr_32 = V_OR_B32 %2:vgpr_32, ...
%4:vgpr_16 = V_ADD_F16_t16 %3:vgpr_32, ...
```
after this patch
```
%2:vgpr_32 = SUBREG_TO_REG 0, %1:vgpr_16, lo16
%3:vgpr_32 = V_OR_B32 %2:vgpr_32, ...
%4:vgpr_16 = V_ADD_F16_t16 %3.lo16:vgpr_32, ...
```
2025-04-01 12:40:18 -04:00
Shoreshen
7f14b2a9eb
Revert "[AMDGPU][CodeGenPrepare] Narrow 64 bit math to 32 bit if profitable" (#133880)
Reverts llvm/llvm-project#130577
2025-04-01 17:37:02 +08:00
Valery Pykhtin
af0b0ce665
[AMDGPU] Fix SIFoldOperandsImpl::tryFoldZeroHighBits when met non-reg src1 operand. (#133761)
This happens when a constant is propagated to a V_AND 0xFFFF, reg
instruction.

Fixes failures like:

```
llc: /github/llvm-project/llvm/include/llvm/CodeGen/MachineOperand.h:366: llvm::Register llvm::MachineOperand::getReg() const: Assertion `isReg() && "This is not a register operand!"' failed.
Stack dump:
0.      Program arguments: /github/llvm-project/build/Debug/bin/llc -mtriple=amdgcn -mcpu=gfx1101 -verify-machineinstrs -run-pass si-fold-operands /github/llvm-project/llvm/test/CodeGen/AMDGPU/fold-zero-high-bits-skips-non-reg.mir -o -
1.      Running pass 'Function Pass Manager' on module '/github/llvm-project/llvm/test/CodeGen/AMDGPU/fold-zero-high-bits-skips-non-reg.mir'.
2.      Running pass 'SI Fold Operands' on function '@test_tryFoldZeroHighBits_skips_nonreg'
...
#12 0x00007f5a55005cfc llvm::MachineOperand::getReg() const /github/llvm-project/llvm/include/llvm/CodeGen/MachineOperand.h:0:5
#13 0x00007f5a555c6bf5 (anonymous namespace)::SIFoldOperandsImpl::tryFoldZeroHighBits(llvm::MachineInstr&) const /github/llvm-project/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp:1459:36
#14 0x00007f5a555c63ad (anonymous namespace)::SIFoldOperandsImpl::run(llvm::MachineFunction&) /github/llvm-project/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp:2455:11
#15 0x00007f5a555c6780 (anonymous namespace)::SIFoldOperandsLegacy::runOnMachineFunction
```
2025-04-01 10:27:58 +02:00
Shoreshen
145b4a3950
[AMDGPU][CodeGenPrepare] Narrow 64 bit math to 32 bit if profitable (#130577)
For Add, Sub, Mul with Int64 type, if profitable, then do:
1. Trunc operands to Int32 type
2. Apply 32 bit Add/Sub/Mul
3. Zext to Int64 type
2025-04-01 11:18:17 +08:00
Brox Chen
a61cc1b99a
[AMDGPU][True16][CodeGen] Skip combineDpp with t16 instructions (#128918)
We only emits v_mov_b32/64_dpp. Don't combine t16 instructions with mov
dpp. Update the test inputs to be legal.

It is future work to emit v_mov_b16_dpp, and then update GCNDPPCombine
to combine it with the 16-bit instructions.
2025-03-31 10:18:25 -04:00
Simon Pilgrim
9b32f3d096
[DAG] visitEXTRACT_SUBVECTOR - don't return early on failure of EXTRACT_SUBVECTOR(INSERT_SUBVECTOR()) -> BITCAST fold (#133695)
Always allow later folds to try to match as well.
2025-03-31 14:32:43 +01:00
Fangrui Song
04a67528d3
[MC] Simplify MCBinaryExpr/MCUnaryExpr printing by reducing parentheses (#133674)
The existing pretty printer generates excessive parentheses for
MCBinaryExpr expressions. This update removes unnecessary parentheses
of MCBinaryExpr with +/- operators and MCUnaryExpr.
Since relocatable expressions only use + and -, this change improves
readability in most cases.

Examples:

- (SymA - SymB) + C now prints as SymA - SymB + C.
  This updates the output of -fexperimental-relative-c++-abi-vtables for
  AArch64 and x86 to `.long _ZN1B3fooEv@PLT-_ZTV1B-8`
- expr + (MCTargetExpr) now prints as expr + MCTargetExpr, with this
  change primarily affecting AMDGPUMCExpr.
2025-03-30 22:03:14 -07:00
Ana Mihajlovic
8c7550132f
[AMDGPU] Unused sdst writing to null (#133229)
Unused sdst writing to null to avoid a false VALU->SALU dependency
stall. This requires using the VOP3 encoding.
2025-03-28 18:12:34 +01:00
LU-JOHN
827f2ad643
AMDGPU: Convert vector 64-bit shl to 32-bit if shift amt >= 32 (#132964)
Convert vector 64-bit shl to 32-bit if shift amt is known to be >= 32.

---------

Signed-off-by: John Lu <John.Lu@amd.com>
2025-03-28 23:46:35 +07:00
Ana Mihajlovic
f7a034d400
[AMDGPU] (x or y) xor -1 -> x nor y (#130264)
Added pattern so s_nor is selected for ((i1 x or i1 y) xor -1) instead
of s_or and s_xor . This patch is for i1 divergent. The ballot in the
test is added for the retrieval of lanemask. The control flow is needed
because the combiner can't pass through phi instructions.
2025-03-28 11:20:17 +01:00
Brox Chen
06411399fb
[AMDGPU][True16][CodeGen] srl pattern for true16 mode (#132987)
Added a srl pattern for true16 flow. Changing right shift 16bit to a
reg_sequence
`srl vgpr32, 16 -> reg_sequence (vgpr32.hi16,  0)`

and finally it's lowered to two COPY
`vdst.lo16 = COPY vsrc.hi16`
`vdst.hi16 = COPY 0`

The benefits of this transform is allowing the following pass to
optimize out these copy.
2025-03-26 18:38:20 -04:00
Alex MacLean
672c51c9cb
[SDAG][tests] add some test cases covering an add-based rotate (#132842)
Add tests to various targets covering rotate idioms where an 'ADD' node
is used to combine the halves instead of an 'OR'. Some of these cases
will be better optimized following #125612, while others are already
well optimized or do not have a valid fold to a rotate or funnel-shift.
2025-03-26 09:47:28 -07:00
Akshat Oke
719b029c16
[AMDGPU][NPM] Port SILateBranchLowering to NPM (#130063) 2025-03-26 19:28:19 +05:30
Jeffrey Byrnes
e5641f6584
[AMDGPU] Autogen checks for mfma-loop.ll (#133004)
Needed for a RegisterCoalescing patch
2025-03-25 15:24:40 -07:00
Juan Manuel Martinez Caamaño
2f8d699845
[AMDGPU][SelectionDAG] Use COPY instead of S_MOV_B32 to assign values to M0 (#132957)
This is consistent with what's done on GISel. This allows the register
coalescer to remove the redundant intermediate `s_mov_b32` instructions
by using `m0` directly as the result register.
2025-03-25 19:05:43 +01:00
Jeffrey Byrnes
25938389c0
[AMDGPU] Autogen checks for agpr-csr.ll (#132959)
Needed for a RegisterCoalescer patch
2025-03-25 10:28:35 -07:00
LU-JOHN
70aeb89094
Calculate KnownBits from Metadata correctly for vector loads (#128908)
Calculate KnownBits correctly from metadata for vector loads.

---------

Signed-off-by: John Lu <John.Lu@amd.com>
2025-03-25 22:46:30 +07:00
Akshat Oke
f8e908a0ed
[AMDGPU][NPM] Port SIInsertHardClauses to NPM (#130062) 2025-03-25 15:33:32 +05:30
Austin Kerbow
e75f586b81
[AMDGPU] Relax lds dma waitcnt with no aliasing pair (#131842)
If we cannot find any lds DMA instruction that is aliased by some load
from lds, we will still insert vmcnt(0). This is overly cautious since
handling inter-thread dependences is normally managed by the memory
model instead of the waitcnt pass, so this change updates the behavior
to be more inline with how other types of memory events are handled.
2025-03-24 10:38:47 -07:00
Jay Foad
02ed65912e
[AMDGPU] 4-align TTMP triples (#132759)
Follow up to e4284a7c70cd "[AMDGPU] 4-align SGPR triples".

Previously TTMP triples like ttmp[3:5] were aligned on a 3-TTMP boundary
which has no basis in hardware.

Aligning them on a 4-TTMP boundary matches what we do for SGPRs, which
reduces the number of extra register classes synthesized by TableGen,
bringing the total number down from 653 to 615.
2025-03-24 17:11:39 +00:00
Ana Mihajlovic
cdea46cc8c
[AMDGPU] Add pattern for inverse.ballot.i64 Wave32 (#132770) 2025-03-24 17:30:02 +01:00
Akshat Oke
f10dc76f03
[AMDGPU][NPM] Port SIInsertWaitcnts to NPM (#130061) 2025-03-24 21:36:45 +05:30
Juan Manuel Martinez Caamaño
5634e7e2f0
[AMDGCN][SIWholeQuadMode] Rework splitBlock/lowerKillI1/lowerKillF32 to handle case when SI_KILL_I1_TERMINATOR -1 0 is not the unique terminator
The lowerKillI1 method wrongly handled cases where it inserted a
new S_BRANCH instruction when the kill was not the only terminator,
and then tried to split the block.

`SI_KILL_I1_TERMINATOR -1,0` doesn't have any effect. Instead of
lowering to an unconditional branch, we remove the instruction and
insert an unconditional branch only if the instruction is the last
terminator. No split is needed in this case (if the last terminator
has been reached, then the whole block was processed).

Also stop generating an unconditional branch in splitBlock: this
branch was redundant since TermMI is promoted to a
terminator that fallsthrough to the next block already.

Solves SWDEV-508819
2025-03-24 15:57:08 +01:00
Pierre van Houtryve
c457c88951
[GlobalISel] Combine (sext (trunc x)) to (sext_inreg x) (#131622)
Split from #131312
2025-03-24 09:32:04 +01:00
Pierre van Houtryve
6e3c24fc0a
[DAG] Combine (sext (sext_in_reg x)) to (sext_in_reg (any_extend x)) (#132386) 2025-03-24 09:31:02 +01:00
Shoreshen
054e0b41a8
[AMDGPU] Add all type for bitcast on VReg_512 (#131775)
Add all types pattern for bitcast on VReg_512
2025-03-24 11:52:10 +08:00
Shilei Tian
f1ac2afe21
Reapply "[AMDGPU] Use COV6 by default (#118515)" (#130963)
This reverts commit 68bcba6d7a1cc18996c0bcb7c62267c62d2040d0.
2025-03-21 15:26:45 -04:00
Stephen Thomas
2e3fa4ba9e
[AMDGPU] Insert before and after instructions that always use GDS (#131338)
It is an architectural requirement that there must be no outstanding GDS
instructions when an "always GDS" instruction is issued, and also that
an always GDS instruction must be allowed to complete.

Insert waits on DScnt/LGKMcnt prior to (if necessary) and subsequent to
(unconditionally) any always GDS instruction, and an additional S_NOP if
the subsequent wait was followed by S_ENDPGM.

Always GDS instructions are GWS instructions, DS_ORDERED_COUNT,
DS_ADD_GS_REG_RTN, and DS_SUB_GS_REG_RTN (the latter two as considered
always GDS as of this patch).
2025-03-21 09:33:04 +00:00
Brox Chen
55d3a55cc1
[AMDGPU][True16][CodeGen]disable true16 on fneg test (#132221)
This is a NFC change.

Revert the failed test case in
https://github.com/llvm/llvm-project/pull/131206
2025-03-20 10:32:49 -04:00
Diana Picus
e17b3cdfb3
[AMDGPU] Dynamic VGPR support for llvm.amdgcn.cs.chain (#130094)
The llvm.amdgcn.cs.chain intrinsic has a 'flags' operand which may
indicate that we want to reallocate the VGPRs before performing the
call.

A call with the following arguments:
```
llvm.amdgcn.cs.chain %callee, %exec, %sgpr_args, %vgpr_args,
  /*flags*/0x1, %num_vgprs, %fallback_exec, %fallback_callee
```
is supposed to do the following:
- copy the SGPR and VGPR args into their respective registers
- try to change the VGPR allocation
- if the allocation has succeeded, set EXEC to %exec and jump to
%callee, otherwise set EXEC to %fallback_exec and jump to
%fallback_callee

This patch implements the dynamic VGPR behaviour by generating an
S_ALLOC_VGPR followed by S_CSELECT_B32/64 instructions for the EXEC and
callee. The rest of the call sequence is left undisturbed (i.e.
identical to the case where the flags are 0 and we don't use dynamic
VGPRs). We achieve this by introducing some new pseudos
(SI_CS_CHAIN_TC_Wn_DVGPR) which are expanded in the SILateBranchLowering
pass, just like the simpler SI_CS_CHAIN_TC_Wn pseudos. The main reason
is so that we don't risk other passes (particularly the PostRA
scheduler) introducing instructions between the S_ALLOC_VGPR and the
jump. Such instructions might end up using VGPRs that have been
deallocated, or the wrong EXEC mask. Once the whole backend treats
S_ALLOC_VGPR and changes to EXEC as barriers for instructions that use
VGPRs, we could in principle move the expansion earlier (but in the
absence of a good reason for that my personal preference is to keep it
later in order to make debugging easier).

Since the expansion happens after register allocation, we're careful to
select constants to immediate operands instead of letting ISel generate
S_MOVs which could interfere with register allocation (i.e. make it look
like we need more registers than we actually do).

For GFX12, S_ALLOC_VGPR only works in wave32 mode, so we bail out during
ISel in wave64 mode. However, we can define the pseudos for wave64 too
so it's easy to handle if future generations support it.

---------

Co-authored-by: Ana Mihajlovic <Ana.Mihajlovic@amd.com>
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
2025-03-20 08:38:04 +01:00
Jeffrey Byrnes
6cc918089a
[AMDGPU] Autogen checks for mfma-no-register-aliasing.ll (#132117)
For an upcoming RegisterCoalescer PR
2025-03-19 16:02:51 -07:00
Mariusz Sikora
4f5ccf22fa
[AMDGPU] Support image_bvh8_intersect_ray instruction and intrinsic. (#130041)
Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>
2025-03-19 16:08:08 +01:00
Emma Pilkington
3eddb992d0
[AMDGPU] Fix a crash by skipping DBG instrs at start of sched region (#131167)
Fixes SWDEV-514946
2025-03-19 09:31:54 -04:00
Diana Picus
72c3c30452
[AMDGPU] Allocate scratch space for dVGPRs for CWSR (#130055)
The CWSR trap handler needs to save and restore the VGPRs. When dynamic
VGPRs are in use, the fixed function hardware will only allocate enough
space for one VGPR block. The rest will have to be stored in scratch, at
offset 0.

This patch allocates the necessary space by:
- generating a prologue that checks at runtime if we're on a compute
queue (since CWSR only works on compute queues); for this we will have
to check the ME_ID bits of the ID_HW_ID2 register - if that is non-zero,
we can assume we're on a compute queue and initialize the SP and FP with
enough room for the dynamic VGPRs
- forcing all compute entry functions to use a FP so they can access
their locals/spills correctly (this isn't ideal but it's the quickest to
implement)

Note that at the moment we allocate enough space for the theoretical
maximum number of VGPRs that can be allocated dynamically (for blocks of
16 registers, this will be 128, of which we subtract the first 16, which
are already allocated by the fixed function hardware). Future patches
may decide to allocate less if they can prove the shader never allocates
that many blocks.

Also note that this should not affect any reported stack sizes (e.g. PAL
backend_stack_size etc).
2025-03-19 13:49:19 +01:00
Matt Arsenault
5b6b4fdb4b
DAG: Fix promote of half freeze (#131844) 2025-03-19 18:30:34 +07:00
Diana Picus
8a53324aa5
[AMDGPU] Deallocate VGPRs before exiting in dynamic VGPR mode (#130037)
In dynamic VGPR mode, Waves must deallocate all VGPRs before exiting. If
the shader program does not do this, hardware inserts `S_ALLOC_VGPR 0`
before S_ENDPGM, but this may incur some performance cost. Therefore
it's better if the compiler proactively generates that instruction.

This patch extends `si-insert-waitcnts` to deallocate the VGPRs via a
`S_ALLOC_VGPR 0` before any `S_ENDPGM` when in dynamic VGPR mode.
2025-03-19 09:00:36 +01:00
Shoreshen
b907920058
[AMDGPU] auto-generate file check line for amdgcn.bitcast.ll (#131955)
Replace check lines by auto-generated
2025-03-19 15:40:58 +08:00
Mariusz Sikora
575fde0995
[AMDGPU] Add intrinsic and MI for image_bvh_dual_intersect_ray (#130038)
- Add llvm.amdgcn.image.bvh.dual.intersect.ray intrinsic and
image_bvh_dual_intersect_ray machine instruction.
- Add llvm_v10i32_ty and llvm_v10f32_ty

---------

Co-authored-by: Mateja Marjanovic <mateja.marjanovic@amd.com>
2025-03-19 07:35:09 +01:00
Akshat Oke
6cc23faaac
[AMDGPU][NPM] Port AMDGPUMarkLastScratchLoad to NPM (#131738)
This finishes all passes for the optimized regalloc path.

---------

Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
2025-03-19 09:27:05 +05:30
Matt Arsenault
5ac680c5bf
AMDGPU: Add more freeze codegen tests (#131843) 2025-03-19 10:17:59 +07:00
Matt Arsenault
428e3a27c3
AMDGPU: Fix attributor not handling all trap intrinsics (#131758) 2025-03-19 10:17:28 +07:00