483 Commits

Author SHA1 Message Date
Mirko Brkušanin
3def49cb64
[AMDGPU] Remove s_wakeup_barrier instruction (#122277) 2025-01-10 11:30:22 +01:00
Jakub Chlanda
01a7d4e26b
[AMDGPU] Allow selection of BITOP3 for some 2 opcodes and B32 cases (#122267)
This came up in downstream static analysis - as a dead code.

Admittedly, it depends on what the intention was when checking for [`if
(NumOpcodes == 2 &&
IsB32)`](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp#L3792C3-L3792C32)
and I took a guess that for certain cases the selection should take
place.

If that's incorrect, that whole if statement can be removed, as it is
after a check for: [`if (NumOpcodes <
4)`](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp#L3788)
2025-01-10 07:49:11 +01:00
Brox Chen
c744ed53a8
[AMDGPU][True16][MC] disable incorrect VOPC t16 instruction (#120271)
The current VOPC t16 instructions are not implemented with the correct
t16 pseudo. Thus the current t16/fake16 instructions are all in fake16
format.

The plan is to remove the incorrect t16 instructions and refactor them.
The first step is to remove them in this patch. The next step will be
updating the t16/fake16 pseudo to the correct format and add back true16
instruction one by one in the upcoming patches.
2025-01-03 11:58:04 -05:00
Matt Arsenault
92ba7e3973
AMDGPU/GlobalISel: Do not try to form v_bitop3_b32 for SGPR results (#117940) 2024-11-30 20:21:20 -05:00
Matt Arsenault
b4a16a78c2
AMDGPU: Match and Select BITOP3 on gfx950 (#117843)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2024-11-27 01:31:19 -05:00
Matt Arsenault
2b9e947d43
AMDGPU: Builtins & Codegen support for v_cvt_scale_fp4<->f32 for gfx950 (#117743)
OPSEL ASM Syntax for v_cvt_scalef32_pk_f32_fp4 : opsel:[x,y,z]
where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read.

OPSEL ASM Syntax for v_cvt_scalef32_pk_fp4_f32 : opsel:[a,b,c,d]
where, c & d i.e. OPSEL[3 : 2] selects which dst_byte  to write.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:20:09 -05:00
Matt Arsenault
62584f32eb
AMDGPU: Builtins & Codegen support for v_cvt_scalef32_pk_f32_{fp8|bf8} for gfx950 (#117741)
OPSEL[0] determines low/high 16 bits of src0 to read.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:12:18 -05:00
Matt Arsenault
803bd812b1
AMDGPU: Builtins & Codegen support for v_cvt_scalef32_pk_{fp8|bf8}_f32 for gfx950 (#117740)
OPSEL[3] determines low/high 16 bits of word to write.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 14:57:09 -05:00
Matt Arsenault
815069c701
AMDGPU: Builtins & Codegen support for: v_cvt_scalef32_[f16|f32]_[bf8|fp8] (#117739)
OPSEL[1:0] collectively decide which byte to read
from src input.

Builtin takes additional imm argument which
represents index (with valid values:[0:3]) of src
byte read. Out of bounds checks will added in next
patch.

OPSEL ASM Syntax: opsel:[x,y,z]
where,
    opsel[x] = Inst{11} = src0_modifier{2}
    opsel[y] = Inst{12} = src1_modifier{2}
    opsel[z] = Inst{14} = src0_modifier{3}

Note: Inst{13} i.e. OPSEL[2] is ignored in
asm syntax and opsel[z] is meaningless
for v_cvt_scalef32_f32_{fp|bf}8

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 14:54:10 -05:00
Piotr Sobczak
a96ec01e1a
[AMDGPU] Optimize out s_barrier_signal/_wait (#116993)
Extend the optimization that converts s_barrier to wave_barrier (nop)
when the number of work items is not larger than wave size.
    
This handles the "split barrier" form of s_barrier where the barrier
is represented by separate intrinsics (s_barrier_signal/s_barrier_wait).
Note: the version where s_barrier is used in gfx12 (and later split)
has the optimization already, but some front-ends may prefer to use
split intrinsics and this is being addressed by the patch.
2024-11-26 10:04:32 +01:00
Matt Arsenault
d1cca3133a
AMDGPU: Add v_permlane16_swap_b32 and v_permlane32_swap_b32 for gfx950 (#117260)
This was a bit annoying because these introduce a new special case
encoding usage. op_sel is repurposed as a subset of dpp controls,
and is eligible for VOP3->VOP1 shrinking. For some reason fi also
uses an enum value, so we need to convert the raw boolean to 1 instead
of -1.

The 2 registers are swapped, so this has 2 defs. Ideally the builtin
would return a pair, but that's difficult so return a vector instead.
This would make a hypothetical builtin that supports v2f16 directly
uglier.
2024-11-22 20:12:50 -08:00
Matt Arsenault
7d544c64e3
AMDGPU: Add v_smfmac_f32_32x32x64_fp8_fp8 for gfx950 (#117259) 2024-11-22 12:11:06 -08:00
Matt Arsenault
90dc644d73
AMDGPU: Add v_smfmac_f32_32x32x32x64_fp8_bf8 for gfx950 (#117258) 2024-11-22 12:08:15 -08:00
Matt Arsenault
8d3435f8a1
AMDGPU: Add v_smfmac_f32_32x32x64_bf8_fp8 for gfx950 (#117257) 2024-11-22 12:02:18 -08:00
Matt Arsenault
8a5c24149d
AMDGPU: Add v_smfmac_f32_32x32x64_bf8_bf8 for gfx950 (#117256) 2024-11-22 11:59:06 -08:00
Brox Chen
4cc278587f
[AMDGPU][True16][MC] VOPC profile fake16 pseudo update (#113175)
Update VOPC profile with VOP3 pseudo:

1. On GFX11+, v_cmp_class_f16 has src1 type f16 for literals, however
it's semantically interpreted as an integer. Update VOPC class f16
profile from operand type f16, i16 to f16, f16, currently updating it
for fake16 format, and will update t16 format in the following patch.
2. 16bit V_CMP_CLASS instructions (V_CMP_**_U/I/F16) are named with
`t16`, but actually using 32 bit registers. Correct it by updating the
pseudo definitions with useRealTrue16/useFakeTrue16 predicates and
rename these `t16` instructions to `fake16`.
3. Update the inst select so that `t16`/`fake16` instructions are
selected in true16/fake16 flow.
4. The mir test file are impacted for a name change of these impacted 16
bit V_CMP instructions, but non-functional change to emitted code
2024-11-22 12:12:13 -05:00
Matt Arsenault
836d2dcf60
AMDGPU: Add v_smfmac_f32_16x16x128_fp8_fp8 for gfx950 (#117235) 2024-11-21 17:06:06 -08:00
Matt Arsenault
33124910c9
AMDGPU: Add v_smfmac_f32_16x16x128_fp8_bf8 for gfx950 (#117234) 2024-11-21 17:03:03 -08:00
Matt Arsenault
3678f8a8aa
AMDGPU: Add v_smfmac_f32_16x16x128_bf8_fp8 for gfx950 (#117233) 2024-11-21 17:00:08 -08:00
Matt Arsenault
7baadb2a4e
AMDGPU: Add v_smfmac_f32_16x16x128_bf8_bf8 for gfx950 (#117232) 2024-11-21 16:57:01 -08:00
Matt Arsenault
3e6f3508ad
AMDGPU: Add v_smfmac_i32_32x32x64_i8 for gfx950 (#117214) 2024-11-21 15:01:03 -08:00
Matt Arsenault
8c53036146
AMDGPU: Add v_smfmac_i32_16x16x128_i8 for gfx950 (#117213) 2024-11-21 14:58:11 -08:00
Matt Arsenault
42dd114a46
AMDGPU: Add v_smfmac_f32_32x32x32_bf16 for gfx950 (#117212) 2024-11-21 14:52:11 -08:00
Matt Arsenault
95ddc1a63b
AMDGPU: Add v_smfmac_f32_16x16x64_bf16 for gfx950 (#117211) 2024-11-21 14:46:43 -08:00
Matt Arsenault
e50eaa2cf1
AMDGPU: Add v_smfmac_f32_32x32x32_f16 for gfx950 (#117205) 2024-11-21 14:43:33 -08:00
Matt Arsenault
2ab178820b
AMDGPU: Add v_smfmac_f32_16x16x64_f16 for gfx950 (#117202) 2024-11-21 14:40:30 -08:00
Matt Arsenault
01c9a14ccf
AMDGPU: Define v_mfma_f32_{16x16x128|32x32x64}_f8f6f4 instructions (#116723)
These use a new VOP3PX encoding for the v_mfma_scale_* instructions,
which bundles the pre-scale v_mfma_ld_scale_b32. None of the modifiers
are supported yet (op_sel, neg or clamp).

I'm not sure the intrinsic should really expose op_sel (or any of the
others). If I'm reading the documentation correctly, we should be able
to just have the raw scale operands and auto-match op_sel to byte
extract patterns.

The op_sel syntax also seems extra horrible in this usage, especially with the
usual assumed op_sel_hi=-1 behavior.
2024-11-21 08:51:58 -08:00
Jay Foad
ade0750e35
[AMDGPU] Fix some cache policy checks for GFX12+ (#116396)
Fix coding errors found by inspection and check that the swz bit still
serves to prevent merging of buffer loads/stores on GFX12+.
2024-11-21 08:22:59 +00:00
Matt Arsenault
927032807d
AMDGPU: Handle gfx950 96/128-bit buffer_load_lds (#116681)
Enforcing this limit in the clang builtin will come later.
2024-11-18 22:01:56 -08:00
Matt Arsenault
50224bd5ba
AMDGPU: Handle gfx950 global_load_lds_* instructions (#116680)
Define global_load_lds_dwordx3 and global_load_dwordx4.
Oddly it seems dwordx2 was skipped.
2024-11-18 21:58:02 -08:00
Matt Arsenault
9eefa922f8
AMDGPU/GlobalISel: Remove getVRegDef null checks in selector (#115530)
We should be able to assume every virtual register is defined.
2024-11-11 12:58:06 -08:00
Kazu Hirata
10b80ff0cc
[Target] Migrate away from PointerUnion::{is,get,dyn_cast} (NFC) (#115623)
Note that PointerUnion::{is,get,dyn_cast} have been soft deprecated in
PointerUnion.h:

  // FIXME: Replace the uses of is(), get() and dyn_cast() with
  //        isa<T>, cast<T> and the llvm::dyn_cast<T>
2024-11-09 17:22:57 -08:00
Gang Chen
8c752900dd
[AMDGPU] modify named barrier builtins and intrinsics (#114550)
Use a local pointer type to represent the named barrier in builtin and
intrinsic. This makes the definitions more user friendly
bacause they do not need to worry about the hardware ID assignment. Also
this approach is more like the other popular GPU programming language.
Named barriers should be represented as global variables of addrspace(3)
in LLVM-IR. Compiler assigns the special LDS offsets for those variables
during AMDGPULowerModuleLDS pass. Those addresses are converted to hw
barrier ID during instruction selection. The rest of the
instruction-selection changes are primarily due to the
intrinsic-definition changes.
2024-11-06 10:37:22 -08:00
Rahul Joshi
fa789dffb1
[NFC] Rename Intrinsic::getDeclaration to getOrInsertDeclaration (#111752)
Rename the function to reflect its correct behavior and to be consistent
with `Module::getOrInsertFunction`. This is also in preparation of
adding a new `Intrinsic::getDeclaration` that will have behavior similar
to `Module::getFunction` (i.e, just lookup, no creation).
2024-10-11 05:26:03 -07:00
Petar Avramovic
7b0d56be1d
AMDGPU/GlobalISel: Fix inst-selection of ballot (#109986)
Both input and output of ballot are lane-masks:
result is lane-mask with 'S32/S64 LLT and SGPR bank'
input is lane-mask with 'S1 LLT and VCC reg bank'.
Ballot copies bits from input lane-mask for
all active lanes and puts 0 for inactive lanes.
GlobalISel did not set 0 in result for inactive lanes
for non-constant input.
2024-10-11 11:40:27 +02:00
Matt Arsenault
c36f902372
AMDGPU/GlobalISel: Insert m0 initialization before sextload/zextload (#111720)
Fixes missing m0 initialize for pre-gfx9 targets with local extending
loads.
2024-10-10 14:01:49 +04:00
Shilei Tian
48ac846fbc
[AMDGPU][GlobalISel] Align selectVOP3PMadMixModsImpl with the SelectionDAG counterpart (#110168)
The current `selectVOP3PMadMixModsImpl` can produce `V_MAD_FIX_F32`
instruction
that violates constant bus restriction, while its `SelectionDAG`
counterpart
doesn't. The culprit is in the copy stripping while the `SelectionDAG`
version
only has a bitcast stripping. This PR simply aligns the two version.
2024-10-08 09:41:24 -04:00
Jay Foad
8d13e7b8c3
[AMDGPU] Qualify auto. NFC. (#110878)
Generated automatically with:
$ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find
lib/Target/AMDGPU/ -type f)
2024-10-03 13:07:54 +01:00
Nikita Popov
cee0bf9626
[AMDGPU] Use Lo_32 and Hi_32 helpers (NFC) (#109413) 2024-09-20 14:35:38 +02:00
Diana Picus
3356208531
Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108512)
This reverts commit
7792b4ae79.

The problem was a conflict with
e55d6f5ea2
"[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive
(https://github.com/llvm/llvm-project/pull/107889)"
which changed the syntax of V_SET_INACTIVE (and thus made my MIR test
crash).

...if only we had a merge queue.
2024-09-13 11:54:30 +02:00
Diana Picus
7792b4ae79
Revert "Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054)"" (#108341)
Reverts llvm/llvm-project#108173

si-init-whole-wave.mir crashes on some buildbots (although it passed
both locally with sanitizers enabled and in pre-merge tests).
Investigating.
2024-09-12 10:12:09 +02:00
Diana Picus
703ebca869
Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054)" (#108173)
This reverts commit
c7a7767fca.

The buildbots failed because I removed a MI from its parent before
updating LIS. This PR should fix that.
2024-09-12 09:11:41 +02:00
Vitaly Buka
c7a7767fca
Revert "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054)
Breaks bots, see #105822.

Reverts llvm/llvm-project#105822
2024-09-10 09:51:43 -07:00
Diana Picus
44556e64f2
[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic (#105822)
This intrinsic is meant to be used in functions that have a "tail" that
needs to be run with all the lanes enabled. The "tail" may contain
complex control flow that makes it unsuitable for the use of the
existing WWM intrinsics. Instead, we will pretend that the function
starts with all the lanes enabled, then branches into the actual body of
the function for the lanes that were meant to run it, and then finally
all the lanes will rejoin and run the tail.

As such, the intrinsic will return the EXEC mask for the body of the
function, and is meant to be used only as part of a very limited pattern
(for now only in amdgpu_cs_chain functions):

```
entry:
  %func_exec = call i1 @llvm.amdgcn.init.whole.wave()
  br i1 %func_exec, label %func, label %tail

func:
  ; ... stuff that should run with the actual EXEC mask
  br label %tail

tail:
  ; ... stuff that runs with all the lanes enabled;
  ; can contain more than one basic block
```

It's an error to use the result of this intrinsic for anything
other than a branch (but unfortunately checking that in the verifier is
non-trivial because SIAnnotateControlFlow will introduce an amdgcn.if
between the intrinsic and the branch).

The intrinsic is lowered to a SI_INIT_WHOLE_WAVE pseudo, which for now
is expanded in si-wqm (which is where SI_INIT_EXEC is handled too);
however the information that the function was conceptually started in
whole wave mode is stored in the machine function info
(hasInitWholeWave). This will be useful in prolog epilog insertion,
where we can skip saving the inactive lanes for CSRs (since if the
function started with all the lanes active, then there are no inactive
lanes to preserve).
2024-09-10 13:24:53 +02:00
Stanislav Mekhanoshin
0745219d4a
[AMDGPU] Add target intrinsic for s_buffer_prefetch_data (#107293) 2024-09-06 11:41:21 -07:00
Stanislav Mekhanoshin
bd840a4004
[AMDGPU] Add target intrinsic for s_prefetch_data (#107133) 2024-09-05 15:14:31 -07:00
Changpeng Fang
26b0bef192
AMDGPU: Use pattern to select instruction for intrinsic llvm.fptrunc.round (#105761)
Use GCNPat instead of Custom Lowering to select instructions for
intrinsic llvm.fptrunc.round. "SupportedRoundMode : TImmLeaf" is used as
a predicate to select only when the rounding mode is supported.
"as_hw_round_mode : SDNodeXForm" is developed to translate the round
modes to the corresponding ones that hardware recognizes.
2024-08-29 11:43:58 -07:00
Juan Manuel Martinez Caamaño
cbf34a5f77
[AMDGPU] Remove dead pass: AMDGPUMachineCFGStructurizer (#105645) 2024-08-23 14:06:17 +02:00
Brox Chen
afd42fb303
[AMDGPU][True16][CodeGen] Support AND/OR/XOR and LDEXP True16 format (#102620)
Support AND/OR/XOR true16 and LDEXP true/fake16 format.

These instructions are previously implemented with fake16 profile.
Fixing the implementation.

Added a RA hint so that when using 16bit register in a 32bit
instruction, try to use the register directly without an extra 16bit
move

---------

Co-authored-by: guochen2 <guochen2@amd.com>
2024-08-13 12:23:39 -04:00
Simon Pilgrim
11ba72e651
[KnownBits] Add KnownBits::add and KnownBits::sub helper wrappers. (#99468) 2024-08-12 10:21:28 +01:00