10135 Commits

Author SHA1 Message Date
Craig Topper
dd3edc8365
[CodeGen] Add Register::stackSlotIndex(). Replace uses of Register::stackSlot2Index. NFC (#125028) 2025-01-29 23:02:07 -08:00
Carl Ritson
a3a3e6997b
[AMDGPU] Rewrite GFX12 SGPR hazard handling to dedicated pass (#118750)
- Algorithm operates over whole IR to attempt to minimize waits.
- Add support for VALU->VALU SGPR hazards via VA_SDST/VA_VCC.
2025-01-30 11:21:11 +09:00
Joel E. Denny
18f8106f31
[KernelInfo] Implement new LLVM IR pass for GPU code analysis (#102944)
This patch implements an LLVM IR pass, named kernel-info, that reports
various statistics for codes compiled for GPUs. The ultimate goal of
these statistics to help identify bad code patterns and ways to mitigate
them. The pass operates at the LLVM IR level so that it can, in theory,
support any LLVM-based compiler for programming languages supporting
GPUs. It has been tested so far with LLVM IR generated by Clang for
OpenMP offload codes targeting NVIDIA GPUs and AMD GPUs.

By default, the pass runs at the end of LTO, and options like
``-Rpass=kernel-info`` enable its remarks. Example `opt` and `clang`
command lines appear in `llvm/docs/KernelInfo.rst`. Remarks include
summary statistics (e.g., total size of static allocas) and individual
occurrences (e.g., source location of each alloca). Examples of its
output appear in tests in `llvm/test/Analysis/KernelInfo`.
2025-01-29 12:40:19 -05:00
Konstantina Mitropoulou
9adc99bcc5
[AMDGPU] Always emit SI_KILL_I1_PSEUDO for uniform floating point branches. (#124028)
- **[NFC] Use GCNPat instead of Pat.**
- **[AMDGPU] Always emit SI_KILL_I1_PSEUDO for uniform floating point
branches.**

---------

Co-authored-by: Konstantina Mitropoulou <KonstantinaMitropoulou@amd.com>
2025-01-29 09:00:40 -08:00
Juan Manuel Martinez Caamaño
0c63ec5347
[NFC][SIWholeQuadMode] Remove redundant arguments (#124930) 2025-01-29 16:33:15 +01:00
Juan Manuel Martinez Caamaño
2e43f39223
[NFC][SIWholeQuadMode] Perform less lookups (#124927) 2025-01-29 15:36:54 +01:00
Acim Maravic
3a29dfe37c
[LLVM][AMDGPU] Add Intrinsic and Builtin for ds_bpermute_fi_b32 (#124616) 2025-01-29 14:04:10 +01:00
Ivan Kosarev
983562d8c5
[AMDGPU][NFC] Simplify t16/fake16 TableGen definitions. (#122693)
Infer mnemonics from the names of the records.
2025-01-29 12:46:05 +00:00
Akshat Oke
71edfd6230
[AMDGPU][NewPM] Sketch out a AMDGPUPassRegistry skeleton (#124785)
Add a dummy pass skeleton list to help track the progress in porting
passes to NPM.
2025-01-29 13:26:50 +05:30
Daniil Fukalov
68d90cff58
[AMDGPU][GlobalISel] Fix assert on APInt creation. (#124608)
Since 3494ee95902cef62f767489802e469c58a13ea04 APInt stopped to
implicitly truncate values, therefore it asserts on a big signed value
converted to (implicitly) unsigned APInt.

The change explicitly marks offset as a signed value.
2025-01-28 15:53:17 +01:00
Akshat Oke
42432ada8e
[AMDGPU][NFC] Sort AMDGPUPassRegistry entries alphabetically (#124544) 2025-01-28 11:25:56 +05:30
Matt Arsenault
cc97653d53
AMDGPU: Custom lower 32-bit element shuffles (#123711)
This is so we can try to make use of v_pk_mov_b32 when available.
Note this currently has little observable effect. The combiner
will undo the common extract of shuffle pattern. The lack
of test changes should demonstrate this change is minimally
correct.

We should probably try to make better use of wider extracts in
even aligned cases, but I'm trying to avoid some really ugly
regalloc regressions in some MFMA tests. The DAG scheduler ends
up doing a worse job if we use vector extracts, resulting
in failure to do 3 address conversion of MFMAs.
2025-01-28 11:17:10 +07:00
Shilei Tian
6e4105574e
[NFC][AMDGPU] Improve code introduced in #124607 (#124672) 2025-01-27 22:57:16 -05:00
Shilei Tian
3b2b7ec07d
[AMDGPU] Handle invariant marks in AMDGPUPromoteAllocaPass (#124607)
Fixes SWDEV-509327.
2025-01-27 17:30:50 -05:00
Brox Chen
5d1c596ab4
[AMDGPU][True16][MC] true16 for minimummaximum/max/min/max3/min3 (#124184)
true16 support for gfx12 instructions including:

v_minimummaximum_f16
v_maximumminimum_f16
v_maximum_f16
v_minimum_f16
v_maximum3_f16
v_minimum3_f16
2025-01-27 16:52:59 -05:00
Jeffrey Byrnes
e77d428e46
[AMDGPU] Do not remat instructions with PhysReg uses (#124366)
This blocks rematerialization during scheduling if the instruction has a
non accepted PhysReg use.

Currently, there aren't any checks like this in place, and we may create
invalid code: https://godbolt.org/z/xjPjdcorf
2025-01-27 10:50:06 -08:00
Brox Chen
d1139b32d2
[AMDGPU][True16][CodeGen] true16 codegen pats for v_mad_u16 (#124000)
true16 codegen pats for v_mad_u16 (mul+add)
2025-01-27 13:47:17 -05:00
Jeremy Morse
81d18ad864
[NFC][DebugInfo] Make some block-start-position methods return iterators (#124287)
As part of the "RemoveDIs" work to eliminate debug intrinsics, we're
replacing methods that use Instruction*'s as positions with iterators. A
number of these (such as getFirstNonPHIOrDbg) are sufficiently
infrequently used that we can just replace the pointer-returning version
with an iterator-returning version, hopefully without much/any
disruption.

Thus this patch has getFirstNonPHIOrDbg and
getFirstNonPHIOrDbgOrLifetime return an iterator, and updates all
call-sites. There are no concerns about the iterators returned being
converted to Instruction*'s and losing the debug-info bit: because the
methods skip debug intrinsics, the iterator head bit is always false
anyway.
2025-01-27 16:27:54 +00:00
Brox Chen
62340ff8d8
[AMDGPU][True16][MC] true16 for v_cmpx_xx_f16 (#123419)
A bulk commit of true16 support for v_cmpx_xx_f16 instructions
including:

v_cmpx_f_f16
v_cmpx_le_f16
v_cmpx_gt_f16
v_cmpx_lg_f16
v_cmpx_ge_f16
v_cmpx_o_f16
v_cmpx_u_f16
v_cmpx_nge_f16
v_cmpx_nlg_f16
v_cmpx_ngt_f16
v_cmpx_nle_f16
v_cmpx_neq_f16
v_cmpx_nlt_f16
v_cmpx_t_f16

v_cmpx_eq_f16 is not in this patch and will be added in the following
patch
2025-01-27 10:12:20 -05:00
Craig Topper
f46eb14309 [AMDGPU] Replace unsigned with Register in SIMachineScheduler. NFC
Some of these may eventually need to VirtRegOrUnit.
2025-01-26 00:26:00 -08:00
Brox Chen
241e5d8c5c
[AMDGPU][True16][MC] true16 for v_cmpx_eq_f16 (#124038)
True16 format for v_cmpx_eq_f16.

Also cleaned up some stray gfx11 check line in gfx12 dasm test
2025-01-24 18:15:40 -05:00
Brox Chen
ec66c4af09
[AMDGPU][True16][CodeGen] true16 codegen pattern for f16 canonicalize (#122000)
true16 codegen pattern for f16 canonicalize
2025-01-24 10:44:00 -05:00
Aaditya
11b0401926
[AMDGPU] Restore SP from saved-FP or saved-BP (#124007)
Currently, the AMDGPU backend bumps the Stack Pointer 
by fixed size offsets in the prolog of device functions, and 
restores it by the same amount in the epilog.
Prolog:
sp += frameSize

Epilog:
sp -= frameSize

If a function has dynamic stack realignment,
Prolog:
sp += frameSize + max_alignment

Epilog:
sp -= frameSize + max_alignment

These calculations are not optimal in case of dynamic 
stack realignment, and completely fail in case of 
dynamic stack readjustment.
This patch uses the saved Frame Pointer to restore SP. 
Prolog:
fp = sp
sp += frameSize

Epilog:
sp = fp

In case of dynamic stack realignment, SP is restored from 
the saved Base Pointer. 
Prolog:
fp = sp + (max_alignment - 1)
fp = fp & (-max_alignment)
bp = sp
sp += frameSize + max_alignment

Epilog:
sp = bp

(Note: The presence of BP has been enforced in case of any 
dynamic stack realignment.)

---------

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-01-24 19:13:40 +05:30
Petar Avramovic
4831fa8632
AMDGPU/GlobalISel: RegBankLegalize rules for load (#112882)
Add IDs for bit width that cover multiple LLTs: B32 B64 etc.
"Predicate" wrapper class for bool predicate functions used to
write pretty rules. Predicates can be combined using &&, || and !.
Lowering for splitting and widening loads.
Write rules for loads to not change existing mir tests from old
regbankselect.
2025-01-24 12:36:41 +01:00
Petar Avramovic
0ee037b861
AMDGPU/GlobalISel: AMDGPURegBankLegalize (#112864)
Lower G_ instructions that can't be inst-selected with register bank
assignment from AMDGPURegBankSelect based on uniformity analysis.
- Lower instruction to perform it on assigned register bank
- Put uniform value in vgpr because SALU instruction is not available
- Execute divergent instruction in SALU - "waterfall loop"

Given LLTs on all operands after legalizer, some register bank
assignments require lowering while other do not.
Note: cases where all register bank assignments would require lowering
are lowered in legalizer.

AMDGPURegBankLegalize goals:
- Define Rules: when and how to perform lowering
- Goal of defining Rules it to provide high level table-like brief
  overview of how to lower generic instructions based on available
  target features and uniformity info (uniform vs divergent).
- Fast search of Rules, depends on how complicated Rule.Predicate is
- For some opcodes there would be too many Rules that are essentially
  all the same just for different combinations of types and banks.
  Write custom function that handles all cases.
- Rules are made from enum IDs that correspond to each operand.
  Names of IDs are meant to give brief description what lowering does
  for each operand or the whole instruction.
- AMDGPURegBankLegalizeHelper implements lowering algorithms

Since this is the first patch that actually enables -new-reg-bank-select
here is the summary of regression tests that were added earlier:
- if instruction is uniform always select SALU instruction if available
- eliminate back to back vgpr to sgpr to vgpr copies of uniform values
- fast rules: small differences for standard and vector instruction
- enabling Rule based on target feature - salu_float
- how to specify lowering algorithm - vgpr S64 AND to S32
- on G_TRUNC in reg, it is up to user to deal with truncated bits
  G_TRUNC in reg is treated as no-op.
- dealing with truncated high bits - ABS S16 to S32
- sgpr S1 phi lowering
- new opcodes for vcc-to-scc and scc-to-vcc copies
- lowering for vgprS1-to-vcc copy (formally this is vgpr-to-vcc G_TRUNC)
- S1 zext and sext lowering to select
- uniform and divergent S1 AND(OR and XOR) lowering - inst-selected into
  SALU instruction
- divergent phi with uniform inputs
- divergent instruction with temporal divergent use, source instruction
  is defined as uniform(AMDGPURegBankSelect) - missing temporal
  divergence lowering
- uniform phi, because of undef incoming, is assigned to vgpr. Will be
  fixed in AMDGPURegBankSelect via another fix in machine uniformity
  analysis.
2025-01-24 12:12:45 +01:00
Jeremy Morse
8e70273509
[NFC][DebugInfo] Use iterator moveBefore at many call-sites (#123583)
As part of the "RemoveDIs" project, BasicBlock::iterator now carries a
debug-info bit that's needed when getFirstNonPHI and similar feed into
instruction insertion positions. Call-sites where that's necessary were
updated a year ago; but to ensure some type safety however, we'd like to
have all calls to moveBefore use iterators.

This patch adds a (guaranteed dereferenceable) iterator-taking
moveBefore, and changes a bunch of call-sites where it's obviously safe
to change to use it by just calling getIterator() on an instruction
pointer. A follow-up patch will contain less-obviously-safe changes.

We'll eventually deprecate and remove the instruction-pointer
insertBefore, but not before adding concise documentation of what
considerations are needed (very few).
2025-01-24 10:53:11 +00:00
Petar Avramovic
f8a56df36e
AMDGPU/GlobalISel: AMDGPURegBankSelect (#112863)
Assign register banks to virtual registers. Does not use generic
RegBankSelect. After register bank selection all register operand of
G_ instructions have LLT and register banks exclusively. If they had
register class, reassign appropriate register bank.

Assign register banks using machine uniformity analysis:
Sgpr - uniform values and some lane masks
Vgpr - divergent, non S1, values
Vcc  - divergent S1 values(lane masks)

AMDGPURegBankSelect does not consider available instructions and, in
some cases, G_ instructions with some register bank assignment can't be
inst-selected. This is solved in RegBankLegalize.

Exceptions when uniformity analysis does not work:
S32/S64 lane masks:
- need to end up with sgpr register class after instruction selection
- In most cases Uniformity analysis declares them as uniform
  (forced by tablegen) resulting in sgpr S32/S64 reg bank
- When Uniformity analysis declares them as divergent (some phis),
  use intrinsic lane mask analyzer to still assign sgpr register bank
temporal divergence copy:
- COPY to vgpr with implicit use of $exec inside of the cycle
- this copy is declared as uniform by uniformity analysis
- make sure that assigned bank is vgpr
Note: uniformity analysis does not consider that registers with vgpr def
are divergent (you can have uniform value in vgpr).
- TODO: implicit use of $exec could be implemented as indicator
  that instruction is divergent
2025-01-24 11:06:02 +01:00
Frederik Harwath
bfd9bc2745
[AMDGPU] SIPeepholeSDWA: Disable on existing SDWA instructions (#124131)
This PR reapplies the changes from PR #123942 which had to be reverted
because of a test failure. The test has been adjusted.
2025-01-24 09:12:32 +01:00
Chaitanya
3c79a04cc2
[AMDGPU] Add amdgpu-sw-lower-lds pass to NPM codegen addIRPasses. (#124102)
This PR adds amdgpu-sw-lower-lds pass to
AMDGPUCodeGenPassBuilder::addIRPasses()
2025-01-24 11:15:30 +05:30
Acim Maravic
7ddeea3598
[LLVM][AMDGPU] MC support for ds_bpermute_fi_b32 (#124108)
Added assembler/disassembler support for ds_bpermute_fi_b32 instruction,
as well as tests.
2025-01-23 17:55:00 +01:00
Lucas Ramirez
6206f5444f
[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748)
Occupancy (i.e., the number of waves per EU) depends, in addition to
register usage, on per-workgroup LDS usage as well as on the range of
possible workgroup sizes. Mirroring the latter, occupancy should
therefore be expressed as a range since different group sizes generally
yield different achievable occupancies.

`getOccupancyWithLocalMemSize` currently returns a scalar occupancy
based on the maximum workgroup size and LDS usage. With respect to the
workgroup size range, this scalar can be the minimum, the maximum, or
neither of the two of the range of achievable occupancies. This commit
fixes the function by making it compute and return the range of
achievable occupancies w.r.t. workgroup size and LDS usage; it also
renames it to `getOccupancyWithWorkGroupSizes` since it is the range of
workgroup sizes that produces the range of achievable occupancies.

Computing the achievable occupancy range is surprisingly involved.
Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum
occupancies i.e., sometimes workgroup sizes inside the range yield the
occupancy bounds. The implementation finds these sizes in constant time;
heavy documentation explains the rationale behind the sometimes
relatively obscure calculations.

As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU,
64-wide waves. Also consider a function with no LDS usage and a flat
workgroup size range of [513,1024].

- A group of 513 items requires 9 waves per group. Only 4 groups made up
of 9 waves each can fit fully on a CU at any given time, for a total of
36 waves on the CU, or 9 per EU. However, filling as much as possible
the remaining 40-36=4 wave slots without decreasing the number of groups
reveals that a larger group of 640 items yields 40 waves on the CU, or
10 per EU.
- Similarly, a group of 1024 items requires 16 waves per group. Only 2
groups made up of 16 waves each can fit fully on a CU ay any given time,
for a total of 32 waves on the CU, or 8 per EU. However, removing as
many waves as possible from the groups without being able to fit another
equal-sized group on the CU reveals that a smaller group of 896 items
yields 28 waves on the CU, or 7 per EU.

Therefore the achievable occupancy range for this function is not [8,9]
as the group size bounds directly yield, but [7,10].

Naturally this change causes a lot of test churn as instruction
scheduling is driven by achievable occupancy estimates. In most unit
tests the flat workgroup size range is the default [1,1024] which,
ignoring potential LDS limitations, would previously produce a scalar
occupancy of 8 (derived from 1024) on a lot of targets, whereas we now
consider the maximum occupancy to be 10 in such cases. Most tests are
updated automatically and checked manually for sanity. I also manually
changed some non-automatically generated assertions when necessary.

Fixes #118220.
2025-01-23 16:07:57 +01:00
Nico Weber
99d450e9f5 Revert "[AMDGPU] SIPeepholeSDWA: Disable on existing SDWA instructions (#123942)"
This reverts commit 6fdaaafd89d7cbc15dafe3ebf1aa3235d148aaab.
Breaks check-llvm, see
https://github.com/llvm/llvm-project/pull/123942#issuecomment-2609861953
2025-01-23 09:19:42 -05:00
Matt Arsenault
e28e93550a
AMDGPU: Make vector_shuffle legal for v2i32 with v_pk_mov_b32 (#123684)
For VALU shuffles, this saves an instruction in some case.
2025-01-23 20:58:02 +07:00
Kareem Ergawy
ff55c9bc63
[llvm][amdgpu] Handle indirect refs to LDS GVs during LDS lowering (#124089)
Fixes #123800

Extends LDS lowering by allowing it to discover transitive
indirect/escpaing references to LDS GVs.

For example, given the following input:
```llvm
@lds_item_to_indirectly_load = internal addrspace(3) global ptr undef, align 8

%store_type = type { i32, ptr }
@place_to_store_indirect_caller = internal addrspace(3) global %store_type undef, align 8

define amdgpu_kernel void @offloading_kernel() {
  store ptr @indirectly_load_lds, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @place_to_store_indirect_caller, i32 0), align 8
  call void @call_unknown()
  ret void
}

define void @call_unknown() {
  %1 = alloca ptr, align 8
  %2 = call i32 %1()
  ret void
}

define void @indirectly_load_lds() {
  call void @directly_load_lds()
  ret void
}

define void @directly_load_lds() {
  %2 = load ptr, ptr addrspace(3) @lds_item_to_indirectly_load, align 8
  ret void
}

```

With the above input, prior to this patch, LDS lowering failed to lower
the reference to `@lds_item_to_indirectly_load` because:
1. it is indirectly called by a function whose address is taken in the
kernel.
2. we did not check if the kernel indirectly makes any calls to unknown
functions (we only checked the direct calls).

Co-authored-by: Jon Chesterfield <jonathan.chesterfield@amd.com>
2025-01-23 14:53:11 +01:00
Frederik Harwath
6fdaaafd89
[AMDGPU] SIPeepholeSDWA: Disable on existing SDWA instructions (#123942)
This is meant as a short-term workaround for an invalid conversion in
this pass that occurs because existing SDWA selections are not correctly
taken into account during the conversion.

See the draft PR #123221 for an attempt to fix the actual issue.

---------

Co-authored-by: Frederik Harwath <fharwath@amd.com>
2025-01-23 14:32:01 +01:00
Brox Chen
18e9d3dbe5
[AMDGPU][True16][MC] true16 for v_cmpx_xx_u/i16 (#123424)
A bulk commit of true16 support for v_cmp_xx_i/u16 instructions
including:

v_cmpx_lt_i16
v_cmpx_eq_i16
v_cmpx_le_i16
v_cmpx_gt_i16
v_cmpx_ne_i16
v_cmpx_ge_i16
v_cmpx_lt_u16
v_cmpx_eq_u16
v_cmpx_le_u16
v_cmpx_gt_u16
v_cmpx_ne_u16
v_cmpx_ge_u16
2025-01-22 15:57:16 -05:00
Brox Chen
1cf0af3d32
[AMDGPU][True16][MC] true16 for v_cmpx_class_f16 (#123251)
True16 format for v_cmpx_class_f16. Update VOPCX_CLASS t16 and fake16
pseudo.
2025-01-22 15:56:58 -05:00
Craig Topper
9e6494c0fb
[CodeGen] Rename RegisterMaskPair to VRegMaskOrUnit. NFC (#123799)
This holds a physical register unit or virtual register and mask.

While I was here I've used emplace_back and removed an unneeded use of a
template.
2025-01-22 09:11:22 -08:00
Matt Arsenault
93d35ad5f5
AMDGPU: Delete FillMFMAShadowMutation (#123861)
No test changes with this removed and it appears to
be obsolete.
2025-01-22 22:41:25 +07:00
Akshat Oke
a343b8e595
[AMDGPU][NewPM] Port SILowerWWMCopies to NPM (#123695) 2025-01-22 14:54:01 +05:30
Venkata Ramanaiah Nalamothu
f7d8336a2f
[llvm] Pass MachineInstr flags to storeRegToStackSlot/loadRegFromStackSlot (NFC) (#120622)
This patch is in preparation to enable setting the MachineInstr::MIFlag
flags, i.e. FrameSetup/FrameDestroy, on callee saved register
spill/reload instructions in prologue/epilogue. This eventually helps in
setting the prologue_end and epilogue_begin markers more accurately.

The DWARF Spec in "6.4 Call Frame Information" says:

The code that allocates space on the call frame stack and performs the
save
operation is called the subroutine’s prologue, and the code that
performs
the restore operation and deallocates the frame is called its epilogue.

which means the callee saved register spills and reloads are part of
prologue (a.k.a frame setup) and epilogue (a.k.a frame destruction),
respectively. And, IIUC, LLVM backend uses FrameSetup/FrameDestroy flags
to identify instructions that are part of call frame setup and
destruction.

In the trunk, while most targets consistently set
FrameSetup/FrameDestroy on save/restore call frame information (CFI)
instructions of callee saved registers, they do not consistently set
those flags on the actual callee saved register spill/reload
instructions.

I believe this patch provides a clean mechanism to set
FrameSetup/FrameDestroy flags on the actual callee saved register
spill/reload instructions as needed. And, by having default argument of
MachineInstr::NoFlags for Flags, this patch is a NFC.

With this patch, the targets have to just pass FrameSetup/FrameDestroy
flag to the storeRegToStackSlot/loadRegFromStackSlot calls from the
target derived spillCalleeSavedRegisters and restoreCalleeSavedRegisters
to set those flags on callee saved register spill/reload instructions.

Also, this patch makes it very easy to set the source line information
on callee saved register spill/reload instructions which is needed by
the DwarfDebug.cpp implementation to set prologue_end and epilogue_begin
markers more accurately.

As per DwarfDebug.cpp implementation:

prologue_end is the first known non-DBG_VALUE and non-FrameSetup
location
    that marks the beginning of the function body

epilogue_begin is the first FrameDestroy location that has been seen in
the
    epilogue basic block

With this patch, the targets have to just do the following to set the
source line information on callee saved register spill/reload
instructions, without hampering the LLVM's efforts to avoid adding
source line information on the artificial code generated by the
compiler.

    <Foo>InstrInfo::storeRegToStackSlot() {
    ...
      DebugLoc DL =
Flags & MachineInstr::FrameSetup ? DebugLoc() : MBB.findDebugLoc(I);
    ...
    }

    <Foo>InstrInfo::loadRegFromStackSlot() {
    ...
      DebugLoc DL =
Flags & MachineInstr::FrameDestroy ? MBB.findDebugLoc(I) : DebugLoc();
    ...
    }

While I understand this patch would break out-of-tree backend builds, I
think it is in the right direction.

One immediate use case that can benefit from this patch is fixing
#120553 becomes simpler.
2025-01-22 13:36:39 +05:30
Kazu Hirata
ceaaa2b9ae [AMDGPU] Fix warnings
This patch fixes:

  llvm/lib/Target/AMDGPU/SIInstrInfo.cpp:2792:14: error: comparison of
  integers of different signs: 'unsigned int' and 'int'
  [-Werror,-Wsign-compare]

  llvm/lib/Target/AMDGPU/SIInstrInfo.cpp:2797:14: error: comparison of
  integers of different signs: 'unsigned int' and 'int'
  [-Werror,-Wsign-compare]
2025-01-21 20:24:30 -08:00
Shoreshen
7c58d6363a
[AMDGPU] Add commute for some VOP3 inst (#121326)
add commute for some VOP3 inst, allow commute for both inline constant
operand, adjust tests

Fixes #111205
2025-01-22 11:08:26 +07:00
Shoreshen
e8811ad3cc
[AMDGPU] Fix unreachable reg bit width (#122107)
Add register class bit width for SReg_256_XNULL and SReg_128_XNULL
2025-01-22 10:05:47 +07:00
Brox Chen
e1c1e74a6f
[AMDGPU][True16][MC] true16 for v_cmp_class_f16 (#122984)
True16 format for v_cmp_class_f16. Update VOPC_CLASS t16 and fake16
pseudo.
2025-01-21 10:07:14 -05:00
Brox Chen
70632f9566
[AMDGPU][True16][MC] true16 for v_cmp_xx_f16 (#122943)
A bulk commit of true16 support for v_cmp_xx_f16 instructions including:
v_cmp_f_f16
v_cmp_eq_f16
v_cmp_le_f16
v_cmp_gt_f16
v_cmp_lg_f16
v_cmp_ge_f16
v_cmp_o_f16
v_cmp_u_f16
v_cmp_nge_f16
v_cmp_nlg_f16
v_cmp_ngt_f16
v_cmp_nle_f16
v_cmp_neq_f16
v_cmp_nlt_f16
v_cmp_t_f16

Added a GFX12 runline for fcmp.f16
2025-01-21 10:06:22 -05:00
Chinmay Deshpande
9ca1323de1
[AMDGPU] Fix crash due to missing check for FLAT instructions that dont use vector registers when computing VALU hazard (#123627) 2025-01-21 05:50:58 -08:00
Janek van Oirschot
82944595fa
[AMDGPU] Change scope of resource usage info symbols (#114810)
Change scope of resource usage info MC symbols to align with the function linkage type
2025-01-21 13:10:06 +00:00
Akshat Oke
7acad6893b
[AMDGPU][CodeGen] SILowerWWMCopies: Declare used analyses (#123710)
This prevents legacy PM from mistakenly removing these analyses if
`SILowerWWMCopies` is the last user of them. (it removes dead analyses
after its last use)
2025-01-21 15:33:20 +05:30
Akshat Oke
9b6e8df896
[AMDGPU][NewPM] Port SIFixVGPRCopies to NPM (#123592)
Extends NPM pipeline support till PostRegAlloc passes (greedy is in the
works)
2025-01-21 15:27:46 +05:30