717 Commits

Author SHA1 Message Date
Mirko Brkusanin
926746d22a [AMDGPU][GFX11] Legalize and select partial NSA MIMG instructions
If more registers are needed for VAddr then the NSA format allows then the
final register can act as a contigous set of remaining addresses. Update
legalizer to pack register for this new format and allow instruction
selection to use NSA encoding when number of addresses exceeds max size.
Also update SIShrinkInstructions to handle partial NSA.

Differential Revision: https://reviews.llvm.org/D144034
2023-02-23 13:33:34 +01:00
Piotr Sobczak
a3d7b3121c [AMDGPU][NFC] Add getMaxMUBUFImmOffset
Replace magic constant 4095 with the function getMaxMUBUFImmOffset().

Differential Revision: https://reviews.llvm.org/D144623
2023-02-23 11:29:59 +01:00
Jay Foad
c9f4df57ca [AMDGPU] Move splitMUBUFOffset from AMDGPUBaseInfo to SIInstrInfo
Moving this out of AMDGPUBaseInfo enforces that AMDGPUBaseInfo should
not be calling into GCNSubtarget.

Differential Revision: https://reviews.llvm.org/D144564
2023-02-22 16:19:05 +00:00
Yashwant Singh
cde2f330b3 [AMDGPU] Introduce never uniform bit field in tablegen
IsNeverUniform can be set to 1 to mark instructions which are
inherently never-uniform/divergent. Enabling this bit in
Writelane instruction for now. To be extended to all required
instructions.

Reviewed By: arsenm, sameerds, #amdgpu

Differential Revision: https://reviews.llvm.org/D143154
2023-02-08 11:45:48 +05:30
Yashwant Singh
422d379de2 [AMDGPU] Use tablegen to list uniform intrinsics
Right now we do opcode wise matching to identify uniform/non-divergent
AMDGPU intrinsics. It is duplicated at 2 places once at IR level uniformity analysis
and at MIR level. Moving them to single tablegen table for consistency and adding
and API rapper to access them.

Reviewed By: arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D142961
2023-01-31 17:44:40 +05:30
Matt Arsenault
4002576156 AMDGPU/GlobalISel: Partially fix getGenericInstructionUniformity
This was broken for the common case of instructions which are uniform
if their inputs are uniform. This is broken for control flow intrinsics
since the API currently does not express which result operand is in question.

This generates failures in just about every intrinsic test when uniformity
analysis is performed without this.
2023-01-30 15:47:18 -04:00
Matt Arsenault
490e348e67 AMDGPU: Partially fix machine uniformity for inline asm
This was assuming virtual registers only, and asserting on physical.
This was also ignoring AGPRs, and only considering VGPRs.

Reporting the instruction as uniform or not is conceptually wrong,
this should be reported per-operand. An inline asm statement could
include uniform and non-uniform components. This should report
purely for the register defs and ignore the uses.

Fixes asserting on most of the inline asm tests when uniformity
analysis is used.
2023-01-30 15:47:18 -04:00
Matt Arsenault
17ce615c78 AMDGPU: Fix null dereference in getInstructionUniformity
This was failing when it couldn't find an allocatable class
for special physical register inputs (like $mode), which are all
scalars.

This avoids numerous test failures when regbankselect is updated
to use uniformity analysis.
2023-01-30 15:47:17 -04:00
Kazu Hirata
e078201835 [Target] Use llvm::count{l,r}_{zero,one} (NFC) 2023-01-28 09:23:07 -08:00
Jay Foad
073401e59c [MC] Define and use MCInstrDesc implicit_uses and implicit_defs. NFC.
The new methods return a range for easier iteration. Use them everywhere
instead of getImplicitUses, getNumImplicitUses, getImplicitDefs and
getNumImplicitDefs. A future patch will remove the old methods.

In some use cases the new methods are less efficient because they always
have to scan the whole uses/defs array to count its length, but that
will be fixed in a future patch by storing the number of implicit
uses/defs explicitly in MCInstrDesc. At that point there will be no need
to 0-terminate the arrays.

Differential Revision: https://reviews.llvm.org/D142215
2023-01-23 14:44:58 +00:00
Jay Foad
245e3dd948 [MC] Do not copy MCInstrDescs. NFC.
Avoid copying MCInstrDesc instances because a future patch will change
them to find their implicit operands and operand info array based on
their own "this" pointer, so it will only work for MCInstrDescs in the
TargetInsts table, not for a copy of an MCInstrDesc at a different
address.

Differential Revision: https://reviews.llvm.org/D142214
2023-01-23 11:55:49 +00:00
Jay Foad
768aed1378 [MC] Make more use of MCInstrDesc::operands. NFC.
Change MCInstrDesc::operands to return an ArrayRef so we can easily use
it everywhere instead of the (IMHO ugly) opInfo_begin and opInfo_end.
A future patch will remove opInfo_begin and opInfo_end.

Also use it instead of raw access to the OpInfo pointer. A future patch
will remove this pointer.

Differential Revision: https://reviews.llvm.org/D142213
2023-01-23 11:31:41 +00:00
Kazu Hirata
caa99a01f5 Use llvm::popcount instead of llvm::countPopulation(NFC) 2023-01-22 12:48:51 -08:00
Jeffrey Byrnes
1f08d3bc3a [AMDGPU] Further reduce attaching of implicit operands to spills
Extension of https://reviews.llvm.org/D141101 to even further reduce the amount of implicit operands we attach. The main benefit is to improve cability of post-ra scheduler, and reduce unneeded dependency resolution (e.g. inserting snops).

Unfortunately, we run into regressions if we completely minimize the amount implicit operands (naively), we run into some regressions (e.g. dual_movs are replaced with multiple calls to v_mov). This is even more reason to switch to LiveRegUnits.

Nonetheless, this patch removes the operands which we can for free (more or less).

Change-Id: Ib4f409202b36bdbc59eed615bc2d19fa8bd8c057

Differential Revision: https://reviews.llvm.org/D141557

Change-Id: I8b039e3c0d39436b384083f8beb947ee1b1730b2
2023-01-19 14:31:07 -08:00
Jay Foad
f460c66581 [AMDGPU] Simplify getNumFlatOffsetBits. NFC.
Previously we considered this field to be either N-bit unsigned or
N+1-bit signed, depending on the instruction. I think it's conceptually
simpler to say that the field is always N+1-bit signed, but some
instructions do not allow negative values.

Differential Revision: https://reviews.llvm.org/D140883
2023-01-12 10:40:36 +00:00
Ruiling Song
cce24b6af0 AMDGPU: Remove IsSourceOfDivergence check
This bit is not set/reserved in td file. Let's remove it for now,
we can always add it back if we need it.

Reviewed by: foad

Differential Revision: https://reviews.llvm.org/D141223
2023-01-11 09:59:35 +08:00
Jeffrey Byrnes
596c558155 [AMDGPU] More selectively attach implicit operands to agpr spills
Implicit def operands are needed when we spill partially undef super registers by each individual subregister. The implicit-def operands will allow us to lower spills without the verifier complaining. Currently, we are overzeously attaching implicit operands, when we really only need them on the first sub reg spill op. By more selectively attached the implicit ops, we will free up some unneeded dependencies for the post-ra scheduler.

Moreover, this enables a previously incorrect optimization / resolves a correctness issue in indirectCopyToAGPR. When lowering AGPR copies on GFX908, we can improve CodeGen by reusing accvgpr_writes. However, we could not reliably determine which agprs accvgpr_writes actually define due to implicit-defs.

Differential Revision: https://reviews.llvm.org/D141101
2023-01-09 15:10:06 -08:00
Ivan Kosarev
2d945ef864 [AMDGPU][NFC] Rename GFX10A16 operands.
They do not seem to be GFX10-specific anymore. Also renames the
corresponding feature.

Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D141069
2023-01-09 17:18:46 +00:00
serge-sans-paille
38818b60c5
Move from llvm::makeArrayRef to ArrayRef deduction guides - llvm/ part
Use deduction guides instead of helper functions.

The only non-automatic changes have been:

1. ArrayRef(some_uint8_pointer, 0) needs to be changed into ArrayRef(some_uint8_pointer, (size_t)0) to avoid an ambiguous call with ArrayRef((uint8_t*), (uint8_t*))
2. CVSymbol sym(makeArrayRef(symStorage)); needed to be rewritten as CVSymbol sym{ArrayRef(symStorage)}; otherwise the compiler is confused and thinks we have a (bad) function prototype. There was a few similar situation across the codebase.
3. ADL doesn't seem to work the same for deduction-guides and functions, so at some point the llvm namespace must be explicitly stated.
4. The "reference mode" of makeArrayRef(ArrayRef<T> &) that acts as no-op is not supported (a constructor cannot achieve that).

Per reviewers' comment, some useless makeArrayRef have been removed in the process.

This is a follow-up to https://reviews.llvm.org/D140896 that introduced
the deduction guides.

Differential Revision: https://reviews.llvm.org/D140955
2023-01-05 14:11:08 +01:00
Christudasan Devadasan
a3028239a7 Revert "[AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs"
This reverts commit 40ba0942e2ab1107f83aa5a0ee5ae2980bf47b1a.
2022-12-21 16:17:42 +05:30
Carl Ritson
5bc703f755 [AMDGPU] Replace getPhysRegClass with getPhysRegBaseClass
Accelerate finding the base class for a physical register by
building a statically mapping table from physical registers
to base classes using TableGen.

Replace uses of SIRegisterInfo::getPhysRegClass with
TargetRegisterInfo::getPhysRegBaseClass in order to use
the computed table.

Reviewed By: arsenm, foad

Differential Revision: https://reviews.llvm.org/D139422
2022-12-20 16:22:14 +09:00
Sameer Sahasrabuddhe
475ce4c200 RFC: Uniformity Analysis for Irreducible Control Flow
Uniformity analysis is a generalization of divergence analysis to
include irreducible control flow:

  1. The proposed spec presents a notion of "maximal convergence" that
     captures the existing convention of converging threads at the
     headers of natual loops.

  2. Maximal convergence is then extended to irreducible cycles. The
     identity of irreducible cycles is determined by the choices made
     in a depth-first traversal of the control flow graph. Uniformity
     analysis uses criteria that depend only on closed paths and not
     cycles, to determine maximal convergence. This makes it a
     conservative analysis that is independent of the effect of DFS on
     CycleInfo.

  3. The analysis is implemented as a template that can be
     instantiated for both LLVM IR and Machine IR.

Validation:
  - passes existing tests for divergence analysis
  - passes new tests with irreducible control flow
  - passes equivalent tests in MIR and GMIR

Based on concepts originally outlined by
Nicolai Haehnle <nicolai.haehnle@amd.com>

With contributions from Ruiling Song <ruiling.song@amd.com> and
Jay Foad <jay.foad@amd.com>.

Support for GMIR and lit tests for GMIR/MIR added by
Yashwant Singh <yashwant.singh@amd.com>.

Differential Revision: https://reviews.llvm.org/D130746
2022-12-20 07:22:24 +05:30
Christudasan Devadasan
40ba0942e2 [AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs
Currently, the custom SGPR spill lowering pass spills
SGPRs into physical VGPR lanes and the remaining VGPRs
are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.

This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.

Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which isn an unproblematic case.

This patch also implements the whole wave spills which
might occur if RA spills any live range of virtual registers
involved in the whole wave operations. Earlier, we had
been hand-picking registers for such machine operands.
But now with SGPR spills into virtual VGPR lanes, we are
exposing them to the allocator.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D124196
2022-12-17 11:56:32 +05:30
Christudasan Devadasan
b5efec4b27 [CodeGen] Additional Register argument to storeRegToStackSlot/loadRegFromStackSlot
With D134950, targets get notified when a virtual register is created and/or
cloned. Targets can do the needful with the delegate callback. AMDGPU propagates
the virtual register flags maintained in the target file itself. They are useful
to identify a certain type of machine operands while inserting spill stores and
reloads. Since RegAllocFast spills the physical register itself, there is no way
its virtual register can be mapped back to retrieve the flags. It can be solved
by passing the virtual register as an additional argument. This argument has no
use when the spill interfaces are called during the greedy allocator or even the
PrologEpilogInserter and can pass a null register in such cases.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D138656
2022-12-17 11:55:34 +05:30
Jay Foad
6443c0ee02 [AMDGPU] Stop using make_pair and make_tuple. NFC.
C++17 allows us to call constructors pair and tuple instead of helper
functions make_pair and make_tuple.

Differential Revision: https://reviews.llvm.org/D139828
2022-12-14 13:22:26 +00:00
Joe Nash
bbfbec94b1 [AMDGPU] Enable OMod on more VOP3 instructions
OMod was disabled if OpSel was enabled, but that restriction is more
specific than necessary. Any VOP3 with float operands can use OMod.

On GFX11, FMAC_F16_e64 can use op_sel.
Previously, SIFoldOperands and convertToThreeAddress were accidentally correct when
they reinterpreted the zero OMod operand on V_FMAC_F16_e64 as the OpSel operand on
V_FMA_F16_gfx9_e64. Now we explicitly add op_sel if required.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D139469
2022-12-07 13:30:33 -05:00
Piotr Sobczak
1e3abd82b9 [AMDGPU] Fix wide spills
Update spill code to account for new vector types with
bit widths: 288, 320, 352, 384.

Related to D138205.

Differential Revision: https://reviews.llvm.org/D139203
2022-12-07 10:23:52 +01:00
Mateja Marjanovic
595a08847a [AMDGPU] Add support for new LLVM vector types
Add VReg, AReg and SReg on AMDGPU for bit widths: 288, 320, 352 and 384.

Differential Revision: https://reviews.llvm.org/D138205
2022-11-29 17:02:04 +01:00
Jay Foad
49762162ea [AMDGPU] Remove isLiteralConstant and isLiteralConstantLike
isLiteralConstant and isLiteralConstantLike were similar to
!isInlineConstant with slight differences like handling isReg operands.

To avoid a profusion of similar functions with undocumented differences,
this patch removes all the isLiteralConstant* variants. Callers are responsible
for handling the isReg case.

Differential Revision: https://reviews.llvm.org/D125759
2022-11-17 16:45:48 +00:00
Pierre van Houtryve
7425077e31 [AMDGPU] Add & use hasNamedOperand, NFC
In a lot of places, we were just calling `getNamedOperandIdx` to check if the result was != or == to -1.
This is fine in itself, but it's verbose and doesn't make the intention clear, IMHO. I added a `hasNamedOperand` and replaced all cases I could find with regexes and manually.

Reviewed By: arsenm, foad

Differential Revision: https://reviews.llvm.org/D137540
2022-11-08 07:57:21 +00:00
Jay Foad
ea60545b0e [AMDGPU] Create new instructions in SIInstrInfo::moveToVALU
Create new VALU instructions in moveToVALU instead of mutating the
existing SALU instruction. This makes it easier to add extra operands so
we can convert to the VOP3 form of VALU instructions.

NFCI but it does have the minor side effect of removing duplicate
implicit operands that were present on the original SALU if they are
default implicit operands for the VALU.

Differential Revision: https://reviews.llvm.org/D137324
2022-11-04 07:21:11 +00:00
Kazu Hirata
b0e0cdf1c7 [AMDGPU] Fix a warning
This patch fixes:

  llvm/lib/Target/AMDGPU/SIInstrInfo.cpp:7383: warning: enumerated
  mismatch in conditional expression:
  ‘llvm::AMDGPU::UfmtGFX11::UnifiedFormat’ vs
  ‘llvm::AMDGPU::UfmtGFX10::UnifiedFormat’
2022-10-30 13:09:59 -07:00
Jay Foad
9bb1e21f07 [AMDGPU] Clean up calls to MachineOperand::setIsDead and friends. NFC. 2022-10-28 10:44:08 +01:00
Carl Ritson
a3646ec1bc [AMDGPU] Add pseudo wavemode to optimize strict_wqm
Strict WQM does not require a WQM transistion if it occurs within
an existing WQM section.
This occurs heavily in GFX11 pixel shaders with LDS_PARAM_LOAD.
Which leads to unnecessary EXEC mask manipulation.

To avoid these transitions, detect WQM -> Strict WQM -> WQM
and substitute new ENTER_PSEUDO_WM/EXIT_PSEUDO_WM markers instead.
These are treat similarly by WWM register pre-allocation pass,
but do not manipulate EXEC or use registers to save EXEC state.

Reviewed By: piotr

Differential Revision: https://reviews.llvm.org/D136813
2022-10-28 09:45:17 +09:00
Jay Foad
191d70f2f5 [AMDGPU] Use Register in more places in SIInstrInfo. NFC.
Also avoid using AMDGPU::NoRegister when it's not neeeded.
2022-10-25 15:04:58 +01:00
Joe Nash
b982ba2a6e [AMDGPU][GFX11] Use VGPR_32_Lo128 for VOP1,2,C
Due to the encoding changes in GFX11, we had a hack in place that
    disables the use of VGPRs above 128. This patch removes the need for
    that hack.

    We introduce a new register class VGPR_32_Lo128 which is used for 16-bit
    operands of VOP1, VOP2, and VOPC instructions. This register class only has the
    low 128 VGPRs, but is otherwise identical to VGPR_32. Therefore, 16-bit VOP1,
    VOP2, and VOPC instructions are correctly limited to use the first 128
    VGPRs, while the other instructions can freely use all 256.

    We introduce new pseduo-instructions used on GFX11 which have the suffix
    t16 (True 16) to use the VGPR_32_Lo128 register class.

Reviewed By: foad, rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D133723
2022-09-20 09:56:28 -04:00
Ruiling Song
0404aafbe3 AMDGPU: Factor out hasDivergentBranch(). NFC
This is helpful for detecting whether a block ends with divergent branch
in passes before lowering the pseudo control flow instructions.

Differential Revision: https://reviews.llvm.org/D133184
2022-09-14 13:27:21 +08:00
Matt Arsenault
7834194837 TableGen: Introduce generated getSubRegisterClass function
Currently there isn't a generic way to get a smaller register class
that can be produced from a subregister of a larger class. Replaces a
manually implemented version for AMDGPU. This will be used to improve
subregister support in the allocator.
2022-09-12 09:03:37 -04:00
Jay Foad
afa0ed33df [AMDGPU] Fix shrinking of F16 FMA on newer subtargets
D125803 introduced shrinking of F16 FMA to FMAAK/FMAMK in
SIShrinkInstructions (useful on GFX10+ where VOP3 instructions may have
a literal operand) but failed to handle the V_FMA_F16_gfx9_e64 form of
the opcode which is used on GFX9+.

Differential Revision: https://reviews.llvm.org/D133489
2022-09-08 16:41:04 +01:00
Kazu Hirata
21de2888a4 Use llvm::is_contained (NFC) 2022-08-27 09:53:11 -07:00
Fangrui Song
de9d80c1c5 [llvm] LLVM_FALLTHROUGH => [[fallthrough]]. NFC
With C++17 there is no Clang pedantic warning or MSVC C5051.
2022-08-08 11:24:15 -07:00
Kazu Hirata
ba0407ba86 [llvm] Use range-based for loops (NFC) 2022-08-07 00:16:21 -07:00
Kazu Hirata
d0ec61c9ff [Target] Remove unused forward declarations (NFC) 2022-08-07 00:16:16 -07:00
Jay Foad
c24d68fff1 [AMDGPU] Take advantage of VOP3 literals in convertToThreeAddress
This improves a corner case where v_fmac can be converted to v_fma on
GFX10+ even if it has a literal operand.

Differential Revision: https://reviews.llvm.org/D130992
2022-08-02 17:27:11 +01:00
Stanislav Mekhanoshin
68901fdbeb [AMDGPU] Consider S_SETPRIO a scheduling boundary
The instruction is used to modify wave priority with the intent
to affect VALU execution and currently we can reschedule VALU
around it since that VALU does not have side effects.

Differential Revision: https://reviews.llvm.org/D130654
2022-07-27 11:50:23 -07:00
Joe Nash
b28bb8cc9c [AMDGPU] Remove old operand from VOPC DPP
For most DPP instructions, the old operand stores the value that was in
the current lane before the DPP operation, and is tied to the
destination. For VOPC DPP, this is unnecessary and incorrect.

There appears to have been a latent bug related to D122737 with
SIInstrInfo::isOperandLegal. If you checked if a register operand was legal
when the InstructionDesc expected an immediate, it reported that is valid.
Its fix is necessary for and tested in this patch.

Reviewed By: foad, rampitec

Differential Revision: https://reviews.llvm.org/D130040
2022-07-19 09:35:05 -04:00
Matt Arsenault
8d0383eb69 CodeGen: Remove AliasAnalysis from regalloc
This was stored in LiveIntervals, but not actually used for anything
related to LiveIntervals. It was only used in one check for if a load
instruction is rematerializable. I also don't think this was entirely
correct, since it was implicitly assuming constant loads are also
dereferenceable.

Remove this and rely only on the invariant+dereferenceable flags in
the memory operand. Set the flag based on the AA query upfront. This
should have the same net benefit, but has the possible disadvantage of
making this AA query nonlazy.

Preserve the behavior of assuming pointsToConstantMemory implying
dereferenceable for now, but maybe this should be changed.
2022-07-18 17:23:41 -04:00
Ivan Kosarev
432cbd7827 [AMDGPU][CodeGen] Support (register + immediate) SMRD offsets.
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D129381
2022-07-18 11:29:31 +01:00
Jay Foad
e45aa230ad [AMDGPU] Update LiveVariables after killing an immediate def
D114999 added code to kill an immediate def if it was folded into its
only use by convertToThreeAddress. This patch updates LiveVariables when
that happens in order to fix verification failures exposed by D129213.

Differential Revision: https://reviews.llvm.org/D129661
2022-07-14 10:49:41 +01:00
Joe Nash
d1af09ad96 [AMDGPU] gfx11 Generate VOPD Instructions
We form VOPD  instructions in the GCNCreateVOPD pass by combining
back-to-back component instructions. There are strict register
constraints for creating a legal VOPD, namely that the matching operands
(e.g. src0x and src0y, src1x and src1y) must be in different register
banks. We add a PostRA scheduler
mutation to put possible VOPD components back-to-back.

Depends on D128442, D128270

Reviewed By: #amdgpu, rampitec

Differential Revision: https://reviews.llvm.org/D128656
2022-07-05 09:18:19 -04:00