34 Commits

Author SHA1 Message Date
Emma Pilkington
4490003a22
[AMDGPU] Rename COV module flag to amdhsa_code_object_version (#79905)
The previous name 'amdgpu_code_object_version', was misleading since
this is really a property of the HSA OS. The new spelling also matches
the asm directive I added in bc82cfb.
2024-03-06 09:51:48 -05:00
Jan Patrick Lehr
f661057865
Revert "[AMDGPU] Compiler should synthesize private buffer resource descriptor from flat_scratch_init" (#81234)
Reverts llvm/llvm-project#79586

This broke the AMDGPU OpenMP Offload buildbot.
The typical error message was that the GPU attempted to read beyong the
largest legal address.

Error message:
AMDGPU fatal error 1: Received error in queue 0x7f8363f22000:
HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to
access memory beyond the largest legal address.
2024-02-09 09:57:38 +01:00
alex-t
88e52511ca
[AMDGPU] Compiler should synthesize private buffer resource descriptor from flat_scratch_init (#79586)
This change implements synthesizing the private buffer resource
descriptor in the kernel prolog instead of using the preloaded kernel
argument.
2024-02-08 20:27:36 +01:00
Yaxun (Sam) Liu
1f3c30911c
[AMDGPU] Mark PC_ADD_REL_OFFSET rematerializable (#79674)
Currently machine LICM hoist PC_ADD_REL_OFFSET out of loops, causes
register pressure when function calls are deep in loops. This is a main
cause of sgpr spill for programs containing large number of function
calls in loops.

This patch marks PC_ADD_REL_OFFSET as rematerializable, which eliminates
sgpr spills due to function calls in loops.
2024-02-01 12:21:19 -05:00
Saiyedul Islam
e21b7e2143
[AMDGPU][NFC] Check more autogenerated llc tests for COV5 (#75219)
Regenerate a few more llc tests to check for COV5 instead of the default
ABI version.
2023-12-15 10:27:49 +05:30
Stanislav Mekhanoshin
fe8335babb
[AMDGPU] Select 64-bit imm moves if can be encoded as 32 bit operand (#70395)
This allows folding of 64-bit operands if fit into 32-bit. Fixes
https://github.com/llvm/llvm-project/issues/67781
2023-10-30 08:12:28 -07:00
Matthias Braun
e3cf80c5c1
BlockFrequencyInfoImpl: Avoid big numbers, increase precision for small spreads
BlockFrequencyInfo calculates block frequencies as Scaled64 numbers but as a last step converts them to unsigned 64bit integers (`BlockFrequency`). This improves the factors picked for this conversion so that:

* Avoid big numbers close to UINT64_MAX to avoid users overflowing/saturating when adding multiply frequencies together or when multiplying with integers. This leaves the topmost 10 bits unused to allow for some room.
* Spread the difference between hottest/coldest block as much as possible to increase precision.
* If the hot/cold spread cannot be represented loose precision at the lower end, but keep the frequencies at the upper end for hot blocks differentiable.
2023-10-24 20:27:39 -07:00
Jingu Kang
ff68e43c81 [MachineLICM] Handle Subloops
It is a re-commit from reverted commit 3454cf67bd0a650097dc6ca99874a34e1d59b500.

Following discussion on https://reviews.llvm.org/D154205, make MachineLICM pass
handle subloops with only visiting outermost loop's blocks once.

Differential Revision: https://reviews.llvm.org/D154205
2023-09-26 14:25:11 +01:00
Benjamin Kramer
3454cf67bd Revert "[MachineLICM] Handle Subloops"
This reverts commit 5ec9699c4d1f165364586d825baef434e2c110b4. It
accesses MI after it has been hoisted.
2023-09-15 13:20:31 +02:00
Jingu Kang
5ec9699c4d [MachineLICM] Handle Subloops
Following discussion on https://reviews.llvm.org/D154205, make MachineLICM pass
handle subloops with only visiting outermost loop's blocks once.

Differential Revision: https://reviews.llvm.org/D154205
2023-09-14 18:07:31 +01:00
Saiyedul Islam
466a8149b3
Revert "[AMDGPU] Make default AMDHSA Code Object Version to be 5 (#65410)" (#66060)
This reverts commit 0a8d17e79b02a92814a2a788d79df1f54d70ec3e.
2023-09-12 15:13:59 +05:30
Saiyedul Islam
0a8d17e79b
[AMDGPU] Make default AMDHSA Code Object Version to be 5 (#65410)
Also update LIT tests and docs.
For more details, see
https://llvm.org/docs/AMDGPUUsage.html#code-object-v5-metadata

Reviewed By: arsenm, jhuber6

Github PR: #65410

Differential Revision: https://reviews.llvm.org/D129818
2023-09-12 13:53:31 +05:30
Matt Arsenault
4d42e8b5d1 Reapply "[CodeGen]Allow targets to use target specific COPY instructions for live range splitting"
This reverts commit a496c8be6e638ae58bb45f13113dbe3a4b7b23fd.

The workaround in c26dfc81e254c78dc23579cf3d1336f77249e1f6 should work
around the underlying problem with SUBREG_TO_REG.
2023-07-31 20:15:45 -04:00
Vitaly Buka
a496c8be6e Revert "[CodeGen]Allow targets to use target specific COPY instructions for live range splitting"
And dependent commits.

Details in D150388.

This reverts commit 825b7f0ca5f2211ec3c93139f98d1e24048c225c.
This reverts commit 7a98f084c4d121244ef7286bc6503b6a181d446e.
This reverts commit b4a62b1fa546312d882fa12dfdcd015177d66826.
This reverts commit b7836d856206ec39509d42529f958c920368166b.

No conflicts in the code, few tests had conflicts in autogenerated CHECKs:
llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll

Reviewed By: alexfh

Differential Revision: https://reviews.llvm.org/D156381
2023-07-26 22:13:32 -07:00
Jingu Kang
351b4c17dd Revert "[MachineLICM] Handle Subloops"
This reverts commit 50dd383d08670960540fecb4b48c0f0429fbfba3.
2023-07-20 17:12:25 +01:00
Jingu Kang
50dd383d08 [MachineLICM] Handle Subloops
Following discussion on https://reviews.llvm.org/D154205, make MachineLICM pass
handle subloops with only visiting outmost loop's blocks once.

Differential Revision: https://reviews.llvm.org/D154205
2023-07-20 16:39:13 +01:00
Jingu Kang
62ed3ff4bb Revert "[MachineLICM] Handle Subloops"
This reverts commit 33e60484d750291e99301e29e60fe72c8fa48ccd.
2023-07-19 10:30:50 +01:00
Jingu Kang
33e60484d7 [MachineLICM] Handle Subloops
MachineLICM pass handles inner loops only when outmost loop does not have unique
predecessor. If the loop has preheader and there is loop invariant code, the
invariant code can be hoisted to the preheader in general. This patch makes the
pass handle inner loops in general.

Differential Revision: https://reviews.llvm.org/D154205
2023-07-12 16:32:14 +01:00
Christudasan Devadasan
7a98f084c4 [AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs
Currently, the custom SGPR spill lowering pass spills
SGPRs into physical VGPR lanes and the remaining VGPRs
are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.

This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.

Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which is an unproblematic case.

Differential Revision: https://reviews.llvm.org/D124196
2023-07-07 23:14:32 +05:30
pvanhout
d892521076 [AMDGPU] Break-up large PHIs for DAGISel
DAGISel uses CopyToReg/CopyFromReg to lower PHI nodes. With large PHIs, this can result in poor codegen.
This is because it introduces a need to have a build_vector before copying the PHI value, and that build_vector may have many undef elements. This can cause very high register pressure and abnormal stack usage in some cases.

This scalarization/phi "break-up" can be easily tuned/disabled through CL options in case it's not beneficial for some users.
It's also only enabled for DAGIsel and GlobalISel handles PHIs much better (as it works on the whole function).

This can both scalarize (break a vector into its elements) and simplify (break a vector into smaller, more manageable subvectors) PHIs.

Fixes SWDEV-321581

Reviewed By: kzhuravl

Differential Revision: https://reviews.llvm.org/D143731
2023-03-28 09:38:47 +02:00
Christudasan Devadasan
a3028239a7 Revert "[AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs"
This reverts commit 40ba0942e2ab1107f83aa5a0ee5ae2980bf47b1a.
2022-12-21 16:17:42 +05:30
Nikita Popov
bdf2fbba9c [AMDGPU] Convert some tests to opaque pointers (NFC) 2022-12-19 12:41:13 +01:00
Christudasan Devadasan
40ba0942e2 [AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs
Currently, the custom SGPR spill lowering pass spills
SGPRs into physical VGPR lanes and the remaining VGPRs
are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.

This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.

Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which isn an unproblematic case.

This patch also implements the whole wave spills which
might occur if RA spills any live range of virtual registers
involved in the whole wave operations. Earlier, we had
been hand-picking registers for such machine operands.
But now with SGPR spills into virtual VGPR lanes, we are
exposing them to the allocator.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D124196
2022-12-17 11:56:32 +05:30
Christudasan Devadasan
29247824f5 [AMDGPU][SIFrameLowering] Use the right frame register in CSR spills
Unlike the callee-saved VGPR spill instructions emitted by
`PEI::spillCalleeSavedRegs`, the CS VGPR spills inserted during
emitPrologue/emitEpilogue require the exec bits flipping to avoid
clobbering the inactive lanes of VGPRs used for SGPR spilling.
Currently, these spill instructions are referenced from the SP at
function entry and when the callee performs a stack realignment,
they ended up getting incorrect stack offsets. Even if we try to
adjust the offsets, the FP-SP becomes a runtime entity with dynamic
stack realignment and the offsets would still be inaccurate.

To fix it, use FP as the frame base in the spill instructions
whenever the function has FP. The offsets obtained for the CS
objects would always be the right values from FP.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D134949
2022-12-17 11:52:36 +05:30
Christudasan Devadasan
b25b4c0ab4 [AMDGPU] Separate out SGPR spills to VGPR lanes during PEI
SILowerSGPRSpills pass handles the lowering of SGPR spills
into VGPR lanes. Some SGPR spills are handled later during
PEI. There is a common function used in both places to find
the free VGPR lane. This patch eliminates that dependency to
find the free VGPR by handling it separately for PEI. It is a
prerequisite patch for a future work to allow SGPR spills to
virtual VGPR lanes during SILowerSGPRSpills.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D124195
2022-12-17 11:49:41 +05:30
Ron Lieberman
ca856fff1c Revert "enable code-object-version=5"
very sorry wrong repo.

This reverts commit d882ba7aeac4b496dccd1b10cb58bd691786b691.
2022-11-29 15:21:09 -06:00
Ron Lieberman
d882ba7aea enable code-object-version=5 2022-11-29 15:11:57 -06:00
Guozhi Wei
11e86868c1 [MachineCSE] Allow CSE for instructions with ignorable operands
Ignorable operands don't impact instruction's behavior, we can safely do CSE on
the instruction.

It is split from D130919. It has big impact to some AMDGPU test cases.
For example in atomic_optimizations_raw_buffer.ll, when trying to check if the
following instruction can be CSEed

  %37:vgpr_32 = V_MOV_B32_e32 0, implicit $exec

Function isCallerPreservedOrConstPhysReg is called on operand "implicit $exec",
this function is implemented as

  -  return TRI.isCallerPreservedPhysReg(Reg, MF) ||
  +  return TRI.isCallerPreservedPhysReg(Reg, MF) || TII.isIgnorableUse(MO) ||
            (MRI.reservedRegsFrozen() && MRI.isConstantPhysReg(Reg));

Both TRI.isCallerPreservedPhysReg and MRI.isConstantPhysReg return false on this
operand, so isCallerPreservedOrConstPhysReg is also false, it causes LLVM failed
to CSE this instruction.

With this patch TII.isIgnorableUse returns true for the operand $exec, so
isCallerPreservedOrConstPhysReg also returns true, it causes this instruction to
be CSEed with previous instruction

  %14:vgpr_32 = V_MOV_B32_e32 0, implicit $exec

So I got different result from here. AMDGPU's implementation of isIgnorableUse
is

  bool SIInstrInfo::isIgnorableUse(const MachineOperand &MO) const {
    // Any implicit use of exec by VALU is not a real register read.
    return MO.getReg() == AMDGPU::EXEC && MO.isImplicit() &&
           isVALU(*MO.getParent()) && !resultDependsOnExec(*MO.getParent());
  }

Since the operand $exec is not a real register read, my understanding is it's
reasonable to do CSE on such instructions.

Because more instructions are CSEed, so I get less instructions generated for
these tests.

Differential Revision: https://reviews.llvm.org/D137222
2022-11-14 19:34:59 +00:00
Matt Arsenault
7a84624079 AMDGPU: Make various vector undefs legal
Surprisingly these were getting legalized to something
zero initialized.

This fixes an infinite loop when combining some vector types.
Also fixes zero initializing some undef values.

SimplifyDemandedVectorElts / SimplifyDemandedBits are not checking
for the legality of the output undefs they are replacing unused
operations with. This resulted in turning vectors into undefs
that were later re-legalized back into zero vectors.
2022-09-28 10:48:52 -04:00
Ruiling Song
a5676a3a7e StructurizeCFG: Set Undef for non-predecessors in setPhiValues()
During structurization process, we may place non-predecessor blocks
between the predecessors of a block in the structurized CFG. Take
the typical while-break case as an example:
```
 /---A(v=...)
 |  / \
 ^ B   C
 |  \ /|
 \---L |
     \ /
      E (r = phi (v:C)...)
```
After structurization, the CFG would be look like:
```
 /---A
 |   |\
 |   | C
 |   |/
 |   F1
 ^   |\
 |   | B
 |   |/
 |   F2
 |   |\
 |   | L
 \   |/
  \--F3
     |
     E
```
We can see that block B is placed between the predecessors(C/L) of E.
During phi reconstruction, to achieve the same sematics as before, we
are reconstructing the PHIs as:
  F1: v1 = phi (v:C), (undef:A)
  F3: r = phi (v1:F2), ...
But this is also saying that `v1` would be live through B, which is not
quite necessary. The idea in the change is to say the incoming value
from B is Undef for the PHI in E. With this change, the reconstructed
PHI would be:
  F1: v1 = phi (v:C), (undef:A)
  F2: v2 = phi (v1:F1), (undef:B)
  F3: r  = phi (v2:F2), ...

Reviewed by: sameerds

Differential Revision: https://reviews.llvm.org/D132450
2022-09-26 09:54:47 +08:00
Ruiling Song
40e9284f3c StructurizeCFG: prefer reduced number of live values
The instruction simplification will try to simplify the affected phis.
In some cases, this might extend the liveness of values. For example:

  BB0:
   | \
   | BB1
   | /
  BB2:phi (BB0, v), (BB1, undef)

The phi in BB2 will be simplified to v as v dominates BB2, but this is
increasing the number of active values in BB1. By setting CanUseUndef
to false, we will not simplify the phi in this way, this would help
register pressure. This is mandatory for the later change to help
reducing VGPR pressure for AMDGPU.

Reviewed by: foad, sameerds

Differential Revision: https://reviews.llvm.org/D132449
2022-09-26 09:54:47 +08:00
Matt Arsenault
69153d6c0a AMDGPU: Use GlobalPriority for largest register tuples
Only do this for 16 and 32 register tuples, although we might want to
extend to 8 tuples.

It's incredibly expensive to spill these, and doing so majorly
interferes with the ability to allocate anything else in the function.

The lit tests show mostly sizeable improvements with a handful of tiny
regressions with large vectors.
2022-09-15 11:45:02 -04:00
Matt Arsenault
da59fe7f15 AMDGPU: Fix test failure
Forgot to commit regenerated test
2022-09-12 09:33:22 -04:00
Matt Arsenault
e30271169f RegAllocGreedy: Try local instruction splitting with subranges
This was only trying this to relax register class constraints, but
this can also help if there are subranges involved.

This solves a compilation failure for AMDGPU when there is high
pressure created by large register tuples. If one virtual register is
using most of the available budget, we need to be able to evict
subranges.

This solves the immediate failure, but this solution leaves a lot to
be desired. In the relevant testcases, we have 32-element tuples but
most of the uses are operations on 1 element subranges of it. What
we're now getting is a spill and restore of the full 1024 bits and an
extract of the used 32-bits. It would be far better if we introduced a
copy to a new virtual register with a smaller register class and used
narrower spills.

Furthermore, we could probably do a better job if the allocator were
to introduce new subranges where none previously existed in the
highest pressure scenarios. The block and region splits should also
try to split specific subranges out.

The mve-vst3.ll test changes looks like noise to me, but instruction
count increased by one. mve-vst4.ll looks like a solid improvement
with several 16-byte spills eliminated. splitkit-copy-live-lanes.mir
also shows a solid reduction in total spill count.

This could use more tests but it's pretty tiring to come up with cases
that fail on this.
2022-09-12 09:03:55 -04:00