6024 Commits

Author SHA1 Message Date
Sameer Sahasrabuddhe
9c1b82599d [AAPointerInfo] handle multiple offsets in PHI
Previously reverted in 8b446ea2ba39e406bcf940ea35d6efb4bb9afe95

Reapplying because this commit is NOT DEPENDENT on the reverted commit
fc21f2d7bae2e0be630470cc7ca9323ed5859892, which broke the ASAN buildbot.
See https://reviews.llvm.org/rGfc21f2d7bae2e0be630470cc7ca9323ed5859892 for
more information.

The arguments to a PHI may represent a recurrence by eventually using the output
of the PHI itself. This is now handled by checking for cycles in the control
flow. If a PHI is not in a recurrence, it is now able to report multiple offsets
instead of conservatively reporting unknown.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D138991
2022-12-18 10:51:20 +05:30
Christudasan Devadasan
40ba0942e2 [AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs
Currently, the custom SGPR spill lowering pass spills
SGPRs into physical VGPR lanes and the remaining VGPRs
are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.

This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.

Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which isn an unproblematic case.

This patch also implements the whole wave spills which
might occur if RA spills any live range of virtual registers
involved in the whole wave operations. Earlier, we had
been hand-picking registers for such machine operands.
But now with SGPR spills into virtual VGPR lanes, we are
exposing them to the allocator.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D124196
2022-12-17 11:56:32 +05:30
Christudasan Devadasan
29247824f5 [AMDGPU][SIFrameLowering] Use the right frame register in CSR spills
Unlike the callee-saved VGPR spill instructions emitted by
`PEI::spillCalleeSavedRegs`, the CS VGPR spills inserted during
emitPrologue/emitEpilogue require the exec bits flipping to avoid
clobbering the inactive lanes of VGPRs used for SGPR spilling.
Currently, these spill instructions are referenced from the SP at
function entry and when the callee performs a stack realignment,
they ended up getting incorrect stack offsets. Even if we try to
adjust the offsets, the FP-SP becomes a runtime entity with dynamic
stack realignment and the offsets would still be inaccurate.

To fix it, use FP as the frame base in the spill instructions
whenever the function has FP. The offsets obtained for the CS
objects would always be the right values from FP.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D134949
2022-12-17 11:52:36 +05:30
Christudasan Devadasan
7a72a93580 [AMDGPU] Preserve only the inactive lanes of scratch vgprs
In general, a callee is free to use a scratch register without
preserving its previous state.  However, the VGPR used for SGPR
spilling can potentially have its inactive lanes overwritten by
the writelane instructions. When the function returns, it can
cause unexpected behavior if the VGPR value is not preserved
appropriately.

The current scheme to preserve the inactive lanes of such
scratch VGPRs is not done rightly. It preserves all lanes
and causes the outgoing values (if any) getting overwritten
by the epilog restores. It then corrupts the return value.

To avoid such situation with scratch VGPRs, this patch ensures
we preserve only their inactive lanes.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D134526
2022-12-17 11:51:43 +05:30
Christudasan Devadasan
20a940f1e2 [AMDGPU][SIFrameLowering] Unify PEI SGPR spill saves and restores
There is a lot of customization and eventually code duplication in the
frame lowering that handles special SGPR spills like the one needed for
the Frame Pointer. Incorporating any additional SGPR spill currently
makes it difficult during PEI. This patch introduces a new spill builder
to efficiently handle such spill requirements. Various spill methods are
special handled using a separate class.

Reviewed By: sebastian-ne, scott.linder

Differential Revision: https://reviews.llvm.org/D132436
2022-12-17 11:50:25 +05:30
Christudasan Devadasan
b25b4c0ab4 [AMDGPU] Separate out SGPR spills to VGPR lanes during PEI
SILowerSGPRSpills pass handles the lowering of SGPR spills
into VGPR lanes. Some SGPR spills are handled later during
PEI. There is a common function used in both places to find
the free VGPR lane. This patch eliminates that dependency to
find the free VGPR by handling it separately for PEI. It is a
prerequisite patch for a future work to allow SGPR spills to
virtual VGPR lanes during SILowerSGPRSpills.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D124195
2022-12-17 11:49:41 +05:30
Christudasan Devadasan
5ebe91fcb2 [AMDGPU] Correctly set IsKill flag for VGPR spills in the prolog
We always assume the vector register is dead or killed while
inserting the VGPR spills in the prolog. It is not always
true. Used the entry block liveIn data while setting the flag.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D124194
2022-12-17 11:48:44 +05:30
Christudasan Devadasan
5692a7e84e [AMDGPU] Callee must always spill writelane VGPRs
Since the writelane instruction used for SGPR spills can
modify inactive lanes, the callee must preserve the VGPR
this instruction modifies even if it was marked Caller-saved.

Reviewed By: arsenm, nhaehnle

Differential Revision: https://reviews.llvm.org/D124192
2022-12-17 11:11:42 +05:30
Mitch Phillips
525d6c54b5 Revert "[AAPointerInfo] handle multiple offsets in PHI"
This reverts commit 88db516af69619d4326edea37e52fc7321c33bb5.

Reason: This change is dependent on a commit that needs to be rolled
back because it broke the ASan buildbot. See
https://reviews.llvm.org/rGfc21f2d7bae2e0be630470cc7ca9323ed5859892 for
more information.
2022-12-16 17:55:48 -08:00
Mitch Phillips
7928a6387f Revert "Revert "[AAPointerInfo] handle multiple offsets in PHI""
This reverts commit 12696d302d146ffe616eecab3feceba9d29be2db.

Reason: This change is dependent on a commit that needs to be rolled
back because it broke the ASan buildbot. See
https://reviews.llvm.org/rGfc21f2d7bae2e0be630470cc7ca9323ed5859892 for
more information.
2022-12-16 17:55:38 -08:00
Mitch Phillips
8b446ea2ba Revert "[AAPointerInfo] handle multiple offsets in PHI"
This reverts commit 179ed8871101cd197e0a719a3629cd5077b1a999.

Reason: This change is dependent on a commit that needs to be rolled
back because it broke the ASan buildbot. See
https://reviews.llvm.org/rGfc21f2d7bae2e0be630470cc7ca9323ed5859892 for
more information.
2022-12-16 17:54:44 -08:00
Jeffrey Byrnes
4d2faf043b [AMDGPU][SIFrameLowering] Mark VGPR used for AGPR spills as reserved
Presently, there is an issue on MI100 (and probably other architecture) where the VGPR used for AGPR copies clobbers VGPR used for AGPR spill. AFAICT this is because in processFunctionBeforeFrameIndicesReplaced we think the VGPR register for AGPR spill is unused. This patch aims to correct that. This is a WIP while I work out issues with producing a good test. For now, I'm curious if this is generally a good / bad idea.

Differential Revision: https://reviews.llvm.org/D139673
2022-12-16 12:00:51 -08:00
Roman Lebedev
96d3c82645
Revert "[SROA] isVectorPromotionViable(): memory intrinsics operate on vectors of bytes (take 3)"
While the PPC litte-endian miscompile did get addressed
by https://reviews.llvm.org/D140046
the PPV big-endian bots are still unhappy.
https://lab.llvm.org/buildbot/#/builders/93/builds/12560

This reverts commit 7bd358bcb4e358b4351c69e02ef76939e08acdc7.
2022-12-16 22:58:41 +03:00
Roman Lebedev
cfd594f8bb
[SROA] isVectorPromotionViable(): memory intrinsics operate on vectors of bytes (take 3)
* This is a recommit of 3c4d2a03968ccf5889bacffe02d6fa2443b0260f,
* which was reverted in 25f01d593ce296078f57e872778b77d074ae5888,
  because it exposed a miscompile in PPC backend,  which was resolved
  in https://reviews.llvm.org/D140089 / cb3f415cd2019df7d14683842198bc4b7a492bc5.
* which was a recommit of cf624b23bc5d5a6161706d1663def49380ff816a,
* which was reverted in 5cfc22cafe3f2465e0bb324f8daba82ffcabd0df,
  because the cut-off on the number of vector elements was not low enough,
  and it triggered both SDAG SDNode operand number assertions,
  5and caused compile time explosions in some cases.

Let's try with something really *REALLY* conservative first,
just to get somewhere, and try to bump it later.

FIXME: should this respect TTI reg width * num vec regs?

Original commit message:

Now, there's a big caveat here - these bytes
are abstract bytes, not the i8 we have in LLVM,
so strictly speaking this is not exactly legal,
see e.g. https://github.com/AliveToolkit/alive2/issues/860
^ the "bytes" "could" have been a pointer,
and loading it as an integer inserts an implicit ptrtoint.

But at the same time,
InstCombine's `InstCombinerImpl::SimplifyAnyMemTransfer()`
would expand a memtransfer of 1/2/4/8 bytes
into integer-typed load+store,
so this isn't exactly a new problem.

Note that in memory, poison is byte-wise,
so we really can't widen elements,
but SROA seems to be inconsistent here.

Fixes #59116.
2022-12-16 19:27:38 +03:00
Jay Foad
e2bcdca527 [AMDGPU] Generate permlane test checks 2022-12-16 11:30:01 +00:00
Matt Arsenault
b5edd522d1 AMDGPU/GlobalISel: Do not create readfirstlane with non-s32 type
We should probably handle any 32-bit type here, but the intrinsic
definition and selection pattern currently do not. Avoids a few lit
tests failures when switched on by default.
2022-12-15 21:44:07 -05:00
Alexander Timofeev
2877b87666 [AMDGPU] Lower VGPR to physical SGPR COPY to S_MOV_B32 if VGPR contains the compile time constant
Sometimes we have a constant value loaded to VGPR. In case we further
need to rematrerialize it in the physical scalar register we may avoid VGPR to
SGPR copy replacing it with S_MOV_B32.

Reviewed By: JonChesterfield, arsenm

Differential Revision: https://reviews.llvm.org/D139874
2022-12-16 00:38:10 +01:00
Christudasan Devadasan
229c466bc8 [AMDGPU] Test fixup
Changing cast_lds_gv into a kernel function to
lower the LDS usage appropriately. The LDS lowering
is currently won't happen for orphan device functions.
2022-12-15 23:36:55 +05:30
Ron Lieberman
38f1abef86 Revert "[SelectionDAG] Do not second-guess alignment for alloca"
Breaks amdgpu buildbot https://lab.llvm.org/buildbot/#/builders/193
 23491

This reverts commit ffedf47d8b793e07317f82f9c2a5f5425ebb71ad.
2022-12-15 10:55:18 -06:00
Andrew Savonichev
ffedf47d8b [SelectionDAG] Do not second-guess alignment for alloca
Alignment of an alloca in IR can be lower than the preferred alignment
on purpose, but this override essentially treats the preferred
alignment as the minimum alignment.

The patch changes this behavior to always use the specified
alignment. If alignment is not set explicitly in LLVM IR, it is set to
DL.getPrefTypeAlign(Ty) in computeAllocaDefaultAlign.

Tests are changed as well: explicit alignment is increased to match
the preferred alignment if it changes output, or omitted when it is
hard to determine the right value (e.g. for pointers, some structs, or
weird types).

Differential Revision: https://reviews.llvm.org/D135462
2022-12-15 18:18:12 +03:00
Juan Manuel MARTINEZ CAAMAÑO
4d852374b1 [DAGCombine] Fix always true condition in combineShiftToMULH
Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D139550
2022-12-15 13:04:42 +01:00
Sameer Sahasrabuddhe
179ed88711 [AAPointerInfo] handle multiple offsets in PHI
Previously reverted in 12696d302d146ffe616eecab3feceba9d29be2db

The arguments to a PHI may represent a recurrence by eventually using the output
of the PHI itself. This is now handled by checking for cycles in the control
flow. If a PHI is not in a recurrence, it is now able to report multiple offsets
instead of conservatively reporting unknown.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D138991
2022-12-15 12:23:50 +05:30
Sameer Sahasrabuddhe
12696d302d Revert "[AAPointerInfo] handle multiple offsets in PHI"
This reverts commit 88db516af69619d4326edea37e52fc7321c33bb5.
2022-12-15 10:14:39 +05:30
Sameer Sahasrabuddhe
88db516af6 [AAPointerInfo] handle multiple offsets in PHI
The arguments to a PHI may represent a recurrence by eventually using the output
of the PHI itself. This is now handled by checking for cycles in the control
flow. If a PHI is not in a recurrence, it is now able to report multiple offsets
instead of conservatively reporting unknown.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D138991
2022-12-15 08:48:38 +05:30
Matt Arsenault
fb639b0ce0 AMDGPU: Update test 2022-12-14 16:47:25 -05:00
Matt Arsenault
c16a58b36c Attributes: Add function getter to parse integer string attributes
The most common case for string attributes parses them as integers. We
don't have a convenient way to do this, and as a result we have
inconsistent missing attribute and invalid attribute handling
scattered around. We also have inconsistent radix usage to
getAsInteger; some places use the default 0 and others use base 10.

Update a few of the uses, but there are quite a lot of these.
2022-12-14 13:12:35 -05:00
Jay Foad
113aafbf23 [AMDGPU] Clean up SReg classes
Remove unused LO16 classes SReg_LO16_XM0_XEXEC, SReg_LO16_XEXEC_HI and
SReg_LO16_XM0.

Simplify the definition of SReg_32.

Add SReg_32_XEXEC and use it to improve SReg_1_XEXEC which previously
excluded M0 for no good reason.

Improve SReg_1 which previously excluded EXEC_HI for no good reason.

Differential Revision: https://reviews.llvm.org/D140012
2022-12-14 15:57:56 +00:00
Sameer Sahasrabuddhe
6a2305484e [AAPointerInfo] track multiple constant offsets for each use
An expression of the form `gep(base, select(pred, const1, const2))` can result
in a set of offsets instead of just one. PointerInfo can now track these sets
instead of conservatively modeling them as Unknown. In general, AAPointerInfo
now uses AAPotentialConstantValues to examine the operands of the GEP.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D138646
2022-12-13 22:27:25 +05:30
Pierre van Houtryve
9fa46200ea [AMDGPU] Add .workgroup_processor_mode to v5 MD
Adds Workgroup Processor Mode (WGP) to the HSA Metadata for Code Object v5/GFX10+.
The field is already present as an asm directive and in the compute program resource register but is also needed in the MD.

Reviewed By: kzhuravl

Differential Revision: https://reviews.llvm.org/D139931
2022-12-13 10:44:52 -05:00
Pierre van Houtryve
678d8946ba [AMDGPU] Add bf16 storage support
- [Clang] Declare AMDGPU target as supporting BF16 for storage-only purposes on amdgcn
  - Add Sema & CodeGen tests cases.
  - Also add cases that D138651 would have covered as this patch replaces it.
- [AMDGPU] Add BF16 storage-only support
  - Support legalization/dealing with bf16 operations in DAGIsel.
  - bf16 as a type remains illegal and is represented as i16 for storage purposes.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D139398
2022-12-13 10:34:26 -05:00
Matt Arsenault
4f63f9739c AMDGPU: Add sanity test if amdgcn.device.{init|fini} already exists 2022-12-12 22:18:37 -05:00
Matt Arsenault
d647e252b8 InstSimplify: Add basic folding of llvm.is.fpclass intrinsic
Copied from the existing llvm.amdgcn.class handling; eventually I will
fold that to the generic intrinsic when legal. The tests should
probably move into an instsimplify only test.
2022-12-12 21:54:04 -05:00
Doru Bercea
aea5980e26 Emit CAS loop for min/max atomics. 2022-12-12 11:42:30 -06:00
Jay Foad
81084bfa2c [AMDGPU] Make use of !listremove. NFCI.
This only affects the order of implicit operands in some MIR tests.

Differential Revision: https://reviews.llvm.org/D139829
2022-12-12 17:01:04 +00:00
Sameer Sahasrabuddhe
2fdeb27790 Revert "[AAPointerInfo] track multiple constant offsets for each use"
Assertion fired in openmp-offload-amdgpu-runtime:
https://lab.llvm.org/buildbot/#/builders/193/builds/23177

This reverts commit c2a0baad1fbb21fe111fef83ec93c2d7923b9b0c.
2022-12-12 15:39:18 +05:30
Nikita Popov
243acd5dcb [BasicAA] Remove support for PhiValues analysis
BasicAA currently has an optional dependency on the PhiValues
analysis. However, at least with our current pipeline setup, we
never actually make use of it. It's possible that this used to work
with the legacy pass manager, but I'm not sure of that either.

Given that this analysis has not actually been in use for a long
time, and nobody noticed or complained, I think we should drop
support for it and focus on one code path. It is worth noting that
analysis quality for the non-PhiValues case has significantly
improved in the meantime.

If we really wanted to make use of PhiValues, the right way would
probably be to pass it in via AAQI in places we want to use it,
rather than using an optional pass manager dependency (which are
an unpredictable PITA and should really only ever be used for
analyses that are only preserved and not used).

Differential Revision: https://reviews.llvm.org/D139719
2022-12-12 09:47:30 +01:00
Sameer Sahasrabuddhe
c2a0baad1f [AAPointerInfo] track multiple constant offsets for each use
An expression of the form `gep(base, select(pred, const1, const2))` can result
in a set of offsets instead of just one. PointerInfo can now track these sets
instead of conservatively modeling them as Unknown. In general, AAPointerInfo
now uses AAPotentialConstantValues to examine the operands of the GEP.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D138646
2022-12-12 13:36:45 +05:30
Austin Kerbow
f9c76a1198 [AMDGPU] Update MFMASmallGemmOpt with better performing stategy
Based on experiments this does better with target small GEMM kernels.

Reviewed By: jrbyrnes

Differential Revision: https://reviews.llvm.org/D139227
2022-12-09 19:03:51 -08:00
Ariel Burton
6b2829dd87 Allow epilogue_begin to be emitted when generating DWARF
We identify epilogue code by looking for instructions tagged
with FrameDestroy.

A function may have more than one epilogue, e.g., because of early
returns or code duplicated during optimization. We need only track
the current block, and emit epilogie_begin at most once per block.

We reduce the number of entries in the line table by combining
epilogue_begin with other flags instead of emitting a separate
entry just for epilogue_begin.

Reviewed By: dblaikie, aprantl

Differential Revision: https://reviews.llvm.org/D133376
2022-12-09 20:17:37 +00:00
Matt Arsenault
9cc0779c4e AMDGPU: Erase llvm.global_ctors/global_dtors after lowering
We should be able to run the pass multiple times without breaking
anything. If we still need to track these for some reason, we could
replace with new entries for the kernels.
2022-12-09 14:25:32 -05:00
Matt Arsenault
f23f26032d AMDGPU: Port AMDGPUCtorDtorLowering to new PM 2022-12-09 13:43:38 -05:00
Matt Arsenault
41c96e9483 AMDGPU: Fix not emitting code for exotic constructor types
This was simply ignoring any entries that weren't direct function
calls. This really should have been erroring on anything
unexpected. We should be able to handle calling just about anything
these days, so just call anything.
2022-12-09 13:22:12 -05:00
Pierre van Houtryve
3612d9eaac [GISel] Rework trunc/shl combine in a generic trunc/shift combine
This combine only handled left shifts, but now it can handle right shifts as well. It handles right shifts conservatively and only truncates them to the size returned by TLI.

AMDGPU benefits from always lowering shifts to 32 bits for instance, but AArch64 would rather keep them at 64 bits.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D136319
2022-12-09 04:46:45 -05:00
Roman Lebedev
d4c4bd6b20
[NFC] Port codegen AMDGPU tests that invoke opt to -passes= syntax 2022-12-09 01:04:46 +03:00
Roman Lebedev
b1a9584818
[opt] Disincentivize new tests from using old pass syntax
Over the past day or so, i've took a large swing at our tests,
and reduced the number of tests that were still using the old syntax
from ~1800 to just 200.

Left to handle: (as it is seen in this patch)
* Transforms/LSR
* Transforms/CGP
* Transforms/TypePromotion
* Transforms/HardwareLoops
* Analysis/*
* some misc.

I think this is the right point to start actively refusing
to honor the old syntax, except for the old tests,
to prevent the old syntax from creeping back in.

Thus, let's add temporary default-off flag,
and if it is not passed refuse to accept old syntax.
The tests that still need porting are annotated with this flag.

Reviewed By: aeubanks

Differential Revision: https://reviews.llvm.org/D139647
2022-12-08 23:54:03 +03:00
Sebastian Neubauer
1fe65d866c [AMDGPU] Add test that spills WWM CSRs twice
Add a test to show a deficit in the current wwm/spilling code that
creates double saves and restores for v40 and v41.

This case came up in D124193.

Differential Revision: https://reviews.llvm.org/D139626
2022-12-08 14:25:41 +01:00
Bjorn Pettersson
51ee10747d [test] Remove duplicate RUN lines
A few more that I missed in commit 3528e63d89305907b3d6e.

There could be more duplicates remaining, since I've only focused
on exactly duplicated "RUN: opt" lines (ignoring multi line RUN
lines ending with '\').
2022-12-08 12:47:24 +01:00
Johannes Doerfert
f6e3a89cc0 [AMDGPU] Annotate the intrinsics to be default and nocallback
Differential Revision: https://reviews.llvm.org/D135155
2022-12-07 14:25:25 -08:00
Jon Chesterfield
d77ae7f251 [amdgpu] Reimplement LDS lowering
Renames the current lowering scheme to "module" and introduces two new
ones, "kernel" and "table", plus a "hybrid" that chooses between those three
on a per-variable basis.

Unit tests are set up to pass with the default lowering of "module" or "hybrid"
with this patch defaulting to "module", which will be a less dramatic codegen
change relative to the current. This reflects the sparsity of test coverage for
the table lowering method. Hybrid is better than module in every respect and
will be default in a subsequent patch.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D139433
2022-12-07 22:02:54 +00:00
Joe Nash
bbfbec94b1 [AMDGPU] Enable OMod on more VOP3 instructions
OMod was disabled if OpSel was enabled, but that restriction is more
specific than necessary. Any VOP3 with float operands can use OMod.

On GFX11, FMAC_F16_e64 can use op_sel.
Previously, SIFoldOperands and convertToThreeAddress were accidentally correct when
they reinterpreted the zero OMod operand on V_FMAC_F16_e64 as the OpSel operand on
V_FMA_F16_gfx9_e64. Now we explicitly add op_sel if required.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D139469
2022-12-07 13:30:33 -05:00