6337 Commits

Author SHA1 Message Date
Ronak Chauhan
5f0b92e580 [AMDGPU] Also consider global and scratch instructions when flushing vmcnt counter in loop preheader
Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D149332
2023-05-05 21:12:10 +05:30
Nicolai Hähnle
ef13308b26 AMDGPU/SDAG: Improve {extract,insert}_subvector lowering for 16-bit vectors
v2:
- simplify the escape to TableGen patterns

Differential Revision: https://reviews.llvm.org/D149841
2023-05-05 10:55:18 +02:00
Nicolai Hähnle
909095a880 AMDGPU: Precommit test showing codegen weakness
The code sequence on gfx9 has a lot of useless v_bfi instructions.

Differential Revision: https://reviews.llvm.org/D149840
2023-05-04 14:11:04 +02:00
Krzysztof Drewniak
fc05b7f0d0 [AMDGPU] Add gfx940 to fp64 atomic tests in global ISel
This changes the test in GlobalISel, which makes it match the test
elsewhere.

Differential Revision: https://reviews.llvm.org/D149795
2023-05-03 22:40:16 +00:00
Krzysztof Drewniak
f0415f2a45 Re-land "[AMDGPU] Define data layout entries for buffers""
Re-land D145441 with data layout upgrade code fixed to not break OpenMP.

This reverts commit 3f2fbe92d0f40bcb46db7636db9ec3f7e7899b27.

Differential Revision: https://reviews.llvm.org/D149776
2023-05-03 19:43:56 +00:00
Krzysztof Drewniak
3f2fbe92d0 Revert "[AMDGPU] Define data layout entries for buffers"
This reverts commit f9c1ede2543b37fabe9f2d8f8fed5073c475d850.

Differential Revision: https://reviews.llvm.org/D149758
2023-05-03 16:11:00 +00:00
Mateja Marjanovic
cf76074a36 [AMDGPU][GlobalISel] Check exact width in get*ClassForBitWidth and widen if necessary
Instead of checking if the given bitwidth is less or equal to a bitwidth of an existing RegClass,
check if it has the exact same value.

For LLVM vector types that don't have a corresponding Register Class, widen them during legalization.
That goes for G_EXTRACT_VECTOR_ELT, G_INSERT_VECTOR_ELT and G_BUILD_VECTOR.

Differential revision: https://reviews.llvm.org/D148096
Reviewers: foad, arsenm
2023-05-03 17:32:24 +02:00
Mateja Marjanovic
6175ec0bb6 Revert "[AMDGPU][GlobalISel] Widen the vector operand in G_BUILD/INSERT/EXTRACT_VECTOR"
This reverts commit b25c7cafcbe1b52ea2d1ff5e5c2f13674b5f297d.
2023-05-03 17:28:01 +02:00
Krzysztof Drewniak
f9c1ede254 [AMDGPU] Define data layout entries for buffers
Per discussion at
https://discourse.llvm.org/t/representing-buffer-descriptors-in-the-amdgpu-target-call-for-suggestions/68798,
we define two new address spaces for AMDGCN targets.

The first is address space 7, a non-integral address space (which was
already in the data layout) that has 160-bit pointers (which are
256-bit aligned) and uses a 32-bit offset. These pointers combine a
128-bit buffer descriptor and a 32-bit offset, and will be usable with
normal LLVM operations (load, store, GEP). However, they will be
rewritten out of existence before code generation.

The second of these is address space 8, the address space for "buffer
resources". These will be used to represent the resource arguments to
buffer instructions, and new buffer intrinsics will be defined that
take them instead of <4 x i32> as resource arguments. ptr
addrspace(8). These pointers are 128-bits long (with the same
alignment). They must not be used as the arguments to getelementptr or
otherwise used in address computations, since they can have
arbitrarily complex inherent addressing semantics that can't be
represented in LLVM. Even though, like their address space 7 cousins,
these pointers have deterministic ptrtoint/inttoptr semantics, they
are defined to be non-integral in order to prevent optimizations that
rely on pointers being a [0, [addr_max]] value from applying to them.

Future work includes:
- Defining new buffer intrinsics that take ptr addrspace(8) resources.
- A late rewrite to turn address space 7 operations into buffer
intrinsics and offset computations.

This commit also updates the "fallback address space" for buffer
intrinsics to the buffer resource, and updates the alias analysis
table.

Depends on D143437

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D145441
2023-05-03 15:25:58 +00:00
Mateja Marjanovic
b25c7cafcb [AMDGPU][GlobalISel] Widen the vector operand in G_BUILD/INSERT/EXTRACT_VECTOR
Widen the vector operand type in G_BUILD_VECTOR, G_INSERT_VECTOR_ELT,
G_EXTRACT_VECTOR_ELT to the nearest larger RegClass.
2023-05-03 17:14:38 +02:00
pvanhout
415956fe7e [llvm-readobj][AMDGPU] Bypass MD verification for PAL
Small split change from D146023.

Migrate elf-notes to v4 and fix llvm-readobj to work with PAL metadata.

Reviewed By: kzhuravl

Differential Revision: https://reviews.llvm.org/D146119
2023-05-03 08:45:24 +02:00
Joseph Huber
a1da746157 [AMDGPU] Place global constructors in .init_array and .fini_array
For the GPU, we emit external kernels that call the initializers and
constructors, however if we had a persistent kernel like in the `_start`
kernel for the `libc` project, we could initialize the standard way of
calling constructors. This patch adds new global variables containing
pointers to the constructors to be called. If these are placed in the
`.init_array` and `.fini_array` sections, then the backend will handle
them specially. The linker will then provide the `__init_array_` and
`__fini_array_` sections to traverse them. An implementation would look
like this.

```
extern uintptr_t __init_array_start[];
extern uintptr_t __init_array_end[];
extern uintptr_t __fini_array_start[];
extern uintptr_t __fini_array_end[];

using InitCallback = void(int, char **, char **);
using FiniCallback = void(void);

extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel]] void
_start(int argc, char **argv, char **envp) {
  uint64_t init_array_size = __init_array_end - __init_array_start;
  for (uint64_t i = 0; i < init_array_size; ++i)
    reinterpret_cast<InitCallback *>(__init_array_start[i])(argc, argv, env);
  uint64_t fini_array_size = __fini_array_end - __fini_array_start;
  for (uint64_t i = 0; i < fini_array_size; ++i)
    reinterpret_cast<FiniCallback *>(__fini_array_start[i])();
}
```

Reviewed By: yaxunl

Differential Revision: https://reviews.llvm.org/D149340
2023-04-29 08:40:19 -05:00
Jay Foad
56af0e913c [EarlyCSE] Do not CSE convergent calls in different basic blocks
"convergent" is documented as meaning that the call cannot be made
control-dependent on more values, but in practice we also require that
it cannot be made control-dependent on fewer values, e.g. it cannot be
hoisted out of the body of an "if" statement.

In code like this, if we allow CSE to combine the two calls:

  x = convergent_call();
  if (cond) {
    y = convergent_call();
    use y;
  }

then we get this:

  x = convergent_call();
  if (cond) {
    use x;
  }

This is conceptually equivalent to moving the second call out of the
body of the "if", up to the location of the first call, so it should be
disallowed.

Differential Revision: https://reviews.llvm.org/D149348
2023-04-28 14:50:48 +01:00
Jay Foad
5534d1d834 [CSE] Precommit an AMDGPU test case for D149348
Differential Revision: https://reviews.llvm.org/D149349
2023-04-28 14:50:48 +01:00
Jeffrey Byrnes
7f0a881e6c [AMDGPU] Track liveins for max-ilp-sched-strategy
Even if optimizing for ILP, it is still useful to track RP to avoid spilling. Given that, we need to maintin consistent liveness state with the RP tracker. This patch makes RP tracking consistent by updating for liveins.

Otherwise, we should completely eliminate RP tracking for this scheduler (checkScheduling, initCandidate).

Differential Revision: https://reviews.llvm.org/D149358
2023-04-27 16:45:45 -07:00
Changpeng Fang
1ab8b9ae15 AMDGPU: Define sub-class of SGPR_64 for tail call return
Summary:
  Registers for tail call return should not be clobbered by callee.
So we need a sub-class of SGPR_64 (excluding callee saved registers (CSR)) to hold
the tail call return address.

Because GFX and C calling conventions have different CSR, we need to define
the sub-class separately. This work is an extension of D147096 with the
consideration of GFX calling convention.

Based on the calling conventions, different instructions will be selected with
different sub-class of SGPR_64 as the input.

Reviewers: arsenm, cdevadas and sebastian-ne

Differential Revision: https://reviews.llvm.org/D148824
2023-04-27 10:45:11 -07:00
skc7
e016fb57b3 [AMDGPU] Legalize soffset of buffer instructions. Use Waterfall loop logic.
Legalize soffset of buffer instructions using waterfall loop.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D141030
2023-04-27 19:36:50 +05:30
ManuelJBrito
8b56da5e9f [IR] Change shufflevector undef mask to poison
With this patch an undefined mask in a shufflevector will be printed as poison.
This change is done to support the new shufflevector semantics
for undefined mask elements.

Differential Revision: https://reviews.llvm.org/D149210
2023-04-27 14:41:10 +01:00
Jay Foad
47d3cbcf84 [BranchFolder] Skip redundant IMPLICIT_DEFs of subregs
Differential Revision: https://reviews.llvm.org/D148509
2023-04-27 09:40:06 +01:00
Jay Foad
12b70ad68c [BranchFolder] Precommit AMDGPU test case for D148509 2023-04-27 09:40:06 +01:00
Nicolai Hähnle
1e63f8272e AMDGPU: Fix an assertion in SIOptimizeVGPRLiveRange
As the comment notes, the shader results in an INSERT_SUBREG with
"undef" (dead) operand in the Endif block. The same can happen with
REG_SEQUENCE. The register is considered dead from a liveness
analysis perspective. The correct thing to do seems to be nothing:
we keep the undef use of the register, the register allocator should
still be able to take the liveness into account correctly.

Differential Revision: https://reviews.llvm.org/D149161
2023-04-27 09:39:44 +02:00
Matt Arsenault
bbc7b30fbf AMDGPU: Remove invalid testcase for enqueue kernel
The call didn't have the right calling convention, but calls to
kernels are supposed to be illegal anyway.
2023-04-26 17:25:30 -04:00
Jay Foad
22516593ae [AMDGPU] Add GFX11 ds_min_f32 / ds_max_f32 tests 2023-04-26 17:09:12 +01:00
Joe Nash
f8ec7a0944 [AMDGPU] Delete test for illegal v_cndmask_b16_dpp
There are no VOP2 or VOP2 with dpp forms of v_cndmask_b16. Delete the
test. NFC.

Reviewed By: critson

Differential Revision: https://reviews.llvm.org/D149184
2023-04-26 09:50:44 -04:00
Janek van Oirschot
124acb7ca3 [AMDGPU] Fix negative offset values interpretation in getMemOperandsWithOffset for DS
The offset values may result in an erroneous scheduling of a load before write for a memory location if the offset values are represented as negative values in MIR, despite actually being unsigned values. This representation in MIR happens as SelectionDAG::getConstant could go through APInt to represent the encoding which assumes the MSB of the encoding as a sign-bit, regardless of whether it is supposed to be a signed value. The 8-bit negative (interpreted) value gets cast to an unsigned 32 bit value in getMemOperandsWithOffset used for comparisons in areMemAccessesTriviallyDisjoint eventually leading to an erroneous schedule in the machine scheduler.

Reviewed By: arsenm, foad

Differential Revision: https://reviews.llvm.org/D149080
2023-04-26 14:10:25 +01:00
OCHyams
2b3c13b716 [DebugInfo] Treat empty metadata operands the same as undef operands in SelectionDAG
Without this patch SelectionDAG silently drops dbg.values using `!{}` operands.

Related to https://discourse.llvm.org/t/auto-undef-debug-uses-of-a-deleted-value

Reviewed By: StephenTozer

Differential Revision: https://reviews.llvm.org/D140990
2023-04-26 09:03:07 +01:00
Jay Foad
0c13e0b748 [AMDGPU] Do not handle _SGPR SMEM instructions in SILoadStoreOptimizer
After D147334 we never select _SGPR forms of SMEM instructions on
subtargets that also support the _SGPR_IMM form, so there is no need to
handle them here.

Differential Revision: https://reviews.llvm.org/D149139
2023-04-25 15:40:13 +01:00
Matt Arsenault
2fce50e8f5 AMDGPU: Fix assertion with multiple uses of f64 fneg of select
A bitcast needs to be inserted back to the original type. Just
skip the multiple use case for a safer quick fix. Handling
the multiple use case seems to be beneficial in some but not
all cases.
2023-04-20 10:15:18 -04:00
Matt Arsenault
99d4c722e3 AMDGPU: Really invert handling of enqueued block detection
Remove the broken call graph analysis in the block enqueue lowering
pass. The previous iteration was reverted due to a runtime bug when
the completion action was unconditionally enabled.
2023-04-20 06:58:24 -04:00
Pravin Jagtap
21a69bdb66 [NewPM][AMDGPU] Port amdgpu-atomic-optimizer
Reviewed By: arsenm, sameerds, gandhi21299

Differential Revision: https://reviews.llvm.org/D148628
2023-04-20 00:27:47 -04:00
Jay Foad
e1ae0e2b7d [AMDGPU] Fix some check prefixes 2023-04-19 16:15:14 +01:00
Jay Foad
141c476a36 [AMDGPU] Remove unused check lines from tests 2023-04-19 16:15:14 +01:00
Jay Foad
43b035b483 [AMDGPU] Remove unused check lines from GlobalISel IR tests 2023-04-19 15:13:02 +01:00
Jay Foad
d9ed0dee0c [AMDGPU] Remove unused check lines from GlobalISel MIR tests 2023-04-19 15:03:32 +01:00
Jay Foad
bf4dc4381e [AMDGPU] Don't transform illegal intrinsics to V_ILLEGAL
This reverts parts of D123693. The functionality of allowing unsupported
intrinsics to select has been superseded by D139000 "Remove function
with incompatible features".

Retain assembler/disassembler support for v_illegal on GFX10+ only,
where it is documented.

Differential Revision: https://reviews.llvm.org/D148127
2023-04-19 09:59:46 +01:00
Chen Zheng
3f4055dec4 [GlobalISelEmitter] handle operand without MVT/class
There are some patterns in td files without MVT/class set
for some operands in target pattern that are from the source
pattern. This prevents GlobalISelEmitter from adding them as
a valid rule, because the target child operand is an
unsupported kind operand. For now, for a leaf child, only
IntInit and DefInit are handled in GlobalISelEmitter.

This issue can be workaround by adding MVT/class to the
patterns in the td files, like the workarounds for patterns
anyext and setcc in PPCInstrInfo.td in D140878.

To avoid adding the same workarounds for other patterns in
td files, this patch tries to handle the UnsetInit case in
GlobalISelEmitter.

Adding the new handling allows us to remove the workarounds
in the td files and also generates many selection rules for
PPC target.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D141247
2023-04-19 07:00:57 +00:00
chenglin.bi
9356097206 Revert "[AMDGPU] Ressociate patterns with sub to use SALU"
The patch will caused dead loop because of DAGCombiner's canonicalization:
  // (x + C) - y  ->  (x - y) + C
  // y - (x + C)  ->  (y - x) - C
  // (x - C) - y  ->  (x - y) - C
  // (C - x) - y  ->  C - (x + y)

This reverts commit b3529b5bf3ba2cd7f38665de16450afefb263c9b.
2023-04-19 11:15:14 +08:00
David Stuttard
4d54565436 [AMDGPU] Remove unnecessary assert
Also remove the function attributes from the test. For PAL based shaders this isn't required.

Differential Revision: https://reviews.llvm.org/D148625
2023-04-18 13:41:38 +01:00
pvanhout
ec82188451 [AMDGPU] Do not crash on agpr_hi16 in AMDGPUResourceUsageAnalysis
Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D148438
2023-04-18 13:53:01 +02:00
Kriti Gupta
a3dfa4e083 [test] Remove occurences of br undef in CodeGen/AMDGPU tests
Differential Revision: https://reviews.llvm.org/D148041
2023-04-18 08:47:29 +01:00
chenglin.bi
b3529b5bf3 [AMDGPU] Ressociate patterns with sub to use SALU
Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D148463
2023-04-18 06:49:41 +08:00
Jay Foad
2be3189601 [AMDGPU] Don't select _SGPR forms of SMEM instructions on GFX9+
On GFX9+, SMEM instructions have an _SGPR_IMM form which is strictly
more powerful than the _SGPR form. It simplifies codegen if we always
select the _SGPR_IMM form with an immediate offset of 0 instead of the
_SGPR form.

Note that this patch just makes minimal changes to the selection
patterns to prove the concept. Further simplifications are possible to
reduced the number of selection patterns.

On GFX9 the _SGPR form of the Real instruction is still required for
assembly/disassembly but on GFX10+ it can be removed completely.

Differential Revision: https://reviews.llvm.org/D147334
2023-04-17 16:23:30 +01:00
pvanhout
ae77aceba5 [Analysis] Remove DA & LegacyDA
UniformityAnalysis offers all of the same features and much more, there is no reason left to use the legacy DAs.
See RFC: https://discourse.llvm.org/t/rfc-deprecate-divergenceanalysis-legacydivergenceanalysis/69538

- Remove LegacyDivergenceAnalysis.h/.cpp
- Remove DivergenceAnalysis.h/.cpp + Unit tests
- Remove SyncDependenceAnalysis - it was not a real registered analysis and was only used by DAs
- Remove/adjust references to the passes in the docs where applicable
- Remove TTI hook associated with those passes.
- Move tests to UniformityAnalysis folder.
  - Remove RUN lines for the DA, leave only the UA ones.
- Some tests had to be adjusted/removed depending on how they used the legacy DAs.

Reviewed By: foad, sameerds

Differential Revision: https://reviews.llvm.org/D148116
2023-04-17 09:01:22 +02:00
Jay Foad
6b5067a81a [AMDGPU] Don't assert that image intrinsics are supported
Unsupported intrinsics should give a regular "cannot select" error.

Differential Revision: https://reviews.llvm.org/D148147
2023-04-16 19:54:55 +01:00
pvanhout
b3b3cb2d2f [AMDGPU] Less aggressively break large PHIs
In some cases, breaking large PHIs can very negatively affect
performance (3x more instructions observed in a particular test case).

This patch adds some basic profitability heuristics to help with some of these issues without affecting the "good" cases.
e.g. avoid breaking PHIs if it causes back-and-forth between vector/scalar form for no good reason.

Fixes SWDEV-392803
Fixes SWDEV-393781
Fixes SWDEV-394228

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D147786
2023-04-14 15:41:26 +02:00
David Stuttard
fc83f1de5d [AMDGPU] Add backend support for new PAL ELF Metadata 3.0
PAL Metadata 3.0 introduces an explicit structure in metadata for the
programmable registers written out by the compiler backend.
Rather than using opaque registers which can change between different
architectures and requires encoding the bitfield information in the backend,
which may change between versions.

This is the initial minimal implementation that enables the use of PAL Metadata
3.0.

The change itself should be NFC for non-PAL, although the way RSRC2 register is
handled has been changed slightly.

The test is fairly minimal, but checks that the metadata format looks as
expected and verifies a couple of special cases such as tgid_[xyz]_en handling
and PsInputAddr/Ena which also change to explicit fields.

Differential Revision: https://reviews.llvm.org/D147143
2023-04-14 09:57:13 +01:00
Diana Picus
b9ba05360e [AMDGPU] Don't S_MOV_B32 into $scc
The peephole optimizer tries to replace
```
%n:sgpr_32 = S_MOV_B32 x
$scc = COPY %n
```
with a `S_MOV_B32` directly into `$scc`.

This crashes because `S_MOV_B32` cannot take `$scc` as input.

We currently generate code like this from GlobalISel when lowering a
G_BRCOND with a constant condition. We should probably look into
removing this kind of branch altogether, but until then we should at
least not crash.

This patch fixes the issue by making sure we don't apply the peephole
optimization when trying to move into a physical register that
doesn't belong to the correct register class.

Differential Revision: https://reviews.llvm.org/D148117
2023-04-14 10:24:43 +02:00
Jay Foad
2d39f5b5cd [AMDGPU] Allow use of TTMP registers in AMDGPUResourceUsageAnalysis
With architected SGPRs, workgroup IDs are passed into a compute shader
in TTMP registers. Allow for this in AMDGPUResourceUsageAnalysis instead
of failing an assertion.

Differential Revision: https://reviews.llvm.org/D148239
2023-04-13 16:56:22 +01:00
pvanhout
fd1d60873f [AMDGPU] Remove CC exception for Promote Alloca Limits
Apparently it was used to work around some issue that has been fixed.
Removing it helps with high scratch usage observed in some cases due to failed alloca promotion.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D145586
2023-04-13 08:48:34 +02:00
Sebastian Neubauer
fee3980df5 [AMDGPU] Fix amdgpu_gfx tail-call test
The inreg argument prevented the tail call optimization to kick in.
Remove the inreg, so this test actually uses a tail call.

Note that it now uses s[4:5] for the return address, which is invalid,
because these registers are supposed to be callee-save.
D147096 tried to fix that problem for the C calling convention.

Differential Revision: https://reviews.llvm.org/D148119
2023-04-12 16:15:09 +02:00