For the GPU, we emit external kernels that call the initializers and
constructors, however if we had a persistent kernel like in the `_start`
kernel for the `libc` project, we could initialize the standard way of
calling constructors. This patch adds new global variables containing
pointers to the constructors to be called. If these are placed in the
`.init_array` and `.fini_array` sections, then the backend will handle
them specially. The linker will then provide the `__init_array_` and
`__fini_array_` sections to traverse them. An implementation would look
like this.
```
extern uintptr_t __init_array_start[];
extern uintptr_t __init_array_end[];
extern uintptr_t __fini_array_start[];
extern uintptr_t __fini_array_end[];
using InitCallback = void(int, char **, char **);
using FiniCallback = void(void);
extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel]] void
_start(int argc, char **argv, char **envp) {
uint64_t init_array_size = __init_array_end - __init_array_start;
for (uint64_t i = 0; i < init_array_size; ++i)
reinterpret_cast<InitCallback *>(__init_array_start[i])(argc, argv, env);
uint64_t fini_array_size = __fini_array_end - __fini_array_start;
for (uint64_t i = 0; i < fini_array_size; ++i)
reinterpret_cast<FiniCallback *>(__fini_array_start[i])();
}
```
Reviewed By: yaxunl
Differential Revision: https://reviews.llvm.org/D149340
"convergent" is documented as meaning that the call cannot be made
control-dependent on more values, but in practice we also require that
it cannot be made control-dependent on fewer values, e.g. it cannot be
hoisted out of the body of an "if" statement.
In code like this, if we allow CSE to combine the two calls:
x = convergent_call();
if (cond) {
y = convergent_call();
use y;
}
then we get this:
x = convergent_call();
if (cond) {
use x;
}
This is conceptually equivalent to moving the second call out of the
body of the "if", up to the location of the first call, so it should be
disallowed.
Differential Revision: https://reviews.llvm.org/D149348
Even if optimizing for ILP, it is still useful to track RP to avoid spilling. Given that, we need to maintin consistent liveness state with the RP tracker. This patch makes RP tracking consistent by updating for liveins.
Otherwise, we should completely eliminate RP tracking for this scheduler (checkScheduling, initCandidate).
Differential Revision: https://reviews.llvm.org/D149358
Summary:
Registers for tail call return should not be clobbered by callee.
So we need a sub-class of SGPR_64 (excluding callee saved registers (CSR)) to hold
the tail call return address.
Because GFX and C calling conventions have different CSR, we need to define
the sub-class separately. This work is an extension of D147096 with the
consideration of GFX calling convention.
Based on the calling conventions, different instructions will be selected with
different sub-class of SGPR_64 as the input.
Reviewers: arsenm, cdevadas and sebastian-ne
Differential Revision: https://reviews.llvm.org/D148824
With this patch an undefined mask in a shufflevector will be printed as poison.
This change is done to support the new shufflevector semantics
for undefined mask elements.
Differential Revision: https://reviews.llvm.org/D149210
As the comment notes, the shader results in an INSERT_SUBREG with
"undef" (dead) operand in the Endif block. The same can happen with
REG_SEQUENCE. The register is considered dead from a liveness
analysis perspective. The correct thing to do seems to be nothing:
we keep the undef use of the register, the register allocator should
still be able to take the liveness into account correctly.
Differential Revision: https://reviews.llvm.org/D149161
There are no VOP2 or VOP2 with dpp forms of v_cndmask_b16. Delete the
test. NFC.
Reviewed By: critson
Differential Revision: https://reviews.llvm.org/D149184
The offset values may result in an erroneous scheduling of a load before write for a memory location if the offset values are represented as negative values in MIR, despite actually being unsigned values. This representation in MIR happens as SelectionDAG::getConstant could go through APInt to represent the encoding which assumes the MSB of the encoding as a sign-bit, regardless of whether it is supposed to be a signed value. The 8-bit negative (interpreted) value gets cast to an unsigned 32 bit value in getMemOperandsWithOffset used for comparisons in areMemAccessesTriviallyDisjoint eventually leading to an erroneous schedule in the machine scheduler.
Reviewed By: arsenm, foad
Differential Revision: https://reviews.llvm.org/D149080
After D147334 we never select _SGPR forms of SMEM instructions on
subtargets that also support the _SGPR_IMM form, so there is no need to
handle them here.
Differential Revision: https://reviews.llvm.org/D149139
A bitcast needs to be inserted back to the original type. Just
skip the multiple use case for a safer quick fix. Handling
the multiple use case seems to be beneficial in some but not
all cases.
Remove the broken call graph analysis in the block enqueue lowering
pass. The previous iteration was reverted due to a runtime bug when
the completion action was unconditionally enabled.
This reverts parts of D123693. The functionality of allowing unsupported
intrinsics to select has been superseded by D139000 "Remove function
with incompatible features".
Retain assembler/disassembler support for v_illegal on GFX10+ only,
where it is documented.
Differential Revision: https://reviews.llvm.org/D148127
There are some patterns in td files without MVT/class set
for some operands in target pattern that are from the source
pattern. This prevents GlobalISelEmitter from adding them as
a valid rule, because the target child operand is an
unsupported kind operand. For now, for a leaf child, only
IntInit and DefInit are handled in GlobalISelEmitter.
This issue can be workaround by adding MVT/class to the
patterns in the td files, like the workarounds for patterns
anyext and setcc in PPCInstrInfo.td in D140878.
To avoid adding the same workarounds for other patterns in
td files, this patch tries to handle the UnsetInit case in
GlobalISelEmitter.
Adding the new handling allows us to remove the workarounds
in the td files and also generates many selection rules for
PPC target.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D141247
The patch will caused dead loop because of DAGCombiner's canonicalization:
// (x + C) - y -> (x - y) + C
// y - (x + C) -> (y - x) - C
// (x - C) - y -> (x - y) - C
// (C - x) - y -> C - (x + y)
This reverts commit b3529b5bf3ba2cd7f38665de16450afefb263c9b.
On GFX9+, SMEM instructions have an _SGPR_IMM form which is strictly
more powerful than the _SGPR form. It simplifies codegen if we always
select the _SGPR_IMM form with an immediate offset of 0 instead of the
_SGPR form.
Note that this patch just makes minimal changes to the selection
patterns to prove the concept. Further simplifications are possible to
reduced the number of selection patterns.
On GFX9 the _SGPR form of the Real instruction is still required for
assembly/disassembly but on GFX10+ it can be removed completely.
Differential Revision: https://reviews.llvm.org/D147334
UniformityAnalysis offers all of the same features and much more, there is no reason left to use the legacy DAs.
See RFC: https://discourse.llvm.org/t/rfc-deprecate-divergenceanalysis-legacydivergenceanalysis/69538
- Remove LegacyDivergenceAnalysis.h/.cpp
- Remove DivergenceAnalysis.h/.cpp + Unit tests
- Remove SyncDependenceAnalysis - it was not a real registered analysis and was only used by DAs
- Remove/adjust references to the passes in the docs where applicable
- Remove TTI hook associated with those passes.
- Move tests to UniformityAnalysis folder.
- Remove RUN lines for the DA, leave only the UA ones.
- Some tests had to be adjusted/removed depending on how they used the legacy DAs.
Reviewed By: foad, sameerds
Differential Revision: https://reviews.llvm.org/D148116
In some cases, breaking large PHIs can very negatively affect
performance (3x more instructions observed in a particular test case).
This patch adds some basic profitability heuristics to help with some of these issues without affecting the "good" cases.
e.g. avoid breaking PHIs if it causes back-and-forth between vector/scalar form for no good reason.
Fixes SWDEV-392803
Fixes SWDEV-393781
Fixes SWDEV-394228
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D147786
PAL Metadata 3.0 introduces an explicit structure in metadata for the
programmable registers written out by the compiler backend.
Rather than using opaque registers which can change between different
architectures and requires encoding the bitfield information in the backend,
which may change between versions.
This is the initial minimal implementation that enables the use of PAL Metadata
3.0.
The change itself should be NFC for non-PAL, although the way RSRC2 register is
handled has been changed slightly.
The test is fairly minimal, but checks that the metadata format looks as
expected and verifies a couple of special cases such as tgid_[xyz]_en handling
and PsInputAddr/Ena which also change to explicit fields.
Differential Revision: https://reviews.llvm.org/D147143
The peephole optimizer tries to replace
```
%n:sgpr_32 = S_MOV_B32 x
$scc = COPY %n
```
with a `S_MOV_B32` directly into `$scc`.
This crashes because `S_MOV_B32` cannot take `$scc` as input.
We currently generate code like this from GlobalISel when lowering a
G_BRCOND with a constant condition. We should probably look into
removing this kind of branch altogether, but until then we should at
least not crash.
This patch fixes the issue by making sure we don't apply the peephole
optimization when trying to move into a physical register that
doesn't belong to the correct register class.
Differential Revision: https://reviews.llvm.org/D148117
With architected SGPRs, workgroup IDs are passed into a compute shader
in TTMP registers. Allow for this in AMDGPUResourceUsageAnalysis instead
of failing an assertion.
Differential Revision: https://reviews.llvm.org/D148239
Apparently it was used to work around some issue that has been fixed.
Removing it helps with high scratch usage observed in some cases due to failed alloca promotion.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D145586
The inreg argument prevented the tail call optimization to kick in.
Remove the inreg, so this test actually uses a tail call.
Note that it now uses s[4:5] for the return address, which is invalid,
because these registers are supposed to be callee-save.
D147096 tried to fix that problem for the C calling convention.
Differential Revision: https://reviews.llvm.org/D148119
Fix the test introduced in D136592 which appeared to have a few check lines
containing an un-checked prefix "GISEL-GFX".
Also canonicalize the other prefixes to minimize churn if SDag and GISel
diverge.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D147958
The math libraries have a lot of code that performs
manual sign bit operations by bitcasting doubles to int2
and doing bithacking on them. This is a bad canonical form
we should rewrite to use high level sign operations directly
on double. To avoid codegen regressions, we need to do a better
job moving fnegs to operate only on the high 32-bits.
This is only halfway to fixing the real case.
The GFX11 NGG Streamout Instructions perform atomic operations on
dedicated registers. At the moment, they lack machine memory operands,
which causes the si-memory-legalizer pass to treat them conservatively
and introduce several unnecessary waits and cache invalidations.
This patch introduces a new address space to represent these special
registers and teaches instruction selection to add memory operands with
this new address space to DS_ADD/SUB_GS_REG_RTN.
Since this address space is meant to be compiler-internal, we move it
up a bit from the other address spaces and give it the number 128.
According to the LLVM Language Reference, address space numbers can go
all the way up to 2^24, but I'm not sure how well this is supported in
practice [1], so using a smaller number seems safer.
[1] 0107513fe7/llvm/utils/TableGen/IntrinsicEmitter.cpp (L401)
Differential Revision: https://reviews.llvm.org/D146031
Use an `OptimizationRemark` for them even though it's not really an
optimization. It just integrates better with the other diagnostics
(enabling is easy with `-pass-remark`).
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D147703
Summary:
This is to avoid using the callee saved registers for the return address
of the tail call return instruction.
Reviewers:
arsenm, cdevadas
Differential Revision:
https://reviews.llvm.org/D147096
The test was added in D69953, but since D67794 landed a bit later it no
longer catches the bug. Update the test to be representative again.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D146934