- Use computeMaxCallFrameSize() in PEI::calculateCallFrameInfo() instead of duplicating the code.
- Set AdjustsStack in FinalizeISel instead of in computeMaxCallFrameSize().
[GlobalISel] Implement convergence control tokens and intrinsics in GMIR
In the IR translator, convert the LLVM token type to LLT::token(), which is an
alias for the s0 type. These show up as implicit uses on convergent operations.
Differential Revision: https://reviews.llvm.org/D158147
At the moment, the emergency spill slot is a fixed object for entry
functions and chain functions, and a regular stack object otherwise.
This patch adopts the latter behaviour for entry/chain functions too. It
seems this was always the intention [1] and it will also save us a bit
of stack space in cases where the first stack object has a large
alignment.
[1]
34c8b835b1
Extra space causes the checks generated by update_mir_test_checks to be
unavailable.
```
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4
# RUN: llc -mtriple=x86_64-- -o - %s -run-pass=none -verify-machineinstrs -simplify-mir | FileCheck %s
---
name: foo
body: |
; CHECK-LABEL: name: foo
; CHECK: bb.0:
; CHECK-NEXT: successors:
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: bb.1:
; CHECK-NEXT: RET 0, $eax
bb.0:
successors:
bb.1:
RET 0, $eax
...
```
The failure log is as follows:
```
llvm/test/CodeGen/MIR/X86/unreachable-block-print.mir:9:16: error: CHECK-NEXT: is on the same line as previous match
; CHECK-NEXT: {{ $}}
^
<stdin>:21:13: note: 'next' match was here
successors:
^
<stdin>:21:13: note: previous match ended here
successors:
```
Similar to 806761a7629df268c8aed49657aeccffa6bca449.
For IR files without a target triple, -mtriple= specifies the full
target triple while -march= merely sets the architecture part of the
default target triple, leaving a target triple which may not make sense,
e.g. amdgpu-apple-darwin.
Therefore, -march= is error-prone and not recommended for tests without
a target triple. The issue has been benign as we recognize
$unknown-apple-darwin as ELF instead of rejecting it outrightly.
This patch changes AMDGPU tests to not rely on the default
OS/environment components. Tests that need fixes are not changed:
```
LLVM :: CodeGen/AMDGPU/fabs.f64.ll
LLVM :: CodeGen/AMDGPU/fabs.ll
LLVM :: CodeGen/AMDGPU/floor.ll
LLVM :: CodeGen/AMDGPU/fneg-fabs.f64.ll
LLVM :: CodeGen/AMDGPU/fneg-fabs.ll
LLVM :: CodeGen/AMDGPU/r600-infinite-loop-bug-while-reorganizing-vector.ll
LLVM :: CodeGen/AMDGPU/schedule-if-2.ll
```
We were allowing extra immediate arguments, and only bothering to check
if registers were implicit or not.
Also consolidate extra operand checks in verifier, to make this
testable. We had 3 different places checking if you were trying to build
an instruction with more operands than allowed by the definition. We had
an assertion in addOperand, a direct check in the MIRParser to avoid the
assertion, and the machine verifier checks. Remove the assert and parser
check so the verifier can provide a consistent verification experience,
which will also handle instructions modified in place.
This enables -regalloc=greedy to memfold spillable inline asm
MachineOperands.
Because no instruction selection framework marks MachineOperands as
spillable, no language frontend can observe functional changes from this
patch. That will change once instruction selection frameworks are
updated.
Link: https://github.com/llvm/llvm-project/issues/20571
1. Map R16-R31 to DWARF registers 130-145.
2. Make R16-R31 caller-saved registers.
3. Make R16-31 allocatable only when feature EGPR is supported
4. Make R16-31 availabe for instructions in legacy maps 0/1 and EVEX
space, except XSAVE*/XRSTOR
RFC:
https://discourse.llvm.org/t/rfc-design-for-apx-feature-egpr-and-ndd-support/73031/4
Explanations for some seemingly unrelated changes:
inline-asm-registers.mir, statepoint-invoke-ra-enter-at-end.mir:
The immediate (TargetInstrInfo.cpp:1612) used for the regdef/reguse is
the encoding for the register
class in the enum generated by tablegen. This encoding will change
any time a new register class is added. Since the number is part
of the input, this means it can become stale.
seh-directive-errors.s:
R16-R31 makes ".seh_pushreg 17" legal
musttail-varargs.ll:
It seems some LLVM passes use the number of registers rather the number
of allocatable registers as heuristic.
This PR is to reland #67702 after #70222 in order to reduce some
compile-time regression when EGPR is not used.
The `YAMLParser.h` header file claims support for YAML 1.2 with a few
deviations, but our plain scalar parsing failed to parse some valid YAML
according to the spec. This change puts us more in compliance with the
YAML spec, now letting us parse plain scalars containing additional
special characters in cases where they are not ambiguous.
1. Map R16-R31 to DWARF registers 130-145.
2. Make R16-R31 caller-saved registers.
3. Make R16-31 allocatable only when feature EGPR is supported
4. Make R16-31 availabe for instructions in legacy maps 0/1 and EVEX
space, except XSAVE*/XRSTOR
RFC:
https://discourse.llvm.org/t/rfc-design-for-apx-feature-egpr-and-ndd-support/73031/4
Explanations for some seemingly unrelated changes:
inline-asm-registers.mir, statepoint-invoke-ra-enter-at-end.mir:
The immediate (TargetInstrInfo.cpp:1612) used for the regdef/reguse is
the encoding for the register
class in the enum generated by tablegen. This encoding will change
any time a new register class is added. Since the number is part
of the input, this means it can become stale.
seh-directive-errors.s:
R16-R31 makes ".seh_pushreg 17" legal
musttail-varargs.ll:
It seems some LLVM passes use the number of registers rather the number
of allocatable registers as heuristic.
To simplify handling PAuth in the machine outliner, introduce a
separate AArch64PointerAuth pass that is executed after both
Prologue/Epilogue Inserter and Machine Outliner passes.
After moving to AArch64PointerAuth, signLR and authenticateLR are
not used outside of their class anymore, so make them private and
simplify accordingly.
The new pass is added via AArch64PassConfig::addPostBBSections(),
so that it can change the code size before branch relaxation occurs.
AArch64BranchTargets is placed there too, so it can take into account
any PACI(A|B)SP instructions and not excessively add BTIs at the start
of functions.
Reviewed By: tmatheson
Differential Revision: https://reviews.llvm.org/D159357
This will represent functions with the amdgpu_cs_chain or
amdgpu_cs_chain_preserve calling conventions.
Differential Revision: https://reviews.llvm.org/D156410
Some opcodes in MIR are defined to be convergent by the target by setting
IsConvergent in the corresponding TD file. For example, in AMDGPU, the opcodes
G_SI_CALL and G_INTRINSIC* are marked as convergent. But this is too
conservative, since calls to functions that do not execute convergent operations
should not be marked convergent. This information is available in LLVM IR.
The new flag MIFlag::NoConvergent now allows the IR translator to mark an
instruction as not performing any convergent operations. It is relevant only on
occurrences of opcodes that are marked isConvergent in the target.
Differential Revision: https://reviews.llvm.org/D157475
MachineVerifier does not check that DBG_VALUE, DBG_VALUE_LIST and
DBG_INSTR_REF have the expected number of operands, so printing them
(e.g. with -print-after-all) should not crash.
Differential Revision: https://reviews.llvm.org/D156226
This reverts commit a496c8be6e638ae58bb45f13113dbe3a4b7b23fd.
The workaround in c26dfc81e254c78dc23579cf3d1336f77249e1f6 should work
around the underlying problem with SUBREG_TO_REG.
Record the call frame size on entry to each basic block. This is usually
zero except when a basic block has been split in the middle of a call
sequence.
This simplifies PEI::replaceFrameIndices which previously had to visit
basic blocks in a specific order and had special handling for
unreachable blocks. More importantly it paves the way for an equally
simple implementation of a backwards version of replaceFrameIndices,
which is required to fully convert PrologEpilogInserter to backwards
register scavenging, which is preferred because it does not rely on
accurate kill flags.
Differential Revision: https://reviews.llvm.org/D156113
And dependent commits.
Details in D150388.
This reverts commit 825b7f0ca5f2211ec3c93139f98d1e24048c225c.
This reverts commit 7a98f084c4d121244ef7286bc6503b6a181d446e.
This reverts commit b4a62b1fa546312d882fa12dfdcd015177d66826.
This reverts commit b7836d856206ec39509d42529f958c920368166b.
No conflicts in the code, few tests had conflicts in autogenerated CHECKs:
llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll
Reviewed By: alexfh
Differential Revision: https://reviews.llvm.org/D156381
ProgrammersManual.html says
> StringMap iteration order, however, is not guaranteed to be deterministic, so any uses which require that should instead use a std::map.
This patch makes -DLLVM_REVERSE_ITERATION=on (currently
-DLLVM_ENABLE_REVERSE_ITERATION=on works as well) shuffle StringMap
iteration order (actually flipping the hash so that elements not in the
same bucket are reversed) to catch violations, similar to D35043 for
DenseMap. This should help change the hash function (e.g., D142862,
D155781).
With a lot of fixes, there are still some violations. This patch
implements the "reverse_iteration" lit feature to skip such tests.
Eventually we should remove this feature.
`ninja check-{llvm,clang,clang-tools}` are clean with
`#define LLVM_ENABLE_REVERSE_ITERATION 1`.
Reviewed By: jhenderson
Differential Revision: https://reviews.llvm.org/D155789
Record the SP adjustment on entry to each basic block. This is almost
always zero except on targets like ARM which can split a basic block in
the middle of a call sequence.
This simplifies PEI::replaceFrameIndices which previously had to visit
basic blocks in a specific order and had special handling for
unreachable blocks. More importantly it paves the way for an equally
simple implementation of a backwards version of replaceFrameIndices,
which is required to fully convert PrologEpilogInserter to backwards
register scavenging, which is preferred because it does not rely on
accurate kill flags.
Differential Revision: https://reviews.llvm.org/D154281
Currently, the custom SGPR spill lowering pass spills
SGPRs into physical VGPR lanes and the remaining VGPRs
are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.
This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.
Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which is an unproblematic case.
Differential Revision: https://reviews.llvm.org/D124196
To reduce the register pressure during allocation,
when the allocator spills a virtual register that
corresponds to a whole wave mode operation, the
spill loads and restores should be activated for
all lanes by temporarily flipping all bits in exec
register to one just before the spills. It is not
implemented in the compiler as of today and this
patch enables the necessary support.
This is a pre-patch before the SGPR spill to virtual
VGPR lanes that would eventually causes the whole
wave register spills during allocation.
Reviewed By: arsenm, cdevadas
Differential Revision: https://reviews.llvm.org/D143759
Branch relaxation requires 2 additional SGPRs for AMDGPU to handle the
case when an indirect branch target is too far away. The register
scavanger may not find available registers, which causes a “did not find
scavenging index” assert to occur in assignRegToScavengingIndex.
In this patch, we estimate before register allocation whether an
indirect branch is likely to be needed, and reserve 2 SGPRs if the
branch distance is found to be above a threshold. The distance threshold
is an approximation as the exact code size and branch distance are
unknown prior to register allocation.
Patch by Corbin Robeck. Thanks!
Differential Review: https://reviews.llvm.org/D149775
1. Remove the existing code that would encode the constant offsets (if
there were any) on buffer intrinsic operations onto their
`MachineMemOperand`s. As far as I can tell, this use of `offset` has
no substantial impact on the generated code, especially since the same
reasoning is performed by areMemAccessesTriviallyDisjoint().
2. When a buffer resource intrinsic takes a pointer argument as the
base resource/descriptor, place that memory argument in the value
field of the MachineMemOperand attached to that intrinsic.
This is more conservative than what would be produced by more typical
LLVM code using GEP, as the Value (for alias analysis purposes)
corresponding to accessing buffer[0] and buffer[1] is the same.
However, the target-specific analysis of disjoint offsets covers a lot
of the simple usecases.
Despite this limitation, the new buffer intrinsics, combined with
LLVM's existing pointer annotations, allow for non-trivial
optimizations, as seen in the new tests, where marking two buffer
descriptors "noalias" allows merging together loads and stores in a
"load from A, modify loaded value, store to B" sequence, which would
not be possible previously.
Depends on D147547
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D148184
In order to enable the LLVM frontend to better analyze buffer
operations (and to potentially enable more precise analyses on the
backend), define versions of the raw and structured buffer intrinsics
that use `ptr addrspace(8)` instead of `<4 x i32>` to represent their
rsrc arguments.
The new intrinsics are named by replacing `buffer.` with `buffer.ptr`.
One advantage to these intrinsic definitions is that, instead of
specifying that a buffer load/store will read/write some memory, we
can indicate that the memory read or written will be based on the
pointer argument. This means that, for example, a read from a
`noalias` buffer can be pulled out of a loop that is modifying a
distinct buffer.
In the future, we will define custom PseudoSourceValues that will
allow us to package up the (buffer, index, offset) triples that buffer
intrinsics contain and allow for more precise backend analysis.
This work also enables creating address space 7, which represents
manipulation of raw buffers using native LLVM load and store
instructions.
Where tests simply used a buffer intrinsic while testing some other
code path (such as the tests for VGPR spills), they have been updated
to use the new intrinsic form. Tests that are "about" buffer
intrinsics (for instance, those that ensure that they codegen as
expected) have been duplicated, either within existing files or into
new ones.
Depends on D145441
Reviewed By: arsenm, #amdgpu
Differential Revision: https://reviews.llvm.org/D147547
This caused compiler assertions, see comment on
https://reviews.llvm.org/D150107.
This also reverts the dependent follow-up change:
> [X86] Remove patterns for ADD/AND/OR/SUB/XOR/CMP with immediate 8 and optimize during MC lowering, NFCI
>
> This is follow-up of D150107.
>
> In addition, the function `X86::optimizeToFixedRegisterOrShortImmediateForm` can be
> shared with project bolt and eliminates the code in X86InstrRelaxTables.cpp.
>
> Differential Revision: https://reviews.llvm.org/D150949
This reverts commit 2ef8ae134828876ab3ebda4a81bb2df7b095d030 and
5586bc539acb26cb94e461438de01a5080513401.
This is follow-up of D150107.
In addition, the function `X86::optimizeToFixedRegisterOrShortImmediateForm` can be
shared with project bolt and eliminates the code in X86InstrRelaxTables.cpp.
Differential Revision: https://reviews.llvm.org/D150949
This is a follow-up to b71edfaa4ec3c998aadb35255ce2f60bba2940b0
since I forgot the lit.local.cfg files in that one.
Reformatting is done with `black`.
If you end up having problems merging this commit because you
have made changes to a python file, the best way to handle that
is to run git checkout --ours <yourfile> and then reformat it
with black.
If you run into any problems, post to discourse about it and
we will try to help.
RFC Thread below:
https://discourse.llvm.org/t/rfc-document-and-standardize-python-code-style
Reviewed By: barannikov88, kwk
Differential Revision: https://reviews.llvm.org/D150762
This commit implements the serialization and deserialization of the Machine
Function's EntryValueObjects.
Depends on D149879, D149778
Differential Revision: https://reviews.llvm.org/D149880
Re-land D145441 with data layout upgrade code fixed to not break OpenMP.
This reverts commit 3f2fbe92d0f40bcb46db7636db9ec3f7e7899b27.
Differential Revision: https://reviews.llvm.org/D149776
Per discussion at
https://discourse.llvm.org/t/representing-buffer-descriptors-in-the-amdgpu-target-call-for-suggestions/68798,
we define two new address spaces for AMDGCN targets.
The first is address space 7, a non-integral address space (which was
already in the data layout) that has 160-bit pointers (which are
256-bit aligned) and uses a 32-bit offset. These pointers combine a
128-bit buffer descriptor and a 32-bit offset, and will be usable with
normal LLVM operations (load, store, GEP). However, they will be
rewritten out of existence before code generation.
The second of these is address space 8, the address space for "buffer
resources". These will be used to represent the resource arguments to
buffer instructions, and new buffer intrinsics will be defined that
take them instead of <4 x i32> as resource arguments. ptr
addrspace(8). These pointers are 128-bits long (with the same
alignment). They must not be used as the arguments to getelementptr or
otherwise used in address computations, since they can have
arbitrarily complex inherent addressing semantics that can't be
represented in LLVM. Even though, like their address space 7 cousins,
these pointers have deterministic ptrtoint/inttoptr semantics, they
are defined to be non-integral in order to prevent optimizations that
rely on pointers being a [0, [addr_max]] value from applying to them.
Future work includes:
- Defining new buffer intrinsics that take ptr addrspace(8) resources.
- A late rewrite to turn address space 7 operations into buffer
intrinsics and offset computations.
This commit also updates the "fallback address space" for buffer
intrinsics to the buffer resource, and updates the alias analysis
table.
Depends on D143437
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D145441
Fix some bugs and reland e4c1dfed38370b4 and 614c63bec6d67c.
1. Run argument stack rebase pass before the reserved physical register
is finalized.
2. Add LEA pseudo instruction to prevent the instruction being
eliminated.
3. Don't support X32.
The base pointer register is reserved by compiler when there is
dynamic size alloca and stack realign in a function. However the
base pointer register is not defined in X86 ABI, so user can use
this register in inline assembly. The inline assembly would
clobber base pointer register without being awared by user. This
patch is to create extra prolog to save the stack pointer to a
scratch register and use this register to reference argument from
stack. For some calling convention (e.g. regcall), there may be
few scratch register.
Below is the example code for such case.
```
extern int bar(void *p);
long long foo(size_t size, char c, int id) {
__attribute__((__aligned__(64))) int a;
char *p = (char *)alloca(size);
asm volatile ("nop"::"S"(405):);
asm volatile ("movl %0, %1"::"r"(id), "m"(a):);
p[2] = 8;
memset(p, c, size);
return bar(p);
}
```
And below prolog/epilog will be emit for this case.
```
leal 4(%esp), %ebx
.cfi_def_cfa %ebx, 0
andl $-128, %esp
pushl -4(%ebx)
...
leal 4(%ebx), %esp
.cfi_def_cfa %esp, 4
```
Differential Revision: https://reviews.llvm.org/D145650
Occupancy is expressed as waves per SIMD. This means that we need to
take into account the number of SIMDs per "CU" or, to be more precise,
the number of SIMDs over which a workgroup may be distributed.
getOccupancyWithLocalMemSize was wrong because it didn't take SIMDs
into account at all.
At the same time, we need to take into account that WGP mode offers
access to a larger total amount of LDS, since this can affect how
non-power-of-two LDS allocations are rounded. To make this work
consistently, we distinguish between (available) local memory size and
addressable local memory size (which is always limited by 64kB on
gfx10+, even with WGP mode).
This change results in a massive amount of test churn. A lot of it is
caused by the fact that the default work group size is 1024, which means
that (due to rounding effects) the default occupancy on older hardware
is 8 instead of 10, which affects scheduling via register pressure
estimates. I've adjusted most tests by just running the UTC tools, but
in some cases I manually changed the work group size to 32 or 64 to make
sure that work group size chunkiness has no effect.
Differential Revision: https://reviews.llvm.org/D139468