Add a prologue to the kernel entry to handle cases where code designed
for kernarg preloading is executed on hardware equipped with
incompatible firmware. If hardware has compatible firmware the 256 bytes
at the start of the kernel entry will be skipped. This skipping is done
automatically by hardware that supports the feature.
A pass is added which is intended to be run at the very end of the
pipeline to avoid any optimizations that would assume the prologue is a
real predecessor block to the actual code start. In reality we have two
possible entry points for the function. 1. The optimized path that
supports kernarg preloading which begins at an offset of 256 bytes. 2.
The backwards compatible entry point which starts at offset 0.
Resource usage information would try to overwrite unnamed functions if
there are multiple within the same compilation unit. This aims to either
use the `MCSymbol` assigned to the unnamed function (i.e.,
`CurrentFnSym`), or, rematerialize the `MCSymbol` for the unnamed
function.
When compiling multiple pipelines, the `MCRegisterInfo` instance in
`AMDGPUAsmPrinter` gets re-used even after finalization, so it calls
`finalize()` multiple times.
Add a reset method and call it in
`AMDGPUAsmPrinter::doFinalization`.
Different approach would be to make it a `unique_ptr`.
---------
Co-authored-by: Thomas Symalla <tsymalla@amd.com>
Converts AMDGPUResourceUsageAnalysis pass from Module to MachineFunction
pass. Moves function resource info propagation to to MC layer (through
helpers in AMDGPUMCResourceInfo) by generating MCExprs for every
function resource which the emitters have been prepped for.
Fixes https://github.com/llvm/llvm-project/issues/64863
Unlike with implicitly preloaded data UserSGPRs firmware is unable to
handle cases where SGPRs for kernel arguments contain preloaded data but
not are not explicitly referenced in the kernel. We need to include
these preloaded SGPRs in the GRANULATED_WAVEFRONT_SGPR_COUNT calculation
to not clobber SGPRs in adjacent waves.
Walks over the MCExpr and uses KnownBits to deduce whether an expression
is known and if so, prints said known value. Should support the most
common MCExpr cases for AMDGPU metadata.
Summary:
This pass is used to get helpful information about the kernel resources
without needing to insepct the binary. However, it currently prints on
every function. These values will always be zero, so it's just spam on
the terminal, at best an indication that a function wasn't internalized
/ optimized out. This patch makes it only print for kernels to make it
more useful in practice.
Some of our custom expressions are not variadic and there seems to be
little benefit in mentioning the variadic nature of expression nodes in
the name anyway.
Allows MCExprs as passed values to PALMetadata. Also adds related
`DelayedMCExpr` classes which serve as a pseudo-fixup to resolve MCExprs
as late as possible (i.e., right before emit through string or blob,
where they should be resolvable).
Commit 6c0665e22174d474050e85ca367424f6e02476be
(https://reviews.llvm.org/D45164) enabled certain constant expression
evaluation for `MCObjectStreamer` at parse time (e.g. `.if` directives,
see llvm/test/MC/AsmParser/assembler-expressions.s).
`getUseAssemblerInfoForParsing` was added to make `clang -c` handling
inline assembly similar to `MCAsmStreamer` (e.g. `llvm-mc -filetype=asm`),
where such expression folding (related to
`AttemptToFoldSymbolOffsetDifference`) is unavailable.
I believe this is overly conservative. We can make some parse-time
expression folding work for `clang -c` even if `clang -S` would still
report an error, a MCAsmStreamer issue (we cannot print `.if`
directives) that should not restrict the functionality of
MCObjectStreamer.
```
% cat b.cc
asm(R"(
.pushsection .text,"ax"
.globl _start; _start: ret
.if . -_start == 1
ret
.endif
.popsection
)");
% gcc -S b.cc && gcc -c b.cc
% clang -S -fno-integrated-as b.cc # succeeded
% clang -c b.cc # succeeded with this patch
% clang -S b.cc # still failed
<inline asm>:4:5: error: expected absolute expression
4 | .if . -_start == 1
| ^
1 error generated.
```
However, removing `getUseAssemblerInfoForParsing` would make
MCDwarfFrameEmitter::Emit (for .eh_frame FDE) slow (~4% compile time
regression for sqlite3.c amalgamation) due to expensive
`AttemptToFoldSymbolOffsetDifference`. For now, make
`UseAssemblerInfoForParsing` false in MCDwarfFrameEmitter::Emit.
Close#62520
Link: https://discourse.llvm.org/t/rfc-clang-assembly-object-equivalence-for-files-with-inline-assembly/78841
Pull Request: https://github.com/llvm/llvm-project/pull/91082
This reverts commit 03c53c69a367008da689f0d2940e2197eb4a955c.
This causes very large compile-time regressions in some cases,
e.g. sqlite3 at O0 regresses by 5%.
Commit 6c0665e22174d474050e85ca367424f6e02476be
(https://reviews.llvm.org/D45164) enabled certain constant expression
evaluation for `MCObjectStreamer` at parse time (e.g. `.if` directives,
see llvm/test/MC/AsmParser/assembler-expressions.s).
`getUseAssemblerInfoForParsing` was added to make `clang -c` handling
inline assembly similar to `MCAsmStreamer` (e.g. `llvm-mc -filetype=asm`),
where such expression folding (related to
`AttemptToFoldSymbolOffsetDifference`) is unavailable.
I believe this is overly conservative. We can make some parse-time
expression folding work for `clang -c` even if `clang -S` would still
report an error, a MCAsmStreamer issue (we cannot print `.if`
directives) that should not restrict the functionality of
MCObjectStreamer.
```
% cat b.cc
asm(R"(
.pushsection .text,"ax"
.globl _start; _start: ret
.if . -_start == 1
ret
.endif
.popsection
)");
% gcc -S b.cc && gcc -c b.cc
% clang -S -fno-integrated-as b.cc # succeeded
% clang -c b.cc # succeeded with this patch
% clang -S b.cc # still failed
<inline asm>:4:5: error: expected absolute expression
4 | .if . -_start == 1
| ^
1 error generated.
```
Close#62520
Link: https://discourse.llvm.org/t/rfc-clang-assembly-object-equivalence-for-files-with-inline-assembly/78841
Pull Request: https://github.com/llvm/llvm-project/pull/91082
I'm planning to remove StringRef::equals in favor of
StringRef::operator==.
- StringRef::operator==/!= outnumber StringRef::equals by a factor of
38 under llvm/ in terms of their usage.
- The elimination of StringRef::equals brings StringRef closer to
std::string_view, which has operator== but not equals.
- S == "foo" is more readable than S.equals("foo"), especially for
!Long.Expression.equals("str") vs Long.Expression != "str".
Rename getNumVGPRBlocks to getEncodedNumVGPRBlocks, to clarify that it's
using the encoding granule. This is used to program the hardware. In
practice, the hardware will use the alloc granule instead, so this patch
also adds a new helper, getAllocatedNumVGPRBlocks, which can be useful
when driving heuristics.
These generic targets include multiple GPUs and will, in the future,
provide a way to build once and run on multiple GPU, at the cost of less
optimization opportunities.
Note that this is just doing the compiler side of things, device libs an
runtimes/loader/etc. don't know about these targets yet, so none of them
actually work in practice right now. This is just the initial commit to
make LLVM aware of them.
This contains the documentation changes for both this change and #76954
as well.
PAL Metadata 3.0 introduces an explicit structure in metadata for the
programmable registers written out by the compiler backend.
The previous approach used opaque registers which can change between different
architectures and required encoding the bitfield information in the backend,
which may change between versions.
This change is an extension the previously added support - which only handled
entry functions. This adds support for all functions.
The change also includes some re-factoring to separate common code.
Introduce Code Object V6 in Clang, LLD, Flang and LLVM. This is the same
as V5 except a new "generic version" flag can be present in EFLAGS. This
is related to new generic targets that'll be added in a follow-up patch.
It's also likely V6 will have new changes (possibly new metadata
entries) added later.
Docs change are part of the follow-up patch #76955
Named '.amdhsa_code_object_version'. This directive sets the
e_ident[ABIVERSION] in the ELF header, and should be used as the assumed
COV for the rest of the asm file.
This commit also weakens the --amdhsa-code-object-version CL flag.
Previously, the CL flag took precedence over the IR flag. Now the IR
flag/asm directive take precedence over the CL flag. This is implemented
by merging a few COV-checking functions in AMDGPUBaseInfo.h.
V3 has been deprecated for a while as well, so it can safely be removed
like V2 was removed.
- [Clang] Set minimum code object version to 4
- [lld] Fix tests using code object v3
- Remove code object V3 from the AMDGPU backend, and delete or port v3
tests to v4.
- Update docs to make it clear V3 can no longer be emitted.
The special handling for blocks ending with a long branch has been
unnecessary since D106445:
"[amdgpu] Add 64-bit PC support when expanding unconditional branches."
V3 has been deprecated for a while as well, so it can safely be removed
like V2 was removed.
- [Clang] Set minimum code object version to 4
- [lld] Fix tests using code object v3
- Remove code object V3 from the AMDGPU backend, and delete or port v3
tests to v4.
- Update docs to make it clear V3 can no longer be emitted.
This patch adds the DAG isel changes for kernel argument preloading.
These changes are not usable with older firmware but subsequent patches
in the series will make the codegen backwards compatible. This patch
should only be submitted alongside that subsequent patch.
Preloading here begins from the start of the kernel arguments until the
amount of arguments indicated by the CL flag
amdgpu-kernarg-preload-count.
Aggregates and arguments passed by-ref are not supported.
Special care for the alignment of the kernarg segment is needed as well
as consideration of the alignment of addressable SGPR tuples when we
cannot directly use misaligned large tuples that the arguments are
loaded to.
Reviewed By: bcahoon
Differential Revision: https://reviews.llvm.org/D158579
Code Object V2 has been deprecated for more than a year now. We can
safely remove it from LLVM.
- [clang] Remove support for the `-mcode-object-version=2` option.
- [lld] Remove/refactor tests that were still using COV2
- [llvm] Update AMDGPUUsage.rst
- Code Object V2 docs are left for informational purposes because those
code objects may still be supported by the runtime/loaders for a while.
- [AMDGPU] Remove COV2 emission capabilities.
- [AMDGPU] Remove `MetadataStreamerYamlV2` which was only used by COV2
- [AMDGPU] Update all tests that were still using COV2 - They are either
deleted or ported directly to code object v4 (as v3 is also planned to
be removed soon).