Shaders compiled with DXC/LLPC generate these relocations, and even if
that changes in the future we want to handle existing binaries. The
friction to support this and the maintenance cost long term both seem
incredibly low, considering other targets like ARM support both REL/RELA
static relocations behind the same interface.
Update the guideline to reduce the chance of miscompilation/performance
regression.
---------
Co-authored-by: Nikita Popov <github@npopov.com>
Co-authored-by: Antonio Frighetto <me@antoniofrighetto.com>
Some data members are only part of a class definition in a Debug build,
e.g. `LVObject::ID`. If `debuginfologicalview` is used as a library,
`NDEBUG` cannot be used for this purpose, as this PP macro may have a
different definition in a downstream project, which in turn triggers an
ODR violation. Fix it by
- Making `LVObject::ID` an unconditional data member.
- Making `LVObject::dump()` non-virtual. Rationale: `virtual` is not
needed (and it calls `print()`, which is virtual anyway).
Fixes#139098.
When no profile is provided, but the new --empty-profile option is
specified, the export/report/show commands now emit coverage data
equivalent to that obtained from a profile with all zero counters
("baseline coverage").
This is useful for build systems (e.g. Bazel) that can track coverage
information for each build target, even those that are never linked into
tests and thus don't have runtime coverage data recorded. By merging in
baseline coverage, lines in files that aren't linked into tests are
correctly reported as uncovered.
Reland with fixes to `CoverageMappingTest.cpp`.
Reverts llvm/llvm-project#144121
Reverts llvm/llvm-project#117910
```
/home/buildbots/llvm-external-buildbots/workers/ppc64le-lld-multistage-test/ppc64le-lld-multistage-test/llvm-project/llvm/unittests/ProfileData/CoverageMappingTest.cpp
/home/buildbots/llvm-external-buildbots/workers/ppc64le-lld-multistage-test/ppc64le-lld-multistage-test/llvm-project/llvm/unittests/ProfileData/CoverageMappingTest.cpp:281:28: error: 'std::reference_wrapper' may not intend to support class template argument deduction [-Werror,-Wctad-maybe-unsupported]
281 | std::make_optional(std::reference_wrapper(*ProfileReader));
| ^
/usr/lib/gcc/ppc64le-redhat-linux/8/../../../../include/c++/8/bits/refwrap.h:289:11: note: add a deduction guide to suppress this warning
289 | class reference_wrapper
| ^
```
This PR fixes a bug in GatherToLDSOpLowering, we were getting the
MemRefType of source for the destination. Additionally, some related
typos are corrected.
CC: @krzysz00 @umangyadav @lialan
When no profile is provided, but the new --empty-profile option is
specifed, the export/report/show commands now emit coverage data
equivalent to that obtained from a profile with all zero counters
("baseline coverage").
This is useful for build systems (e.g. Bazel) that can track coverage
information for each build target, even those that are never linked into
tests and thus don't have runtime coverage data recorded. By merging in
baseline coverage, lines in files that aren't linked into tests are
correctly reported as uncovered.
This patch is part of a series that adds origin-tracking to the debugify
source location coverage checks, allowing us to report symbolized stack
traces of the point where missing source locations appear.
This patch adds the configuration options needed to enable this feature,
in the form of a new CMake option that enables a flag in
`llvm-config.h`; this is not an entirely new CMake flag, but a new
option, `COVERAGE_AND_ORIGIN`, for the existing flag
`LLVM_ENABLE_DEBUGLOC_COVERAGE_TRACKING`. This patch contains
documentation, but no actual implementation for the flag itself.
…133242)"
This reverts commit 130080fab11cde5efcb338b77f5c3b31097df6e6 because it
causes issues in testcases similar to coalescer_remat.ll [1], i.e. when
we use a VGPR tuple but only write to its lower parts. The high VGPRs
would then not be included in the vgpr_count, and accessing them would
be an out of bounds violation.
[1]
https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/AMDGPU/coalescer_remat.ll
Following GitHub organizations were merged into the ROCm org:
* ROCm-Developer-Tools
* RadeonOpenCompute
* ROCmSoftwarePlatform
Ensure that all hyperlinks to the old organizations now point to the new
organization at https://github.com/ROCm.
This patch extends the TMA G2S intrinsics with the
support for cta_group::1/2 available from Blackwell onwards.
The existing intrinsics are auto-upgraded with a default
value of '0' for the `cta_group` flag operand.
* lit tests are added for all combinations of the newer variants.
* Negative tests are added to validate the error-handling
when the value of the cta_group flag falls out-of-range.
* The generated PTX is verified with a 12.8 ptxas executable.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
This is a cli tool to that tests the conformance of LLVM's mustache
implementation against the public Mustache spec, hosted at
https://github.com/mustache/spec. This is a revised version of the
patches in #111487.
Co-authored-by: Peter Chou <peter.chou@mail.utoronto.ca>
This patch extends the llvm.histogram intrinsic to support additional
update operations beyond the existing add. Specifically, the new
supported operations are:
* umax: unsigned maximum
* umin: unsigned minimum
* uadd.sat: unsigned saturated addition
Based on the discussion from:
https://discourse.llvm.org/t/rfc-expanding-the-experimental-histogram-intrinsic/84673
This change is to support target extension types in vectors. The change
allows sized target extension types to opt-in to being a valid vector
element.
Allowing target extension types as vector elements will allow backends
to use vector operations such as `insertelement` and `extractelement` on
their target types with minimal changes.
RFC:
https://discourse.llvm.org/t/rfc-supporting-sized-target-extension-types-in-vector/86431
This pull request adds support for parsing the source language in both
DWARF and CodeView. Specifically,
- The `LVSourceLanguage` class is introduced to represent any supported
language by any of the debug info representations.
- Update `LVDWARFReader.cpp` and `LVCodeViewVisitor.cpp` to parse the
source language where it applies. Added a new `=Language` attribute;
`getAttributeLanguage()` is internally used to control whether this
information is being printed.
If a test took more than one attempt to complete, show the number of attempts and the maximum allowed attempts as `2 of 4 attempts` inside the `<progress info>` (see [TEST RUN OUTPUT FORMAT](https://llvm.org/docs/CommandGuide/lit.html#test-run-output-format)).
NOTE: Additionally this is a fixup for #141851 where the tests were not quite right. `max-retries-per-test/allow-retries-test_retry_attempts/test.py` was added but never used there. Now we're calling it. To correlate better between the test output and the test script I've used higher numbers of max allowed retries.
This is a fixup for #141851 and removes `=` from all
options with additional arguments.
Before 14 out of 22 options with arguments used "=" and 7 didn't.
Add a `alloc-variant-zeroed` function attribute which can be used to
inform folding allocation+memset. This addresses
https://github.com/rust-lang/rust/issues/104847, where LLVM does not
know how to perform this transformation for non-C languages.
Co-authored-by: Jamie <jamie@osec.io>
On x86, the `*l` libcalls are for 80-bit extended precision. `fp128`
needs to use the `*f128` libcalls instead.
Add a few missing ones, esp. for FP min/max.
Also use the `f128` libcalls on x86-32. I believe the situation there is
the same as on x86-64.
Don't count register uses when determining the maximum number of
registers used by a function. Count only the defs. This is really an
underestimate of the true register usage, but in practice that's not
a problem because if a function uses a register, then it has either
defined it earlier, or some other function that executed before has
defined it.
In particular, the register counts are used:
1. When launching an entry function - in which case we're safe because
the register counts of the entry function will include the register
counts of all callees.
2. At function boundaries in dynamic VGPR mode. In this case it's safe
because whenever we set the new VGPR allocation we take into account
the outgoing_vgpr_count set by the middle-end.
The main advantage of doing this is that the artificial VGPR arguments
used only for preserving the inactive lanes when using the
llvm.amdgcn.init.whole.wave intrinsic are no longer counted. This
enables us to allocate only the registers we need in dynamic VGPR mode.
---------
Co-authored-by: Thomas Symalla <5754458+tsymalla@users.noreply.github.com>
In 'asm goto' statements ('callbr' in LLVM IR), you can specify one or
more labels / basic blocks in the containing function which the assembly
code might jump to. If you're also compiling with branch target
enforcement via BTI, then previously listing a basic block as a possible
jump destination of an asm goto would cause a BTI instruction to be
placed at the start of the block, in case the assembly code used an
_indirect_ branch instruction (i.e. to a destination address read from a
register) to jump to that location. Now it doesn't do that any more:
branches to destination labels from the assembly code are assumed to be
direct branches (to a relative offset encoded in the instruction), which
don't require a BTI at their destination.
This change was proposed in https://discourse.llvm.org/t/85845 and there
seemed to be no disagreement. The rationale is:
1. it brings clang's handling of asm goto in Arm and AArch64 in line
with gcc's, which didn't generate BTIs at the target labels in the first
place.
2. it improves performance in the Linux kernel, which uses a lot of 'asm
goto' in which the assembly language just contains a NOP, and the
label's address is saved elsewhere to let the kernel self-modify at run
time to swap between the original NOP and a direct branch to the label.
This allows hot code paths to be instrumented for debugging, at only the
cost of a NOP when the instrumentation is turned off, instead of the
larger cost of an indirect branch. In this situation a BTI is
unnecessary (if the branch happens it's direct), and since the code
paths are hot, also a noticeable performance hit.
Implementation:
`SelectionDAGBuilder::visitCallBr` is the place where 'asm goto' target
labels are handled. It calls `setIsInlineAsmBrIndirectTarget()` on each
target `MachineBasicBlock`. Previously it also called
`setMachineBlockAddressTaken()`, which made `hasAddressTaken()` return
true, which caused a BTI to be added in the Arm backends.
Now `visitCallBr` doesn't call `setMachineBlockAddressTaken()` any more
on asm goto targets, but `hasAddressTaken()` also checks the flag set by
`setIsInlineAsmBrIndirectTarget()`. So call sites that were using
`hasAddressTaken()` don't need to be modified. But the Arm backends
don't call `hasAddressTaken()` any more: instead they test two more
specific query functions that cover all the reasons `hasAddressTaken()`
might have returned true _except_ being an asm goto target.
Testing:
The new test `AArch64/callbr-asm-label-bti.ll` is testing the actual
change, where it expects not to see a `bti` instruction after
`[[LABEL]]`. The rest of the test changes are all churn, due to the
flags on basic blocks changing. Actual output code hasn't changed in any
of the existing tests, only comments and diagnostics.
Further work:
`RISCVIndirectBranchTracking.cpp` and `X86IndirectBranchTracking.cpp`
also call `hasAddressTaken()` in a way that might benefit from using the
same more specific check I've put in `ARMBranchTargets.cpp` and
`AArch64BranchTargets.cpp`. But I'm not sure of that, so in this commit
I've only changed the Arm backends, and left those alone.
This PR removes the necessity to know the dimension of the embeddings while invoking `Embedder::Create`. Having the `Dimension` parameter introduces complexities in downstream consumers.
(Tracking issue - #141817)
When packaging LLVM we've seen arbitrary tests fail.
It happened sporadically and most of the times the test
do work if they are run a second time on the next day.
The tests themselves were always different and we didn't
know ahead of time which ones we wanted to re-run.
That's we filter-out a lot of `libomp` and `libarcher` tests [1].
This change allows us to set
`LIT_OPTS="--max-retries-per-test=12"`
when running any "check-XXX" build target. Then any lit test
will at most be re-run 12 times, unless there's an `ALLOW_RETRIES:`
in one of the test scripts that's specifying a different value
than `12`. `12` is just an example here, any positive integer
will work.
Please note, that this only adds the possibility to re-run
lit tests. It does not actually do it until the caller specifies
`--max-retries-per-test=<POSITIVE_INT>` either on a call to `lit` or
in `LIT_OPTS`.
Also note, that one can still use `ALLOW_RETRIES:` in test scripts
and it will always rule over `--max-retries-per-test`. When
`--max-retries-per-test` is set too low, but the
`config.test_retry_attempts`
is high enough, it works as well.
Any option in the list below overrules its predecessor:
* `--max-retries-per-test`
* `config.test_retry_attempts`
* `ALLOW_RETRIES` keyword
From the above options to re-run tests, `--max-retries-per-test` is the
only one that doesn't require a change in the test scripts or the test
config.
[1]:
https://src.fedoraproject.org/rpms/llvm/blob/rawhide/f/llvm.spec#_2326
Downstream PR to make use of the `--max-retries-per-test` option:
https://src.fedoraproject.org/rpms/llvm/pull-request/434
Downstream ticket: https://issues.redhat.com/browse/LLVM-145
Currently, users have to invoke two APIs: `computeEmbeddings()` followed
by getters to access the embeddings. This PR refactors the code to
reduce this *stateful* access of APIs. Users can now directly invoke
getters; Internally, getters would compute the embeddings.
This adds [de]interleave intrinsics for factors of 4,6,8, so that every
interleaved memory operation supported by the in-tree targets can be
represented by a single intrinsic.
For context, [de]interleaves of fixed-length vectors are represented by
a series of shufflevectors. The intrinsics are needed for scalable
vectors, and we don't currently scalably vectorize all possible factors
of interleave groups supported by RISC-V/AArch64.
The underlying reason for this is that higher factors are currently
represented by interleaving multiple interleaves themselves, which made
sense at the time in the discussion in
https://github.com/llvm/llvm-project/pull/89018.
But after trying to integrate these for higher factors on RISC-V I think
we should revisit this design choice:
- Matching these in InterleavedAccessPass is non-trivial: We currently
only support factors that are a power of 2, and detecting this requires
a good chunk of code
- The shufflevector masks used for [de]interleaves of fixed-length
vectors are much easier to pattern match as they are strided patterns,
but for the intrinsics it's much more complicated to match as the
structure is a tree.
- Unlike shufflevectors, there's no optimisation that happens on
[de]interleave2 intriniscs
- For non-power-of-2 factors e.g. 6, there are multiple possible ways a
[de]interleave could be represented, see the discussion in #139373
- We already have intrinsics for 2,3,5 and 7, so by avoiding 4,6 and 8
we're not really saving much
By representing these higher factors are interleaved-interleaves, we can
in theory support arbitrarily high interleave factors. However I'm not
sure this is actually needed in practice: SVE only has instructions
for factors 2,3,4, whilst RVV only supports up to factor 8.
This patch would make it much easier to support scalable interleaved
accesses in the loop vectorizer for RISC-V for factors 3,5,6 and 7, as
the loop vectorizer and InterleavedAccessPass wouldn't need to
construct and match trees of interleaves.
For interleave factors above 8, for which there are no hardware memory
operations to match in the InterleavedAccessPass, we can still keep the
wide load + recursive interleaving in the loop vectorizer.
This relands #141031
This change ensures generated SPIR-V is valid and passes machine
verification:
```
*** Bad machine code: inconsistent constant size ***
- function: foo
- basic block: %bb.1 entry (0x9ec9298)
- instruction: %12:iid(s8) = G_CONSTANT i4 1
```
That is done by promoting `G_CONSTANT` instructions with small integer
types (e.g., `i4`) to `i8` if no extensions for "special" integer types
are enabled.
This patch adds initial support for Integrated Distributed ThinLTO
(DTLTO) in LLVM, which manages distribution internally during the
traditional link step. This enables compatibility with any build
system that supports in-process ThinLTO. In contrast, existing
approaches to distributed ThinLTO, which split the thin-link
(--thinlto-index-only), backend compilation, and final link into
separate steps, require build system support, e.g. Bazel.
This patch implements the core DTLTO mechanism, which enables
delegation of ThinLTO backend jobs to an external process (the
distributor). The distributor can then manage job distribution through
systems like Incredibuild. A generic JSON interface is used to
communicate with the distributor, allowing for the creation of new
distributors (and thus integration with different distribution
systems) without modifying LLVM.
Please see llvm/docs/dtlto.rst for more details.
RFC: https://discourse.llvm.org/t/rfc-integrated-distributed-thinlto/69641
Design Review: https://github.com/llvm/llvm-project/pull/126654
Utilize the new extensions to the LLVM Offloading API to extend to
llvm-objdump to handle dumping fatbin offload bundles generated by HIP.
This extension to llvm-objdump adds the option --offload-fatbin.
Specifying this option will take the input object/executable and extract
all offload fatbin bundle entries into distinct code object files with
names reflecting the source file name combined with the Bundle Entry ID.
Users can also use the --arch-name option to filter offload fatbin
bundle entries by their target triple.
---------
Co-authored-by: dsalinas <dsalinas@MKM-L1-DSALINAS.amd.com>
Note: This relands #140615 adding a ".count" suffix to the non-".all"
variants.
Our current intrinsic support for barrier intrinsics is confusing and
incomplete, with multiple intrinsics mapping to the same instruction and
intrinsic names not clearly conveying intrinsic semantics. Further, we
lack support for some variants. This change unifies the IR
representation to a single consistently named set of intrinsics.
- llvm.nvvm.barrier.cta.sync.aligned.all(i32)
- llvm.nvvm.barrier.cta.sync.aligned.count(i32, i32)
- llvm.nvvm.barrier.cta.arrive.aligned.count(i32, i32)
- llvm.nvvm.barrier.cta.sync.all(i32)
- llvm.nvvm.barrier.cta.sync.count(i32, i32)
- llvm.nvvm.barrier.cta.arrive.count(i32, i32)
The following Auto-Upgrade rules are used to maintain compatibility with
IR using the legacy intrinsics:
* llvm.nvvm.barrier0 --> llvm.nvvm.barrier.cta.sync.aligned.all(0)
* llvm.nvvm.barrier.n --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
* llvm.nvvm.bar.sync --> llvm.nvvm.barrier.cta.sync.aligned.all(x)
* llvm.nvvm.barrier --> llvm.nvvm.barrier.cta.sync.aligned.count(x, y)
* llvm.nvvm.barrier.sync --> llvm.nvvm.barrier.cta.sync.all(x)
* llvm.nvvm.barrier.sync.cnt --> llvm.nvvm.barrier.cta.sync.count(x, y)