This reverts commit 14cd1339318b16e08c1363ec6896bd7d1e4ae281. The
buildbot failure seems to have been a cmake issue which has been
discussed in more detail in this Discourse post:
https://discourse.llvm.org/t/cmake-doesnt-regenerate-all-tablegen-target-files/87901
If any buildbots fail to select arbitrary intrinsics with this patch,
it's worth considering using clean builds with ccache instead of
incremental builds, as recommended here:
https://llvm.org/docs/HowToAddABuilder.html#:~:text=Use%20CCache%20and%20NOT%20incremental%20builds
The original commit message for this patch:
Add the llvm.amdgcn.call.whole.wave intrinsic for calling whole wave
functions. This will take as its first argument the callee with the
amdgpu_gfx_whole_wave calling convention, followed by the call
parameters which must match the signature of the callee except for the
first function argument (the i1 original EXEC mask, which doesn't need
to be passed in). Indirect calls are not allowed.
Make direct calls to amdgpu_gfx_whole_wave functions a verifier error.
Tail calls are handled in a future patch.
Now that #149310 has restricted lifetime intrinsics to only work on
allocas, we can also drop the explicit size argument. Instead, the size
is implied by the alloca.
This removes the ability to only mark a prefix of an alloca alive/dead.
We never used that capability, so we should remove the need to handle
that possibility everywhere (though many key places, including stack
coloring, did not actually respect this).
Add the llvm.amdgcn.call.whole.wave intrinsic for calling whole wave
functions. This will take as its first argument the callee with the
amdgpu_gfx_whole_wave calling convention, followed by the call
parameters which must match the signature of the callee except for the
first function argument (the i1 original EXEC mask, which doesn't need
to be passed in). Indirect calls are not allowed.
Make direct calls to amdgpu_gfx_whole_wave functions a verifier error.
Unspeakable horrors happen around calls from whole wave functions, the
plan is to improve the handling of caller/callee-saved registers in
a future patch.
Tail calls are also handled in a future patch.
This patch implements the `llvm.loop.estimated_trip_count` metadata
discussed in [[RFC] Fix Loop Transformations to Preserve Block
Frequencies](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785).
As [suggested in the RFC
comments](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785/4),
it adds the new metadata to all loops at the time of profile ingestion
and estimates each trip count from the loop's `branch_weights` metadata.
As [suggested in the PR #128785
review](https://github.com/llvm/llvm-project/pull/128785#discussion_r2151091036),
it does so via a new `PGOEstimateTripCountsPass` pass, which creates the
new metadata for each loop but omits the value if it cannot estimate a
trip count due to the loop's form.
An important observation not previously discussed is that
`PGOEstimateTripCountsPass` *often* cannot estimate a loop's trip count,
but later passes can sometimes transform the loop in a way that makes it
possible. Currently, such passes do not necessarily update the metadata,
but eventually that should be fixed. Until then, if the new metadata has
no value, `llvm::getLoopEstimatedTripCount` disregards it and tries
again to estimate the trip count from the loop's current
`branch_weights` metadata.
lifetime.start and lifetime.end are primarily intended for use on
allocas, to enable stack coloring and other liveness optimizations. This
is necessary because all (static) allocas are hoisted into the entry
block, so lifetime markers are the only way to convey the actual
lifetimes.
However, lifetime.start and lifetime.end are currently *allowed* to be
used on non-alloca pointers. We don't actually do this in practice, but
just the mere fact that this is possible breaks the core purpose of the
lifetime markers, which is stack coloring of allocas. Stack coloring can
only work correctly if all lifetime markers for an alloca are
analyzable.
* If a lifetime marker may operate on multiple allocas via a select/phi,
we don't know which lifetime actually starts/ends and handle it
incorrectly (https://github.com/llvm/llvm-project/issues/104776).
* Stack coloring operates on the assumption that all lifetime markers
are visible, and not, for example, hidden behind a function call or
escaped pointer. It's not possible to change this, as part of the
purpose of lifetime markers is that they work even in the presence of
escaped pointers, where simple use analysis is insufficient.
I don't think there is any way to have coherent semantics for lifetime
markers on allocas, while also permitting them on arbitrary pointer
values.
This PR restricts lifetimes to operate on allocas only. As a followup, I
will also drop the size argument, which is superfluous if we always
operate on an alloca. (This change also renders various code handling
lifetime markers on non-alloca dead. I plan to clean up that kind of
code after dropping the size argument as well.)
In practice, I've only found a few places that currently produce
lifetimes on non-allocas:
* CoroEarly replaces the promise alloca with the result of an intrinsic,
which will later be replaced back with an alloca. I think this is the
only place where there is some legitimate loss of functionality, but I
don't think this is particularly important (I don't think we'd expect
the promise in a coroutine to admit useful lifetime optimization.)
* SafeStack moves unsafe allocas onto a separate frame. We can safely
drop lifetimes here, as SafeStack performs its own stack coloring.
* Similar for AddressSanitizer, it also moves allocas into separate
memory.
* LSR sometimes replaces the lifetime argument with a GEP chain of the
alloca (where the offsets ultimately cancel out). This is just
unnecessary. (Fixed separately in
https://github.com/llvm/llvm-project/pull/149492.)
* InferAddrSpaces sometimes makes lifetimes operate on an addrspacecast
of an alloca. I don't think this is necessary.
Whole wave functions are functions that will run with a full EXEC mask.
They will not be invoked directly, but instead will be launched by way
of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in
a future patch). These functions are meant as an alternative to the
`llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics.
Whole wave functions will set EXEC to -1 in the prologue and restore the
original value of EXEC in the epilogue. They must have a special first
argument, `i1 %active`, that is going to be mapped to EXEC. They may
have either the default calling convention or amdgpu_gfx. The inactive
lanes need to be preserved for all registers used, active lanes only for
the CSRs.
At the IR level, arguments to a whole wave function (other than
`%active`) contain poison in their inactive lanes. Likewise, the return
value for the inactive lanes is poison.
This patch contains the following work:
* 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN
used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return
a SReg_1 representing `%active`, which needs to be passed into
SI_WHOLE_WAVE_FUNC_RETURN.
* SelectionDAG support for generating these 2 new pseudos and the
special handling of %active. Since the return may be in a different
basic block, it's difficult to add the virtual reg for %active to
SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF
which is later replaced via a custom inserter.
* Expansion of the 2 pseudos during prolog/epilog insertion. PEI also
marks any used VGPRs as WWM registers, which are then spilled and
restored with the usual logic.
Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic
and a lot of optimization work (especially in order to reduce spills
around function calls).
---------
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
Co-authored-by: Shilei Tian <i@tianshilei.me>
Update isDereferenceableAndAlignedPointer to make use of dereferenceable
assumptions with variable sizes via SCEV.
To do so, factor out the logic to check via an assumption to a helper,
and use SE to check if the access size is less than the dereferenceable
size.
PR: https://github.com/llvm/llvm-project/pull/128436
This applies the same check as llvm.vector.splice which checks that the immediate is in the range [-VL, VL-1] where VL is the minimum vector length. If vscale_range is available, the lower bound is used to increase the known minimum vector length for this check. This ensures the immediate is in range for any possible value of vscale that satisfies the vscale_range.
Add `dead_on_return` attribute, which is meant to be taken advantage
by the frontend, and states that the memory pointed to by the argument
is dead upon function return. As with `byval`, it is supposed to be
used for passing aggregates by value. The difference lies in the ABI:
`byval` implies that the pointer is explicitly passed as argument to
the callee (during codegen the copy is emitted as per byval contract),
whereas a `dead_on_return`-marked argument implies that the copy
already exists in the IR, is located at a specific stack offset within
the caller, and this memory will not be read further by the caller upon
callee return – or otherwise poison, if read before being written.
RFC: https://discourse.llvm.org/t/rfc-add-dead-on-return-attribute/86871.
This PR is part of https://discourse.llvm.org/t/rfc-profile-information-propagation-unittesting/73595
In a slight departure from the RFC, instead of a brand-new `MD_prof_unknown` kind, this adds a first operand to `MD_prof` metadata. This makes it easy to replace with valid metadata (only one `MD_prof`), otherwise sites inserting valid `MD_prof` would also have to check to remove the `unknown` one.
The patch just introduces the notion and fixes the verifier accordingly. Existing APIs working (esp. reading) `MD_prof` will be updated subsequently.
For some reason, some of the checks for specific assumbe bundle elements
exit early if the check pass, meaning we don't verify other entries.
Replace the early returns with early continues.
This also requires removing some tests that are currently rejected. They will
be added back as part of https://github.com/llvm/llvm-project/pull/128436.
PR: https://github.com/llvm/llvm-project/pull/145586
Also fix the LangRef to match the implementation. This was checking
against the alloca address space size rather than the default address
space.
The check was also more permissive than the LangRef. The error
check permitted any size less than the pointer size; follow the
stricter wording of the LangRef.
This patch attaches the range attribute to the setmaxnreg
and fence.proxy.tensormap.* intrinsics. The range checks
are now handled generically in the Verifier. So, this patch
removes the per-intrinsic error-handling for range-checks
from the Verifier.
This patch also adds more coverage tests for these cases.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
By chance, two things have prevented the autoupgrade path being
exercised much so far:
* LLParser setting the debug-info mode to "old" on seeing intrinsics,
* The test in AutoUpgrade.cpp wanting to upgrade into a "new" debug-info
block.
In practice, this appears to mean this code path hasn't seen the various
invalid inputs that can come its way. This commit does a number of
things:
* Tolerates the various illegal inputs that can be written with
debug-intrinsics, and that must be tolerated until the Verifier runs,
* Printing illegal/null DbgRecord fields must succeed,
* Verifier errors need to localise the function/block where the error
is,
* Tests that now see debug records will print debug-record errors,
Plus a few new tests for other intrinsic-to-debug-record failures modes
I found. There are also two edge cases:
* Some of the unit tests switch back and forth between intrinsic and
record modes at will; I've deleted coverage and some assertions to
tolerate this as intrinsic support is now Gone (TM),
* In sroa-extract-bits.ll, the order of debug records flips. This is
because the autoupgrader upgrades in the opposite order to the basic
block conversion routines... which doesn't change the record order, but
_does_ change the use list order in Metadata! This should (TM) have no
consequence to the correctness of LLVM, but will change the order of
various records and the order of DWARF record output too.
I tried to reduce this patch to a smaller collection of changes, but
they're all intertwined, sorry.
This change is to support target extension types in vectors. The change
allows sized target extension types to opt-in to being a valid vector
element.
Allowing target extension types as vector elements will allow backends
to use vector operations such as `insertelement` and `extractelement` on
their target types with minimal changes.
RFC:
https://discourse.llvm.org/t/rfc-supporting-sized-target-extension-types-in-vector/86431
Add a `alloc-variant-zeroed` function attribute which can be used to
inform folding allocation+memset. This addresses
https://github.com/rust-lang/rust/issues/104847, where LLVM does not
know how to perform this transformation for non-C languages.
Co-authored-by: Jamie <jamie@osec.io>
Before this commit, having a convergence token on a non-convergent call
was considered to be an error.
This commit relaxes this requirement and allows convergence tokens to be
present on non-convergent calls.
When such token is present, they have no effect as the underlying call
is non-convergent.
This allows passes like DCE to strip `convergent` attribute from
functions for which all convergent operations have been stripped. When
this happens, a convergence token can still exist in the call-site,
causing the verifier to complain.
Alternatives have been considered in #134863 and #134844.
While external globals can be unsized, I don't think an unsized
initializer makes sense.
It seems like the backend currently ends up treating this as a zero-size
global. If people want that behavior, they should declare it as such.
We currently accept label arguments to inline asm calls. This support
predates both blockaddresses and callbr and is only covered by one X86
test. Remove it in favor of callbr (or at least blockaddress, though
that cannot guarantee correct codegen, just like using block labels
directly can't).
I didn't bother implementing bitcode upgrade support for this, but I can
add it if desired.
This PR updates the `Verifier` to enforce that `alloca` instructions on
AMDGPU must be in AS5. This prevents hitting a misleading backend error
like "unable to select FrameIndex," which makes it look like a backend
bug when it's actually an IR-level issue.
In #132722 spills of ZT0 were disabled around all SME ABI routines to
avoid a case where ZT0 is spilled before ZA is enabled (resulting in a
crash).
It turns out that the ABI does not promise that routines will preserve
ZT0 (however in practice they do), so generally disabling ZT0 spills for
ABI routines is not correct.
The case where a crash was possible was "aarch64_new_zt0" functions with
ZA disabled on entry and a ZT0 spill around __arm_tpidr2_save. In this
case, ZT0 will be undefined at the call to __arm_tpidr2_save, so this
patch avoids the ZT0 spill by marking the callsite with
"aarch64_zt0_undef". This attribute only applies to callsites and marks
that at the point the call is made ZT0 is not defined, so does not need
preserving.
Fixes#126417.
Currently, assignment tracking recognizes allocas, stores, and mem
intrinsics as valid instructions to tag with DIAssignID, with allocas
representing the allocation for a variable and the others representing
instructions that may assign to the variable. There are other intrinsics
that can perform these assignments however, and if we transform a store
instruction into one of these intrinsics and correctly transfer the
DIAssignID over, this results in a verifier error. The
AssignmentTrackingAnalysis pass also does not know how to handle these
intrinsics if they are untagged, as it does not know how to extract
assignment information (base address, offset, size) from them.
This patch adds _some_ support for some intrinsics that may perform
assignments: masked store/scatter, and vp store/strided store/scatter.
This patch does not add support for extracting assignment information
from these, as they may store with either non-constant size or to
non-contiguous blocks of memory; instead it adds support for recognizing
untagged stores with "unknown" assignment info, for which we assume that
the memory location of the associated variable should not be used, as we
can't determine which fragments of it should or should not be used.
In principle, it should be possible to handle the more complex cases
mentioned above, but it would require more substantial changes to
AssignmentTrackingAnalysis, and it is mostly only needed as a fallback
if the DIAssignID is not preserved on these alternative stores.
With -fpatchable-function-entry (or the patchable_function_entry
function attribute), we emit records of patchable entry locations to the
__patchable_function_entries section. Add an additional parameter to the
command line option that allows one to specify a different default
section name for the records, and an identical parameter to the function
attribute that allows one to override the section used.
The main use case for this change is the Linux kernel using prefix NOPs
for ftrace, and thus depending on__patchable_function_entries to locate
traceable functions. Functions that are not traceable currently disable
entry NOPs using the function attribute, but this creates a
compatibility issue with -fsanitize=kcfi, which expects all indirectly
callable functions to have a type hash prefix at the same offset from
the function entry.
Adding a section parameter would allow the kernel to distinguish between
traceable and non-traceable functions by adding entry records to
separate sections while maintaining a stable function prefix layout for
all functions. LKML discussion:
https://lore.kernel.org/lkml/Y1QEzk%2FA41PKLEPe@hirez.programming.kicks-ass.net/
I initially thought starting with a more narrow definition and later
expanding would make more sense. But as pointed out in review for PR
#129220, this restriction is generating additional unnecessary work.
This patch alters the intrinsic to accept patterns of any type. Future
patches will update LoopIdiomRecognize and PreISelIntrinsicLowering to
take advantage of this. The verifier will complain if an unsized type is
used. I've additionally taken the opportunity to remove a comment from
the LangRef about some bit widths potentially not being supported by the
target. I don't think this is any more true than it is for arbitrary
width loads/stores which don't carry a similar warning that I can see.
A verifier check ensures that only sized types are used for the pattern.
`llvm.wasm.throw` intrinsic can throw but it was not invokable. Not sure
what the rationale was when it was first written that way, but I think
at least in Emscripten's C++ exception support with the Wasm port of
libunwind, `__builtin_wasm_throw`, which is lowered down to
`llvm.wasm.rethrow`, is used only within `_Unwind_RaiseException`, which
is an one-liner and thus does not need an `invoke`:
720e97f76d/system/lib/libunwind/src/Unwind-wasm.c (L69)
(`_Unwind_RaiseException` is called by `__cxa_throw`, which is generated
by the `throw` C++ keyword)
But this does not address other direct uses of the builtin in C++, whose
use I'm not sure about but is not prohibited. Also other language
frontends may need to use the builtin in different functions, which has
`try`-`catch`es or destructors.
This makes `llvm.wasm.throw` invokable in the backend. To do that, this
adds a custom lowering routine to `SelectionDAGBuilder::visitInvoke`,
like we did for `llvm.wasm.rethrow`.
This does not generate `invoke`s for `__builtin_wasm_throw` yet, which
will be done by a follow-up PR.
Addresses #124710.
Check that the input register is an inreg argument to the parent
function. (See the comment in `IntrinsicsAMDGPU.td`.)
This LLVM defect was identified via the AMD Fuzzing project.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Make sure that this intrinsic is being followed by unreachable.
This LLVM defect was identified via the AMD Fuzzing project.
Thanks to @rovka for helping me solve this issue!
This remove some erroneous debug info from tests that should address the
test failures that showed up when the this was previously committed.
This reverts commit 6716ce8b641f0e42e2343e1694ee578b027be0c4.
Came up recently with some nodebug case on codeview, that caused a null
entry in elements and crashed LLVM.
Original clang fix to avoid generating IR like this: 504dd577675e8c85cdc8525990a7c8b517a38a89
Verify the format is valid and the type is one of the expected
i32 vectors. Verify the used vector types at least cover the
requirements of the corresponding format operand.
Improve the information printed when -memprof-report-hinted-sizes is
enabled. Now print the full context hash computed from the original
profile, similar to what we do when reporting matching statistics. This
will make it easier to correlate with the profile.
Note that the full context hash must be computed at profile match time
and saved in the metadata and summary, because we may trim the context
during matching when it isn't needed for distinguishing hotness.
Similarly, due to the context trimming, we may have more than one full
context id and total size pair per MIB in the metadata and summary,
which now get a list of these pairs.
Remove the old aggregate size from the metadata and summary support.
One other change from the prior support is that we no longer write the
size information into the combined index for the LTO backends, which
don't use this information, which reduces unnecessary bloat in
distributed index files.
Relands 7ff3a9acd84654c9ec2939f45ba27f162ae7fbc3 after regenerating the
test case.
Supersedes the draft PR #94992, taking a different approach following
feedback:
* Lower in PreISelIntrinsicLowering
* Don't require that the number of bytes to set is a compile-time
constant
* Define llvm.memset_pattern rather than llvm.memset_pattern.inline
As discussed in the [RFC
thread](https://discourse.llvm.org/t/rfc-introducing-an-llvm-memset-pattern-inline-intrinsic/79496),
the intent is that the intrinsic will be lowered to loops, a sequence of
stores, or libcalls depending on the expected cost and availability of
libcalls on the target. Right now, there's just a single lowering path
that aims to handle all cases. My intent would be to follow up with
additional PRs that add additional optimisations when possible (e.g.
when libcalls are available, when arguments are known to be constant
etc).
This reverts commit 7ff3a9acd84654c9ec2939f45ba27f162ae7fbc3.
Recent scheduling changes means tests need to be re-generated. Reverting
to green while I do that.
Supersedes the draft PR #94992, taking a different approach following
feedback:
* Lower in PreISelIntrinsicLowering
* Don't require that the number of bytes to set is a compile-time
constant
* Define llvm.memset_pattern rather than llvm.memset_pattern.inline
As discussed in the [RFC
thread](https://discourse.llvm.org/t/rfc-introducing-an-llvm-memset-pattern-inline-intrinsic/79496),
the intent is that the intrinsic will be lowered to loops, a sequence of
stores, or libcalls depending on the expected cost and availability of
libcalls on the target. Right now, there's just a single lowering path
that aims to handle all cases. My intent would be to follow up with
additional PRs that add additional optimisations when possible (e.g.
when libcalls are available, when arguments are known to be constant
etc).
As defined in LangRef, branching on `undef` is undefined behavior.
This PR aims to remove undefined behavior from tests. As UB tests break
Alive2 and may be the root cause of breaking future optimizations.
Here's an Alive2 proof for one of the examples:
https://alive2.llvm.org/ce/z/TncxhP
StructType::setBody is the only mechanism that can potentially create
recursion in the type system. Add a runtime check that it is not
actually used to create recursion.
If the check fails, report an error from LLParser, BitcodeReader and
IRLinker. In all other cases assert that the check succeeds.
In future StructType::setBody will be removed in favor of specifying the
body when the type is created, so any performance hit from this runtime
check will be temporary.