This introduces a new `ptrtoaddr` instruction which is similar to
`ptrtoint` but has two differences:
1) Unlike `ptrtoint`, `ptrtoaddr` does not capture provenance
2) `ptrtoaddr` only extracts (and then extends/truncates) the low
index-width bits of the pointer
For most architectures, difference 2) does not matter since index (address)
width and pointer representation width are the same, but this does make a
difference for architectures that have pointers that aren't just plain
integer addresses such as AMDGPU fat pointers or CHERI capabilities.
This commit introduces textual and bitcode IR support as well as basic code
generation, but optimization passes do not handle the new instruction yet
so it may result in worse code than using ptrtoint. Follow-up changes will
update capture tracking, etc. for the new instruction.
RFC: https://discourse.llvm.org/t/clarifiying-the-semantics-of-ptrtoint/83987/54
Reviewed By: nikic
Pull Request: https://github.com/llvm/llvm-project/pull/139357
Now that #149310 has restricted lifetime intrinsics to only work on
allocas, we can also drop the explicit size argument. Instead, the size
is implied by the alloca.
This removes the ability to only mark a prefix of an alloca alive/dead.
We never used that capability, so we should remove the need to handle
that possibility everywhere (though many key places, including stack
coloring, did not actually respect this).
This slightly relaxes the invariant established in #149310, by also
allowing the lifetime argument to be poison. This is to support the
typical pattern of RAUWing with poison when removing an instruction.
It's worth noting that this does not require any conservative
assumptions, lifetimes with poison arguments can simply be skipped.
Fixes https://github.com/llvm/llvm-project/issues/151119.
This patch hyphenates words that are used as adjecives, such as:
- architecture specific
- human readable
- implementation defined
- language independent
- language specific
- machine readable
- machine specific
- target independent
- target specific
This patch implements the `llvm.loop.estimated_trip_count` metadata
discussed in [[RFC] Fix Loop Transformations to Preserve Block
Frequencies](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785).
As [suggested in the RFC
comments](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785/4),
it adds the new metadata to all loops at the time of profile ingestion
and estimates each trip count from the loop's `branch_weights` metadata.
As [suggested in the PR #128785
review](https://github.com/llvm/llvm-project/pull/128785#discussion_r2151091036),
it does so via a new `PGOEstimateTripCountsPass` pass, which creates the
new metadata for each loop but omits the value if it cannot estimate a
trip count due to the loop's form.
An important observation not previously discussed is that
`PGOEstimateTripCountsPass` *often* cannot estimate a loop's trip count,
but later passes can sometimes transform the loop in a way that makes it
possible. Currently, such passes do not necessarily update the metadata,
but eventually that should be fixed. Until then, if the new metadata has
no value, `llvm::getLoopEstimatedTripCount` disregards it and tries
again to estimate the trip count from the loop's current
`branch_weights` metadata.
Split out from https://github.com/llvm/llvm-project/pull/150248:
Specify that the argument of lifetime.start/lifetime.end is ignored and
will be removed in the future.
Remove lifetime size handling from SDAG. The size was previously
discarded during isel, so was always ignored for stack coloring anyway.
Where necessary, obtain the size of the full frame index.
This enables the (reasonably common) pattern of using `mmap` to reserve
but not actually map a wide range of pages, and then only adding in more
pages as memory is actually needed. Effectively, that region of memory
is one big allocated object for LLVM, but crucially, that allocated
object *changes its size*.
Having an allocated object grow seems entirely compatible with what LLVM
optimizations assume, *except* that when LLVM sees an `alloca` or
similar instruction, it will assume that a pointer that has been
`getelementptr inbounds` by more than the size of the allocated object
cannot alias that `alloca`. But for allocated objects that are created
e.g. by `mmap`, where LLVM does not know their size, this cannot happen
anyway.
The other main point to be concerned about is having a `getelementptr
inbounds` that is moved up across an operation that grows an allocated
object: this should be legal as `getelementptr` is freely reorderable.
We achieve that by saying that for allocated objects that change their
size, "inbounds" means "inbounds of their maximal size", not "inbounds
of their current size".
It would be nice to also allow shrinking allocations (e.g. by
`munmap`ing pages at the end), but that is more tricky. Consider an
example like this:
- load 4 bytes from `ptr`
- call some function
- load 1 byte from `ptr`
Right now, LLVM could argue that since `ptr` clearly has not been
deallocated, there must be at least 4 bytes of dereferenceable memory
behind `ptr` after the call. If allocations can shrink, this kind of
reasoning is no longer valid. I don't know if LLVM actually does
reasoning like that -- I think it should not, since I think it should be
possible to have allocations that shrink -- but to remain conservative I
am not proposing that as part of this patch.
lifetime.start and lifetime.end are primarily intended for use on
allocas, to enable stack coloring and other liveness optimizations. This
is necessary because all (static) allocas are hoisted into the entry
block, so lifetime markers are the only way to convey the actual
lifetimes.
However, lifetime.start and lifetime.end are currently *allowed* to be
used on non-alloca pointers. We don't actually do this in practice, but
just the mere fact that this is possible breaks the core purpose of the
lifetime markers, which is stack coloring of allocas. Stack coloring can
only work correctly if all lifetime markers for an alloca are
analyzable.
* If a lifetime marker may operate on multiple allocas via a select/phi,
we don't know which lifetime actually starts/ends and handle it
incorrectly (https://github.com/llvm/llvm-project/issues/104776).
* Stack coloring operates on the assumption that all lifetime markers
are visible, and not, for example, hidden behind a function call or
escaped pointer. It's not possible to change this, as part of the
purpose of lifetime markers is that they work even in the presence of
escaped pointers, where simple use analysis is insufficient.
I don't think there is any way to have coherent semantics for lifetime
markers on allocas, while also permitting them on arbitrary pointer
values.
This PR restricts lifetimes to operate on allocas only. As a followup, I
will also drop the size argument, which is superfluous if we always
operate on an alloca. (This change also renders various code handling
lifetime markers on non-alloca dead. I plan to clean up that kind of
code after dropping the size argument as well.)
In practice, I've only found a few places that currently produce
lifetimes on non-allocas:
* CoroEarly replaces the promise alloca with the result of an intrinsic,
which will later be replaced back with an alloca. I think this is the
only place where there is some legitimate loss of functionality, but I
don't think this is particularly important (I don't think we'd expect
the promise in a coroutine to admit useful lifetime optimization.)
* SafeStack moves unsafe allocas onto a separate frame. We can safely
drop lifetimes here, as SafeStack performs its own stack coloring.
* Similar for AddressSanitizer, it also moves allocas into separate
memory.
* LSR sometimes replaces the lifetime argument with a GEP chain of the
alloca (where the offsets ultimately cancel out). This is just
unnecessary. (Fixed separately in
https://github.com/llvm/llvm-project/pull/149492.)
* InferAddrSpaces sometimes makes lifetimes operate on an addrspacecast
of an alloca. I don't think this is necessary.
This fixes llvm.experimental.constrained.lrint and friends to use the
same semantics as llvm.lrint and friends, as defined in
ba5c26da7ce85dbdcee3d964282e5f0981792702.
According to the current LangRef, atomics of sizes larger than a
target-dependent limit are non-conformant IR. Presumably, that size
limit is `TargetLoweringBase::getMaxAtomicSizeInBitsSupported()`. As a
consequence, one would not even know whether IR is valid without
instantiating the Target backend.
To get around this, Clang's CGAtomic uses a constant "16 bytes" for the
maximally supported atomic. The verifier only checks the power-of-two
requirement.
In a discussion with jyknight, the intention is rather that the
AtomicExpandPass will just lower everything larger than the
target-supported atomic sizes to libcall (such as `__atomic_load`).
According to this interpretation, the size limit needs only be known by
the lowering and does not affect the IR specification.
The original "target-specific size limit" had been added in
59b66883eacbc62a09c09f08bcbfdce7af46cf31. The LangRef change is needed
for #134455 because otherwise frontends need to pass a TargetLowering
object to the helper functions just to know what the target-specific
limit is.
This also changes the LangRef for atomicrmw. Are there libatomic
fallbacks for these? If not, LLVM-IR validity still depends on
instantiating the actual backend. There are also some intrinsics such as
`llvm.memcpy.element.unordered.atomic` that have this constraint but do
not change in this PR.
RFC on discourse:
https://discourse.llvm.org/t/rfc-debug-info-for-coroutine-suspension-locations-take-2/86606
With this commit, we add `DILabel` debug infos to the resume points of a
coroutine. Those labels can be used by debugging scripts to figure out
the exact line and column at which a coroutine was suspended by looking
up current `__coro_index` value inside the coroutines frame, and then
searching for the corresponding label inside the coroutine's resume
function.
The DWARF information generated for such a label looks like:
```
0x00000f71: DW_TAG_label
DW_AT_name ("__coro_resume_1")
DW_AT_decl_file ("generator-example.cpp")
DW_AT_decl_line (5)
DW_AT_decl_column (3)
DW_AT_artificial (true)
DW_AT_LLVM_coro_suspend_idx (0x01)
DW_AT_low_pc (0x00000000000019be)
```
The labels can be mapped to their corresponding `__coro_idx` values
either via their naming convention `__coro_resume_<N>` or using the new
`DW_AT_LLVM_coro_suspend_idx` attribute. In gdb, those line numebrs can
be looked up using `info line -function my_coroutine -label
__coro_resume_1`. LLDB unfortunately does not understand DW_TAG_label
debug information, yet.
Given this is an artificial compiler-generated label, I did apply the
DW_AT_artificial tag to it. The DWARFv5 standard only allows that tag on
type and variable definitions, but this is a natural extension and was
also blessed in the RFC on discourse.
Also, this commit adds `DW_AT_decl_column` to labels, not only for
coroutines but also for normal C and C++ labels. While not strictly
necessary, I am doing so now because it would be harder to do so later
without breaking the binary LLVM-IR format
Drive-by fixes: While reading the existing test cases to understand how
to write my own test case, I did a couple of small typo fixes and
comment improvements
Add `dead_on_return` attribute, which is meant to be taken advantage
by the frontend, and states that the memory pointed to by the argument
is dead upon function return. As with `byval`, it is supposed to be
used for passing aggregates by value. The difference lies in the ABI:
`byval` implies that the pointer is explicitly passed as argument to
the callee (during codegen the copy is emitted as per byval contract),
whereas a `dead_on_return`-marked argument implies that the copy
already exists in the IR, is located at a specific stack offset within
the caller, and this memory will not be read further by the caller upon
callee return – or otherwise poison, if read before being written.
RFC: https://discourse.llvm.org/t/rfc-add-dead-on-return-attribute/86871.
Also fix the LangRef to match the implementation. This was checking
against the alloca address space size rather than the default address
space.
The check was also more permissive than the LangRef. The error
check permitted any size less than the pointer size; follow the
stricter wording of the LangRef.
This patch extends the llvm.histogram intrinsic to support additional
update operations beyond the existing add. Specifically, the new
supported operations are:
* umax: unsigned maximum
* umin: unsigned minimum
* uadd.sat: unsigned saturated addition
Based on the discussion from:
https://discourse.llvm.org/t/rfc-expanding-the-experimental-histogram-intrinsic/84673
This change is to support target extension types in vectors. The change
allows sized target extension types to opt-in to being a valid vector
element.
Allowing target extension types as vector elements will allow backends
to use vector operations such as `insertelement` and `extractelement` on
their target types with minimal changes.
RFC:
https://discourse.llvm.org/t/rfc-supporting-sized-target-extension-types-in-vector/86431
Add a `alloc-variant-zeroed` function attribute which can be used to
inform folding allocation+memset. This addresses
https://github.com/rust-lang/rust/issues/104847, where LLVM does not
know how to perform this transformation for non-C languages.
Co-authored-by: Jamie <jamie@osec.io>
In 'asm goto' statements ('callbr' in LLVM IR), you can specify one or
more labels / basic blocks in the containing function which the assembly
code might jump to. If you're also compiling with branch target
enforcement via BTI, then previously listing a basic block as a possible
jump destination of an asm goto would cause a BTI instruction to be
placed at the start of the block, in case the assembly code used an
_indirect_ branch instruction (i.e. to a destination address read from a
register) to jump to that location. Now it doesn't do that any more:
branches to destination labels from the assembly code are assumed to be
direct branches (to a relative offset encoded in the instruction), which
don't require a BTI at their destination.
This change was proposed in https://discourse.llvm.org/t/85845 and there
seemed to be no disagreement. The rationale is:
1. it brings clang's handling of asm goto in Arm and AArch64 in line
with gcc's, which didn't generate BTIs at the target labels in the first
place.
2. it improves performance in the Linux kernel, which uses a lot of 'asm
goto' in which the assembly language just contains a NOP, and the
label's address is saved elsewhere to let the kernel self-modify at run
time to swap between the original NOP and a direct branch to the label.
This allows hot code paths to be instrumented for debugging, at only the
cost of a NOP when the instrumentation is turned off, instead of the
larger cost of an indirect branch. In this situation a BTI is
unnecessary (if the branch happens it's direct), and since the code
paths are hot, also a noticeable performance hit.
Implementation:
`SelectionDAGBuilder::visitCallBr` is the place where 'asm goto' target
labels are handled. It calls `setIsInlineAsmBrIndirectTarget()` on each
target `MachineBasicBlock`. Previously it also called
`setMachineBlockAddressTaken()`, which made `hasAddressTaken()` return
true, which caused a BTI to be added in the Arm backends.
Now `visitCallBr` doesn't call `setMachineBlockAddressTaken()` any more
on asm goto targets, but `hasAddressTaken()` also checks the flag set by
`setIsInlineAsmBrIndirectTarget()`. So call sites that were using
`hasAddressTaken()` don't need to be modified. But the Arm backends
don't call `hasAddressTaken()` any more: instead they test two more
specific query functions that cover all the reasons `hasAddressTaken()`
might have returned true _except_ being an asm goto target.
Testing:
The new test `AArch64/callbr-asm-label-bti.ll` is testing the actual
change, where it expects not to see a `bti` instruction after
`[[LABEL]]`. The rest of the test changes are all churn, due to the
flags on basic blocks changing. Actual output code hasn't changed in any
of the existing tests, only comments and diagnostics.
Further work:
`RISCVIndirectBranchTracking.cpp` and `X86IndirectBranchTracking.cpp`
also call `hasAddressTaken()` in a way that might benefit from using the
same more specific check I've put in `ARMBranchTargets.cpp` and
`AArch64BranchTargets.cpp`. But I'm not sure of that, so in this commit
I've only changed the Arm backends, and left those alone.
This adds [de]interleave intrinsics for factors of 4,6,8, so that every
interleaved memory operation supported by the in-tree targets can be
represented by a single intrinsic.
For context, [de]interleaves of fixed-length vectors are represented by
a series of shufflevectors. The intrinsics are needed for scalable
vectors, and we don't currently scalably vectorize all possible factors
of interleave groups supported by RISC-V/AArch64.
The underlying reason for this is that higher factors are currently
represented by interleaving multiple interleaves themselves, which made
sense at the time in the discussion in
https://github.com/llvm/llvm-project/pull/89018.
But after trying to integrate these for higher factors on RISC-V I think
we should revisit this design choice:
- Matching these in InterleavedAccessPass is non-trivial: We currently
only support factors that are a power of 2, and detecting this requires
a good chunk of code
- The shufflevector masks used for [de]interleaves of fixed-length
vectors are much easier to pattern match as they are strided patterns,
but for the intrinsics it's much more complicated to match as the
structure is a tree.
- Unlike shufflevectors, there's no optimisation that happens on
[de]interleave2 intriniscs
- For non-power-of-2 factors e.g. 6, there are multiple possible ways a
[de]interleave could be represented, see the discussion in #139373
- We already have intrinsics for 2,3,5 and 7, so by avoiding 4,6 and 8
we're not really saving much
By representing these higher factors are interleaved-interleaves, we can
in theory support arbitrarily high interleave factors. However I'm not
sure this is actually needed in practice: SVE only has instructions
for factors 2,3,4, whilst RVV only supports up to factor 8.
This patch would make it much easier to support scalable interleaved
accesses in the loop vectorizer for RISC-V for factors 3,5,6 and 7, as
the loop vectorizer and InterleavedAccessPass wouldn't need to
construct and match trees of interleaves.
For interleave factors above 8, for which there are no hardware memory
operations to match in the InterleavedAccessPass, we can still keep the
wide load + recursive interleaving in the loop vectorizer.
Some hardware (for example, certain AVR chips) have peripheral registers
mapped to the data space address 0. Although a volatile load/store on
`ptr null` already generates expected code, the wording in the LangRef
makes operations on null seem like undefined behavior in all cases. This
commit adds a comment that, for volatile operations, it may be defined
behavior to access the address null, if the architecture permits it. The
intended use case is MMIO registers with hard-coded addresses that
include bit-value 0. A simple CodeGen test is included for AVR, as an
architecture known to have this quirk, that does `load volatile` and
`store volatile` to `ptr null`, expecting to generate `lds <reg>, 0` and
`sts 0, <reg>`.
See [this
thread](https://rust-lang.zulipchat.com/#narrow/channel/213817-t-lang/topic/Adding.20the.20possibility.20of.20volatile.20access.20to.20address.200)
and [the
RFC](https://discourse.llvm.org/t/rfc-volatile-access-to-non-dereferenceable-memory-may-be-well-defined/86303)
for discussion and context.
This function can be used to retrieve the number of bits that can be used
for arithmetic in a given address space (i.e. the range of the address
space). For most in-tree targets this should not make any difference
but differentiating between the size of a pointer in bits and the address
range is extremely important e.g. for CHERI-enabled targets, where pointers
carry additional metadata such as bounds and permissions and only a subset
of the pointer bits is used as the address.
The address size is defined to be the same as the index size.
We considered adding a separate property since targets exist where indexing
and address range actually use different sizes (AMDGPU fat pointers with
160 representation, 48 bit address and 32 bit index), but for the purposes
of LLVM semantics, differentiating them does not add much value and it
introduces a lot of complexity in ensure the correct bits are used. See
the reasoning by @nikic on https://discourse.llvm.org/t/clarifiying-the-semantics-of-ptrtoint/83987/38https://discourse.llvm.org/t/clarifiying-the-semantics-of-ptrtoint/83987/49
Originally uploaded as https://reviews.llvm.org/D135158
Reviewed By: davidchisnall, krzysz00
Pull Request: https://github.com/llvm/llvm-project/pull/139347
Make this consistent with other operations with respect to
signaling nan quieting. This was specifying that quieting is
required, which is true for IEEE. Make this consistent with other
IR operations, and signaling nan quieting is possible but optional
in the case where there are two nan inputs.
This permits directly selecting the intrinsic to the hardware
instruction in the default floating-point environment for shaders.
Thread-local globals live, by default, in the default globals address
space, which may not be 0, so we need to overload @llvm.thread.pointer
to support other address spaces, and use the default globals address
space in Clang.
Currently, each variant in the variant part of a structure type can only
contain a single member. This was sufficient for Rust, where each
variant is represented as its own type.
However, this isn't really enough for Ada, where a variant can have
multiple members.
This patch adds support for this scenario. This is done by allowing the
use of DW_TAG_variant by DICompositeType, and then changing the DWARF
generator to recognize when a DIDerivedType representing a variant holds
one of these. In this case, the fields from the DW_TAG_variant are
inlined into the variant, like so:
```
<4><7d>: Abbrev Number: 9 (DW_TAG_variant)
<7e> DW_AT_discr_value : 74
<5><7f>: Abbrev Number: 7 (DW_TAG_member)
<80> DW_AT_name : (indirect string, offset: 0x43): field0
<84> DW_AT_type : <0xa7>
<88> DW_AT_alignment : 8
<89> DW_AT_data_member_location: 0
<5><8a>: Abbrev Number: 7 (DW_TAG_member)
<8b> DW_AT_name : (indirect string, offset: 0x4a): field1
<8f> DW_AT_type : <0xa7>
<93> DW_AT_alignment : 8
<94> DW_AT_data_member_location: 8
```
Note that the intermediate DIDerivedType is still needed in this
situation, because that is where the discriminants are stored.
While external globals can be unsized, I don't think an unsized
initializer makes sense.
It seems like the backend currently ends up treating this as a zero-size
global. If people want that behavior, they should declare it as such.
This patch adds support for LLVM IR atomicrmw `fmaximum` and `fminimum`
instructions.
These mirror the `llvm.maximum.*` and `llvm.minimum.*` instructions, but
are atomic and use IEEE754 2019 handling for NaNs, which is different to
`fmax` and `fmin`. See:
https://llvm.org/docs/LangRef.html#llvm-minimum-intrinsic
for more details.
Future changes will allow this LLVM IR to be lowered to specialised
assembler instructions on suitable targets, such as AArch64.
This patch adds support for LLVM IR atomicrmw `fmaximum` and `fminimum`
instructions.
These mirror the `llvm.maximum.*` and `llvm.minimum.*` instructions, but
are atomic and use IEEE754 2019 handling for NaNs, which is different to
`fmax` and `fmin`. See:
https://llvm.org/docs/LangRef.html#llvm-minimum-intrinsic
for more details.
Future changes will allow this LLVM IR to be lowered to specialised
assembler instructions on suitable targets, such as AArch64.
So far, the Language Reference said that the alloca address space from
the datalayout is used if no explicit address space is provided, which
is not what the LLParser and the AsmWriter implement. This patch adjusts
the documentation to match the implementation: The default AS 0 is used
if none is explicitly specified.
This is an alternative to PR #135786, which would change the parser's
behavior to match the Language Reference instead.
- _Float16 is now accepted by Clang.
- The half IR type is fully handled by the backend.
- These values are passed in FP registers and converted to/from float around
each operation.
- Compiler-rt conversion functions are now built for s390x including the missing
extendhfdf2 which was added.
Fixes#50374