The check for ABI differences for inlined calls involves the caller, the
callee and the nested callee. Before inlining, the ABI is determined by
the target features of the callee. After inlining it is determined by
the caller. The features of the nested callee should never actually
matter.
Add a default off option to the inline cost calculation to always inline
all viable calls regardless of the cost/benefit and cost/threshold
calculations.
For performance reasons, some users require that all calls be inlined.
Rather than forcing them to adjust the inlining threshold to an
arbitrarily high value, offer an option to inline all calls.
Call `recordInliningWithCalleeDeleted` before dropping the contents of
the Callee. Otherwise the handlers don't have access to e.g. the
DebugLoc, so the Callee DebugLoc was missing in inlining remarks for
functions with internal linkage.
The test is the same as `optimization-remarks-passed-yaml.ll` except
that the function `foo` has internal linkage instead of external linkage.
This along with IntrReadMem means that the Intrinsic only reads memory
through the given argument ptr and its derivatives. This allows passes
like Inliner to attach alias.scope to the call instruction as it sees
that no other memory is accessed.
Discovered via SWDEV-543741
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
We'll remove the size estimator after, this change is to get the `ml-*`
build bots green after the aforementioned PR.
We never used the size estimator again after the initial DQN-based
training. Should we want to again, we now have IR2Vec, which the old
estimator was approximating in functionality.
Now that #149310 has restricted lifetime intrinsics to only work on
allocas, we can also drop the explicit size argument. Instead, the size
is implied by the alloca.
This removes the ability to only mark a prefix of an alloca alive/dead.
We never used that capability, so we should remove the need to handle
that possibility everywhere (though many key places, including stack
coloring, did not actually respect this).
This reverts https://github.com/llvm/llvm-project/pull/132976.
The PR incorrectly claimed that this makes inlining more liberal,
referencing the string comparison in TargetTransformInfoImpl.h.
However, the implementation that actually applies is the one in
BasicTTIImpl.h, which performs a feature subset comparison. As such,
this regressed inlining, most concerningly of functions without +vector
into functions with +vector.
Revert the change to restore the previous behavior.
Without this change, the following test would fail to compile
with `-march=armv8-a+sme`:
```
void func1(const svuint32_t *in, svuint32_t *out) {
[&]() __arm_streaming { *out = *in; }();
}
```
But in general, it's probably better never to inline
streaming functions into non-streaming functions, because
they will have been marked as 'streaming' for a reason
by the user.
Do not omit check lines for any functions, to avoid spurious diffs
on regeneration. Also update to a newer UTC version which properly
generates the metadata checks.
lifetime.start and lifetime.end are primarily intended for use on
allocas, to enable stack coloring and other liveness optimizations. This
is necessary because all (static) allocas are hoisted into the entry
block, so lifetime markers are the only way to convey the actual
lifetimes.
However, lifetime.start and lifetime.end are currently *allowed* to be
used on non-alloca pointers. We don't actually do this in practice, but
just the mere fact that this is possible breaks the core purpose of the
lifetime markers, which is stack coloring of allocas. Stack coloring can
only work correctly if all lifetime markers for an alloca are
analyzable.
* If a lifetime marker may operate on multiple allocas via a select/phi,
we don't know which lifetime actually starts/ends and handle it
incorrectly (https://github.com/llvm/llvm-project/issues/104776).
* Stack coloring operates on the assumption that all lifetime markers
are visible, and not, for example, hidden behind a function call or
escaped pointer. It's not possible to change this, as part of the
purpose of lifetime markers is that they work even in the presence of
escaped pointers, where simple use analysis is insufficient.
I don't think there is any way to have coherent semantics for lifetime
markers on allocas, while also permitting them on arbitrary pointer
values.
This PR restricts lifetimes to operate on allocas only. As a followup, I
will also drop the size argument, which is superfluous if we always
operate on an alloca. (This change also renders various code handling
lifetime markers on non-alloca dead. I plan to clean up that kind of
code after dropping the size argument as well.)
In practice, I've only found a few places that currently produce
lifetimes on non-allocas:
* CoroEarly replaces the promise alloca with the result of an intrinsic,
which will later be replaced back with an alloca. I think this is the
only place where there is some legitimate loss of functionality, but I
don't think this is particularly important (I don't think we'd expect
the promise in a coroutine to admit useful lifetime optimization.)
* SafeStack moves unsafe allocas onto a separate frame. We can safely
drop lifetimes here, as SafeStack performs its own stack coloring.
* Similar for AddressSanitizer, it also moves allocas into separate
memory.
* LSR sometimes replaces the lifetime argument with a GEP chain of the
alloca (where the offsets ultimately cancel out). This is just
unnecessary. (Fixed separately in
https://github.com/llvm/llvm-project/pull/149492.)
* InferAddrSpaces sometimes makes lifetimes operate on an addrspacecast
of an alloca. I don't think this is necessary.
When we rebuild the call site tries after inlining of an allocation with
MD_memprof metadata, we don't want to reapply the discarding of small
non-cold contexts (under -memprof-callsite-cold-threshold=) because we
have either no context size info (without -memprof-report-hinted-sizes
or another option that causes us to keep that as metadata), and even
with that information in the metadata, we have imperfect information at
that point as we have already discarded some contexts during matching.
The first case was even worse because we didn't guard our check by
whether the number of cold bytes was 0, leading to very aggressive
pruning during post-inline metadata rebuilding without the context size
information.
Inliner currently treats every "call asm" IR instruction as a single
instruction regardless of how many instructions the inline assembly may
contain. This may underestimate the cost of inlining for a callee
containing long inline assembly. Besides, we may need to assign a higher
cost to instructions in inline assembly since they cannot be analyzed
and optimized by the compiler.
This PR introduces a new option `-inline-asm-instr-cost` -- set zero by
default, which can control the cost of inline assembly instructions in
inliner's cost-benefit analysis.
Motivation: When using libc++, `std::bitset<64>::count()` doesn't
optimize to a single popcount instruction on AArch64, because we fail to
inline the library code completely. Inlining fails, because the internal
bit_iterator struct is passed as a [2 x i64] %arg value on AArch64. The
value is built using insertvalue instructions and only one of the array
entries is constant. If we know that this entry is constant, we can
prove that half the function becomes dead. However, InlineCost only
considers operands for simplification if they are Constants, which %arg
is not. Without this simplification the function is too expensive to
inline.
Therefore, we had to teach InlineCost to support non-Constant simplified values
(PR #145083). Now, we enable this for extractvalue, because we want to simplify
the extractvalue with the insertvalues from the caller function. This is enough to
get bitset::count fully optimized.
There are similar opportunities we can explore for BinOps in the future
(e.g. cmp eq %arg1, %arg2 when the caller passes the same value into
both arguments), but we need to be careful here, because InstSimplify
isn't completely safe to use with operands owned by different functions.
This depth limits a linear search (rather than the usual potentially
exponential one) and is not particularly important for compile-time in
practice.
The change in #137297 is going to increase the length of GEP chains, so
I'd like to increase this limit a bit to reduce the chance of
regressions (https://github.com/dtcxzyw/llvm-opt-benchmark/pull/2419
showed a 13% increase in SearchLimitReached). There is no particular
significance to the new value of 10.
Compile-time is neutral.
- Consider inlining recursive function of depth 1 only when
the caller is the function itself instead of inlining it
for each callsite so that we avoid redundant work.
- Use CondContext instead of DomTree for better compilation time.
The context disambiguation code already emits remarks when hinting
allocations (by adding hotness attributes) during cloning. However,
we did not yet emit hints when applying the hotness attributes during
building of the metadata (during matching and again after inlining).
Add remarks when we apply the hint attributes for these
non-context-sensitive allocations.
When determining whether an escape source may alias with a noalias
argument, only take provenance captures into account. If only the
address of the argument was captured, an access through the escape
source is not legal.
Currently, `InlineCostFeatureIndex::NumberOfFeatures` results in an
index in the middle of the feature vector, therefore not correctly
setting the default inlining decision and overwriting another feature.
`FeatureIndex::NumberOfFeatures` is the last index of the feature
vector, where the default inlining decision gets appended to when
enabled.
Adds support for MSVC's undocumented `/funcoverride` flag, which marks
functions as being replaceable by the Windows kernel loader. This is
used to allow functions to be upgraded depending on the capabilities of
the current processor (e.g., the kernel can be built with the naive
implementation of a function, but that function can be replaced at boot
with one that uses SIMD instructions if the processor supports them).
For each marked function we need to generate:
* An undefined symbol named `<name>_$fo$`.
* A defined symbol `<name>_$fo_default$` that points to the `.data`
section (anywhere in the data section, it is assumed to be zero sized).
* An `/ALTERNATENAME` linker directive that points from `<name>_$fo$` to
`<name>_$fo_default$`.
This is used by the MSVC linker to generate the appropriate metadata in
the Dynamic Value Relocation Table.
Marked function must never be inlined (otherwise those inline sites
can't be replaced).
Note that I've chosen to implement this in AsmPrinter as there was no
way to create a `GlobalVariable` for `<name>_$fo$` that would result in
a symbol being emitted (as nothing consumes it and it has no
initializer). I tried to have `llvm.used` and `llvm.compiler.used` point
to it, but this didn't help.
Within LLVM I referred to this feature as "loader replaceable" as
"function override" already has a different meaning to C++ developers...
I also took the opportunity to extract the feature symbol generation
code used by both AArch64 and X86 into a common function in AsmPrinter.
Recursive functions are generally not inlined to avoid issues
like infinite inlining or excessive code expansion. However,
this conservative approach misses opportunities for optimization in
cases where a recursive call is guaranteed to execute only once.
This patch detects a scenario where a guarding branch condition of a recursive
call will become false after the first iteration of the recursive function.
If such a condition is met, and the recursion depth is confirmed to be one,
the Inliner will now consider this recursive function for inlining.
A new test case (`test/Transforms/Inline/inline-recursive-fn.ll`)
has been added to verify this behaviour.
Since e39f6c1844fab59c638d8059a6cf139adb42279a opt will infer the
correct datalayout when given a triple. Avoid explicitly specifying it
in tests that depend on the AMDGPU target being present to avoid the
string becoming out of sync with the TargetInfo value.
Only tests with REQUIRES: amdgpu-registered-target or a local lit.cfg
were updated to ensure that tests for non-target-specific passes that
happen to use the AMDGPU layout still pass when building with a limited
set of targets.
Reviewed By: shiltian, arsenm
Pull Request: https://github.com/llvm/llvm-project/pull/137921
Previously the inliner always produced a memcpy with alignment 1 for src
and destination, leading to potentially suboptimal Codegen.
Since the Src ptr alignment is only available through the CallBase it
has to be passed to HandleByValArgumentInit. Dst Alignment is already
known so it doesn't have to be passed along.
If there is no specified Src Alignment my changes cause the ptr to have
no align data attached instead of align 1 as before (see
inline-tail.ll), I believe this is fine but since I'm a first time
contributor, please confirm.
My changes are already covered by 4 existing regression tests, so I did
not add any additional ones.
The example from #45778 now results in:
```C
opt -S -passes=inline,instcombine,sroa,instcombine test.ll
define dso_local i32 @test(ptr %t) {
entry:
%.sroa.0.0.copyload = load ptr, ptr %t, align 8 # this used to be align 1 in the original issue
%arrayidx.i = getelementptr inbounds nuw i8, ptr %.sroa.0.0.copyload, i64 24
%0 = load i32, ptr %arrayidx.i, align 4
ret i32 %0
}
```
Fixes#45778.
During inlining, we may opportunistically simplify conditional branches
(incl. switches) to unconditional branches if, after inlining, their
destination is fixed. While we do this, we should propagate any
DILocation attached to the original branch to the simplified branch,
which this patch enables.
Found using https://github.com/llvm/llvm-project/pull/107279.
As part of inlining an invoke instruction, we may replace an inlined
resume instruction with a simple branch to the landing pad block. When
this happens, we should also propagate the resume's DILocation to this
branch, which this patch enables.
Found using https://github.com/llvm/llvm-project/pull/107279.
## What?
Implement `areInlineCompatible` for the SystemZ target using
FeatureBitset comparison.
## Why?
The default implementation in `TargetTransformInfoImpl.h` makes a string
comparison and only inlines when the target-cpu and the target-features
for caller and callee are the same. We are missing out on optimizations
when the callee has a subset of features of the caller.
## How?
Get the FeatureBitset of the caller and callee and check when callee is
a subset or equal to the caller's features. It's a similar
implementation to ARM, PowerPC...
## Testing?
Test cases check for when the callee is a subset of the caller, when
it's not a subset and when both are equals.
This is a follow-up to https://github.com/llvm/llvm-project/pull/130210.
The EphemeralValuesAnalysis pass used to return an EphemeralValuesCache
object which used to hold the ephemeral values and used to provide a
lazy collection of the ephemeral values, and an invalidation using the
`clear()` function.
This patch removes the EphemeralValuesCache class completely and instead
returns the SmallVector containing the ephemeral values.
This patch does two things:
1. It implements an ephemeral values cache analysis pass that collects the ephemeral values of a function and caches them for fast lookups. The collection of the ephemeral values is done lazily when the user calls `EphemeralValuesCache::ephValues()`.
2. It adds caching of ephemeral values using the `EphemeralValuesCache` to speed up `CallAnalyzer::analyze()`. Without caching this can take a long time to run in cases where the function contains a large number of `@llvm.assume()` calls and a large number of callsites. The time is spent in `collectEphemeralvalues()`.
These date back to when the non-intrinsic format of variable locations
was still being tested and was behind a compile-time flag, so not all
builds / bots would correctly run them. The solution at the time, to get
at least some test coverage, was to have tests opt-in to non-intrinsic
debug-info if it was built into LLVM.
Nowadays, non-intrinsic format is the default and has been on for more
than a year, there's no need for this flag to exist.
(I've downgraded the flag from "try" to explicitly requesting
non-intrinsic format in some places, so that we can deal with tests that
are explicitly about non-intrinsic format in their own commit).
After the default implementation swap from
https://github.com/llvm/llvm-project/pull/117493, where
`areInlineCompatible` checks if the callee features are a subset of
caller features. This is not a safe assumption in general on PPC. We
fallback to check for strict feature set equality for now, and see what
improvements we can make.
This PR removes the old `nocapture` attribute, replacing it with the new
`captures` attribute introduced in #116990. This change is
intended to be essentially NFC, replacing existing uses of `nocapture`
with `captures(none)` without adding any new analysis capabilities.
Making use of non-`none` values is left for a followup.
Some notes:
* `nocapture` will be upgraded to `captures(none)` by the bitcode
reader.
* `nocapture` will also be upgraded by the textual IR reader. This is to
make it easier to use old IR files and somewhat reduce the test churn in
this PR.
* Helper APIs like `doesNotCapture()` will check for `captures(none)`.
* MLIR import will convert `captures(none)` into an `llvm.nocapture`
attribute. The representation in the LLVM IR dialect should be updated
separately.
This is a followup to #117750. Currently, AlwaysInline only invalidates
analyses at the end, by returning that no analyses are preserved.
However, this means that analyses fetched during inlining may be
outdated. The aforementioned PR exposed this issue.
Instead, bring the logic closer to what the normal inliner does, by
directly invalidating the caller in FAM. This should make sure that we
don't receive any outdated analyses even if they are fetched during
inlining.
Also drop the BFI updating entirely -- there's no point in doing it if
we're going to invalidate everything anyway.
This reapplies #119138 with a defensive fix for the assertion failure
when building libcxx.
Unfortunately the failure does not reproduce on my machine, so I am not
able to extract a test case.
The key insight for the fix comes from Jessica Clarke, who observes that
`VTablePtr` may, in fact,
not be a pointer on return from `FindAvailableLoadedValue`.
Co-authored-by: Alexander Richardson <alexander.richardson@cl.cam.ac.uk>
When the size is an appropriate constant, __memcpy_chk will turn into a
memcpy that gets folded away by InstCombine. Therefore this patch avoids
counting these as calls for purposes of inlining costs.
This is only really relevant on platforms whose headers redirect memcpy
to __memcpy_chk (such as Darwin). On platforms that use intrinsics,
memcpy and similar functions are already exempt from call penalties.