Inliner currently treats every "call asm" IR instruction as a single
instruction regardless of how many instructions the inline assembly may
contain. This may underestimate the cost of inlining for a callee
containing long inline assembly. Besides, we may need to assign a higher
cost to instructions in inline assembly since they cannot be analyzed
and optimized by the compiler.
This PR introduces a new option `-inline-asm-instr-cost` -- set zero by
default, which can control the cost of inline assembly instructions in
inliner's cost-benefit analysis.
Motivation: When using libc++, `std::bitset<64>::count()` doesn't
optimize to a single popcount instruction on AArch64, because we fail to
inline the library code completely. Inlining fails, because the internal
bit_iterator struct is passed as a [2 x i64] %arg value on AArch64. The
value is built using insertvalue instructions and only one of the array
entries is constant. If we know that this entry is constant, we can
prove that half the function becomes dead. However, InlineCost only
considers operands for simplification if they are Constants, which %arg
is not. Without this simplification the function is too expensive to
inline.
Therefore, we had to teach InlineCost to support non-Constant simplified values
(PR #145083). Now, we enable this for extractvalue, because we want to simplify
the extractvalue with the insertvalues from the caller function. This is enough to
get bitset::count fully optimized.
There are similar opportunities we can explore for BinOps in the future
(e.g. cmp eq %arg1, %arg2 when the caller passes the same value into
both arguments), but we need to be careful here, because InstSimplify
isn't completely safe to use with operands owned by different functions.
Allow mapping callee Values to arbitrary (non-Constant) simplified
values. The simplified values can also originate from the caller. This
enables us to simplify instructions in the callee with instructions from
the caller.
The first use case for this is simplifying extractvalues (PR #145054).
- Consider inlining recursive function of depth 1 only when
the caller is the function itself instead of inlining it
for each callsite so that we avoid redundant work.
- Use CondContext instead of DomTree for better compilation time.
Adds support for MSVC's undocumented `/funcoverride` flag, which marks
functions as being replaceable by the Windows kernel loader. This is
used to allow functions to be upgraded depending on the capabilities of
the current processor (e.g., the kernel can be built with the naive
implementation of a function, but that function can be replaced at boot
with one that uses SIMD instructions if the processor supports them).
For each marked function we need to generate:
* An undefined symbol named `<name>_$fo$`.
* A defined symbol `<name>_$fo_default$` that points to the `.data`
section (anywhere in the data section, it is assumed to be zero sized).
* An `/ALTERNATENAME` linker directive that points from `<name>_$fo$` to
`<name>_$fo_default$`.
This is used by the MSVC linker to generate the appropriate metadata in
the Dynamic Value Relocation Table.
Marked function must never be inlined (otherwise those inline sites
can't be replaced).
Note that I've chosen to implement this in AsmPrinter as there was no
way to create a `GlobalVariable` for `<name>_$fo$` that would result in
a symbol being emitted (as nothing consumes it and it has no
initializer). I tried to have `llvm.used` and `llvm.compiler.used` point
to it, but this didn't help.
Within LLVM I referred to this feature as "loader replaceable" as
"function override" already has a different meaning to C++ developers...
I also took the opportunity to extract the feature symbol generation
code used by both AArch64 and X86 into a common function in AsmPrinter.
Recursive functions are generally not inlined to avoid issues
like infinite inlining or excessive code expansion. However,
this conservative approach misses opportunities for optimization in
cases where a recursive call is guaranteed to execute only once.
This patch detects a scenario where a guarding branch condition of a recursive
call will become false after the first iteration of the recursive function.
If such a condition is met, and the recursion depth is confirmed to be one,
the Inliner will now consider this recursive function for inlining.
A new test case (`test/Transforms/Inline/inline-recursive-fn.ll`)
has been added to verify this behaviour.
Query the correct TTI for the current target instead of constructing
some random default one. Also query the pass manager for
ProfileSummaryInfo.
This should only change the printing, not the actual result.
We can use *Set::insert_range to collapse:
for (auto Elem : Range)
Set.insert(E);
down to:
Set.insert_range(Range);
In some cases, we can further fold that into the set declaration.
This is a follow-up to https://github.com/llvm/llvm-project/pull/130210.
The EphemeralValuesAnalysis pass used to return an EphemeralValuesCache
object which used to hold the ephemeral values and used to provide a
lazy collection of the ephemeral values, and an invalidation using the
`clear()` function.
This patch removes the EphemeralValuesCache class completely and instead
returns the SmallVector containing the ephemeral values.
This patch does two things:
1. It implements an ephemeral values cache analysis pass that collects the ephemeral values of a function and caches them for fast lookups. The collection of the ephemeral values is done lazily when the user calls `EphemeralValuesCache::ephValues()`.
2. It adds caching of ephemeral values using the `EphemeralValuesCache` to speed up `CallAnalyzer::analyze()`. Without caching this can take a long time to run in cases where the function contains a large number of `@llvm.assume()` calls and a large number of callsites. The time is spent in `collectEphemeralvalues()`.
When the size is an appropriate constant, __memcpy_chk will turn into a
memcpy that gets folded away by InstCombine. Therefore this patch avoids
counting these as calls for purposes of inlining costs.
This is only really relevant on platforms whose headers redirect memcpy
to __memcpy_chk (such as Darwin). On platforms that use intrinsics,
memcpy and similar functions are already exempt from call penalties.
Previously InlineCostAnnotationPrinter only prints inline cost for call
instructions. I don't think there is any reason not to analyze invoke
and its callee, and this patch adds such support.
Currently we will not be able to inline a large function even if it only
has one live use because the inline cost is still very high after
applying `LastCallToStaticBonus`, which is a constant. This could
significantly impact the performance because CSR spill is very
expensive.
This PR adds a new function `getInliningLastCallToStaticBonus` to TTI to
allow targets to customize this value.
Fixes SWDEV-471398.
The constructor initializes `*this` with `M->getDataLayout()`, which
is effectively the same as calling the copy constructor.
There does not seem to be a case where a copy would be necessary.
Pull Request: https://github.com/llvm/llvm-project/pull/102841
Use ICmpInst::compare() where possible, ConstantFoldCompareInstOperands
in other places. This only changes places where the either the fold is
guaranteed to succeed, or the code doesn't use the resulting compare if
we fail to fold.
#66457 makes InlineCost to use cost-benefit by default, which causes
0.4-0.5% performance regression on multiple internal workloads. See
discussions https://github.com/llvm/llvm-project/pull/66457. This pull
request reverts it.
Co-authored-by: helloguo <helloguo@meta.com>
Vectors are always bit-packed and don't respect the elements' alignment
requirements. This is different from arrays. This means offsets of
vector GEPs need to be computed differently than offsets of array GEPs.
This PR fixes many places that rely on an incorrect pattern
that always relies on `DL.getTypeAllocSize(GTI.getIndexedType())`.
We replace these by usages of `GTI.getSequentialElementStride(DL)`,
which is a new helper function added in this PR.
This changes behavior for GEPs into vectors with element types for which
the (bit) size and alloc size is different. This includes two cases:
* Types with a bit size that is not a multiple of a byte, e.g. i1.
GEPs into such vectors are questionable to begin with, as some elements
are not even addressable.
* Overaligned types, e.g. i16 with 32-bit alignment.
Existing tests are unaffected, but a miscompilation of a new test is fixed.
---------
Co-authored-by: Nikita Popov <github@npopov.com>
This is a stacked PR following on from #68415
This patch has two purposes:
(1) It tries to make inlining more likely when it can avoid a
streaming-mode change.
(2) It avoids inlining when inlining causes more streaming-mode changes.
An example of (1) is:
```
void streaming_compatible_bar(void);
void foo(void) __arm_streaming {
/* other code */
streaming_compatible_bar();
/* other code */
}
void f(void) {
foo(); // expensive streaming mode change
}
->
void f(void) {
/* other code */
streaming_compatible_bar();
/* other code */
}
```
where it wouldn't have inlined the function when foo would be a
non-streaming function.
An example of (2) is:
```
void streaming_bar(void) __arm_streaming;
void foo(void) __arm_streaming {
streaming_bar();
streaming_bar();
}
void f(void) {
foo(); // expensive streaming mode change
}
-> (do not inline into)
void f(void) {
streaming_bar(); // these are now two expensive streaming mode changes
streaming_bar();
}```
- The motivation is to expose tunable knobs to control the aggressiveness of inlines for different backend (e.g., machines with different icache size, and workload with different icache/itlb PMU counters). Tuning inline aggressiveness shows a small (~+0.3%) but stable improvement on workload/hardware that is more frontend bound.
- Both multipliers could be overridden from command line.
Reviewed By: kazu
Differential Revision: https://reviews.llvm.org/D153154
Multiplying raw block frequency with an integer carries a high risk
of overflow.
- Add `BlockFrequency::mul` return an std::optional with the product
or `nullopt` to indicate an overflow.
- Fix two instances where overflow was likely.
On AMDGPU, alloca instructions have penalty that can
be avoided when SROA is applied after inlining.
This patch introduces the default implementation of
TargetTransformInfo::getCallerAllocaCost.
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D149740
When we inline a callee into a caller, the compiler needs to make sure
that the caller supports a superset of instruction sets that the
callee is allowed to use. Normally, we check for the compatibility of
target features via functionsHaveCompatibleAttributes, but that
happens after we decide to honor call site attribute
Attribute::AlwaysInline. If the caller contains a call marked with
Attribute::AlwaysInline, which can happen with
__attribute__((flatten)) placed on the caller, the caller could end up
with code that cannot be lowered to assembly code.
This patch fixes the problem by checking the target feature
compatibility before we honor Attribute::AlwaysInline.
Fixes https://github.com/llvm/llvm-project/issues/62664
Differential Revision: https://reviews.llvm.org/D150396
!make.implicit metadata attached to branch means it will very likely
be eliminated (together with associated cmp instruction).
Reviewed By: apilipenko
Differential Revision: https://reviews.llvm.org/D149747
This aligns the inlining macros more closely with how the regalloc
macros are defined.
- Explicitly specify the dtype/shape
- Remove separate names for python/C++
- Add docstring for inline cost features
Differential Revision: https://reviews.llvm.org/D149384
Instead use ConstantFoldSelectInstruction(), which will return
nullptr if it cannot be folded and a constant expression would
be produced instead.
In preparation for removing select constant expressions.