Only accumulate the codesize increase of functions that are actually
specialized, rather than for every candidate specialization that we
analyse.
This fixes a subtle bug where prior analysis of candidate
specializations that were deemed unprofitable could prevent subsequent
profitable candidates from being recognised.
Enable specialization on literal constant arguments by default in
Function Specialization.
---------
Co-authored-by: Alexandros Lamprineas <alexandros.lamprineas@arm.com>
We can turn:
```
%add = add i8 %arg, C1
%and = and i8 %add, C2
%cmp = icmp eq i1 %and, C3
```
into:
```
%and = and i8 %arg, C2
%cmp = icmp eq i1 %and, (C3 - C1) & C2
```
This is only worth doing if the sequence is the sole user of the addition
operation.
Motivation example:
```
> cat test.cpp
extern "C" [[gnu::weak]] void f() {}
void alias() __attribute__((alias("f")));
int main() { auto p = alias; p(); }
> clang test.cpp -fsanitize=cfi-icall -flto=thin -fuse-ld=lld
> ./a.out
[1] 1868 illegal hardware instruction ./a.out
```
If the address of a function was only taken through its alias, the
function was not considered exported and therefore was not included
in the CFI jumptable. This resulted in `@llvm.type.test()` being lowered
to `false`, and consequently the indirect call to the function was
eventually optimized to `ubsantrap()`.
Use VPInstruction::ResumePhi to create phi nodes for reduction resume
values in the scalar preheader, similar to how ResumePhis are used for
first-order recurrence resume values after 9a5a8731e77.
This allows simplifying createAndCollectMergePhiForReduction to only
collect reduction resume phis when vectorizing epilogue loops and adding
extra incoming edges from the main vector loop. Updating phis for the
epilogue vector loops requires special attention, because additional
incoming values from the bypass blocks need to be added.
PR: https://github.com/llvm/llvm-project/pull/110004
This patch adds support for cold function coverage instrumentation based
on sampling PGO counts. The major motivation is to detect dead functions
for the services that are optimized with sampling PGO. If a function is
covered by sampling profile count (e.g., those with an entry count > 0),
we choose to skip instrumenting those functions, which significantly
reduces the instrumentation overhead.
More details about the implementation and flags:
- Added a flag `--pgo-instrument-cold-function-only` in
`PGOInstrumentation.cpp` as the main switch to control skipping the
instrumentation.
- Built the extra instrumentation passes(a bundle of passes in
`addPGOInstrPasses`) under sampling PGO pipeline. This is controlled by
`--instrument-cold-function-only-path` flag.
- Added a driver flag `-fprofile-generate-cold-function-coverage`:
- 1) Config the flags in one place, i,e. adding
`--instrument-cold-function-only-path=<...>` and
`--pgo-function-entry-coverage`. Note that the instrumentation file path
is passed through `--instrument-sample-cold-function-path`, because we
cannot use the `PGOOptions.ProfileFile` as it's already used by
`-fprofile-sample-use=<...>`.
- 2) makes linker to link `compiler_rt.profile` lib(see
[ToolChain.cpp#L1125-L1131](https://github.com/llvm/llvm-project/blob/main/clang/lib/Driver/ToolChain.cpp#L1125-L1131)
).
- Added a flag(`--pgo-cold-instrument-entry-threshold`) to config entry
count to determine cold function.
Overall, the full command is like:
```
clang++ -O2 -fprofile-generate-cold-function-coverage=<...> -fprofile-sample-use=<...> code.cc -o code
```
If the list of scalars vectorized as the part of the same vector node,
no need to generate vector node again, it will be handled as part of
overlapping matching.
Fixes#113810
We left some easy opportunities for further simplifications.
log2(trunc(x)) is simply trunc(log2(x)). This is safe if we know that
trunc is NUW because it means that the truncation didn't drop any bits.
It is also safe if the caller is OK with zero as a possible answer.
log2(x >>u y) is simply `log2(x) - y`.
log2(x & y) is a funny one. It comes up when doing something like:
```
unsigned int f(unsigned int x, unsigned int y) {
unsigned char a = 1u << x;
return y / a;
}
```
LLVM would canonicalize this to:
```
%shl = shl nuw i32 1, %x
%conv1 = and i32 %shl, 255
%div = udiv i32 %y, %conv1
```
In cases like these, we can ignore the mask entirely.
This is equivalent to `y >> x`.
The issue fixed in PR113337 exposed a bug in the comparisons done in
allocTypesMatch, which compares a vector of alloc types to those in the
given vector of Edges. The form of std::equal used, which didn't provide
the end iterator for the Edges vector, will iterate through as many
entries in the Edges vector as in the InAllocTypes vector, which can
fail if there are fewer entries in the Edges vector, because we may
dereference a bogus Edge pointer. This function is called twice, once
for the Node, with its callee edges, in which case the number of edges
should always match the number of entries in allocTypesMatch, which is
computed from the Node's callee edges. It was also called for Node's
clones, and it turns out that after cloning and edge modifications done
for other allocations, the number of callee edges in Node and its clones
may no longer match. In some cases, more common with memprof ICP before
the PR113337, the number of clone edges can be smaller leading to a bad
dereference. I found for a large application even before adding memprof
ICP support we sometimes call this with fewer entries in the clone's
callee edges, but were getting lucky as they had allocation type None,
and we didn't end up attempting to dereference the bad edge pointer.
Fix this by passing Edges.end() to std::equal, which means std::equal
will fail if the number of entries in the 2 vectors are not equal.
However, this is too conservative, as clone edges may have been added or
removed since it was initially cloned, and in fact can be wrong as we
may not be comparing allocation types corresponding to the same callee.
Therefore, a couple of enhancements are made to avoid regressing and
improve the checking and cloning:
- Don't bother calling the alloc type comparison when the clone and the
Node's alloc type for the current allocation are precise (have a
single allocation type) and are the same (which is guaranteed by an
earlier check, and an assert is added to confirm that). In that case
we can trivially determine that the clone can be used.
- Split the alloc type matching handling into a separate function for
the clone case. In that case, for each of the InAllocType entries,
attempt to find and compare to the clone callee edge with the same
callee as the corresponding original node callee.
To create a test case I needed to take a spec application (xalancbmk),
and repeatedly apply random hot/cold-ness to the memprof contexts
when building, until I hit the problematic case. I then reduced that
full LTO IR using llvm-reduce and then manually.
Refactors VPVectorPointerRecipe to use the VF VPValue to obtain the
runtime VF, similar to #95305.
Since only reverse vector pointers require the runtime VF, the patch
sets VPUnrollPart::PartOpIndex to 1 for vector pointers and 2 for
reverse vector pointers. As a result, the generation of reverse vector
pointers is moved into a separate recipe.
# What
This PR renames the newly-introduced llvm attribute
`sanitize_realtime_unsafe` to `sanitize_realtime_blocking`. Likewise,
sibling variables such as `SanitizeRealtimeUnsafe` are renamed to
`SanitizeRealtimeBlocking` respectively. There are no other functional
changes.
# Why?
- There are a number of problems that can cause a function to be
real-time "unsafe",
- we wish to communicate what problems rtsan detects and *why* they're
unsafe, and
- a generic "unsafe" attribute is, in our opinion, too broad a net -
which may lead to future implementations that need extra contextual
information passed through them in order to communicate meaningful
reasons to users.
- We want to avoid this situation and make the runtime library boundary
API/ABI as simple as possible, and
- we believe that restricting the scope of attributes to names like
`sanitize_realtime_blocking` is an effective means of doing so.
We also feel that the symmetry between `[[clang::blocking]]` and
`sanitize_realtime_blocking` is easier to follow as a developer.
# Concerns
- I'm aware that the LLVM attribute `sanitize_realtime_unsafe` has been
part of the tree for a few weeks now (introduced here:
https://github.com/llvm/llvm-project/pull/106754). Given that it hasn't
been released in version 20 yet, am I correct in considering this to not
be a breaking change?
In some cases, VPWidenCastRecipes are created but not considered in the
legacy cost model, including truncates/extends when evaluating a reduction
in a smaller type. Return 0 for such casts for now, to avoid divergences
between VPlan and legacy cost models.
Fixes https://github.com/llvm/llvm-project/issues/113526.
The recent change to add support for cloning indirect calls
inadvertantly caused duplicate edges to be created between the same
caller/callee pair. This is due to the new moveCalleeEdgeToNewCaller
not properly guarding the addition of a new edge (ironically I was
testing for that in an assertion, but failed to handle that case
specially otherwise). Now simply move the context ids over to any
existing edge.
This issue in turn led to some assumptions in cloning being violated,
resulting in a later crash.
Add a test for this case to checkNode.
Type::isScalableTy and StructType::containsScalableVectorType failed to
detect some cases of structs containing scalable vectors because
containsScalableVectorType did not call back into isScalableTy to check
the element types. Fix this, which requires sharing the same Visited set
in both functions. Also change the external API so that callers are
never required to pass in a Visited set, and normalize the naming to
isScalableTy.
This patch implements an approach to communicate errors between the
OMPIRBuilder and its users. It introduces `llvm::Error` and
`llvm::Expected` objects to replace the values returned by callbacks
passed to `OMPIRBuilder` codegen functions. These functions then check
the result for errors when callbacks are called and forward them back to
the caller, which has the flexibility to recover, exit cleanly or dump a
stack trace.
This prevents a failed callback to leave the IR in an invalid state and
still continue the codegen process, triggering unrelated assertions or
segmentation faults. In the case of MLIR to LLVM IR translation of the
'omp' dialect, this change results in the compiler emitting errors and
exiting early instead of triggering a crash for not-yet-implemented
errors. The behavior in Clang and openmp-opt stays unchanged, since
callbacks will continue always returning 'success'.
CTMark #113200 size overhead was 5.3%, now it's 4.7%.
The patch affects only signed integers.
https://alive2.llvm.org/ce/z/Lv5hyi
* The patch replaces code which extracted sign bit,
maximized/minimized it, then packed it back, with
simple sign bit flip. The another way to think about
transformation is as a subtraction of MIN_SINT from
A/B. Then we map MIN_SINT to 0, 0 to -MIN_SINT, and
MAX_SINT to MAX_UINT.
* Then to maximize/minimize A/B we don't need
to extract sign bit, we can apply shadow the
same way as to other bits.
* After sign bit flip, we had to switch to unsigned
version of the predicates.
* After change above getHighestPossibleValue/getLowestPossibleValue
became very similar, so we can combine into a single function.
* Because the function does sign bit flip and
requires unsigned predicates used for returned values,
there is no point in keeping it as a member of class,
to hide, we switch to function local lambda.
If the scalars is used externally is in the root node, it may have
incorrect signedness info because of the conflict with the demanded bits
analysis. Need to perform exact signedness analysis and compute it
rather than rely on the precomputed value, which might be incorrect for
alternate zext/sext nodes.
Fixes#113520
Since SLP support "clusterization" of the non-load instructions, the
restriction for reduced values for loads only should be removed to avoid
compiler crash.
Fixes#113516
Need to consider undefs correctly, when trying to replace them with
potentially poisonous values in shuffles. Such elements should not be
silently replaced by poison values, instead complex analysis should be
implemented to see if it is safe to do it.
Fixes#113425
This is a recommit of #107120 . The original PR was approved but failed
buildbot. The newly added tests should only be run for compilers that
support the ARM target. This has been resolved by adding a config file
for these tests.
- Pass optimizes memcpy's by padding out destinations and sources to a
full word to make ARM backend generate full word loads instead of
loading a single byte (ldrb) and/or half word (ldrh). Only pads
destination when it's a stack allocated constant size array and source
when it's constant string. Heuristic to decide whether to pad or not
is very basic and could be improved to allow more examples to be
padded.
- Pass works at the midend level
This option applies for _import_ WPD (i.e., when `DevirtModule` pass
de-virtualizes according to an imported summary, in ThinLTO backend
pipeline). It's meant for debugging (e.g., bisection).
This is part of the effort to support for enabling plugins on windows by
adding better support for building llvm as a DLL. The export macros used
here were added in #96630
Since shared library symbols aren't deduplicated across multiple
libraries on windows like Linux we have to manually explicitly import
and export `Any::TypeId` template instantiations for the uses of
`llvm::Any` in the LLVM codebase to support LLVM Windows shared library
builds.
This change ensures that external code, including LLVM's own tests, can
use PassManager callbacks when LLVM is built as a DLL.
I also removed the only use of llvm::Any for LoopNest that only existed
in debug code and there also doesn't seem to be any code creating
`Any<LoopNest>`