We can't assume that operand 0 is the negated operand because
the matcher handles "fsub -0.0, X" (and also +0.0 with FMF).
By capturing the extract within the match, we avoid the bug
and make the transform more robust (can't assume that this
pass will only see canonical IR).
Relative to the previous attempt, this is rebased over the
InstSimplify fix in ac74e7a7806480a000c9a3502405c3dedd8810de,
which addresses the miscompile reported in PR58401.
-----
foldOpIntoPhi() currently only folds operations into the phi if all
but one operands constant-fold. The two exceptions to this are freeze
and select, where we allow more general simplification.
This patch makes foldOpIntoPhi() generally simplification based and
removes all the instruction-specific logic. We just try to simplify
the instruction for each operand, and for the (potentially) one
non-simplified operand, we move it into the new block with adjusted
operands.
This fixes https://github.com/llvm/llvm-project/issues/57448, which
was my original motivation for the change.
Differential Revision: https://reviews.llvm.org/D134954
LoopFlatten has been in the code base off by default for years, but this
enables it to run by default. Downstream this has been running for
years, so it has been exposed to quite some code. Then around the time
we switched to the NPM, several fixes went in related to updating the
MemorySSA state and we moved it to a loop pass manager, which both
helped preventing rerunning certain analysis passes, and thus helped a
bit with compile-times.
About compile-times, adding a pass isn't free, but this should see only
very minor increases. The pass is relatively simple and there shouldn't
be anything algorithmically expensive because all it does is looking at
inner/outer loops and it checks assumptions on loop increments and
indices. If we see increases, I expect this to mainly come from
invalidation of analysis info, and perhaps subsequent passes to trigger
and do more. Despite its simplicity/restrictions, it triggers in most
code-bases, which makes it worth to enable this by default.
Differential Revision: https://reviews.llvm.org/D109958
Another alternative to fix the thread identification problem in
coroutines.
We plan to fix this problem by unifying memory effecting attributes. See
https://discourse.llvm.org/t/rfc-unify-memory-effect-attributes/65579.
But it may be a long-term project. And it is a pity that the coroutines
can't resume in different threads for years. So this one is temporary
fix. It may cause unnecessary performance regression for coroutines. But
correctness are more important. And this one is planned to be reverted
after we are able to unify the memory effecting attributes actually.
Reviewed By: jdoerfert, rjmccall
Differential Revision: https://reviews.llvm.org/D135550
Instead of duplicating the existing decomposition code for GEP indices
just use the existing code by calling the existing decompose function on
the index expression and multiply the result's coefficients by the scale of
the index.
This both reduces code duplication and generalizes the pattern we can
handle.
If both operands of an `add nsw` are known positive, it can be treated
the same as `add nuw` and added to the unsigned system.
https://alive2.llvm.org/ce/z/6gprff
If the divisor is a power-of-2 or negative-power-of-2 and the dividend
is known to have >= trailing zeros than the divisor, the division is exact:
https://alive2.llvm.org/ce/z/UGBksM (general proof)
https://alive2.llvm.org/ce/z/D4yPS- (examples based on regression tests)
This isn't the most direct optimization (we could create ashr in these
examples instead of relying on existing folds for exact divides), but
it's possible that there's a more general constraint than just a pow2
divisor, so this might be extended in the future.
This should solve issue #58348.
Differential Revision: https://reviews.llvm.org/D135970
This should be functionally equivalent - both calls are thin
wrappers around computeKnownBits(). We'll probably want to use
known-bits directly in follow-up patches because that could
determine "exact" for example (see issue #58348).
Support decomposition for `mul/shl nuw` with constant operand for unsigned
queries. Those expressions should not wrap in the unsigned sense and can
be added directly to the unsigned system.
makeLoopInvariant may recursively move its operands to make them
invariant, before moving the passed in instruction. Those recursively
moved instructions are currently missed when invalidating block and loop
dispositions.
To address this, move the invalidation code to Loop::makeLoopInvariant.
Fixes#58314.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D135909
`ProvenanceAnalysis::related()` was assuming that the order of parameters for `relatedCheck()` was not affecting
the result but this was not the case when both parameters were `PHINode`s.
Due to this assumption `ProvenanceAnalysis::related()` was ordering the parameters based on pointer value which resulted in
non-deterministic behavior.
To address this change `relatedPHI()` so that it gives the same result independent of the parameter order.
rdar://100325456
Differential Revision: https://reviews.llvm.org/D135376
Instead of checking whether a map entry exists to decide if we should
initialize it or add to it, we can rely on the map entry being constructed
and initialized to 0 before the addition happens.
For the std::max case, I've made a reference to the map entry to
avoid looking it up twice.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D135977
avoiding an assertion.
A BB with a nonzero count, whose successor blocks all have 0 counts, could
cause an assertion. Don't create any branch weights in this case.
Reviewed By: xur
Differential Revision: https://reviews.llvm.org/D134203
(A | B) & ~(A & B) --> A ^ B
https://alive2.llvm.org/ce/z/qpFMns
We already have the equivalent fold for real
logic instructions, but this pattern may occur
with selects too.
This is part of solving issue #58313.
This was reverted because it was breaking when targeting Darwin which
tried to export these symbols which are now hidden. It should be safe
to just stop attempting to export these symbols in the clang driver,
though Apple folks will need to change their TAPI allow list described
in the commit where these symbols were originally exported
f538018562
Bug: https://github.com/llvm/llvm-project/issues/58265
Differential Revision: https://reviews.llvm.org/D135340
Move logic to check and replace conditions to a helper function. This
isolates the code, allows using early returns, reduces the
indentation and simplifies eliminateConstraints.
When generating masked gathers nodes, SLP vectorizer accounts the cost
of the GEPs for loads as part of the scalar-vector transformation cost
estimation. But it does not do it for vectorized loads/stores, while it
may completely remove some of the GEPs completely. Because of this in
some cases masked gather operation can be much more profitable rather
than regular vectorization (masked-gather cost + vector GEP - scalar
loads + GEPs comparing to vectorized loads - scalar loads).
Added the analysis of the removed scalarGEPs for vectorized load/store nodes for better cost estimation.
Differential Revision: https://reviews.llvm.org/D135282
This reverts commit b05f5b90a12098660a4fc16da0b4d421ddfe14e2.
There are thread sanitizer buildbot failures in simple_stack.c.
I think that's because this ended up affecting the handling of
volatile accesses to allocas. Reverting for now.
Don't add argmem if the pointer is clearly not an argument (e.g.
a global). I don't think this makes a difference right now, but
gives more obvious results with D135780.
Limit pointer decomposition to pointers with index sizes of at most 64
bits. int64_t is used for coefficients, so as long as the index size <=
64 bits we should be able to represent all pointer offsets.
Pointer decomposition is limited to inbounds GEPs, so if a index
computation would overflow the result is poison, so it doesn't matter
that the coefficient overflows.
This allows replacing MulOverflow with regular multiplications.
We have to account for accesses to argument memory via captures.
I don't think there's any way to make this produce incorrect
results right now (because as soon as "other" is set, we lose the
ability to infer argmemonly), but this avoids incorrect results
once we have more precise representation.
The code for inferring memory attributes on arguments claims that
inalloca/preallocated arguments are always clobbered:
d71ad41080/llvm/lib/Transforms/IPO/FunctionAttrs.cpp (L640-L642)
However, we would still infer memory attributes for the whole
function without taking this into account, so we could still end
up inferring readnone for the function. This adds an argument
clobber if there are any inalloca/preallocated arguments.
Differential Revision: https://reviews.llvm.org/D135783
Add a check (can be disabled via a flag) that the pipeline we generate is actually parsable.
Can be disabled because we don't expect to handle every pass in -print-pipeline-passes.
Fixes#58280.
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D135703
(X << Z) / (Y << Z) --> X / Y
https://alive2.llvm.org/ce/z/CLKzqT
This requires a surprising "nuw" constraint because we have
to guard against immediate UB via signed-div overflow with
-1 divisor.
This extends 008a89037a49ca0d9 and is another transform
derived from issue #58137.
Need to set the insertpoint for extractelement to point to the first
instruction in the node to avoid possible crash during external uses
combine process. Without it we may endup with the incorrect
transformation.
Differential Revision: https://reviews.llvm.org/D135591
(X << Z) / (Y << Z) --> X / Y
https://alive2.llvm.org/ce/z/E5eaxU
This fixes the motivating example from issue #58137,
but it is not the most general transform. We should
probably also convert left-shift in the divisor to
right-shift in the dividend for that, but that exposes
another missed canonicalization for shifts and adds.
Freeze instruction in some cases makes codegen worse, so need to be very
careful when emitting it. Instead improve analysis in isUndefVector
function to generate mask of unused elements and use it in the analysis.
Differential Revision: https://reviews.llvm.org/D135382
optimizeInductions may leave dead recipes which can prevent sinking.
Sinking on the other hand should not introduce new dead recipes, so
clean up dead recipes before sinking.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D133762
They have been scattered over the code. For better structuring, perform
them in one place. Potential CT drop is possible because we collect exit
blocks twice, but it's small price to pay for much better code structure.
These are semantically two different stages, but were entwined in the
old implementation. Now cost computation does not do legality checks,
and they all are done beforehead.
For a local linkage GlobalObject in a non-prevailing COMDAT, it remains defined while its
leader has been made available_externally. This violates the COMDAT rule that
its members must be retained or discarded as a unit.
To fix this, update the regular LTO change D34803 to track local linkage
GlobalValues, and port the code to ThinLTO (GlobalAliases are not handled.)
This fixes two problems.
(a) `__cxx_global_var_init` in a non-prevailing COMDAT group used to
linger around (unreferenced, hence benign), and is now correctly discarded.
```
int foo();
inline int v = foo();
```
(b) Fix https://github.com/llvm/llvm-project/issues/58215:
as a size optimization, we place private `__profd_` in a COMDAT with a
`__profc_` key. When FuncImport.cpp makes `__profc_` available_externally due to
a non-prevailing COMDAT, `__profd_` incorrectly remains private. This change
makes the `__profd_` available_externally.
```
cat > c.h <<'eof'
extern void bar();
inline __attribute__((noinline)) void foo() {}
eof
cat > m1.cc <<'eof'
int main() {
bar();
foo();
}
eof
cat > m2.cc <<'eof'
__attribute__((noinline)) void bar() {
foo();
}
eof
clang -O2 -fprofile-generate=./t m1.cc m2.cc -flto -fuse-ld=lld -o t_gen
rm -fr t && ./t_gen && llvm-profdata show -function=foo t/default_*.profraw
clang -O2 -fprofile-generate=./t m1.cc m2.cc -flto=thin -fuse-ld=lld -o t_gen
rm -fr t && ./t_gen && llvm-profdata show -function=foo t/default_*.profraw
```
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D135427
((Op1 * X) / Y) / Op1 --> X / Y
https://alive2.llvm.org/ce/z/JYxWjA
InstSimplify handles the more basic mul+div pattern with
shared operand, but we don't seem to have any reassociation
folds to handle cases where the common op is further away.
This is a generalization of 9cff4711ac72 and another
transform derived from issue #58137.