As a follow-on to 113686, this breaks the recursion between phi nodes
that have p1 = phi(x, p2) and p2 = phi(y, p1). The knownFPClass can be
calculated from the classes of p1 and p2.
Relands #114356. Compared to the last version, this patch only merges
poison-generating/nsz flags from the select to fix LV regression in
`llvm/test/Transforms/PhaseOrdering/AArch64/predicated-reduction.ll`.
Given a recursive phi with select:
%p = phi [ 0, entry ], [ %sel, loop]
%sel = select %c, %other, %p
The fp state can be calculated using the knowledge that the select/phi
pair can only be the initial state (0 here) or from %other. This adds a
short-cut into computeKnownFPClass for PHI to detect that the select is
recursive back to the phi, and if so use the state from the other
operand.
This helps to address a regression from #83200.
This reverts commit 7f2e937469a8cec3fe977bf41ad2dfb9b4ce648a as it causes
regressions in the tests it modifies, and undoes what was added in #100653
(which itself was a fix for a previous regression).
Enables initial non-power-of-2 support (but still requires number of
elements, forming whole registers) for reductions.
Enables extra vectorization for
MultiSource/Benchmarks/7zip/7zip-benchmark, CINT2006/464.h264ref and
CFP2017rate/526.blender_r (checked for SSE2)
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/112361
- **[Inliner] Add tests for propagating more parameter attributes; NFC**
- **[Inliner] Propagate more attributes to params when inlining**
Add support for propagating:
- `derefereancable`
- `derefereancable_or_null`
- `align`
- `nonnull`
- `range`
These are only propagated if the parameter to the to-be-inlined callsite
match the exact parameter used in the to-be-inlined function.
Update VPInterleaveRecipe to always use the pointer to member 0 as
pointer argument. This in many cases helps to remove unneeded index
adjustments and simplifies VPInterleaveRecipe::execute.
In some rare cases, the address of member 0 does not dominate the insert
position of the interleave group. In those cases a PtrAdd VPInstruction
is emitted to compute the address of member 0 based on the address of
the insert position. Alternatively we could hoist the recipe computing
the address of member 0.
Alive2: https://alive2.llvm.org/ce/z/NJgBPL
The motivating case of this patch is to emit `andn` on RISC-V with zbb
for expressions like `(sub 63, X) & 63`.
…ntElimination
ArgumentPromotion and DeadArgumentElimination passes could change
function signatures but the function name remains the same as before the
transformation. This makes it hard for tracing with bpf programs where
user tends to use function signature in the source. See discussion [1]
for details.
This patch added suffix to functions whose signatures are changed. The
suffix lets users know that function signature has changed and they need
to impact the IR or binary to find modified signature before tracing
those functions.
The suffix for ArgumentPromotion is ".argprom" and the suffixes for
DeadArgumentElimination are ".argelim" and ".retelim". The suffix also
gives user hints about what kind of transformation has been done.
With this patch, I built a recent linux kernel with full LTO enabled. I
got 4 functions with only argpromotion like
```
set_track_update.argelim.argprom
pmd_trans_huge_lock.argprom
...
```
I got 1058 functions with only deadargelim like
```
process_bit0.argelim
pci_io_ecs_init.argelim
...
```
I got 3 functions with both argpromotion and deadargelim
```
set_track_update.argelim.argprom
zero_pud_populate.argelim.argprom
zero_pmd_populate.argelim.argprom
```
[1] https://github.com/llvm/llvm-project/issues/104678
This is a follow up to 924907bc6, and is mostly motivated by consistency
but does include one additional optimization. In general, we prefer 0.0
over -0.0 as the identity value for an fadd. We use that value in
several places, but don't in others. So, let's be consistent and use the
same identity (when nsz allows) everywhere.
This creates a bunch of test churn, but due to 924907bc6, most of that
churn doesn't actually indicate a change in codegen. The exception is
that this change enables the use of 0.0 for nsz, but *not* reasoc, fadd
reductions. Or said differently, it allows the neutral value of an
ordered fadd reduction to be 0.0.
This patch replaces all dominated uses of condition with true/false to
improve context-sensitive optimizations. It eliminates a bunch of
branches in llvm-opt-benchmark.
As a side effect, it may introduce new phi nodes in some corner cases.
See the following case:
```
define i1 @test(i1 %cmp, i1 %cond) {
entry:
br i1 %cond, label %bb1, label %bb2
bb1:
br i1 %cmp, label %if.then, label %if.else
if.then:
br %bb2
if.else:
br %bb2
bb2:
%res = phi i1 [%cmp, %entry], [%cmp, %if.then], [%cmp, %if.else]
ret i1 %res
}
```
It will be simplified into:
```
define i1 @test(i1 %cmp, i1 %cond) {
entry:
br i1 %cond, label %bb1, label %bb2
bb1:
br i1 %cmp, label %if.then, label %if.else
if.then:
br %bb2
if.else:
br %bb2
bb2:
%res = phi i1 [%cmp, %entry], [true, %if.then], [false, %if.else]
ret i1 %res
}
```
I am planning to fix this in late pipeline/CGP since this problem exists
before the patch.
This is simplifycfg part of
https://github.com/llvm/llvm-project/pull/95515
In this PR, we support hoisting load/store with conditional faulting in
`SimplifyCFGOpt::speculativelyExecuteBB` to eliminate conditional
branches.
This is for cases like
```
void test (int a, int *b) {
if (a)
*b = a;
}
```
In the following patches, we will support the hoist in
`SimplifyCFGOpt::hoistCommonCodeFromSuccessors`.
That is for cases like
```
void test (int a, int *c, int *d) {
if (a)
*c = a;
else
*d = a;
}
```
SLP vectorizer has an estimation for gather/buildvector nodes, which
contain some scalar loads. SLP vectorizer performs pretty similar (but
large in SLOCs) estimation, which not always correct. Instead, this
patch implements clustering analysis and actual node allocation with the
full analysis for the vectorized clustered scalars (not only loads, but
also some other instructions) with the correct cost estimation and
vector insert instructions. Improves overall vectorization quality and
simplifies analysis/estimations.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/104144
with "[Vectorize] Fix warnings"
It introduced compiler crashes, see #104144.
This reverts commit 69332bb8995aef60d830406de12cb79a50390261 and
351f4a5593f1ef507708ec5eeca165b20add3340.
SLP vectorizer has an estimation for gather/buildvector nodes, which
contain some scalar loads. SLP vectorizer performs pretty similar (but
large in SLOCs) estimation, which not always correct. Instead, this
patch implements clustering analysis and actual node allocation with the
full analysis for the vectorized clustered scalars (not only loads, but
also some other instructions) with the correct cost estimation and
vector insert instructions. Improves overall vectorization quality and
simplifies analysis/estimations.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/104144
Whilst dealing with review comments on
https://github.com/llvm/llvm-project/pull/96752
I discovered that SCEV does not know about the dereferenceable attribute
on function arguments so I have updated getRangeRef to make use of it
by calling getPointerDereferenceableBytes.
The idea behind this canonicalization is that it allows us to handle less
patterns, because we know that some will be canonicalized away. This is
indeed very useful to e.g. know that constants are always on the right.
However, this is only useful if the canonicalization is actually
reliable. This is the case for constants, but not for arguments: Moving
these to the right makes it look like the "more complex" expression is
guaranteed to be on the left, but this is not actually the case in
practice. It fails as soon as you replace the argument with another
instruction.
The end result is that it looks like things correctly work in tests,
while they actually don't. We use the "thwart complexity-based
canonicalization" trick to handle this in tests, but it's often a
challenge for new contributors to get this right, and based on the
regressions this PR originally exposed, we clearly don't get this right
in many cases.
For this reason, I think that it's better to remove this complexity
canonicalization. It will make it much easier to write tests for
commuted cases and make sure that they are handled.
This patch adds the aforementioned fold to InstCombine. This pattern is
produced after naive implementations of 3-way comparison in high-level
languages are transformed into LLVM IR and then optimized.
Proofs: https://alive2.llvm.org/ce/z/w4QLq_
Update createEdgeMask to created masks where the terminator in Src is a
switch. We need to handle 2 separate cases:
1. Dst is not the default desintation. Dst is reached if any of the
cases with destination == Dst are taken. Join the conditions for each
case where destination == Dst using a logical OR.
2. Dst is the default destination. Dst is reached if none of the cases
with destination != Dst are taken. Join the conditions for each case
where the destination is != Dst using a logical OR and negate it.
Edge masks are created for every destination of cases and/or
default when requesting a mask where the source is a switch.
Fixes https://github.com/llvm/llvm-project/issues/48188.
PR: https://github.com/llvm/llvm-project/pull/99808
MustExec has special logic to determine whether the first loop iteration
will always be executed, by simplifying the IV comparison with the start
value. Currently, this code assumes that the IV is on the LHS of the
comparison, but this is not guaranteed. Make sure it handles the
commuted variant as well.
The changed PhaseOrdering test previously performed peeling to make the
loads dereferenceable -- as a side effect, this also reduced the exit
count by one, avoiding the awkward <= MAX case.
Now we know up-front the the loads are dereferenceable and can be simply
hoisted. As such, we retain the original exit count and now have to
handle it by widening the exit count calculation to i128. This is a
regression, but at least it preserves the vectorization, which was the
original goal. I'm not sure what else can be done about that test.
This attempts to fix a regression from #98025, where the new order of
reduction nodes causes later passes to not be able to produce as nice
shuffles. The issue boils down to picking an order of [0 1 3 2] for
loaded v4i8 values, which meant later parts could not find a simpler
ordering for the shuffles given the legal nodes available in AArch64. If
instead we make sure they are ordered [0 1 2 3] then everything can fall
into place.
In order to produce a better order that is more likely to work in more
cases, this patch takes the existing clustered loads and sort the base
pointers if there is an order between them. i.e if `V2 == gep (V1, X)`
then V1 is sorted before V2.
Summary:
This pass expands variadic functions into non-variadic function calls
according to the target ABI. Currently, this is used as the lowering for
the NVPTX and AMDGPU targets.
This pass is currently only run late in the target's backend. However,
during LTO we want to run it before the inliner pass so that the
expanded functions can be inlined using standard heuristics. This pass
is a no-op for unsupported targets, so this won't apply to any code that
isn't already using it.
Workaround until I can get #96884 fixed properly - when trying to find identity sequences, peek through any bitcasts to see if the values all came from the same source. We don't run CSE frequently enough to merge all the bitcasts that we end up with.
Constmerge can fold switch jump tables, possibly making functions
identical again. It can help mergefunc.
On the other hand, the opposite seems unlikely.
Fixes https://github.com/llvm/llvm-project/issues/92201.
This patch moves branch condition creation to enter the scalar epilogue
loop to VPlan. Modeling the branch in the middle block also requires
modeling the successor blocks. This is done using the recently
introduced VPIRBasicBlock.
Note that the middle.block is still created as part of the skeleton and
then patched in during VPlan execution. Unfortunately the skeleton needs
to create the middle.block early on, as it is also used for induction
resume value creation and is also needed to properly update the
dominator tree during skeleton creation.
After this patch lands, I plan to move induction resume value and phi
node creation in the scalar preheader to VPlan. Once that is done, we
should be able to create the middle.block in VPlan directly.
This is a re-worked version based on the earlier
https://reviews.llvm.org/D150398 and the main change is the use of
VPIRBasicBlock.
Depends on https://github.com/llvm/llvm-project/pull/92525
PR: https://github.com/llvm/llvm-project/pull/92651
We already handle blendv(x,y,bitcast(sext(m))) -> select(m,x,y) cases, but this adds support for peeking through one-use shuffles as well. VectorCombine should already have canonicalized the IR to shuffle(bitcast(...)) for us.
The particular use case is where we have split generic 256/512-bit code to use target-specific blendv intrinsics (e.g. AVX1 spoofing AVX2 256-bit ops).
Fixes#58895
Its expected that the sequence `return X > 0.0 ? X : -X`, compiled with
-Ofast, produces fabs intrinsic. However, at this point, LLVM is unable
to do so.
The above sequence goes through the following transformation during the
pass pipeline:
1) SROA pass generates the phi node. Here, it does not infer the
fast-math flags on the phi node unlike clang frontend typically does.
2) Phi node eventually gets translated into select instruction.
Because of missing no-signed-zeros(nsz) fast-math flag on the select
instruction, InstCombine pass fails to fold the sequence into fabs
intrinsic.
This patch, as a part of SROA, tries to propagate nsz fast-math flag on
the phi node using function attribute enabling this folding.
Closes#51601
Co-authored-by: Sushant Gokhale <sgokhale@nvidia.com>