As a follow-on to 113686, this breaks the recursion between phi nodes
that have p1 = phi(x, p2) and p2 = phi(y, p1). The knownFPClass can be
calculated from the classes of p1 and p2.
Relands #114356. Compared to the last version, this patch only merges
poison-generating/nsz flags from the select to fix LV regression in
`llvm/test/Transforms/PhaseOrdering/AArch64/predicated-reduction.ll`.
Given a recursive phi with select:
%p = phi [ 0, entry ], [ %sel, loop]
%sel = select %c, %other, %p
The fp state can be calculated using the knowledge that the select/phi
pair can only be the initial state (0 here) or from %other. This adds a
short-cut into computeKnownFPClass for PHI to detect that the select is
recursive back to the phi, and if so use the state from the other
operand.
This helps to address a regression from #83200.
This reverts commit 7f2e937469a8cec3fe977bf41ad2dfb9b4ce648a as it causes
regressions in the tests it modifies, and undoes what was added in #100653
(which itself was a fix for a previous regression).
Enables initial non-power-of-2 support (but still requires number of
elements, forming whole registers) for reductions.
Enables extra vectorization for
MultiSource/Benchmarks/7zip/7zip-benchmark, CINT2006/464.h264ref and
CFP2017rate/526.blender_r (checked for SSE2)
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/112361
Update VPInterleaveRecipe to always use the pointer to member 0 as
pointer argument. This in many cases helps to remove unneeded index
adjustments and simplifies VPInterleaveRecipe::execute.
In some rare cases, the address of member 0 does not dominate the insert
position of the interleave group. In those cases a PtrAdd VPInstruction
is emitted to compute the address of member 0 based on the address of
the insert position. Alternatively we could hoist the recipe computing
the address of member 0.
This patch replaces all dominated uses of condition with true/false to
improve context-sensitive optimizations. It eliminates a bunch of
branches in llvm-opt-benchmark.
As a side effect, it may introduce new phi nodes in some corner cases.
See the following case:
```
define i1 @test(i1 %cmp, i1 %cond) {
entry:
br i1 %cond, label %bb1, label %bb2
bb1:
br i1 %cmp, label %if.then, label %if.else
if.then:
br %bb2
if.else:
br %bb2
bb2:
%res = phi i1 [%cmp, %entry], [%cmp, %if.then], [%cmp, %if.else]
ret i1 %res
}
```
It will be simplified into:
```
define i1 @test(i1 %cmp, i1 %cond) {
entry:
br i1 %cond, label %bb1, label %bb2
bb1:
br i1 %cmp, label %if.then, label %if.else
if.then:
br %bb2
if.else:
br %bb2
bb2:
%res = phi i1 [%cmp, %entry], [true, %if.then], [false, %if.else]
ret i1 %res
}
```
I am planning to fix this in late pipeline/CGP since this problem exists
before the patch.
SLP vectorizer has an estimation for gather/buildvector nodes, which
contain some scalar loads. SLP vectorizer performs pretty similar (but
large in SLOCs) estimation, which not always correct. Instead, this
patch implements clustering analysis and actual node allocation with the
full analysis for the vectorized clustered scalars (not only loads, but
also some other instructions) with the correct cost estimation and
vector insert instructions. Improves overall vectorization quality and
simplifies analysis/estimations.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/104144
with "[Vectorize] Fix warnings"
It introduced compiler crashes, see #104144.
This reverts commit 69332bb8995aef60d830406de12cb79a50390261 and
351f4a5593f1ef507708ec5eeca165b20add3340.
SLP vectorizer has an estimation for gather/buildvector nodes, which
contain some scalar loads. SLP vectorizer performs pretty similar (but
large in SLOCs) estimation, which not always correct. Instead, this
patch implements clustering analysis and actual node allocation with the
full analysis for the vectorized clustered scalars (not only loads, but
also some other instructions) with the correct cost estimation and
vector insert instructions. Improves overall vectorization quality and
simplifies analysis/estimations.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/104144
The idea behind this canonicalization is that it allows us to handle less
patterns, because we know that some will be canonicalized away. This is
indeed very useful to e.g. know that constants are always on the right.
However, this is only useful if the canonicalization is actually
reliable. This is the case for constants, but not for arguments: Moving
these to the right makes it look like the "more complex" expression is
guaranteed to be on the left, but this is not actually the case in
practice. It fails as soon as you replace the argument with another
instruction.
The end result is that it looks like things correctly work in tests,
while they actually don't. We use the "thwart complexity-based
canonicalization" trick to handle this in tests, but it's often a
challenge for new contributors to get this right, and based on the
regressions this PR originally exposed, we clearly don't get this right
in many cases.
For this reason, I think that it's better to remove this complexity
canonicalization. It will make it much easier to write tests for
commuted cases and make sure that they are handled.
MustExec has special logic to determine whether the first loop iteration
will always be executed, by simplifying the IV comparison with the start
value. Currently, this code assumes that the IV is on the LHS of the
comparison, but this is not guaranteed. Make sure it handles the
commuted variant as well.
The changed PhaseOrdering test previously performed peeling to make the
loads dereferenceable -- as a side effect, this also reduced the exit
count by one, avoiding the awkward <= MAX case.
Now we know up-front the the loads are dereferenceable and can be simply
hoisted. As such, we retain the original exit count and now have to
handle it by widening the exit count calculation to i128. This is a
regression, but at least it preserves the vectorization, which was the
original goal. I'm not sure what else can be done about that test.
This attempts to fix a regression from #98025, where the new order of
reduction nodes causes later passes to not be able to produce as nice
shuffles. The issue boils down to picking an order of [0 1 3 2] for
loaded v4i8 values, which meant later parts could not find a simpler
ordering for the shuffles given the legal nodes available in AArch64. If
instead we make sure they are ordered [0 1 2 3] then everything can fall
into place.
In order to produce a better order that is more likely to work in more
cases, this patch takes the existing clustered loads and sort the base
pointers if there is an order between them. i.e if `V2 == gep (V1, X)`
then V1 is sorted before V2.
This patch moves branch condition creation to enter the scalar epilogue
loop to VPlan. Modeling the branch in the middle block also requires
modeling the successor blocks. This is done using the recently
introduced VPIRBasicBlock.
Note that the middle.block is still created as part of the skeleton and
then patched in during VPlan execution. Unfortunately the skeleton needs
to create the middle.block early on, as it is also used for induction
resume value creation and is also needed to properly update the
dominator tree during skeleton creation.
After this patch lands, I plan to move induction resume value and phi
node creation in the scalar preheader to VPlan. Once that is done, we
should be able to create the middle.block in VPlan directly.
This is a re-worked version based on the earlier
https://reviews.llvm.org/D150398 and the main change is the use of
VPIRBasicBlock.
Depends on https://github.com/llvm/llvm-project/pull/92525
PR: https://github.com/llvm/llvm-project/pull/92651
This is another relatively small adjustment to shuffleToIdentity, which
has had a few knock-one effects to need a few more changes. It attempts
to detect free concats, that will be legalized to multiple vector
operations. For example if the lanes are '[a[0], a[1], b[0], b[1]]' and
a and b are v2f64 under aarch64.
In order to do this:
- isFreeConcat detects whether the input has piece-wise identities from
multiple inputs that can become a concat.
- A tree of concat shuffles is created to concatenate the input values
into a single vector. This is a little different to most other inputs as
there are created from multiple values that are being combined together,
and we cannot rely on the Lane0 insert location always being valid.
- The insert location is changed to the original location instead of
updating per item, which ensure it is valid due to the order that we
visit and create items.
Use VPIRBasicBlock to wrap the middle block and implement patching up
branches in predecessors in VPIRBasicBlock::execute. The IR middle block
is only created after skeleton creation. Initially a regular
VPBasicBlock is created, which will later be replaced by a
VPIRBasicBlock once the middle IR basic block has been created.
Note that this slightly changes the order of instructions created in the
middle block; code generated by recipe execution in the middle block
will now be inserted before the terminator (and in between the compare
to used by the terminator). The original order will be restored in
https://github.com/llvm/llvm-project/pull/92651.
PR: https://github.com/llvm/llvm-project/pull/95816
…f weights" #95136
Reverts #95060, and relands #86609, with the unintended code generation
changes addressed.
This patch implements the changes to LLVM IR discussed in
https://discourse.llvm.org/t/rfc-update-branch-weights-metadata-to-allow-tracking-branch-weight-origins/75032
In this patch, we add an optional field to MD_prof meatdata nodes for
branch weights, which can be used to distinguish weights added from
llvm.expect* intrinsics from those added via other methods, e.g. from
profiles or inserted by the compiler.
One of the major motivations, is for use with MisExpect diagnostics,
which need to know if branch_weight metadata originates from an
llvm.expect intrinsic. Without that information, we end up checking
branch weights multiple times in the case if ThinLTO + SampleProfiling,
leading to some inaccuracy in how we report MisExpect related
diagnostics to users.
Since we change the format of MD_prof metadata in a fundamental way, we
need to update code handling branch weights in a number of places.
We also update the lang ref for branch weights to reflect the change.
This patch implements the changes to LLVM IR discussed in
https://discourse.llvm.org/t/rfc-update-branch-weights-metadata-to-allow-tracking-branch-weight-origins/75032
In this patch, we add an optional field to MD_prof metadata nodes for
branch weights, which can be used to distinguish weights added from
`llvm.expect*` intrinsics from those added via other methods, e.g.
from profiles or inserted by the compiler.
One of the major motivations, is for use with MisExpect diagnostics,
which need to know if branch_weight metadata originates from an
llvm.expect intrinsic. Without that information, we end up checking
branch weights multiple times in the case if ThinLTO + SampleProfiling,
leading to some inaccuracy in how we report MisExpect related
diagnostics to users.
Since we change the format of MD_prof metadata in a fundamental way, we
need to update code handling branch weights in a number of places.
We also update the lang ref for branch weights to reflect the change.
This patch simplifies `sdiv` to `udiv` by preserving the `nsw` flag for
`(X | Op01C) + Op1C --> X + (Op01C + Op1C)` if the sum of `Op01C` and
`Op1C` will not overflow, and preserves the `nuw` flag unconditionally.
Alive2 Proofs (provided by @nikic): https://alive2.llvm.org/ce/z/nrdCZT,
https://alive2.llvm.org/ce/z/YnJHnH
Use getStartAndEndForAccess to compute the start and end of both src
and sink (factored out to helper in bce3680f45b57f). If they do not
overlap (i.e. SrcEnd <= SinkStart || SinkEnd <= SrcStart), there is no
dependence, regardless of stride.
PR: https://github.com/llvm/llvm-project/pull/92307
This patch adds a basic version of a combine that attempts to remove
shuffles that when combined simplify away to an identity shuffle. For
example:
%ab = shufflevector <8 x half> %a, <8 x half> poison, <4 x i32> <i32 3,
i32 2, i32 1, i32 0>
%at = shufflevector <8 x half> %a, <8 x half> poison, <4 x i32> <i32 7,
i32 6, i32 5, i32 4>
%abt = fneg <4 x half> %at
%abb = fneg <4 x half> %ab
%r = shufflevector <4 x half> %abt, <4 x half> %abb, <8 x i32> <i32 7,
i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
By looking through the shuffles and fneg, it can be simplified to:
%r = fneg <8 x half> %a
The code tracks each lane starting from the original shuffle, keeping a
track of a vector of {src, idx}. As we propagate up through the
instructions we will either look through intermediate instructions
(binops and unops) or see a collections of lanes that all have the same
src and incrementing idx (an identity). We can also see a single value
with identical lanes, which we can treat like a splat.
Only the basic version is added here, handling identities, splats,
binops and unops. In follow-up patches other instructions can be added
such as constants, intrinsics, cmp/sel and zext/sext/trunc.
This patch adds loadCSE support to simplifyLoopAfterUnroll. It is based
on EarlyCSE's implementation using ScopeHashTable and is using SCEV for
accessed pointers to check to find redundant loads after unrolling.
This applies to the late unroll pass only, for full unrolling those
redundant loads will be cleaned up by the regular pipeline.
The current approach constructs MSSA on-demand per-loop, but there is
still small but notable compile-time impact:
stage1-O3 +0.04%
stage1-ReleaseThinLTO +0.06%
stage1-ReleaseLTO-g +0.05%
stage1-O0-g +0.02%
stage2-O3 +0.09%
stage2-O0-g +0.04%
stage2-clang +0.02%
https://llvm-compile-time-tracker.com/compare.php?from=c089fa5a729e217d0c0d4647656386dac1a1b135&to=ec7c0f27cb5c12b600d9adfc8543d131765ec7be&stat=instructions:u
This benefits some workloads with runtime-unrolling disabled,
where users use pragmas to force unrolling, as well as with
runtime unrolling enabled.
On SPEC/MultiSource, this removes a number of loads after unrolling
on AArch64 with runtime unrolling enabled.
```
External/S...te/526.blender_r/526.blender_r 96
MultiSourc...rks/mediabench/gsm/toast/toast 39
SingleSource/Benchmarks/Misc/ffbench 4
External/SPEC/CINT2006/403.gcc/403.gcc 18
MultiSourc.../Applications/JM/ldecod/ldecod 4
MultiSourc.../mediabench/jpeg/jpeg-6a/cjpeg 6
MultiSourc...OE-ProxyApps-C/miniGMG/miniGMG 9
MultiSourc...e/Applications/ClamAV/clamscan 4
MultiSourc.../MallocBench/espresso/espresso 3
MultiSourc...dence-flt/LinearDependence-flt 2
MultiSourc...ch/office-ispell/office-ispell 4
MultiSourc...ch/consumer-jpeg/consumer-jpeg 6
MultiSourc...ench/security-sha/security-sha 11
MultiSourc...chmarks/McCat/04-bisect/bisect 3
SingleSour...tTests/2020-01-06-coverage-009 12
MultiSourc...ench/telecomm-gsm/telecomm-gsm 39
MultiSourc...lds-flt/CrossingThresholds-flt 24
MultiSourc...dence-dbl/LinearDependence-dbl 2
External/S...C/CINT2006/445.gobmk/445.gobmk 6
MultiSourc...enchmarks/mafft/pairlocalalign 53
External/S...31.deepsjeng_r/531.deepsjeng_r 3
External/S...rate/510.parest_r/510.parest_r 58
External/S...NT2006/464.h264ref/464.h264ref 29
External/S...NT2017rate/502.gcc_r/502.gcc_r 45
External/S...C/CINT2006/456.hmmer/456.hmmer 6
External/S...te/538.imagick_r/538.imagick_r 18
External/S.../CFP2006/447.dealII/447.dealII 4
MultiSourc...OE-ProxyApps-C++/miniFE/miniFE 12
External/S...2017rate/525.x264_r/525.x264_r 36
MultiSourc...Benchmarks/7zip/7zip-benchmark 33
MultiSourc...hmarks/ASC_Sequoia/AMGmk/AMGmk 2
MultiSourc...chmarks/VersaBench/8b10b/8b10b 1
MultiSourc.../Applications/JM/lencod/lencod 116
MultiSourc...lds-dbl/CrossingThresholds-dbl 24
MultiSource/Benchmarks/McCat/05-eks/eks 15
```
PR: https://github.com/llvm/llvm-project/pull/83860
Another step towards cleaning up shuffles that have been split, often across bitcasts between SSE intrinsic.
Strip shuffles entirely if we fold to an identity shuffle.
Folding a `select` into a floating point binary operators can only be
done if the result is preserved for both case. In particular, if the
other operand of the `select` can be a NaN, then the transformation
won't preserve the result value.
We are replacing a narrow IV increment with a wider one. If the original
(narrow) increment did not wrap, the wider one should not wrap either.
Set the flags to be the union of both wide increment and original
increment; this ensures we preserve flags SCEV could infer for the wider
increment.
Fixes https://github.com/llvm/llvm-project/issues/71517.
A set of microbenchmarks (https://github.com/llvm/llvm-test-suite/pull/26) showed that loop interleaving can be beneficial for loops with low trip count as well. Loop interleaving count computation is updated accordingly in prior patches while this patch removes the loop trip count threshold for interleaving.
This patch canonicalizes getelementptr instructions with constant
indices to use the `i8` source element type. This makes it easier for
optimizations to recognize that two GEPs are identical, because they
don't need to see past many different ways to express the same offset.
This is a first step towards
https://discourse.llvm.org/t/rfc-replacing-getelementptr-with-ptradd/68699.
This is limited to constant GEPs only for now, as they have a clear
canonical form, while we're not yet sure how exactly to deal with
variable indices.
The test llvm/test/Transforms/PhaseOrdering/switch_with_geps.ll gives
two representative examples of the kind of optimization improvement we
expect from this change. In the first test SimplifyCFG can now realize
that all switch branches are actually the same. In the second test it
can convert it into simple arithmetic. These are representative of
common optimization failures we see in Rust.
Fixes https://github.com/llvm/llvm-project/issues/69841.
As suggested as follow-up in
https://github.com/llvm/llvm-project/pull/72164, manage inbounds via
VPRecipeWithIRFlags.
Note that in some cases we can now preserve inbounds in a few more
cases.
Move vector pointer generation to a separate VPVectorPointerRecipe.
This untangles address computation from the memory recipes future
and is also needed to enable explicit unrolling in VPlan.
https://github.com/llvm/llvm-project/pull/72164
This patch deduces `noundef` attributes for return values.
IIUC, a function returns `noundef` values iff all of its return values
are guaranteed not to be `undef` or `poison`.
Definition of `noundef` from LangRef:
```
noundef
This attribute applies to parameters and return values. If the value representation contains any
undefined or poison bits, the behavior is undefined. Note that this does not refer to padding
introduced by the type’s storage representation.
```
Alive2: https://alive2.llvm.org/ce/z/g8Eis6
Compile-time impact: http://llvm-compile-time-tracker.com/compare.php?from=30dcc33c4ea3ab50397a7adbe85fe977d4a400bd&to=c5e8738d4bfbf1e97e3f455fded90b791f223d74&stat=instructions:u
|stage1-O3|stage1-ReleaseThinLTO|stage1-ReleaseLTO-g|stage1-O0-g|stage2-O3|stage2-O0-g|stage2-clang|
|--|--|--|--|--|--|--|
|+0.01%|+0.01%|-0.01%|+0.01%|+0.03%|-0.04%|+0.01%|
The motivation of this patch is to reduce the number of `freeze` insts
and enable more optimizations.
There are many tests that specify a target triple/CPU flags but no
DataLayout which can lead to IR being generated that has unusual
behaviour. This commit attempts to use the default DataLayout based
on the relevant flags if there is no explicit override on the command
line or in the IR file.
One thing that is not currently possible to differentiate from a missing
datalayout `target datalayout = ""` in the IR file since the current
APIs don't allow detecting this case. If it is considered useful to
support this case (instead of passing "-data-layout=" on the command
line), I can change IR parsers to track whether they have seen such a
directive and change the callback type.
Differential Revision: https://reviews.llvm.org/D141060