This patch reverts rGae739aefd7473517d3f08b5c8d08a66c7f469198 to address performance regressions reported by our [CI](https://github.com/dtcxzyw/llvm-ci/issues/137) after rG2ec1d0f427c7822540352c0c14d057e7bfe4f77b.
For example:
```
define ptr @const_gep_chain(ptr %p, i64 %a) {
%p1 = getelementptr inbounds i8, ptr %p, i64 %a
%p2 = getelementptr inbounds i8, ptr %p1, i64 1
%p3 = getelementptr inbounds i8, ptr %p2, i64 2
%p4 = getelementptr inbounds i8, ptr %p3, i64 3
ret ptr %p4
}
```
The last three GEPs will not be folded since rG2ec1d0f427c7822540352c0c14d057e7bfe4f77b.
I think it is appropriate to remove this code because there is no compile-time regression reported in our benchmarks.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D149240
This is folloup to b5595836, which missed the
Replacemen variable.
Before b5595836 the code assumed that alloca
ptrs are not tagged so tagging is implemented
as simple OR.
So this patch completes support of tagged SP
by passing untagged alloca pointers into
tagPointer.
Otherwise we could produce `br <2x i1>` which are of course not legal.
```
Branch condition is not 'i1' type!
br <2 x i1> %cond.fr1, label %entry.split.us, label %entry.split
%cond.fr1 = freeze <2 x i1> %cond
LLVM ERROR: Broken module found, compilation aborted!
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0. Program arguments: /home/vchuravy/builds/llvm/bin/opt -passes=simple-loop-unswitch<nontrivial> -S
```
Fixes change introduced by https://reviews.llvm.org/D138526
Reviewed By: caojoshua
Differential Revision: https://reviews.llvm.org/D149560
The comment is stale, as UserVF is handled before selectVectorizationFactor
is called. Clarify the comment by remove the mention of UserVF.
Suggested as independent improvement in D143938.
LVP operates on the loop it stores in TheLoop. Use it instead of the
argument, to be in line with other member functions.
Suggested as independent improvement in D143938.
The function checks if there's a plan with the specified VF. Update the
comment to match the implementation.
Pointed out as independent improvement in D143938.
Move calls of collect* helpers closer to where the cost-model is used.
Should help simplifying D142669 & D142670.
Differential Revision: https://reviews.llvm.org/D142674
The old LoopUnswitch pass unswitched selects, but the changes were never
ported to the new SimpleLoopUnswitch.
We unswitch by turning:
```
S = select %cond, %a, %b
```
into:
```
head:
br %cond, label %then, label %tail
then:
br label %tail
tail:
S = phi [ %a, %then ], [ %b, %head ]
```
Unswitch selects are always nontrivial, since the successors do not exit
the loop and the loop body always needs to be cloned.
Differential Revision: https://reviews.llvm.org/D138526
Co-authored-by: Sergey Kachkov <sergey.kachkov@syntacore.com>
Fixes assert with pointers with different address spaces. We
could keep looking through addrspacecast, but it would require
checking for null handling of the access address space.
Fixes#62384
c55fffe replaced fcmp with fpclass.
```
declare i1 @llvm.is.fpclass(<fptype> <op>, i32 <test>)
declare <N x i1> @llvm.is.fpclass(<vector-fptype> <op>, i32 <test>)
```
Perfect fix will require checking bits of <op> corresponding to <test>
argument. For now just propagate shadow without reporting before
intrinsic. Still existing handling of fcmp is also simple OR, so it's
not making it worse.
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D149491
Switch to the just updated versions of the API in tcmalloc that change
the name of the hot cold paramter to a reserved identifier __hot_cold_t.
This was based on feedback from Richard Smith, as I also need to add
some follow-on handling to clang so they are annotated properly.
Differential Revision: https://reviews.llvm.org/D149475
Part 2 of https://reviews.llvm.org/D147456
Use callee name on IR as an anchor to match the call target/inlinee name in the profile. The advantages of this in particular:
- Different from the traditional way of encoding hash signatures to every block that would affect binary/profile size and build speed, it doesn't require any additional information for this, all the data is already in the IR and profiles.
- Effective for current nested profile layout in which once a callsite is mismatched all the inlinee's profiles are dropped.
**The input of the algorithm:**
- IR locations: the anchor is the callee name of direct callsite.
- Profile locations: the anchor is the call target name for `BodySample`s or inlinee's profile name for `CallsiteSamples`.
The two lists are populated by parsing the IR and profile and both can be generalized as a sequence of locations with an optional anchor.
For example: say location `1.2(foo)` refers to a callsite at `1.2` with callee name `foo` and `1.3` refers to a non-directcall location `1.3`.
```
// The current build source code:
int main() {
1. ...
2. foo();
3. ...
4 ...
5. ...
6. bar();
7. ...
}
```
IR locations are populated and simplified as: `[1, 2(foo), 3, 5, 6(bar), 7]`.
```
; The "stale" profile:
main:350:1
1: 1
2: 3
3: 100 foo:100
4: 2
7: 2
8: 200 bar:200
9: 30
```
Profile locations are populated and simplified as `[1, 2, 3(foo), 4, 7, 8(bar), 9]`
**Matching heuristic:**
- Match all the anchors in lexical order first.
- Match non-anchors evenly between two anchors: Split the non-anchor range, the first half is matched based on the start anchor, the second half is matched based on the end anchor.
So the example above is matched like:
```
[1, 2(foo), 3, 5, 6(bar), 7]
| | | | | |
[1, 2, 3(foo), 4, 7, 8(bar), 9]
```
3 -> 4 matching is based on anchor `foo`, 5 -> 7 matching is based on anchor `bar`.
The output mapping of matching is [2->3, 3->4, 5->7, 6->8, 7->9].
For the implementation, the anchors are saved in a map for fast look-up. The result mapping is saved into `IRToProfileLocationMap`(see https://reviews.llvm.org/D147456) and distributed to all FunctionSamples(`distributeIRToProfileLocationMap`)
**Clang-self build benchmark: **
Current build version: clang-10
The profiled version: clang-9
Results compared to a refresh profile(collected profile on clang-10) and to be fair, we invalidated new functions' profiles(both refresh and stale profile use the same profile list).
1) Regression to using refresh profile with this off : -3.93%
2) Regression to using refresh profile with this on : -1.1%
So this algorithm can recover ~72% of the regression.
**Internal(Meta) large-scale services.**
we saw one real instance of a 3 week stale profile., it delivered a ~1.8% win.
**Notes or future work:**
- Classic AutoFDO support: the current version only supports pseudo-probe, but I believe it's not hard to extend to classic line-number based AutoFDO since pseudo-probe and line-number are shared the LineLocation structure.
- The fuzzy matching is an open-ended area and there could be more heuristics to try out, but since the current version already recovers a reasonable percentage of regression(with some pseudo probe order change, it can recover close to 90%), I'm submitting the patch for review and we will try more heuristics in future.
- Profile call target name are only available when the call is hit by samples, the missing anchor might mislead the matching, this can be mitigated in llvm-profgen to generate the call target for the zero samples.
- This doesn't handle function name mismatch, we plan to solve it in future.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D147545
AutoFDO/CSSPGO often has to deal with stale profiles collected on binaries built from several revisions behind release. It’s likely to get incorrect profile annotations using the stale profile, which results in unstable or low performing binaries. Currently for source location based profile, once a code change causes a profile mismatch, all the locations afterward are mismatched, the affected samples or inlining info are lost. If we can provide a matching framework to reuse parts of the mismatched profile - aka incremental PGO, it will make PGO more stable, also increase the optimization coverage and boost the performance of binary.
This patch is the part 1 of stale profile matching, summary of the implementation:
- Added a structure for the matching result:`LocToLocMap`, which is a location to location map meaning the location of current build is matched to the location of the previous build(to be used to query the “stale” profile).
- In order to use the matching results for sample query, we need to pass them to all the location queries. For code cleanliness, we added a new pointer field(`IRToProfileLocationMap`) to `FunctionSamples`.
- Added a wrapper(`mapIRLocToProfileLoc`) for the query to the location, the location from input IR will be remapped to the matched profile location.
- Added a new switch `--salvage-stale-profile`.
- Some refactoring for the staleness detection.
Test case is in part 2 with the matching algorithm.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D147456
This makes it explicit that these functions work with instructions, and avoids
calling them if the operand is not an instruction.
Differential Revision: https://reviews.llvm.org/D149465
"convergent" is documented as meaning that the call cannot be made
control-dependent on more values, but in practice we also require that
it cannot be made control-dependent on fewer values, e.g. it cannot be
hoisted out of the body of an "if" statement.
In code like this, if we allow CSE to combine the two calls:
x = convergent_call();
if (cond) {
y = convergent_call();
use y;
}
then we get this:
x = convergent_call();
if (cond) {
use x;
}
This is conceptually equivalent to moving the second call out of the
body of the "if", up to the location of the first call, so it should be
disallowed.
Differential Revision: https://reviews.llvm.org/D149348
D37076 makes LICM duplicate instructions into exit blocks if the
instruction is free. For GEPs, the motivation appears to be that
this allows the GEP to be folded into addressing modes, while
non-foldable users outside the loop might prevent this. TBH I don't
think LICM is the place to do this (why doesn't CGP apply this
heuristic itself?) but at least I understand the motivation.
However, the transform is also applied to all other "free"
instructions, which are just that (removed during lowering and not
"folded" in some way). For such instructions, this transform seems
somewhere between useless, counter-productive (undoing CSE/GVN) and
actively incorrect. For example, this transform can duplicate freeze
instructions, which is illegal.
This patch limits the transform to just foldable GEPs, though we
might want to drop it from LICM entirely as a followup.
This is a small compile-time improvement, because querying TTI cost
model for every single instruction is expensive.
Differential Revision: https://reviews.llvm.org/D149136
The entry to the plan is the preheader of the vector loop and
guaranteed to be a VPBasicBlock. Make sure this is the case by
adjusting the type.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D149005
We already invalidate each individual instruction for which LCSSA
is formed in formLCSSAForInstructions(), so I don't see a reason
why we would need to invalidate the entire loop on top of that.
I believe we also no longer need the instruction-level invalidation
now that SCEV looks through LCSSA phis, but I'll leave that for a
separate patch, as it's less obvious.
Differential Revision: https://reviews.llvm.org/D149331
D129370 introduced the idea that hoisting could skip over non-matching
instructions and continue to look for matching (hoistable) instructions,
but certain types of mismatch still aborted the whole hoisting attempt.
Fix this by splitting out some of the instruction matching checks into a
helper function.
Also forbid hoisting allocas past stacksave/stackrestore, completing the
fix started in D133730, to avoid regressing tests.
Differential Revision: https://reviews.llvm.org/D149365
This makes the logic for referenced globals reusable for import criteria
that don't use thresholds - in fact, we currently didn't consider any
thresholds when importing.
Differential Revision: https://reviews.llvm.org/D149298
Following the change in shufflevector semantics,
poison will be used to represent undefined elements in shufflevector masks.
Differential Revision: https://reviews.llvm.org/D149256
If two nodes share the same value, which is replaced in one of the
nodes, need to automatically replace same value in all nodes. Btter to
use WeakTrackingVH for this to fix compiler crash.
This commit cleans up some parts of the PGO instrumentation. Most
importantly, it removes a template parameter shadowing of a class name
that could lead to confusion.
Reviewed By: gysit
Differential Revision: https://reviews.llvm.org/D149324
This commit moves the CFGMST.h file into the include directory. The
implemented algorithm is can be helpful for downstream projects that
want to use the PGO data in a non-standard way.
Reviewed By: gysit
Differential Revision: https://reviews.llvm.org/D149336