Problem 1: There is a typo which reassigns `BlockWorkList` to
`EHPadWorkList` on attempt to remove `RemBB` from work lists.
Problem 2: `Chain->UnscheduledPredecessors == 0` is an incorrect way to
check whether `RemBB` is enqueued or not. The root cause is a postponed
deletion of `WorkList` from already scheduled blocks in
`selectBestCandidateBlock`. Bug happens in the following scenario:
* `FunctionChain` is being processed with non-zero
`UnscheduledPredecessors`
* Block `B'` is added to the `BlockWorkList`
* Block `B'` is chosen as the best successor (`selectBestSuccessor`) for
some another block and added into `Chain`
* Block `B'` is removed by tail duplicator.
`RemovalCallback` erroneously won't erase `B'` from `BlockWorkList`,
because `UnscheduledPredecessors` value of `FunctionChain` is not zero
(and it is allowed to be non-zero).
Proposed solution is to always cleanup worklists on block deletion by
tail duplicator.
MachineBlockPlacement has quadratic runtime in the number of
predecessors: in some situation, for an edge, all predecessors of the
successor are considered.
Limit the number of considered predecessors to bound compile time for
large functions.
Pull Request: https://github.com/llvm/llvm-project/pull/142584
https://github.com/llvm/llvm-project/pull/109711 disables
`buildCFGChains()` when `-apply-ext-tsp-for-size` is used to improve
codesize. Tail merging can change the layout and normally requires
`buildCFGChains()` to be called again, but we want to prevent this when
optimizing for codesize. We saw slight size improvement on large
binaries with this change. If `-apply-ext-tsp-for-size` is not used,
this should be a NFC.
* Do not use profile data when flipping a branch condition when
optimizing for size. This should improving outlining and ICF due to more
uniform instruction sequences.
* Refactor `optimizeBranches()` to use early `continue`s
* Use the correct debug location for `insertBranch()`
Rather than invariantly running `F->verify()` when asserts are enabled,
run machine IR verification in LIT tests only.
Swap `CHECK-PERF` and `CHECK-SIZE` in `code_placement_ext_tsp_large.ll`.
Remove `={0,1,true,false}` from flags in tests.
This is an implementation of a new "size-aware" machine block placement.
The
idea is to reorder blocks so that the number of fall-through jumps is
maximized.
Observe that profile data is ignored for the optimization, and it is
applied only
for instances with hasOptSize()=true.
This strategy has two benefits:
(i) it eliminates jump instructions, which results in smaller text size;
(ii) we avoid using profile data while reordering blocks, which yields
more
"uniform" functions, thus helping ICF and machine outliner/merger.
For large (mobile) apps, the size benefits of (i) and (ii) are roughly
the same,
combined providing up to 0.5% uncompressed and up to 1% compressed
savings size
on top of the current solution.
The optimization is turned off by default.
The `NodeCounts` parameter of `calcExtTspScore()` is unused, so remove
it.
Use `SmallVector` since arrays are expected to be small since they
represent MBBs.
This produces far too much terminal output, particularly for the
instruction reduction. Since it doesn't consider the liveness of of
the instructions it's deleting, it produces quite a lot of verifier
errors.
Machine block placement might remove nodes from the function but does
not update the dominator tree accordingly. Instead of renumbering (which
might crash due to accessing removed blocks), set the domtree to null to
make clear that it is invalid at this point.
Fixup of #102107.
The dominator tree gained an optimization to use block numbers instead
of a DenseMap to store blocks. Given that machine basic blocks already
have numbers, expose these via appropriate GraphTraits. For debugging,
block number epochs are added to MachineFunction -- this greatly helps
in finding uses of block numbers after RenumberBlocks().
In a few cases where dominator trees are preserved across renumberings,
the dominator tree is updated to use the new numbers.
The extTSP-based basic block layout algorithm improves the performance
of the generated code, but unfortunately it has a super-linear time
complexity. This leads to extremely long compilation times for certain
relatively rare kinds of autogenerated code.
This patch adds an `-mllvm` flag to optionally restrict extTSP only to
functions smaller than a specified threshold. While commit
bcdc0477319a26fd8dcdde5ace3bdd6743599f44 added a knob to to limit the
maximum chain size, it's still possible that for certain huge functions
the number of chains is very large, leading to a quadratic behaviour in
ExtTSPImpl::mergeChainPairs.
PR #91843 changed the algorithm used to find the next unplaced block so
that it iterates through the blocks in BlockFilter instead of iterating
through the blocks in the function and checking if they are in the block
filter. Unfortunately this sometimes results in a different block
ordering being chosen, as the order of blocks in BlockFilter comes from
the order in MachineLoopInfo, and in some cases this differs from the
order they are in the function. This can also give an end result that
has worse performance.
Fix this by making collectLoopBlockSet place blocks in its output in the
order that they are in the function.
- Add `MachineBlockFrequencyAnalysis`.
- Add `MachineBlockFrequencyPrinterPass`.
- Use `MachineBlockFrequencyInfoWrapperPass` in legacy pass manager.
- `LazyMachineBlockFrequencyInfo::print` is empty, drop it due to new
pass manager migration.
This reverts commit ab58b6d58edf6a7c8881044fc716ca435d7a0156.
In `CodeGen/Generic/MachineBranchProb.ll`, `llc` crashed with dumped MIR
when targeting PowerPC. Move test to `llc/new-pm`, which is X86
specific.
In MachineBlockPlacement, the function getFirstUnplacedBlock is
inefficient because in most cases (for usual loop CFG), this function
fails to find a candidate, and its complexity becomes O(#(loops in
function) * #(blocks in function)). This makes the compilation of very
long functions slow. This update reduces it to O(k * #(blocks in
function)) where k is the maximum loop nesting depth, by iterating
through the BlockFilter instead.
I'm planning to remove StringRef::equals in favor of
StringRef::operator==.
- StringRef::operator==/!= outnumber StringRef::equals by a factor of
53 under llvm/ in terms of their usage.
- The elimination of StringRef::equals brings StringRef closer to
std::string_view, which has operator== but not equals.
- S == "foo" is more readable than S.equals("foo"), especially for
!Long.Expression.equals("str") vs Long.Expression != "str".
This patch added backend consumption on a new loop metadata:
!1 = !{!"llvm.loop.align", i32 64}
which is generated from clang's new loop attribute:
[[clang::code_align()]]
clang patch: #70762
C++20 comes with std::erase to erase a value from std::vector. This
patch renames llvm::erase_value to llvm::erase for consistency with
C++20.
We could make llvm::erase more similar to std::erase by having it
return the number of elements removed, but I'm not doing that for now
because nobody seems to care about that in our code base.
Since there are only 50 occurrences of erase_value in our code base,
this patch replaces all of them with llvm::erase and deprecates
llvm::erase_value.
- Refactor the (Machine)BlockFrequencyInfo::printBlockFreq functions
into a `PrintBlockFreq()` function returning a `Printable` object. This
simplifies usage as it can be directly piped to a `raw_ostream` like
`dbgs() << PrintBlockFreq(MBFI, Freq) << '\n';`.
- Previously there was an interesting behavior where
`BlockFrequencyInfoImpl` stores frequencies both as a `Scaled64` number
and as an `uint64_t`. Most algorithms use the `BlockFrequency`
abstraction with the integers, the print function for basic blocks
printed the `Scaled64` number potentially showing higher accuracy than
was used by the algorithm. This changes things to only print
`BlockFrequency` values.
- Replace some instances of `dbgs() << Freq.getFrequency()` with the new
function.
The `BlockFrequency` class abstracts `uint64_t` frequency values. Use it
more consistently in various APIs and disable implicit conversion to
make usage more consistent and explicit.
- Use `BlockFrequency Freq` parameter for `setBlockFreq`,
`getProfileCountFromFreq` and `setBlockFreqAndScale` functions.
- Return `BlockFrequency` in `getEntryFreq()` functions.
- While on it change some `const BlockFrequency& Freq` parameters to
plain `BlockFreqency Freq`.
- Mark `BlockFrequency(uint64_t)` constructor as explicit.
- Add missing `BlockFrequency::operator!=`.
- Remove `uint64_t BlockFreqency::getMaxFrequency()`.
- Add `BlockFrequency BlockFrequency::max()` function.
* Place types and functions in the llvm::codelayout namespace
* Change EdgeCountT from pair<pair<uint64_t, uint64_t>, uint64_t> to a struct and utilize structured bindings.
It is not conventional to use the "T" suffix for structure types.
* Remove a redundant copy in ChainT::merge.
* Change {ExtTSPImpl,CDSortImpl}::run to use return value instead of an output parameter
* Rename applyCDSLayout to computeCacheDirectedLayout: (a) avoid rare
abbreviation "CDS" (cache-directed sort) (b) "compute" is more conventional
for the specific use case
* Change the parameter types from std::vector to ArrayRef so that
SmallVector arguments can be used.
* Similarly, rename applyExtTspLayout to computeExtTspLayout.
Reviewed By: Amir
Differential Revision: https://reviews.llvm.org/D159526
This will make it easy for callers to see issues with and fix up calls
to createTargetMachine after a future change to the params of
TargetMachine.
This matches other nearby enums.
For downstream users, this should be a fairly straightforward
replacement,
e.g. s/CodeGenOpt::Aggressive/CodeGenOptLevel::Aggressive
or s/CGFT_/CodeGenFileType::
Sometimes LLVM generates branch to return instruction, like PR63227.
It is because in function MachineBlockPlacement::canTailDuplicateUnplacedPreds
we avoid duplicating a BB into another already placed BB to prevent destroying
computed layout. But if the successor BB is a return block, duplicating it will
only reduce taken branches without hurt to any other branches.
Differential Revision: https://reviews.llvm.org/D153093
This change initializes the members TSI, LI, DT, PSI, and ORE pointer feilds of the SelectOptimize class to nullptr.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D148303
Use case:
- When block layout is visualized after MBP pass, the basic blocks are labeled in layout order; meanwhile blocks could be numbered in a different order.
- As a result, it's hard to map between the graph and pass output. With this option on, the basic blocks are renumbered in function layout order.
This option is only useful when a function is to be visualized (i.e., when view options are on) to make it debugging only.
Use https://godbolt.org/z/5WTW36bMr as an example:
- As MBP pass output (shown in godbolt output window), `func2` is in a basic block numbered `2` (`bb.2`), and `func1` is in a basic block numbered `3` (`bb.3`);
`bb.3` is a block with higher block frequency than `bb.2`, and `bb.3` is placed before `bb.2` in the functin layout.
- Use [1] to get the dot graph (graph uploaded in [2]), the blocks are re-numbered.
- `func1` is in 'if.end' block, and labeled `1` in visualized dot; `func2` is in 'if.then' blocks, and labeled `3` --> the labeled number and bb number won't map.
- [[ b5626ae975/llvm/lib/CodeGen/MachineBlockFrequencyInfo.cpp (L127) | DOTGraphTraits<MachineBlockFrequencyInfo *>::getNodeLabel ]] is where labeled numbers are based on function layout number, and [[ a8d93783f3/llvm/include/llvm/Support/GraphWriter.h (L209)
| called by graph writer ]].
So call 'MachineFunction::RenumberBlocks' would make labeled number (in dot graph) and block number (in pass output) consistent with each other.
[1] `./bin/clang++ -O3 -S -mllvm -view-block-layout-with-bfi=count -mllvm -view-bfi-func-name=_Z9func_loopv -mllvm -print-after=block-placement -mllvm -filter-print-funcs=_Z9func_loopv test.c`
[2] {F25201785}
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D137467
The diff modifies ext-tsp code layout algorithm in the following ways:
(i) fixes merging of cold block chains (this is a port of D129397);
(ii) adjusts the cost model utilized for optimization;
(iii) adjusts some APIs so that the implementation can be used in BOLT; this is
a prerequisite for D129895.
The only non-trivial change is (ii). Here we introduce different weights for
conditional and unconditional branches in the cost model. Based on the new model
it is slightly more important to increase the number of "fall-through
unconditional" jumps, which makes sense, as placing two blocks with an
unconditional jump next to each other reduces the number of jump instructions in
the generated code. Experimentally, this makes a mild impact on the performance;
I've seen up to 0.2%-0.3% perf win on some benchmarks.
Reviewed By: hoy
Differential Revision: https://reviews.llvm.org/D129893
This reverts commit 7f230feeeac8a67b335f52bd2e900a05c6098f20.
Breaks CodeGenCUDA/link-device-bitcode.cu in check-clang,
and many LLVM tests, see comments on https://reviews.llvm.org/D121169