llvm-project

Author	SHA1	Message	Date
Kazu Hirata	4f1b98bf69	[Utils] Remove an unnecessary cast (NFC) (#157023 ) MergedCounts is of type double.	2025-09-05 14:33:17 +08:00
Iris Shi	bdf03fcff3	Revert "[llvm][NFC] Use `llvm::sort()`" (#140668 )	2025-05-20 11:27:03 +08:00
Iris Shi	061a7699f3	[llvm][NFC] Use `llvm::sort()` (#140335 )	2025-05-17 14:49:46 +08:00
Kazu Hirata	06cb7b1e14	[Transforms] Use llvm::append_range (NFC) (#133650 )	2025-03-30 12:21:59 -07:00
Ellis Hoag	fbec1c2a08	[NFC][CodeLayout] Remove unused parameter (#110145 ) The `NodeCounts` parameter of `calcExtTspScore()` is unused, so remove it. Use `SmallVector` since arrays are expected to be small since they represent MBBs.	2024-09-26 10:28:06 -07:00
Kazu Hirata	fef144cebb	Revert "[llvm] Use llvm::sort (NFC) (#96434 )" This reverts commit 05d167fc201b4f2e96108be0d682f6800a70c23d. Reverting the patch fixes the following under EXPENSIVE_CHECKS: LLVM :: CodeGen/AMDGPU/sched-group-barrier-pipeline-solver.mir LLVM :: CodeGen/AMDGPU/sched-group-barrier-pre-RA.mir LLVM :: CodeGen/PowerPC/aix-xcoff-used-with-stringpool.ll LLVM :: CodeGen/PowerPC/merge-string-used-by-metadata.mir LLVM :: CodeGen/PowerPC/mergeable-string-pool-large.ll LLVM :: CodeGen/PowerPC/mergeable-string-pool-pass-only.mir LLVM :: CodeGen/PowerPC/mergeable-string-pool.ll	2024-06-25 11:18:40 -07:00
Kazu Hirata	05d167fc20	[llvm] Use llvm::sort (NFC) (#96434 )	2024-06-23 10:38:51 -07:00
spupyrev	cebc837937	[CodeLayout] Pre-process execution counts before layout (#70501 ) BOLT fails to process binaries in non-LBR mode, as some blocks marked as having a zero execution count. Adjusting code layout to process such blocks without assertions. This is NFC for all other use cases.	2023-11-02 12:08:33 -07:00
spupyrev	f61179f812	[CodeLayout] Changed option names cds to cdsort (#69668 ) Renaming cds-> cdsort for consistency. This is NFC unless somebody uses older names	2023-10-26 18:10:30 -07:00
Alina Sbirlea	d0584e248d	[CodeLayout] Update to resolve Wdangling warning. Change cc2fbc648d7babbfa612f4f5eda3160212ef6ca7 introduced -Wdangling warning, use temporaries to resolve. llvm/lib/Transforms/Utils/CodeLayout.cpp:764:27: error: temporary whose address is used as value of local variable '[minDensity, maxDensity]' will be destroyed at the end of the full-expression [-Werror,-Wdangling] 764 \| std::minmax(ChainPred->density(), ChainSucc->density()); llvm/lib/Transforms/Utils/CodeLayout.cpp:764:49: error: temporary whose address is used as value of local variable '[minDensity, maxDensity]' will be destroyed at the end of the full-expression [-Werror,-Wdangling] 764 \| std::minmax(ChainPred->density(), ChainSucc->density());	2023-10-25 11:31:48 -07:00
spupyrev	cc2fbc648d	[CodeLayout] Faster basic block reordering, ext-tsp (#68617 ) Aggressive inlining might produce huge functions with >10K of basic blocks. Since BFI treats _all_ blocks and jumps as "hot" having non-negative (but perhaps small) weight, the current implementation can be slow, taking minutes to produce an layout. This change introduces a few modifications that significantly (up to 50x on some instances) speeds up the computation. Some notable changes: - reduced the maximum chain size to 512 (from the prior 4096); - introduced MaxMergeDensityRatio param to avoid merging chains with very different densities; - dropped a couple of params that seem unnecessary. Looking at some "offline" metrics (e.g., the number of created fall-throughs), there shouldn't be problems; in fact, I do see some metrics go up. But it might be hard/impossible to measure perf difference for such small changes. I did test the performance clang-14 binary and do not record a perf or i-cache-related differences. My 5 benchmarks, with ext-tsp runtime (the lower the better) and "tsp-score" (the higher the better). Before: - benchmark 1: num functions: 13,047 reordering running time is 2.4 seconds score: 125503458 (128.3102%) - benchmark 2: num functions: 16,438 reordering running time is 3.4 seconds score: 12613997277 (129.7495%) - benchmark 3: num functions: 12,359 reordering running time is 1.9 seconds score: 1315881613 (105.8991%) - benchmark 4: num functions: 96,588 reordering running time is 7.3 seconds score: 89513906284 (100.3413%) - benchmark 5: num functions: 1 reordering running time is 372 seconds score: 21292505965077 (99.9979%) - benchmark 6: num functions: 71,155 reordering running time is 314 seconds score: 29795381626270671437824 (102.7519%) After: - benchmark 1: reordering running time is 2.2 seconds score: 125510418 (128.3130%) - benchmark 2: reordering running time is 2.6 seconds score: 12614502162 (129.7525%) - benchmark 3: reordering running time is 1.6 seconds score: 1315938168 (105.9024%) - benchmark 4: reordering running time is 4.9 seconds score: 89518095837 (100.3454%) - benchmark 5: reordering running time is 4.8 seconds score: 21292295939119 (99.9971%) - benchmark 6: reordering running time is 104 seconds score: 29796710925310302879744 (102.7565%)	2023-10-25 07:52:26 -07:00
Kazu Hirata	f9306f6de3	[ADT] Rename llvm::erase_value to llvm::erase (NFC) (#70156 ) C++20 comes with std::erase to erase a value from std::vector. This patch renames llvm::erase_value to llvm::erase for consistency with C++20. We could make llvm::erase more similar to std::erase by having it return the number of elements removed, but I'm not doing that for now because nobody seems to care about that in our code base. Since there are only 50 occurrences of erase_value in our code base, this patch replaces all of them with llvm::erase and deprecates llvm::erase_value.	2023-10-24 23:03:13 -07:00
Fangrui Song	a24418375a	[CodeLayout] cache-directed sort: limit max chain size (#69039 ) When linking an executable with a slightly larger executable, ld.lld --call-graph-profile-sort=cdsort can be very slow (see #68638). ``` 4.6% 20.7Mi .text.hot 3.5% 15.9Mi .text 3.4% 15.2Mi .text.unknown ``` Add cl option `cdsort-max-chain-size`, which is similar to `ext-tsp-max-chain-size`, and set it to 128, to improve performance. In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort --time-trace" builds, the "Total Sort sections" time is measured as follows: * -mllvm -cdsort-max-chain-size=64: 1.321813 * -mllvm -cdsort-max-chain-size=128: 2.030425 * -mllvm -cdsort-max-chain-size=256: 2.927684 * -mllvm -cdsort-max-chain-size=512: 5.493106 * unlimited: 9 minutes The rest part takes 6.8s.	2023-10-22 16:50:03 -07:00
Fangrui Song	beffc821e8	[CodeLayout] CDSortImpl: remove HotChains and remove linear-time erase_value from mergeChains (#69276 ) After mergeChainPairs initializes a priority queue, HotChains is unused except a HotChains.size() use in LLVM_DEBUG. Optimize it out.	2023-10-17 18:37:18 -07:00
Fangrui Song	9dbfd5828e	[CodeLayout] CDSortImpl: remove two conditions that cannot trigger. NFC	2023-10-13 16:13:41 -07:00
spupyrev	3d75c7c11b	[CodeLayout] Fixing initialization of empty ranges (#68917 ) Fixing libc++'s consistency checks, by eliminating ranges of singular iterators.	2023-10-13 09:35:56 -07:00
spupyrev	b90fcafcd6	[CodeLayout][NFC] Using MergedVector to avoid extra vector allocations (#68724 ) Using a wrapper (MergedVector) around vectors to avoid extra vector allocations. Plus a few edits in the comments.	2023-10-12 15:07:43 -07:00
spupyrev	5b39d8d3db	Revert "[CodeLayout] Faster basic block reordering, ext-tsp (#68275 )" This reverts commit 0a7bf3aad692c5bb591cac605a19980b00325d50.	2023-10-06 12:04:04 -07:00
spupyrev	0a7bf3aad6	[CodeLayout] Faster basic block reordering, ext-tsp (#68275 ) Aggressive inlining might produce huge functions with >10K of basic blocks. Since BFI treats _all_ blocks and jumps as "hot" having non-negative (but perhaps small) weight, the current implementation can be slow, taking minutes to produce an layout. This change introduces a few modifications that significantly (up to 50x on some instances) speeds up the computation. Some notable changes: - reduced the maximum chain size to 512 (from the prior 4096); - introeuced MaxMergeDensityRatio param to avoid merging chains with very differen densities; - dropped a couple of params that seem unnecessary. Looking at some "offline" metrics (e.g., the number of created fall-throughs), there shouldn't be problems; in fact, I do see some metrics go up. But it might be hard/impossible to measure perf difference for such small changes. I did test the performance clang-14 binary and do not record a perf or i-cache-related differences. My 5 benchmarks, with ext-tsp runtime (the lower the better) and "tsp-score" (the higher the better). Before: - benchmark 1: reordering running time is 2486 milliseconds score: 125503458 (128.3102%) - benchmark 2: reordering running time is 3443 milliseconds score: 12613997277 (129.7495%) - benchmark 2: reordering running time is 1978 milliseconds score: 1315881613 (105.8991%) - benchmark 4: reordering running time is 7364 milliseconds score: 89513906284 (100.3413%) - benchmark 5: reordering running time is 372605 milliseconds score: 21292505965077 (99.9979%) After: - benchmark 1: reordering running time is 2498 milliseconds score: 125510418 (128.3173%) - benchmark 2: reordering running time is 3201 milliseconds score: 12614502162 (129.7547%) - benchmark 3: reordering running time is 2137 milliseconds score: 1315938168 (105.9036%) - benchmark 4: reordering running time is 6242 milliseconds score: 89518095837 (100.3460%) - benchmark 5: reordering running time is 5819 milliseconds score: 21292295939119 (99.9969%)	2023-10-06 11:58:25 -07:00
Fangrui Song	6b8d04c23d	[CodeLayout] Refactor std::vector uses, namespace, and EdgeCountT. NFC * Place types and functions in the llvm::codelayout namespace * Change EdgeCountT from pair<pair<uint64_t, uint64_t>, uint64_t> to a struct and utilize structured bindings. It is not conventional to use the "T" suffix for structure types. * Remove a redundant copy in ChainT::merge. * Change {ExtTSPImpl,CDSortImpl}::run to use return value instead of an output parameter * Rename applyCDSLayout to computeCacheDirectedLayout: (a) avoid rare abbreviation "CDS" (cache-directed sort) (b) "compute" is more conventional for the specific use case * Change the parameter types from std::vector to ArrayRef so that SmallVector arguments can be used. * Similarly, rename applyExtTspLayout to computeExtTspLayout. Reviewed By: Amir Differential Revision: https://reviews.llvm.org/D159526	2023-09-21 13:13:03 -07:00
Fangrui Song	af935cf0ee	[CodeLayout] Fix X1_Y_X2 and Y_X2_X1 testing for jumps from Y (#66592 ) The CHECK2 test in code_placement_ext_tsp_large.ll now has the same result as the CHECK test: when chain(0,2,3,4,1) is merged with chain(8), the result is now chain(0,2,3,4,8,1). Ideally we should have test coverage for -ext-tsp-chain-split-threshold=1, but it seems challenging to craft one. Perhaps the default value of -ext-tsp-chain-split-threshold can be decreased as the -ext-tsp-enable-chain-split-along-jumps heuristic is now more powerful.	2023-09-18 22:50:17 -07:00
spupyrev	b4b42bd652	Cleaning up unreachable code in CodeLayout - removing an unreachable instruction from the code (earlier code merge bug); - silencing "unused variable" warnings. Reviewed By: rahmanl Differential Revision: https://reviews.llvm.org/D158859	2023-08-28 09:22:20 -07:00
spupyrev	bc59faa863	A new code layout algorithm for function reordering [2/3] We are bringing a new algorithm for function layout (reordering) based on the call graph (extracted from a profile data). The algorithm is an improvement of top of a known heuristic, C^3. It tries to co-locate hot and frequently executed together functions in the resulting ordering. Unlike C^3, it explores a larger search space and have an objective closely tied to the performance of instruction and i-TLB caches. Hence, the name CDS = Cache-Directed Sort. The algorithm can be used at the linking or post-linking (e.g., BOLT) stage. The algorithm shares some similarities with C^3 and an approach for basic block reordering (ext-tsp). It works with chains (ordered lists) of functions. Initially all chains are isolated functions. On every iteration, we pick a pair of chains whose merging yields the biggest increase in the objective, which is a weighted combination of frequency-based and distance-based locality. That is, we try to co-locate hot functions together (so they can share the cache lines) and functions frequently executed together. The merging process stops when there is only one chain left, or when merging does not improve the objective. In the latter case, the remaining chains are sorted by density in the decreasing order. Complexity We regularly apply the algorithm for large data-center binaries containing 10K+ (hot) functions, and the algorithm takes only a few seconds. For some extreme cases with 100K-1M nodes, the runtime is within minutes. Perf-impact We extensively tested the implementation extensively on a benchmark of isolated binaries and prod services. The impact is measurable for "larger" binaries that are front-end bound: the cpu time improvement (on top of C^3) is in the range of [0% .. 1%], which is a result of a reduced i-TLB miss rate (by up to 20%) and i-cache miss rate (up to 5%). Reviewed By: rahmanl Differential Revision: https://reviews.llvm.org/D152834	2023-07-27 09:20:53 -07:00
spupyrev	a7e13a99c2	A new code layout algorithm for function reordering [1/3] We are brining a new algorithm for function layout (reordering) based on the call graph (extracted from a profile data). The algorithm is an improvement of top of a known heuristic, C^3. It tries to co-locate hot and frequently executed together functions in the resulting ordering. Unlike C^3, it explores a larger search space and have an objective closely tied to the performance of instruction and i-TLB caches. Hence, the name CDS = Cache-Directed Sort. The algorithm can be used at the linking or post-linking (e.g., BOLT) stage. This diff modifies the existing data structures to facilitate the implementation (down the stack). This is a no-op change. Reviewed By: hoy Differential Revision: https://reviews.llvm.org/D152833	2023-07-07 07:25:25 -07:00
Fangrui Song	1e6921131a	Move global namespace cl::opt inside llvm::	2023-02-14 00:09:44 -08:00
Yevgeny Rouban	e351821088	Fix compilation of CodeLayout.cpp for MacOS llvm/lib/Transforms/Utils/CodeLayout.cpp uses std::abs() with double argument, which is provided by cmath header, which is not explicitly included into CodeLayout.cpp. The implicit include in llvm/include/llvm/Support/MathExtras.h was removed in commit 16544cbe64b81a50800a88296ef37f4873a37b25 Inserting explicit include of cmath into CodeLayout.cpp in order to fix build on MacOS. Committed on behalf of alsemenov (Aleksei Semenov) Reviewed By: thieta Differential Revision: https://reviews.llvm.org/D135072	2022-10-03 21:47:43 +07:00
Kazu Hirata	56ea4f9bd3	[Transforms] Qualify auto in range-based for loops (NFC) Identified with readability-qualified-auto.	2022-08-27 21:21:02 -07:00
spupyrev	8d5b694da1	extending code layout alg The diff modifies ext-tsp code layout algorithm in the following ways: (i) fixes merging of cold block chains (this is a port of D129397); (ii) adjusts the cost model utilized for optimization; (iii) adjusts some APIs so that the implementation can be used in BOLT; this is a prerequisite for D129895. The only non-trivial change is (ii). Here we introduce different weights for conditional and unconditional branches in the cost model. Based on the new model it is slightly more important to increase the number of "fall-through unconditional" jumps, which makes sense, as placing two blocks with an unconditional jump next to each other reduces the number of jump instructions in the generated code. Experimentally, this makes a mild impact on the performance; I've seen up to 0.2%-0.3% perf win on some benchmarks. Reviewed By: hoy Differential Revision: https://reviews.llvm.org/D129893	2022-08-24 09:40:25 -07:00
Kazu Hirata	50724716cd	[Transforms] Qualify auto in range-based for loops (NFC) Identified with readability-qualified-auto.	2022-08-14 12:51:58 -07:00
Kazu Hirata	448c466636	Use llvm::erase_value (NFC)	2022-08-13 12:55:50 -07:00
Kazu Hirata	a2d4501718	[llvm] Fix comment typos (NFC)	2022-08-07 00:16:14 -07:00
Fangrui Song	36c7d79dc4	Remove unneeded cl::ZeroOrMore for cl::opt options Similar to 557efc9a8b68628c2c944678c6471dac30ed9e8e. This commit handles options where cl::ZeroOrMore is more than one line below cl::opt.	2022-06-04 00:10:42 -07:00
spupyrev	bcdc047731	speeding up ext-tsp for huge instances Differential Revision: https://reviews.llvm.org/D120780	2022-03-02 07:17:48 -08:00
serge-sans-paille	a494ae43be	Cleanup includes: TransformsUtils Estimation on the impact on preprocessor output: before: 1065307662 after: 1064800684 Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup Differential Revision: https://reviews.llvm.org/D120741	2022-03-01 21:00:07 +01:00
Kazu Hirata	3a3cb929ab	[llvm] Use = default (NFC)	2022-02-06 22:18:35 -08:00
Hongtao Yu	dee058c670	[CSSPGO] Turn on ext-tsp by default for CSSPGO. I'm seeing ext-tsp helps CSSPGO for our intern large benchmarks so I'm turning on it for CSSPGO. For non-CS AutoFDO, ext-tsp doesn't seem to help, probably because of lower profile counts quality. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D119048	2022-02-04 19:46:44 -08:00
spupyrev	f573f6866e	ext-tsp basic block layout A new basic block ordering improving existing MachineBlockPlacement. The algorithm tries to find a layout of nodes (basic blocks) of a given CFG optimizing jump locality and thus processor I-cache utilization. This is achieved via increasing the number of fall-through jumps and co-locating frequently executed nodes together. The name follows the underlying optimization problem, Extended-TSP, which is a generalization of classical (maximum) Traveling Salesmen Problem. The algorithm is a greedy heuristic that works with chains (ordered lists) of basic blocks. Initially all chains are isolated basic blocks. On every iteration, we pick a pair of chains whose merging yields the biggest increase in the ExtTSP value, which models how i-cache "friendly" a specific chain is. A pair of chains giving the maximum gain is merged into a new chain. The procedure stops when there is only one chain left, or when merging does not increase ExtTSP. In the latter case, the remaining chains are sorted by density in decreasing order. An important aspect is the way two chains are merged. Unlike earlier algorithms (e.g., based on the approach of Pettis-Hansen), two chains, X and Y, are first split into three, X1, X2, and Y. Then we consider all possible ways of gluing the three chains (e.g., X1YX2, X1X2Y, X2X1Y, X2YX1, YX1X2, YX2X1) and choose the one producing the largest score. This improves the quality of the final result (the search space is larger) while keeping the implementation sufficiently fast. Differential Revision: https://reviews.llvm.org/D113424	2021-12-07 07:31:10 -08:00
Nico Weber	3678326d28	Revert "ext-tsp basic block layout" This reverts commit c68f71eb37c2b6ffcf29e865d443a910e73083bd. Breaks tests on arm hosts, see comments on https://reviews.llvm.org/D113424	2021-12-06 19:08:20 -05:00
spupyrev	c68f71eb37	ext-tsp basic block layout A new basic block ordering improving existing MachineBlockPlacement. The algorithm tries to find a layout of nodes (basic blocks) of a given CFG optimizing jump locality and thus processor I-cache utilization. This is achieved via increasing the number of fall-through jumps and co-locating frequently executed nodes together. The name follows the underlying optimization problem, Extended-TSP, which is a generalization of classical (maximum) Traveling Salesmen Problem. The algorithm is a greedy heuristic that works with chains (ordered lists) of basic blocks. Initially all chains are isolated basic blocks. On every iteration, we pick a pair of chains whose merging yields the biggest increase in the ExtTSP value, which models how i-cache "friendly" a specific chain is. A pair of chains giving the maximum gain is merged into a new chain. The procedure stops when there is only one chain left, or when merging does not increase ExtTSP. In the latter case, the remaining chains are sorted by density in decreasing order. An important aspect is the way two chains are merged. Unlike earlier algorithms (e.g., based on the approach of Pettis-Hansen), two chains, X and Y, are first split into three, X1, X2, and Y. Then we consider all possible ways of gluing the three chains (e.g., X1YX2, X1X2Y, X2X1Y, X2YX1, YX1X2, YX2X1) and choose the one producing the largest score. This improves the quality of the final result (the search space is larger) while keeping the implementation sufficiently fast. Differential Revision: https://reviews.llvm.org/D113424	2021-12-06 08:56:39 -08:00

39 Commits