llvm-project

shylie/llvm-project

Fork 0

Commit Graph

Author	SHA1	Message	Date
Snehasish Kumar	24bf6ff4e0	[llvm] Update default cutoff threshold for machine function splitter. Based on internal testing at Google we found that setting the profile summary cutoff threshold to 999950 yields the best results in terms of itlb and icache metrics (as observed on Intel CPUs). default = Split out code if no profile count available for block size-% = The fraction of bytes split out of .text and .text.hot itlb = Misses per kilo instructions (MPKI) for itlb icache = Misses per kilo instructions (MPKI) for L1 icache Search1 \| cutoff \| size-% \| itlb \| icache \| \|---------\|---------\|-----------\|---------\| \| default \| 42.5861 \| 0.0822151 \| 2.46363 \| \| 999999 \| 44.9350 \| 0.0767194 \| 2.44416 \| \| 999950 \| 50.0660 \| 0.075744 \| 2.4091 \| \| 999500 \| 56.9158 \| 0.082564 \| 2.4188 \| \| 995000 \| 63.8625 \| 0.0814927 \| 2.42832 \| \| 990000 \| 71.7314 \| 0.106906 \| 2.57785 \| Search2 \| cutoff \| size-% \| itlb \| icache \| \|---------\|--------\|----------\|---------\| \| default \| 2.8845 \| 0.626712 \| 4.73245 \| \| 999999 \| 3.3291 \| 0.602309 \| 4.70045 \| \| 999950 \| 3.8577 \| 0.587842 \| 4.71632 \| \| 999500 \| 4.4170 \| 0.63577 \| 4.68351 \| \| 995000 \| 5.1020 \| 0.657969 \| 4.82272 \| \| 990000 \| 5.7153 \| 0.719122 \| 5.39496 \| Differential Revision: https://reviews.llvm.org/D89085	2020-10-14 12:48:10 -07:00
Snehasish Kumar	94faadaca4	[llvm][CodeGen] Machine Function Splitter We introduce a codegen optimization pass which splits functions into hot and cold parts. This pass leverages the basic block sections feature recently introduced in LLVM from the Propeller project. The pass targets functions with profile coverage, identifies cold blocks and moves them to a separate section. The linker groups all cold blocks across functions together, decreasing fragmentation and improving icache and itlb utilization. We evaluated the Machine Function Splitter pass on clang bootstrap and SPECInt 2017. For clang bootstrap we observe a mean 2.33% runtime improvement with a ~32% reduction in itlb and stlb misses. Additionally, L1 icache misses reduced by 9.5% while L2 instruction misses reduced by 20%. For SPECInt we report the change in IntRate the C/C++ benchmarks. All benchmarks apart from mcf and x264 improve, on average by 0.6% with the max for deepsjeng at 1.6%. Benchmark % Change 500.perlbench_r 0.78 502.gcc_r 0.82 505.mcf_r -0.30 520.omnetpp_r 0.18 523.xalancbmk_r 0.37 525.x264_r -0.46 531.deepsjeng_r 1.61 541.leela_r 0.83 557.xz_r 0.15 Differential Revision: https://reviews.llvm.org/D85368	2020-08-28 11:10:14 -07:00

Author

SHA1

Message

Date

Snehasish Kumar

24bf6ff4e0

[llvm] Update default cutoff threshold for machine function splitter.

Based on internal testing at Google we found that setting the profile
summary cutoff threshold to 999950 yields the best results in terms of
itlb and icache metrics (as observed on Intel CPUs).

*default* = Split out code if no profile count available for block
*size-%*  = The fraction of bytes split out of .text and .text.hot
*itlb*    = Misses per kilo instructions (MPKI) for itlb
*icache*  = Misses per kilo instructions (MPKI) for L1 icache

Search1

| cutoff  | size-%  | itlb      | icache  |
|---------|---------|-----------|---------|
| default | 42.5861 | 0.0822151 | 2.46363 |
|  999999 | 44.9350 | 0.0767194 | 2.44416 |
|  999950 | 50.0660 |  0.075744 |  2.4091 |
|  999500 | 56.9158 |  0.082564 |  2.4188 |
|  995000 | 63.8625 | 0.0814927 | 2.42832 |
|  990000 | 71.7314 |  0.106906 | 2.57785 |

Search2

| cutoff  | size-% | itlb     | icache  |
|---------|--------|----------|---------|
| default | 2.8845 | 0.626712 | 4.73245 |
|  999999 | 3.3291 | 0.602309 | 4.70045 |
|  999950 | 3.8577 | 0.587842 | 4.71632 |
|  999500 | 4.4170 |  0.63577 | 4.68351 |
|  995000 | 5.1020 | 0.657969 | 4.82272 |
|  990000 | 5.7153 | 0.719122 | 5.39496 |

Differential Revision: https://reviews.llvm.org/D89085

2020-10-14 12:48:10 -07:00

Snehasish Kumar

94faadaca4

[llvm][CodeGen] Machine Function Splitter

We introduce a codegen optimization pass which splits functions into hot and cold
parts. This pass leverages the basic block sections feature recently
introduced in LLVM from the Propeller project. The pass targets
functions with profile coverage, identifies cold blocks and moves them
to a separate section. The linker groups all cold blocks across
functions together, decreasing fragmentation and improving icache and
itlb utilization.

We evaluated the Machine Function Splitter pass on clang bootstrap and
SPECInt 2017.

For clang bootstrap we observe a mean 2.33% runtime improvement with a
~32% reduction in itlb and stlb misses. Additionally, L1 icache misses
reduced by 9.5% while L2 instruction misses reduced by 20%.

For SPECInt we report the change in IntRate the C/C++
benchmarks. All benchmarks apart from mcf and x264 improve, on average
by 0.6% with the max for deepsjeng at 1.6%.

Benchmark		% Change
500.perlbench_r		 0.78
502.gcc_r		 0.82
505.mcf_r		-0.30
520.omnetpp_r		 0.18
523.xalancbmk_r		 0.37
525.x264_r		-0.46
531.deepsjeng_r		 1.61
541.leela_r		 0.83
557.xz_r		 0.15

Differential Revision: https://reviews.llvm.org/D85368

2020-08-28 11:10:14 -07:00

2 Commits