Aggressive inlining might produce huge functions with >10K of basic
blocks. Since BFI treats _all_ blocks and jumps as "hot" having
non-negative (but perhaps small) weight, the current implementation can
be slow, taking minutes to produce an layout. This change introduces a
few modifications that significantly (up to 50x on some instances)
speeds up the computation. Some notable changes:
- reduced the maximum chain size to 512 (from the prior 4096);
- introduced MaxMergeDensityRatio param to avoid merging chains with
very different densities;
- dropped a couple of params that seem unnecessary.
Looking at some "offline" metrics (e.g., the number of created
fall-throughs), there shouldn't be problems; in fact, I do see some
metrics go up. But it might be hard/impossible to measure perf
difference for such small changes. I did test the performance clang-14
binary and do not record a perf or i-cache-related differences.
My 5 benchmarks, with ext-tsp runtime (the lower the better) and
"tsp-score" (the higher the better).
**Before**:
- benchmark 1:
num functions: 13,047
reordering running time is 2.4 seconds
score: 125503458 (128.3102%)
- benchmark 2:
num functions: 16,438
reordering running time is 3.4 seconds
score: 12613997277 (129.7495%)
- benchmark 3:
num functions: 12,359
reordering running time is 1.9 seconds
score: 1315881613 (105.8991%)
- benchmark 4:
num functions: 96,588
reordering running time is 7.3 seconds
score: 89513906284 (100.3413%)
- benchmark 5:
num functions: 1
reordering running time is 372 seconds
score: 21292505965077 (99.9979%)
- benchmark 6:
num functions: 71,155
reordering running time is 314 seconds
score: 29795381626270671437824 (102.7519%)
**After**:
- benchmark 1:
reordering running time is 2.2 seconds
score: 125510418 (128.3130%)
- benchmark 2:
reordering running time is 2.6 seconds
score: 12614502162 (129.7525%)
- benchmark 3:
reordering running time is 1.6 seconds
score: 1315938168 (105.9024%)
- benchmark 4:
reordering running time is 4.9 seconds
score: 89518095837 (100.3454%)
- benchmark 5:
reordering running time is 4.8 seconds
score: 21292295939119 (99.9971%)
- benchmark 6:
reordering running time is 104 seconds
score: 29796710925310302879744 (102.7565%)
Avoids a crash in the D152928 patch due to a reduction pattern appearing after legalization
We can probably extend this further to avoid truncating to sub-128-bit vXi8 (and then calling WidenToV16I8) entirely, but we can't currently hit other cases.
Even though we're only interested in the X64 codegen for the first test, its much easier to maintain if we just let the update script generate the codegen checks for X86 as well.
Currently, the SLS hardening pass is run before the machine outliner,
which means that the outliner creates new functions and calls which do
not have the SLS hardening applied.
The fix for this is to move the SLS passes to after the outliner, as has
recently been done for the return address signing pass.
This also avoids a bug where the SLS outliner emits code with
instructions after a return, which the outliner doesn't correctly
handle.
Reviewed By: kristof.beyls
Differential Revision: https://reviews.llvm.org/D158511
BlockFrequencyInfo calculates block frequencies as Scaled64 numbers but as a last step converts them to unsigned 64bit integers (`BlockFrequency`). This improves the factors picked for this conversion so that:
* Avoid big numbers close to UINT64_MAX to avoid users overflowing/saturating when adding multiply frequencies together or when multiplying with integers. This leaves the topmost 10 bits unused to allow for some room.
* Spread the difference between hottest/coldest block as much as possible to increase precision.
* If the hot/cold spread cannot be represented loose precision at the lower end, but keep the frequencies at the upper end for hot blocks differentiable.
Some passes has limitation that only support simple terminators:
branch/unreachable/return. Right now, they ask the pass manager to add
LowerSwitch pass to eliminate `switch`. Let's manage such kind of pass
dependency by ourselves. Also add the assertion in the related passes.
This patch also includes:
- Remove legacy non_imm12 PatLeaf from RISCVInstrInfoZb.td
- Implement a custom GlobalISel operand renderer for TrailingZeros
SDNodeXForm
It checks for the copy of subregs, but it checks destination which may
never happen in SSA. It misses the subreg check and happily produces
S_MOV_B64 out of a subreg COPY.
The affected test should have never been formed in the first place
because the pass is running in SSA and copies into a subreg shall never
happen.
Currently, we specify that the ptrmask intrinsic allows the mask to have
any size, which will be zero-extended or truncated to the pointer size.
However, what semantics of the specified GEP expansion actually imply is
that the mask is only meaningful up to the pointer type *index* size --
any higher bits of the pointer will always be preserved. In other words,
the mask gets 1-extended from the index size to the pointer size. This
is also the behavior we want for CHERI architectures.
This PR makes two changes:
* It spells out the interaction with the pointer type index size more
explicitly.
* It requires that the mask matches the pointer type index size. The
intention here is to make handling of this intrinsic more robust, to
avoid accidental mix-ups of pointer size and index size in code
generating this intrinsic. If a zero-extend or truncate of the mask is
desired, it should just be done explicitly in IR. This also cuts down on
the amount of testing we have to do, and things transforms needs to
check for.
As far as I can tell, we don't actually support pointers with different
index type size at the SDAG level, so I'm just asserting the sizes match
there for now. Out-of-tree targets using different index sizes may need
to adjust that code.
This allows working with e.g. v8i8 / v16i8 sources.
It is generally useful, but is primarily beneficial when allowing e.g. v8i8s to be passed to branches directly through registers. As such, this is the first in a series of patches to enable that work. However, it effects https://reviews.llvm.org/D155995, so it has been implemented on top of that.
Differential Revision: https://reviews.llvm.org/D159036
Change-Id: Idfcb57dacd0c32cab040fe4dd4ac2ec762750664
.. and move bitcast from a constant for integer-based types into a
better suited location. It solves the mystery of why we sometimes used
`mov.u32` and sometimes `mov.b32` for loading constants. Now they all
should use `.b32`
eaf85b9c28 "[AMDGPU] Select VGPR versions of MFMA if possible" prevents
the compiler from reserving AGPRs if a kernel has no inline asm
explicitly using AGPRs, no calls, and runs at least 2 waves with not
more than 256 VGPRs. This, in turn, makes it impossible to allocate AGPR
if necessary. As a result, regalloc fails in case we have an MAI
instruction that has at least one AGPR operand.
This change checks if we have AGPRs and forces operands to VGPR if we do
not have them.
---------
Co-authored-by: Alexander Timofeev <alexander.timofeev@amd.com>
D142966 made it so that st2 that do not start at element 0 use zip2
instead of st2. This extends that to any 64bit store that has a nearby
load that can better become a LDP operation, which is expected to have a
higher throughput. It searches up to 20 instructions away for a store to
p+16 or p-16.
Use BatchAA with EarliestEscapeInfo instead of callCapturesBefore() in
MemDepAnalysis. The advantage of this is that it will also take
not-captured-before information into account for non-calls (see
test_store_before_capture for a representative example), and that this
is a cached analysis. The disadvantage is that EII is slightly less
precise than full CapturedBefore analysis.
In practice the impact is positive, with gvn.NumGVNLoad going from 22022
to 22808 on test-suite.
The impact to compile-time is also positive, mainly in the ThinLTO
configuration.
Remove bad test for >2x XLen scalar. Don't restrict struct returns if they aren't homogenous.
Original commit message:
Types larger than 2*XLen are passed indirectly which is not supported
yet. Currently, we will incorrectly pass X10 multiple times.
We currently shrink the type of vmv_s_x_vl to LMUL=1 when its passthru
is
undef to avoid constraining the register allocator since it ignores
LMUL.
This patch relaxes it for non-undef passthrus, which occurs when
lowering
insert_vector_elt.
Function foldRedundantCopy records COPY instructions in CopyMIs and uses
it later. But other optimizations may delete or modify it. So before
using it we should check if the extracted instruction is existing and
still a COPY instruction.
A new ComplexPattern `AddrRegImmLsb00000` is added, which is like
`AddrRegImm` except that if the least significant 5 bits isn't all
zeros, we will fail back to offset 0.
In https://reviews.llvm.org/D125075, we switched to use
FastPreTileConfig in O0 and abandoned X86PreAMXConfigPass.
we can remove related code of X86PreAMXConfigPass safely.