The improvements in 63917e1 / #70796 do not check for memory
barriers/unmodelled sideeffects, which means we may incorrectly hoist
loads across memory barriers.
Fix this by checking any machine instruction in the loop is a load-fold
barrier.
PR: https://github.com/llvm/llvm-project/pull/116987
Both are based on MachineLICMBase, and the functionality there is
"switched" based on a PreRegAlloc flag. This commit is simply about
trusting the original value of that flag, defined by the `MachineLICM`
and `EarlyMachineLICM` classes.
The `PreRegAlloc` flag used to be overwritten it based on MRI.isSSA(),
which is un-reliable due to how it is inferred by the MIRParser. I see
that we can now define isSSA in MIR (thanks @gargaroff ), meaning the
fix isn’t really needed anymore, but redefining that flag still feels
wrong.
Note that I'm looking into upstreaming more changes to MachineLICM, see
[the discourse
thread](https://discourse.llvm.org/t/extending-post-regalloc-machinelicm/82725).
This time with 100% more building unit tests. Original commit message follows.
[NFC] Switch a number of DenseMaps to SmallDenseMaps for speedup (#109417)
If we use SmallDenseMaps instead of DenseMaps at these locations,
we get a substantial speedup because there's less spurious malloc
traffic. Discovered by instrumenting DenseMap with some accounting
code, then selecting sites where we'll get the most bang for our buck.
If we use SmallDenseMaps instead of DenseMaps at these locations,
we get a substantial speedup because there's less spurious malloc
traffic. Discovered by instrumenting DenseMap with some accounting
code, then selecting sites where we'll get the most bang for our buck.
- Add `MachineBlockFrequencyAnalysis`.
- Add `MachineBlockFrequencyPrinterPass`.
- Use `MachineBlockFrequencyInfoWrapperPass` in legacy pass manager.
- `LazyMachineBlockFrequencyInfo::print` is empty, drop it due to new
pass manager migration.
Summary:
- Remove wrappers in `MachineDominatorTree`.
- Remove `MachineDominatorTree` update code in
`MachineBasicBlock::SplitCriticalEdge`.
- Use `MachineDomTreeUpdater` in passes which call
`MachineBasicBlock::SplitCriticalEdge` and preserve
`MachineDominatorTreeWrapperPass` or CFG analyses.
Commit abea99f65a97248974c02a5544eaf25fc4240056 introduced related
methods in 2014. Now we have SemiNCA based dominator tree in 2017 and
dominator tree updater, the solution adopted here seems a bit outdated.
Reverts the behavior introduced by 770393b while keeping the refactored
code.
Fixes a miscompile on AArch64, at the cost of a small regression on
AMDGPU.
#96146 opened to investigate the issue
All values are small so no reason to ever use SmallSet really. In large
programs we'll end up using std::set which is extremely slow compared to
DenseSet. This brings a decent speedup to the pass in large programs.
Prepare for new pass manager version of `MachineDominatorTreeAnalysis`.
We may need a machine dominator tree version of `DomTreeUpdater` to
handle `SplitCriticalEdge` in some CodeGen passes.
Those BitVectors get expensive on targets like AMDGPU with thousands of
registers, and RegAliasIterator is also expensive.
We can move all liveness calculations to use RegUnits instead to speed
it up for targets where RegAliasIterator is expensive, like AMDGPU.
On targets where RegAliasIterator is cheap, this alternative can be a little more expensive, but I believe the tradeoff is worth it.
Previously, we just check if the source is a virtual register and
this prevents some potential hoists.
We can see some improvements in AArch64/RISCV tests.
befa925acac8fd6a9266e introduced preliminary hoisting of COPY
instructions when the user of the COPY is inside the same loop. That
optimization appeared to be too aggressive and hoisted too many COPY's
greatly increasing register pressure causing performance regressions for
AMDGPU target.
This is intended to fix the regression by hoisting COPY instruction only
if either:
- User of COPY can be hoisted (other args are invariant)
or
- Hoisting COPY doesn't bring high register pressure
When hoisting an invariant load, we should not combine it with an
existing load through common subexpression elimination (CSE). This is
because there might be memory-changing instructions between the existing
load and the end of the block entering the loop.
Fixes https://github.com/llvm/llvm-project/issues/72855
When there is a COPY instruction in the loop with other uses, we want to
hoist the COPY, which in turn leads to the users being hoisted as well.
Co-authored-by David Green : David.Green@arm.com
Sometimes, loads can appear in a loop after the LICM pass is executed
the final time. For example, ExpandMemCmp pass creates loads in a loop,
and one of the operands may be an invariant address.
This patch extends the pre-regalloc stage MachineLICM by allowing to
hoist invariant loads from loops that don't have any stores or calls
and allows load reorderings.
Skip LICM if PHI belongs to the current loop, e.g. is in the
loop's header. This prevents LICM from bailing for CFGs like
L1:
R = LoopInvariant // can be LICM'd
BR L1
L2:
PHI(R, ..)
BR L2
When hosting a loop invariant instruction the resulting register must be
live in
all the basic blocks of the loop body and the killed flags of the
register must
be cleared.
Before this patch killed flags of subregister to a hoisted superregister
was not
cleared in the loop body.
This was found in an out of tree target, but the testcase
mlicm-stack-write-check.mir was modified to trigger the case.
It is a re-commit from reverted commit 3454cf67bd0a650097dc6ca99874a34e1d59b500.
Following discussion on https://reviews.llvm.org/D154205, make MachineLICM pass
handle subloops with only visiting outermost loop's blocks once.
Differential Revision: https://reviews.llvm.org/D154205
This fixes https://github.com/llvm/llvm-project/issues/60766
With MSVC style exception-handling (funclets), no registers are
alive when entering the funclet so they must be reloaded from the
stack. MachineLICM can sometimes hoist such reloads out of the
funclet which is not correct, the register will have been clobbered
when entering the funclet. This can happen in any loop that
contains a try-catch.
This has been tested on x86_64-pc-window-msvc. I'm not sure if
funclets work the same on the other windows archs.
Reviewed By: rnk, arsenm
Differential Revision: https://reviews.llvm.org/D153337
MachineLICM pass handles inner loops only when outmost loop does not have unique
predecessor. If the loop has preheader and there is loop invariant code, the
invariant code can be hoisted to the preheader in general. This patch makes the
pass handle inner loops in general.
Differential Revision: https://reviews.llvm.org/D154205
This was stored in LiveIntervals, but not actually used for anything
related to LiveIntervals. It was only used in one check for if a load
instruction is rematerializable. I also don't think this was entirely
correct, since it was implicitly assuming constant loads are also
dereferenceable.
Remove this and rely only on the invariant+dereferenceable flags in
the memory operand. Set the flag based on the AA query upfront. This
should have the same net benefit, but has the possible disadvantage of
making this AA query nonlazy.
Preserve the behavior of assuming pointsToConstantMemory implying
dereferenceable for now, but maybe this should be changed.
Add a shouldHoist method to TargetInstrInfo which is queried by
MachineLICM to override hoisting decisions for a given target.
This mirrors functionality provided by shouldSink.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D118773