When computing the viable cycles for scheduling an instruction,
`computeStart` used to include special-case logic to handle loop-carried
dependencies. This special handling was necessary because loop-carried
dependencies were represented by reversed forward-direction edges in the
DAG. Now that we have the DDG, which explicitly models loop-carried
dependencies, this special handling is no longer required. As a first
step towards completely removing `isLoopCarriedDep`, this patch
eliminates the special-case logic from `computeStart` and some related
functions.
Split off from https://github.com/llvm/llvm-project/pull/135148
As with loads and stores, instructions that may trigger floating‑point
exceptions must not be reordered across a barrier instruction. This
patch adds the missing loop‑carried dependencies between such
instructions and the barrier, preventing reordering that could
previously occur. Same as #174391, the implementation is based on that
of `ScheduleDAGInstrs::buildSchedGraph`.
Split off from #135148
The loads/stores must not be reordered across barrier instructions.
However, in MachinePipeliner, it potentially could happen since
loop-carried dependencies from loads/stores to a barrier instruction
were not considered. The same problem exists for barrier-to-barrier
dependencies. This patch adds the handling for those cases. The
implementation is based on that of `ScheduleDAGInstrs::buildSchedGraph`.
Split off from https://github.com/llvm/llvm-project/pull/135148
- Remove pass initialization calls from pass constructors.
- For some passes, add the initialization to `initializeCodeGen` or
`initializeGlobalISel`.
- Remove redundant initializations from llc and X86 target for some
passes.
In loop-carried dependence analysis of MachinePipeliner, there is
special handling for a specific case, referred to as a "cheap check".
This check is not sound and sometimes misses dependencies. If there is
no significant performance regression, this special logic should be
deleted.
Split off from https://github.com/llvm/llvm-project/pull/135148
This change corrects the scheduling relationship between dependent PHI
nodes. Previously, the implementation treated SU1 as the successor of
SU0. In reality, SU0 should depend on SU1, not the other way around.
The incorrect ordering could cause SU0 to be scheduled before SU1, which
leads to invalid IR: subsequent instructions may reference values that
have not yet been defined.
%3:intregs = PHI %21:intregs, %bb.6, %7:intregs, %bb.1 - SU0 %7:intregs
= PHI %21:intregs, %bb.6, %13:intregs, %bb.1 - SU1 %27:intregs = A2_zxtb
%3:intregs - SU2
%13:intregs = C2_muxri %45:predregs, 0, %46:intreg
Co-Authored by: Sumanth Gundapaneni
Use it in `printVRegOrUnit()`, `getPressureSets()`/`PSetIterator`,
and in functions/classes dealing with register pressure.
Static type checking revealed several bugs, mainly in MachinePipeliner.
I'm not very familiar with this pass, so I left a bunch of FIXMEs.
There is one bug in `findUseBetween()` in RegisterPressure.cpp, also
annotated with a FIXME.
The dependency analysis in MachinePipeliner checks dependencies for
every pair of store instructions in the target basic block. This means
the time complexity of the analysis is `O(N^2)`, where `N` is the number
of store instructions. Therefore, compilation time can become
significantly long when there are too many store instructions.
To mitigate it, this patch introduces logic to count the number of store
instructions at the beginning of the pipeliner and bail out if it
exceeds the threshold. The default value if the threshold should be
large enough. Thus, in most practical cases where the pipeliner is
beneficial, this patch should not cause any performance regression.
Related issue: #150262
This patch replaces SmallSet<T *, N> with SmallPtrSet<T *, N>. Note
that SmallSet.h "redirects" SmallSet to SmallPtrSet for pointer
element types:
template <typename PointeeType, unsigned N>
class SmallSet<PointeeType*, N> : public SmallPtrSet<PointeeType*, N>
{};
We only have 140 instances that rely on this "redirection", with the
vast majority of them under llvm/. Since relying on the redirection
doesn't improve readability, this patch replaces SmallSet with
SmallPtrSet for pointer element types.
This patch fixes a bug introduced in #145878. A dependency was added in
the wrong direction, causing an assertion failure due to broken
topological order.
This patch adds an additional validation step to ensure that the
generated schedule does not violate loop-carried memory dependencies.
Prior to this patch, incorrect schedules could be produced due to the
lack of checks for the following types of dependencies:
- load-to-store backward (from bottom to top within the BB) dependencies
- store-to-load dependencies
- store-to-store dependencies
One possible solution to this issue is to add these dependencies
directly to the dependency graph, although doing so may lead to
performance degradation. In addition, no known cases of incorrect code
generation caused by these missing dependencies have been observed in
practice. Given these factors, this patch introduces a post-scheduling
validation phase to check for such previously missed dependencies,
instead of adding them to the graph before searching for a schedule.
Since no actual problems have been identified so far, it is likely that
most generated schedules are already valid. Therefore, this additional
validation is not expected to cause performance degradation in practice.
Split off from #135148 .
The remaining tasks are as follows:
- Address other missing loop-carried dependencies (e.g., output
dependencies between physical registers, barrier instructions, and
instructions that may raise floating-point exceptions)
- Remove code that are currently retained to maintain the existing
behavior but probably unnecessary.
- Eliminate `SwingSchedulerDAG::isLoopCarriedDep` and use
`SwingSchedulerDDG` to traverse edges after dependency analysis part.
In MachinePipeliner, loop-carried memory dependencies are represented by
DAG, which makes things complicated and causes some necessary
dependencies to be missing. This patch introduces a new class to manage
loop-carried memory dependencies to simplify the logic. The ultimate
goal is to add currently missing dependencies, but this is a first step
of that, and this patch doesn't intend to change current behavior. This
patch also adds new tests that show the missed dependencies, which
should be fixed in the future.
Split off from #135148
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
This patch moves a process in `addLoopCarriedDependences` that checks
for a loop-carried dependency between two instructions to another
function. This patch is preliminary to a later patch and is not intended
to change current behavior.
Split off from #135148
MachinePipeliner uses AliasAnalysis to collect loop-carried memory
dependencies. To analyze loop-carried dependencies, we need to
explicitly tell AliasAnalysis that the values may come from different
iterations. Before this patch, MachinePipeliner didn't do this, so some
loop-carried dependencies might be missed. For example, in the following
case, there is a loop-carried dependency from the load to the store, but
it wasn't considered.
```
def @f(ptr noalias %p0, ptr noalias %p1) {
entry:
br label %body
loop:
%idx0 = phi ptr [ %p0, %entry ], [ %p1, %body ]
%idx1 = phi ptr [ %p1, %entry ], [ %p0, %body ]
%v0 = load %idx0
...
store %v1, %idx1
...
}
```
Further, the handling of the underlying objects was not sound. If there
is no information about memory operands (i.e., `memoperands()` is
empty), it must be handled conservatively. However, Machinepipeliner
uses a dummy value (namely `UnknownValue`). It is distinguished from
other "known" objects, causing necessary dependencies to be missed.
(NOTE: in such cases, `buildSchedGraph` adds non-loop-carried
dependencies correctly, so perhaps a critical problem has not occurred.)
This patch fixes the above problems. This change has increased false
dependencies that didn't exist before. Therefore, this patch also
introduces additional alias checks with the underlying objects.
Split off from #135148
There was a case where `normalizeNonPipelinedInstructions` didn't
schedule unpipelineable instructions correctly, which could generate
illegal code. This patch fixes this issue by rejecting the schedule if
we fail to insert the unpipelineable instructions at stage 0.
Here is a part of the debug output for `sms-unpipeline-insts3.mir`
before applying this patch.
```
SU(0): %27:gpr32 = PHI %21:gpr32all, %bb.3, %28:gpr32all, %bb.4
Successors:
SU(14): Data Latency=0 Reg=%27
SU(15): Anti Latency=1
...
SU(14): %41:gpr32 = ADDWrr %27:gpr32, %12:gpr32common
Predecessors:
SU(0): Data Latency=0 Reg=%27
SU(16): Ord Latency=0 Artificial
Successors:
SU(15): Data Latency=1 Reg=%41
SU(15): %28:gpr32all = COPY %41:gpr32
Predecessors:
SU(14): Data Latency=1 Reg=%41
SU(0): Anti Latency=1
SU(16): %30:ppr = WHILELO_PWW_S %27:gpr32, %15:gpr32, implicit-def $nzcv
Predecessors:
SU(0): Data Latency=0 Reg=%27
Successors:
SU(14): Ord Latency=0 Artificial
...
Do not pipeline SU(16)
Do not pipeline SU(1)
Do not pipeline SU(0)
Do not pipeline SU(15)
Do not pipeline SU(14)
SU(0) is not pipelined; moving from cycle 19 to 0 Instr: ...
SU(1) is not pipelined; moving from cycle 10 to 0 Instr: ...
SU(15) is not pipelined; moving from cycle 28 to 19 Instr: ...
SU(16) is not pipelined; moving from cycle 19 to 0 Instr: ...
Schedule Found? 1 (II=10)
...
cycle 9 (1) (14) %41:gpr32 = ADDWrr %27:gpr32, %12:gpr32common
cycle 9 (1) (15) %28:gpr32all = COPY %41:gpr32
```
The SUs are traversed in the order of the original basic block, so in
this case a new cycle of each instruction is determined in the order of
`SU(0)`, `SU(1)`, `SU(14)`, `SU(15)`, `SU(16)`. Since there is an
artificial dependence from `SU(16)` to `SU(14)`, which is contradict to
the original SU order, the new cycle of `SU(14)` must be greater than or
equal to the cycle of `SU(16)` at that time. This results in the failure
of scheduling `SU(14)` at stage 0. For now, we reject the schedule for
such cases.
DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently
gained C++23-style insert_range. This patch replaces:
Dest.insert(Src.begin(), Src.end());
with:
Dest.insert_range(Src);
This patch does not touch custom begin like succ_begin for now.
The previous implementation had false positive/negative cases in the
analysis of the loop carried dependency.
A missed dependency case is caused by incorrect analysis of address
increments. This is fixed by strict analysis of recursive definitions.
See added test swp-carried-dep4.mir.
Excessive dependency detection is fixed by improving the formula
for determining the overlap of address ranges to be accessed. See added test
swp-carried-dep5.mir.
The createSIMachineScheduler & createPostMachineScheduler
target hooks are currently placed in the PassConfig interface.
Moving it out to TargetMachine so that both legacy and
the new pass manager can effectively use them.
…e Graph
In MachinePipeliner, a DAG class is used to represent the Data
Dependence Graph. Data Dependence Graph generally contains cycles, so
it's not appropriate to use DAG classes. In fact, some "hacks" are used
to express back-edges in the current implementation. This patch adds a
new class to provide a better interface for manipulating dependencies.
Our approach is as follows:
- To build the graph, we use the ScheduleDAGInstrs class as it is,
because it has powerful functions and the current implementation depends
heavily on it.
- After the graph construction is finished (i.e., during scheduling), we
use the new class DataDependenceGraph to manipulate the dependencies.
Since we don't change the dependencies during scheduling, the new class
only provides functions to read them. Also, this patch is just a
refactoring, i.e., scheduling results should not change with or without
this patch.
We used to skip fixed registers, but fixed registers are not enough
because there are some runtime unusable registers like registers
reserved by `-ffixed-xxx` options.
Here we change to use reserved registers so that the estimated
pressure is more accurate.
`RegisterClassInfo::getRegPressureSetLimit` is a wrapper of
`TargetRegisterInfo::getRegPressureSetLimit` with some logics to
adjust the limit by removing reserved registers.
It seems that we shouldn't use
`TargetRegisterInfo::getRegPressureSetLimit`
directly, just like the comment "This limit must be adjusted
dynamically for reserved registers" said.
Thus we should use `RegisterClassInfo::getRegPressureSetLimit` and
remove replicated code.
Separate from https://github.com/llvm/llvm-project/pull/118787
The code was passing a physical register directly to getPressureSets
which expects a register unit.
Fix this by looping over the register units and calling getPressureSets
for each of them.
Found while trying to add a RegisterUnit class to stop storing register
units in `Register`. 0 is a valid register unit but not a valid
Register.
dependencies in same cycle
Dependency checks were insufficient when reordering instructions with
physical register dependencies (i.e. Anti/Output dependencies). This
could result in generating incorrect code.
- Add `LiveIntervalsAnalysis`.
- Add `LiveIntervalsPrinterPass`.
- Use `LiveIntervalsWrapperPass` in legacy pass manager.
- Use `std::unique_ptr` instead of raw pointer for `LICalc`, so
destructor and default move constructor can handle it correctly.
This would be the last analysis required by `PHIElimination`.
…ain cases" (#97246)
This reverts commit e6a961dbef773b16bda2cebc4bf9f3d1e0da42fc.
There is no difference from the original change. I re-ran the failed
test and it passed. So the failure wasn't caused by this change.
test result: https://lab.llvm.org/buildbot/#/builders/176/builds/585
when scheduling
When scheduling an instruction, if both any predecessors and any
successors of the instruction are already scheduled, `SchedStart` isn't
taken into account. It may result generating incorrect code. This patch
fixes the problem. Also, this patch merges `SchedStart` into
`EarlyStart` (same for `SchedEnd`).
Fixes https://github.com/llvm/llvm-project/issues/93936
This commit implements the Window Scheduler as described in the RFC:
https://discourse.llvm.org/t/rfc-window-scheduling-algorithm-for-machinepipeliner-in-llvm/74718
This Window Scheduler implements the window algorithm designed by
Steven Muchnick in the book "Advanced Compiler Design And
Implementation",
with some improvements:
1. Copy 3 times of the loop kernel and construct the corresponding DAG
to identify dependencies between MIs;
2. Use heuristic algorithm to obtain a set of window offsets.
The window algorithm is equivalent to modulo scheduling algorithm with a
stage of 2. It is mainly applied in targets where hardware resource
conflicts are severe, and the SMS algorithm often fails in such cases.
On our own DSA, this window algorithm typically can achieve a
performance
improvement of over 10%.
Co-authored-by: Kai Yan <aklkaiyan@tencent.com>
Co-authored-by: Ran Xiao <lennyxiao@tencent.com>
---------
Co-authored-by: Kai Yan <aklkaiyan@tencent.com>
Co-authored-by: Ran Xiao <lennyxiao@tencent.com>
Modulo variable expansion is a technique that resolves overlap of
variable lifetimes by unrolling. The existing implementation solves it
by making a copy by move instruction for processors with ordinary
registers such as Arm and x86. This method may result in a very large
number of move instructions, which can cause performance problems.
Modulo variable expansion is enabled by specifying -pipeliner-mve-cg. A
backend must implement some newly defined interfaces in
PipelinerLoopInfo. They were implemented for AArch64.
Discourse thread:
https://discourse.llvm.org/t/implementing-modulo-variable-expansion-for-machinepipeliner
Prepare for new pass manager version of `MachineDominatorTreeAnalysis`.
We may need a machine dominator tree version of `DomTreeUpdater` to
handle `SplitCriticalEdge` in some CodeGen passes.
By default the scheduling info of instructions into a BUNDLE are given a
latency of 0 as they operate on the implicit register of the bundle.
This modifies that for AArch64 so that the latency is adjusted to use
the latency from the instruction in the bundle instead. This essentially
assumes that the bundled instructions are executed in a single cycle,
which for AArch64 is probably OK considering they are mostly used for
MOVPFX bundles, where this can help create slightly better scheduling
especially for in-order cores.