Remove four global variables that are set but never read to fix
-Wunused-but-set-global warnings:
- `MFMAChainLength` in AMDGPUIGroupLP.cpp
- `Wide` in llvm-objdump.cpp
- `SaveTemps` in ClangSYCLLinker.cpp
- `DeprecatedDriverCommand` in ClangScanDeps.cpp
Follow up to #178342
In the greedy pipeline solver, the group cost is found using the
addEdges function and the edges must be removed from the DAG after
processing each group. The best group edges are then reinserted using
the same function. This repeats the costly reachability checks inside
the function which become problematic for pipelines with many
SchedGroups.
The algorithm is changed to remember the best group edges instead of
recomputing them. Additionally, SchedGroup::tryAddEdge is refactored to
avoid a redundant cycle check which is already performed by DAG->addEdge.
There are three overloaded SchedGroup::initSchedGroup functions, two of
which are only used for specific types of SchedGroups, namely
SCHED_BARRIER and SCHED_GROUP_BARRIER. This seems to have a led to some
confusion since the different functions perform checks which are not
needed for their intended restricted use cases. Furthermore, there are
several wrong comments surrounding those functions.
Simplify the functions and inline the actual initialization parts of the
SCHED_BARRIER and SCHED_GROUP_BARRIER variants at their only call sites.
Extract a function that finds the candidate SUnits for a given
SchedGroup and use this instead of initSchedGroup. Fix comments.
llvm::count_if calls std::count_if which returns a difference_type.
difference_type is always signed but is never going to be a negative
value when used as the result of count_if.
This resulted in warnings in our 32-bit Arm builds like:
```
AMDGPUIGroupLP.cpp:1050:20: warning: comparison of integers of different signs:
'typename iterator_traits<const SDep *>::difference_type' (aka 'int') and 'unsigned int' [-Wsign-compare]
1050 | if (SuccSize >= Size)
| ~~~~~~~~ ^ ~~~~
```
I presume these warnings are not generated in 64-bit builds because
unsigned is 32-bit even for 64-bit platforms and there is no risk in
extending 32-bit unsigned into 64-bit signed.
To fix the warning I've changed the type of SuccSize to unsigned, and
the assignment acts like a static_cast into that type.
This assert was failing in a fuzzing test. I consulted with @jrbyrnes
who said:
The MFMASmallGemmSingleWaveOpt::apply() method is invoked if and only if
the user has inserted an intrinsic llvm.amdgcn.iglp.opt(i32 1) into
their source code. This intrinsic applies a highly specialized DAG
mutation to result in specific scheduling for a specific set of kernels.
These assertions are really just confirming that the characteristics of
the kernel match what is expected (i.e. The kernels are similar to the
ones this DAG mutation strategy were designed against).
However, if we apply this DAG mutation to kernels for which is was not
designed, then we may not find the types of instructions we are looking
for, and may end up with empty caches.
I think it should be fine to just return false if the cache is empty
instead of the assert.
We want special handing for IGLP instructions in the scheduler but they
should still be treated like they have side effects by other passes. Add
a target hook to the ScheduleDAGInstrs DAG builder so that we have more
control over this.
IGLP itself will be in SavedMutations via mutations added during
Scheduler creation, thus falling back results in reapplying IGLP.
In PostRA scheduling, if we have multiple regions with IGLP
instructions, then we may have infinite loop.
Disable the feature for now.
This implements the basic pipelining structure of exp/mfma interleaving
for better extensibility. While it does have improved extensibility,
there are controls which only enable it for DAGs with certain
characteristics (matching the DAGs it has been designed against).
We can also re-enter IGLP mutation via later `SchedStage`s in the
`GCNMaxOccupancySchedStrategy` . This is sort of NFC in that there is no
changed behavior for the only current client of `IsReentry`
I believe this assert was trying to check that 3 variables were equal to
0.
I think it instead got interpreted as ((DSWCount == DSWWithPermCount) ==
DSWWithSharedVMEMCount) == 0 I guess (DSWCount == DSWWithPermCount) was
true because both counts were 0. Then true got compared to
DSWWithSharedVMEMCount, and since DSWWithSharedVMEMCount is 0, that
compare was false. And then that false compared equal to the final 0.
Save the DSW counters from PreRA scheduling. While this avoids recalculation in the postRA pass, that isn't the main purpose.
This is required because of physical register dependencies in PostRA scheduling -- they alter the DAG s.t. our counters may become incorrect -- which alters the layout of the pipeline. By preserving the values from PreRA, we can be sure that we accurately construct the pipeline.
Additionally, remove a bad assert in SharesPredWithPrevNthGroup -- it is possible that we will have an empty cache if OtherGroup has no elements which have a V_PERM pred (possible if the V_PERM SG is empty).
The sync pipeline should always contain the candidate ID. If it doesn't
something's gone awry. assert on that.
Reviewed by: jrbyrnes
Differential Revision: https://reviews.llvm.org/D158845
This adds the IGLP strategy for single-wave gemms. The SchedGroup pipeline is laid out in multiple phases, with each phase corresponding to a distinct pattern present in gemm kernels. The resilience of the optimization is dependent upon IR (as seen by pre-RA scheduling) continuing to have these patterns (as defined by instruction class and dependencies) in their current relative ordering.
The kernels of interest have these specific phases:
NT: 1, 2a, 2c
NN: 1, 2a, 2b
TT: 1, 2b, 2c
TN: 1, 2b
The general approach taken was to have a long SchedGroup pipeline. In this way the scheduler will have less capability of doing the wrong thing. In order to resolve the challenge of correctly fitting these long pipelines, we leverage the rules infrastructure to help the solver.
Differential Revision: https://reviews.llvm.org/D149773
Change-Id: I1a35962a95b4bdf740602b8f110d3297c6fb9d96
Currently the PipelineSolver processes SchedGroups in bottom up manner. However, there is no compelling reason to require this. Providing the option to toggle this affords greater experimentation capability, and make usage a bit more intuitive. Importantly, it makes designing rules much easier.
Differential Revision: https://reviews.llvm.org/D149393
Change-Id: Ic4abd3408f9faa105c0eef72eab7873d46083ee4