
Relands #104763 with - Fixes for EXPENSIVE_CHECKS test failure (due to sorting operator failing if the input is shuffled first) - Fix for broken proposal selection - c3cb27370af40e491446164840766478d3258429 included Original commit description below --- Major rewrite of the AMDGPUSplitModule pass in order to better support it long-term. Highlights: - Removal of the "SML" logging system in favor of just using CL options and LLVM_DEBUG, like any other pass in LLVM. - The SML system started from good intentions, but it was too flawed and messy to be of any real use. It was also a real pain to use and made the code more annoying to maintain. - Graph-based module representation with DOTGraph printing support - The graph represents the module accurately, with bidirectional, typed edges between nodes (a node usually represents one function). - Nodes are assigned IDs starting from 0, which allows us to represent a set of nodes as a BitVector. This makes comparing 2 sets of nodes to find common dependencies a trivial task. Merging two clusters of nodes together is also really trivial. - No more defaulting to "P0" for external calls - Roots that can reach non-copyable dependencies (such as external calls) are now grouped together in a single "cluster" that can go into any partition. - No more defaulting to "P0" for indirect calls - New representation for module splitting proposals that can be graded and compared. - Graph-search algorithm that can explore multiple branches/assignments for a cluster of functions, up to a maximum depth. - With the default max depth of 8, we can create up to 256 propositions to try and find the best one. - We can still fall back to a greedy approach upon reaching max depth. That greedy approach uses almost identical heuristics to the previous version of the pass. All of this gives us a lot of room to experiment with new heuristics or even entirely different splitting strategies if we need to. For instance, the graph representation has room for abstract nodes, e.g. if we need to represent some global variables or external constraints. We could also introduce more edge types to model other type of relations between nodes, etc. I also designed the graph representation & the splitting strategies to be as fast as possible, and it seems to have paid off. Some quick tests showed that we spend pretty much all of our time in the CloneModule function, with the actual splitting logic being >1% of the runtime.
77 lines
1.9 KiB
LLVM
77 lines
1.9 KiB
LLVM
; RUN: llvm-split -o %t %s -j 3 -mtriple amdgcn-amd-amdhsa
|
|
; RUN: llvm-dis -o - %t0 | FileCheck --check-prefix=CHECK0 --implicit-check-not=define %s
|
|
; RUN: llvm-dis -o - %t1 | FileCheck --check-prefix=CHECK1 --implicit-check-not=define %s
|
|
; RUN: llvm-dis -o - %t2 | FileCheck --check-prefix=CHECK2 --implicit-check-not=define %s
|
|
|
|
; We have 4 kernels:
|
|
; - Each kernel has an internal helper
|
|
; - @A and @B's helpers does an indirect call.
|
|
;
|
|
; We default to putting A/B in P0, alongside a copy
|
|
; of all helpers who have their address taken.
|
|
; The other kernels can still go into separate partitions.
|
|
;
|
|
; Note that dependency discovery shouldn't stop upon finding an
|
|
; indirect call. HelperC/D should also end up in P0 as they
|
|
; are dependencies of HelperB.
|
|
|
|
; CHECK0: define internal void @HelperD
|
|
; CHECK0: define amdgpu_kernel void @D
|
|
|
|
; CHECK1: define internal void @HelperC
|
|
; CHECK1: define amdgpu_kernel void @C
|
|
|
|
; CHECK2: define hidden void @HelperA
|
|
; CHECK2: define hidden void @HelperB
|
|
; CHECK2: define hidden void @CallCandidate
|
|
; CHECK2: define internal void @HelperC
|
|
; CHECK2: define internal void @HelperD
|
|
; CHECK2: define amdgpu_kernel void @A
|
|
; CHECK2: define amdgpu_kernel void @B
|
|
|
|
@addrthief = global [3 x ptr] [ptr @HelperA, ptr @HelperB, ptr @CallCandidate]
|
|
|
|
define internal void @HelperA(ptr %call) {
|
|
call void %call()
|
|
ret void
|
|
}
|
|
|
|
define internal void @HelperB(ptr %call) {
|
|
call void @HelperC()
|
|
call void %call()
|
|
call void @HelperD()
|
|
ret void
|
|
}
|
|
|
|
define internal void @CallCandidate() {
|
|
ret void
|
|
}
|
|
|
|
define internal void @HelperC() {
|
|
ret void
|
|
}
|
|
|
|
define internal void @HelperD() {
|
|
ret void
|
|
}
|
|
|
|
define amdgpu_kernel void @A(ptr %call) {
|
|
call void @HelperA(ptr %call)
|
|
ret void
|
|
}
|
|
|
|
define amdgpu_kernel void @B(ptr %call) {
|
|
call void @HelperB(ptr %call)
|
|
ret void
|
|
}
|
|
|
|
define amdgpu_kernel void @C() {
|
|
call void @HelperC()
|
|
ret void
|
|
}
|
|
|
|
define amdgpu_kernel void @D() {
|
|
call void @HelperD()
|
|
ret void
|
|
}
|