This fixes a bug where zero cost instruction was hoisted to nearest
common dominator but the hoisted instruction's operands didn't dominate
the common dominator causing poison values.
When zero cost instructions are hoisted, the simplifyHoistedPhi function
was setting incoming phi values which were not dominating the use
causing runtime failure. This was set to poison by rebuildSSA function.
This commit fixes the issue.
We just replaced SmallSet<T *, N> with SmallPtrSet<T *, N>, bypassing
the redirection found in SmallSet.h. With that, we no longer need to
include SmallSet.h in many files.
This patch replaces SmallSet<T *, N> with SmallPtrSet<T *, N>. Note
that SmallSet.h "redirects" SmallSet to SmallPtrSet for pointer
element types:
template <typename PointeeType, unsigned N>
class SmallSet<PointeeType*, N> : public SmallPtrSet<PointeeType*, N>
{};
We only have 140 instances that rely on this "redirection", with the
vast majority of them under llvm/. Since relying on the redirection
doesn't improve readability, this patch replaces SmallSet with
SmallPtrSet for pointer element types.
…hi values (#139605)"
This relands commit b11523b494b with the fix for llvm-buildbot failures
"clang-hip-vega20" and "openmp-offload-amdgpu-runtime-2". The reland
prevents hoisting the phi node which fixes the issue.
Original PR description:
The order of if and else blocks can introduce unnecessary VGPR copies.
Consider the case of an if-else block where the incoming phi from the
'Else block' only contains zero-cost instructions, and the 'Then' block
modifies some value. There would be no interference when coalescing
because only one value is live at any point before structurization.
However, in the structurized CFG, the Then value is live at 'Else' block
due to the path if→flow→else, leading to additional VGPR copies.
This patch addresses the issue by:
- Identifying PHI nodes with zero-cost incoming values from the Else
block and hoisting those values to the nearest common dominator of the
Then and Else blocks.
- Updating Flow PHI nodes by replacing poison entries (on the if→flow
edge) with the correct hoisted values.
The order of if and else blocks can introduce unnecessary VGPR copies.
Consider the case of an if-else block where the incoming phi from the
'Else block' only contains zero-cost instructions, and the 'Then' block
modifies some value. There would be no interference when coalescing
because only one value is live at any point before structurization.
However, in the structurized CFG, the Then value is live at 'Else' block
due to the path if→flow→else, leading to additional VGPR copies.
This patch addresses the issue by:
- Identifying PHI nodes with zero-cost incoming values from the Else
block and hoisting those values to the nearest common dominator of the
Then and Else blocks.
- Updating Flow PHI nodes by replacing poison entries (on the if→flow
edge) with the correct hoisted values.
Flow blocks are generated code that don't really correspond to any
location in the source, so principally they should have empty DebugLocs.
Practically, setting these debug locs leads to redundant is_stmts being
generated after #108251, causing stepping test failures in the ROCm GDB
test suite.
Fixes SWDEV-502134
We can use *Set::insert_range to collapse:
for (auto Elem : Range)
Set.insert(E.first);
down to:
Set.insert_range(llvm::make_first_range(Range));
In some cases, we can further fold that into the set declaration.
Previously, the TermDL (BB terminator → DebugLoc) map was initialized at
the start of processing each function's region, creating entries for the
entire function. This could be inefficient for large functions.
This patch improves performance by creating map entries only when
needed—when a terminator is being killed or when a flow block is
created. Additionally, entries are removed immediately after use,
preventing unnecessary map growth and ensuring DebugLocs are not
"retracked."
A mapless variant was also explored, but due to limited familiarity with
the structurizer, it was not pursued further.
In my cases, this change improves performance by 2-3×.
This just makes it more obvious that having Parent as the single
predecessor is a special case, instead of checking for it in the middle
of a loop that finds the nearest common dominator of multiple
predecessors.
Currently `StructurizeCFG` drops branch_weight metadata .
This metadata can be generated from user annotations in the source code
like:
```cpp
if (...) [[likely]] {
}
```
After investigating more while-break cases, I think we should try to
optimize
the way we reconstruct phi nodes. Previously, we reconstruct each phi
nodes separately, but this is not optimal. For example:
```
header:
%v.1 = phi float [ %v, %entry ], [ %v.2, %latch ]
br i1 %cc, label %if, label %latch
if:
%v.if = fadd float %v.1, 1.0
br i1 %cc2, label %latch, label %exit
latch:
%v.2 = phi float [ %v.if, %if ], [ %v.1, %header ]
br i1 %cc3, label %exit, label %header
exit:
%v.3 = phi float [ %v.2, %latch ], [ %v.if, %if ]
```
For this case, we have different copies of value `v`, but there is at
most one copy of value `v` alive at any program point shown above.
The existing ssa reconstruction will use the incoming values from the
old deleted phi. Below is a possible output after ssa reconstruction.
```
header:
%v.1 = phi float [ %v, %entry ], [ %v.loop, %Flow1 ]
br i1 %cc, label %if, label %flow
if:
%v.if = fadd float %v.1, 1.0
br label %flow
flow:
%v.exit.if = phi float [ %v.if, %if ], [ undef, %header ]
%v.latch = phi float [ %v.if, %if ], [ %v.1, %header ]
latch:
br label %flow1
flow1:
%v.loop = phi float [ %v.latch, %latch ], [ undef, %Flow ]
%v.exit = phi float [ %v.latch, %latch ], [ %v.exit.if, %Flow ]
exit:
%v.3 = phi float [ %v.exit, %flow1 ]
```
If we look closely, in order to reconstruct `v.1` `v.2` `v.3`, we are
having two simultaneous copies of `v` alive at `flow` and `flow1`.
We highly depend on register coalescer to coalesce them together.
But register coalescer may not always be able to coalesce them
because of the complexity in the chain of phi.
On the other side, now that we have only one copy of `v` alive at any
program point before the transform, why not simplify the phi network
as much as we can? Look at the incoming values of these PHIs:
```
header if latch
v.1: -- -- v.2
v.2: v.1 v.if --
v.3: -- v.if v.2
```
If we let them share the same incoming values for these three different
incoming blocks, then we would have only one copy of alive `v` at any
program point after ssa reconstruction. Something like:
```
header:
%v.1 = phi float [ %v, %entry ], [ %v.2, %Flow1 ]
br i1 %cc, label %if, label %flow
if:
%v.if = fadd float %v.1, 1.0
br label %flow
flow:
%v.2 = phi float [ %v.if, %if ], [ %v.1, %header ]
latch:
br label %flow1
flow1:
...
exit:
%v.3 = phi float [ %v.2, %flow1 ]
```
Some passes has limitation that only support simple terminators:
branch/unreachable/return. Right now, they ask the pass manager to add
LowerSwitch pass to eliminate `switch`. Let's manage such kind of pass
dependency by ourselves. Also add the assertion in the related passes.
Small oversight in https://reviews.llvm.org/D145688 - the pass' dependency was not updated to reflect the change to UA.
Also, change DivergenceAnalysis to UniformityAnalysis in a comment. That way, StructurizeCFG only refers to UA and not DA anymore.
During structurization process, we may place non-predecessor blocks
between the predecessors of a block in the structurized CFG. Take
the typical while-break case as an example:
```
/---A(v=...)
| / \
^ B C
| \ /|
\---L |
\ /
E (r = phi (v:C)...)
```
After structurization, the CFG would be look like:
```
/---A
| |\
| | C
| |/
| F1
^ |\
| | B
| |/
| F2
| |\
| | L
\ |/
\--F3
|
E
```
We can see that block B is placed between the predecessors(C/L) of E.
During phi reconstruction, to achieve the same sematics as before, we
are reconstructing the PHIs as:
F1: v1 = phi (v:C), (undef:A)
F3: r = phi (v1:F2), ...
But this is also saying that `v1` would be live through B, which is not
quite necessary. The idea in the change is to say the incoming value
from B is Undef for the PHI in E. With this change, the reconstructed
PHI would be:
F1: v1 = phi (v:C), (undef:A)
F2: v2 = phi (v1:F1), (undef:B)
F3: r = phi (v2:F2), ...
Reviewed by: sameerds
Differential Revision: https://reviews.llvm.org/D132450
The instruction simplification will try to simplify the affected phis.
In some cases, this might extend the liveness of values. For example:
BB0:
| \
| BB1
| /
BB2:phi (BB0, v), (BB1, undef)
The phi in BB2 will be simplified to v as v dominates BB2, but this is
increasing the number of active values in BB1. By setting CanUseUndef
to false, we will not simplify the phi in this way, this would help
register pressure. This is mandatory for the later change to help
reducing VGPR pressure for AMDGPU.
Reviewed by: foad, sameerds
Differential Revision: https://reviews.llvm.org/D132449
This reverts commit f1b05a0a2bbbea160002be709f8a1c59de366761.
Need to revert to due to issues identified with testing. The
transformation is incorrect for blocks that contain convergent
instructions.
StructurizeCFG linearizes the successors of branching basic block
by adding Flow blocks to record the true/false path for branches
and back edges. This patch reduces the number of Phi values needed
to capture the control flow path by improving the basic block
ordering.
Previously, StructurizeCFG adds loop exit blocks outside of the
loop. StructurizeCFG sets a boolean value to indicate the path
taken, and all exit block live values extend to after the loop.
For loops with a large number of exits blocks, this creates a
huge number of values that are maintained, which increases
compilation time and register pressure. This is problem
especially with ASAN, which adds early exits to blocks with
unreachable instructions for each instrumented check in the loop.
In specific cases, this patch reduces the number of values needed
after the loop by moving the exit block into the loop. This is
done for blocks that have a single predecessor and single successor
by moving the block to appear just after the predecessor.
Differential Revision: https://reviews.llvm.org/D123231
Clang-format InstructionSimplify and convert all "FunctionName"s to
"functionName". This patch does touch a lot of files but gets done with
the cleanup of InstructionSimplify in one commit.
This is the alternative to the less invasive clang-format only patch: D126783
Reviewed By: spatel, rengolin
Differential Revision: https://reviews.llvm.org/D126889
D118623 added code to fold not-of-compare into a compare
with the inverted predicate, if the compare had no other
uses. This relies on accurate use lists in the IR but it
was run before setPhiValues, when some phi inputs are still
stored in a data structure on the side, instead of being
real uses in the IR. The effect was that a phi that should
be using the original compare result would now get an
inverted result instead.
Fix this by moving simplifyConditions after setPhiValues.
Differential Revision: https://reviews.llvm.org/D120312
In some cases StructurizeCFG inserts i1 xor instructions to invert
predicates. Add a quick loop to clean these up afterwards if we can get
away with modifying an existing compare instruction instead.
(StructurizeCFG is generally run late in the pipeline so instcombine
does not clean them up for us.)
Differential Revision: https://reviews.llvm.org/D118623