Allow a duplicate basic block with multiple predecessors to the
jump table to be simplified, by considering that the same basic
block may appear in more switch cases.
This PR removes tests with `br i1 undef` under
`llvm/tests/Transforms/ObjCARC, Reassociate, SCCP, SLPVectorizer...`.
After this PR, I'll continue to fix tests under `llvm/tests/CodeGen`,
which has more UB tests than `llvm/tests/Transforms`.
This should act like range.
Previously ConstantRangeList assumed a 64-bit range. Now query from the
actual entries. This also means that the empty range has no bitwidth, so
move asserts to avoid checking the bitwidth of empty ranges.
This is a follow up of #96878 to support hoisting load/store from BBs
have the same predecessor, if load/store are the only instructions and
the branch is unpredictable, e.g.:
```
void test (int a, int *c, int *d) {
if (a)
*c = a;
else
*d = a;
}
```
Currently when we merge invokes as part of SimplifyCFG we apply a merge
of the invoke DILocations to the merged invoke. We also insert an
unconditional branch to the merged invoke at the positions previously
occupied by the original invokes; as this branch is part of the
substitution for the invoke it has replaced, we should propagate the
original invoke DebugLoc to it.
I noticed that the two C functions emitted different IR:
```
int switch_duplicate_arms(int switch_val, int v, int w) {
switch (switch_val) {
default:
break;
case 0:
w = v;
break;
case 1:
w = v;
break;
}
return w;
}
int if_duplicate_arms(int switch_val, int v, int w) {
if (switch_val == 0)
w = v;
else if (switch_val == 1)
w = v;
return v0;
}
```
We generate IR that looks like this:
```
define i32 @switch_duplicate_arms(i32 %0, i32 %1, i32 %2, i32 %3) {
switch i32 %1, label %7 [
i32 0, label %5
i32 1, label %6
]
5:
br label %7
6:
br label %7
7:
%8 = phi i32 [ %3, %4 ], [ %2, %6 ], [ %2, %5 ]
ret i32 %8
}
define i32 @if_duplicate_arms(i32 %0, i32 %1, i32 %2, i32 %3) {
%5 = icmp ult i32 %1, 2
%6 = select i1 %5, i32 %2, i32 %3
ret i32 %6
}
```
For `switch_duplicate_arms`, taking case 0 and 1 are the same since %5
and %6
branch to the same location and the incoming values for %8 are the same
from
those blocks. We could remove one on the duplicate switch targets and
update
the switch with the single target.
On RISC-V, prior to this patch, we generate the following code:
```
switch_duplicate_arms:
li a4, 1
beq a1, a4, .LBB0_2
mv a0, a3
bnez a1, .LBB0_3
.LBB0_2:
mv a0, a2
.LBB0_3:
ret
if_duplicate_arms:
li a4, 2
mv a0, a2
bltu a1, a4, .LBB1_2
mv a0, a3
.LBB1_2:
ret
```
After this patch, the O3 code is optimized to the icmp + select pair,
which
gives us the same code gen as `if_duplicate_arms`, as desired. This
results
is one less branch instruction in the final assembly.
This may help with both code size and further switch simplification. I
found
that this patch causes no significant impact to spec2006/int/ref and
spec2017/intrate/ref.
---------
Co-authored-by: Min Hsu <min@myhsu.dev>
This is intended to solve a problem with lowering atomics in
OpenMP and C++ common to AMDGPU and NVPTX.
In OpenCL and CUDA, it is undefined behavior for an atomic instruction
to modify an object in thread private memory. In OpenMP, it is defined.
Correspondingly, the hardware does not handle this correctly. For
AMDGPU,
32-bit atomics work and 64-bit atomics are silently dropped. We
therefore
need to codegen this by inserting a runtime address space check,
performing
the private case without atomics, and fallback to issuing the real
atomic
otherwise. This metadata allows us to avoid this extra check and branch.
Handle this by introducing metadata intended to be applied to atomicrmw,
indicating they cannot access the forbidden address space.
Prior impl would fail if the number of attribute sets on the two calls
wasn't the same which is unnecessary as long as we aren't throwing
away and must-preserve attrs.
Closes#110896
Some (many) attributes can safely be dropped to enable sinking. For
example removing `nonnull` on a return/param can't affect correctness.
Closes#109472
SimplifyCFG store speculation currently has some homegrown code to check
for a writable object, handling the alloca special case only.
Switch it to use the generic isWritableObject() API, which means that we
also support byval arguments, allocator return values, and writable
arguments.
I've adjusted isWritableObject() to also check for the noalias attribute
when handling writable. Otherwise, I don't think that we can generalize
from at-entry writability. This was not relevant for previous uses of
the function, because they'd already require noalias for other reasons
anyway.
Now in the simplifycfg and jumpthreading passes, we will remove the
empty blocks (blocks only have phis and an unconditional branch).
However, in some cases, this will increase size of the IR and slow down
the compile of other passes dramatically. For example, we have the
following CFG:
1. BB1 has 100 predecessors, and unconditionally branches to BB2 (does
not have any other instructions).
2. BB2 has 100 phis.
Then in this case, if we remove BB1, for every phi in BB2, we need to
increase 99 entries (replace the incoming edge from BB1 with 100 edges
from its predecessors). Then in total, we will increase 9900 phi
entries, which can slow down the compile time for many other passes.
Therefore, in this change, we add a check to see whether removing the
empty blocks will increase lots of phi entries. Now, the threshold is
1000 (can be controlled by the command line option
`max-phi-entries-increase-after-removing-empty-block`), which means that
we will not remove an empty block if it will increase the total number
of phi entries by 1000. This threshold is conservative and for most of
the cases, we will not have such a large phi. So, this will only be
triggered in some unusual IRs.
This is supposed to test multiplication of the linear multiplifier
with the largest value it can be multiplied with. However, if
we truncate TableSize-1 here, it might not actually be the largest
value. I think in practice this still works out, because in cases
where we'd truncate the value here we'd also fail the NonMonotonic
check. But to match the intent of the code, we should treat the
truncating case as overflowing.
If we can sink the a load/store, but not the gep producing its pointer
operand, don't sink the load/store either. This may prevent the gep from
being folded into an addressing mode, and may also negatively affect
further analysis.
Fixes https://github.com/llvm/llvm-project/issues/96838.
If a dereferenceability fact is provided through `!dereferenceable` (or
similar), it may only hold on the given control flow path. When we use
`isSafeToSpeculativelyExecute()` to check multiple instructions, we
might make use of `!dereferenceable` information that does not hold at
the speculation target. This doesn't happen when speculating
instructions one by one, because `!dereferenceable` will be dropped
while speculating.
Fix this by checking whether the instruction with `!dereferenceable`
dominates the context instruction. If this is not the case, it means we
are speculating, and cannot guarantee that it holds at the speculation
target.
Fixes https://github.com/llvm/llvm-project/issues/108854.
Pass speculation target and assumption cache to
isSafeToSpeculativelyExecute() calls.
This allows speculating based on dereferenceable/align assumptions, but
the primary motivation here is to avoid regressions from planned changes
to fix https://github.com/llvm/llvm-project/issues/108854.
Same we way mark a path unreachable if it may cause a nullptr
dereference, div/rem by zero or signed div/rem of INT_MIN by -1 cause
immediate UB.
Closes#109008
This is simplifycfg part of
https://github.com/llvm/llvm-project/pull/95515
In this PR, we support hoisting load/store with conditional faulting in
`SimplifyCFGOpt::speculativelyExecuteBB` to eliminate conditional
branches.
This is for cases like
```
void test (int a, int *b) {
if (a)
*b = a;
}
```
In the following patches, we will support the hoist in
`SimplifyCFGOpt::hoistCommonCodeFromSuccessors`.
That is for cases like
```
void test (int a, int *c, int *d) {
if (a)
*c = a;
else
*d = a;
}
```
This is a followup to https://github.com/llvm/llvm-project/pull/104579
to remove the limitation on sinking loads/stores of allocas entirely,
even if this would introduce a phi node.
Nowadays, SROA supports speculating load/store over select/phi.
Additionally, SimplifyCFG with sinking only runs at the end of the
function simplification pipeline, after SROA. I checked that the two
tests modified here still successfully SROA after the SimplifyCFG
transform.
We should, however, keep the limitation on lifetime intrinsics. SROA
does not have speculation support for these, and I've also found that
the way these are handled in the backend is very problematic
(https://github.com/llvm/llvm-project/issues/104776), so I think we
should leave them alone.
SimplifyCFG sinking currently does not sink loads/stores of allocas,
because historically SROA was unable to handle the resulting IR. Since
then, SROA both learned to speculate loads/stores over selects and phis,
*and* SimplifyCFG sinking has been deferred to the end of the function
simplification pipeline, which means that SROA happens before it.
As such, I believe that this workaround should no longer be necessary.
Given how sensitive SimplifyCFG sinking seems to be, this patch takes a
very conservative step towards removing this, by allowing sinking if we
don't actually need to form a phi over the pointer argument.
This fixes https://github.com/llvm/llvm-project/issues/104567, where
sinking a store to an escaped alloca allows converting a switch into
arithmetic.
- This patch skips the threading on known values if the target has
divergent branch.
- So far, threading on known values is skipped when the basic block has
covergent calls. However, even without convergent calls, if that
condition is divergent, threading duplicates the execution of that
block threaded and hence results in lower performance. E.g.,
```
BB1:
if (cond) BB3, BB2
BB2:
// work2
br BB3
BB3:
// work3
if (cond) BB5, BB4
BB4:
// work4
br BB5
BB5:
```
after threading,
```
BB1:
if (cond) BB3', BB2'
BB2':
// work3
br BB5
BB3':
// work2
// work3
// work4
br BB5
BB5:
```
After threading, work3 is executed twice if 'cond' is a divergent one.
Reviewers: yxsamliu, nikic
Pull Request: https://github.com/llvm/llvm-project/pull/100185