This reverts commit fdb87640ee2be63af9b0e0cd943cb13d79686a03, and thus
re-enables terminator folding for RISCV. The reported miscompile has
been fixed in f5dd70c58277d925710e5a7c25c86d7565cc3c6c.
This is a partial revert of e947f953370abe8ffc8713b8f3250a3ec39599fe.
It caused a miscompile in downstream testing.
Spoke with Philip offline. We believe the issue is that LSR needs to
make sure the Step of the other AddRec is non-zero. Reverting until
Philip is back from vacation.
If looking for a miscompile revert candidate, look here!
The transform being enabled prefers comparing to a loop invariant
exit value for a secondary IV over using an otherwise dead primary
IV. This increases register pressure (by requiring the exit value
to be live through the loop), but reduces the number of instructions
within the loop by one.
On RISC-V which has a large number of scalar registers, this is
generally a profitable transform. We loose the ability to use a beqz
on what is typically a count down IV, and pay the cost of computing
the exit value on the secondary IV in the loop preheader, but save
an add or sub in the loop body. For anything except an extremely
short running loop, or one with extreme register pressure, this is
profitable. On spec2017, we see a 0.42% geomean improvement in
dynamic icount, with no individual workload regressing by more than
0.25%.
Code size wise, we trade a (possibly compressible) beqz and a (possibly
compressible) addi for a uncompressible beq. We also add instructions
in the preheader. Net result is a slight regression overall, but
neutral or better inside the loop.
Previous versions of this transform had numerous cornercase correctness
bugs. All of them ones I can spot by inspection have been fixed, and I
have run this through all of spec2017, but there may be further issues
lurking. Adding uses to an IV is a fraught thing to do given poison
semantics, so this transform is somewhat inherently risky.
This patch is a reworked version of D134893 by @eop. That patch has
been abandoned since May, so I picked it up, reworked it a bit, and
am landing it.
This is a follow-up to b71edfaa4ec3c998aadb35255ce2f60bba2940b0
since I forgot the lit.local.cfg files in that one.
Reformatting is done with `black`.
If you end up having problems merging this commit because you
have made changes to a python file, the best way to handle that
is to run git checkout --ours <yourfile> and then reformat it
with black.
If you run into any problems, post to discourse about it and
we will try to help.
RFC Thread below:
https://discourse.llvm.org/t/rfc-document-and-standardize-python-code-style
Reviewed By: barannikov88, kwk
Differential Revision: https://reviews.llvm.org/D150762
Like what has been done in AArch64 (D125335).
We enable this under `-O2` to show the codegen diffs here but we
may only do this under `-O3` like AArch64.
There are two cases that we may produce these eliminable copies:
1. ISel of `FrameIndex`. Like `rvv/fixed-vectors-calling-conv.ll`.
2. Tail duplication. Like `select-optimize-multiple.ll`.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D144535
The vsetvli insertion pass can replace it with MU if needed by
a using instruction. The vsetvli insertion pass will not convert
MU to MA so we need to start at MA.
Reviewed By: eopXD
Differential Revision: https://reviews.llvm.org/D143790
This matches the behavior from a number of other targets, including e.g. X86. This does have the effect of increasing register pressure slightly, but we have a relative abundance of registers in the ISA compared to other targets which use the same heuristic.
The motivation here is that our current cost heuristic treats number of registers as the dominant cost. As a result, an extra use outside of a loop can radically change the LSR result. As an example consider test4 from the recently added test/Transforms/LoopStrengthReduce/RISCV/lsr-cost-compare.ll. Without a use outside the loop (see test3), we convert the IV into a pointer increment. With one, we leave the gep in place.
The pointer increment version both decreases number of instructions in some loops, and creates parallel chains of computation (i.e. decreases critical path depth). Both are generally profitable.
Arguably, we should really be using a more sophisticated model here - such as e.g. using profile information or explicitly modeling parallelism gains. However, as a practical matter starting with the same mild hack that other targets have used seems reasonable.
Differential Revision: https://reviews.llvm.org/D142227
The LSR may suggest less profitable transformation to the loop. This
patch adds check to prevent LSR from generating worse code than what
we already have.
Since LSR affects nearly all targets, the patch is guarded by the
option 'lsr-drop-solution' and default as disable for now.
The next step should be extending an TTI interface to allow target(s)
to enable this enhancememnt.
Debug log is added to remind user of such choice to skip the LSR
solution.
Reviewed By: Meinersbur, #loopoptwg
Differential Revision: https://reviews.llvm.org/D126043
Changes since initial commit:
* Wrapping a pointer in an SCEV unknown hides the base, and SCEV is only able to compute a subtraction when the bases are known to be equal. This results in a SCEVCouldNotCompute flowing forward and triggering asserts. Test case added in d767b392.
* isLoopInvariant returns true for instructions outside the loop, but not necessarily *above* the loop. Since this code is allowed to visit uses of an IV outside of a loop, we have to make sure the operands of the compare are both invariant and dominating the header. Test case added in 2aed3cdb.
Original commit message follows...
The ICmpZero matching is checking to see if the expression is loop invariant per SCEV and expandable. This allows expressions inside the loop which can be made loop invariant to be seamlessly expanded, but is overly conservative for expressions which already *are* loop invariant.
As a simple justification for why this is correct, consider a loop invariant urem as RHS vs an alternate function with that same urem wrapped inside a helper call. Why would it be legal to match the later, but not the former?
Differential Revision: https://reviews.llvm.org/D129793
The ICmpZero matching is checking to see if the expression is loop invariant per SCEV and expandable. This allows expressions inside the loop which can be made loop invariant to be seamlessly expanded, but is overly conservative for expressions which already *are* loop invariant.
As a simple justification for why this is correct, consider a loop invariant urem as RHS vs an alternate function with that same urem wrapped inside a helper call. Why would it be legal to match the later, but not the former?
Differential Revision: https://reviews.llvm.org/D129793
Follow up to 3bc09c7da5 - remove a fixme I forgot to remove, and add test cases showing remaining work.
Note that scaled vscales show up in vectorized code from a couple of sources:
* Element types smaller than vector block size (i.e. everything under i64)
* Unrolling
* LMUL > 1
The largest scaling we can currently have is 256 (e8 in every possible vector register). More practically useful scales are in the 2-16 range.
Motivation here is to unblock LSRs ability to use ICmpZero uses - the major effect of which is to enable count down IVs. The test changes reflect this goal, but the potential impact is much broader since this isn't a change in LSR at all.
SCEVExpander needs(*) to prove that expanding the expression is safe anywhere the SCEV expression is valid. In general, we can't expand any node which might fault (or exhibit UB) unless we can either a) prove it won't fault, or b) guard the faulting case. We'd been allowing non-zero constants here; this change extends it to non-zero values.
vscale is never zero. This is already implemented in ValueTracking, and this change just adds the same logic in SCEV's range computation (which in turn drives isKnownNonZero). We should common up some logic here, but let's do that in separate changes.
(*) As an aside, "needs" is such an interesting word here. First, we don't actually need to guard this at all; we could choose to emit a select for the RHS of ever udiv and remove this code entirely. Secondly, the property being checked here is way too strong. What the client actually needs is to expand the SCEV at some particular point in some particular loop. In the examples, the original urem dominates that loop and yet we completely ignore that information when analyzing legality. I don't plan to actively pursue either direction, just noting it for future reference.
Differential Revision: https://reviews.llvm.org/D129710
For the moment, we're pretty conservative here. My motivating case is the vscale one (as that is idiomatic for scalable vectorized loops on RISCV). There are two obvious approaches to fixing this, and I tried to add reasonable coverage for both even though I'll likely only fix one.