This transformation doesn't actually use any of the internal state of
LSR and recomputes all information from SCEV. Splitting it out makes
it easier to test.
Note that long term I would like to write a version of this transform
which *is* integrated with LSR's solver, but if that happens, we'll
just delete the extra pass.
Integration wise, I switched from using TTI to using a pass configuration
variable. This seems slightly more idiomatic, and means we don't run
the extra logic on any target other than RISCV.
Currently before forwarding an AVL we do a simple non-exhaustive check
to see if its LiveInterval is extendable. But we also need to check for
this when we're extending an AVL's LiveInterval via merging the
VSETVLIInfos in transferBefore with equally zero AVLs.
Rather than trying to conservatively prevent these cases, this inserts a
copy of the AVL instead if we don't know we'll be able to extend it.
This is likely to be more robust, and even if the extra copy is
undesirable these cases should be rare in practice.
This avoids some cases where LSR produces results that lead to very poor
codegen. There's a chance we'll see minor degradations for some inputs
in the case that our metrics say the found solution is worse, but in
reality it's better than the starting point.
Per the review thread, at least one vendor has been enabling this by
defualt for some time and found overall it's an improvement. As such,
we'll enable by default and aim to fix any as-yet-unknown regressions
in-tree.
This fixes a crash found when compiling OpenBLAS with -mllvm
-verify-machineinstrs.
When we "forward" the AVL from the output of a vsetvli, we might have to
extend the LiveInterval of the AVL to where insert the new vsetvli.
Most of the time we are able to extend the LiveInterval because there's
only one val num (definition) for the register. But PHI elimination can
assign multiple values to the same register, in which case we end up
clobbering a different val num when extending:
%x = PseudoVSETVLI %avl, ...
%avl = ADDI ...
%v = PseudoVADD ..., avl=%x
; %avl is forwarded to PseudoVADD:
%x = PseudoVSETVLI %avl, ...
%avl = ADDI ...
%v = PseudoVADD ..., avl=%avl
Here there's no way to extend the %avl from the vsetvli since %avl is
redefined, i.e. we have two val nums.
This fixes it by only forwarding it when we have exactly one val num,
where it should be safe to extend it.
If the AVL in a VSETVLIInfo is the output VL of a vsetvli with the same
VLMAX, we treat it as the AVL of said vsetvli.
This allows us to remove a true dependency as well as treating
VSETVLIInfos as equal in more places and avoid toggles.
We do this in two places, needVSETVLI and computeInfoForInstr. However
we don't do this in computeInfoForInstr's vsetvli equivalent,
getInfoForVSETVLI.
We also have a restriction only in computeInfoForInstr that the AVL
can't be a register as we want to avoid extending live ranges.
This patch does two interlinked things:
1) It adds this AVL "peeking" to getInfoForVSETVLI
2) It relaxes the constraint that the AVL can't be a register in
computeInfoForInstr, since it removes a use of the output VL which can
actually reduce register pressure. E.g. see the diff in
@vector_init_vsetvli_N and @test6
Now that getInfoForVSETVLI and computeInfoForInstr are consistent, we
can remove the check in needVSETVLI.
We also need to update how we update LiveIntervals in insertVSETVLI, as
we can now end up needing to extend the LiveRange of the AVL across
blocks.
This reverts commit fdb87640ee2be63af9b0e0cd943cb13d79686a03, and thus
re-enables terminator folding for RISCV. The reported miscompile has
been fixed in f5dd70c58277d925710e5a7c25c86d7565cc3c6c.
This is a partial revert of e947f953370abe8ffc8713b8f3250a3ec39599fe.
It caused a miscompile in downstream testing.
Spoke with Philip offline. We believe the issue is that LSR needs to
make sure the Step of the other AddRec is non-zero. Reverting until
Philip is back from vacation.
If looking for a miscompile revert candidate, look here!
The transform being enabled prefers comparing to a loop invariant
exit value for a secondary IV over using an otherwise dead primary
IV. This increases register pressure (by requiring the exit value
to be live through the loop), but reduces the number of instructions
within the loop by one.
On RISC-V which has a large number of scalar registers, this is
generally a profitable transform. We loose the ability to use a beqz
on what is typically a count down IV, and pay the cost of computing
the exit value on the secondary IV in the loop preheader, but save
an add or sub in the loop body. For anything except an extremely
short running loop, or one with extreme register pressure, this is
profitable. On spec2017, we see a 0.42% geomean improvement in
dynamic icount, with no individual workload regressing by more than
0.25%.
Code size wise, we trade a (possibly compressible) beqz and a (possibly
compressible) addi for a uncompressible beq. We also add instructions
in the preheader. Net result is a slight regression overall, but
neutral or better inside the loop.
Previous versions of this transform had numerous cornercase correctness
bugs. All of them ones I can spot by inspection have been fixed, and I
have run this through all of spec2017, but there may be further issues
lurking. Adding uses to an IV is a fraught thing to do given poison
semantics, so this transform is somewhat inherently risky.
This patch is a reworked version of D134893 by @eop. That patch has
been abandoned since May, so I picked it up, reworked it a bit, and
am landing it.
This is a follow-up to b71edfaa4ec3c998aadb35255ce2f60bba2940b0
since I forgot the lit.local.cfg files in that one.
Reformatting is done with `black`.
If you end up having problems merging this commit because you
have made changes to a python file, the best way to handle that
is to run git checkout --ours <yourfile> and then reformat it
with black.
If you run into any problems, post to discourse about it and
we will try to help.
RFC Thread below:
https://discourse.llvm.org/t/rfc-document-and-standardize-python-code-style
Reviewed By: barannikov88, kwk
Differential Revision: https://reviews.llvm.org/D150762
Like what has been done in AArch64 (D125335).
We enable this under `-O2` to show the codegen diffs here but we
may only do this under `-O3` like AArch64.
There are two cases that we may produce these eliminable copies:
1. ISel of `FrameIndex`. Like `rvv/fixed-vectors-calling-conv.ll`.
2. Tail duplication. Like `select-optimize-multiple.ll`.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D144535
The vsetvli insertion pass can replace it with MU if needed by
a using instruction. The vsetvli insertion pass will not convert
MU to MA so we need to start at MA.
Reviewed By: eopXD
Differential Revision: https://reviews.llvm.org/D143790
This matches the behavior from a number of other targets, including e.g. X86. This does have the effect of increasing register pressure slightly, but we have a relative abundance of registers in the ISA compared to other targets which use the same heuristic.
The motivation here is that our current cost heuristic treats number of registers as the dominant cost. As a result, an extra use outside of a loop can radically change the LSR result. As an example consider test4 from the recently added test/Transforms/LoopStrengthReduce/RISCV/lsr-cost-compare.ll. Without a use outside the loop (see test3), we convert the IV into a pointer increment. With one, we leave the gep in place.
The pointer increment version both decreases number of instructions in some loops, and creates parallel chains of computation (i.e. decreases critical path depth). Both are generally profitable.
Arguably, we should really be using a more sophisticated model here - such as e.g. using profile information or explicitly modeling parallelism gains. However, as a practical matter starting with the same mild hack that other targets have used seems reasonable.
Differential Revision: https://reviews.llvm.org/D142227
The LSR may suggest less profitable transformation to the loop. This
patch adds check to prevent LSR from generating worse code than what
we already have.
Since LSR affects nearly all targets, the patch is guarded by the
option 'lsr-drop-solution' and default as disable for now.
The next step should be extending an TTI interface to allow target(s)
to enable this enhancememnt.
Debug log is added to remind user of such choice to skip the LSR
solution.
Reviewed By: Meinersbur, #loopoptwg
Differential Revision: https://reviews.llvm.org/D126043
Changes since initial commit:
* Wrapping a pointer in an SCEV unknown hides the base, and SCEV is only able to compute a subtraction when the bases are known to be equal. This results in a SCEVCouldNotCompute flowing forward and triggering asserts. Test case added in d767b392.
* isLoopInvariant returns true for instructions outside the loop, but not necessarily *above* the loop. Since this code is allowed to visit uses of an IV outside of a loop, we have to make sure the operands of the compare are both invariant and dominating the header. Test case added in 2aed3cdb.
Original commit message follows...
The ICmpZero matching is checking to see if the expression is loop invariant per SCEV and expandable. This allows expressions inside the loop which can be made loop invariant to be seamlessly expanded, but is overly conservative for expressions which already *are* loop invariant.
As a simple justification for why this is correct, consider a loop invariant urem as RHS vs an alternate function with that same urem wrapped inside a helper call. Why would it be legal to match the later, but not the former?
Differential Revision: https://reviews.llvm.org/D129793
The ICmpZero matching is checking to see if the expression is loop invariant per SCEV and expandable. This allows expressions inside the loop which can be made loop invariant to be seamlessly expanded, but is overly conservative for expressions which already *are* loop invariant.
As a simple justification for why this is correct, consider a loop invariant urem as RHS vs an alternate function with that same urem wrapped inside a helper call. Why would it be legal to match the later, but not the former?
Differential Revision: https://reviews.llvm.org/D129793
Follow up to 3bc09c7da5 - remove a fixme I forgot to remove, and add test cases showing remaining work.
Note that scaled vscales show up in vectorized code from a couple of sources:
* Element types smaller than vector block size (i.e. everything under i64)
* Unrolling
* LMUL > 1
The largest scaling we can currently have is 256 (e8 in every possible vector register). More practically useful scales are in the 2-16 range.
Motivation here is to unblock LSRs ability to use ICmpZero uses - the major effect of which is to enable count down IVs. The test changes reflect this goal, but the potential impact is much broader since this isn't a change in LSR at all.
SCEVExpander needs(*) to prove that expanding the expression is safe anywhere the SCEV expression is valid. In general, we can't expand any node which might fault (or exhibit UB) unless we can either a) prove it won't fault, or b) guard the faulting case. We'd been allowing non-zero constants here; this change extends it to non-zero values.
vscale is never zero. This is already implemented in ValueTracking, and this change just adds the same logic in SCEV's range computation (which in turn drives isKnownNonZero). We should common up some logic here, but let's do that in separate changes.
(*) As an aside, "needs" is such an interesting word here. First, we don't actually need to guard this at all; we could choose to emit a select for the RHS of ever udiv and remove this code entirely. Secondly, the property being checked here is way too strong. What the client actually needs is to expand the SCEV at some particular point in some particular loop. In the examples, the original urem dominates that loop and yet we completely ignore that information when analyzing legality. I don't plan to actively pursue either direction, just noting it for future reference.
Differential Revision: https://reviews.llvm.org/D129710
For the moment, we're pretty conservative here. My motivating case is the vscale one (as that is idiomatic for scalable vectorized loops on RISCV). There are two obvious approaches to fixing this, and I tried to add reasonable coverage for both even though I'll likely only fix one.