Comments attached to the `ScaledReg` field of `struct Formula` explains
that, `ScaledReg` must be non-null when `Scale` is non-zero.
This fixes up a code path where this invariant is violated. Also, add an
assert to ensure this invariant holds true.
Without this patch, compiler aborts with the attached test case.
Fixes#76504
This macros is always defined: either 0 or 1. The correct pattern is to
use #if.
Re-apply #110185 with more fixes for debug build with the ABI breaking
checks disabled.
Motivating example: https://godbolt.org/z/eb97zrxhx
Here we have 2 induction variables in the loop: one is corresponding to
i variable (add rdx, 4), the other - to res (add rax, 2). The second
induction variable can be removed by rewriteLoopExitValues() method
(final value of res at loop exit is unroll_iter * -2); however, this
doesn't happen because we have duplicated LCSSA phi nodes at loop exit:
```
; Preheader:
for.body.preheader.new: ; preds = %for.body.preheader
%unroll_iter = and i64 %N, -4
br label %for.body
; Loop:
for.body: ; preds = %for.body, %for.body.preheader.new
%lsr.iv = phi i64 [ %lsr.iv.next, %for.body ], [ 0, %for.body.preheader.new ]
%i.07 = phi i64 [ 0, %for.body.preheader.new ], [ %inc.3, %for.body ]
%inc.3 = add nuw i64 %i.07, 4
%lsr.iv.next = add nsw i64 %lsr.iv, -2
%niter.ncmp.3.not = icmp eq i64 %unroll_iter, %inc.3
br i1 %niter.ncmp.3.not, label %for.end.loopexit.unr-lcssa.loopexit, label %for.body, !llvm.loop !7
; Exit blocks
for.end.loopexit.unr-lcssa.loopexit: ; preds = %for.body
%inc.3.lcssa = phi i64 [ %inc.3, %for.body ]
%lsr.iv.next.lcssa11 = phi i64 [ %lsr.iv.next, %for.body ]
%lsr.iv.next.lcssa = phi i64 [ %lsr.iv.next, %for.body ]
br label %for.end.loopexit.unr-lcssa
```
rewriteLoopExitValues requires %lsr.iv.next value to have only 2 uses:
one in LCSSA phi node, the other - in induction phi node. Here we have 3
uses of this value because of duplicated lcssa nodes, so the transform
doesn't apply and leads to an extra add operation inside the loop. The
proposed solution is to accumulate inserted instructions that will
require LCSSA form update into SetVector and then call
formLCSSAForInstructions for this SetVector once, so the same
instructions don't process twice.
Reland fixes the issue with preserve-lcssa.ll test: it fails in the situation
when x86_64-unknown-linux-gnu target is unavailable in opt. The changes are
moved into separate duplicated-phis.ll test with explicit x86 target requirement
to fix bots which are not building this target.
Motivating example: https://godbolt.org/z/eb97zrxhx
Here we have 2 induction variables in the loop: one is corresponding to
i variable (add rdx, 4), the other - to res (add rax, 2). The second
induction variable can be removed by rewriteLoopExitValues() method
(final value of res at loop exit is unroll_iter * -2); however, this
doesn't happen because we have duplicated LCSSA phi nodes at loop exit:
```
; Preheader:
for.body.preheader.new: ; preds = %for.body.preheader
%unroll_iter = and i64 %N, -4
br label %for.body
; Loop:
for.body: ; preds = %for.body, %for.body.preheader.new
%lsr.iv = phi i64 [ %lsr.iv.next, %for.body ], [ 0, %for.body.preheader.new ]
%i.07 = phi i64 [ 0, %for.body.preheader.new ], [ %inc.3, %for.body ]
%inc.3 = add nuw i64 %i.07, 4
%lsr.iv.next = add nsw i64 %lsr.iv, -2
%niter.ncmp.3.not = icmp eq i64 %unroll_iter, %inc.3
br i1 %niter.ncmp.3.not, label %for.end.loopexit.unr-lcssa.loopexit, label %for.body, !llvm.loop !7
; Exit blocks
for.end.loopexit.unr-lcssa.loopexit: ; preds = %for.body
%inc.3.lcssa = phi i64 [ %inc.3, %for.body ]
%lsr.iv.next.lcssa11 = phi i64 [ %lsr.iv.next, %for.body ]
%lsr.iv.next.lcssa = phi i64 [ %lsr.iv.next, %for.body ]
br label %for.end.loopexit.unr-lcssa
```
rewriteLoopExitValues requires %lsr.iv.next value to have only 2 uses:
one in LCSSA phi node, the other - in induction phi node. Here we have 3
uses of this value because of duplicated lcssa nodes, so the transform
doesn't apply and leads to an extra add operation inside the loop. The
proposed solution is to accumulate inserted instructions that will
require LCSSA form update into SetVector and then call
formLCSSAForInstructions for this SetVector once, so the same
instructions don't process twice.
This transformation doesn't actually use any of the internal state of
LSR and recomputes all information from SCEV. Splitting it out makes
it easier to test.
Note that long term I would like to write a version of this transform
which *is* integrated with LSR's solver, but if that happens, we'll
just delete the extra pass.
Integration wise, I switched from using TTI to using a pass configuration
variable. This seems slightly more idiomatic, and means we don't run
the extra logic on any target other than RISCV.
Somewhat confusingly a `SCEVMulExpr` is a `SCEVNAryExpr`, so can have
> 2 operands. Previously, the vscale immediate matching did not check
the number of operands of the `SCEVMulExpr`, so would ignore any
operands after the first two.
This led to incorrect codegen (and results) for ArmSME in IREE
(https://github.com/iree-org/iree), which sometimes addresses things
that are a `vscale * vscale` multiple away. The test added with this
change shows an example reduced from IREE. The second write should
be offset from the first `16 * vscale * vscale` (* 4 bytes), however,
previously LSR dropped the second vscale and instead offset the write by
`#4, mul vl`, which is an offset of `16 * vscale` (* 4 bytes).
Fix#97510 .
Note that, for the new phi instruction `NewPH`, which replaces the old
phi `PH` and the cast `ShadowUse`, I choose to propagate the debug
location of `PH` to it, because the cast is eliminated according to the
optimization semantics.
Extends LoopStrengthReduce to recognize immediates multiplied by vscale, and query the current target for whether they are legal offsets for memory operations or adds.
This is a helper to avoid writing `getModule()->getDataLayout()`. I
regularly try to use this method only to remember it doesn't exist...
`getModule()->getDataLayout()` is also a common (the most common?)
reason why code has to include the Module.h header.
LSR will generate chains of related instructions with a known increment
between them. With SVE, in the case of the test case, this can include
increments like 'vscale * 16 + 8'. The idea of this patch is if we have
a '+8' increment already calculated in the chain, we can generate a
(legal) '+ vscale*16' addressing mode from it, allowing us to use the
'[x16, #1, mul vl]' addressing mode instructions.
In order to do this we keep track of the known 'bases' when generating
chains in GenerateIVChain, checking for each if the accumulated
increment expression from the base neatly folds into a legal addressing
mode. If they do not we fall back to the existing LeftOverExpr, whether
it is legal or not.
This is mostly orthogonal to #88124, dealing with the generation of
chains as opposed to rest of LSR. The existing vscale addressing mode
work has greatly helped compared to the last time I looked at this,
allowing us to check that the addressing modes are indeed legal.
<https://reviews.llvm.org/D126043> introduced a flag to drop solutions
if deemed unprofitable. As noted there, introducing a TTI hook enables
backends to individually opt into this behaviour.
This will be used by #89927.
This is the major rename patch that prior patches have built towards.
The DPValue class is being renamed to DbgVariableRecord, which reflects
the updated terminology for the "final" implementation of the RemoveDI
feature. This is a pure string substitution + clang-format patch. The
only manual component of this patch was determining where to perform
these string substitutions: `DPValue` and `DPV` are almost exclusively
used for DbgRecords, *except* for:
- llvm/lib/target, where 'DP' is used to mean double-precision, and so
appears as part of .td files and in variable names. NB: There is a
single existing use of `DPValue` here that refers to debug info, which
I've manually updated.
- llvm/tools/gold, where 'LDPV' is used as a prefix for symbol
visibility enums.
Outside of these places, I've applied several basic string
substitutions, with the intent that they only affect DbgRecord-related
identifiers; I've checked them as I went through to verify this, with
reasonable confidence that there are no unintended changes that slipped
through the cracks. The substitutions applied are all case-sensitive,
and are applied in the order shown:
```
DPValue -> DbgVariableRecord
DPVal -> DbgVarRec
DPV -> DVR
```
Following the previous rename patches, it should be the case that there
are no instances of any of these strings that are meant to refer to the
general case of DbgRecords, or anything other than the DPValue class.
The idea behind this patch is therefore that pure string substitution is
correct in all cases as long as these assumptions hold.
This patch changes DPValue::filter to be a non-member method
filterDbgVars. There are two reasons for this: firstly, the name of
DPValue is about to change to DbgVariableRecord, which will result in
every `for` loop that uses DPValue::filter to require a line break. This
is a small thing, but it makes the rename patch more difficult to
review, and is just generally more awkward for what is a fairly common
loop. Secondly, the intent is to later break up the DPValue class into
subclasses, at which point it would be better to have a non-member
function that allows template arguments for the cases we want to filter
with greater specificity.
As part of the effort to rename the DbgRecord classes, this patch
renames the widely-used functions that operate on DbgRecords but refer
to DbgValues or DPValues in their names to refer to DbgRecords instead;
all such functions are defined in one of `BasicBlock.h`,
`Instruction.h`, and `DebugProgramInstruction.h`.
This patch explicitly does not change the names of any comments or
variables, except for where they use the exact name of one of the
renamed functions. The reason for this is reviewability; this patch can
be trivially examined to determine that the only changes are direct
string substitutions and any results from clang-format responding to the
changed line lengths. Future patches will cover renaming variables and
comments, and then renaming the classes themselves.
As part of the RemoveDIs project we need LLVM to insert instructions using
iterators wherever possible, so that the iterators can carry a bit of
debug-info. This commit implements some of that by updating the contents of
llvm/lib/Transforms/Utils to always use iterator-versions of instruction
constructors.
There are two general flavours of update:
* Almost all call-sites just call getIterator on an instruction
* Several make use of an existing iterator (scenarios where the code is
actually significant for debug-info)
The underlying logic is that any call to getFirstInsertionPt or similar
APIs that identify the start of a block need to have that iterator passed
directly to the insertion function, without being converted to a bare
Instruction pointer along the way.
Noteworthy changes:
* FindInsertedValue now takes an optional iterator rather than an
instruction pointer, as we need to always insert with iterators,
* I've added a few iterator-taking versions of some value-tracking and
DomTree methods -- they just unwrap the iterator. These are purely
convenience methods to avoid extra syntax in some passes.
* A few calls to getNextNode become std::next instead (to keep in the
theme of using iterators for positions),
* SeparateConstOffsetFromGEP has it's insertion-position field changed.
Noteworthy because it's not a purely localised spelling change.
All this should be NFC.
If the phi node found by matchSimpleRecurrence is not from the current
loop, then isAlmostDeadIV panics. With this patch we bail out early.
Signed-off-by: Patrick O'Neill <patrick@rivosinc.com>
---------
Signed-off-by: Patrick O'Neill <patrick@rivosinc.com>
Patch 1 of 3 to add llvm.dbg.label support to the RemoveDIs project. The
patch stack adds a new base class
-> 1. Add DbgRecord base class for DPValue and the not-yet-added
DPLabel class.
2. Add the DPLabel class.
3. Enable dbg.label conversion and add support to passes.
Patches 1 and 2 are NFC.
In the near future we also will rename DPValue to DbgVariableRecord and
DPLabel to DbgLabelRecord, at which point we'll overhaul the function
names too. The name DPLabel keeps things consistent for now.
Follow up to https://github.com/llvm/llvm-project/pull/74747
This change extends the previously added fixed expansion threshold by
scaling down the cost allowed for an expansion for a loop with either a
small known trip count or a profile which indicates the trip count is
likely small. The goal here is to improve code generation for a loop
nest where the outer loop has a high trip count, and the inner loop runs
only a handful of iterations.
---------
Co-authored-by: Nikita Popov <github@npopov.com>
This is a follow up to an item I noted in my submission comment for
e947f95. I don't have a real world example where this is triggering
unprofitably, but avoiding the transform when we estimate the loop to be
short running from profiling seems quite reasonable. It's also now come
up as a possibility in a regression twice in two days, so I'd like to
get this in to close out the possibility if nothing else.
The original review dropped the threshold for short trip count loops. I
will return to that in a separate review if this lands.
This patch trivially extends support for DbgValueInst recovery to
DPValues in LoopStrengthReduce; they are handled identically, so this is
mostly done by reusing the DbgValueInst code (using templates or
auto-parameter lambdas to reduce actual code duplication).
The term folding logic needs to prove that the induction variable does
not cycle through the same set of values so that testing for the value
of the IV on the exiting iteration is guaranteed to trigger only on that
iteration. The prior code checked the no-self-wrap property on the IV,
but this is insufficient as a zero step is trivially no-self-wrap per
SCEV's definition but does repeat the same series of values.
In the current form, this has the effect of basically disabling lsr's
term-folding for all non-constant strides. This is still a net
improvement as we've disabled term-folding entirely, so being able to
enable it for constant strides is still a net improvement.
As future work, there's two SCEV weakness worth investigating.
First sext (or i32 %a, 1) to i64 does not return true for
isKnownNonZero. This is because we check only the unsigned range in that
query. We could either do query pushdown, or check the signed range as
well. I tried the second locally and it has very broad impact - i.e. we
have a bunch of missing optimizations here.
Second, zext (or i32 %a, 1) to i64 as the increment to the IV in
expensive_expand_short_tc causes the addrec to no longer be provably
no-self-wrap. I didn't investigate this so it might be necessary, but
the loop structure is such that I find this result surprising.
If looking for a miscompile revert candidate, look here!
The transform being enabled prefers comparing to a loop invariant
exit value for a secondary IV over using an otherwise dead primary
IV. This increases register pressure (by requiring the exit value
to be live through the loop), but reduces the number of instructions
within the loop by one.
On RISC-V which has a large number of scalar registers, this is
generally a profitable transform. We loose the ability to use a beqz
on what is typically a count down IV, and pay the cost of computing
the exit value on the secondary IV in the loop preheader, but save
an add or sub in the loop body. For anything except an extremely
short running loop, or one with extreme register pressure, this is
profitable. On spec2017, we see a 0.42% geomean improvement in
dynamic icount, with no individual workload regressing by more than
0.25%.
Code size wise, we trade a (possibly compressible) beqz and a (possibly
compressible) addi for a uncompressible beq. We also add instructions
in the preheader. Net result is a slight regression overall, but
neutral or better inside the loop.
Previous versions of this transform had numerous cornercase correctness
bugs. All of them ones I can spot by inspection have been fixed, and I
have run this through all of spec2017, but there may be further issues
lurking. Adding uses to an IV is a fraught thing to do given poison
semantics, so this transform is somewhat inherently risky.
This patch is a reworked version of D134893 by @eop. That patch has
been abandoned since May, so I picked it up, reworked it a bit, and
am landing it.
Extra uses for variables outside the loop can mess with the generation
of postinc variables. This patch alters the collection of loop invariant
fixups in LSR when the target is optimizing for PostInc, to exclude the
collection of these extra uses. It is expected that the variable can be
rematerialized, which will lead to a more optimal sequence of
instructions in the loop.
In CollectLoopInvariantFixupsAndFormulae(), LSR looks at users
outside the loop. E.g. if we have an addrec based on %base, and
%base is also used outside the loop, then we have to keep it in a
register anyway, which may make it more profitable to use
%base + %idx style addressing.
This reasoning doesn't hold up when the base is a constant, because
the constant can be rematerialized. The lsr-memcpy.ll test regressed
when enabling opaque pointers, because inttoptr (i64 6442450944 to ptr)
now also has a use outside the loop (previously it didn't due to a
pointer type difference), and that extra "use" results in worse use
of addressing modes in the loop. However, the use outside the loop
actually gets rematerialized, so the alleged register saving does
not occur.
The same reasoning also applies to other types of constants, such
as global variable references.
Differential Revision: https://reviews.llvm.org/D155073
Move the logic added in 3a57152d85e1 to normalizeForPostIncUse to catch
additional un-invertable cases. This fixes another mis-compile pointed
out by @peixin in D153004.
getExpr is missing a check to make sure the result is invertible.
This can lead to incorrect results, so return nullptr in those cases
like in other places in IVUsers.
Fixes#62660.
Reviewed By: qcolombet
Differential Revision: https://reviews.llvm.org/D153202
There are multiple places in the code where the type of memory being accessed from an instruction needs to be obtained, including an upcoming patch to improve GEP cost modeling. This deduplicates the logic between them. It's not strictly NFC as EarlyCSE/LoopStrengthReduce may catch more intrinsics now.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D150583
This reverts the revert commit 1797ab36efc9c90c921cd725831f8c3f6a7125a2.
The recommitted version now checks the PostIncLoopSets for all fixups
and returns nullptr if the result doesn't match for all fixups.
This reverts commit abfeda5af329b5889db709ff74506e20e0b569e9.
and fe19036e1266d2a90b44725c82b898134906e4c3.
The added assertion triggers during clang bootstrap builds. Revert while
I investigate.
GenerateTruncates at the moment creates extends/truncates for post-inc
uses of normalized expressions. For example, if an add rec of the form
{1,+,-1} is used outside the loop, the normalized form will use {1,+,-1}
instead of {0,+,-1}. When naively sign-extending the normalized
expression, it will get extended incorrectly to {1,+,-1} for the wider
type, if the backedge-taken count of the loop is 1.
To address this, the patch updates GenerateTruncates to check if the
LSRUse contains any fixups with PostIncLoops. If that's the case, first
de-normalize the expression, then perform the extend/truncate, then
normalize again.
There may be other places where similar checks are needed and the helper
can be generalized for those cases. I'd not be surprised if other subtle
mis-compiles are caused by this.
Fixes#38847.
Fixes#58039.
Fixes#62852.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D153004
SCEVExpander keeps track of all instructions it inserted. However,
it currently misses some phi nodes created during LCSSA construction.
Fix this by collecting these into another argument.
This also removes the IRBuilder argument, which was added for
essentially the same purpose, but only handles the root LCSSA nodes,
not those inserted by SSAUpdater.
This was reported as a regression on D149344, but the reduced test
case also reproduces without it.
Differential Revision: https://reviews.llvm.org/D150681