608 Commits

Author SHA1 Message Date
Florian Hahn
2f7ccaf4a8
[SCEV] Add predicate in SolveLinEq to ensure B is a multiple of A. (#108777)
This can help in cases where pointer alignment info is missing, e.g.
https://github.com/llvm/llvm-project/pull/108210

The predicate is formed for the complex expression that's passed to
SolveLinEquationWithOverflow and the checks could probably be pushed
closer to the root nodes, which in some cases may be cheaper to check.


PR: https://github.com/llvm/llvm-project/pull/108777
2024-09-28 14:19:57 +01:00
Nikita Popov
5bcc82d433 [LoopPeel] Fix LCSSA phi node invalidation
In the test case, the BECount of the second loop uses %load,
but we only have an LCSSA phi node for %add, so that is what
gets invalidated. Use the forgetLcssaPhiWithNewPredecessor()
API instead, which will invalidate the roots of the expression
instead.

Fixes https://github.com/llvm/llvm-project/issues/109333.
2024-09-20 17:01:41 +02:00
Nikita Popov
4ec4ac15ed
[SCEVExpander] Fix addrec cost model (#106704)
The current isHighCostExpansion cost model for addrecs computes the cost
for some kind of polynomial expansion that does not appear to have any
relation to addrec expansion whatsoever.

A literal expansion of an affine addrec is a phi and add (plus the
expansion of start and step). For a non-affine addrec, we get another
phi+add for each additional addrec nested in the step recurrence.

This partially `fixes` https://github.com/llvm/llvm-project/issues/53205
(the runtime unroll test case in this PR).
2024-09-19 09:39:35 +02:00
Ganesh
02e4186d0b
[X86] AMD Zen 5 Initial enablement (#107964)
This patch enables the basic skeleton enablement of AMD next gen zen5 CPUs.
2024-09-13 17:45:33 +01:00
Nikita Popov
52b879594f [LoopUnroll] Avoid undef values in test (NFC)
Avoid most of the code being optimized away as a result of
optimization improvements.
2024-09-03 12:10:29 +02:00
Nikita Popov
fe1a1eee2f [Tests] Regenerate test checks (NFC) 2024-09-03 11:42:47 +02:00
Nikita Popov
9edd998e10 [LoopUnroll] Add test for #53205 (NFC) 2024-08-29 16:43:56 +02:00
Nikita Popov
fe182ddf1f [LoopUnrollAnalyzer] Use constant folding API for loads
Use ConstantFoldLoadFromConst() instead of a partial re-implementation.
This makes the code slightly more generic by not depending on the
exact structure of the constant.
2024-08-28 11:53:25 +02:00
James Y Knight
dfeb3991fb
Remove the x86_mmx IR type. (#98505)
It is now translated to `<1 x i64>`, which allows the removal of a bunch
of special casing.

This _incompatibly_ changes the ABI of any LLVM IR function with
`x86_mmx` arguments or returns: instead of passing in mmx registers,
they will now be passed via integer registers. However, the real-world
incompatibility caused by this is expected to be minimal, because Clang
never uses the x86_mmx type -- it lowers `__m64` to either `<1 x i64>`
or `double`, depending on ABI.

This change does _not_ eliminate the SelectionDAG `MVT::x86mmx` type.
That type simply no longer corresponds to an IR type, and is used only
by MMX intrinsics and inline-asm operands.

Because SelectionDAGBuilder only knows how to generate the
operands/results of intrinsics based on the IR type, it thus now
generates the intrinsics with the type MVT::v1i64, instead of
MVT::x86mmx. We need to fix this before the DAG LegalizeTypes, and thus
have the X86 backend fix them up in DAGCombine. (This may be a
short-lived hack, if all the MMX intrinsics can be removed in upcoming
changes.)

Works towards issue #98272.
2024-07-25 09:19:22 -04:00
v01dXYZ
cff8d716bd
[SCEV] forgetValue: support (with-overflow-inst op0, op1) (#98015)
The use-def walk in forgetValue() was skipping instructions with
non-SCEVable types. However, SCEV may look past with.overflow
intrinsics returning aggregates.

Fixes #97586.
2024-07-09 09:14:33 +02:00
Nikita Popov
37c736e035 [LoopUnroll] Use poison instead of undef for another preheader value 2024-06-25 12:17:20 +02:00
Nikita Popov
eeb0884e66 [LoopUnroll] Use poison instead of undef for preheader value 2024-06-25 12:09:58 +02:00
Stephen Tozer
094572701d
[RemoveDIs] Print IR with debug records by default (#91724)
This patch makes the final major change of the RemoveDIs project, changing the
default IR output from debug intrinsics to debug records. This is expected to
break a large number of tests: every single one that tests for uses or
declarations of debug intrinsics and does not explicitly disable writing
records. 

If this patch has broken your downstream tests (or upstream tests on a
configuration I wasn't able to run):
1. If you need to immediately unblock a build, pass
`--write-experimental-debuginfo=false` to LLVM's option processing for all
failing tests (remember to use `-mllvm` for clang/flang to forward arguments to
LLVM).
2. For most test failures, the changes are trivial and mechanical, enough that
they can be done by script; see the migration guide for a guide on how to do
this: https://llvm.org/docs/RemoveDIsDebugInfo.html#test-updates
3. If any tests fail for reasons other than FileCheck check lines that need
updating, such as assertion failures, that is most likely a real bug with this
patch and should be reported as such.

For more information, see the recent PSA:
https://discourse.llvm.org/t/psa-ir-output-changing-from-debug-intrinsics-to-debug-records/79578
2024-06-14 15:07:27 +01:00
Sameer Sahasrabuddhe
e0ac087ff0
[LoopUnroll] Consider convergence control tokens when unrolling (#91715)
- There is no restriction on a loop with controlled convergent
operations when
  the relevant tokens are defined and used within the loop.

- When a token defined outside a loop is used inside (also called a loop
convergence heart), unrolling is allowed only in the absence of
remainder or
  runtime checks.

- When a token defined inside a loop is used outside, such a loop is
said to be
"extended". This loop can only be unrolled by also duplicating the
extended part
  lying outside the loop. Such unrolling is disabled for now.

- Clean up loop hearts: When unrolling a loop with a heart, duplicating
the
heart will introduce multiple static uses of a convergence control token
in a
cycle that does not contain its definition. This violates the static
rules for
tokens, and needs to be cleaned up into a single occurrence of the
intrinsic.

- Spell out the initializer for UnrollLoopOptions to improve
readability.


Original implementation [D85605] by Nicolai Haehnle
<nicolai.haehnle@amd.com>.
2024-06-06 13:13:46 +05:30
Sergey Kachkov
f34dedbf44
[LoopPeel] Support min/max intrinsics in loop peeling (#93162)
This patch adds processing of min/max intrinsics in LoopPeel in the
similar way as it was done for conditional statements: for
min/max(IterVal, BoundVal) we peel iterations where IterVal < BoundVal
for monotonically increasing IterVal; for monotonically decreasing
IterVal we peel iterations where IterVal > BoundVal (strict comparision
predicates are used to minimize number of peeled iterations).
2024-05-31 13:58:10 +03:00
Sergey Kachkov
60a890d855 [LoopPeel] Add pre-commit test for min/max intrinsics 2024-05-31 13:06:08 +03:00
Simon Pilgrim
54e52aa5eb
[X86] Reduce znver3/4 LoopMicroOpBufferSize to practical loop unrolling values (#91340)
The znver3/4 scheduler models have previously associated the LoopMicroOpBufferSize with the maximum size of their op caches, and when this led to quadratic complexity issues this were reduced to a value of 512 uops, based mainly on compilation time and not its effectiveness on runtime performance.

From a runtime performance POV, a large LoopMicroOpBufferSize leads to a higher number of loop unrolls, meaning the cpu has to rely on the frontend decode rate (4 ins/cy max) for much longer to fill the op cache before looping begins and we make use of the faster op cache rate (8/9 ops/cy).

This patch proposes we instead cap the size of the LoopMicroOpBufferSize based off the maximum rate from the op cache (znver3 = 8op/cy, znver4 = 9op/cy) and the branch misprediction penalty from the opcache (~12cy) as a estimate of the useful number of ops we can unroll a loop by before mispredictions are likely to cause stalls. This isn't a perfect metric, but does try to be closer to the spirit of how we use LoopMicroOpBufferSize in the compiler vs the size of a similar naming buffer in the cpu.
2024-05-16 14:44:00 +01:00
Dmitri Gribenko
83974a4b92 Revert "[LoopUnroll] Clamp PartialThreshold for large LoopMicroOpBufferSize (#67657)"
This reverts commit f0b3654701bde1cf7821d60698b42383edaff9f3.

This commit triggers UB by reading an uninitialized variable.

`UP.PartialThreshold` is used uninitialized in `getUnrollingPreferences()` when
it is called from `LoopVectorizationPlanner::executePlan()`. In this case the
`UP` variable is created on the stack and its fields are not initialized.

```
==8802==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x557c0b081b99 in llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getUnrollingPreferences(llvm::Loop*, llvm::ScalarEvolution&, llvm::TargetTransformInfo::UnrollingPreferences&, llvm::OptimizationRemarkEmitter*) llvm-project/llvm/include/llvm/CodeGen/BasicTTIImpl.h
    #1 0x557c0b07a40c in llvm::TargetTransformInfo::Model<llvm::X86TTIImpl>::getUnrollingPreferences(llvm::Loop*, llvm::ScalarEvolution&, llvm::TargetTransformInfo::UnrollingPreferences&, llvm::OptimizationRemarkEmitter*) llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:2277:17
    #2 0x557c0f5d69ee in llvm::TargetTransformInfo::getUnrollingPreferences(llvm::Loop*, llvm::ScalarEvolution&, llvm::TargetTransformInfo::UnrollingPreferences&, llvm::OptimizationRemarkEmitter*) const llvm-project/llvm/lib/Analysis/TargetTransformInfo.cpp:387:19
    #3 0x557c0e6b96a0 in llvm::LoopVectorizationPlanner::executePlan(llvm::ElementCount, unsigned int, llvm::VPlan&, llvm::InnerLoopVectorizer&, llvm::DominatorTree*, bool, llvm::DenseMap<llvm::SCEV const*, llvm::Value*, llvm::DenseMapInfo<llvm::SCEV const*, void>, llvm::detail::DenseMapPair<llvm::SCEV const*, llvm::Value*>> const*) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:7624:7
    #4 0x557c0e6e4b63 in llvm::LoopVectorizePass::processLoop(llvm::Loop*) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:10253:13
    #5 0x557c0e6f2429 in llvm::LoopVectorizePass::runImpl(llvm::Function&, llvm::ScalarEvolution&, llvm::LoopInfo&, llvm::TargetTransformInfo&, llvm::DominatorTree&, llvm::BlockFrequencyInfo*, llvm::TargetLibraryInfo*, llvm::DemandedBits&, llvm::AssumptionCache&, llvm::LoopAccessInfoManager&, llvm::OptimizationRemarkEmitter&, llvm::ProfileSummaryInfo*) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:10344:30
    #6 0x557c0e6f2f97 in llvm::LoopVectorizePass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:10383:9

[...]

  Uninitialized value was created by an allocation of 'UP' in the stack frame
    #0 0x557c0e6b961e in llvm::LoopVectorizationPlanner::executePlan(llvm::ElementCount, unsigned int, llvm::VPlan&, llvm::InnerLoopVectorizer&, llvm::DominatorTree*, bool, llvm::DenseMap<llvm::SCEV const*, llvm::Value*, llvm::DenseMapInfo<llvm::SCEV const*, void>, llvm::detail::DenseMapPair<llvm::SCEV const*, llvm::Value*>> const*) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:7623:3
```
2024-05-16 12:11:42 +02:00
Nikita Popov
f0b3654701
[LoopUnroll] Clamp PartialThreshold for large LoopMicroOpBufferSize (#67657)
The znver3/znver4 scheduler models are outliers, specifying very large
LoopMicroOpBufferSizes at 512, while typical values for other subtargets
are on the order of ~50. Even if this information is
micro-architecturally correct (*), this does not mean that we want to
runtime unroll all loops to a size that completely fills the loop
buffer. Unless this is the single hot loop in the entire application,
the massive code size increase will bust the micro-op and instruction
caches.

Protect against this by clamping to the default PartialThreshold of 150,
which is the same as the default full-unroll threshold and half the
aggressive full-unroll threshold. Allowing more partial unrolling than
full unrolling certainly does not make sense.

(*) I strongly doubt that this is actually correct -- I believe this may
derive from an incorrect reading of Agner Fog's micro-architecture
guide. The number 4096 that was originally used here is the size of the
general micro-op cache, not that of a loop buffer. A separate loop
buffer is not listed for the Zen microarchitecture. Comparing this to
the listing for Skylake, it has a 1536 micro-op buffer, but only a 64
micro-op loopback buffer, with a note that it's rarely fully utilized.
Our scheduling model specifies LoopMicroOpBufferSize of 50 in that case.
2024-05-16 10:21:22 +09:00
chenlin
79643565a8
[LoopUnroll] Remove redundant debug instructions after blocks have been merged (#91246)
Remove redundant debug instructions after blocks have been merged into
the predecessor, It can reduce some compile time in some cases.

This change only fixes the situation of loop unrolling, and other
situations are not considered. "RemoveRedundantDbgInstrs" seems to be
very time-consuming. Thus, we just add here after the "Dest" has been
merged into the "Fold", this may be a more targeted solution!!!

fixes: https://github.com/llvm/llvm-project/issues/89073
2024-05-13 09:42:04 -07:00
Jorge Gorbe Moya
2cde0e2f97 Revert "[BasicBlockUtils] Remove redundant llvm.dbg instructions after blocks to reduce compile time (#89069)"
This reverts commit 2e3e0868748635b779ba89a772eae3664bd822e4. It caused
quadratic slowdown at compilation time in some cases. See the comments
in the original PR: https://github.com/llvm/llvm-project/pull/89069
2024-05-03 13:05:08 -07:00
annamthomas
46c2d93662
[StandardInstrumentation] Annotate loops with the function name (#90756)
When analyzing pass debug output it is helpful to have the function name
along with the loop name.
2024-05-03 14:13:59 -04:00
Florian Hahn
175d297102
[LoopUnroll] Add CSE to remove redundant loads after unrolling. (#83860)
This patch adds loadCSE support to simplifyLoopAfterUnroll. It is based
on EarlyCSE's implementation using ScopeHashTable and is using SCEV for
accessed pointers to check to find redundant loads after unrolling.

This applies to the late unroll pass only, for full unrolling those
redundant loads will be cleaned up by the regular pipeline.

The current approach constructs MSSA on-demand per-loop, but there is
still small but notable compile-time impact:

stage1-O3  +0.04%
stage1-ReleaseThinLTO +0.06%
stage1-ReleaseLTO-g +0.05%
stage1-O0-g +0.02%
stage2-O3 +0.09%
stage2-O0-g +0.04%
stage2-clang +0.02%


https://llvm-compile-time-tracker.com/compare.php?from=c089fa5a729e217d0c0d4647656386dac1a1b135&to=ec7c0f27cb5c12b600d9adfc8543d131765ec7be&stat=instructions:u

This benefits some workloads with runtime-unrolling disabled,
where users use pragmas to force unrolling, as well as with
runtime unrolling enabled.

On SPEC/MultiSource, this removes a number of loads after unrolling
on AArch64 with runtime unrolling enabled.

```
External/S...te/526.blender_r/526.blender_r    96
MultiSourc...rks/mediabench/gsm/toast/toast    39
SingleSource/Benchmarks/Misc/ffbench            4
External/SPEC/CINT2006/403.gcc/403.gcc         18
MultiSourc.../Applications/JM/ldecod/ldecod     4
MultiSourc.../mediabench/jpeg/jpeg-6a/cjpeg     6
MultiSourc...OE-ProxyApps-C/miniGMG/miniGMG     9
MultiSourc...e/Applications/ClamAV/clamscan     4
MultiSourc.../MallocBench/espresso/espresso     3
MultiSourc...dence-flt/LinearDependence-flt     2
MultiSourc...ch/office-ispell/office-ispell     4
MultiSourc...ch/consumer-jpeg/consumer-jpeg     6
MultiSourc...ench/security-sha/security-sha    11
MultiSourc...chmarks/McCat/04-bisect/bisect     3
SingleSour...tTests/2020-01-06-coverage-009    12
MultiSourc...ench/telecomm-gsm/telecomm-gsm    39
MultiSourc...lds-flt/CrossingThresholds-flt    24
MultiSourc...dence-dbl/LinearDependence-dbl     2
External/S...C/CINT2006/445.gobmk/445.gobmk     6
MultiSourc...enchmarks/mafft/pairlocalalign    53
External/S...31.deepsjeng_r/531.deepsjeng_r     3
External/S...rate/510.parest_r/510.parest_r    58
External/S...NT2006/464.h264ref/464.h264ref    29
External/S...NT2017rate/502.gcc_r/502.gcc_r    45
External/S...C/CINT2006/456.hmmer/456.hmmer     6
External/S...te/538.imagick_r/538.imagick_r    18
External/S.../CFP2006/447.dealII/447.dealII     4
MultiSourc...OE-ProxyApps-C++/miniFE/miniFE    12
External/S...2017rate/525.x264_r/525.x264_r    36
MultiSourc...Benchmarks/7zip/7zip-benchmark    33
MultiSourc...hmarks/ASC_Sequoia/AMGmk/AMGmk     2
MultiSourc...chmarks/VersaBench/8b10b/8b10b     1
MultiSourc.../Applications/JM/lencod/lencod   116
MultiSourc...lds-dbl/CrossingThresholds-dbl    24
MultiSource/Benchmarks/McCat/05-eks/eks        15
```

PR: https://github.com/llvm/llvm-project/pull/83860
2024-05-02 11:01:24 +01:00
CL
2e3e086874
[BasicBlockUtils] Remove redundant llvm.dbg instructions after blocks to reduce compile time (#89069)
this patch is to fix the compile time for some cases, before this
change, some targets (riscv-64, ve) will spend much more compile time on
this case (https://godbolt.org/z/rrov17cTo). With this change, the
compile time was reduced a lot.

Fixes https://github.com/llvm/llvm-project/issues/89073

PR: https://github.com/llvm/llvm-project/pull/89069
2024-04-26 10:57:37 +01:00
Florian Hahn
01f8da908c
[LoopUnroll] Add tests for performing load CSE after unrolling.
Precommit tests for https://github.com/llvm/llvm-project/pull/83860.
2024-04-24 12:00:16 +01:00
Graham Hunter
03f852f704
[AArch64] Improve cost model for legal subvec insert/extract (#81135)
Currently we model subvector inserts and extracts as shuffles,
potentially going as far as scalarizing. If the types are legal then
they can just be simple zip/unzip operations, or possible even no-ops.
Change the cost to a relatively small one to ensure that simple loops
featuring such operations between fixed and scalable vector types that
are effectively the same at a given sve width can be unrolled and
further optimized.
2024-03-04 16:17:01 +00:00
Vedant Paranjape
e209178d64
[SimplifyIndVar] LCSSA form is destroyed by simplifyLoopIVs, preserve it (#78696)
In LoopUnroll, peelLoop is called on the loop. After the loop is peeled
it calls simplifyLoopAfterUnroll on the loop. This call to
simplifyLoopAfterUnroll doesn't preserve the LCSSA form of the parent
loop and thus during the next call to peelLoop the LCSSA form is already
broken.

LoopPeel util takes in the PreserveLCSSA argument and it passes
on the same argument to simplifyLoop which checks if the loop is in a
valid LCSSA form, when (PreserveLCSSA = true).

This causes an assert in simplifyLoop when (PreserveLCSSA = true), as
during the last call LCSSA for the loop wasn't preserved, and thus
crashes at the following assert.

assert(L->isRecursivelyLCSSAForm(*DT, *LI) &&
            "Requested to preserve LCSSA, but it's already broken.");

Upon debugging, it is evident that simplifyLoopIVs call inside
simplifyLoopAfterUnroll breaks the LCSSA form. This patch fixes
llvm#77118, it checks if the replacement of IV Users with Loop Invariant
preserves the LCSSA form. If it does not, it emits the required LCSSA
Phi instructions.
2024-02-21 17:51:56 +05:30
Graham Hunter
ad78e210bd
[NFC][AArch64] Tests for guarding unrolling with scalable vec ins/ext (#81132) 2024-02-19 09:47:49 +00:00
Sergey Kachkov
ffd79b3312
[LoopUnroll] Consider simplified operands while retrieving TTI instruction cost (#70929)
Get more precise cost of instruction after LoopUnroll considering that
some operands of it can be simplified, e.g. induction variable will be
replaced by constant after full unrolling.
2024-02-06 17:01:38 +03:00
modiking
99ddd77ed9
[LoopUnroll] Introduce PragmaUnrollFullMaxIterations as a hard cap on how many iterations we try to unroll (#78648)
Fixes [PR77842](https://github.com/llvm/llvm-project/issues/77842) where
UBSAN causes pragma full unroll to try and unroll INT_MAX times. This
sets a cap to make sure we don't attempt this and crash the compiler.

Testing:
ninja check-all with new test

---------

Co-authored-by: Nikita Popov <github@npopov.com>
2024-02-05 17:01:00 -08:00
Nikita Popov
2d69827c5c [Transforms] Convert tests to opaque pointers (NFC) 2024-02-05 11:57:34 +01:00
Nikita Popov
62ae7d976f [LoopUnroll] Fix missing sign extension
For integers larger than 64-bit, this would zero-extend a -1
value, instead of sign-extending it.

Fixes https://github.com/llvm/llvm-project/issues/80289.
2024-02-01 16:08:25 +01:00
Nikita Popov
f7b05e055f [LoopUnroll] Add test for #80289 (NFC) 2024-02-01 16:08:25 +01:00
Nikita Popov
90ba33099c
[InstCombine] Canonicalize constant GEPs to i8 source element type (#68882)
This patch canonicalizes getelementptr instructions with constant
indices to use the `i8` source element type. This makes it easier for
optimizations to recognize that two GEPs are identical, because they
don't need to see past many different ways to express the same offset.

This is a first step towards
https://discourse.llvm.org/t/rfc-replacing-getelementptr-with-ptradd/68699.
This is limited to constant GEPs only for now, as they have a clear
canonical form, while we're not yet sure how exactly to deal with
variable indices.

The test llvm/test/Transforms/PhaseOrdering/switch_with_geps.ll gives
two representative examples of the kind of optimization improvement we
expect from this change. In the first test SimplifyCFG can now realize
that all switch branches are actually the same. In the second test it
can convert it into simple arithmetic. These are representative of
common optimization failures we see in Rust.

Fixes https://github.com/llvm/llvm-project/issues/69841.
2024-01-24 15:25:29 +01:00
Yingwei Zheng
b7f50e13d8
[InstCombine] Improve foldICmpWithDominatingICmp with DomConditionCache (#75370)
This patch uses affected values from DomConditionCache(introduced by #73662), instead of a cheap/incomplete check `getSinglePredecessor`.
2023-12-14 21:02:10 +08:00
XiangZhang
1d6a678591
[LoopUnroll] Make use of MaxTripCount for loops with "#pragma unroll" (#74703)
Fix loop unroll fail caused by branches folding.

For example:
SimplifyCFG foldloop branches then cause loop unroll failed for "#program unroll" loop.
```
#program unroll
for (int I = 0; I < ConstNum; ++I) { // folding "I < ConstNum" and "Cond2"
  if (Cond2) {
  break;
  }
  xxx loop body;
}
```

The pragma unroll metadata only takes effect if there is an exact trip
count, but not if there is an upper bound trip count. This patch make it
work with an upper bound trip count as well in shouldPragmaUnroll().

Loop unroll is important in stack nervous devices (e.g. GPU, and that is
why a lot of GPU code mark loop with "#program unroll").
It usually much simplify the address (offset) calculations in old
iterations, then we can do a lot of others optimizations, e.g, SROA, for
these simplifed address (escape alloca the whole aggregates).
2023-12-08 19:43:10 +08:00
Philip Reames
ffb2af3ed6
[SCEVExpander] Attempt to reinfer flags dropped due to CSE (#72431)
LSR uses SCEVExpander to generate induction formulas. The expander
internally tries to reuse existing IR expressions. To do that, it needs
to strip any poison generating flags (nsw, nuw, exact, nneg, etc..)
which may not be valid for the newly added users.

This is conservatively correct, but has the effect that LSR will strip
nneg flags on zext instructions involved in trip counts in loop
preheaders. To avoid this, this patch adjusts the expanded to reinfer
the flags on the CSE candidate if legal for all possible users.

This should fix the regression reported in
https://github.com/llvm/llvm-project/issues/71200.

This should arguably be done inside canReuseInstruction instead, but
doing it outside is more conservative compile time wise. Both
canReuseInstruction and isGuaranteedNotToBePoison walk operand lists, so
right now we are performing work which is roughly O(N^2) in the size of
the operand graph. We should fix that before making the per operand step
more expensive. My tenative plan is to land this, and then rework the
code to sink the logic into more core interfaces.
2023-12-07 13:20:36 -08:00
Nikita Popov
eecb99c5f6 [Tests] Add disjoint flag to some tests (NFC)
These tests rely on SCEV looking recognizing an "or" with no common
bits as an "add". Add the disjoint flag to relevant or instructions
in preparation for switching SCEV to use the flag instead of the
ValueTracking query. The IR with disjoint flag matches what
InstCombine would produce.
2023-12-05 14:09:36 +01:00
Joshua Cao
5602636835
[LoopPeel] Peel iterations based on and, or conditions (#73413)
For example, this allows us to peel this loop with a `and`:
```
for (int i = 0; i < N; ++i) {
  if (i % 2 == 0 && i < 3) // can peel based on || as well
    f1();
  f2();
```
into:
```
for (int i = 0; i < 3; ++i) { // peel three iterations
  if (i % 2 == 0)
    f1();
  f2();
}
for (int i = 3; i < N; ++i)
  f2();
```
2023-12-02 11:24:02 -08:00
Jeremy Morse
d2d9dc8eb4
[DebugInfo][RemoveDIs] Make debugify pass convert to/from RemoveDIs mode (#73251)
Debugify is extremely useful as a testing and debugging tool, and a good
number of LLVM-IR transform tests use it. We need it to support "new"
non-instruction debug-info to get test coverage, but it's not important
enough to completely convert right now (and it'd be a large
undertaking). Thus: convert to/from dbg.value/DPValue mode on entry and
exit of the pass, which gives us the functionality without any further
work. The cost is compile-time, but again this is only happening during
tests.

Tested by: the large set of debugify tests enabled here. Note the
InstCombine test (cast-mul-select.ll) that hasn't been fully enabled:
this is because there's a debug-info sinking piece of code there that
hasn't been instrumented.
2023-11-29 13:19:50 +00:00
Nikita Popov
a86bce6577 [LoopPeel] Regenerate test checks (NFC) 2023-11-28 15:42:07 +01:00
Craig Topper
03d4a9d94d
[InstCombine] Set disjoint flag when turning Add into Or. (#72702)
The disjoint flag was recently added to IR in #72583
2023-11-27 12:54:11 -08:00
Jeremy Morse
c672ba7dde
[DebugInfo][RemoveDIs] Instrument inliner for non-instr debug-info (#72884)
With intrinsics representing debug-info, we just clone all the
intrinsics when inlining a function and don't think about it any
further. With non-instruction debug-info however we need to be a bit
more careful and manually move the debug-info from one place to another.
For the most part, this means keeping a "cursor" during block cloning of
where we last copied debug-info from, and performing debug-info copying
whenever we successfully clone another instruction.

There are several utilities in LLVM for doing this, all of which now
need to manually call cloneDebugInfo. The testing story for this is not
well covered as we could rely on normal instruction-cloning mechanisms
to do all the hard stuff. Thus, I've added a few tests to explicitly
test dbg.value behaviours, ahead of them becoming not-instructions.
2023-11-26 21:24:29 +00:00
Jeremy Morse
59fab22642 [DebugInfo][RemoveDIs] Support cloning and remapping DPValues (#72546)
This patch adds support for CloneBasicBlock duplicating the DPValues
attached to instructions, and adds facilities to remap them into their new
context. The plumbing to achieve this is fairly straightforwards and
mechanical.

I've also added illustrative uses to LoopUnrollRuntime, SimpleLoopUnswitch
and SimplifyCFG. The former only updates for the epilogue right now so I've
added CHECK lines just for the end of an unrolled loop (further updates
coming later). SimpleLoopUnswitch had no debug-info tests so I've added a
new one. The two modified parts of SimplifyCFG are covered by the two
modified SimplifyCFG tests.

These are scenarios where we have to do extra cloning for copying of
DPValues because they're no longer instructions, and remap them too.
2023-11-24 15:17:32 +00:00
Ramkumar Ramachandra
4c01a58008
update_analyze_test_checks: support output from LAA (#67584)
update_analyze_test_checks.py is an invaluable tool in updating tests.
Unfortunately, it only supports output from the CostModel,
ScalarEvolution, and LoopVectorize analyses. Many LoopAccessAnalysis
tests use hand-crafted CHECK lines, and it is moreover tedious to
generate these CHECK lines, as the output fom the analysis is not
stable, and requires the test-writer to hand-craft FileCheck matches.
Alleviate this pain, and support output from:

  $ opt -passes='print<loop-accesses>'

This patch includes several non-trivial changes including:
- Preserving whitespace at the beginning of the line, so that the LAA
output can be properly indented.
- Regexes matching the unstable output, which is basically a pointer
address hex.
- Separating is_analyze from preserve_names clearly, as the former was
formerly used as an overload for the latter.

To demonstate the utility of this patch, several tests in
LoopAccessAnalysis have been auto-generated by
update_analyze_test_checks.py.
2023-10-31 14:33:53 +00:00
Aleksandr Popov
e8d5db206c
[LoopPeeling] Fix weights updating of peeled off branches (#70094)
In https://reviews.llvm.org/D64235 a new algorithm has been introduced
for updating the branch weights of latch blocks and their copies.

It increases the probability of going to the exit block for each next
peel iteration, calculating weights by (F - I * E, E), where:
- F is a weight of the edge from latch to header.
- E is a weight of the edge from latch to exit.
- I is a number of peeling iteration.

E.g: Let's say the latch branch weights are (100,300) and the estimated
trip count is 4. If we peel off all 4 iterations the weights of the
copied branches will be:
0: (100,300)
1: (100,200)
2: (100,100)
3: (100,1)

https://godbolt.org/z/93KnoEsT6

So we make the original loop almost unreachable from the 3rd peeled copy
according to the profile data. But that's only true if the profiling
data is accurate.
Underestimated trip count can lead to a performance issues with the
register allocator, which may decide to spill intervals inside the loop
assuming it's unreachable.

Since we don't know how accurate the profiling data is, it seems better
to set neutral 1/1 weights on the last peeled latch branch. After this
change, the weights in the example above will look like this:
0: (100,300)
1: (100,200)
2: (100,100)
3: (100,100)

Co-authored-by: Aleksandr Popov <apopov@azul.com>
2023-10-31 14:02:42 +01:00
David Green
75b3c3d267
[ARM] Disable UpperBound loop unrolling for MVE tail predicated loops. (#69709)
For MVE tail predicated loops, better code can be generated by keeping
the loop whole than to unroll to an upper bound, which requires the
expansion of active lane masks that can be difficult to generate good
code for. This patch disables UpperBound unrolling when we find a
active_lane_mask in the loop.
2023-10-31 09:51:30 +00:00
Alex Richardson
e39f6c1844 [opt] Infer DataLayout from triple if not specified
There are many tests that specify a target triple/CPU flags but no
DataLayout which can lead to IR being generated that has unusual
behaviour. This commit attempts to use the default DataLayout based
on the relevant flags if there is no explicit override on the command
line or in the IR file.

One thing that is not currently possible to differentiate from a missing
datalayout `target datalayout = ""` in the IR file since the current
APIs don't allow detecting this case. If it is considered useful to
support this case (instead of passing "-data-layout=" on the command
line), I can change IR parsers to track whether they have seen such a
directive and change the callback type.

Differential Revision: https://reviews.llvm.org/D141060
2023-10-26 12:07:37 -07:00
Alex Richardson
e86d6a43f0 Regenerate test checks for tests affected by D141060 2023-10-04 10:51:35 -07:00
Nikita Popov
4d5525e0cb [LoopUnroll] Add tests for excessive znver3 unrolls (NFC) 2023-09-28 12:10:38 +02:00