This patch introduces a new function:
AArch64Subtarget::getVScaleForTuning
that returns a value for vscale that can be used for tuning the cost
model when using scalable vectors. The VScaleForTuning option in
AArch64Subtarget is initialised according to the following rules:
1. If the user has specified the CPU to tune for we use that, else
2. If the target CPU was specified we use that, else
3. The tuning is set to "generic".
For CPUs of type "generic" I have assumed that vscale=2.
New tests added here:
Analysis/CostModel/AArch64/sve-gather.ll
Analysis/CostModel/AArch64/sve-scatter.ll
Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll
Differential Revision: https://reviews.llvm.org/D110259
This patch supplements missing test case for D107353.
- Fix wrong descriptions in 64-bit mode test case
- Added testcase under 32-bit mode
Reviewed By: bmahjour
Differential Revision: https://reviews.llvm.org/D108507
At the moment, rewriteLoopExitValue forgets the current phi node in the
loop that collects phis to rewrite. A few lines after the value is
forgotten, SCEV is used again to analyze incoming values and
potentially expand SCEV expression. This means that another SCEV is
created for PN, before the IR is actually updated in the next loop.
This leads to accessing invalid cached expression in combination with
D71539.
PN should only be changed once the actual incoming exit value is set in
the next loop. Moving invalidation there should ensure that PN is
invalidated in all relevant cases.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D111495
bitcast (inselt (bitcast X), Y, 0) --> or (and X, MaskC), (zext Y)
https://alive2.llvm.org/ce/z/Ux-662
Similar to D111082 / db231ebdb07f :
We want to avoid relatively opaque vector ops on types that are
likely supported by the backend as scalar integers. The bitwise
logic ops are more likely to allow further combining.
We probably want to generalize this to allow a shift too, but
that would oppose instcombine's general rule of not creating
extra instructions, so that's left as a potential follow-up.
Alternatively, we could do that transform in VectorCombine
with the help of the TTI cost model.
This is part of solving:
https://llvm.org/PR52057
This patch fixes a crash when despeculating ctlz/cttz intrinsics with
scalable-vector types. It is not safe to speculatively get the size of
the vector type in bits in case the vector type is not a fixed-length type. As
it happens this isn't required as vector types are skipped anyway.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D112141
Removed/replaced RUN lines using legacy PM syntax in favor of using
-passes in lit tests for Float2Int, MetaRenamer, StripDeadPrototypes
and StripSymbols.
To guarantee convergence of the algorithm each optimization step should decrease number of instructions when IR is modified. This property is not held in this test case. The problem is that SCEV Expander may do "unexpected" reassociation what results in creation of new min/max chains and introduction of extra instructions. As a result on each step we indefinitely optimize back and forth.
The solution is to restrict SCEV Expander to perform uncontrolled reassociations by means of "Unknown" expressions.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D112060
This is trivial. It was left out of the original review only because we had multiple copies of the same code in review at the same time, and keeping them in sync was easiest if the structure was kept in sync.
This patch duplicates a bit of logic we apply to comparisons encountered during the IV users walk to conditions which feed exit conditions. Why? simplifyAndExtend has a very limited list of users it walks. In particular, in the examples is stops at the zext and never visits the icmp. (Because we can't fold the zext to an addrec yet in SCEV.) Being willing to visit when we haven't simplified regresses multiple tests (seemingly because of less optimal results when computing trip counts).
Note that this can be trivially extended to multiple exiting blocks. I'm leaving that to a future patch (solely to cut down on the number of versions of the same code in review at once.)
Differential Revision: https://reviews.llvm.org/D111896
Using BPI within loop predication is non-trivial because BPI is only
preserved lossily in loop pass manager (one fix exposed by lossy
preservation is up for review at D111448). However, since loop
predication is only used in downstream pipelines, it is hard to keep BPI
from breaking for incomplete state with upstream changes in BPI.
Also, correctly preserving BPI for all loop passes is a non-trivial
undertaking (D110438 does this lossily), while the benefit of using it
in loop predication isn't clear.
In this patch, we rely on profile metadata to get almost similar benefit as
BPI, without actually using the complete heuristics provided by BPI.
This avoids the compile time explosion we tried to fix with D110438 and
also avoids fragile bugs because BPI can be lossy in loop passes
(D111448).
Reviewed-By: asbirlea, apilipenko
Differential Revision: https://reviews.llvm.org/D111668
Right now when we see -O# we add the corresponding 'default<O#>' into
the list of passes to run when translating legacy -pass-name. This has
the side effect of not using the default AA pipeline.
Instead, treat -O# as -passes='default<O#>', but don't allow any other
-passes or -pass-name. I think we can keep `opt -O#` as shorthand for
`opt -passes='default<O#>` but disallow anything more than just -O#.
Tests need to be updated to not use `opt -O# -pass-name`.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D112036
Need to follow the order of the reused scalars from the
ReuseShuffleIndices mask rather than rely on the natural order.
Differential Revision: https://reviews.llvm.org/D111898
The goal is to allow grafting an inline tree from Clang or GCC into a new compilation without affecting other functions. For GCC, we're doing this by extracting the inline tree from dwarf information and generating the equivalent remarks.
This allows easier side-by-side asm analysis and a trial way to see if a particular inlining setup provides benefits by itself.
Testing:
ninja check-all
Reviewed By: wenlei, mtrofin
Differential Revision: https://reviews.llvm.org/D110658
This simplifies the return value of addRuntimeCheck from a pair of
instructions to a single `Value *`.
The existing users of addRuntimeChecks were ignoring the first element
of the pair, hence there is not reason to track FirstInst and return
it.
Additionally all users of addRuntimeChecks use the second returned
`Instruction *` just as `Value *`, so there is no need to return an
`Instruction *`. Therefore there is no need to create a redundant
dummy `and X, true` instruction any longer.
Effectively this change should not impact the generated code because the
redundant AND will be folded by later optimizations. But it is easy to
avoid creating it in the first place and it allows more accurately
estimating the cost of the runtime checks.
These cases use the same codegen as AVX2 (pshuflw/pshufd) for the sub-128bit vector deinterleaving, and unpcklqdq for v2i64.
It's going to take a while to add full interleaved cost coverage, but since these are the same for SSE2 -> AVX2 it should be an easy win.
Fixes PR47437
Differential Revision: https://reviews.llvm.org/D111938
And another attempt to start untangling this ball of threads around gather.
There's `TTI::prefersVectorizedAddressing()`hoop, which confusingly defaults to `true`,
which tells LV to try to vectorize the addresses that lead to loads,
but X86 generally can not deal with vectors of addresses,
the only instructions that support that are GATHER/SCATTER,
but even those aren't available until AVX2, and aren't really usable until AVX512.
This specializes the hook for X86, to return true only if we have AVX512 or AVX2 w/ fast gather.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D111546
This removes an over-specified fold. The more general transform
was added with:
727e642e970d
There's a difference on an existing test that shows a potentially
unnecessary use limit on an icmp fold.
That fold is in InstCombinerImpl::foldICmpSubConstant(), and IIRC
there was some back-and-forth on it and similar folds because they
could cause analysis/passes (SCEV, LSR?) to miss optimizations.
Differential Revision: https://reviews.llvm.org/D111410
(iN X s>> (N-1)) & Y --> (X < 0) ? Y : 0
https://alive2.llvm.org/ce/z/qeYhdz
I was looking at a missing abs() transform and found my way to this
generalization of an existing fold that was added with D67799.
As discussed in that review, we want to make sure codegen handles
this difference well, and for all of the targets/types that I
spot-checked, it looks good.
I am leaving the existing fold in place in this commit because
it covers a potentially missing icmp fold, but I plan to remove
that as a follow-up commit as suggested during review.
Differential Revision: https://reviews.llvm.org/D111410
This patch adds a pass option to only run transforms that scalarize
vector operations and do not create new vector instructions.
When running VectorCombine early in the pipeline introducing new vector
operations can have negative effects, like blocking loop or SLP
vectorization. To avoid regressions, restrict the early VectorCombine
run (when using -enable-matrix) to only perform scalarization and not
introduce new vector operations.
This is done as option to the pass directly, which is then set when
adding the pass to the pipeline. This is done for the new pass manager
only.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D111800
Add lshr (sext i1 X to iN), C --> select (X, -1 >> C, 0) case. This expands
C == N-1 case to arbitrary C.
Fixes PR52078.
Reviewed By: spatel, RKSimon, lebedev.ri
Differential Revision: https://reviews.llvm.org/D111330
This patch teaches SCEV two implication rules:
x <u y && y >=s 0 --> x <s y,
x <s y && y <s 0 --> x <u y.
And all equivalents with signs/parts swapped.
Differential Revision: https://reviews.llvm.org/D110517
Reviewed By: nikic
Fix a bug when getInlineCost incorrectly returns a
cost/threshold pair instead of an explicit never inline.
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D111687
Need to check that either Idx is UndefMaskElem and value is UndefValue
or Idx is valid and value is the same as the scalar value in the node.
Differential Revision: https://reviews.llvm.org/D111802
This shows the transform side of D109457, but also lets us try other approaches to the same problem. The common trend to all is that we need to explicit reason about UB to disallow possibility of infinite loops.
While i've modelled most of the relevant tuples for AVX2,
that only covered fully-interleaved groups.
By definition, interleaving load of stride N means:
load N*VF elements, and shuffle them into N VF-sized vectors,
with 0'th vector containing elements `[0, VF)*stride + 0`,
and 1'th vector containing elements `[0, VF)*stride + 1`.
Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6)
Now, not fully interleaved load, is when not all of these vectors is demanded.
So at worst, we could just pretend that everything is demanded,
and discard the non-demanded vectors. What this means is that the cost
for not-fully-interleaved group should be not greater than the cost
for the same fully-interleaved group, but perhaps somewhat less.
Examples:
https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4)
https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2)
https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1)
As we have established over the course of last ~70 patches, (wow)
`BaseT::getInterleavedMemoryOpCos()` is absolutely bogus,
it is usually almost an order of magnitude overestimation,
so i would claim that we should at least use the hardcoded costs
of fully interleaved load groups.
We could go further and adjust them e.g. by the number of demanded indices,
but then i'm somewhat fearful of underestimating the cost.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D111174