SCCP can use PredicateInfo to constrain ranges based on assume and
branch conditions. Currently, this is only enabled during IPSCCP.
This enables it for SCCP as well, which runs after functions have
already been simplified, while IPSCCP runs pre-inline. To a large
degree, CVP already handles range-based optimizations, but SCCP is more
reliable for the cases it can handle. In particular, SCCP works reliably
inside loops, which is something that CVP struggles with due to LVI
cycles.
I have made various optimizations to make PredicateInfo more efficient,
but unfortunately this still has significant compile-time cost (around
0.1-0.2%).
This reverts commit e9de32fd159d30cfd6fcc861b57b7e99ec2742ab due to
multiple performance regressions observed across downstream Numba
benchmarks (https://github.com/llvm/llvm-project/issues/138509#issuecomment-3193855772).
While avoiding non-trivial unswitches on newly-cloned loops helps
mitigate the pathological case reported in https://github.com/llvm/llvm-project/issues/138509,
it may as well make the IR less friendly to vectorization / loop-
canonicalization (in the test reported, previously no select with
loop-carried dependence existed in the new specialized loops),
leading the abovementioned approach to be reconsidered.
The vector combiner will process all instructions as it first loops
through the function, adding any newly added and deleted instructions to
a worklist which is then processed when all nodes are done. These leaves
extra uses in the graph as the initial processing is performed, leading
to sub-optimal decisions being made for other combines. This changes it
so that trivially dead instructions are removed immediately. The main
changes that this requires is to make sure iterator invalidation does not
occur.
Directly emit shl instead of a multiply if VF * Step is a power-of-2. The
main motivation here is to prepare the code and test for directly
generating and expanding a SCEV expression of the minimum iteration
count. SCEVExpander will directly emit shl for multiplies with
powers-of-2.
InstCombine will also performs this combine, so end-to-end this should
effectively by NFC.
PR: https://github.com/llvm/llvm-project/pull/153495
Reopen#128938.
Attempt to shrink the size of vector loads where only some of the
incoming lanes are used for rebroadcasts in shufflevector instructions.
---------
Co-authored-by: Leon Clark <leoclark@amd.com>
Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>
We can derive and upgrade alignment for loads/stores using other
well-aligned loads/stores. This optimization does a single forward pass through
each basic block and uses loads/stores (the alignment and the offset) to
derive the best possible alignment for a base pointer, caching the
result. If it encounters another load/store based on that pointer, it
tries to upgrade the alignment. The optimization must be a forward pass within a basic
block because control flow and exception throwing can impact alignment guarantees.
---------
Co-authored-by: Nikita Popov <github@npopov.com>
Materialize the vector trip count computation using VPInstruction
instead of directly creating IR. This is one of the last few steps
needed to model the full vector skeleton in VPlan. It also simplifies
vector-trip count computations for scalable vectors, as we can re-use
the UF x VF computation.
PR: https://github.com/llvm/llvm-project/pull/151925
Now that #149310 has restricted lifetime intrinsics to only work on
allocas, we can also drop the explicit size argument. Instead, the size
is implied by the alloca.
This removes the ability to only mark a prefix of an alloca alive/dead.
We never used that capability, so we should remove the need to handle
that possibility everywhere (though many key places, including stack
coloring, did not actually respect this).
Attempt to shrink the size of vector loads where only some of the incoming lanes are used for rebroadcasts in shufflevector instructions.
---------
Co-authored-by: Leon Clark <leoclark@amd.com>
Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>
LICM tries to reassociate GEPs in order to hoist an invariant GEP.
Currently, it also does this in the case where the GEP has a constant
offset.
This is usually undesirable. From a back-end perspective, constant GEPs
are usually free because they can be folded into addressing modes, so
this just increases register pressume. From a middle-end perspective,
keeping constant offsets last in the chain makes it easier to analyze
the relationship between multiple GEPs on the same base, especially
after CSE.
The worst that can happen here is if we start with something like
```
loop {
p + 4*x
p + 4*x + 1
p + 4*x + 2
p + 4*x + 3
}
```
And LICM converts it into:
```
p.1 = p + 1
p.2 = p + 2
p.3 = p + 3
loop {
p + 4*x
p.1 + 4*x
p.2 + 4*x
p.3 + 4*x
}
```
Which is much worse than leaving it for CSE to convert to:
```
loop {
p2 = p + 4*x
p2 + 1
p2 + 2
p2 + 3
}
```
This patch implements the `llvm.loop.estimated_trip_count` metadata
discussed in [[RFC] Fix Loop Transformations to Preserve Block
Frequencies](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785).
As [suggested in the RFC
comments](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785/4),
it adds the new metadata to all loops at the time of profile ingestion
and estimates each trip count from the loop's `branch_weights` metadata.
As [suggested in the PR #128785
review](https://github.com/llvm/llvm-project/pull/128785#discussion_r2151091036),
it does so via a new `PGOEstimateTripCountsPass` pass, which creates the
new metadata for each loop but omits the value if it cannot estimate a
trip count due to the loop's form.
An important observation not previously discussed is that
`PGOEstimateTripCountsPass` *often* cannot estimate a loop's trip count,
but later passes can sometimes transform the loop in a way that makes it
possible. Currently, such passes do not necessarily update the metadata,
but eventually that should be fixed. Until then, if the new metadata has
no value, `llvm::getLoopEstimatedTripCount` disregards it and tries
again to estimate the trip count from the loop's current
`branch_weights` metadata.
Similar to #150639 this fixes the AggressiveInstCombine fold for convert
tables to cttz instructions if the gep types are not array types. i.e
`gep i16 @glob, i64 %idx` instead of `gep [64 x i16] @glob, i64 0, i64 %idx`.
https://github.com/llvm/llvm-project/pull/147026 will enable sub
reductions, which require that the phi value is the first operand since
they aren't commutative. This re-orders the operands when executing
reductions, which actually matches other existing code in
VPReductionRecipe::execute.
After PR #136329, shuffle indices may differ, which can cause the
existing cost-based logic to miss optimisation opportunities for
binop/shuffle sequences.
This patch improves the cost model in foldSelectShuffle to more
accurately assess costs, recognising when certain duplicate shuffles do
not require actual instructions.
Additionally, in break-even cases, this change introduces a check for
whether the pattern ultimately feeds into a vector reduction, allowing
the transform to proceed when it is likely to be profitable overall.
Track newly-cloned loops coming from unswitching non-trivial invariant
conditions, so as to prevent conditions in such cloned blocks from
being unswitched again.
Fixes: https://github.com/llvm/llvm-project/issues/138509.
Since the costs were added the codegen for i8/i16 and/or/xor reductions
has improved. This updates the cost model to produce the same costs in
terms of number of instructions.
This preserves the nuw/nsw flags on widened truncs by checking for
TruncInst in the VPIRFlags constructor
The motivation for this is to be able to fold away some redundant truncs
feeding into uitofps (or potentially narrow the inductions feeding them)
Add a new fold to instcombine to move SExt/ZExt across identity
shuffles, applying the cast after the shuffle. This sinks extends and
can enable more general additional folding of both shuffles (and
related instructions) and extends. If backends prefer splitting up doing
casts first, the extends can be hoisted again in VectorCombine for
example.
A larger example is included in the load_i32_zext_to_v4i32. The wider
extend is easier to compute an accurate cost for and targets (like
AArch64) can lower a single wider extend more efficiently than multiple
separate extends.
This is a generalization of a VectorCombine version
(https://github.com/llvm/llvm-project/pull/141109) as suggested by
@preames.
PR: https://github.com/llvm/llvm-project/pull/146901
More closely match improveShuffleKindFromMask's shuffle ordering by
trying to match a SK_InsertSubvector shuffles patterns before SK_Select
- both can match many of the same patterns, but its much easier to
recognise when a SK_InsertSubvector can be converted to SK_Select than
vice-versa.
Another step towards #145335 - which I'm hoping will allow us to
generalise improveShuffleKindFromMask and remove getInstructionCost's
shuffle matching entirely.
Currently, LLVM fails to convert certain pblendvb intrinsics into select
instructions when the blend mask is derived from complex boolean logic
operations. This occurs even when the mask is ultimately based on
sign-extended comparison results, preventing further optimization
opportunities.
Fixes#66513
---------
Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>
This put the onus on the caller to ensure the result type is big enough.
In the unlikely event a cropped result is required then explicitly
truncate a safe value.
Patch fixes cost estimation for the extractelements from non-power-of-2
vectors, defined as subvector extracts. In this case the subvector size
might be not adjusted to a whole register size, need to get the minimum
between whole vector size and the actual difference to prevent compiler
crash.
Fixes#143513
It will run twice in the non-LTO pipeline with `O1` or higher. In LTO post link pipeline, it will be run once with `O2` or higher, since inline and SROA don't run in `O1`.
Better to preserve the original order of the alternate nodes to avoid
inter-lane shuffling, select/insert subvector patterns provide better
perf.
Reviewers: RKSimon, hiraditya
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/136329
While sinking instructions (that are loop invariant) from preheader to
the exit block, we are skipping instructions due to decrementing
instruction iterator twice.
getSameOpcode in some cases may consider 2 compares as having same
opcode, even though previously they were considered as alternate. It may
happen, because getSameOpcode looses info about previous instructions
and their states. Need to use isAlternateInstruction function instead
for the correct analysis.
Reviewers: RKSimon, hiraditya
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/133769
After unrolling, there may be additional simplifications that can be
applied. One example is removing SCALAR-STEPS for the first part where
only the first lane is demanded.
This removes redundant adds of 0 from a large number of tests (~200),
many which I am still working on updating.
In preparation for removing redundant WideIV steps added in
https://github.com/llvm/llvm-project/pull/119284.
PR: https://github.com/llvm/llvm-project/pull/123655
Closes#115683 .
Overflow arithmetic instruction plus extract value are usually generated
when a division is being replaced, but the zero check may still be
there. In that case hoist these two instructions out of this basic
block, and let later optimizations take care of the unnecessary zero
checks.
In the previous code (#128032), it specified the destination vector as the
getShuffleCost argument. Because the shuffle mask specifies the indices
of the two vectors specified as elements, the maximum value is twice the
size of the source vector. This causes a problem if the destination
vector is smaller than the source vector and specify an index in the
mask that exceeds the size of the destination vector.
Fix the problem by correcting the previous code, which was using wrong
argument in the Cost calculation.
Fixes#130250
Relative to the previous attempt this includes two fixes:
* Adjust callCapturesBefore() to not skip captures(ret: address,
provenance) arguments, as these will not count as a capture
at the call-site.
* When visiting uses during stack slot optimization, don't skip
the ModRef check for passthru captures. Calls can both modref
and be passthru for captures.
------
This extends CaptureTracking to support inferring non-trivial
CaptureInfos. The focus of this patch is to only support FunctionAttrs,
other users of CaptureTracking will be updated in followups.
The key API changes here are:
* DetermineUseCaptureKind() now returns a UseCaptureInfo where the UseCC
component specifies what is captured at that Use and the ResultCC
component specifies what may be captured via the return value of the
User. Usually only one or the other will be used (corresponding to
previous MAY_CAPTURE or PASSTHROUGH results), but both may be set for
call captures.
* The CaptureTracking::captures() extension point is passed this
UseCaptureInfo as well and then can decide what to do with it by
returning an Action, which is one of: Stop: stop traversal.
ContinueIgnoringReturn: continue traversal but don't follow the
instruction return value. Continue: continue traversal and follow the
instruction return value if it has additional CaptureComponents.
For now, this patch retains the (unsound) special logic for comparison
of null with a dereferenceable pointer. I'd like to switch key code to
take advantage of address/address_is_null before dropping it.
This PR mainly intends to introduce necessary API changes and basic
inference support, there are various possible improvements marked with
TODOs.