The current canonicalization of extractelement(cast) requires that the
CastInst has only one use. However, when that use occurs inside a loop,
it still satisfies this condition, even though the cast is effectively
used multiple times, once per iteration, rather than truly being used
once.
```cpp
} else if (auto *CI = dyn_cast<CastInst>(I)) {
// Canonicalize extractelement(cast) -> cast(extractelement).
// Bitcasts can change the number of vector elements, and they cost
// nothing.
if (CI->hasOneUse() && (CI->getOpcode() != Instruction::BitCast)){
```
Before
```llvm
%34 = fptosi <4 x float> %33 to <4 x i32>
;/loop{
%40 = extractelement <4 x i32> %34, i32 %36
```
After
```llvm
;/loop{
%37 = extractelement <4 x float> %30, i32 %32
%38 = fptosi float %37 to i32
```
After canonicalization, for this particular example, it no longer uses a single instruction to cast the entire vector at once, but instead performs the cast for every element separately, which is less performant.
Ideally, we would like to check if the cast instruction **has one use and that this use is not called inside a loop**. However, InstCombine/InstCombineVectorOps.cpp does not provide utilities like `LoopInfo` to check that. It might be possible to approximate this by analyzing basic block successors or by building a dominance tree, but that may be a costly solution.
A solution to prevent this optimization could be to check if the index is an immediate value and if the use is inside the same basic block as the cast instruction:
```cpp
if (CI->hasOneUse() && (CI->getOpcode() != Instruction::BitCast)) {
Instruction *U = cast<Instruction>(*CI->user_begin());
if (U->getParent() == CI->getParent() || isa<ConstantInt>(Index)){
```
Fixes: https://github.com/llvm/llvm-project/issues/165793
The LangRef current defines out-of-range stepvector values as poison.
This property is at odds with both the expansion used for fixed-length
vectors and the equivalent ISD node, both of which implicitly truncate
out-of-range values.
InstCombine phi scalarization would always create a new binary op with
the phi as the first operand, which is not correct for non-commutable
binary ops such as sub. This fix preserves the original binary op
ordering in the new binary op and adds a test for this behavior.
Currently, this transformation can produce silently incorrect IR, and in
the case of the added test, would optimize it out entirely.
Each call to `findemandedEltsBySingleUser()` returns a new APInt that
must be OR'd with the current APInt. For large vectors with many uses
this can be slow, if the total number of operations is `{# uses} x {size
of vector}`. Instead or OR'ing, use `setBit()` on the passed-in APInt.
Closes https://github.com/llvm/llvm-project/issues/160507.
Note: Replacing other users except for `ExtElt` is a bit strange to me.
I tried to only replace `ExtElt` with a new extractelement, but it
caused regressions on `widen_extract2/3`.
In visitShuffleVectorInst there's an if block that's meant to turn
shufflevector followed by bitcast into extractelement where possible.
It assumes that there will never be bitcasts performed on vectors of ptr
as such operations are almost always illegal, and ptrtoint instructions
should be used instead.
There is however an edge case where a bitcast instruction can be
performed on a vector of type `<1 x ptr>` to turn it into type `ptr`
In this edge case, the code initializes the variable `VecBitWidth` to 0.
Then, when iterating over users that are bitcasts, an attempt is made to
create a vector of size 0, which triggers and assert.
This commit changes initialization of `VecBitWidth` to use datalayout to
find the the size of the vector instead of getPrimitiveSizeInBits method
which results in 0 for ptr and vectors of ptr.
Extracting any element from a subvector starting at index 0 is
equivalent to extracting from the original vector, i.e.
extract_elt(vector_extract(x, 0), y) -> extract_elt(x, y)
Add a new fold to instcombine to move SExt/ZExt across identity
shuffles, applying the cast after the shuffle. This sinks extends and
can enable more general additional folding of both shuffles (and
related instructions) and extends. If backends prefer splitting up doing
casts first, the extends can be hoisted again in VectorCombine for
example.
A larger example is included in the load_i32_zext_to_v4i32. The wider
extend is easier to compute an accurate cost for and targets (like
AArch64) can lower a single wider extend more efficiently than multiple
separate extends.
This is a generalization of a VectorCombine version
(https://github.com/llvm/llvm-project/pull/141109) as suggested by
@preames.
PR: https://github.com/llvm/llvm-project/pull/146901
The change adds a new instcombine pattern, and associated test, for
patterns like this:
```
%3 = shufflevector <2 x float> %1, <2 x float> poison, <4 x i32> zeroinitializer
%4 = extractelement <4 x float> %3, i64 %idx
```
The shufflevector has a splat, or broadcast, mask, so the extractelement
simply must be the first element of %1, so we transform this to
```
%2 = extractelement <2 x float> %1, i64 0
```
This canonicalizes fneg/fabs (shuffle X, poison, mask) -> shuffle
(fneg/fabs X), posion, mask
This undoes part of b331a7ebc1e02f9939d1a4a1509e7eb6cdda3d38 and
a8f13dbdeb31be37ee15b5febb7cc2137bbece67, but keeps the binary shuffle
case i.e. shuffle fneg, fneg, mask.
By pulling out the shuffle we bring it inline with the same
canonicalisation we perform on binary ops and intrinsics, which the
original commit acknowledges it goes in the opposite direction.
However nowadays VectorCombine is more powerful and can do more
optimisations when the shuffle is pulled out, so I think we should
revisit this. In particular we get more shuffles folded and can perform
scalarization.
Having a finite Depth (or recursion limit) for computeKnownBits is very
limiting, but is currently a load-bearing necessity, as all KnownBits
are recomputed on each call and there is no caching. As a prerequisite
for an effort to remove the recursion limit altogether, either using a
clever caching technique, or writing a easily-invalidable KnownBits
analysis, make the Depth argument in APIs in ValueTracking uniformly the
last argument with a default value. This would aid in removing the
argument when the time comes, as many callers that currently pass 0
explicitly are now updated to omit the argument altogether.
If we are attempting to combine shuffle+bitcast but the bitcast is
pairable with a subsequent bitcast, we should not fold the shuffle as
doing so can block further simplifications.
The motivation for this is a long-standing regression affecting SIMDe on
AArch64, introduced indirectly by the AlwaysInliner (1a2e77cf). Some
reproducers:
* https://godbolt.org/z/53qx18s6M
* https://godbolt.org/z/o5e43h5M7
foldAggregateConstructionIntoAggregateReuse iterates over the entries of
SourceAggregates and the order of inserted instructions depends on the
order of the iterator. Using a regular DenseMap can lead to
non-deterministic value naming/numbering.
I don't think it can actually impact the generated binary, but it makes
diffing IR more difficult.
PR: https://github.com/llvm/llvm-project/pull/132564
As part of the "RemoveDIs" project, BasicBlock::iterator now carries a
debug-info bit that's needed when getFirstNonPHI and similar feed into
instruction insertion positions. Call-sites where that's necessary were
updated a year ago; but to ensure some type safety however, we'd like to
have all calls to moveBefore use iterators.
This patch adds a (guaranteed dereferenceable) iterator-taking
moveBefore, and changes a bunch of call-sites where it's obviously safe
to change to use it by just calling getIterator() on an instruction
pointer. A follow-up patch will contain less-obviously-safe changes.
We'll eventually deprecate and remove the instruction-pointer
insertBefore, but not before adding concise documentation of what
considerations are needed (very few).
With the introduction of CmpPredicate in 51a895a (IR: introduce struct
with CmpInst::Predicate and samesign), PatternMatch is one of the first
key pieces of infrastructure that must be updated to match a CmpInst
respecting samesign information. Implement this change to Cmp-matchers.
This is a preparatory step in migrating the codebase over to
CmpPredicate. Since we no functional changes are desired at this stage,
we have chosen not to migrate CmpPredicate::operator==(CmpPredicate)
calls to use CmpPredicate::getMatching(), as that would have visible
impact on tests that are not yet written: instead, we call
CmpPredicate::operator==(Predicate), preserving the old behavior, while
also inserting a few FIXME comments for follow-ups.
We would like to optimize situations of the form that happen after loop
vectorization+SROA:
```
loop:
%phi = phi zeroinitializer, %interleaved
%deinterleave_a = shufflevector %phi, poison ; pick half of the lanes
%deinterleave_b = shufflevector %phi, posion ; pick remaining lanes
... %a = ... %b = ...
%interleaved = shufflevector %a, %b ; interleave lanes of a+b
```
where the interleave and de-interleave shuffle operations cancel each
other out.
This could be handled by `foldOpPhi` but does not currently work because
it does
not proceed when there are multiple uses of the `Phi` operation.
This extends `foldOpPhi` to allow multiple `shufflevector` uses when
they are
shown to simplify for all `Phi` input values.
Function foldAggregateConstructionIntoAggregateReuse can fold
insertvalue(phi(extractvalue(src1), extractvalue(src2)))
into
phi(src1, src2)
when we can find source aggregates in all predecessors.
This patch extends it to handle following case
insertvalue(phi(extractvalue(src1), elm2))
into
phi(src1, insertvalue(elm2))
with the condition that the predecessor without source aggregate has
only one successor.
This fixes a missed optimization caused by the `foldBitcastExtElt`
pattern interfering with other combine patterns. In the case I was
hitting, we have IR that combines two vectors into a new larger vector
by extracting elements and inserting them into the new vector.
```llvm
define <4 x half> @bitcast_extract_insert_to_shuffle(i32 %a, i32 %b) {
%avec = bitcast i32 %a to <2 x half>
%a0 = extractelement <2 x half> %avec, i32 0
%a1 = extractelement <2 x half> %avec, i32 1
%bvec = bitcast i32 %b to <2 x half>
%b0 = extractelement <2 x half> %bvec, i32 0
%b1 = extractelement <2 x half> %bvec, i32 1
%ins0 = insertelement <4 x half> undef, half %a0, i32 0
%ins1 = insertelement <4 x half> %ins0, half %a1, i32 1
%ins2 = insertelement <4 x half> %ins1, half %b0, i32 2
%ins3 = insertelement <4 x half> %ins2, half %b1, i32 3
ret <4 x half> %ins3
}
```
With the current behavior, `InstCombine` converts each vector extract
sequence to
```llvm
%tmp = trunc i32 %a to i16
%a0 = bitcast i16 %tmp to half
%a1 = extractelement <2 x half> %avec, i32 1
```
where the extraction of `%a0` is now done by truncating the original
integer. While on it's own this is fairly reasonable, in this case it
also blocks the pattern which converts `extractelement` -
`insertelement` into shuffles which gives the overall simpler result:
```llvm
define <4 x half> @bitcast_extract_insert_to_shuffle(i32 %a, i32 %b) {
%avec = bitcast i32 %a to <2 x half>
%bvec = bitcast i32 %b to <2 x half>
%ins3 = shufflevector <2 x half> %avec, <2 x half> %bvec, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
ret <4 x half> %ins3
}
```
In this PR I fix the conflict by obeying the `hasOneUse` check even if
there is no shift instruction required. In these cases we can't remove
the vector completely, so the pattern has less benefit anyway.
Also fwiw, I think dropping the `hasOneUse` check for the 0th element
might have been a mistake in the first place. Looking at
535c5d56a7
the commit message only mentions loosening the `isDesirableIntType`
requirement and doesn't mention changing the `hasOneUse` check at all.
Rename the function to reflect its correct behavior and to be consistent
with `Module::getOrInsertFunction`. This is also in preparation of
adding a new `Intrinsic::getDeclaration` that will have behavior similar
to `Module::getFunction` (i.e, just lookup, no creation).
This replaces some uses of isSafeToSpeculativelyExecute() with
isSafeToSpeculativelyExecuteWithVariableReplaced(), in cases where we
are guarding against operand changes rather plain speculation.
I believe that this is NFC with the current implementation of the
function (as it only does something different from loads), but this
makes us more defensive against future generalizations.
This patch is moving out stepvector intrinsic from the experimental
namespace.
This intrinsic exists in LLVM for several years now, and is widely used.
If the binop is not speculatable, and the extract index is out of
range, then scalarizing will perform the operation on a poison
operand, resulting in immediate UB, instead of the previous
poison result.
Fixes https://github.com/llvm/llvm-project/issues/97053.
Using the target number of vector elements, scaleShuffleMaskElts will try to use narrowShuffleMaskElts/widenShuffleMaskElts to scale the shuffle mask accordingly.
Working on #58895 I didn't want to create yet another case where we have to handle both re-scaling cases.
Uses the new InsertPosition class (added in #94226) to simplify some of
the IRBuilder interface, and removes the need to pass a BasicBlock
alongside a BasicBlock::iterator, using the fact that we can now get the
parent basic block from the iterator even if it points to the sentinel.
This patch removes the BasicBlock argument from each constructor or call
to setInsertPoint.
This has no functional effect, but later on as we look to remove the
`Instruction *InsertBefore` argument from instruction-creation
(discussed
[here](https://discourse.llvm.org/t/psa-instruction-constructors-changing-to-iterator-only-insertion/77845)),
this will simplify the process by allowing us to deprecate the
InsertPosition constructor directly and catch all the cases where we use
instructions rather than iterators.
This transform works on single-source shuffles, which require that
the second operand is poison, not undef. Otherwise we may convert
undef to poison.
Fixes https://github.com/llvm/llvm-project/issues/92887.
Folding a select-like `shufflevector` into a floating point binary
operators can only be done if the result is preserved for both case. In
particular, if the common operand of the `shufflevector` and the
floating point binary operator can be a NaN, then the transformation
won't preserve the result value.
As part of the RemoveDIs project we need LLVM to insert instructions using
iterators wherever possible, so that the iterators can carry a bit of
debug-info. This commit implements some of that by updating the contents of
llvm/lib/Transforms/Utils to always use iterator-versions of instruction
constructors.
There are two general flavours of update:
* Almost all call-sites just call getIterator on an instruction
* Several make use of an existing iterator (scenarios where the code is
actually significant for debug-info)
The underlying logic is that any call to getFirstInsertionPt or similar
APIs that identify the start of a block need to have that iterator passed
directly to the insertion function, without being converted to a bare
Instruction pointer along the way.
Noteworthy changes:
* FindInsertedValue now takes an optional iterator rather than an
instruction pointer, as we need to always insert with iterators,
* I've added a few iterator-taking versions of some value-tracking and
DomTree methods -- they just unwrap the iterator. These are purely
convenience methods to avoid extra syntax in some passes.
* A few calls to getNextNode become std::next instead (to keep in the
theme of using iterators for positions),
* SeparateConstOffsetFromGEP has it's insertion-position field changed.
Noteworthy because it's not a purely localised spelling change.
All this should be NFC.
Otherwise we may replace undef with poison.
Note that a lot of tests regressing here already have variants
that use poison instead of undef (often in a separate
inseltpoison file), which is why I'm not adjusting them to the
new pattern.