This is another relatively small adjustment to shuffleToIdentity, which
has had a few knock-one effects to need a few more changes. It attempts
to detect free concats, that will be legalized to multiple vector
operations. For example if the lanes are '[a[0], a[1], b[0], b[1]]' and
a and b are v2f64 under aarch64.
In order to do this:
- isFreeConcat detects whether the input has piece-wise identities from
multiple inputs that can become a concat.
- A tree of concat shuffles is created to concatenate the input values
into a single vector. This is a little different to most other inputs as
there are created from multiple values that are being combined together,
and we cannot rely on the Lane0 insert location always being valid.
- The insert location is changed to the original location instead of
updating per item, which ensure it is valid due to the order that we
visit and create items.
When looking up through shuffles, a Value can be multiple different leaf types
(for example an identity from one position, a splat from another). We currently
detect this by recalculating which type of leaf it is when generating, but as
more types of leafs are added (#94954) this doesn't scale very well.
This patch switches it to use Use, not Value, to more accurately detect which
type of leaf each Use should have.
The `VectorCombine::foldShuffleToIdentity` does not preserve fast math
flags when folding the shuffle, leading to unexpected vectorized result
and missed optimizations with FMA instructions.
We can conservatively take the maximal legal set of fast math flags
whenever we fold shuffles to identity to enable further optimizations in
the backend.
---------
Co-authored-by: Henry Jiang <henry.jiang1@ibm.com>
This removes the check that both operands of the original shuffle are
instructions, which is a relic from a previous version that held more
variables as Instructions.
Other than some additional checks needed for compare predicates and
selects with scalar condition operands, these are relatively simple
additions to what already exists.
This just adds splat constants, which can be treated like any other
splat which hopefully makes them very simple. It does not try to handle
more complex constant vectors yet, just the more common splats.
This is probably the most involved addition, as it tries to make use of
isTriviallyVectorizable with isVectorIntrinsicWithScalarOpAtArg to handle a
number of different intrinsics that are all lane-wise. Additional tests have
been added for some of the different intrinsics from
isVectorIntrinsicWithScalarOpAtArg / isVectorIntrinsicWithOverloadTypeAtArg.
The matcher m_Trunc() matches an Operator with a given Opcode, which
could either be an Instruction or ConstExpr.
VectorCombine::foldTruncFromReductions() incorrectly assumes that the
pattern matched is always an Instruction, and attempts a cast. Fix this.
Fixes#88796.
The shuffleToIdentity fold needs to be a bit more careful about the difference
between call instructions and intrinsics. The second can be handled, but the
first should result in bailing out. This patch also adds some extra intrinsic
tests from #91000.
Fixes#91078
This patch adds a basic version of a combine that attempts to remove
shuffles that when combined simplify away to an identity shuffle. For
example:
%ab = shufflevector <8 x half> %a, <8 x half> poison, <4 x i32> <i32 3,
i32 2, i32 1, i32 0>
%at = shufflevector <8 x half> %a, <8 x half> poison, <4 x i32> <i32 7,
i32 6, i32 5, i32 4>
%abt = fneg <4 x half> %at
%abb = fneg <4 x half> %ab
%r = shufflevector <4 x half> %abt, <4 x half> %abb, <8 x i32> <i32 7,
i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
By looking through the shuffles and fneg, it can be simplified to:
%r = fneg <8 x half> %a
The code tracks each lane starting from the original shuffle, keeping a
track of a vector of {src, idx}. As we propagate up through the
instructions we will either look through intermediate instructions
(binops and unops) or see a collections of lanes that all have the same
src and incrementing idx (an identity). We can also see a single value
with identical lanes, which we can treat like a splat.
Only the basic version is added here, handling identities, splats,
binops and unops. In follow-up patches other instructions can be added
such as constants, intrinsics, cmp/sel and zext/sext/trunc.
Ensure the getShuffleCost arguments/instruction args are populated - minor extension to #88743 to help improve shuffle costs for certain corner cases (e.g. shuffles of loads)
Another step towards cleaning up shuffles that have been split, often across bitcasts between SSE intrinsic.
Strip shuffles entirely if we fold to an identity shuffle.
Prior to #85863, the required parameters of llvm::isKnownNonZero were
Value and DataLayout. After, they are Value, Depth, and SimplifyQuery,
where SimplifyQuery is implicitly constructible from DataLayout. The
change to move Depth before SimplifyQuery needed callers to be updated
unnecessarily, and as commented in #85863, we actually want Depth to be
after SimplifyQuery anyway so that it can be defaulted and the caller
does not need to specify it.
Don't just assert that the src/dst vector element counts are multiples of one another - in general IR this can actually happen.
Reported by @mikaelholmen
When creating cast(shuffle(x,y)) we were only adding the cast() to the worklist, not the new shuffle, preventing recursive combines.
foldShuffleOfBinops is also failing to do this, but I still need to add test coverage for this.
Only fold bitcast(shuffle(x,y)) -> shuffle(bitcast(x),bitcast(y)) if we won't actually increase the number of bitcasts (i.e. x or y is already bitcasted from the correct type).
This tries to add some costs for the shuffle in a ST3/ST4 instruction,
which are represented in LLVM IR as store(interleaving shuffle). In
order to detect the store, it needs to add a CxtI context instruction to
check the users of the shuffle. LD3 and LD4 are added, LD2 should be a
zip1 shuffle, which will be added in another patch.
It should help fix some of the regressions from #87510.
This optimization, much like the existing foldShuffleOfBinops can cause a
lot of regressions. Add a quick debug message to make the costs are more
obvious.
Encountered while working on #67803, wading through the chains of bitcasts that SSE intrinsics introduces - this patch helps prevents cases where the bitcast chains aren't cleared out and we can't perform further combines until after InstCombine/InstSimplify has run.
Generalise fold to "bitcast (shuf V0, V1, MaskC) --> shuf (bitcast V0), (bitcast V1), MaskC'".
Reapplied with a clang codegen test fix.
Further prep work for #67803
This is part of a series of small patches to compute shuffle masks for
the couple of cases where we call getShuffleCost without one. My goal is
to add an invariant that all calls to getShuffleCost for fixed length
vectors have a mask.
Note that this code appears to be reachable with scalable vectors, and
thus we have to only pass a non-empty mask when the number of elements
is precisely known.
Vector truncations can be pretty expensive, especially on X86, whilst scalar truncations are often free.
If the cost of performing the add/mul/and/or/xor reduction is cheap enough on the pre-truncated type, then avoid the vector truncation entirely.
Fixes https://github.com/llvm/llvm-project/issues/81469
Part of the "RemoveDIs" project to remove debug intrinsics requires
passing block-positions around in iterators rather than as instruction
pointers, allowing some debug-info to reside in BasicBlock::iterator.
This means getInsertionPointAfterDef has to return an iterator, and as
it can return no-instruction that means returning an optional iterator.
This patch changes the signature for getInsertionPtAfterDef and then
patches up the various places that use it to handle the different type.
This would overall be an NFC patch, however in
InstCombinerImpl::freezeOtherUses I've started skipping any debug
intrinsics at the returned insert-position. This should not have any
_meaningful_ effect on the compiler output: at worst it means variable
assignments that are skipped will now cover the freeze instruction and
anything inserted before it, which should be inconsequential.
Sadly: this makes the function signature ugly. This is probably the
ugliest piece of fallout for the "RemoveDIs" work, but it serves the
overall purpose of improving compile times and not allowing `-g` to
affect compiler output, so should be worthwhile in the end.
The whole point of the GenericDomTree.h vs
GenericDomTreeConstruction.h distinction is that the latter only
needs to be included in the source file and not the header.
When getSplatOp returns nullptr, the intrinsic cannot be scalarized.
This patch includes a test case that fixes a crash from trying to
scalarize the VPIntrinsic when getSplatOp returns nullptr.
This fixes https://github.com/llvm/llvm-project/issues/72034.
When dealing with a truncating shuffle, we can end up in a situation
where the type passed to getShuffleCost is the type of the result of the
shuffle, and the mask references an element which is out of bounds of
the result vector.
If dealing with truncating shuffles, pass the type of the input vectors
to `getShuffleCost()` in order to avoid an out-of-bounds assertion.
Previously we were just matching against a fixed list of VP intrinsics
that we
knew couldn't be speculated, but we can reuse the logic in
isSafeToSpeculativelyExecuteWithOpcode. This also allows speculation in
more
cases, e.g. when the divisor is known to be non-zero.
Unfortunately we can't reuse the exact same function call for VP
intrinsics
with functional intrinsics instead of opcodes, because
isSafeToSpeculativelyExecute needs an instruction that already exists.
So this
just copies the logic by peeking into the function attributes of the
intrinsic.
Allow length changing shuffle masks in the "bitcast (shuf V, MaskC) --> shuf (bitcast V), MaskC'" fold.
It also exposes some poor shuffle mask detection for extract/insert subvector cases inside improveShuffleKindFromMask
First stage towards addressing Issue #67803
We should check whether the element type is non-byte-sized, not
the vector type. For types like <32 x i1> the whole type is
byte-sized, but the individual elements (that we scalarize to)
are not.
Fixes https://github.com/llvm/llvm-project/issues/67060.