If the mask of a (fixed-vector) deinterleaved load is assembled by
`vector.interleaveN` intrinsic, any intrinsic arguments that are
all-zeros are regarded as gaps.
For a deinterleaved masked.load / vp.load, if it's mask, `%c`, is
synthesized by the following snippet:
```
%m = shufflevector %s, poison, <0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3>
%g = <1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0>
%c = and %m, %g
```
Then we can know that `%g` is the gap mask and `%s` is the mask for each
field / component. This patch teaches InterleaveAccess pass to recognize
such patterns
This extends the fixed vector lowering to support the case where the
mask is formed via shufflevector idiom.
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>
This completes the basic support for masked.laod and masked.store in
InterleaveAccess. The backend already added via the intrinsic lowering
path and the common code structure (in RISCV at least).
Note that this isn't enough to enable in LV yet. We still need support
for recognizing an interleaved mask via a shufflevector in getMask.
Follow up to 28417e64, and the whole line of work started with 4b81dc7.
This change merges the handling for VPStore - currently in
lowerInterleavedVPStore - into the existing dedicated routine used in
the shuffle lowering path. This removes the last use of the dedicated
lowerInterleavedVPStore and thus we can remove it.
This contains two changes which are functional.
First, like in 28417e64, merging support for vp.store exposes the
strided store optimization for code using vp.store.
Second, it seems the strided store case had a significant missed
optimization. We were performing the strided store at the full unit
strided store type width (i.e. LMUL) rather than reducing it to match
the input width. This became obvious when I tried to use the mask
created by the helper routine as it caused a type incompatibility.
Normally, I'd try not to include an optimization in an API rework, but
structuring the code to both be correct for vp.store and not optimize
the existing case turned out be more involved than seemed worthwhile. I
could pull this part out as a pre-change, but its a bit awkward on it's
own as it turns out to be somewhat of a half step on the possible
optimization; the full optimization is complex with the old code
structure.
---------
Co-authored-by: Craig Topper <craig.topper@sifive.com>
The point of this change is simply to show that the constant check was
not required for correctness. The mixed intrinsic and shuffle tests are
added purely to exercise the code. An upcoming change will add support
for shuffle matching in getMask to support non-constant fixed vector
cases.
This is the masked.store side to the masked.load support added in
881b3fd.
With this change, we support masked.load and masked.store via the
intrinsic lowering path used primarily with scalable vectors. An
upcoming change will extend the fixed vector (i.a. shuffle vector) paths
in the same manner.
1) Rename argument II to something slightly more descriptive since we have
more than one IntrinsicInst flowing through.
2) Perform a checked dyn_cast early to eliminate two casts later in each
routine.
This builds on the whole series of recent API reworks to implement
support for deinterleaveN of masked.load. The goal is to be able to
enable masked interleave groups in the vectorizer once all the codegen
and costing pieces are in place.
I considered including the shuffle path support in this review as well
(since the RISCV target specific stuff should be common), but decided to
separate it into it's own review just to focus attention on one thing at
a time.
This continues in the direction started by commit 4b81dc7. We
essentially merges the handling for VPLoad - currently in
lowerInterleavedVPLoad - into the existing dedicated routine. This
removes the last use of the dedicate lowerInterleavedVPLoad and thus we
can remove it.
This isn't quite NFC as the main callback has support for the strided
load optimization whereas the VPLoad specific version didn't. So this
adds the ability to form a strided load for a vp.load deinterleave with
one shuffle used.
This continues in the direction started by commit 4b81dc7. We
essentially merges the handling for VPStore - currently in
lowerInterleavedVPStore which is shared between shuffle and intrinsic
based interleaves - into the existing dedicated routine.
There are cases where InstCombine / InstSimplify might sink extractvalue
instructions that use a deinterleave intrinsic into successor blocks,
which prevents InterleavedAccess from kicking in because the current
pattern requires deinterleave intrinsic to be used by extractvalue.
However, this requirement is bit too strict while we could have just
replaced the users of deinterleave intrinsic with whatever generated by
the target TLI hooks.
This essentially merges the handling for VPLoad - currently in
lowerInterleavedVPLoad which is shared between shuffle and intrinsic
based interleaves - into the existing dedicated routine.
My plan is that if we like this factoring is that I'll do the same for
the intrinsic store paths, and then remove the excess generality from
the shuffle paths since we don't need to support both modes in the
shared VPLoad/Store callbacks. We can probably even fold the VP versions
into the non-VP shuffle variants in the analogous way.
Factoring out and combining `isInterleaveIntrinsic`,
`isDeinterleaveIntrinsic`, and `getIntrinsicFactor` into
`getInterleaveIntrinsicFactor` and `getDeinterleaveIntrinsicFactor`
inside VectorUtils.
NFC.
As noted in post commit review, the API change here was not required.
I'd apparently confused myself when teasing apart patches from my
development branch.
For the fixed vector cases, we already support this, but the
deinterleave intrinsic cases (primary used by scalable vectors) didn't.
Supporting it requires plumbing through the Factor separately from the
extracts, as there can now be fewer extracts than the Factor. Note that
the fixed vector path handles this slightly differently - it uses the
shuffle and indices scheme to achieve the same thing.
Now that the loop vectorizer emits just a single
llvm.vector.[de]interleaveN intrinsic after #141865, we can remove the
need to recognise recursively [de]interleaved intrinsics.
No in-tree target currently has instructions to emit an interleaved
access with a factor > 8, and I'm not aware of any other passes that
will emit recursive interleave patterns, so this code is effectively
dead.
Some tests have been converted from the recursive form to a single
intrinsic, and some others were deleted that are no longer needed, e.g.
to do with the recursive tree.
This closes off the work started in #139893.
This teaches the interleaved access pass to the lower the intrinsics for
factors 4,6 and 8 added in #139893 to target intrinsics.
Because factors 4 and 8 could either have been recursively
[de]interleaved or have just been a single intrinsic, we need to check
that it's the former it before reshuffling around the values via
interleaveLeafValues.
After this patch, we can teach the loop vectorizer to emit a single
interleave intrinsic for factors 2 through to 8, and then we can remove
the recursive interleaving matching in interleaved access pass.
This adds support for lowering deinterleave and interleave intrinsics
for factors 3 5 and 7 into target specific memory intrinsics.
Notably this doesn't add support for handling higher factors constructed
from interleaving interleave intrinsics, e.g. factor 6 from interleave3
+ interleave2.
I initially tried this but it became very complex very quickly. For
example, because there's now multiple factors involved
interleaveLeafValues is no longer symmetric between interleaving and
deinterleaving. There's then also two ways of representing a factor 6
deinterleave: It can both be done as either 1 deinterleave3 and 3
deinterleave2s OR 1 deinterleave2 and 3 deinterleave3s.
I'm not sure the complexity of supporting arbitrary factors is warranted
given how we only need to support a small number of factors currently:
SVE only needs factors 2,3,4 whilst RVV only needs 2,3,4,5,6,7,8.
My preference would be to just add a interleave6 and deinterleave6
intrinsic to avoid all this ambiguity, but I'll defer this discussion to
a later patch.
Teach InterleavedAccessPass to recognize vp.load + shufflevector and
shufflevector + vp.store. Though this patch only adds RISC-V support to
actually lower this pattern. The vp.load/vp.store in this pattern
require constant mask.
DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently
gained C++23-style insert_range. This patch replaces:
Dest.insert(Src.begin(), Src.end());
with:
Dest.insert_range(Src);
This patch does not touch custom begin like succ_begin for now.
Teach InterleavedAccessPass to recognize the following patterns:
- vp.store an interleaved scalable vector
- Deinterleaving a scalable vector loaded from vp.load
Upon recognizing these patterns, IA will collect the interleaved /
deinterleaved operands and delegate them over to their respective
newly-added TLI hooks.
For RISC-V, these patterns are lowered into segmented loads/stores
Right now we only recognized power-of-two (de)interleave cases, in which
(de)interleave4/8 are synthesized from a tree of (de)interleave2.
---------
Co-authored-by: Nikolay Panchenko <nicholas.panchenko@gmail.com>
Previously, AArch64 used pattern matching to support
llvm.vector.(de)interleave of 2 and 4; RISC-V only supported
(de)interleave of 2.
This patch consolidates the logics in these two targets by factoring out
the common factor calculations into the InterleaveAccess Pass.
- [AArch64]: TargetLowering is updated to spot load/store (de)interleave4 like sequences using PatternMatch,
and emit equivalent sve.ld4 and sve.st4 intrinsics.
This patch is moving out following intrinsics:
* vector.interleave2/deinterleave2
* vector.reverse
* vector.splice
from the experimental namespace.
All these intrinsics exist in LLVM for more than a year now, and are
widely used, so should not be considered as experimental.
Similar to #87934, this adds costs to the shuffles in a canonical LD3/LD4
pattern, which are represented in LLVM as deinterleaving-shuffle(load). This
likely has less effect at the moment than the ST3/ST4 costs as instcombine will
perform certain transforms without considering the cost.
These are the last remaining "trivial" changes to passes that use
Instruction pointers for insertion. All of this should be NFC, it's just
changing the spelling of how we identify a position.
In one or two locations, I'm also switching uses of getNextNode etc to
using std::next with iterators. This too should be NFC.
---------
Merged by: Stephen Tozer <stephen.tozer@sony.com>
If a load instruction qualifies to be optimized by InterleavedAccess
Pass, but also has a dead binop instruction, this will lead to a crash.
Binop instruction will not be deleted, because normally it would be
deleted through its' users, but it has none. Later on deleting a load
instruction will fail because it still has uses.
The InterleavedAccess pass currently matches (de)interleaving
shufflevector instructions with loads or stores, and calls into
target lowering to generate ldN or stN instructions.
Since we can't use shufflevector for scalable vectors (besides a
splat with zeroinitializer), we have interleave2 and deinterleave2
intrinsics. This patch extends InterleavedAccess to recognize those
intrinsics and if possible replace them with ld2/st2.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D146218
It is expected that shuffles that we hoist through binops only have a single
vector operand, the other being undef/poison. The checks for
isDeInterleaveMaskOfFactor check that all the elements come from inside the
first vector, but with non-canonical shuffles the second operand could still
have a value. Add a quick check to make sure it is UndefValue as expected, to
make sure we don't run into problems with BinOpShuffles not using BinOps.
Fixes#61749
Differential Revision: https://reviews.llvm.org/D147306
This adds two new methods to ShuffleVectorInst, isInterleave and
isInterleaveMask, so that the logic to check if a shuffle mask is an
interleave can be shared across the TTI, codegen and the interleaved
access pass.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D145971
D89489 added some logic to the interleaved access pass to attempt to
undo the folding of shuffles into binops, that instcombine performs. If
early-cse is run too, the binops may be commoned into a single operation
with multiple shuffle uses. It is still profitable reverse the transform
though, so long as all the uses are shuffles.
Differential Revision: https://reviews.llvm.org/D129419
This reverts commit 7f230feeeac8a67b335f52bd2e900a05c6098f20.
Breaks CodeGenCUDA/link-device-bitcode.cu in check-clang,
and many LLVM tests, see comments on https://reviews.llvm.org/D121169
Neither of these passes modify the CFG, allowing us to preserve DomTree
and LoopInfo across them by using setPreservesCFG.
Differential Revision: https://reviews.llvm.org/D110161
The Interleave Access pass will convert shuffle(binop(load, load)) to
binop(shuffle(load), shuffle(load)), in order to create more
interleaving load patterns (VLD2/3/4) that might have been messed up by
instcombine. As shown in D104247 we were missing copying IR flags to the
new instruction though, which should just be kept the same as the
original instruction.
Differential Revision: https://reviews.llvm.org/D104255
This patch is a part of D93817 and makes transformations in CodeGen use poison for shufflevector/insertelem's initial vector element.
The change in CodeGenPrepare.cpp is fine because the mask of shufflevector should be always zero.
It doesn't touch the second element (which is poison).
The change in InterleavedAccessPass.cpp is also fine becauses the mask is of the form <a, a+m, a+2m, .., a+km> where a+km is smaller than
the size of the first vector operand.
This is guaranteed by the caller of replaceBinOpShuffles, which is lowerInterleavedLoad.
It calls isDeInterleaveMask and isDeInterleaveMaskOfFactor to check the mask is the desirable form.
isDeInterleaveMask has the check that a+km is smaller than the vector size.
To check my understanding, I added an assertion & added a test to show that this optimization doesn't fire in such case.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D94056