(This is a re-apply for what was 8374d42. The bug there was fairly
major - despite the comments and review description, the code was
using each register in the source register group, not only the first
register. This was completely wrong.)
This is a continuation of the work started in
https://github.com/llvm/llvm-project/pull/125735 to lower selected VLA
shuffles in linear m1 components instead of generating O(LMUL^2) or
O(LMUL*Log2(LMUL) high LMUL shuffles.
This pattern focuses on shuffles where all the elements being used
across the entire destination register group come from a single register
in the source register group. Such cases come up fairly frequently via
e.g. spread(N), and repeat(N) idioms.
One subtlety to this patch is the handling of the index vector for
vrgatherei16.vv. Because the index and source registers can have
different EEW, the index vector for the Nth chunk of the destination is
not guaranteed to be register aligned. In fact, it is common for e.g. an
EEW=64 shuffle to have EEW=16 indices which are four chunks per source
register. Given this, we have to pay a cost for extracting these chunks
into the low position before performing each shuffle.
I'd initially expressed this as a naive extract sub-vector for each data
parallel piece. However, at high LMUL, this quickly caused register
pressure problems since we could at worst need 4x the temporary
registers for the index. Instead, this patch uses a repeating slidedown
chained from previous iterations. This increases critical path by at
worst 3 slides (SEW=64 is the worst case), but reduces register pressure
to at worst 2x - and only if the original index vector is reused
elsewhere. I view this as arguably a bit of a workaround (since our
scheduling should have done better with the plain extract variant), but
a probably necessary one.
(With a fix to recently added code.)
Implement the first TODO from #125735, and minorly cleanup code using
same style as the recently landed strict prefix case.
This is a continuation of the work started in #125735 to lower selected
VLA shuffles in linear m1 components instead of generating O(LMUL^2) or
O(LMUL*Log2(LMUL) high LMUL shuffles.
This pattern focuses on shuffles where all the elements being used
across the entire destination register group come from a single register
in the source register group. Such cases come up fairly frequently via
e.g. spread(N), and repeat(N) idioms.
One subtlety to this patch is the handling of the index vector for
vrgatherei16.vv. Because the index and source registers can have
different EEW, the index vector for the Nth chunk of the destination is
not guaranteed to be register aligned. In fact, it is common for e.g. an
EEW=64 shuffle to have EEW=16 indices which are four chunks per source
register. Given this, we have to pay a cost for extracting these chunks
into the low position before performing each shuffle.
I'd initially expressed this as a naive extract sub-vector for each data
parallel piece. However, at high LMUL, this quickly caused register
pressure problems since we could at worst need 4x the temporary
registers for the index. Instead, this patch uses a repeating slidedown
chained from previous iterations. This increases critical path by at
worst 3 slides (SEW=64 is the worst case), but reduces register pressure
to at worst 2x - and only if the original index vector is reused
elsewhere. I view this as arguably a bit of a workaround (since our
scheduling should have done better with the plain extract variant), but
a probably neccessary one.
Found using https://github.com/codespell-project/codespell
```
codespell RISCV --write-changes \
--ignore-words-list=FPR,fpr,VAs,ORE,WorstCase,hart,sie,MIs,FLE,fle,CarryIn,vor,OLT,VILL,vill,bu,pass-thru
```
With the test changes.
Original message:
The Zicond version of this requires an li instruction and an
additional register.
Without Zicond we match this in a DAGCombine on RISCVISD::SELECT_CC.
This PR has 2 commits. I'll pre-commit the test change if this looks
good.
The Zicond version of this requires an li instruction and an
additional register.
Without Zicond we match this in a DAGCombine on RISCVISD::SELECT_CC.
This PR has 2 commits. I'll pre-commit the test change if this looks
good.
If we're lowering a shuffle to a vrgather (or vcompress), and we know
that a prefix of the operation can be done while producing the same
(defined) lanes, do the operation with a narrower LMUL.
This enables DAG combines to form this mask. Reverse is generally linear
in LMUL so this is reasonable, and results in better codegen for the 2
source variants.
For <= m1, the change is only slightly profitable if at all. We trade
some mask creation and an extract vrsub for a vslideup.vi. This is
likely roughly neutral. At >= m2, this is distinctly profitable as
generic DAG pushes the reverse into the two operands. We effectively
already did this for one operand, but the other was hitting a full
O(LMUL^2) shuffle. Moving that to be O(LMUL/2) operation is a big win.
These three intrinsics are similar to llvm.vector.(de)interleave2 but
work with 3/5/7 vector operands or results.
For RISC-V, it's important to have them in order to support segmented
load/store with factor of 2 to 8: factor of 2/4/8 can be synthesized
from (de)interleave2; factor of 6 can be synthesized from factor of 2
and 3; factor 5 and 7 have their own intrinsics added by this patch.
This patch only adds codegen support for these intrinsics, we still need
to teach vectorizer to generate them as well as teaching
InterleavedAccessPass to use them.
---------
Co-authored-by: Craig Topper <craig.topper@sifive.com>
The APInt constructor asserts if bits are set past the size of the APInt
unless it is signed. This currently fails on RV32 because more than XLen
bits are set.
If the input contains odd number of shuffled vectors, the 2 last
shuffles are shuffled with the same first vector. Need to correctly
process such situation: when the first vector is requested for the first
time - extract it from the source vector, when it is requested the
second time - reuse previous result. The second vector should be
extracted in both cases.
Fixes#125269
Reviewers: topperc, preames
Reviewed By: preames
Pull Request: https://github.com/llvm/llvm-project/pull/125693
This avoids lowering scalable fp_extends that don't need multiple
extends (i.e. f16->f32, f32->f64) to _vl nodes, but converts them back
during DAG preprocessing so we don't need to add any more patterns.
Keeping the nodes in their generic SDNode form matches more splat
patterns
High LMUL shuffles are expensive on typical SIMD implementations.
Without exact vector length knowledge, we struggle to map elements
within the vector to the register within the vector register group.
However, there are some patterns where we can perform a vector length
agnostic (VLA) shuffle by leveraging knowledge of the pattern performed
even without the ability to map individual elements to registers. An
existing in tree example is vector reverse.
This patch introduces another such case. Specifically, if we have a
shuffle where the a local rearrangement of elements is happening within
a 128b (really zvlNb) chunk, and we're applying the same pattern to each
chunk, we can decompose a high LMUL shuffle into a linear number of m1
shuffles. We take advantage of the fact the tail of the operation is
undefined, and repeat the pattern for all elements in the source
register group - not just the ones the fixed vector type covers.
This is an optimization for typical SIMD vrgather designs, but could be
a pessimation on hardware for which vrgather's execution cost is not
independent of the runtime VL.
Teach InterleavedAccessPass to recognize the following patterns:
- vp.store an interleaved scalable vector
- Deinterleaving a scalable vector loaded from vp.load
Upon recognizing these patterns, IA will collect the interleaved /
deinterleaved operands and delegate them over to their respective
newly-added TLI hooks.
For RISC-V, these patterns are lowered into segmented loads/stores
Right now we only recognized power-of-two (de)interleave cases, in which
(de)interleave4/8 are synthesized from a tree of (de)interleave2.
---------
Co-authored-by: Nikolay Panchenko <nicholas.panchenko@gmail.com>
I have been unsuccessful at further reducing the test. The
failure requires a shuffle with 2 scalable->fixed extracts with
the same source. 0 is the only valid index for a scalable->fixed
extract so the 2 sources must be the same extract. Shuffles with
the same source are aggressively canonicalized to a unary shuffle.
So it requires the extracts to become identical through other
optimizations without the shuffle being canonicalized before it is
lowered.
Fixes#125306.
This adds a VP version of an existing DAG combine. I've put it in
RISCVISelLowering since we would need to add a ISD::VP_AVGCEIL opcode
otherwise.
This pattern appears in 525.264_r.
If we have a (vselect c, a+b, a-b), we can combine this to a+(vselect c,
b, -b). That by itself isn't hugely profitable, but if we reverse the
select, we get a form which matches a masked vrsub.vi with zero. The
result is that we can use a masked vrsub *before* the add instead of a
masked add or sub. This doesn't change the critical path (since we
already had the pass through on the masked second op), but does reduce
register pressure since a, b, and (a+b) don't need to all be alive at
once.
In addition to the vselect form, we can also see the same pattern with a
vector_shuffle encoding the vselect. I explored canonicalizing these to
vselects instead, but that exposes several unrelated missing combines.
Previously, AArch64 used pattern matching to support
llvm.vector.(de)interleave of 2 and 4; RISC-V only supported
(de)interleave of 2.
This patch consolidates the logics in these two targets by factoring out
the common factor calculations into the InterleaveAccess Pass.
This is a follow-up to #117878 and allows the usage of vrgather if the index
we are accessing in VT is constant and within bounds.
This patch replaces the previous behavior of bailing out if the length of the
search vector is greater than the vector of elements we are searching for.
Since matchSplatAsGather works on EXTRACT_VECTOR_ELT, and we know the index
from which the element is extracted, we only need to check if we are doing an
insert from a larger vector into a smaller one, in which we do an extract
instead.
Co-authored-by: Luke Lau luke_lau@icloud.com
Co-authored-by: Philip Reames preames@rivosinc.com
I have a particular user downstream who likes to write shuffles in terms
of unions involving _BitInt(128) types. This isn't completely crazy
because there's a bunch of code in the wild which was written with SSE
in mind, so 128 bits is a common data fragment size.
The problem is that generic lowering scalarizes this to ELEN, and we end
up with really terrible extract/insert sequences if the i128 shuffle is
between other (non-i128) operations.
I explored trying to do this via generic lowering infrastructure, and
frankly got lost. Doing this a target specific DAG is a bit ugly -
really, there's nothing hugely target specific here - but oh well. If
reviewers prefer, I could probably phrase this as a generic DAG combine,
but I'm not sure that's hugely better. If reviewers have a strong
preference on how to handle this, let me know, but I may need a bit of
help.
A couple notes:
* The argument passing weirdness is due to a missing combine to turn a
build_vector of adjacent i64 loads back into a vector load. I'm a bit
surprised we don't get that, but the isel output clearly has the
build_vector at i64.
* The splat case I plan to revisit in another patch. That's a relatively
common pattern, and the fact I have to scalarize that to avoid an
infinite loop is non-ideal.
Patch adds usage of processShuffleMasks in in codegen
in lowerShuffleViaVRegSplitting. This function is already used for X86
shuffles estimations and in DAGTypeLegalizer::SplitVecRes_VECTOR_SHUFFLE
functions, unifies the code.
Reviewers: topperc, wangpc-pp, lukel97, preames
Reviewed By: preames
Pull Request: https://github.com/llvm/llvm-project/pull/121765
This takes inspiration from AArch64 which does the same thing to assist
with zip/trn/etc.. Doing this recursion unconditionally when the mask
allows is slightly questionable, but seems to work out okay in practice.
As a bit of context, it's helpful to realize that we have existing logic
in both DAGCombine and InstCombine which mutates the element width of in
an analogous manner. However, that code has two restriction which
prevent it from handling the motivating cases here. First, it only
triggers if there is a bitcast involving a different element type.
Second, the matcher used considers a partially undef wide element to be
a non-match. I considered trying to relax those assumptions, but the
information loss for undef in mid-level opt seemed more likely to open a
can of worms than I wanted.
Every call should have regmask operand to indicate what registers are
preserved or clobbered by the call. VirtRegRewriter uses this to tell
MachineRegisterInfo what registers are clobbered by a function. If the
mask isn't present the registers potentially clobbered by a tail called
function aren't counted. I have checked ARM, AArch64, and X86 and they
all have a regmask operand on their tail calls.
I believe this fixes an issue I'm seeing with IPRA.
This fixes a regression from #101294 by checking if we might be
clobbering a sh{1,2,3}add pattern.
Only do this is the underlying add isn't going to be folded away into an
address offset.
This reverts commit b8952d4b1b0c73bf39d6440ad3166a088ced563f.
spec x264 fails to build in all VLS configurations, with the assertion
failure: clang: ../llvm-project/llvm/lib/Target/RISCV/RISCVISelLowering.cpp:5246: llvm::SDValue lowerShuffleViaVRegSplitting(llvm::ShuffleVectorSDNode*, llvm::SelectionDAG&, const llvm::RISCVSubtarget&): Assertion `RegCnt == NumOfDestRegs && "Whole vector must be processed"' failed.
I can reduce a failing piece of IR, but the failure appears pretty
broad, so I suspect any reasonable vls build will hit it.
Patch adds usage of processShuffleMasks in in codegen
in lowerShuffleViaVRegSplitting. This function is already used for X86
shuffles estimations and in DAGTypeLegalizer::SplitVecRes_VECTOR_SHUFFLE
functions, unifies the code.
Reviewers: preames, topperc, lukel97, wangpc-pp
Reviewed By: wangpc-pp
Pull Request: https://github.com/llvm/llvm-project/pull/120803
With this change, targets are no longer required to put memory / strict-fp opcodes after special
`ISD::FIRST_TARGET_MEMORY_OPCODE`/`ISD::FIRST_TARGET_STRICTFP_OPCODE` markers.
This will also allow autogenerating `isTargetMemoryOpcode`/`isTargetStrictFPOpcode (#119709).
Pull Request: https://github.com/llvm/llvm-project/pull/119969
SDNode::use_iterator now returns an SDUse& when dereferenced.
SDNode::user_iterator returns SDNode*. SDNode::use_begin/use_end/uses
work on use_iterator. SDNode::user_begin/user_end/users work on
user_iterator.
We can now write range based for loops using SDUse& and SDNode::uses().
I've converted many of these in this patch. I didn't update loops that
have additional variables updated in their for statement.
Some loops use SDNode::use_iterator::getOperandNo() which also prevents
using range based for loops. I plan to move this into SDUse in a follow
up patch.
Most of these are just places that want the first user and aren't
iterating over the whole list.
While there I changed some use_size() == 1 to hasOneUse() which
is more efficient.
This is part of an effort to rename use_iterator to user_iterator
and provide a use_iterator that dereferences to SDUse&. This patch
helps reduce the diff on later patches.
This function is most often used in range based loops or algorithms
where the iterator is implicitly dereferenced. The dereference returns
an SDNode * of the user rather than SDUse * so users() is a better name.
I've long beeen annoyed that we can't write a range based loop over
SDUse when we need getOperandNo. I plan to rename use_iterator to
user_iterator and add a use_iterator that returns SDUse& on dereference.
This will make it more like IR.