The motivation for this is to allow us to match strided accesses that
are emitted from the loop vectorizer with EVL tail folding (see #122232)
In these loops the step isn't loop invariant and is based off of
@llvm.experimental.get.vector.length.
We can relax this as long as we make sure to construct the updates after
the definition inside the loop, instead of the preheader.
I presume the restriction was previously added so that the step would
dominate the insertion point in the preheader. I can't think of why it
wouldn't be safe to calculate it in the loop otherwise.
I have a particular user downstream who likes to write shuffles in terms
of unions involving _BitInt(128) types. This isn't completely crazy
because there's a bunch of code in the wild which was written with SSE
in mind, so 128 bits is a common data fragment size.
The problem is that generic lowering scalarizes this to ELEN, and we end
up with really terrible extract/insert sequences if the i128 shuffle is
between other (non-i128) operations.
I explored trying to do this via generic lowering infrastructure, and
frankly got lost. Doing this a target specific DAG is a bit ugly -
really, there's nothing hugely target specific here - but oh well. If
reviewers prefer, I could probably phrase this as a generic DAG combine,
but I'm not sure that's hugely better. If reviewers have a strong
preference on how to handle this, let me know, but I may need a bit of
help.
A couple notes:
* The argument passing weirdness is due to a missing combine to turn a
build_vector of adjacent i64 loads back into a vector load. I'm a bit
surprised we don't get that, but the isel output clearly has the
build_vector at i64.
* The splat case I plan to revisit in another patch. That's a relatively
common pattern, and the fact I have to scalarize that to avoid an
infinite loop is non-ideal.
For .wv widening instructions when checking if the opperand is vs1 or
vs2, we take into account whether or not it has a passthru. For tied
pseudos though their passthru is the vs2, and we weren't taking this
into account.
We were previously checking a combination of the vector policy op and
the opcode to determine if we needed to skip copying the passthru from
a masked pseudo to an unmasked pseudo.
However we can just do this by checking
RISCVII::isFirstDefTiedToFirstUse, which is a proxy for whether or not
a pseudo has a passthru operand.
This should hopefully remove the need for the changes in #123106
This adds support for lowering llvm.vp.{gather,scatter}s to
experimental.vp.strided.{load,store}.
This will help us handle strided accesses with EVL tail folding that are
emitted from the loop vectorizer, but note that it's still not enough.
We will also need to handle the vector step not being loop-invariant
(i.e. produced by @llvm.experimental.vector.length) in a future patch.
Patch adds usage of processShuffleMasks in in codegen
in lowerShuffleViaVRegSplitting. This function is already used for X86
shuffles estimations and in DAGTypeLegalizer::SplitVecRes_VECTOR_SHUFFLE
functions, unifies the code.
Reviewers: topperc, wangpc-pp, lukel97, preames
Reviewed By: preames
Pull Request: https://github.com/llvm/llvm-project/pull/121765
This extension adds eleven instructions to accelerate interrupt
servicing.
The current spec can be found at:
https://github.com/quic/riscv-unified-db/releases/latest
This patch adds assembler only support.
---------
Co-authored-by: Harsh Chandel <hchandel@qti.qualcomm.com>
Use the probe loop structure to allocate vector code in the stack as
well. We add the pseudo instruction RISCV::PROBED_STACKALLOC_RVV to
differentiate from the normal loop.
This takes inspiration from AArch64 which does the same thing to assist
with zip/trn/etc.. Doing this recursion unconditionally when the mask
allows is slightly questionable, but seems to work out okay in practice.
As a bit of context, it's helpful to realize that we have existing logic
in both DAGCombine and InstCombine which mutates the element width of in
an analogous manner. However, that code has two restriction which
prevent it from handling the motivating cases here. First, it only
triggers if there is a bitcast involving a different element type.
Second, the matcher used considers a partially undef wide element to be
a non-match. I considered trying to relax those assumptions, but the
information loss for undef in mid-level opt seemed more likely to open a
can of worms than I wanted.
Many instructions assign all or a subset of Inst{6-2} to Imm{4-0}. Make
this the default. Subsets of Inst{6-2} can be overridden as needed by
derived classes/records which we already do with Inst{12} in a few
places.
I added some CodeGen test cases related to reduce. To maintain
consistency, I also added cases for instructions like
`vector.reduce.or`.
For cases where `v1i1` type generates `VFIRST`, please refer to:
https://reviews.llvm.org/D139512.
Some masked pseudos like PseudoVCPOP_M_B8_MASK don't have a passthru,
but in the masked->unmasked peephole we assumed the masked pseudo always
had one.
This checks for a passthru first and fixes#122245.
Every call should have regmask operand to indicate what registers are
preserved or clobbered by the call. VirtRegRewriter uses this to tell
MachineRegisterInfo what registers are clobbered by a function. If the
mask isn't present the registers potentially clobbered by a tail called
function aren't counted. I have checked ARM, AArch64, and X86 and they
all have a regmask operand on their tail calls.
I believe this fixes an issue I'm seeing with IPRA.
We can just use a std::optional to wrap the operand info instead. The
state field is confusing as we have a "partially known" state where EEW
is known and EMUL is nullopt, but it's still "Known".
All but one of the cases in tree today have EMUL=SEW/EEW*LMUL. Repeating
this each time is verbose and introduces oppurtunity for error. (For
instance, the comment associated with vwmul.vv was out of sync with the
code for same.)
Introduce getOperandLog2EEW and move most complexity to it. Then
introduce getOperandInfo as a wrapper around previous, and special case
the one case which requires it.
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>
When these instructions are marked nofpexcept, we can optimize them.
There are some added toggles in the output, likley because other
noexcept fp instructions are not part of isSupportedInstr yet. We may
want to avoid marking an instruction as isSupported in the future if any
of its FP users are missing nofpexcept to avoid added toggles. However,
we seem to get some GPRs back as a result of this change, which may
outweigh the cost of avoiding extra toggles.
The plan is to follow this patch up with added support for more FP
instructions in the same way. The instructions in this patch are a
natural starting point because they allow us to test with integer
instructions which have good support already.
Custom lowering for s32 G_ADD/SUB to help match selection dag better.
Specifically for RV64 a s32 is produced as a add+sext the output this
allows for fewer instructions to sign extend a couple patterns. Allows
for the generation of addiw,subw,negw to reduce required instructions to
load values quicker
Log2_ceil_i32 in rvzbb.ll shows a more obvious improvement case.
Reductions are weird because for some operands, they are vector
registers but only read the first lane. For these operands, we do not
need to check to make sure the EEW and EMUL ratios match. The EEWs,
however, do need to match.
Also tested with Ubuntu on SiFive's HiFive Premier P550 board. Curiously
latency is reporting ~1.5 on basic scalar arithmetic, scalar mul is
~3.5, and div is ~36.5. This 0.5 cycles higher than I expect.