A lot of fixed-length custom lowerings just involve inserting the
operands into a scalable container and extracting the result out, and
lowerToScalableOp already does this.
We just need to teach it to handle operands with different element types
(but same vector element count), and we can reuse it for
vselect/zext/sext/setcc/fcopysign.
This reverts commit fe4f6c1a58ab4f00a88a97af01000b6783b573ee, but leaves
the tests that were added.
The original commit mistakenly assumed that if regular bf16/f16 loads
and stores could be lowered without zvfbfmin/zvfhmin, then so too could
masked loads/stores and gathers/scatters.
However SelectionDAG can't actually type-legalize masked.load/stores
since it needs to be done in ScalarizeMaskedMemIntrinPass.
This was causing crashes on IREE because we now returned true for
isLegalMaskedLoadStore.
The original intent of this was to remove a discrepancy in the loop
vectorizer tests whenever predication was enabled, but this has gone
away after 92d09245d61dce80d3e68a27cc34d5fc6f062c93. So I don't think we
need to reapply this patch.
When vectorizing with predication some loops that were previously
vectorized without zvfhmin/zvfbfmin will no longer be vectorized because
the masked load/store or gather/scatter cost returns illegal.
This is due to a discrepancy where for these costs we check
isLegalElementTypeForRVV but for regular memory accesses we don't.
But for bf16 and f16 vectors we don't actually need the extension
support for loads and stores, so this adds a new function which takes
this into account.
For regular memory accesses we should probably also e.g. return an
invalid cost for i64 elements on zve32x, but it doesn't look like we
have tests for this yet.
We also should probably not be vectorizing these bf16/f16 loops to begin
with if we don't have zvfhmin/zvfbfmin and zfhmin/zfbfmin. I think this
is due to the scalar costs being too cheap. I've added tests for this in
a100f6367205c6a909d68027af6a8675a8091bd9 to fix in another patch.
We use `lh` to load 2 bytes from memory into a gpr, then mask this gpr
with -65536 to emulate nan-boxing behavior, and then the value in gpr is
moved to fpr using `fmv.w.x`.
To move the value back from fpr to gpr, we use `fmv.x.w` and finally,
`sh` is used to store the lower 2 bytes back to memory.
If zfh is enabled at the same time, we can just use flh/fsw to
load/store bf16 directly.
Follow up to 28417e64, and the whole line of work started with 4b81dc7.
This change merges the handling for VPStore - currently in
lowerInterleavedVPStore - into the existing dedicated routine used in
the shuffle lowering path. This removes the last use of the dedicated
lowerInterleavedVPStore and thus we can remove it.
This contains two changes which are functional.
First, like in 28417e64, merging support for vp.store exposes the
strided store optimization for code using vp.store.
Second, it seems the strided store case had a significant missed
optimization. We were performing the strided store at the full unit
strided store type width (i.e. LMUL) rather than reducing it to match
the input width. This became obvious when I tried to use the mask
created by the helper routine as it caused a type incompatibility.
Normally, I'd try not to include an optimization in an API rework, but
structuring the code to both be correct for vp.store and not optimize
the existing case turned out be more involved than seemed worthwhile. I
could pull this part out as a pre-change, but its a bit awkward on it's
own as it turns out to be somewhat of a half step on the possible
optimization; the full optimization is complex with the old code
structure.
---------
Co-authored-by: Craig Topper <craig.topper@sifive.com>
This continues in the direction started by commit 4b81dc7. We
essentially merges the handling for VPLoad - currently in
lowerInterleavedVPLoad - into the existing dedicated routine. This
removes the last use of the dedicate lowerInterleavedVPLoad and thus we
can remove it.
This isn't quite NFC as the main callback has support for the strided
load optimization whereas the VPLoad specific version didn't. So this
adds the ability to form a strided load for a vp.load deinterleave with
one shuffle used.
This continues in the direction started by commit 4b81dc7. We
essentially merges the handling for VPStore - currently in
lowerInterleavedVPStore which is shared between shuffle and intrinsic
based interleaves - into the existing dedicated routine.
There are cases where InstCombine / InstSimplify might sink extractvalue
instructions that use a deinterleave intrinsic into successor blocks,
which prevents InterleavedAccess from kicking in because the current
pattern requires deinterleave intrinsic to be used by extractvalue.
However, this requirement is bit too strict while we could have just
replaced the users of deinterleave intrinsic with whatever generated by
the target TLI hooks.
The change implements custom lowering of `get_fpmode`, `set_fpmode` and
`reset_fpmode` for RISCV target. The implementation is aligned with the
functions `fegetmode` and `fesetmode` in GLIBC.
This essentially merges the handling for VPLoad - currently in
lowerInterleavedVPLoad which is shared between shuffle and intrinsic
based interleaves - into the existing dedicated routine.
My plan is that if we like this factoring is that I'll do the same for
the intrinsic store paths, and then remove the excess generality from
the shuffle paths since we don't need to support both modes in the
shared VPLoad/Store callbacks. We can probably even fold the VP versions
into the non-VP shuffle variants in the analogous way.
There have been discussions on splitting RISCVISelLowering.cpp. I think
InterleavedAccess related TLI hooks would be some of the low hanging
fruit as it's relatively isolated and also because X86 is already doing
it.
NFC.
Add LLVM Context to getOptimalMemOpType and findOptimalMemOpLowering. So
that we can use EVT::getVectorVT to generate EVT type in
getOptimalMemOpType.
Related to [#146673](https://github.com/llvm/llvm-project/pull/146673).
As noted in post commit review, the API change here was not required.
I'd apparently confused myself when teasing apart patches from my
development branch.
For the fixed vector cases, we already support this, but the
deinterleave intrinsic cases (primary used by scalable vectors) didn't.
Supporting it requires plumbing through the Factor separately from the
extracts, as there can now be fewer extracts than the Factor. Note that
the fixed vector path handles this slightly differently - it uses the
shuffle and indices scheme to achieve the same thing.
Put one copy on RISCVTargetLowering as a static function so that both
locations can use it, and rename the method to getM1VT for slightly
improved readability.
This change adds new parameters to the method
`shouldFoldSelectWithIdentityConstant()`. The method now takes the
opcode of the select node and the non-identity operand of the select
node. To gain access to the appropriate arguments, the call of
`shouldFoldSelectWithIdentityConstant()` is moved after all other checks
have been performed. Moreover, this change adjusts the precondition of
the fold so that it would work for `SELECT` nodes in addition to
`VSELECT` nodes.
No functional change is intended because all implementations of
`shouldFoldSelectWithIdentityConstant()` are adjusted such that they
restrict the fold to a `VSELECT` node; the same restriction as before.
The rationale of this change is to make more fine grained decisions
possible when to revert the InstCombine canonicalization of
`(select c (binop x y) y)` to `(binop (select c x idc) y)` in the
backends.
The semantics of the PARTIAL_REDUCE_SMLA with i32 result element, and i8
sources corresponds to vqdot. Analogously PARTIAL_REDUCE_UMLA
corresponds to vqdotu. There is currently no vqdotsu equivalent.
This patch is a starting place. We can extend this quite a bit more, and
I plan to take a look at the fixed vector lowering, the TTI hook to
drive loop vectorizer, and to try to integrate the reduction based
lowering I'd added for zvqdotq into this flow.
This commit moves RISC-V to auto-generate its target-specific SDNode
types. The biggest change is that SDNodes can now be validated against
their expected type profiles, and that we don't need to edit several
different files when declaring a new one.
This takes Sergei's work in #119709 and "finishes" it - by moving the
final five RISCVISD opcodes into tablegen (including defining their
types), and by ensuring the tablegen has expected closing scope
comments.
Co-authored-by: Sergei Barannikov <barannikov88@gmail.com>
Teach InterleavedAccessPass to recognize vp.load + shufflevector and
shufflevector + vp.store. Though this patch only adds RISC-V support to
actually lower this pattern. The vp.load/vp.store in this pattern
require constant mask.
This patch adds pattern matching for the basic usages of the dot product
instructions introduced by the experimental zvqdotq extension. It
specifically only handles the case where the pattern is feeding a i32
sum reduction as we need to reassociate the reduction tree to use these
instructions.
The vecreduce_add (sext) and vecreduce_add (zext) cases are included
mostly to exercise the VX matchers. For the generic matching, we fail to
match due to an order of combine issue which results in the bitcast
being separated from the splat.
I chose to do this lowering as an early combine so as to avoid having to
integrate the entire logic into the reduction lowering flow. In
particular, that would get a lot more complicated as we extend this to
handle add-trees feeding the reductions.
These instructions are included in XRivosVisni. They perform a scalar
insert into a vector (with a potentially non-zero index) and a scalar
extract from a vector (with a potentially non-zero index) respectively.
They're very analogous to vmv.s.x and vmv.x.s respectively.
The instructions do have a couple restrictions:
1) Only constant indices are supported w/a uimm5 format.
2) There are no FP variants.
One important property of these instructions is that their throughput
and latency are expected to be LMUL independent.
If XRivosVizip is available, the ri.vzip2a and ri.vzip2b instructions
can be used perform a interleave shuffle. This patch only effects the
intrinsic lowering (and thus scalable vectors). Fixed vectors go through
shuffle lowering and the zip2a (but not zip2b) case is already handled
there..
If XRivosVizip is available, the ri.vunzip2a and ri.vunzip2b can be used
to the concatenation and register deinterleave shuffle. This patch only
effects the intrinsic lowering (and thus scalable vectors because the
fixed vectors go through shuffle lowering).
Note that this patch is restricted to e64 for staging purposes only. e64
is obviously profitable (i.e. we remove a vcompress). At e32 and below,
our alternative is a vnsrl instead, and we need a bit more complexity
around lowering with fractional LMUL before the ri.vunzip2a/b versions
becomes always profitable. I'll post the followup change once this
lands.
If the vrgather.vi is preceeded by a vmv.v.x which writes a superset of
the lanes writen by the vrgather, and the vrgather has no passthru, then
the vrgather has no semantic effect.
This is the start of a mini-series of patches around rewriting
vrgather.vi/vx preceeded by vmv.v.x, vfmf.v.f, vmv.s.x, etc... Starting
with the simplest, but also lowest impact.
One point I'd like a second oppinion on is the out of bounds semenatic
change. As far as I can tell, all the indices are in bounds by
construction. The doc change is as much as I couldn't figure out how to
test the alternative as anything else.
This implements initial code generation support for a subset of the
xrivosvizip extension. Specifically, this adds support for vzipeven,
vzipodd, and vzip2a, but not vzip2b, vunzip2a, or vunzip2b. The others
will follow in separate patches.
One review note: The zipeven/zipodd matchers were recently rewritten to
better match upstream style, so careful review there would be
appreciated. The matchers don't yet support type coercion to wider
types. This will be done in a future patch.
This patch improves DAGCombiner's handling of potential store merges by
detecting function calls between loads and stores. When a function call
exists in the chain between a load and its corresponding store, we avoid
merging these stores if the spilling is unprofitable.
We had to implement a hook on TLI, since TTI is unavailable in
DAGCombine. Currently, it's only enabled for riscv.
This is the DAG equivalent of PR #129258
This change adds support for `qci-nest` and `qci-nonest` interrupt
attribute values. Both of these are machine-mode interrupts, which use
instructions in Xqciint to push and pop A- and T-registers (and a few
others) from the stack.
In particular:
- `qci-nonest` uses `qc.c.mienter` to save registers at the start of the
function, and uses `qc.c.mileaveret` to restore those registers and
return from the interrupt.
- `qci-nest` uses `qc.c.mienter.nest` to save registers at the start of
the function, and uses `qc.c.mileaveret` to restore those registers and
return from the interrupt.
- `qc.c.mienter` and `qc.c.mienter.nest` both push registers ra, s0
(fp), t0-t6, and a0-a10 onto the stack (as well as some CSRs for the
interrupt context). The difference between these is that
`qc.c.mienter.nest` re-enables M-mode interrupts.
- `qc.c.mileaveret` will restore the registers that were saved by
`qc.c.mienter(.nest)`, and return from the interrupt.
These work for both standard M-mode interrupts and the non-maskable
interrupt CSRs added by Xqciint.
The `qc.c.mienter`, `qc.c.mienter.nest` and `qc.c.mileaveret`
instructions are compatible with push and pop instructions, in as much
as they (mostly) only spill the A- and T-registers, so we can use the
`Zcmp` or `Xqccmp` instructions to spill the S-registers. This
combination (`qci-(no)nest` and `Xqccmp`/`Zcmp`) is not implemented in
this change.
The `qc.c.mienter(.nest)` instructions have a specific register storage
order so they preserve the frame pointer convention linked list past the
current interrupt handler and into the interrupted code and frames if
frame pointers are enabled.
Co-authored-by: Pankaj Gode <quic_pgode@quicinc.com>
The VLMUL and policy enums originally lived in RISCVBaseInfo.h in the
backend which is where everything else in the RISCVII namespace is
defined.
RISCVTargetParser.h is used by much more of the compiler and it
doesn't really make sense to have 2 different namespaces exposed.
These enums are both associated with VTYPE so using the RISCVVType
namespace seems like a good home for them.
Teach InterleavedAccessPass to recognize the following patterns:
- vp.store an interleaved scalable vector
- Deinterleaving a scalable vector loaded from vp.load
Upon recognizing these patterns, IA will collect the interleaved /
deinterleaved operands and delegate them over to their respective
newly-added TLI hooks.
For RISC-V, these patterns are lowered into segmented loads/stores
Right now we only recognized power-of-two (de)interleave cases, in which
(de)interleave4/8 are synthesized from a tree of (de)interleave2.
---------
Co-authored-by: Nikolay Panchenko <nicholas.panchenko@gmail.com>
Previously, AArch64 used pattern matching to support
llvm.vector.(de)interleave of 2 and 4; RISC-V only supported
(de)interleave of 2.
This patch consolidates the logics in these two targets by factoring out
the common factor calculations into the InterleaveAccess Pass.
With this change, targets are no longer required to put memory / strict-fp opcodes after special
`ISD::FIRST_TARGET_MEMORY_OPCODE`/`ISD::FIRST_TARGET_STRICTFP_OPCODE` markers.
This will also allow autogenerating `isTargetMemoryOpcode`/`isTargetStrictFPOpcode (#119709).
Pull Request: https://github.com/llvm/llvm-project/pull/119969
The default legalization uses vmslt with a vector of XLen to compute a
mask. This doesn't work if the type isn't legal. For fixed vectors it
will scalarize. For scalable vectors it crashes the compiler.
This patch uses an alternate strategy that promotes the i1 vector to an
i8 vector and does the merge. I don't claim this to be the best
lowering. I wrote it quickly almost 3 years ago when a crash was
reported in our downstream.
Fixes#120405.
Lowering to load-acquire/store-release for RISCV Zalasr.
Currently uses the psABI lowerings for WMO load-acquire/store-release
(which are identical to A.7). These are incompatable with the A.6
lowerings currently used by LLVM. This should be OK for now since Zalasr
is behind the enable experimental extensions flag, but needs to be fixed
before it is removed from that.
For TSO, it uses the standard Ztso mappings except for lowering seq_cst
loads/store to load-acquire/store-release, I had Andrea review that.
Enable `-fstack-clash-protection` for RISCV and stack probe for function
prologues.
We probe the stack by creating a loop that allocates and probe the stack
in ProbeSize chunks.
We emit an unrolled probe loop for small allocations and emit a variable
length probe loop for bigger ones.
I want to use this function for GISel too so Type * is a better common
interface. All of the callers already convert EVT to Type * as needed
by calling lowering anyway.
Instead of directly lowering to vnsrl_vl and having custom pattern
matching for that case, we can just lower to a (legal) shift and
truncate, and let generic pattern matching produce the vnsrl.
The major motivation for this is that I'm going to reuse this logic to
handle e.g. deinterleave4 w/ i8 result.
The test changes aren't particularly interesting. They're minor code
improvements - I think because we do slightly better with the
insert_subvector patterns, but that's mostly irrelevant.
This is an alternative fix for #81192. This allows the SelectionDAG
scheduler to be able to find a representative register class for i32 on
RV64. The representative register class is the super register class with
the largest spill size that is also legal. The default implementation of
findRepresentativeClass only works for legal types which i32 is not for
RV64.
I did some investigation of why tablegen uses i32 in output patterns on
RV64. It appears it comes down to a function called
ForceArbitraryInstResultType that picks a type for the output
pattern when the isel pattern isn't specific enough. I believe it picks
the smallest type(lowested numbered) to resolve the conflict.
A similar issue occurs for f16 and bf16 which both use the FPR16
register class. If the isel pattern doesn't specify, tablegen may find
both f16 and bf16 and may pick bf16 from Zfh pattern when Zfbfmin isn't
present. Since bf16 isn't legal in that case, findRepresentativeClass
will fail.
For i8, i16, i32, this patch calls the base class with XLenVT to get the
representative class since XLenVT is always legal.
For bf16/f16, we call the base class with f32 since all of the f16/bf16
extensions depend on either F or Zfinx which will make f32 a legal type.
The final representative register class further depends on whether D or
Zdinx is also enabled, but that should be handled by the default
implementation.
This patch adds support for getting even-odd general purpose register
pairs into and out of inline assembly using the `R` constraint as
proposed in riscv-non-isa/riscv-c-api-doc#92
There are a few different pieces to this patch, each of which need their
own explanation.
- Renames the Register Class used for f64 values on rv32i_zdinx from
`GPRPair*` to `GPRF64Pair*`. These register classes are kept broadly
unmodified, as their primary value type is used for type inference
over selection patterns. This rename affects quite a lot of files.
- Adds new `GPRPair*` register classes which will be used for `R`
constraints and for instructions that need an even-odd GPR pair. This
new type is used for `amocas.d.*`(rv32) and `amocas.q.*`(rv64) in
Zacas, instead of the `GPRF64Pair` class being used before.
- Marks the new `GPRPair` class legal as for holding a `MVT::Untyped`.
Two new RISCVISD node types are added for creating and destructing a
pair - `BuildGPRPair` and `SplitGPRPair`, and are introduced when
bitcasting to/from the pair type and `untyped`.
- Adds functionality to `splitValueIntoRegisterParts` and
`joinRegisterPartsIntoValue` to handle changing `i<2*xlen>` MVTs into
`untyped` pairs.
- Adds an override for `getNumRegisters` to ensure that `i<2*xlen>`
values, when going to/from inline assembly, only allocate one (pair)
register (they would otherwise allocate two). This is due to a bug in
SelectionDAGBuilder.cpp which other backends also work around.
- Ensures that Clang understands that `R` is a valid inline assembly
constraint.
- This also allows `R` to be used for `f64` types on `rv32_zdinx`
architectures, where doubles are stored in a GPR pair.
This intrinsic was introduced by #92289 and currently we just expand
it for RISC-V.
This patch adds custom lowering for this intrinsic and simply maps
it to `vcompress` instruction.
Fixes#113242.