A disjoint OR can be converted to XOR. And a XOR+NOT is XNOR. Idea
taken from #147279.
I changed the existing xnor pattern to have the not on the outside
instead of the inside. These are equivalent for xor since xor is
associative. Tablegen was already generating multiple variants
of the isel pattern using associativity.
There are some issues here. The disjoint flag isn't preserved
through type legalization. I was hoping we could recover it
manually for the masked merge cases, but that doesn't work either.
This patch adds a new `GetVTypeMinimalPredicates` for `f16` operation
supported by `Zvfhmin`. Split the type predicates for minimal support
and full compute support. This is a refactor patch for implementing
vector compute support for bf16 (Zvfbfa), that we can check `bf16` type
whether with `Zvfbfa` extension in `GetVTypePredicates`.
This commit moves RISC-V to auto-generate its target-specific SDNode
types. The biggest change is that SDNodes can now be validated against
their expected type profiles, and that we don't need to edit several
different files when declaring a new one.
This takes Sergei's work in #119709 and "finishes" it - by moving the
final five RISCVISD opcodes into tablegen (including defining their
types), and by ensuring the tablegen has expected closing scope
comments.
Co-authored-by: Sergei Barannikov <barannikov88@gmail.com>
This is the fixed-length equivalent of #136716.
The pattern we need to match is ({s,z}ext_vl (or_vl disjoint a, b)).
This only allows or_vls with an undef passthru, which allows us to
ignore its mask and vl and just take it from the {s,z}ext_vl.
A riscv_or_vl_is_add_oneuse PatFrag is added to mirror or_is_add in
RISCVInstrInfo.td.
We'd already had this transform for the intrinsics, but hadn't added it
for either fixed length or scalable vectors coming from normal IR.
For the record, the fact we have three different sets of patterns here
really is quite ugly.
Add a policy operand to set the tail agnostic policy instead of using
ForceTailAgnostic. The masked to unmasked transforms had to be updated
to drop the policy operand when converting to unmasked.
This is another attempt at #88496 to keep mask operands in SSA after
instruction selection.
Previously we selected the mask operands into vmv0, a singleton register
class with exactly one register, V0.
But the register allocator doesn't really support singleton register
classes and we ran into errors like "ran out of registers during
register allocation in function".
This avoids this by introducing a pass just before register allocation
that converts any use of vmv0 to a copy to $v0, i.e. what isel currently
does today.
That way the register allocator doesn't need to deal with the singleton
register class, but we get the benefits of having the mask registers in
SSA throughout the backend:
- This allows RISCVVLOptimizer to reduce the VLs of instructions that
define mask registers
- It enables CSE and code sinking in more places
- It removes the need to peek through mask copies in RISCVISelDAGToDAG
and keep track of V0 defs in RISCVVectorPeephole
This patch initially eliminates uses of vmv0s after RISCVVectorPeephole
to keep the diff to a minimum, and a follow up patch will move it past
the other MachineInstr SSA passes.
Note that it doesn't try to remove any defs of vmv0 as we shouldn't have
any instructions that have any vmv0 outputs.
As a further follow up, we can move the elimination pass to after phi
elimination and outside of SSA, which would unblock the pre-RA scheduler
around masked pseudos. This might also help the issue that
RISCVVectorMaskDAGMutation tries to solve.
Because the fpext has a single use constraint on it we can't match cases
where it's used for both operands.
Introduce a new PatFrag that allows multiple uses on a single user and
use it for the binary patterns, and some ternary patterns.
(For some of the ternary patterns there is a fneg that counts as a
separate user, we still need to handle these)
Because riscv_fpextend_vl doesn't have a passthru operand the tail
elements are undef, so we can treat them as if they were active.
Relaxing this allows us to match widening reductions where the fpextend
isn't a VP intrinsic.
This same reasoning is already used for riscv_fpextend_vl in
RISCVInstrInfoVSDPatterns.td
This avoids lowering scalable fp_extends that don't need multiple
extends (i.e. f16->f32, f32->f64) to _vl nodes, but converts them back
during DAG preprocessing so we don't need to add any more patterns.
Keeping the nodes in their generic SDNode form matches more splat
patterns
Make SplatPat_simm5_plus1 responsible for decrementing the immediate
instead of requiring DecImm SDNodeXForm to be used after. This allows
better sharing of tablegen classes.
Replace LMUL suffixes with _B1, _B2, etc. This matches what we do
for other mask only instructions like VCPOP_M, VFIRST_M, VMSBF_M,
VLM, VSM, etc.
Now all pseudoinstructions that use Log2SEW=0 will be consistently
named.
Instead of directly lowering to vnsrl_vl and having custom pattern
matching for that case, we can just lower to a (legal) shift and
truncate, and let generic pattern matching produce the vnsrl.
The major motivation for this is that I'm going to reuse this logic to
handle e.g. deinterleave4 w/ i8 result.
The test changes aren't particularly interesting. They're minor code
improvements - I think because we do slightly better with the
insert_subvector patterns, but that's mostly irrelevant.
When lowering fixed length f16 insert_subvector nodes at index 0 we
crashed with zvfhmin because we couldn't select vmv_v_v_vl.
This was due to the predicates requiring full zvfh, even though we only
need zve32x. Use the integer VTypeInfo instead similarly to
VPatSlideVL_VX_VI.
The extract_subvector tests aren't related but were just added for
consistency with the insert_subvector tests.
These pseudos used to be handled by CustomInserter to insert the
rounding
mode change for vector ceil, floor, etc. At some point they were changed
to use the InsertReadWriteCSR pass instead of the custom inserter. I
believe
that makes them redundant with the pseudos used by the RVV intrinsics
with rounding mode operand.
Previously we crashed because we had no lowering for f16/bf16 scalable
vectors.
Because the lowering uses vrgather_vv_vl, we need to add bf16 patterns
for it.
selectFPImm previously handled cases where an FPImm could be
materialized in an integer register.
We can generalize this to cases where a value was in an integer register
and then copied to a scalar FP register to be used by a vector
instruction.
In the affected test, the call lowering code used up all of the FP
argument registers and started using GPRs. Now we use integer vector
instructions to consume those GPRs instead of moving them to scalar FP
first.
This adds VL patterns for vfwmaccbf16.vv so that we can handle fixed
length vectors.
It does this by teaching combineOp_VLToVWOp_VL to emit
RISCVISD::VFWMADD_VL for bf16. The change in getOrCreateExtendedOp is
needed because getNarrowType is based off of the bitwidth so returns
f16. We need to explicitly check for bf16.
Note that the .vf patterns don't work yet, since the build_vector splat
gets lowered to a (vmv_v_x_vl (fmv_x_anyexth x)) instead of a vfmv.v.f,
which SplatFP doesn't pick up, see #106637.
In #71501 we removed the LMUL variants for vmv.s.x and vmv.x.s because
they ignore register groups, so this patch does the same for their
floating point equivalents.
We don't need to add any extra patterns for extractelt in
RISCVInstrInfoVSDPatterns.td because in lowerEXTRACT_VECTOR_ELT we make
sure that the node is narrowed down to LMUL 1.
These new opcodes drop the shift amount, rounding mode, and passthru.
Making them exactly like TRUNCATE_VECTOR_VL. The shift amount, rounding
mode, and passthru are added in isel patterns similar to how we
translate TRUNCATE_VECTOR_VL to vnsrl with a shift of 0.
This should simplify #99418 a little.
We can shuffle vXf16 vectors just like vXi16 vectors. We don't need any
FP instructions. Update the predicates for vrgather and vslides patterns
to only check the predicates based on the equivalent integer type. If we
use the FP type it will check Zvfh and block Zvfhmin.
These are probably not the only patterns that need to be fixed, but the
test from the bug report no longer crashes.
Fixes#97477
We try not let bf16 splats through to isel, but constant folding
allows FP constants to get through. Thankfully we can handle those
using vmv.v.i or vmv.v.x.
We had specific patterns for riscv_vfmv_v_f_vl in both RISCVInstrInfoVVLPatterns.td
and RISCVInstrInfoVSDPatterns.td.
The RISCVInstrInfoVSDPatterns.td patterns could only match if the
RISCVInstrInfoVVLPatterns.td failed. As far as I can tell this
would only happen if the predicate didn't match. Tweak the predicate
so the RISCVInstrInfoVVLPatterns.td can match in more cases.
This isn't used by clang and isn't in the rvv-intrinsic-doc.
The instruction requires Zvfh.
If the F register passed to the instruction isn't nan-boxed correctly,
the instruction will generate the wrong nan. So the instruction isn't a
generic move FPR16 to vector register instruction.
Similar to #93596, this moves the signed vnclip patterns into DAG
combine.
This will allows us to support more than 1 level of truncate in a
future patch.
I plan to add support for multiple layers of vnclipu. For example,
i32->i8 using 2 vnclipu instructions. First clipping to 65535, then
clipping to 255. Similar for signed vnclip.
This scales poorly if we need to add patterns with 2 or 3 truncates.
Instead, move the code to DAGCombiner with new ISD opcodes to represent
VCLIP(U).
This patch just moves the existing patterns into DAG combine. Support
for multiple truncates will as a follow up. A similar patch series will
be made for the signed vnclip.
I think the behaviors are the same if this describes their behavior.
AVGFLOORS sign extends the inputs by 1 bit, adds them, then does an
arithmetic shift right by 1 before truncating to the original bit width.
This is vaadd with rdn rounding mode.
AVGCEILS sign extends the inputs by 1 bit, adds them, then does an
arithmetic shift right by 1. If the bit shifted out is 1, it adds 1 to
the shifted value. Then truncates to the original bit width. This is vaadd
with rnu rounding mode.
I think this wasn't implemented previously because there was some
confusion about what average means. Some may expect average to round
towards zero, but there is no way to do that in RISC-V or with the
SelectionDAG nodes. Related issue
https://github.com/riscv/riscv-v-spec/issues/935
Similar to #75145, but for scalable vectors.
Specifically, this patch works for the below optimization case:
## Source Code
```
define void @trunc_sat_i8i16_maxmin(ptr %x, ptr %y) {
%1 = load <vscale x 4 x i16>, ptr %x, align 16
%2 = tail call <vscale x 4 x i16> @llvm.smax.v4i16(<vscale x 4 x i16> %1, <vscale x 4 x i16> splat (i16 -128))
%3 = tail call <vscale x 4 x i16> @llvm.smin.v4i16(<vscale x 4 x i16> %2, <vscale x 4 x i16> splat (i16 127))
%4 = trunc <vscale x 4 x i16> %3 to <vscale x 4 x i8>
store <vscale x 4 x i8> %4, ptr %y, align 8
ret void
}
```
## Before this patch
[Compiler Explorer](https://godbolt.org/z/EKc9eGvo8)
```
trunc_sat_i8i16_maxmin:
vl1re16.v v8, (a0)
li a0, -128
vsetvli a2, zero, e16, m1, ta, ma
vmax.vx v8, v8, a0
li a0, 127
vmin.vx v8, v8, a0
vsetvli zero, zero, e8, mf2, ta, ma
vnsrl.wi v8, v8, 0
vse8.v v8, (a1)
ret
```
## After this patch
```
trunc_sat_i8i16_maxmin:
vsetivli zero, 4, e8, mf4, ta, ma
vle16.v v8, (a0)
vnclip.wi v8, v8, 0
vse8.v v8, (a1)
ret
```