We can use `viota`+`vrgather` to synthesize `vdecompress` and lower
expanding load to `vcpop`+`load`+`vdecompress`.
And if `%mask` is all ones, we can lower expanding load to a normal
unmasked load.
Fixes#101914.
In https://reviews.llvm.org/D147608 we added custom lowering for
integers, but inadvertently also marked it as custom for scalable FP
vectors despite not handling it.
This adds handling for floats and marks it as custom lowered for
fixed-length FP vectors too.
Note that this doesn't handle bf16 or f16 vectors that would need
promotion, but these scalar_to_vector nodes seem to be emitted when
expanding them.
This intrinsic was introduced by #92289 and currently we just expand
it for RISC-V.
This patch adds custom lowering for this intrinsic and simply maps
it to `vcompress` instruction.
Fixes#113242.
I'm not sure if this fix is required, but I've written the patch anyway.
This does not cause test changes, but we haven't got tests that try to
use all 32 registers in inline assembly.
Broadly, for GPRs, we made the explicit choice that `r` constraints
would never attempt to use `x0`, because `x0` isn't really usable like
the other GPRs. I believe the same thing applies to `Zhinx`, `Zfinx` and
`Zdinx` because they should not be allocating operands to `x0` either,
so this patch introduces new `NoX0` classes for `GPRF16` and `GPRF32`
registers, and uses them with inline assembly. There is also a
`GPRPairNoX0` for the `Zdinx` case on rv32, avoiding use of the `x0`
pair which has different behaviour to the other GPR pairs.
This change implements support for the `cr` and `cf` register
constraints (which allocate a RVC GPR or RVC FPR respectively), and the
`N` modifier (which prints the raw encoding of a register rather than
the name).
The intention behind these additions is to make it easier to use inline
assembly when assembling raw instructions that are not supported by the
compiler, for instance when experimenting with new instructions or when
supporting proprietary extensions outside the toolchain.
These implement part of my proposal in riscv-non-isa/riscv-c-api-doc#92
As part of the implementation, I felt there was not enough coverage of
inline assembly and the "in X" floating-point extensions, so I have
added more regression tests around these configurations.
This is implementation is based on what the X86 target does but
emitting the instructions that GCC emits for rv64.
---------
Co-authored-by: Pengcheng Wang <wangpengcheng.pp@bytedance.com>
This fixes all the places that hit the new assertion added in
https://github.com/llvm/llvm-project/pull/106524 in tests. That is,
cases where the value passed to the APInt constructor is not an N-bit
signed/unsigned integer, where N is the bit width and signedness is
determined by the isSigned flag.
The fixes either set the correct value for isSigned, set the
implicitTrunc flag, or perform more calculations inside APInt.
Note that the assertion is currently still disabled by default, so this
patch is mostly NFC.
For regular floating point types we mark these as expanded on scalable
vectors so they're not legal in the cost model, so this does the same
for f16 w/ zvfhmin and bf16.
The aim is to have the same set of promotions on fixed-length bf16
vectors as on fixed-length f16 vectors, and then deduplicate them
similarly to what was done for scalable vectors.
It looks like fneg/fabs/fcopysign end up getting expanded because fsub
is now legal, and the default operation action must be expand.
Rename the function to reflect its correct behavior and to be consistent
with `Module::getOrInsertFunction`. This is also in preparation of
adding a new `Intrinsic::getDeclaration` that will have behavior similar
to `Module::getFunction` (i.e, just lookup, no creation).
It turns out that {s,u}int_to_fp nodes get their operation action from
their operand's type, not the result type, so we don't need to set it
for fp16 or bf16. vp_{s,u}int_to_fp uses the result type though so we
need to keep it.
This also means that we can lower int_to_fp for fixed length bf16
vectors already, so this adds tests for that.
The cost model test changes are due to BasicTTIImpl's getCastInstrCost
not taking into account that int_to_fp needs its legal type swapped.
This can be fixed in a later patch, but its worth noting that the
affected types in the tests currently crash when lowered anyway (due to
them needing split at LMUL > 8)
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.
In https://reviews.llvm.org/D153848, promotion was added for a variety
of f16 ops with zvfhmin, including VP reductions.
However I don't believe it's correct to promote f16 fadd or fmul
reductions to f32 since we need to round the intermediate results.
Today if we lower @llvm.vp.reduce.fadd.nxv1f16 on RISC-V, we'll get two
different results depending on whether we compiled with +zvfh or
+zvfhmin, for example with a 3 element reduction:
; v9 = [0.1563, 5.97e-8, 0.00006104]
; zvfh
vsetivli x0, 3, e16, m1, ta, ma
vmv.v.i v8, 0
vfredosum.vs v8, v9, v8
vfmv.f.s fa0, v8
; fa0 = 0.1563
; zvfhmin
vsetivli x0, 3, e16, m1, ta, ma
vfwcvt.f.f.v v10, v9
vsetivli x0, 3, e32, m1, ta, ma
vmv.v.i v8, 0
vfredosum.vs v8, v10, v8
vfmv.f.s fa0, v8
fcvt.h.s fa0, fa0
; fa0 = 0.1564
This same thing happens with reassociative reductions e.g. vfredusum.vs,
and this also applies for bf16.
I couldn't find anything in the LangRef for reductions that suggest the
excess precision is allowed. There may be something we can do in Clang
with -fexcess-precision=fast, but I haven't looked into this yet.
I presume the same precision issue occurs with fmul, but not with
fmin/fmax/fminimum/fmaximum.
I can't think of another way of lowering these other than scalarizing,
and we can't scalarize scalable vectors, so this just removes the
promotion and adjusts the cost model to return an invalid cost. (It
looks like we also don't currently cost fmul reductions, so presumably
they also have an invalid cost?)
I think this should be enough to stop the loop vectorizer or SLP from
emitting these intrinsics.
This is the dual of #110144, but doesn't handle the case when the scalar
type is illegal i.e. no zfhmin/zfbfmin. It looks like softening isn't
yet implemented for insert_vector_elt operands and it will crash during
type legalization, so I've left that configuration out of the tests.
Previously we crashed because we had no lowering for f16/bf16 scalable
vectors.
Because the lowering uses vrgather_vv_vl, we need to add bf16 patterns
for it.
Previously we used an empty MachinePointerInfo. I checked a few other
targets like X86, ARM, and AArch64 and they all appear to use correct
MachinePointerInfo.
Support for large code model are added recently, and sementically
direct calls are lowered to an indirect branch with a constant pool target.
By default it does not use the x7 register and this is suboptimal with
Zicfilp because it introduces landing pad check, which is unnecessary
since the constant pool is read-only and unlikely to be tampered.
Change direct calls and tail calls to use x7 as the scratch
register (a.k.a. software guarded branch in the CFI spec)
When the scalar type is illegal, it gets softened during type
legalization and gets lowered as an integer.
However with zfhmin/zfbfmin the type is now legal and it passes through
type legalization where it crashes because we didn't have any custom
lowering or patterns for it.
This handles said case via the existing custom lowering to a vslidedown
and vfmv.f.s.
It also handles the case where we only have zvfhmin/zvfbfmin and don't
have vfmv.f.s, in which case we need to extract it to a GPR and then use
fmv.h.x.
Fixes#110126
-Reduce the scope of some variables.
-Use getArgOperand instead of getOperand to get intrinsic operands.
-Use initialize_list instead of a SmallVector.
-Remove wide VectorType variable that is only used to check fixed vs
scalable. We can use the narrow VectorType for that.
We can lower f16/bf16 memory ops without promotion through the existing
custom lowering.
Some of the zero strided VP loads get combined to a VP splat, so we need
to also handle the lowering for that for f16/bf16 w/ zvfhmin/zvfbfmin.
This patch copies the lowering from ISD::SPLAT_VECTOR over to
lowerScalarSplat which is used by the VP splat lowering.
This patch fixes:
llvm/lib/Target/RISCV/RISCVISelLowering.cpp:10479:12: error:
variable 'SubRegIdx' set but not used
[-Werror,-Wunused-but-set-variable]
Create the equivalent INSERT_SUBVECTOR/EXTRACT_SUBVECTOR instead.
When we tried porting this to global isel, we noticed that subreg
operations are created early. We aren't able to do this until
instruction selection in global isel.
For SelectionDAG, it makes sense to use insert/extract_subvector as the
canonical form for these operations pre-isel. If it had come into
SelectionDAG as a insert/extract_subvector we would have kept it in that
form.
Use a setne with 0 instead of a trunc. We know we zero extended the node
so we can get by with a non-zero check only. The truncate lowering
doesn't know that we zero extended so has to mask the lsb.
I don't think DAG combine sees the trunc before we lower it to RISCVISD
nodes so we don't get a chance to use computeKnownBits to remove the
AND.
For f16 with zvfhmin, we promote most ops and VP ops to f32. This does
the same for bf16 with zvfbfmin, so the two fp types should now be in
sync.
There are a few places in the custom lowering where we need to check for
a LMUL 8 f16/bf16 vector that can't be promoted and must be split, this
extracts that out into isPromotedOpNeedingSplit.
In a follow up NFC we can deduplicate the code that sets up the
promotions.
This adds zvfhmin test coverage for fceil, ffloor, fnearbyint, frint,
fround and froundeven and splits them at nxv32f16 to avoid crashing,
similarly to what we do for other nodes that we promote.
This also sets ftrunc to promote which was previously missing. We
already promote the VP version of it, vp_froundtozero.
Marking it as promoted affects some of the cost model tests since
they're no longer expanded.
This patch adds support for the missing STRICT_UINT_TO_FP and
STRICT_SINT_TO_FP for riscv and adds a test case for rv32 which was
previously crashing.
The code is in line with how other strict_* nodes are handled
(e.g., getting op(1) instead of op(0) when it's a strict node, as op(0)
in a strict node is the entry token).
We currently make sure to check that if folding an op to an f16 widening
op that we have zvfh. We need to do the same for bf16 vectors, but with
the further restriction that we can only combine vfmadd_vl to vfwmadd_vl
(to get vfwmaccbf16.v{v,f}).
The added test case currently crashes because we try to fold an add to a
bf16 widening add, which doesn't exist in zvfbfmin or zvfbfwma
This moves the checks into the extension support checks to keep it one
place.
We need to insert a insert_subvector or extract_subvector which feels
pretty custom.
This should make it easier to support fixed vector arguments for GISel.