The simplest way is:
1. Save `vtype` to a scalar register.
2. Insert a `vsetvli`.
3. Use segment load/store.
4. Restore `vtype` via `vsetvl`.
But `vsetvl` is usually slow, so this PR is not in this way.
Instead, we use wider whole load/store instructions if the register
encoding is aligned. We have done the same optimization for COPY in
https://github.com/llvm/llvm-project/pull/84455.
We found this suboptimal implementation when porting some video codec
kernels via RVV intrinsics.
This keeps it closer to the other legality checks like the FP exceptions
check.
It also means that isSupportedInstr only needs to check the opcode,
which allows it to be replaced with a TSFlags based check in a later
patch.
That is, on all targets except ARM and AArch64.
This field used to be required due to a bug, it was fixed long ago
by 23423c0ea8d414e56081cb6a13bd8b2cc91513a9.
This patch implements pages 15-17 from
jhauser.us/RISCV/ext-P/RVP-instrEncodings-015.pdf
Documentation:
jhauser.us/RISCV/ext-P/RVP-baseInstrs-014.pdf
jhauser.us/RISCV/ext-P/RVP-instrEncodings-015.pdf
Add trailing newlines to the following files to comply with POSIX
standards:
- llvm/lib/Target/RISCV/RISCVInstrInfoXSpacemiT.td
- llvm/test/MC/RISCV/xsmtvdot-invalid.s
- llvm/test/MC/RISCV/xsmtvdot-valid.s
Closes#151706
Previously, we fold `(vfmv.s.f (extract_subvector X, 0))` into X when
X's type is the same as `vfmv.s.f`'s result type. This patch generalizes
it by folding it into insert_subvector when X is narrower and
extract_subvector when X is wider.
Co-authored-by: Craig Topper <craig.topper@sifive.com>
There are a couple of places in the loop vectoriser where we
want to calculate the cost of extracting the last lane in a
vector. However, we wrongly assume that asking for the cost
of extracting lane (VF.getKnownMinValue() - 1) is an accurate
representation of the cost of extracting the last lane. For
SVE at least, this is non-trivial as it requires the use of
whilelo and lastb instructions.
To solve this problem I have added a new
getReverseVectorInstrCost interface where the index is used
in reverse from the end of the vector. Suppose a vector has
a given ElementCount EC, the extracted/inserted lane would be
EC - 1 - Index. For scalable vectors this index is unknown at
compile time. I've added a AArch64 hook that better represents
the cost, and also a RISCV hook that maintains compatibility
with the behaviour prior to this PR.
I've also taken the liberty of adding support in vplan for
calculating the cost of VPInstruction::ExtractLastElement.
The InputArg/OutputArg now contains the OrigTy, so directly use that
instead of trying to recover it.
CC_RISCV is now *nearly* a normal CC assignment function. However, it
still differs by having an IsRet flag.
Generate QC_INSB/QC_INSBI from `or (and X, MaskImm), OrImm` iff the
value being inserted only sets known zero bits. This is based on a
similar DAG to DAG transform done in `AArch64`.
This is a more general form of the recently added isel pattern
(seteq (i64 (and GPR:$rs1, 0x8000000000000000)), 0)
-> (XORI (i64 (SRLI GPR:$rs1, 63)), 1)
We can use a shift right for any AND mask that is a negated power
of 2. But for every other constant we need to use seqz instead of
xori. I don't think there is a benefit to xori over seqz as neither
are compressible.
We already do this transform from target independent code when the setcc
constant is a non-zero subset of the AND mask that is not a legal icmp
immediate.
I don't believe any of these patterns comparing MSBs to 0 are
canonical according to InstCombine. The canonical form is (X < 4096).
I'm curious if these appear during SelectionDAG and if so, how.
My goal here was just to remove the special case isel patterns.
This helps the 3 vendor extensions that make sext_inreg i1 legal.
I'm delaying this until after LegalizeDAG since we normally have
sext_inreg i1 up until LegalizeDAG turns it into and+neg.
I also delayed the recently added (sext_inreg (xor (setcc), -1), i1)
combine. Though the xor isn't likely to appear before LegalizeDAG anyway.
Resolve the TODO: on RV32, when constructing the double-precision
constant `+0.0` for `s64`, `BuildPairF64Pseudo` can be optimized to use
the `fcvt.d.w` instruction to generate the result directly.
This patch replaces SmallSet<T *, N> with SmallPtrSet<T *, N>. Note
that SmallSet.h "redirects" SmallSet to SmallPtrSet for pointer
element types:
template <typename PointeeType, unsigned N>
class SmallSet<PointeeType*, N> : public SmallPtrSet<PointeeType*, N>
{};
We only have 140 instances that rely on this "redirection", with the
vast majority of them under llvm/. Since relying on the redirection
doesn't improve readability, this patch replaces SmallSet with
SmallPtrSet for pointer element types.
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
pli.h and pli.w both accept signed immediates, so pli.b should too. But
unlike those instructions, pli.b doesn't do any extension so its ok to
accept an unsigned immediate as well.
Currently we have a switch statement that checks if a vector instruction
may read elements past VL. However it currently doesn't account for
instructions in vendor extensions.
Handling all possible vendor instructions will result in quite a lot of
opcodes being added, so I've created a new TSFlag that we can declare in
TableGen, and added it to the existing instruction definitions.
I've tried to be conservative as possible here: All SiFive vendor vector
instructions should be covered by the flag, as well as all of
XRivosVizip, and ri.vextract from XRivosVisni.
For now this should be NFC because coincidentally, these instructions
aren't handled in getOperandInfo, so RISCVVLOptimizer should currently
avoid touching them despite them being liberally handled in
getMinimumVLForUser.
However in an upcoming patch we'll need to also bail in
getMinimumVLForUser, so this prepares for it.
What most code wants to know is the direction and we have to decode the
opcode to figure that out. Instead pass the direction around as a bool
and convert to opcode when we create the merge instruction.
If we're moving the second copy before another instruction that reads
the copied register, we need to clear the kill flag on the combined
move.
Fixes#153598.
This patch adds CodeGen support for qc.insbi and qc.insb instructions
defined in the Qualcomm uC Xqcibm extension. qc.insbi and qc.insb
inserts bits into destination register from immediate and register
operand respectively.
A sequence of `xor`, `and` & `xor` depending on appropriate conditions
are converted to `qc.insbi` or `qc.insb` which depends on the
immediate's value.
These instructions are the shift by immediate and saturate by immediate
instructions from the top half of page 9 of
https://jhauser.us/RISCV/ext-P/RVP-instrEncodings-015.pdf
I've also improved the CHECK lines in the invalid tests to check line
and column number from the diagnostic.
Co-authored-by: realqhc <caiqihan021@hotmail.com>
Follow-up PR to #153071, adding the remaining zvbb instructions
(VBREV8_V and VREV8_V), plus the zvbc instruction (VCLMUL_VV, VCLMUL_VX,
VCLMULH_VV, VCLMULH_VX).
Godbolt example: https://godbolt.org/z/ThdfP475a
In the example single-element vse is used to store reduction result
instead of scalar store ([this optimization was introduced by this
patch](https://reviews.llvm.org/D109482)). However, vmv.x.s can't be
eliminated here because it has other uses (e.g. CopyToReg), so it seems
more profitable to use scalar store (we already have store value in a
scalar register, and can save one vsetvli which is likely to be required
for single-element vse). The proposed solution is to this transform only
if vmv.x.s has one use (in store instruction)
The check should be about unsigned 16-bit immediates, not signed ones.
This is not a bug per-se, as the old codegen was correct for the
uint16_max case, it just didn't end up using `qc.e.bgeui`, which we
would prefer it did.
This PR adds support for the following instructions to the RISC-V
VLOptimizer: vandn.vx, vandn.vv, vbrev.v, vclz.v, vcpop.v, vctz.v,
vror.vi, vror.vx, vror.vv, vrol.vx, vrol.vv.