These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
There's a long comment explaining this approach in RISCVInstrInfoXqci.td
This change also fixes some problems when fixups are able to be resolved for `qc.e.li` and `qc.li`.
Instead of allowing a parsed MCInst to have a either uimm10 or simm10,
always render as simm10. This avoids a mismatch between parsed MCInst
and disassembled MCInst when a uimm10 value is used.
We have some 48-bit instructions in the `Xqci` spec that currently
cannot be compressed to their 32-bit variants due to the constraint in
`CompressInstEmitter` on destination instruction operands not being
allowed to mismatch with the DAG operands.
For eg. the` QC_E_ADDI` instruction can be compressed to the `ADDI`
instruction when the immediate is signed-12 bit but this is currently
not possible since the `QC_E_ADDI` instruction has `GPRNoX0` register
operands while the `ADDI` instruction has `GPR` register operands
leading to an operand type validation error.
I think we can remove the check that only source instruction operands
can mismatch with the corresponding DAG operands and rely on the fact
that we check if the DAG register operand type is a subclass of the
instruction register operand type.
We don't have any tests that show why this AddedComplexity is needed.
ImmLeafs are automatically ranked higher than register operands so there
is no ambgiuity with the base ISA here.
There have been discussions on splitting RISCVISelLowering.cpp. I think
InterleavedAccess related TLI hooks would be some of the low hanging
fruit as it's relatively isolated and also because X86 is already doing
it.
NFC.
Previously we just assumed that no instruction that needed to be moved
would have an implicit def, but vnclip pseudos will.
We can still try to move them but we just need to check that no
instructions between have any reads or writes to the physical register.
Fixes#147986
These were removed in #147830 due to ignoring that these instructions
operate on bytes. This patch adds them back with tests including a test
for the byte boundary issue.
I seperated out the commits to show bad optimization if we don't round
Bits to the nearest byte.
This reverts commit aee21c368b41cd5f7765a31b9dbe77f2bffadd4e.
As noted
<https://github.com/llvm/llvm-project/pull/146855#issuecomment-3061784904>
this causes compile errors for several RVV configurations:
fatal error: error in backend: SmallVector unable to grow. Requested capacity (4294967296) is larger than maximum value for size type (4294967295)
If there are multiple masks producers followed by multiple
masked consumers, may a move(vmv* v0, vx) will be generated
to save mask.
By moving the mask's producer after the mask's use,
the spill can be eliminated, and the move can be removed.
Add basic isel patterns for the multiple accumulate QC.MULIADD
instruction.
While most case work with just the TD file pattern, there are few cases
which need to be handled in ISelLowering depending on the immediate we
are multiplying with:
- imm + 1 , imm - 1, 1 - imm, -1 - imm are a power of 2 --> these become
slli and add/sub
- immediate is 2^n - 2 ^m --> this becomes (add/sub (shl X, C1), (shl X,
C2))
- imm - 2, imm - 4, imm - 6 is a power of 2 --> these use shxadd when
zba is enabled
The patch does not decompose mul if Xqciac is present, for the above
conditions. There could be cases where this may not beneficial which I
plan to address in follow up patches.
Lower it just like the vector [l]lrint, using vfcvt, with the right
rounding mode. Updating costs to account for this custom-lowering is
left to a companion patch.
Currently we have slightly different costing for the vp and non-vp
version of the rounding intrinsics.
We can delete this code and use the generic BasicTTIImpl code for the vp
intrinsics which falls back to the non-vp versions.
I'm not sure if the zvfh costing is correct, this should probably be
fixed in a follow up patch. At the moment the non-vp cost is more
important since it is what the loop vectorizer will use.
Move the costing to the generic implementation in BasicTTIImpl since it
just falls back to the non-vp costing.
Also pass through the OperandValueInfo if using value based costing, but
I don't believe this affects the result for any in-tree target
currently.
There will be more schedule definitions for vendor extentions and
we need to add these `UnsupportedSchedXXX` to exsiting models every
time we add new schedule definitions.
The fact is that each vendor will barely implement other vendors'
extensions, so we can package these definitions into one.
Add LLVM Context to getOptimalMemOpType and findOptimalMemOpLowering. So
that we can use EVT::getVectorVT to generate EVT type in
getOptimalMemOpType.
Related to [#146673](https://github.com/llvm/llvm-project/pull/146673).
These instructions operate on bytes so we need to round the demanded
bits up to the nearest byte which we aren't doing. I think we forgot to
update this when we changed from hasAllWUsers to hasNBitUsers.
We don't have any test case for these instruction so remove them until
we can put together a test.
As noted in post commit review, the API change here was not required.
I'd apparently confused myself when teasing apart patches from my
development branch.
For the fixed vector cases, we already support this, but the
deinterleave intrinsic cases (primary used by scalable vectors) didn't.
Supporting it requires plumbing through the Factor separately from the
extracts, as there can now be fewer extracts than the Factor. Note that
the fixed vector path handles this slightly differently - it uses the
shuffle and indices scheme to achieve the same thing.
XSfvqmaccdod/qoq and XSfvfwmaccqqq are SiFive's small-size matrix
multiplication extensions. This patches add scheduling info for their
instructions along with six new SchedReadWrite.
This patch adds scheduling data for the XSfvfnrclipxfqf instruction,
which narrows / clips FP32 data to INT8 according to value range
specified by a scalar register. Three new SchedReadWrites are
introduced.
Extend lowerVectorXRINT to also do a FP_EXTEND_VL when the source
element type is [b]f16, and wire up this custom-promote. Updating the
cost-model to not give these an invalid cost is left to a companion
patch.
A disjoint OR can be converted to XOR. And a XOR+NOT is XNOR. Idea
taken from #147279.
I changed the existing xnor pattern to have the not on the outside
instead of the inside. These are equivalent for xor since xor is
associative. Tablegen was already generating multiple variants
of the isel pattern using associativity.
There are some issues here. The disjoint flag isn't preserved
through type legalization. I was hoping we could recover it
manually for the masked merge cases, but that doesn't work either.
All of these are guaranteed to be MemSDNode. The only intrinsics that
aren't are vlm and vsm. We should add those to
RISCVTargetLowering::getTgtMemIntrinsic to fix that.
When vectorizing a loop with a fixed-order recurrence we use a splice,
which gets lowered to a vslidedown and vslideup pair.
However with the way we lower it today we end up with extra vl toggles
in the loop, especially with EVL tail folding, e.g:
.LBB0_5: # %vector.body
# =>This Inner Loop Header: Depth=1
sub a5, a2, a3
sh2add a6, a3, a1
zext.w a7, a4
vsetvli a4, a5, e8, mf2, ta, ma
vle32.v v10, (a6)
addi a7, a7, -1
vsetivli zero, 1, e32, m2, ta, ma
vslidedown.vx v8, v8, a7
sh2add a6, a3, a0
vsetvli zero, a5, e32, m2, ta, ma
vslideup.vi v8, v10, 1
vadd.vv v8, v10, v8
add a3, a3, a4
vse32.v v8, (a6)
vmv2r.v v8, v10
bne a3, a2, .LBB0_5
Because the vslideup overwrites all but UpOffset elements from the
vslidedown, we currently set the vslidedown's AVL to said offset.
But in the vslideup we use either VLMAX or the EVL which causes a
toggle.
This increases the AVL of the vslidedown so it matches vslideup, even if
the extra elements are overridden, to avoid the toggle.
A new tuning feature +vl-dependent-latency has been added which keeps
the old behaviour for microarchitectures that dynamically dispatch uops
based on vl, e.g. sifive-x280.
+vl-dependent-latency can be reused for the recently proposed Ovlt
optimization directive if/when it's ratified:
https://lists.riscv.org/g/tech-privileged/message/2487
If we wanted to aggressively optimise for vl at the expense of
introducing more toggles we could probably look at doing this in
RISCVVLOptimizer.