This patch adds CodeGen support for qc.insbi and qc.insb instructions
defined in the Qualcomm uC Xqcibm extension. qc.insbi and qc.insb
inserts bits into destination register from immediate and register
operand respectively.
A sequence of `xor`, `and` & `xor` depending on appropriate conditions
are converted to `qc.insbi` or `qc.insb` which depends on the
immediate's value.
-Only fold if the ADD can be folded into all uses.
-Don't reassociate an ADDI if the shl+add can be a shxadd or similar
instruction.
-Only reassociate a single ADDI. If there are 2 addis it's the same
number of instructions as shl+add. If there are more than 2 that it
would increase instructions over folding the addis into the
loads/stores.
Rename UnwrapShl->SelectShl. Make it only responsible for matching
a SHL by constant.
Handle the fallback case of reg+reg with no scale outside of SelectShl.
Reorder the check so RHS is checked for shift first. The base pointer
is most likely on the LHS. It's very unlikely both operands are shifts.
This is preparation for adding better costing decisions to this code.
The transformation done in #147349 was incorrect since we were not
passing the input node of the `OR` instruction to the `QC.INSBI`
instruction leading to the generated instruction doing the wrong thing.
In order to do this we first needed to add the output register to
`QC.INSBI` as being both an input and output.
The code produced after the above fix will need a copy (mv) to preserve
the register input to the OR instruction if it has more than one use
making the transformation net neutral ( `6-byte QC.E.ORI/ORAI` vs
`2-byte C.MV + 4-byte QC.INSB`I). Avoid doing the transformation if
there is more than one use of the input register to the OR instruction.
A disjoint OR can be converted to XOR. And a XOR+NOT is XNOR. Idea
taken from #147279.
I changed the existing xnor pattern to have the not on the outside
instead of the inside. These are equivalent for xor since xor is
associative. Tablegen was already generating multiple variants
of the isel pattern using associativity.
There are some issues here. The disjoint flag isn't preserved
through type legalization. I was hoping we could recover it
manually for the masked merge cases, but that doesn't work either.
All of these are guaranteed to be MemSDNode. The only intrinsics that
aren't are vlm and vsm. We should add those to
RISCVTargetLowering::getTgtMemIntrinsic to fix that.
When the immediate to the ORI is a ShiftedMask_32 that does not fit in
12-bits we can use the QC.INSBI instruction instead. We do not do this
for cases where the ORI can be replaced with a BSETI since these can be
compressesd when the Xqcibm extension (which QC.INSBI is a part of) is
enabled.
This moves the peephole that folds vmerges into its operands into
RISCVVectorPeephole. This will also allow us to eventually commute
instructions to allow folding, see #141885 and #70042
Most of the test diffs are due to the slight change in instruction
ordering.
For now doPeepholeMaskedRVV is kept even though its a duplicate of
RISCVVectorPeephole::convertToUnmasked to minimize the diff, I plan on
removing it in a separate patch as it causes some instructions to be
shuffled around.
Similarly, this runs foldVMergeToMask before the other peepholes to
minimize the diff for now.
rvv-peephole-vmerge-vops-mir.ll was replaced with a dedicated
vmerge-peephole.mir test.
This code handled load/store addresses that are a SHL instruction. That
seems very unlikely to occur unless you're accessing an array that
starts at address 0. I'm not even sure if you can represent that in llvm
IR.
This patch is similar to #142737
The XAndesPerf extension includes signed bitfield extraction
instruction `NDS.BFOS, which can extract the bits from 0 to Len - 1,
place them starting at bit Lsb, zero-filled the bits from 0 to Lsb -1,
and sign-extend the result.
When Lsb == Msb, it is a special case where the Lsb will be set to 0
instead of being equal to the Msb.
The Xqcibm Bit Manipulation extension has the `qc.ext` instruction that
can extract a subset of bits from the source register to the destination
register.
Unlike the corresponding instructions in `XTHeadbb` and `XAndesPerf`
which extract the bits between `Msb` and `Lsb`, the `qc.ext` instruction
extracts `width` bits from an offset that is determined by the `shamt`.
The Xqcibm Bit Manipulation extension has the `qc.extu` instruction that
can extract a subset of bits from the source register to the destination
register.
Unlike the corresponding instructions in XTHeadbb and XAndesPerf which
extract the bits between `Msb` and `Lsb`, the `qc.extu` instruction
extracts `width` bits from an offset that is determined by the `shamt`.
The XAndesPerf extension includes unsigned bitfield extraction
instruction `NDS.BFOZ`, which can extract the bits from 0 to Len -1,
place them starting at bit Msb, and zero-fills the remaining bits.
This patch handles the cases where Msb < Lsb for `NDS.BFOZ`.
Instruction Sytax:
nds.bfoz Rd, Rs1, Msb, Lsb
The operation is:
if Msb < Lsb:
Lenm1 = Lsb - Msb;
Rd[Lsb:Msb] = Rs1[Lenm1:0];
if (Lsb < (XLen -1)) Rd[XLen-1:Lsb+1]=0;
Rd[Msb-1:0]=0;
When Len == 1, it is a special case where the Msb is set to 0 instead of
being equal to the Lsb.
The XAndesPerf extension includes signed bitfield extraction
instruction `NDS.BFOS`, which can extract the bits from LSB to MSB,
places them starting at bit 0, and sign-extends the result.
The testcase includes the two patterns that can be selected as
signed bitfield extracts: `ashr+shl` and `ashr+sext_inreg`
The XAndesPerf extension includes unsigned bitfield extraction
instruction `NDS.BFOZ`, which can extract the bits from LSB to MSB,
places them starting at bit 0, and zero-extends the result.
The testcase includes the three patterns that can be selected as
unsigned bitfield extracts: `and`, `and+lshr` and `lshr+and`
This avoids a bunch of complexity around making sure the offset doesn't
exceed 4093 so we can add 4 after splitting later. By splitting early,
the split loads/stores will get selected independently.
There's a bit of follow up work to do, particularly around splitting a
constant pool load. Overall I think this is cleaner with less edge
cases.
Don't split i64 load/store when we have Zilsd.
In the future, we should enhanced the LoadStoreOptimizer pass to do
this, but this is a good starting point. Even if we support it in
LoadStoreOptimizer, we might still want this for volatile loads/stores
to guarantee the use of Zilsd.
Use the QC_E_ADDI instruction when the constant is not a signed 12 bit
value but is a signed 26 bit one. Don't use it if a single LUI
instruction can be used instead.
This commit moves RISC-V to auto-generate its target-specific SDNode
types. The biggest change is that SDNodes can now be validated against
their expected type profiles, and that we don't need to edit several
different files when declaring a new one.
This takes Sergei's work in #119709 and "finishes" it - by moving the
final five RISCVISD opcodes into tablegen (including defining their
types), and by ensuring the tablegen has expected closing scope
comments.
Co-authored-by: Sergei Barannikov <barannikov88@gmail.com>
According the
[spec](https://github.com/XUANTIE-RV/thead-extension-spec/releases/tag/2.3.0),
the operation of `th.ext rd, rs1, msb, lsb` is
reg[rd] := sign_extend(reg[rs1][msb:lsb])
The spec doesn't specify if lsb is greater than msb.
I don't think lsb can be greater than msb. So that If the shift-right
amount is greater than msb, we can set lsb equal to msb to extract the
bit rs1[msb] and sign-extend it.
This puts the base before the offset to match the order we use for base
ISA where the offset is an immediate.
I'm investigating using sub-operands for the base ISA loads and stores
too so having a consistent operand order will allow more sharing.
We'd already had this transform for the intrinsics, but hadn't added it
for either fixed length or scalable vectors coming from normal IR.
For the record, the fact we have three different sets of patterns here
really is quite ugly.