The RISC-V vector crypto extensions have been ratified. This patch
updates the Clang and LLVM support for these extensions to be
non-experimental, while leaving the C intrinsics as experimental since
the C intrinsics are not yet standardized.
Co-authored-by: Brandon Wu <brandon.wu@sifive.com>
The motivation of this change is simply to reduce test duplication. As
can be seen in the (massive) test delta, we have many tests whose output
differ only due to the use of addi on rv32 vs addiw on rv64 when the
high bits are don't care.
As an aside, we don't need to worry about the non-zero immediate
restriction on the compressed variants because we're not directly
forming the compressed variants. If we happen to get a zero immediate
for the ADDI, then either a later optimization will strip the useless
instruction or the encoder is responsible for not compressing the
instruction.
RegAllocGreedy uses SlotIndexes::getApproxInstrDistance to approximate
the length of a live range for its heuristics. Renumbering all slot
indexes with the default instruction distance ensures that this estimate
will be as accurate as possible, and will not depend on the history of
how instructions have been added to and removed from SlotIndexes's maps.
This also means that enabling -early-live-intervals, which runs the
SlotIndexes analysis earlier, will not cause large amounts of churn due
to different register allocator decisions.
We can use a 32-bit splat and bitcast to i64 vector.
This only handles the case where we are using vlmax so that the new
vl is cheap to compute. This could be generalized to double the VL.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D158879
In a recent series of refactorings (described here: https://discourse.llvm.org/t/riscv-transition-in-vector-pseudo-structure-policy-variants/71295), I greatly increased the number of IMPLICIT_DEF operands to our vector instructions. This has turned out to have an unexpected negative impact because MachineCSE does not CSE IMPLICIT_DEFs, and thus does not CSE any instruction with an IMPLICIT_DEF operand. SelectionDAG *does* CSE the same case, but that only covers the same block case, not the cross block case. This lead to the performance regression reported in https://github.com/llvm/llvm-project/issues/64282.
This change is a slightly ugly hack to side step the issue. Instead of fixing the root cause (lack of CSE for IMPLICIT_DEF) or undoing the operand changes, we leave the extra operand in place, and use NoReg in place of IMPLICIT_DEF. I then convert back to IMPLICIT_DEF just before register allocation so that ProcessImplicitDefs and TwoAddressInstructions can do the normal transforms to Undef tied registers.
We may end up backporting this into the 17.x release branch. Given how late in the release cycle this is landing, that's much less likely now, but still a possibility.
Differential Revision: https://reviews.llvm.org/D156909
We only attempted to determine KnownBits for uniform constant shift amounts, but ComputeKnownBits is able to handle some non-uniform cases as well that we can use as a fallback.
This patch adds pseudos and SDNode patterns for vbrev.v, vrev8.v, vclz.v,
vctz.v and vcpop.v.
I've only added them for integer element types so far since we're lacking tests
for floats.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D155216
This change continues with the line of work discussed in https://discourse.llvm.org/t/riscv-transition-in-vector-pseudo-structure-policy-variants/71295.
This change targets all the pseudos used in loads (unit, strided, segmented, fault first, and their combinations). As with previous changes in the series, we replace the existing TA and TU forms with a single unified pseudo with a passthru (which may be implicit_def) and a policy operand.
One quirk is that I went ahead and treated the unmasked mask load instruction (vlm) the same way. We need the pass thru operand to model tail undefined, but since the instruction is unconditionally agnostic and the instruction has no mask, the policy operand is arguably unneeded. I kept it mostly for consistency sake.
Another quirk worth highlighting is that segment loads require a bit of dedicated handling. Surprisingly, we don't have IMPLICIT_DEF nodes of the right types, and attempting to use them results in some odd looking codegen and a few crashes. Instead, I left the REG_SEQUENCE form, and extended InsertVSETVLI to recognize the complex undefs. Arguably, we should probably revisit the handling of undef reg_sequence nodes here, but I'm hoping to side step that in this patch.
As before, we see codegen changes (some improvements and some regressions) due to scheduling differences caused by the extra implicit_def instructions. I did have to delete one register allocation regression test as I couldn't figure out how to meaningfully update it. I spent a significant amount of time trying, and finally gave up.
Differential Revision: https://reviews.llvm.org/D154141
There is an issue: https://github.com/llvm/llvm-project/issues/63515
The issue is because when expanding SPLAT_VECTOR_SPLIT_I64_VL node, only memoperand is used to create dependency.
However in ScheduleDAGNodes, dependency is checked with chain only, and breaks order of store/load instructions.
I think in llvm.bitreverse.nxv2i64 intrinsic SPLAT_VECTOR_SPLIT_I64_VL nodes are parallel processed,
so no chain should be add to these nodes.
Using temporary in expanding SPLAT_VECTOR_SPLIT_I64_VL node can keep vlse instruction get correct value
no matter order of store instructions is changed.
Differential Revision: https://reviews.llvm.org/D153743
This continues towards the goal spelled out in https://discourse.llvm.org/t/riscv-transition-in-vector-pseudo-structure-policy-variants/71295. This patch switches all the binary operations (no widen, no narrow, but both int and FP) to use the _TU + implicit_def passthrough form. Change is mechanical.
This only changes the unmasked variants. Masked variants will still go through doPeepholeMaskedRVV and end up in the unsuffixed/TA form. Fixing that will be a separate change.
Differential Revision: https://reviews.llvm.org/D152940
Where C is a simm32.
This costs an extra temporary register, but avoids a constant pool.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D152236
The MC layer instructions have the correct register classes, and
the pseudos don't have any additional operands. So there doesn't
seem to be any reason for them to exist.
The pseudos were incorrectly going through code in RISCVMCInstLower
that converted LMUL>1 register classes to LMUL1 register class.
This makes the MCInst technically malformed, and prevented the
vl2r.v, vl4r.v, and vl8r.v InstAliases from matching. This accounts
for all of the .ll test diffs.
Differential Revision: https://reviews.llvm.org/D139511
Cannonical frame address after RVV stack adjustment is sp + StackSize +
RVVStackSize * vlenb, and since vlenb is unknown at compile-time (but it
is a constant for particular HW implementation), emit
.cfi_def_cfa_expression so libunwind can read VLENB CSR register at
run-time and obtain correct frame address.
Fixes https://github.com/llvm/llvm-project/issues/58356 (but additional
run-time support for reading CSR may be required)
Differential Revision: https://reviews.llvm.org/D136263
A new pass MachineLateInstrsCleanup is added to be run after PEI.
This is a simple pass that removes redundant and identical instructions
whenever found by scanning the MF once while keeping track of register
definitions in a map. These instructions are typically immediate loads
resulting from rematerialization, and address loads emitted by target in
eliminateFrameInde().
This is enabled by default, but a target could easily disable it by means of
'disablePass(&MachineLateInstrsCleanupID);'.
This late cleanup is naturally not "optimal" in removing instructions as it
is done by looking at phys-regs, but still quite effective. It would be
desirable to improve other parts of CodeGen and avoid these redundant
instructions in the first place, but there are no ideas for this yet.
Differential Revision: https://reviews.llvm.org/D123394
Reviewed By: RKSimon, foad, craig.topper, arsenm, asb
We can reuse constants if we use SRL followed by AND and AND followed by SHL.
Similar was done to bitreverse previously.
Differential Revision: https://reviews.llvm.org/D138045
This changes the default value used for mask policy from mask undisturbed to mask agnostic. In hardware, there may be a minor preference for ta/ma, but since this is only going to apply to instructions which don't use the mask policy bit, this is functionally mostly a nop. The main value is to make future changes to using MA when legal for masked instructions easier to review by reducing test churn.
The prior code was motivated by a desire to minimize state transitions between masked and unmasked code. This patch achieves the same effect using the demanded field logic (landed in afb45ff), and there are no regressions I spotted in the test diffs. (Given the size, I have only been able to skim.) I do want to call out that regressions are possible here; the demanded analysis only works on a block local scope right now, so e.g. a tight loop mixing masked and unmasked computation might see an extra vsetvli or two.
Differential Revision: https://reviews.llvm.org/D133803
RVV doesn't have immediate field for memory addressing. Currently
we build MachineInstructions in PEI to computing stack offset for
RVV load store instructions. These instructions were added too late to
can be optimized by CSE, LICM... passes.
This patch makes FrameIndex SDNodes can't be matched in RVV Load Store
instruction selection patterns. So that the FrameIndex SDNodes would be
selected as `ADDI GPR, targetframeindex`.
There are 2 advantages for such change:
1. Stack objects address computing can be optimized by machine function
passes.
2. Since the ADDI instruction's destination register can be used as a
temp register, we can save an emergency spill slot.
Differential Revision: https://reviews.llvm.org/D128187
For large integers (for example, magic numbers generated by
TargetLowering::BuildSDIV when dividing by constant), we may
need about 4~8 instructions to build them.
In the same time, it just takes two instructions to load
constants (with extra cycles to access memory), so it may be
profitable to put these integers into constant pool.
Reviewed By: asb, craig.topper
Differential Revision: https://reviews.llvm.org/D114950
Add an alias of `addi [x], zero, imm` to generate pseudo
instruction li, which makes assembly mush more readable.
For existed tests, users can update them by running script
`llvm/utils/update_llc_test_checks.py`.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D112692