If VLEN is exactly known, we may be able to use the vsetivli encoding
instead of the vsetvli a0, zero, <vtype> encoding. This slightly reduces
register pressure.
This builds on 632f1c5, but reverses course a bit. It turns out to be
quite complicated to canonicalize from VLMAX to immediate early because
the sentinel value is widely used in tablegen patterns without knowledge
of LMUL. Instead, we canonicalize towards the VLMAX representation, and
then pick the immediate form during insertion since we have the LMUL
information there.
Within InsertVSETVLI, this could reasonable fit in a couple places. If
reviewers want me to e.g. move it to emission, let me know. Doing so may
require a bit of extra code to e.g. handle comparisons of the two forms,
but shouldn't be too complicated.
When doing our backwards walk, we were not handling the case where the
AVL was defined by a register whose definition was an ADDI xN, x0,
<imm>. Doing so (as we already do in the forward pass) allows us to
prune a few more transitions.
This refactors the logic in transferBefore so that we're moving in the
direction of "keep the existing Info, only change what is needed".
For the sake of review there are two commits in this PR: The former is
needed to make the latter an NFC commit. Neither introduce any test
diffs but the former is not technically NFC, hence why I did not
precommit it.
- [RISCV] Preserve AVL when previous info is ratio only in
transferBefore
- [RISCV] Don't change AVL if only zeroness is demanded. NFC
Previously we bailed if we encountered a pseudo without a VL op, i.e.
vmv.x.s,
which prevented us from preserving VL and VTYPE. It looks like this was
copied
over from a time whenever this code was operating on the MachineInstrs
in
place, see https://reviews.llvm.org/D127870
However because we no longer mutate the MIs, we can just get rid of this
early
exit which allows us to preserve VL and VTYPE when dealing with vmv.x.s.
transferBefore currently takes an incoming state and an instruction,
computes
the new state needed for the instruction, and then modifies that new
state to
be more similar to the incoming state.
This patch reverses the approach by instead taking the incoming state
and
modifying only the bits that are demanded by the instruction.
There is an optimisation in transferBefore where if a VSETVLIInfo uses
the AVL
of a defining vsetvli, it uses that vsetvli's AVL provided VLMAX is the
same.
This patch moves it out of transferBefore and up into
computeInfoForInstr to
show how it isn't affected by the other optimisations in transferBefore,
and to
simplify the control flow by removing an early return.
This should make #72352 easier to reason about.
The property we're explicitly looking for is whether or not MI only cares about
VL zeroness and not VL itself, so we can just use DemandedFields for this. This
should simplify an upcoming change in #72352
Extend our PRE logic to cover non-immediate AVL values. This covers
large constant AVLs (which must be materialized in registers), and may
help some code written explicitly with intrinsics.
Looking at the existing code, I can't entirely figure out why I thought
we needed VL == AVL to perform the PRE. My best guess is that I was
worried about the VLMAX < VL < 2 * VLMAX case, but the spec explicitly
says that vsetvli must be determinist on any particular AVL value.
That case was, possibly by accident, covering another legality
precondition. Specifically, by only returning true for immediate and
VLMAX AVL values, we didn't encounter the case where the AVL was a
register and that register wasn't available in the predecessor (e.g. if
AVL is a load in the MBB block itself).
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>
For instructions like vmv.s.x and friends where we don't care about LMUL
or the
SEW/LMUL ratio, we can change the LMUL in its state so that it has the
same
SEW/LMUL ratio as the previous state. This allows us to avoid more VL
toggles
later down the line (i.e. use vsetvli zero, zero, which requires that
the
SEW/LMUL ratio must be the same)
This is an alternative approach to the idea in #69259, but note that
they
don't catch exactly the same test cases.
This way the compiler can tell us about missing cases if we add a new value
to this enum. Amusingly, the first time I landed this, I had indeed forgotten
a switch case, and the build bots were quite happy to remind me of such.
A bit debateable since we could extract it from the MachineFunction (and
thus the MachineInstr), but we have the same pattern for MachineFunction
associated structure already for TII and MRI.
This reverts commit 20fc8e8df20e165d1c632bc80a0cebce2dc158f7. As pointed out in review of the mentioned follow up patch, this gets the predicate wrong. We need not simply VL being unchanged, but VLMAX being unchanged. Given that the code structure I'd introduced here is simply confusing.
When AMOs are used to implement parallel reduction operations, typically the return value would be discarded.
This patch adds a peephole pass `RISCVDeadRegisterDefinitions`. It rewrites `rd` to `x0` when `rd` is marked as dead.
It may improve the register allocation and reduce pipeline hazards on CPUs without register renaming and OOO.
Comparison with GCC: https://godbolt.org/z/bKaxnEcec
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D158759
In D158086, we limit all floating point scalar move and splat can't fuse
vsetvli with different SEW, and this patch try to relax the constraint
as possible by introducing new SEW demand type:
SEWGreaterThanOrEqualAndLessThan64, that allow SEW fused with larger
SEW, but constraint it can't fused with SEW=64.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D158177
In practice, this field is only used a return value from
computeVLVTYPEChanges. Add a reference parameter to
computeVLVTYPEChanges to return its info.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D158902
This updates the backwards mutation code to handle the case where the previous vset was in vl-preserving (x0, x0) form, but that VL was never used before the next vset which changes the VL. Since this requires writing both VL operands, eliminate the restriction on removing GPR producing vsetv as well. (The register will now be written by the earlier vsetv.)
Differential Revision: https://reviews.llvm.org/D158019
Scalar move and splat instruction are only demand the SEW is greater than
its own needs, but floating point vector with SEW=64 is not alwaws valid even
SEW=64 is valid, because we have a special configuration: zve64f.
So we need to check floating point vector instruction with SEW=64 is
valid when compute demand of floating point scalar move and splat
instruction.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D158086
vmv.x.s and vmv.f.s are unconditional. They read the low element of a vector
register (not vector group), and function even when VL=0 or VSTART>0. As such,
they are don't care with respect to both VL and LMUL.
We'd previously had handling in the forward pass only via the NoRegister
mechanusm. (The only instructions with SEW but without VL are these extracts.)
This patch moves that handling into getDemanded so that the backwards pass
benefits as well.
Differential Revision: https://reviews.llvm.org/D157991
We were defaulting to VL=0 when we didn't otherwise have a vsetv
nearby. Instead, let's use VL=1. VL=0 is very much a cornercase
in hardware, and let's avoid if we can.
Differential Revision: https://reviews.llvm.org/D158015
In a recent series of refactorings (described here: https://discourse.llvm.org/t/riscv-transition-in-vector-pseudo-structure-policy-variants/71295), I greatly increased the number of IMPLICIT_DEF operands to our vector instructions. This has turned out to have an unexpected negative impact because MachineCSE does not CSE IMPLICIT_DEFs, and thus does not CSE any instruction with an IMPLICIT_DEF operand. SelectionDAG *does* CSE the same case, but that only covers the same block case, not the cross block case. This lead to the performance regression reported in https://github.com/llvm/llvm-project/issues/64282.
This change is a slightly ugly hack to side step the issue. Instead of fixing the root cause (lack of CSE for IMPLICIT_DEF) or undoing the operand changes, we leave the extra operand in place, and use NoReg in place of IMPLICIT_DEF. I then convert back to IMPLICIT_DEF just before register allocation so that ProcessImplicitDefs and TwoAddressInstructions can do the normal transforms to Undef tied registers.
We may end up backporting this into the 17.x release branch. Given how late in the release cycle this is landing, that's much less likely now, but still a possibility.
Differential Revision: https://reviews.llvm.org/D156909
This change continues with the line of work discussed in https://discourse.llvm.org/t/riscv-transition-in-vector-pseudo-structure-policy-variants/71295.
This change targets all the pseudos used in loads (unit, strided, segmented, fault first, and their combinations). As with previous changes in the series, we replace the existing TA and TU forms with a single unified pseudo with a passthru (which may be implicit_def) and a policy operand.
One quirk is that I went ahead and treated the unmasked mask load instruction (vlm) the same way. We need the pass thru operand to model tail undefined, but since the instruction is unconditionally agnostic and the instruction has no mask, the policy operand is arguably unneeded. I kept it mostly for consistency sake.
Another quirk worth highlighting is that segment loads require a bit of dedicated handling. Surprisingly, we don't have IMPLICIT_DEF nodes of the right types, and attempting to use them results in some odd looking codegen and a few crashes. Instead, I left the REG_SEQUENCE form, and extended InsertVSETVLI to recognize the complex undefs. Arguably, we should probably revisit the handling of undef reg_sequence nodes here, but I'm hoping to side step that in this patch.
As before, we see codegen changes (some improvements and some regressions) due to scheduling differences caused by the extra implicit_def instructions. I did have to delete one register allocation regression test as I couldn't figure out how to meaningfully update it. I spent a significant amount of time trying, and finally gave up.
Differential Revision: https://reviews.llvm.org/D154141
A vmv.v.i/x splats the immediate to all active lanes. For the active lanes, this is the same as vmv.s.x which inserts one scalar into the low lane. If we can ignore all the inactive lanes (because they are known undefined), then the two are semantically equivalent. We already reason about compatible VL/VTYPE combinations for vmv.s.x, apply the same logic to vmv.v.i.
Unlike a vmv.s.x, we do need to be careful not to increase LMUL. A splat instruction is probably linear in LMUL, so restrict this to LMUL1.
Differential Revision: https://reviews.llvm.org/D152845
We already have several places in this code which reason about whether the inactive lanes are defined, and are about to add one more in D151653. Let's go ahead and common the code so that we don't have the same concept repeating in multiply places.
Differential Revision: https://reviews.llvm.org/D152844
If a vm.s.x pseudo has an undef passthru operand, then we're free to use
whatever tail policy we want for VL > 1. We previously relaxed the tail
policy for this but only when we could also expand the SEW.
This patch changes it to relax the tail policy even if the SEW can't be
expanded and removes a few more toggles, as well as fully moving the
vmv.s.x logic into getDemanded.
vmv.s.x/vfmv.s.f instructions that only write to the first destination
element can use any SEW greater than or equal to its original SEW,
provided that it's writing to an implicit_def operand where we can
clobber the other lanes.
We were already handling this in needVSETVLI, which meant that when
scanning the instructions from top to bottom we could detect this and
avoid the toggle:
vsetivli zero, 4, e64, mf2, ta, ma
li a0, 11
vsetivli zero, 1, e8, mf8, ta, ma
vmv.s.x v0, a0
->
vsetivli zero, 4, e64, mf2, ta, ma
li a0, 11
vmv.s.x v0, a0
The issue that this patch aims to solve is arises when the vmv.s.x is
the first vector instruction in the block and doesn't have any prior
predecessor info:
entry_bb:
li a0, 11
; No previous state here: forced to set VL/VTYPE
vsetivli zero, 1, e8, mf8, ta, ma
vmv.s.x v0, a0
vsetivli zero, 4, e16, mf2, ta, ma
vmerge.vvm v8, v9, v8, v0
doLocalPostpass can work backwards from bottom to top and work out if
an earlier vsetvli can be mutated to avoid a toggle. It uses
DemandedFields and getDemanded for this, which previously didn't take
into account the possibility of going to a larger SEW.
A previous patch consolidated the vmv.s.x logic from needVSETVLI logic
into getDemanded, and this patch removes the gate around it so that
doLocalPostpass can now delete vsetvlis like in the scenario below:
entry_bb:
li a0, 11
; Previous vsetivli mutated: second one deleted
vsetivli zero, 4, e16, mf2, ta, ma
vmv.s.x v0, a0
vmerge.vvm v8, v9, v8, v0
Differential Revision: https://reviews.llvm.org/D151561
This patch restructures the logic that checks if vmv.s.x's SEW can be
expanded into getDemandedBits, so that it can be shared by both the
top-to-bottom and bottom-to-top passes.
It adds a third option for SEW in DemandedFields, that's weaker than
demanded but stronger than not demanded, that states that it the new SEW
must be greater than or equal to the current SEW.
Note that we now need to take care of the order of operands in
areCompatibleVTYPEs as the relation is no longer commutative.
A later patch will remove the gating on the bottom-to-top pass
(dolocalPostpass) and another one will relax the demands on the tail
policy further.
vmv.s.x and friends that only write to the first destination element can
use any SEW greater than or equal to its original SEW, provided that
it's writing to an implicit_def operand where we can clobber the other
lanes.
We were already handling this in needVSETVLI, which meant that when
scanning the instructions from top to bottom we could detect this and
avoid the toggle:
```
vsetivli zero, 4, e64, mf2, ta, ma
li a0, 11
vsetivli zero, 1, e8, mf8, ta, ma
vmv.s.x v0, a0
->
vsetivli zero, 4, e64, mf2, ta, ma
li a0, 11
vmv.s.x v0, a0
```
The issue that this patch aims to solve is whenever vmv.s.x arises when
the first vector instruction in the block and doesn't have any prior
predecessor info:
```
entry_bb:
li a0, 11
; No previous state here: forced to set VL/VTYPE
vsetivli zero, 1, e8, mf8, ta, ma
vmv.s.x v0, a0
vsetivli zero, 4, e16, mf2, ta, ma
vmerge.vvm v8, v9, v8, v0
```
doLocalPostpass can work backwards from bottom to top and work out if
an earlier vsetvli can be mutated to avoid a toggle. It uses
DemandedFields and getDemanded for this, which previously didn't take
into account the possibility of going to a larger SEW.
This patch adds a third option for SEW in DemandedFields, that's weaker
than demanded but stronger than not demanded, that states that it the
new SEW must be greater than or equal to the current SEW.
We can then use this option to move that vmv.s.x specific logic from
needVSETVLI into getDemanded, making it available for both phase 2 and
3, i.e. we can now mutate the earlier vsetivli going from bottom to top:
```
entry_bb:
li a0, 11
; Previous vsetivli mutated: second one deleted
vsetivli zero, 4, e16, mf2, ta, ma
vmv.s.x v0, a0
vmerge.vvm v8, v9, v8, v0
```
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D151561
The immediate field on the vsetivli is fairly limited. For larger vectors, we end up having to materialize a constant in a register. We hadn't plumbed the infrastructure to treat such materialized constants as constants for purpose of vsetvli elimination.
I only bothered to handle LI. We could extend this to LUI sequences, but well, 2048 elements is probably enough for all practical fixed length vector codegen. :)
The test delta does point out a related problem. At LMUL8, we see increased register allocation pressure, and we should probably either a) address register allocation remat, or b) be less aggressive about eliminating vsetvlis at high lmul. Note that high LMUL code is not generated much by default.
Differential Revision: https://reviews.llvm.org/D151212
The original change had a bug where it allowed SEW mutation. This is wrong in multiple ways, but an easy example is that the slide amount is in units of SEW, and thus that changing SEW changes the slide offset.
I'd reverted this in 33314693 intending to more majorly rework the patch because in addition to the bug, I'd noticed a potential oppurtunity to increase scope. After implementing that variant, and realizing it triggered nowhere, I decided to go back to the prior patch with the minimal fix.
Note there's no separate test case for the fix. This is because we already had multiple, and I just didn't realize the impact of the original test diff. Adding one more test would have been unlikely to catch that human error.
Original commit message..
Noticed this while looking at some SLP output. If we have an extractelement, we're probably using a slidedown into an destination with no contents. Given this, we can allow the slideup to use a larger VL and clobber tail elements of the destination vector. Doing this allows us to avoid vsetvli toggles in many fixed length vector examples.
Differential Revision: https://reviews.llvm.org/D148834
Noticed this while looking at some SLP output. If we have an extractelement, we're probably using a slidedown into an destination with no contents. Given this, we can allow the slideup to use a larger VL and clobber tail elements of the destination vector. Doing this allows us to avoid vsetvli toggles in many fixed length vector examples.
Differential Revision: https://reviews.llvm.org/D148834
If the AVL is a virtual register defined by a vsetvli with the same
vlmax we need and the previous vsetvli we saw in the data flow also
has that vlmax, we can use the x0, x0 form when we insert a vsetvli.
Not only does this avoid an update of the VL physical register, but
it may allow doLocalPostpass to completely remove the inserted vsetvli
by rewriting the vtype of the previous vsetvli.
Differential Revision: https://reviews.llvm.org/D148735