This is mostly a copy of the AArch64PostLegalizerLoweringPass, except it
removes all of the AArch64 combines.
This pass allows us to lower instructions after the generic
post-legalization combiner has had a chance to run.
We will be adding combines to this pass in future patches.
This transformation doesn't actually use any of the internal state of
LSR and recomputes all information from SCEV. Splitting it out makes
it easier to test.
Note that long term I would like to write a version of this transform
which *is* integrated with LSR's solver, but if that happens, we'll
just delete the extra pass.
Integration wise, I switched from using TTI to using a pass configuration
variable. This seems slightly more idiomatic, and means we don't run
the extra logic on any target other than RISCV.
This patch implements simple landing pad labels ([pr]). When Zicfilp
enabled, this patch inserts `lpad 0` at the beginning of basic blocks
which are possible to be landed by indirect jumps.
This patch also supports option riscv-landing-pad-label to make users
cpable to set nonzero fixed labels. Using nonzero fixed label force
setting t2 before indirect jumps. It's less portable but more strict
than original implementation.
[pr]: https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/417
[CodeGen] Change the prototype of regalloc filter function
Change the prototype of the filter function so that we can
filter not just by RegClass. We need to implement more
complicated filter based upon some other info associated
with each register.
Patch provided by: Gang Chen (gangc@amd.com)
Given an AVL that's computed from vlenb, if it's equal to VLMAX then we
can replace it with the VLMAX sentinel value.
The main motiviation is to be able to express an EVL of VLMAX in VP
intrinsics whilst emitting vsetvli a0, zero, so that we can replace
llvm.riscv.masked.strided.{load,store} with their VP counterparts.
This is done in RISCVVectorPeephole (previously RISCVFoldMasks, renamed
to account for the fact that it no longer just folds masks) instead of
SelectionDAG since there are multiple places places where VP nodes are
lowered that would have need to have been handled.
This also avoids doing it in RISCVInsertVSETVLI as it's much harder to
lookup the value of the AVL, and in RISCVVectorPeephole we can take
advantage of DeadMachineInstrElim to remove any leftover
PseudoReadVLENBs.
Machine Copy Propagation Pass may enlarge branch relaxation distance by
breaking generation of compressed insts. This commit moves Machine Copy
Propagation Pass before Branch relaxation pass so the results of Branch
relaxation pass won't be affected by Machine Copy Propagation Pass.
- Fix build with `EXPENSIVE_CHECKS`
- Remove unused `PassName::ID` to resolve warning
- Mark `~SelectionDAGISel` virtual so AArch64 backend can work properly
This reverts commit de37c06f01772e02465ccc9f538894c76d89a7a1 to
de37c06f01772e02465ccc9f538894c76d89a7a1
It still breaks EXPENSIVE_CHECKS build. Sorry.
Port selection dag isel to new pass manager.
Only `AMDGPU` and `X86` support new pass version. `-verify-machineinstrs` in new pass manager belongs to verify instrumentation, it is enabled by default.
We no longer need to separate the passes now that #70549 is landed and
this will unblock #89089.
It's not strictly NFC because it will move coalescing before register
allocation when -riscv-vsetvl-after-rvv-regalloc is disabled. But this
makes it closer to the original behaviour.
This patch try to get rid of vsetvl implict vl/vtype def-use chain and
improve the register allocation quality by moving the vsetvl insertion
pass after RVV register allocation
It will gain the benefit for the following optimization from
1. unblock scheduler's constraints by removing vl/vtype def-use chain
2. Support RVV re-materialization
3. Support partial spill
This patch add a new option `-riscv-vsetvl-after-rvv-regalloc=<1|0>` to
control this feature and default set as disable.
As noted in
https://github.com/llvm/llvm-project/pull/91440#discussion_r1601976425,
if the pass pipeline stops early because of -stop-after any allocated
passes added with insertPass will not be freed if they haven't already
been added.
This was showing up as a failure on the address sanitizer buildbots. We
can fix it by instead passing the pass ID instead so that allocation is
deferred.
Split off from #70549, this patch moves RISCVInsertVSETVLI to after phi
elimination where we exit SSA and need to move to LiveVariables.
The motivation for splitting this off is to avoid the large scheduling
diffs from moving completely to after regalloc, and instead focus on
converting the pass to work on LiveIntervals.
The two main changes required are updating VSETVLIInfo to store VNInfos
instead of MachineInstrs, which allows us to still check for PHI defs in
needVSETVLIPHI, and fixing up the live intervals of any AVL operands
after inserting new instructions.
On O3 the pass is inserted after the register coalescer, otherwise we
end up with a bunch of COPYs around eliminated PHIs that trip up
needVSETVLIPHI.
Co-authored-by: Piyou Chen <piyou.chen@sifive.com>
This further splits off #91440 to inch RISCVInsertVSETVLI closer to post
vector regalloc.
As noted in #91440, most of the diffs are from moving vsetvli insertion
after the vxrm/csr insertion passes, but these are getting conflated
with the changes from moving to LiveIntervals.
One idea was that we could try and remove some of these diffs by
manually moving back the vsetvlis past the vxrm/csr instructions. But
this meant having to touch up the LiveIntervals again which seemed to
lead to even more diffs.
This instead just moves RISCVInsertVSETVLI after RISCVInsertReadWriteCSR
and RISCVInsertWriteVXRM so we can isolate those changes.
Currently RISCVDeadRegisterDefinitions runs after vsetvli insertion, but
in #70549 vsetvli insertion runs after vector regalloc and as a result
we no longer convert some vsetvli a0, a0s to vsetvli x0, a0. This patch
moves it to after vector regalloc, but before scalar regalloc so we
still get the benefits of reducing register pressure.
The original commit was calling shrinkToUses on an interval for a virtual
register whose def was erased. This fixes it by calling shrinkToUses first
and removing the interval if we erase the old VL def.
This patch splits off part of the work to move vsetvli insertion to post
regalloc in #70549.
The doLocalPostpass operates outside of RISCVInsertVSETVLI's dataflow,
so we can move it to its own pass. We can then move it to post vector
regalloc which should be a smaller change.
A couple of things that are different from #70549:
- This manually fixes up the LiveIntervals rather than recomputing it
via createAndComputeVirtRegInterval. I'm not sure if there's much of a
difference with either.
- For the postpass it's sufficient enough to just check isUndef() in
hasUndefinedMergeOp, i.e. we don't need to lookup the def in VNInfo.
Running on llvm-test-suite and SPEC CPU 2017 there aren't any changes in
the number of vsetvlis removed. There are some minor scheduling diffs as
well as extra spills and less spills in some cases (caused by transient
vsetvlis existing between RISCVInsertVSETVLI and RISCVCoalesceVSETVLI
when vec regalloc happens), but they are minor and should go away once
we finish moving the rest of RISCVInsertVSETVLI.
We could also potentially turn off this pass for unoptimised builds.
Split vector and scalar regalloc has been enabled by default for 5
months now since d0a39e617ba301a76d28e2d82e1f657999c9dcfb, and shipped
with 18.1.0. I haven't heard of any issues with it so far, so this
proposes to remove the flag to reduce the number of configurations we
have to support.
There are two places in tree that use these helpers and there will
be more future usages.
Reviewers: asb, BeMg, lukel97
Reviewed By: BeMg, lukel97
Pull Request: https://github.com/llvm/llvm-project/pull/84144
When using Greedy Register Allocation, there are times where
early-clobber values are ignored, and assigned the same register. This
is illeagal behaviour for these intructions. To get around this, using
Pseudo instructions for early-clobber registers gives them a definition
and allows Greedy to assign them to a different register. This then
meets the ARM Architecture Reference Manual and matches the defined
behaviour.
This patch takes the existing RISC-V patch and makes it target
independent, then adds support for the ARM Architecture. Doing this will
ensure early-clobber restraints are followed when using the ARM
Architecture. Making the pass target independent will also open up
possibility that support other architectures can be added in the future.
AArch64 has had it enabled since late November, so hopefully the main
issues have been resolved.
I see a small reduction in dynamic instruction count on every benchmark
in specint2017. The best improvement was 0.3% so nothing amazing.
This pass looks for unsigned icmps that have illegal types and tries
to widen the use/def graph to improve the placement of the zero
extends that type legalization would need to insert.
I've explicitly disabled it for i32 by adding a check for
isSExtCheaperThanZExt to the pass.
The generated code isn't perfect, but my data shows a net
dynamic instruction count improvement on spec2017 for both base and
Zba+Zbb+Zbs.
We convert existed macro fusions to TableGen.
Bacause `Fusion` depend on `Instruction` definitions which is defined
below `RISCVFeatures.td`, so we recommend user to add fusion features
when defining new processor.
This commit includes the necessary changes to clang and LLVM to support
codegen of `RVE` and the `ilp32e`/`lp64e` ABIs.
The differences between `RVE` and `RVI` are:
* `RVE` reduces the integer register count to 16(x0-x16).
* The ABI should be `ilp32e` for 32 bits and `lp64e` for 64 bits.
`RVE` can be combined with all current standard extensions.
The central changes in ilp32e/lp64e ABI, compared to ilp32/lp64 are:
* Only 6 integer argument registers (rather than 8).
* Only 2 callee-saved registers (rather than 12).
* A Stack Alignment of 32bits (rather than 128bits).
* ilp32e isn't compatible with D ISA extension.
If `ilp32e` or `lp64` is used with an ISA that has any of the registers
x16-x31 and f0-f31, then these registers are considered temporaries.
To be compatible with the implementation of ilp32e in GCC, we don't use
aligned registers to pass variadic arguments and set stack alignment\
to 4-bytes for types with length of 2*XLEN.
FastCC is also supported on RVE, while GHC isn't since there is only one
avaiable register.
Differential Revision: https://reviews.llvm.org/D70401
Reordering based on the sort order of the MemOpInfo array was disabled
in <https://reviews.llvm.org/D72706>. However, it's not clear this is
desirable for al targets. It also makes it more difficult to compare the
incremental benefit of enabling load clustering in the selectiondag
scheduler as well was the machinescheduler, as the sdag scheduler does
seem to allow this reordering.
This patch adds a parameter that can control the behaviour on a
per-target basis.
Split out from #73789.
This patch make riscv-split-regalloc as true by default.
It will not affect the codegen result if it vector register allocation
doesn't exist. If there is the vector register allocation, it may affect
the non-rvv register LiveInterval's segment/weight. It will make the
allocation in a different order.
This adds minimal support for load clustering, but disables it by
default. The intent is to iterate on the precise heuristic and the
question of turning this on by default in a separate PR. Although
previous discussion indicates hope that the MachineScheduler would
replace most uses of the SelectionDAG scheduler, it does seem most
targets aren't using MachineScheduler load clustering right now:
PPC+AArch64 seem to just use it to help with paired load/store formation
and although AMDGPU uses it for general clustering it also implements
ShouldScheduleLoadsNear for the SelectionDAG scheduler's clustering.
Enable this flow by -riscv-split-regalloc=1 (default disable), and could
designate specific allocator to RVV by
-riscv-rvv-regalloc=<fast|basic|greedy>
It uses the RegClass filter function to decide which regclass need to be
processed.
This patch is pre-requirement for supporting PostRA vsetvl insertion
pass.
This adds a new pass to insert VXRM writes for vector instructions. With
the goal of avoiding redundant writes.
The pass does 2 dataflow algorithms. The first is a forward data flow to
calculate where a VXRM value is available. The second is a backwards
dataflow to determine where a VXRM value is anticipated.
Finally, we use the results of these two dataflows to insert VXRM writes
where a value is anticipated, but not available.
The pass does not split critical edges so we aren't always able to
eliminate all redundancy.
The pass will only insert vxrm writes on paths that always require it.
We currently have three postprocess peephole optimisations for vector
pseudos:
1) Masked pseudo with all ones mask -> unmasked pseudo
2) Merge vmerge pseudo into operand pseudo's mask
3) vmerge pseudo with all ones mask -> vmv.v.v pseudo
This patch aims to move these peepholes out of SelectionDAG and into a
separate RISCVFoldMasks MachineFunction pass.
There are a few motivations for doing this:
* The current SelectionDAG implementation operates on MachineSDNodes,
which are essentially MachineInstrs but require a bunch of logic to
reason about chain and glue operands. The RISCVII::has*Op helper
functions also don't exactly line up with the SDNode operands. Mutating
these pseudos and their operands in place becomes a good bit easier at
the MachineInstr level. For example, we would no longer need to check
for cycles in the DAG during performCombineVMergeAndVOps.
* Although it's further down the line, moving this code out of
SelectionDAG allows it to be reused by GlobalISel later on.
* In performCombineVMergeAndVOps, it may be possible to commute the
operands to enable folding in more cases (see
test/CodeGen/RISCV/rvv/vmadd-vp.ll). There is existing machinery to
commute operands in TII::commuteInstruction, but it's implemented on
MachineInstrs.
The pass runs straight after ISel, before any of the other machine SSA
optimization passes run. This is so that dead-mi-elimination can mop up
any vmsets that are no longer used (but if preferred we could try and
erase them from inside RISCVFoldMasks itself). This also means that
these peepholes are no longer run at codegen -O0, so this patch isn't
strictly NFC.
Only the performVMergeToVMv peephole is refactored in this patch, the
remaining two would be implemented later. And as noted by @preames, it
should be possible to move doPeepholeSExtW out of SelectionDAG as well.
Rematerialization during register allocation is currently limited to a
single instruction with no inputs.
This patch introduces a pseudoinstruction that represents the
materialization of a constant. I've started with a sequence of 2
instructions for now, which covers at least the common LUI+ADDI(W) case.
This instruction will be expanded into real instructions immediately
after register allocation using a new pass. This gives the post-RA
scheduler a chance to separate the 2 instructions to improve ILP.
I believe this matches the approach used by AArch64.
Unfortunately, this loses some CSE opportunies when an LUI value is used
by multiple constants with different LSBs.
This feature is off by default and a new backend command line option is
added to enable it for testing.
This avoids the spill and reloads reported in #69586.
This uses the recently introduced sink-and-fold support in MachineSink.
https://reviews.llvm.org/D152828
This enables folding ADDI into load/store addresses.
Enabling by default will be a separate PR.
When AMOs are used to implement parallel reduction operations, typically the return value would be discarded.
This patch adds a peephole pass `RISCVDeadRegisterDefinitions`. It rewrites `rd` to `x0` when `rd` is marked as dead.
It may improve the register allocation and reduce pipeline hazards on CPUs without register renaming and OOO.
Comparison with GCC: https://godbolt.org/z/bKaxnEcec
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D158759
This will make it easy for callers to see issues with and fix up calls
to createTargetMachine after a future change to the params of
TargetMachine.
This matches other nearby enums.
For downstream users, this should be a fairly straightforward
replacement,
e.g. s/CodeGenOpt::Aggressive/CodeGenOptLevel::Aggressive
or s/CGFT_/CodeGenFileType::
In a recent series of refactorings (described here: https://discourse.llvm.org/t/riscv-transition-in-vector-pseudo-structure-policy-variants/71295), I greatly increased the number of IMPLICIT_DEF operands to our vector instructions. This has turned out to have an unexpected negative impact because MachineCSE does not CSE IMPLICIT_DEFs, and thus does not CSE any instruction with an IMPLICIT_DEF operand. SelectionDAG *does* CSE the same case, but that only covers the same block case, not the cross block case. This lead to the performance regression reported in https://github.com/llvm/llvm-project/issues/64282.
This change is a slightly ugly hack to side step the issue. Instead of fixing the root cause (lack of CSE for IMPLICIT_DEF) or undoing the operand changes, we leave the extra operand in place, and use NoReg in place of IMPLICIT_DEF. I then convert back to IMPLICIT_DEF just before register allocation so that ProcessImplicitDefs and TwoAddressInstructions can do the normal transforms to Undef tied registers.
We may end up backporting this into the 17.x release branch. Given how late in the release cycle this is landing, that's much less likely now, but still a possibility.
Differential Revision: https://reviews.llvm.org/D156909
With `-fsanitize=kcfi` (Kernel Control-Flow Integrity), Clang emits
"kcfi" operand bundles to indirect call instructions. Similarly to
the target-specific lowering added in D119296, implement KCFI operand
bundle lowering for RISC-V.
This patch disables the generic KCFI pass for RISC-V in Clang, and
adds the KCFI machine function pass in `RISCVPassConfig::addPreSched`
to emit target-specific `KCFI_CHECK` pseudo instructions before calls
that have KCFI operand bundles. The machine function pass also bundles
the instructions to ensure we emit the checks immediately before the
calls, which is not possible with the generic pass.
`KCFI_CHECK` instructions are lowered in `RISCVAsmPrinter` to a
contiguous code sequence that traps if the expected hash in the
operand bundle doesn't match the hash before the target function
address. This patch emits an `ebreak` instruction for error handling
to match the Linux kernel's `BUG()` implementation. Just like for X86,
we also emit trap locations to a `.kcfi_traps` section to support
error handling, as we cannot embed additional information to the trap
instruction itself.
Relands commit 62fa708ceb027713b386c7e0efda994f8bdc27e2 with fixed
tests.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D148385