This follows up #115495 by enabling merging of external globals by
default, which had been left as a next step in order to make the
previous change more incremental and so we can more easily narrow down
on any identified regressions.
Enabling merging of external globals matches what Arm does (for non
mach-o targets), though AArch64 doesn't as there were [some
concerns](https://reviews.llvm.org/D61947) it might cause regressions in
some cases.
See https://github.com/llvm/llvm-project/pull/117880 for benchmark figures and discussion.
Enable `-fstack-clash-protection` for RISCV and stack probe for function
prologues.
We probe the stack by creating a loop that allocates and probe the stack
in ProbeSize chunks.
We emit an unrolled probe loop for small allocations and emit a variable
length probe loop for bigger ones.
Here we add a scheduling mutation in pre-ra scheduling, which will
add an artificial dependency edge between mask producer and its
previous nearest instruction that uses V0 register.
This prevents the overlap of live intervals of mask registers and
as a consequence we can reduce some spills/moves.
From the test changes, we can see some improvements and also some
regressions (more vtype toggles).
Partially fixes#113489.
From the discussion at the round-table at the RISC-V Summit it was clear
people see cases where global merging would help. So the direction of
enabling it by default and iteratively working to enable it in more
cases or to improve the heuristics seems sensible. This patch tries to
make a minimal step in that direction.
Following discussions in #110443, and the following earlier discussions
in https://lists.llvm.org/pipermail/llvm-dev/2017-October/117907.html,
https://reviews.llvm.org/D38482, https://reviews.llvm.org/D38489, this
PR attempts to overhaul the `TargetMachine` and `LLVMTargetMachine`
interface classes. More specifically:
1. Makes `TargetMachine` the only class implemented under
`TargetMachine.h` in the `Target` library.
2. `TargetMachine` contains target-specific interface functions that
relate to IR/CodeGen/MC constructs, whereas before (at least on paper)
it was supposed to have only IR/MC constructs. Any Target that doesn't
want to use the independent code generator simply does not implement
them, and returns either `false` or `nullptr`.
3. Renames `LLVMTargetMachine` to `CodeGenCommonTMImpl`. This renaming
aims to make the purpose of `LLVMTargetMachine` clearer. Its interface
was moved under the CodeGen library, to further emphasis its usage in
Targets that use CodeGen directly.
4. Makes `TargetMachine` the only interface used across LLVM and its
projects. With these changes, `CodeGenCommonTMImpl` is simply a set of
shared function implementations of `TargetMachine`, and CodeGen users
don't need to static cast to `LLVMTargetMachine` every time they need a
CodeGen-specific feature of the `TargetMachine`.
5. More importantly, does not change any requirements regarding library
linking.
cc @arsenm @aeubanks
AArch64 left this disabled after seeing some cases of slightly worse
codegen that weren't tracked down, so I suggest as a path to
incrementally moving towards enable globals merging we follow suit, and
evaluate turning on later.
This patch disables merging of external globals, but also adds a flag to
override that. This reduces churn in test cases, simplifies benchmarking
runs, and this flag can be removed later.
A follow-on PR enables the globals merging pass by default (and as it's
based on this commit, merging of external globals is disabled just as
they are for AArch64).
#73789 added load clustering and #73796 tried to add store clustering.
If post machine schedule is used, previous cluster of load/store which
formed in machine schedule may break. In order to solve this, add
load/sotre clustering to post machine schedule.
Now that LLVM 19.1.0 has been out for a while with post-vector-RA
vsetvli insertion enabled by default, this proposes to remove the flag
that restores the old pre-RA behaviour so we only have one configuration
going forward.
That flag was mainly meant as a fallback in case users ran into issues,
but I haven't seen anything reported so far.
The purpose of this optimization is to make the VL argument, for
instructions that have a VL argument, as small as possible. This is
implemented by visiting each instruction in reverse order and checking
that if it has a VL argument, whether the VL can be reduced.
By putting this pass before VSETVLI insertion, we see three kinds of
changes to generated code:
1. Eliminate VSETVLI instructions
2. Reduce the VL toggle on VSETVLI instructions that also change vtype
3. Reduce the VL set by a VSETVLI instruction
The list of supported instructions is currently whitelisted for safety.
In the future, we could add more instructions to `isSupportedInstr` to
support even more VL optimization.
We originally wrote this pass because vector GEP instructions do not
take a VL, which leads us to emit code that uses VL=VLMAX to implement
GEP in the RISC-V backend. As a result, some of the vector instructions
will write to lanes, specifically between the intended VL and VLMAX,
that will never be read. As an alternative to this pass, we considered
adding a vector predicated GEP instruction, but this would not fit well
into the intrinsic type system since GEP has a variable number of
arguments, each with arbitrary types. The second approach we considered
was to put this pass after VSETVLI insertion, but we found that it was
more difficult to recognize optimization opportunities, especially
across basic block boundaries -- the data flow analysis was also a bit
more expensive and complex.
While this pass solves the GEP problem, we have expanded it to handle
more cases of VL optimization, and there is opportunity for the analysis
to be improved to enable even more optimization. We have a few follow up
patches to post, but figured this would be a good start.
---------
Co-authored-by: Craig Topper <craig.topper@sifive.com>
Co-authored-by: Kito Cheng <kito.cheng@sifive.com>
This reverts commit 64972834c193632cbc47e54c0f0c721636b077e6.
Based on the discussions in #108991 that happened post merge, we have decided
to remove this pass in favor of generating `RISCV::G_*` opcodes in the legalizer.
We may reconsider moving that code elsewhere in the future so that we can do
a better job during generic combines. We don't feel that doing it in instruciton
selection is the right decision today. Firstly, it requires us to manually
do regbankselect on the newly introduced instructions. Secondly, it is more
difficult to test since the test output will contain whatever `RISCV::G_*`
instructions select to (instead of `RISCV::G_*`).
My personal opinion is that the legalizer pass can be split into an early
legalizer and a late legalizer, both before regbankselect. The first legalizer
would not introduce target specific generic opcodes and the generic combiner
would run after it. The second legalizer would introduce the target specific
generic opcodes. I think this approach is better than the lowerer because the
legalizer guarantees that whatever we lower to is legal, and apparently because
it is more performant at compared to the lowerer (although, I'm not sure how
true this is).
A recent atomics ABI change / fix requires that for the "A6C" and A6S"
atomics ABIs (i.e. both of those supported by LLVM currently), an
additional fence is inserted for an atomic_compare_exchange with seq_cst
failure ordering.
<https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/445>
This isn't trivial to support through the hooks used by AtomicExpandPass
because that pass assumes that when fences are inserted, the original
atomics ordering information can be removed from the instruction. Rather
than try to change and complicate that API, this patch implements the
needed fence insertion through a small special purpose pass.
This is mostly a copy of the AArch64PostLegalizerLoweringPass, except it
removes all of the AArch64 combines.
This pass allows us to lower instructions after the generic
post-legalization combiner has had a chance to run.
We will be adding combines to this pass in future patches.
This transformation doesn't actually use any of the internal state of
LSR and recomputes all information from SCEV. Splitting it out makes
it easier to test.
Note that long term I would like to write a version of this transform
which *is* integrated with LSR's solver, but if that happens, we'll
just delete the extra pass.
Integration wise, I switched from using TTI to using a pass configuration
variable. This seems slightly more idiomatic, and means we don't run
the extra logic on any target other than RISCV.
This patch implements simple landing pad labels ([pr]). When Zicfilp
enabled, this patch inserts `lpad 0` at the beginning of basic blocks
which are possible to be landed by indirect jumps.
This patch also supports option riscv-landing-pad-label to make users
cpable to set nonzero fixed labels. Using nonzero fixed label force
setting t2 before indirect jumps. It's less portable but more strict
than original implementation.
[pr]: https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/417
[CodeGen] Change the prototype of regalloc filter function
Change the prototype of the filter function so that we can
filter not just by RegClass. We need to implement more
complicated filter based upon some other info associated
with each register.
Patch provided by: Gang Chen (gangc@amd.com)
Given an AVL that's computed from vlenb, if it's equal to VLMAX then we
can replace it with the VLMAX sentinel value.
The main motiviation is to be able to express an EVL of VLMAX in VP
intrinsics whilst emitting vsetvli a0, zero, so that we can replace
llvm.riscv.masked.strided.{load,store} with their VP counterparts.
This is done in RISCVVectorPeephole (previously RISCVFoldMasks, renamed
to account for the fact that it no longer just folds masks) instead of
SelectionDAG since there are multiple places places where VP nodes are
lowered that would have need to have been handled.
This also avoids doing it in RISCVInsertVSETVLI as it's much harder to
lookup the value of the AVL, and in RISCVVectorPeephole we can take
advantage of DeadMachineInstrElim to remove any leftover
PseudoReadVLENBs.
Machine Copy Propagation Pass may enlarge branch relaxation distance by
breaking generation of compressed insts. This commit moves Machine Copy
Propagation Pass before Branch relaxation pass so the results of Branch
relaxation pass won't be affected by Machine Copy Propagation Pass.
- Fix build with `EXPENSIVE_CHECKS`
- Remove unused `PassName::ID` to resolve warning
- Mark `~SelectionDAGISel` virtual so AArch64 backend can work properly
This reverts commit de37c06f01772e02465ccc9f538894c76d89a7a1 to
de37c06f01772e02465ccc9f538894c76d89a7a1
It still breaks EXPENSIVE_CHECKS build. Sorry.
Port selection dag isel to new pass manager.
Only `AMDGPU` and `X86` support new pass version. `-verify-machineinstrs` in new pass manager belongs to verify instrumentation, it is enabled by default.
We no longer need to separate the passes now that #70549 is landed and
this will unblock #89089.
It's not strictly NFC because it will move coalescing before register
allocation when -riscv-vsetvl-after-rvv-regalloc is disabled. But this
makes it closer to the original behaviour.
This patch try to get rid of vsetvl implict vl/vtype def-use chain and
improve the register allocation quality by moving the vsetvl insertion
pass after RVV register allocation
It will gain the benefit for the following optimization from
1. unblock scheduler's constraints by removing vl/vtype def-use chain
2. Support RVV re-materialization
3. Support partial spill
This patch add a new option `-riscv-vsetvl-after-rvv-regalloc=<1|0>` to
control this feature and default set as disable.
As noted in
https://github.com/llvm/llvm-project/pull/91440#discussion_r1601976425,
if the pass pipeline stops early because of -stop-after any allocated
passes added with insertPass will not be freed if they haven't already
been added.
This was showing up as a failure on the address sanitizer buildbots. We
can fix it by instead passing the pass ID instead so that allocation is
deferred.
Split off from #70549, this patch moves RISCVInsertVSETVLI to after phi
elimination where we exit SSA and need to move to LiveVariables.
The motivation for splitting this off is to avoid the large scheduling
diffs from moving completely to after regalloc, and instead focus on
converting the pass to work on LiveIntervals.
The two main changes required are updating VSETVLIInfo to store VNInfos
instead of MachineInstrs, which allows us to still check for PHI defs in
needVSETVLIPHI, and fixing up the live intervals of any AVL operands
after inserting new instructions.
On O3 the pass is inserted after the register coalescer, otherwise we
end up with a bunch of COPYs around eliminated PHIs that trip up
needVSETVLIPHI.
Co-authored-by: Piyou Chen <piyou.chen@sifive.com>
This further splits off #91440 to inch RISCVInsertVSETVLI closer to post
vector regalloc.
As noted in #91440, most of the diffs are from moving vsetvli insertion
after the vxrm/csr insertion passes, but these are getting conflated
with the changes from moving to LiveIntervals.
One idea was that we could try and remove some of these diffs by
manually moving back the vsetvlis past the vxrm/csr instructions. But
this meant having to touch up the LiveIntervals again which seemed to
lead to even more diffs.
This instead just moves RISCVInsertVSETVLI after RISCVInsertReadWriteCSR
and RISCVInsertWriteVXRM so we can isolate those changes.
Currently RISCVDeadRegisterDefinitions runs after vsetvli insertion, but
in #70549 vsetvli insertion runs after vector regalloc and as a result
we no longer convert some vsetvli a0, a0s to vsetvli x0, a0. This patch
moves it to after vector regalloc, but before scalar regalloc so we
still get the benefits of reducing register pressure.
The original commit was calling shrinkToUses on an interval for a virtual
register whose def was erased. This fixes it by calling shrinkToUses first
and removing the interval if we erase the old VL def.
This patch splits off part of the work to move vsetvli insertion to post
regalloc in #70549.
The doLocalPostpass operates outside of RISCVInsertVSETVLI's dataflow,
so we can move it to its own pass. We can then move it to post vector
regalloc which should be a smaller change.
A couple of things that are different from #70549:
- This manually fixes up the LiveIntervals rather than recomputing it
via createAndComputeVirtRegInterval. I'm not sure if there's much of a
difference with either.
- For the postpass it's sufficient enough to just check isUndef() in
hasUndefinedMergeOp, i.e. we don't need to lookup the def in VNInfo.
Running on llvm-test-suite and SPEC CPU 2017 there aren't any changes in
the number of vsetvlis removed. There are some minor scheduling diffs as
well as extra spills and less spills in some cases (caused by transient
vsetvlis existing between RISCVInsertVSETVLI and RISCVCoalesceVSETVLI
when vec regalloc happens), but they are minor and should go away once
we finish moving the rest of RISCVInsertVSETVLI.
We could also potentially turn off this pass for unoptimised builds.
Split vector and scalar regalloc has been enabled by default for 5
months now since d0a39e617ba301a76d28e2d82e1f657999c9dcfb, and shipped
with 18.1.0. I haven't heard of any issues with it so far, so this
proposes to remove the flag to reduce the number of configurations we
have to support.
There are two places in tree that use these helpers and there will
be more future usages.
Reviewers: asb, BeMg, lukel97
Reviewed By: BeMg, lukel97
Pull Request: https://github.com/llvm/llvm-project/pull/84144
When using Greedy Register Allocation, there are times where
early-clobber values are ignored, and assigned the same register. This
is illeagal behaviour for these intructions. To get around this, using
Pseudo instructions for early-clobber registers gives them a definition
and allows Greedy to assign them to a different register. This then
meets the ARM Architecture Reference Manual and matches the defined
behaviour.
This patch takes the existing RISC-V patch and makes it target
independent, then adds support for the ARM Architecture. Doing this will
ensure early-clobber restraints are followed when using the ARM
Architecture. Making the pass target independent will also open up
possibility that support other architectures can be added in the future.
AArch64 has had it enabled since late November, so hopefully the main
issues have been resolved.
I see a small reduction in dynamic instruction count on every benchmark
in specint2017. The best improvement was 0.3% so nothing amazing.
This pass looks for unsigned icmps that have illegal types and tries
to widen the use/def graph to improve the placement of the zero
extends that type legalization would need to insert.
I've explicitly disabled it for i32 by adding a check for
isSExtCheaperThanZExt to the pass.
The generated code isn't perfect, but my data shows a net
dynamic instruction count improvement on spec2017 for both base and
Zba+Zbb+Zbs.
We convert existed macro fusions to TableGen.
Bacause `Fusion` depend on `Instruction` definitions which is defined
below `RISCVFeatures.td`, so we recommend user to add fusion features
when defining new processor.
This commit includes the necessary changes to clang and LLVM to support
codegen of `RVE` and the `ilp32e`/`lp64e` ABIs.
The differences between `RVE` and `RVI` are:
* `RVE` reduces the integer register count to 16(x0-x16).
* The ABI should be `ilp32e` for 32 bits and `lp64e` for 64 bits.
`RVE` can be combined with all current standard extensions.
The central changes in ilp32e/lp64e ABI, compared to ilp32/lp64 are:
* Only 6 integer argument registers (rather than 8).
* Only 2 callee-saved registers (rather than 12).
* A Stack Alignment of 32bits (rather than 128bits).
* ilp32e isn't compatible with D ISA extension.
If `ilp32e` or `lp64` is used with an ISA that has any of the registers
x16-x31 and f0-f31, then these registers are considered temporaries.
To be compatible with the implementation of ilp32e in GCC, we don't use
aligned registers to pass variadic arguments and set stack alignment\
to 4-bytes for types with length of 2*XLEN.
FastCC is also supported on RVE, while GHC isn't since there is only one
avaiable register.
Differential Revision: https://reviews.llvm.org/D70401
Reordering based on the sort order of the MemOpInfo array was disabled
in <https://reviews.llvm.org/D72706>. However, it's not clear this is
desirable for al targets. It also makes it more difficult to compare the
incremental benefit of enabling load clustering in the selectiondag
scheduler as well was the machinescheduler, as the sdag scheduler does
seem to allow this reordering.
This patch adds a parameter that can control the behaviour on a
per-target basis.
Split out from #73789.