The `RISCVIndirectBranchTracking` pass inserts `lpad` instruction and
could change the basic block alignment, so this should not happen after
the branch relaxation as the adjusted offset is possible to exceed the
branch range.
This is meant as a preparation for PR #130988 "[AMDGPU] Implement IR
expansion for frem instruction" which implements the expansion of
another instruction in this pass. The more general name seems more
appropriate given this change and quite reasonable even without it.
This is the follow up to #125026 that keeps mask operands in virtual
register form for as long as possible throughout the backend.
The diffs in this patch are from MachineCSE/MachineSink/RISCVVLOptimizer
kicking in.
The invariant that the mask COPY never has a subreg no longer holds
after MachineCSE (it coalesces some copies), so it needed to be relaxed.
This is another attempt at #88496 to keep mask operands in SSA after
instruction selection.
Previously we selected the mask operands into vmv0, a singleton register
class with exactly one register, V0.
But the register allocator doesn't really support singleton register
classes and we ran into errors like "ran out of registers during
register allocation in function".
This avoids this by introducing a pass just before register allocation
that converts any use of vmv0 to a copy to $v0, i.e. what isel currently
does today.
That way the register allocator doesn't need to deal with the singleton
register class, but we get the benefits of having the mask registers in
SSA throughout the backend:
- This allows RISCVVLOptimizer to reduce the VLs of instructions that
define mask registers
- It enables CSE and code sinking in more places
- It removes the need to peek through mask copies in RISCVISelDAGToDAG
and keep track of V0 defs in RISCVVectorPeephole
This patch initially eliminates uses of vmv0s after RISCVVectorPeephole
to keep the diff to a minimum, and a follow up patch will move it past
the other MachineInstr SSA passes.
Note that it doesn't try to remove any defs of vmv0 as we shouldn't have
any instructions that have any vmv0 outputs.
As a further follow up, we can move the elimination pass to after phi
elimination and outside of SSA, which would unblock the pre-RA scheduler
around masked pseudos. This might also help the issue that
RISCVVectorMaskDAGMutation tries to solve.
A recent atomics ABI change / fix requires that for the "A6C" and A6S"
atomics ABIs (i.e. both of those supported by LLVM currently), an
additional fence is inserted for an atomic_compare_exchange with seq_cst
failure ordering.
<https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/445>
This isn't trivial to support through the hooks used by AtomicExpandPass
because that pass assumes that when fences are inserted, the original
atomics ordering information can be removed from the instruction. Rather
than try to change and complicate that API, this patch implements the
needed fence insertion through a small special purpose pass.
This patch is part of a set of patches that add an `-fextend-lifetimes`
flag to clang, which extends the lifetimes of local variables and
parameters for improved debuggability. In addition to that flag, the
patch series adds a pragma to selectively disable `-fextend-lifetimes`,
and an `-fextend-this-ptr` flag which functions as `-fextend-lifetimes`
for this pointers only. All changes and tests in these patches were
written by Wolfgang Pieb (@wolfy1961), while Stephen Tozer (@SLTozer)
has handled review and merging. The extend lifetimes flag is intended to
eventually be set on by `-Og`, as discussed in the RFC
here:
https://discourse.llvm.org/t/rfc-redefine-og-o1-and-add-a-new-level-of-og/72850
This patch implements a new intrinsic instruction in LLVM,
`llvm.fake.use` in IR and `FAKE_USE` in MIR, that takes a single operand
and has no effect other than "using" its operand, to ensure that its
operand remains live until after the fake use. This patch does not emit
fake uses anywhere; the next patch in this sequence causes them to be
emitted from the clang frontend, such that for each variable (or this) a
fake.use operand is inserted at the end of that variable's scope, using
that variable's value. This patch covers everything post-frontend, which
is largely just the basic plumbing for a new intrinsic/instruction,
along with a few steps to preserve the fake uses through optimizations
(such as moving them ahead of a tail call or translating them through
SROA).
Co-authored-by: Stephen Tozer <stephen.tozer@sony.com>
This patch implements simple landing pad labels ([pr]). When Zicfilp
enabled, this patch inserts `lpad 0` at the beginning of basic blocks
which are possible to be landed by indirect jumps.
This patch also supports option riscv-landing-pad-label to make users
cpable to set nonzero fixed labels. Using nonzero fixed label force
setting t2 before indirect jumps. It's less portable but more strict
than original implementation.
[pr]: https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/417
Currently, the LowerConstantIntrinsics pass does an RPO traversal of
every function... only to find that many functions don't have constant
intrinsics (is.constant, objectsize). In the CodeGen pipeline, there is
already a pre-isel intrinsic lowering pass, which iterates over
intrinsic declarations and lowers all users. Call
lowerConstantIntrinsics from this pass to avoid the extra iteration over
the entire IR and the RPO traversal.
(Reapplying with corrected commit message)
We recently moved RISCVInsertVSETVLI from before vector register allocation
to after vector register allocation. When doing so, we added an unconditional
dependency on LiveIntervals - even at O0 where LiveIntevals hadn't previously
run. As reported in #93587, this was apparently not safe to do.
This change makes LiveIntervals optional, and adjusts all the update code to
only run wen live intervals is present. The only real tricky part of this
change is the abstract state tracking in the dataflow. We need to represent
a "register w/unknown definition" state - but only when we don't have
LiveIntervals.
This adjust the abstract state definition so that the AVLIsReg state can
represent either a register + valno, or a register + unknown definition.
With LiveIntervals, we have an exact definition for each AVL use. Without
LiveIntervals, we treat the definition of a register AVL as being unknown.
The key semantic change is that we now have a state in the lattice for which
something is known about the AVL value, but for which two identical lattice
elements do *not* neccessarily represent the same AVL value at runtime.
Previously, the only case which could result in such an unknown AVL was the
fully unknown state (where VTYPE is also fully unknown). This requires a
small adjustment to hasSameAVL and lattice state equality to draw this
important distinction.
The net effect of this patch is that we remove the LiveIntervals dependency
at O0, and O0 code quality will regress for cases involving register AVL values.
In practice, this means we pessimize code written with intrinsics at O0.
This patch is an alternative to #93796 and #94340. It is very directly
inspired by review conversation around them, and thus should be considered
coauthored by Luke.
Stacked on https://github.com/llvm/llvm-project/pull/94658.
We recently moved RISCVInsertVSETVLI from before vector register
allocation to after vector register allocation. When doing so, we added
an unconditional dependency on LiveIntervals - even at O0 where
LiveIntevals hadn't previously run. As reported in #93587, this was
apparently not safe to do.
This change makes LiveIntervals optional, and adjusts all the update
code to only run wen live intervals is present. The only real tricky
part of this change is the abstract state tracking in the dataflow. We
need to represent a "register w/unknown definition" state - but only
when we don't have LiveIntervals.
This adjust the abstract state definition so that the AVLIsReg state can
represent either a register + valno, or a register + unknown definition.
With LiveIntervals, we have an exact definition for each AVL use.
Without LiveIntervals, we treat the definition of a register AVL as
being unknown.
The key semantic change is that we now have a state in the lattice for
which something is known about the AVL value, but for which two
identical lattice elements do *not* necessarily represent the same AVL
value at runtime. Previously, the only case which could result in such
an unknown AVL was the fully unknown state (where VTYPE is also fully
unknown). This requires a small adjustment to hasSameAVL and lattice
state equality to draw this important distinction.
The net effect of this patch is that we remove the LiveIntervals
dependency at O0, and O0 code quality will regress for cases involving
register AVL values.
This patch is an alternative to
https://github.com/llvm/llvm-project/pull/93796 and
https://github.com/llvm/llvm-project/pull/94340. It is very directly
inspired by review conversation around them, and thus should be
considered coauthored by Luke.
We no longer need to separate the passes now that #70549 is landed and
this will unblock #89089.
It's not strictly NFC because it will move coalescing before register
allocation when -riscv-vsetvl-after-rvv-regalloc is disabled. But this
makes it closer to the original behaviour.
This reverts commit 8cc8e5d6c6ac9bfc888f3449f7e424678deae8c2.
This reverts commit dae55c89835347a353619f506ee5c8f8a2c136a7.
Causes major compile-time regressions for unoptimized builds.
Prior to this patch, when using -fthinlto-index= the ObjCARCContractPass isn't run prior to CodeGen, and instruction selection fails on IR containing arc intrinsics. This patch is motivated by that usecase.
The pass was previously added in various places codegen is performed. This patch adds the pass to the default codegen pipepline, makes sure it bails immediately if no arc intrinsics are found, and removes the adhoc scheduling of the pass.
Co-authored-by: Nuri Amari <nuriamari@fb.com>
This patch try to get rid of vsetvl implict vl/vtype def-use chain and
improve the register allocation quality by moving the vsetvl insertion
pass after RVV register allocation
It will gain the benefit for the following optimization from
1. unblock scheduler's constraints by removing vl/vtype def-use chain
2. Support RVV re-materialization
3. Support partial spill
This patch add a new option `-riscv-vsetvl-after-rvv-regalloc=<1|0>` to
control this feature and default set as disable.
Split off from #70549, this patch moves RISCVInsertVSETVLI to after phi
elimination where we exit SSA and need to move to LiveVariables.
The motivation for splitting this off is to avoid the large scheduling
diffs from moving completely to after regalloc, and instead focus on
converting the pass to work on LiveIntervals.
The two main changes required are updating VSETVLIInfo to store VNInfos
instead of MachineInstrs, which allows us to still check for PHI defs in
needVSETVLIPHI, and fixing up the live intervals of any AVL operands
after inserting new instructions.
On O3 the pass is inserted after the register coalescer, otherwise we
end up with a bunch of COPYs around eliminated PHIs that trip up
needVSETVLIPHI.
Co-authored-by: Piyou Chen <piyou.chen@sifive.com>
This further splits off #91440 to inch RISCVInsertVSETVLI closer to post
vector regalloc.
As noted in #91440, most of the diffs are from moving vsetvli insertion
after the vxrm/csr insertion passes, but these are getting conflated
with the changes from moving to LiveIntervals.
One idea was that we could try and remove some of these diffs by
manually moving back the vsetvlis past the vxrm/csr instructions. But
this meant having to touch up the LiveIntervals again which seemed to
lead to even more diffs.
This instead just moves RISCVInsertVSETVLI after RISCVInsertReadWriteCSR
and RISCVInsertWriteVXRM so we can isolate those changes.
The original commit was calling shrinkToUses on an interval for a virtual
register whose def was erased. This fixes it by calling shrinkToUses first
and removing the interval if we erase the old VL def.
This patch splits off part of the work to move vsetvli insertion to post
regalloc in #70549.
The doLocalPostpass operates outside of RISCVInsertVSETVLI's dataflow,
so we can move it to its own pass. We can then move it to post vector
regalloc which should be a smaller change.
A couple of things that are different from #70549:
- This manually fixes up the LiveIntervals rather than recomputing it
via createAndComputeVirtRegInterval. I'm not sure if there's much of a
difference with either.
- For the postpass it's sufficient enough to just check isUndef() in
hasUndefinedMergeOp, i.e. we don't need to lookup the def in VNInfo.
Running on llvm-test-suite and SPEC CPU 2017 there aren't any changes in
the number of vsetvlis removed. There are some minor scheduling diffs as
well as extra spills and less spills in some cases (caused by transient
vsetvlis existing between RISCVInsertVSETVLI and RISCVCoalesceVSETVLI
when vec regalloc happens), but they are minor and should go away once
we finish moving the rest of RISCVInsertVSETVLI.
We could also potentially turn off this pass for unoptimised builds.
When using Greedy Register Allocation, there are times where
early-clobber values are ignored, and assigned the same register. This
is illeagal behaviour for these intructions. To get around this, using
Pseudo instructions for early-clobber registers gives them a definition
and allows Greedy to assign them to a different register. This then
meets the ARM Architecture Reference Manual and matches the defined
behaviour.
This patch takes the existing RISC-V patch and makes it target
independent, then adds support for the ARM Architecture. Doing this will
ensure early-clobber restraints are followed when using the ARM
Architecture. Making the pass target independent will also open up
possibility that support other architectures can be added in the future.
This patch make riscv-split-regalloc as true by default.
It will not affect the codegen result if it vector register allocation
doesn't exist. If there is the vector register allocation, it may affect
the non-rvv register LiveInterval's segment/weight. It will make the
allocation in a different order.
This adds a new pass to insert VXRM writes for vector instructions. With
the goal of avoiding redundant writes.
The pass does 2 dataflow algorithms. The first is a forward data flow to
calculate where a VXRM value is available. The second is a backwards
dataflow to determine where a VXRM value is anticipated.
Finally, we use the results of these two dataflows to insert VXRM writes
where a value is anticipated, but not available.
The pass does not split critical edges so we aren't always able to
eliminate all redundancy.
The pass will only insert vxrm writes on paths that always require it.
Rematerialization during register allocation is currently limited to a
single instruction with no inputs.
This patch introduces a pseudoinstruction that represents the
materialization of a constant. I've started with a sequence of 2
instructions for now, which covers at least the common LUI+ADDI(W) case.
This instruction will be expanded into real instructions immediately
after register allocation using a new pass. This gives the post-RA
scheduler a chance to separate the 2 instructions to improve ILP.
I believe this matches the approach used by AArch64.
Unfortunately, this loses some CSE opportunies when an LUI value is used
by multiple constants with different LSBs.
This feature is off by default and a new backend command line option is
added to enable it for testing.
This avoids the spill and reloads reported in #69586.
In a recent series of refactorings (described here: https://discourse.llvm.org/t/riscv-transition-in-vector-pseudo-structure-policy-variants/71295), I greatly increased the number of IMPLICIT_DEF operands to our vector instructions. This has turned out to have an unexpected negative impact because MachineCSE does not CSE IMPLICIT_DEFs, and thus does not CSE any instruction with an IMPLICIT_DEF operand. SelectionDAG *does* CSE the same case, but that only covers the same block case, not the cross block case. This lead to the performance regression reported in https://github.com/llvm/llvm-project/issues/64282.
This change is a slightly ugly hack to side step the issue. Instead of fixing the root cause (lack of CSE for IMPLICIT_DEF) or undoing the operand changes, we leave the extra operand in place, and use NoReg in place of IMPLICIT_DEF. I then convert back to IMPLICIT_DEF just before register allocation so that ProcessImplicitDefs and TwoAddressInstructions can do the normal transforms to Undef tied registers.
We may end up backporting this into the 17.x release branch. Given how late in the release cycle this is landing, that's much less likely now, but still a possibility.
Differential Revision: https://reviews.llvm.org/D156909
With `-fsanitize=kcfi` (Kernel Control-Flow Integrity), Clang emits
"kcfi" operand bundles to indirect call instructions. Similarly to
the target-specific lowering added in D119296, implement KCFI operand
bundle lowering for RISC-V.
This patch disables the generic KCFI pass for RISC-V in Clang, and
adds the KCFI machine function pass in `RISCVPassConfig::addPreSched`
to emit target-specific `KCFI_CHECK` pseudo instructions before calls
that have KCFI operand bundles. The machine function pass also bundles
the instructions to ensure we emit the checks immediately before the
calls, which is not possible with the generic pass.
`KCFI_CHECK` instructions are lowered in `RISCVAsmPrinter` to a
contiguous code sequence that traps if the expected hash in the
operand bundle doesn't match the hash before the target function
address. This patch emits an `ebreak` instruction for error handling
to match the Linux kernel's `BUG()` implementation. Just like for X86,
we also emit trap locations to a `.kcfi_traps` section to support
error handling, as we cannot embed additional information to the trap
instruction itself.
Relands commit 62fa708ceb027713b386c7e0efda994f8bdc27e2 with fixed
tests.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D148385
With `-fsanitize=kcfi` (Kernel Control-Flow Integrity), Clang emits
"kcfi" operand bundles to indirect call instructions. Similarly to
the target-specific lowering added in D119296, implement KCFI operand
bundle lowering for RISC-V.
This patch disables the generic KCFI pass for RISC-V in Clang, and
adds the KCFI machine function pass in `RISCVPassConfig::addPreSched`
to emit target-specific `KCFI_CHECK` pseudo instructions before calls
that have KCFI operand bundles. The machine function pass also bundles
the instructions to ensure we emit the checks immediately before the
calls, which is not possible with the generic pass.
`KCFI_CHECK` instructions are lowered in `RISCVAsmPrinter` to a
contiguous code sequence that traps if the expected hash in the
operand bundle doesn't match the hash before the target function
address. This patch emits an `ebreak` instruction for error handling
to match the Linux kernel's `BUG()` implementation. Just like for X86,
we also emit trap locations to a `.kcfi_traps` section to support
error handling, as we cannot embed additional information to the trap
instruction itself.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D148385
Depends on D151395.
This is the 2nd patch of the patch-set. For the cover letter of the
patch-set, please checkout D151395. This patch originates from
D121376.
This commit models vxrm by adding an immediate operand into intrinsics
and machine instructions of RVV fixed-point instruction `vaadd`,
`vaaddu`, `vasub`, and `vasubu`. This commit only covers intrinsics of
the four instructions, the proceeding patches of the patch-set will do
the same to other RVV fixed-point instructions.
The current naiive approach is to have a write to vxrm inserted before
every fixed-point instruction. This is done by the new added pass
`RISCVInsertReadWriteCSR`. The reason to name the pass in a more general
term is because we will also model rounding mode for the RVV floating-
point instructions. The approach will be improved in the future,
implementing partial redundancy elimination algorithms to it.
The original LLVM intrinsics and machine instructions, take `vaadd` as
an example, does not model the rounding mode is not removed in this
patch. That is, `int.riscv.vaadd.*` co-exists with
`int.riscv.vaadd.rm.*` after this patch. The next patch will add C
intrinsics of vaadd with an additional operand that models the control
of the rounding mode, in this patch, `int.riscv.vaadd.rm.*` will
replace `int.riscv.vaadd.*`.
Authored-by: ShihPo Hung <shihpo.hung@sifive.com>
Co-Authored-by: eop Chen <eop.chen@sifive.com>
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D151396
Rather than having a separate pass to add the hint instructions,
emit them directly into the streamer during asm printing.
Reviewed By: BeMg, kito-cheng
Differential Revision: https://reviews.llvm.org/D149511
Follow-up for 4ece50737d5385fb80cfa23f5297d1111f8eed39 (D142027).
Assignment Tracking Analysis now always runs and is skipped internally if
assignment tracking is disabled. Update these tests to expect to see the
pass run.
Buildbot failure: https://lab.llvm.org/buildbot/#/builders/57/builds/24094
Issue #58168 describes the difficulty diagnosing stack size issues
identified by -Wframe-larger-than. For simple code, its easy to
understand the stack layout and where space is being allocated, but in
more complex programs, where code may be heavily inlined, unrolled, and
have duplicated code paths, it is no longer easy to manually inspect the
source program and understand where stack space can be attributed.
This patch implements a machine function pass that emits remarks with a
textual representation of stack slots, and also outputs any available
debug information to map source variables to those slots.
The new behavior can be used by adding `-Rpass-analysis=stack-frame-layout`
to the compiler invocation. Like other remarks the diagnostic
information can be saved to a file in a machine readable format by
adding -fsave-optimzation-record.
Fixes: #58168
Reviewed By: nickdesaulniers, thegameg
Differential Revision: https://reviews.llvm.org/D135488
Issue #58168 describes the difficulty diagnosing stack size issues
identified by -Wframe-larger-than. For simple code, its easy to
understand the stack layout and where space is being allocated, but in
more complex programs, where code may be heavily inlined, unrolled, and
have duplicated code paths, it is no longer easy to manually inspect the
source program and understand where stack space can be attributed.
This patch implements a machine function pass that emits remarks with a
textual representation of stack slots, and also outputs any available
debug information to map source variables to those slots.
The new behavior can be used by adding `-Rpass-analysis=stack-frame-layout`
to the compiler invocation. Like other remarks the diagnostic
information can be saved to a file in a machine readable format by
adding -fsave-optimzation-record.
Fixes: #58168
Reviewed By: nickdesaulniers, thegameg
Differential Revision: https://reviews.llvm.org/D135488
Currently per-function metadata consists of:
(start-pc, size, features)
This adds a new UAR feature and if it's set an additional element:
(start-pc, size, features, stack-args-size)
Reviewed By: melver
Differential Revision: https://reviews.llvm.org/D136078
As stated in
https://discourse.llvm.org/t/rfc-llc-add-expandlargeintfpconvert-pass-for-fp-int-conversion-of-large-bitint/65528,
this implementation is very similar to ExpandLargeDivRem, which expands
‘fptoui .. to’, ‘fptosi .. to’, ‘uitofp .. to’, ‘sitofp .. to’ instructions
with a bitwidth above a threshold into auto-generated functions. This is
useful for targets like x86_64 that cannot lower fp convertions with more
than 128 bits. The expanded nodes are referring from the IR generated by
`compiler-rt/lib/builtins/floattidf.c`, `compiler-rt/lib/builtins/fixdfti.c`,
and etc.
Corner cases:
1. For fp16: as there is no related builtins added in compliler-rt. So I
mainly utilized the fp32 <-> fp16 lib calls to implement.
2. For fp80: as this pass is soft fp emulation and no fp80 instructions can
help in this problem. I recommend users to deprecate this usage. For now, the
implementation uses fp128 as the temporary conversion type and inserts
fptrunc/ext at top/end of the function.
3. For bf16: as clang FE currently doesn't support bf16 algorithm operations
(convert to int, float, +, -, *, ...), this patch doesn't consider bf16 for
now.
4. For unsigned FPToI: since both default hardware behaviors and libgcc are
ignoring "returns 0 for negative input" spec. This pass follows this old way
to ignore unsigned FPToI. See this example:
https://gcc.godbolt.org/z/bnv3jqW1M
The end-to-end tests are uploaded at https://reviews.llvm.org/D138261
Reviewed By: LuoYuanke, mgehre-amd
Differential Revision: https://reviews.llvm.org/D137241
Currently per-function metadata consists of:
(start-pc, size, features)
This adds a new UAR feature and if it's set an additional element:
(start-pc, size, features, stack-args-size)
Reviewed By: melver
Differential Revision: https://reviews.llvm.org/D136078
Originally, `OptLevel` isn't passed into the `MachineFunctionPass`.
This lets the default parameter of `SelectionDAGISel`, which is
`CodeGenOpt::Default`, be passed in. OptLevelChanger captures the
optimization level with the parameter, and rather not the value
within `TargetMachine`. This lets the optimization be
unintentionally overwriten if other value than `CodeGenOpt::Default`
passed.
This patch fixes this by passing the optimization level rather
than using the default value.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D126641
When optimizing for size, this pass searches for instructions that are
prevented from being compressed by one of the following:
1. The use of a single uncompressed register.
2. A base register + offset where the offset is too large to be
compressed and the base register may or may not already be compressed.
In the first case, if there is a compressed register available, then the
uncompressed register is copied to the compressed register and its uses
replaced. This is only done if there are enough uses that code size
would be improved.
In the second case, if a compressed register is available, then the
original base register is copied and adjusted such that:
new_base_register = base_register + adjustment
base_register + large_offset = new_base_register + small_offset
and the uses of the base register are replaced with the new base
register. Again this is only done if there are enough uses for code size
to be improved.
This pass was authored by Lewis Revill, with large offset optimization
added by Craig Blackmore.
Differential Revision: https://reviews.llvm.org/D92105