This patch adds basic support of `MachinePipeliner` and disable
it by default.
The functionality should be OK and all llvm-test-suite tests have
passed.
This looks like a rather weird change, so let me explain why this isn't
as unreasonable as it looks. Let's start with the problem it's solving.
```
define signext i32 @overlap_live_ranges(ptr %arg, i32 signext %arg1) { bb:
%i = icmp eq i32 %arg1, 1
br i1 %i, label %bb2, label %bb5
bb2: ; preds = %bb
%i3 = getelementptr inbounds nuw i8, ptr %arg, i64 4
%i4 = load i32, ptr %i3, align 4
br label %bb5
bb5: ; preds = %bb2, %bb
%i6 = phi i32 [ %i4, %bb2 ], [ 13, %bb ]
ret i32 %i6
}
```
Right now, we codegen this as:
```
li a3, 1
li a2, 13
bne a1, a3, .LBB0_2
lw a2, 4(a0)
.LBB0_2:
mv a0, a2
ret
```
In this example, we have two values which must be assigned to a0 per the
ABI (%arg, and the return value). SelectionDAG ensures that all values
used in a successor phi are defined before exit the predecessor block.
This creates an ADDI to materialize the immediate in the entry block.
Currently, this ADDI is not sunk into the tail block because we'd have
to split a critical edges to do so. Note that if our immediate was
anything large enough to require two instructions we *would* split this
critical edge.
Looking at other targets, we notice that they don't seem to have this
problem. They perform the sinking, and tail duplication that we don't.
Why? Well, it turns out for AArch64 that this is entirely an accident of
the existance of the gpr32all register class. The immediate is
materialized into the gpr32 class, and then copied into the gpr32all
register class. The existance of that copy puts us right back into the
two instruction case noted above.
This change essentially just bypasses this emergent behavior aspect of
the aarch64 behavior, and implements the same "always sink immediates"
behavior for RISCV as well.
The original goal of this pass was to focus on vector operations with
VLMAX. However, users often utilize only part of the result, and such
usage may come from the vectorizer.
We found that relaxing this constraint can capture more optimization
opportunities, such as non-power-of-2 code generation and vector
operation sequences with different VLs.
---------
Co-authored-by: Kito Cheng <kito.cheng@sifive.com>
The InitUndef pass currently uses target-specific pseudo instructions,
with one pseudo per register class.
Instead, add a generic pseudo instruction, which can be used by all
targets and register classes.
Previously for vector peepholes that fold based on VL, we checked if the
VLMAX is the same as a proxy to check that the EEWs were the same. This
only worked at LMUL >= 1 because the EMULs of the Src output and user's
input had to be the same because the register classes needed to match.
At fractional LMULs we would have incorrectly folded something like
this:
%x:vr = PseudoVADD_VV_MF4 $noreg, $noreg, $noreg, 4, 4 /* e16 */, 0
%y:vr = PseudoVMV_V_V_MF8 $noreg, %x, 4, 3 /* e8 */, 0
This models the EEW of the destination operands of vector instructions
with a TSFlag, which is enough to fix the incorrect folding.
There's some overlap with the TargetOverlapConstraintType and
IsRVVWideningReduction. If we model the source operands as well we may
be able to subsume them.
This patch prepares the NFC groundwork for global outlining using
CGData, which will follow
https://github.com/llvm/llvm-project/pull/90074.
- The `MinRepeats` parameter is now explicitly passed to the
`getOutliningCandidateInfo` function, rather than relying on a default
value of 2. For local outlining, the minimum number of repetitions is
typically 2, but for the global outlining (mentioned above), we will
optimistically create a single `Candidate` for each `OutlinedFunction`
if stable hashes match a specific code sequence. This parameter is
adjusted accordingly in global outlining scenarios.
- I have also implemented `unique_ptr` for `OutlinedFunction` to ensure
safe and efficient memory management within `FunctionList`, avoiding
unnecessary implicit copies.
This depends on https://github.com/llvm/llvm-project/pull/101461.
This is a patch for
https://discourse.llvm.org/t/rfc-enhanced-machine-outliner-part-2-thinlto-nolto/78753.
The renamable flag is useful during MachineCopyPropagation but renamable
flag will be dropped after lowerCopy in some case.
This patch introduces extra arguments to pass the renamable flag to
copyPhysReg.
As noted in
https://github.com/llvm/llvm-project/pull/100367/files#r1695442138,
RISCVMaskedPseudoInfo currently stores two things, whether or not a
masked pseudo has an unmasked variant, and whether or not it's
element-wise.
These are separate things, so this patch splits the latter out into the
underlying instruction's TSFlags to help make the semantics of #100367
more clear.
To the best of my knowledge the only non-element-wise instructions in V
are:
- vredsum.vs and other reductions
- vcompress.vm
- vms*f.m
- vcpop.m and vfirst.m
- viota.m
In vector crypto the instructions that operate on element groups are
conservatively marked (this might be fine to relax later given since
non-EGS multiple vls are reserved), as well as the SiFive extensions and
XTHeadVdot.
This is just like AArch64.
Changing the threshold to 6 will increase the code size, but will
also decrease unconditional branches. CPUs with wide fetch/issue units
can benefit from it.
The value 6 may be debatable, we can set it to `SchedModel.IssueWidth`.
When folding, we currently check if the pseudo's result is not lanewise
(e.g. vredsum.vs or viota.m) and bail if we're changing the mask.
However we also need to check for the AVL too.
This patch bails if the AVL changed for these pseudos, and also renames
the pseudo table property to be more explicit.
This adds initial support for rematerializing vector instructions,
starting with vid.v since it's simple and has the least number of
operands. It has one passthru operand which we need to check is
undefined. It also has an AVL operand, but it's fine to rematerialize
with it because it's scalar and register allocation is split between
vector and scalar.
RISCVInsertVSETVLI can still happen before vector regalloc if
-riscv-vsetvl-after-rvv-regalloc is false, so this makes sure that we
only rematerialize after regalloc by checking for the implicit uses that
are added.
We split target-dependent MachineCombiner patterns into their target
folder.
This makes MachineCombiner much more target-independent.
Reviewers:
davemgreen, asavonic, rotateright, RKSimon, lukel97, LuoYuanke, topperc, mshockwave, asi-sc
Reviewed By: topperc, mshockwave
Pull Request: https://github.com/llvm/llvm-project/pull/87991
This restructures the code to make the fact that most of
getVLENFactoredAmount is just a generic multiply w/immediate more
obvious and prepare for a couple of upcoming enhancements to this code.
Note that I plan to switch mulImm to early return, but decided I'd do
that as a separate commit to keep this diff readable.
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>
This TSFlags was introduced by https://reviews.llvm.org/D108767.
A base class of all RISCV RegisterClass is added and we store
IsVRegClass/VLMul/NF into TSFlags and add helpers to get them.
This can reduce some lines and I think there will be more usages.
Reviewers: preames, topperc
Reviewed By: topperc
Pull Request: https://github.com/llvm/llvm-project/pull/84894
This is another part of #70452 which makes getMemOperandsWithOffsetWidth
use a LocationSize for Width, as opposed to the unsigned it currently
uses. The advantages on it's own are not super high if
getMemOperandsWithOffsetWidth usually uses known sizes, but if the
values can come from an MMO it can help be more accurate in case they
are Unknown (and in the future, scalable).
When using Greedy Register Allocation, there are times where
early-clobber values are ignored, and assigned the same register. This
is illeagal behaviour for these intructions. To get around this, using
Pseudo instructions for early-clobber registers gives them a definition
and allows Greedy to assign them to a different register. This then
meets the ARM Architecture Reference Manual and matches the defined
behaviour.
This patch takes the existing RISC-V patch and makes it target
independent, then adds support for the ARM Architecture. Doing this will
ensure early-clobber restraints are followed when using the ARM
Architecture. Making the pass target independent will also open up
possibility that support other architectures can be added in the future.
This helper function handles common cases where we can determine a
constant value is being defined in a register. Although it looks like
codegen changes are possible due to this being called in
PeepholeOptimizer, my main motivation is to use this in
describeLoadedValue.
These are picked up from getMemOperandsWithOffsetWidth but weren't then
being passed through to shouldClusterMemOps, which forces backends to
collect the information again if they want to use the kind of heuristics
typically used for the similar shouldScheduleLoadsNear function (e.g.
checking the offset is within 1 cache line).
This patch just adds the parameters, but doesn't attempt to use them.
There is potential to use them in the current PPC and AArch64
shouldClusterMemOps implementation, and I intend to use the offset in
the heuristic for RISC-V. I've left these for future patches in the
interest of being as incremental as possible.
As noted in the review and in an inline FIXME, an ElementCount-style abstraction may later be used to condense these two parameters to one argument. ElementCount isn't quite suitable as it doesn't support negative offsets.
This adds minimal support for load clustering, but disables it by
default. The intent is to iterate on the precise heuristic and the
question of turning this on by default in a separate PR. Although
previous discussion indicates hope that the MachineScheduler would
replace most uses of the SelectionDAG scheduler, it does seem most
targets aren't using MachineScheduler load clustering right now:
PPC+AArch64 seem to just use it to help with paired load/store formation
and although AMDGPU uses it for general clustering it also implements
ShouldScheduleLoadsNear for the SelectionDAG scheduler's clustering.
This hook is called by the default implementation of
getMemOperandWithOffset and by the load/store clustering code in the
MachineScheduler though this isn't enabled by default and is not yet
enabled for RISC-V. Only return true for queries on scalar loads/stores
for now (this is a conservative starting point, and vector load/store
can be handled in a follow-on patch).
Don't blindly copy the original flags from the pre-reassociated
instrutions.
This copied the integer poison flags which are not safe to preserve
after reassociation.
For the FP flags, I think we should only keep the intersection of
the flags. Override setSpecialOperandAttr to do this.
Fixes#72777.
This hook is called by the target-independent implementation of
TargetInstrInfo::describeLoadedValue. I've opted to test it via a C++
unit test, which although fiddly to set up seems the right way to test a
function with such clear intended semantics (rather than testing the
impact indirectly).
isAddImmediate will never recognise ADDIW as an add immediate which I
_think_ is conservatively correct, as the caller may not understand its
semantics vs ADDI.
Note that although the doc comment for isAddImmediate specifies its
behaviour solely in terms of physical registers, none of the current
in-tree implementations (including this one) bail out on virtual
registers (see #72357).
DAGCombiner, as well as InstCombine, tend to canonicalize GE/LE into
GT/LT, namely:
```
X >= C --> X > (C - 1)
```
Which sometime generates off-by-one constants that could have been CSE'd
with surrounding constants.
Instead of changing such canonicalization, this patch tries to swap
those branch conditions post-isel, in the hope of resurfacing more
constant CSE opportunities. More specifically, it performs the following
optimization:
For two constants C0 and C1 from
```
li Y, C0
li Z, C1
```
To remove redundnat `li Y, C0`,
1. if C1 = C0 + 1 we can turn:
(a) blt Y, X -> bge X, Z
(b) bge Y, X -> blt X, Z
2. if C1 = C0 - 1 we can turn:
(a) blt X, Y -> bge Z, X
(b) bge X, Y -> blt Z, X
This optimization will be done by PeepholeOptimizer through
RISCVInstrInfo::optimizeCondBranch.
Call this method directly from each vector case with the correct
arguments. This allows us to treat each type of copy as its own
special case and not pass variables to a common merge point. This
is similar to how AArch64 is structured.
I think I can reduce the number of operands to this new method, but
I'll do that as a follow up.
This uses the recently introduced sink-and-fold support in MachineSink.
https://reviews.llvm.org/D152828
This enables folding ADDI into load/store addresses.
Enabling by default will be a separate PR.
The handling for vector pseudos in hasAllNBitUsers is duplicated across
RISCVISelDAGToDAG and RISCVOptWInstrs. This deduplicates it between the
two,
with the common denominator between the two call sites being the opcode
and
SEW: We need to handle extracting these separately since one operates at
the
SelectionDAG level and the other at the MachineInstr level.
With `-fsanitize=kcfi` (Kernel Control-Flow Integrity), Clang emits
"kcfi" operand bundles to indirect call instructions. Similarly to
the target-specific lowering added in D119296, implement KCFI operand
bundle lowering for RISC-V.
This patch disables the generic KCFI pass for RISC-V in Clang, and
adds the KCFI machine function pass in `RISCVPassConfig::addPreSched`
to emit target-specific `KCFI_CHECK` pseudo instructions before calls
that have KCFI operand bundles. The machine function pass also bundles
the instructions to ensure we emit the checks immediately before the
calls, which is not possible with the generic pass.
`KCFI_CHECK` instructions are lowered in `RISCVAsmPrinter` to a
contiguous code sequence that traps if the expected hash in the
operand bundle doesn't match the hash before the target function
address. This patch emits an `ebreak` instruction for error handling
to match the Linux kernel's `BUG()` implementation. Just like for X86,
we also emit trap locations to a `.kcfi_traps` section to support
error handling, as we cannot embed additional information to the trap
instruction itself.
Relands commit 62fa708ceb027713b386c7e0efda994f8bdc27e2 with fixed
tests.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D148385
With `-fsanitize=kcfi` (Kernel Control-Flow Integrity), Clang emits
"kcfi" operand bundles to indirect call instructions. Similarly to
the target-specific lowering added in D119296, implement KCFI operand
bundle lowering for RISC-V.
This patch disables the generic KCFI pass for RISC-V in Clang, and
adds the KCFI machine function pass in `RISCVPassConfig::addPreSched`
to emit target-specific `KCFI_CHECK` pseudo instructions before calls
that have KCFI operand bundles. The machine function pass also bundles
the instructions to ensure we emit the checks immediately before the
calls, which is not possible with the generic pass.
`KCFI_CHECK` instructions are lowered in `RISCVAsmPrinter` to a
contiguous code sequence that traps if the expected hash in the
operand bundle doesn't match the hash before the target function
address. This patch emits an `ebreak` instruction for error handling
to match the Linux kernel's `BUG()` implementation. Just like for X86,
we also emit trap locations to a `.kcfi_traps` section to support
error handling, as we cannot embed additional information to the trap
instruction itself.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D148385
This commit implements the two NTLH intrinsic functions.
```
type __riscv_ntl_load (type *ptr, int domain);
void __riscv_ntl_store (type *ptr, type val, int domain);
```
```
enum {
__RISCV_NTLH_INNERMOST_PRIVATE = 2,
__RISCV_NTLH_ALL_PRIVATE,
__RISCV_NTLH_INNERMOST_SHARED,
__RISCV_NTLH_ALL
};
```
We encode the non-temporal domain into MachineMemOperand flags.
1. Create the RISC-V built-in function with custom semantic checking.
2. Assume the domain argument is a compile time constant,
and make it as LLVM IR metadata (nontemp_node).
3. Encode domain value as two bits MachineMemOperand TargetMMOflag.
4. According to MachineMemOperand TargetMMOflag, select corrsponding ntlh instruction.
Currently, it supports scalar type and fixed-length vector type.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D143364
It was only in RISCVInstrInfo because it was used by 2 passes, but those
passes have been merged in D147173.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D147174
I think it's good practice to avoid having default ctors unless they're really
valid/useful. For OutlinedFunction the default ctor was used to represent a
bail-out value for getOutliningCandidateInfo(), so I changed the API to return
an optional<getOutliningCandidateInfo> instead which seems a tad cleaner.
Differential Revision: https://reviews.llvm.org/D146375
D145471 added overrides of the other signature to return MemBytes,
but shouldn't have removed these overrides.
These signatures will now call the MemBytes signature and ignore
the MemBytes. This matches X86.
Refer from: https://reviews.llvm.org/D44782
After https://reviews.llvm.org/D130302, LW+SEXT.B can be folded into LB
as partially reload stack slot. This gains incorrect optimization result
from `StackSlotColoring` without given the number of bytes exactly load
from stack. LB+SW are mis-interpreted as fully reload/restore from stack
slot without the sign-extension. SW would be considered as a redundant store.
The testcase is copied from llvm/test/CodeGen/X86/pr30821.mir.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D145471
For in-order cores MachineCombiner makes better decisions when the critical path
is calculated only for the current basic block and does not take into account
other blocks from the trace.
This patch adds a virtual method to TargetInstrInfo to allow each target decide
which strategy to use.
Depends on D140541
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D140542
The motivation behind this patch is to unify some of the outliner logic across architectures. This looks nicer in general and makes fixing [issues like this](https://reviews.llvm.org/D124707#3483805) easier.
There are some notable changes here:
1. `isMetaInstruction()` is used directly instead of checking for specific meta-instructions like `IMPLICIT_DEF` or `KILL`. This was already done in the RISC-V implementation, but other architectures still did hardcoded checks.
- As an exception to this, CFI instructions are explicitly delegated to the target because RISC-V has different handling for those.
2. `isTargetIndex()` checks are replaced with an assert; none of the architectures supported actually use `MO_TargetIndex` at this point in time.
3. `isCFIIndex()` and `isFI()` checks are also replaced with asserts, since these operands should not exist in [any context](https://reviews.llvm.org/D122635#3447214) at this stage in the pipeline.
Reviewed by: paquette
Differential Revision: https://reviews.llvm.org/D125072
Move to RISCVInstrInfo since we need RISCVSubtarget now.
Instead of asking if only the lower 32 bits are used we can now
ask if the lower N bits are used. This will be needed by a future
patch.