This will make it easy for callers to see issues with and fix up calls
to createTargetMachine after a future change to the params of
TargetMachine.
This matches other nearby enums.
For downstream users, this should be a fairly straightforward
replacement,
e.g. s/CodeGenOpt::Aggressive/CodeGenOptLevel::Aggressive
or s/CGFT_/CodeGenFileType::
On AArch64, it is safe to let the linker handle relaxation of
unconditional branches; in most cases, the destination is within range,
and the linker doesn't need to do anything. If the linker does insert
fixup code, it clobbers the x16 inter-procedural register, so x16 must
be available across the branch before linking. If x16 isn't available,
but some other register is, we can relax the branch either by spilling
x16 OR using the free register for a manually-inserted indirect branch.
This patch builds on D145211. While that patch is for correctness, this
one is for performance of the common case. As noted in
https://reviews.llvm.org/D145211#4537173, we can trust the linker to
relax cross-section unconditional branches across which x16 is
available.
Programs that use machine function splitting care most about the
performance of hot code at the expense of the performance of cold code,
so we prioritize minimizing hot code size.
Here's a breakdown of the cases:
Hot -> Cold [x16 is free across the branch]
Do nothing; let the linker relax the branch.
Cold -> Hot [x16 is free across the branch]
Do nothing; let the linker relax the branch.
Hot -> Cold [x16 used across the branch, but there is a free register]
Spill x16; let the linker relax the branch.
Spilling requires fewer instructions than manually inserting an
indirect branch.
Cold -> Hot [x16 used across the branch, but there is a free register]
Manually insert an indirect branch.
Spilling would require adding a restore block in the hot section.
Hot -> Cold [No free regs]
Spill x16; let the linker relax the branch.
Cold -> Hot [No free regs]
Spill x16 and put the restore block at the end of the hot function; let the linker relax the branch.
Ex:
[Hot section]
func.hot:
... hot code...
func.restore:
... restore x16 ...
B func.hot
[Cold section]
func.cold:
... spill x16 ...
B func.restore
Putting the restore block at the end of the function instead of
just before the destination increases the cost of executing the
store, but it avoids putting cold code in the middle of hot code.
Since the restore is very rarely taken, this is a worthwhile
tradeoff.
Differential Revision: https://reviews.llvm.org/D156767
This is intended to be a non-functional change. This patch removes
OBSCURE_COPY in favour of using `forceDisableTriviallyReMaterializable`.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D159194
This is a way to prevent the register allocator from inserting instructions
which behave differently for different runtime vector-lengths, inside a
call-sequence which changes the streaming-SVE mode before/after the call.
I've considered using BUNDLEs in Machine IR, but found that using this is
not possible for a few reasons:
* Most passes don't look inside BUNDLEs, but some passes would need to
look inside these call-sequence bundles, for example the PrologEpilog
pass (to remove the CALLSEQSTART/END), a PostRA pass to remove COPY
instructions, or the AArch64PseudoExpand pass.
* Within the streaming-mode-changing call sequence, one of the instructions
is a CALLSEQEND. The corresponding CALLSEQBEGIN (AArch64::ADJCALLSTACKUP)
is outside this sequence. This means we'd end up with a BUNDLE that has
[SMSTART, COPY, BL, ADJCALLSTACKUP, COPY, SMSTOP]. The MachineVerifier
doesn't accept this, and we also can't move the CALLSEQSTART into the
call sequence.
Maybe in the future we could model this differently by modelling
the runtime vector-length as a value that's used by certain operations
(similar to e.g. NCZV flags) and clobbered by SMSTART/MMSTOP, such that the
register allocator can consider these as actual dependences and avoid
rematerialization. For now we just want to address the immediate problem.
Reviewed By: paulwalker-arm, aemerson
Differential Revision: https://reviews.llvm.org/D159193
This patch contains a few changes:
* It changes the alignment of the strided/contiguous ZPR2/ZPR4 registers to
128-bits. This is important, because when we spill these registers to the
stack, the address doesn't need to be 256/512 bits aligned because we
split the single-store/reload pseudo instruction up into multiple
STR_ZXI/LDR_ZXI (single vector store/load) instructions, which only
require a 128-bit alignment. Additionally, an alignment larger than the
stack-alignment is not supported for scalable vectors.
* It adds support for these register classes in storeRegToStackSlot,
loadRegFromStackSlot and copyPhysReg.
* It adds tests only for the strided forms. There is no need to also
test the contiguous forms, because a register such as z2_z3 or
z4_z5_z6_z7 are also part of the regular ZPR2 and ZPR4 register classes,
respectively, which are already covered and tested.
Reviewed By: dtemirbulatov
Differential Revision: https://reviews.llvm.org/D159189
This patch contains a few changes:
* It changes the alignment of the strided/contiguous ZPR2/ZPR4 registers to
128-bits. This is important, because when we spill these registers to the
stack, the address doesn't need to be 256/512 bits aligned because we
split the single-store/reload pseudo instruction up into multiple
STR_ZXI/LDR_ZXI (single vector store/load) instructions, which only
require a 128-bit alignment. Additionally, an alignment larger than the
stack-alignment is not supported for scalable vectors.
* It adds support for these register classes in storeRegToStackSlot,
loadRegFromStackSlot and copyPhysReg.
* It adds tests only for the strided forms. There is no need to also
test the contiguous forms, because a register such as z2_z3 or
z4_z5_z6_z7 are also part of the regular ZPR2 and ZPR4 register classes,
respectively, which are already covered and tested.
Reviewed By: dtemirbulatov
Differential Revision: https://reviews.llvm.org/D159189
Machine function splitting + branch relaxation currently don't properly
handle inline asm goto blocks that conditional branch to cold goto
labels. While such inline asm is technically invalid, machine
function splitting is the only thing that exposes it as such.
Since machine function splitting doesn't help too much in these
circumstances anyway, disable it for asm goto blocks and their targets.
Differential Revision: https://reviews.llvm.org/D158647
Jump tables on AArch64 are label-relative rather than table-relative, so
having jump table destinations that are in different sections causes
problems with relocation. Jump table lookups have a max range of 1MB, so
all destinations must be in the same section as the lookup code. Both of
these restrictions can be mitigated with some careful and complex logic,
but doing so doesn't gain a huge performance benefit.
Efficiently ensuring jump tables are correct and can be compressed on
AArch64 is a TODO item. In the meantime, don't split blocks that can
cause problems.
Differential Revision: https://reviews.llvm.org/D157124
Because unconditional branch relaxation on AArch64 grows the stack to
spill a register, splitting a function would cause the red zone to be
overwritten. Explicitly disable MFS for such functions.
Differential Revision: https://reviews.llvm.org/D157127
When matching FNEG patterns for the MachineCombiner we need to check for
opcodes first, before trying to extract a register from an operand.
Otherwise handling of instructions with non-register operands causes the
compiler to crash.
Differential Revision: https://reviews.llvm.org/D158473
Because the code layout is not known during compilation, the distance of
cross-section jumps is not knowable at compile-time. Because of this, we
should assume that any cross-sectional jumps are out of range. This
assumption is necessary for machine function splitting on AArch64, which
introduces cross-section branches in the middle of functions. The linker
relaxes out-of-range unconditional branches, but it clobbers X16 to do
so; it doesn't relax conditional branches, which must be manually
relaxed by the compiler.
Differential Revision: https://reviews.llvm.org/D145211
In the machine outliner implementation for AArch64, `signOutlinedFunction()`
reimplements signing the LR value in prologue and authenticating it in
epilogue of the outlined function. This patch factors out `signLR()` and
`authenticateLR()` functions from AArch64FrameLowering code and reuses
them in `signOutlinedFunction()`.
The `mergeOutliningCandidateAttributes()` outliner callback is
introduced as well to further unify signing and authentication of the LR
value.
Reviewed By: tmatheson
Differential Revision: https://reviews.llvm.org/D157320
PATCHABLE_TYPED_EVENT_CALL and PATCHABLE_EVENT_CALL are pseudo
instructions that expand to XRay sleds, so getInstSizeInBytes
should reflect the size of the sleds, not the pseudo-instructions.
Differential Revision: https://reviews.llvm.org/D156272
This patch optimizes a pair of LDRSWpre and LDRSWui (or LDURSWi)
instructions into a single LDPSWpre instruction. This is a missing case
in D99272.
MIR test cases in D152564 are updated to verify the optimization.
Differential Revision: https://reviews.llvm.org/D152407
The AArch64Subtarget interface 'isNeonAvailable' is more appropriate going
forward, as we may also want to generate 'streaming SVE' code (not just
'streaming-compatible SVE' code), but here we must still make sure not to
use NEON instructions which are invalid in streaming SVE mode.
to help debug and report better diagnostics for functions like
relaxDwarfCallFrameFragment (D153167).
In MCStreamer, some emitCFI* functions already take a SMLoc argument. Add a
SMLoc argument to the remaining functions that generate a MCCFIInstruction.
Sometimes an developer would like to have more control over cmov vs branch. We have unpredictable metadata in LLVM IR, but currently it is ignored by X86 backend. Propagate this metadata and avoid cmov->branch conversion in X86CmovConversion for cmov with this metadata.
Example:
```
int MaxIndex(int n, int *a) {
int t = 0;
for (int i = 1; i < n; i++) {
// cmov is converted to branch by X86CmovConversion
if (a[i] > a[t]) t = i;
}
return t;
}
int MaxIndex2(int n, int *a) {
int t = 0;
for (int i = 1; i < n; i++) {
// cmov is preserved
if (__builtin_unpredictable(a[i] > a[t])) t = i;
}
return t;
}
```
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D118118
Emit FNMADD instead of FNEG(FMADD) for optimization levels
above Oz when fast-math flags (nsz+contract) permit it.
Differential Revision: https://reviews.llvm.org/D149260
Emit FNMADD instead of FNEG(FMADD) for optimization levels
above Oz when fast-math flags (nsz+contract) permit it.
Differential Revision: https://reviews.llvm.org/D149260
Emit FNMADD instead of FNEG(FMADD) for optimization levels
above Oz when fast-math flags (nsz+contract) permit it.
Differential Revision: https://reviews.llvm.org/D149260
PATCHABLE_* instructions expand to up to 36-byte
sleds. Updating the size of PATCHABLE instructions
causes them to be outlined, so we need to add a
check to prevent the outliner from considering
basic blocks that contain PATCHABLE instructions.
Differential Revision: https://reviews.llvm.org/D147982
The motivating example is in https://godbolt.org/z/45nbdYMK9
- For this example, `subs` is generated for the good case; `sub` followed by `cmp` is generated for the bad case. Since signed overflow is undefined behavior in C/C++ (indicated as `nsw` flag in LLVM IR), `subs` should be generated for the good case as well.
This patch relaxes one restriction from "quit optimization when V is used" to "continue if MI produces poison value when signed overflow occurs". This is not meant to be C/C++ specific since it looks at 'NoSWrap' since it looks at MachineInstr flags.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D146820
I think it's good practice to avoid having default ctors unless they're really
valid/useful. For OutlinedFunction the default ctor was used to represent a
bail-out value for getOutliningCandidateInfo(), so I changed the API to return
an optional<getOutliningCandidateInfo> instead which seems a tad cleaner.
Differential Revision: https://reviews.llvm.org/D146375
STG, STZG, ST2G, STZ2G are the exceptions to append 'Offset' to name the
offset format of load/store instructions. All other load/store
instructions use 'i' as the appendix. If there is no special reason to
do so, we should make the naming consistent.
Differential Revision: https://reviews.llvm.org/D141819
The motivation behind this patch is to unify some of the outliner logic across architectures. This looks nicer in general and makes fixing [issues like this](https://reviews.llvm.org/D124707#3483805) easier.
There are some notable changes here:
1. `isMetaInstruction()` is used directly instead of checking for specific meta-instructions like `IMPLICIT_DEF` or `KILL`. This was already done in the RISC-V implementation, but other architectures still did hardcoded checks.
- As an exception to this, CFI instructions are explicitly delegated to the target because RISC-V has different handling for those.
2. `isTargetIndex()` checks are replaced with an assert; none of the architectures supported actually use `MO_TargetIndex` at this point in time.
3. `isCFIIndex()` and `isFI()` checks are also replaced with asserts, since these operands should not exist in [any context](https://reviews.llvm.org/D122635#3447214) at this stage in the pipeline.
Reviewed by: paquette
Differential Revision: https://reviews.llvm.org/D125072
Recommit with bug fixes + added testcases to the outliner. Also adds some
debug output.
We found a case in the Swift benchmarks where the MachineOutliner introduces
about a 20% compile time overhead in comparison to building without the
MachineOutliner.
The origin of this slowdown is that the benchmark has long blocks which incur
lots of LRU checks for lots of candidates.
Imagine a case like this:
```
bb:
i1
i2
i3
...
i123456
```
Now imagine that all of the outlining candidates appear early in the block, and
that something like, say, NZCV is defined at the end of the block.
The outliner has to check liveness for certain registers across all candidates,
because outlining from areas where those registers are used is unsafe at call
boundaries.
This is fairly wasteful because in the previously-described case, the outlining
candidates will never appear in an area where those registers are live.
To avoid this, precalculate areas where we will consider outlining from.
Anything outside of these areas is mapped to illegal and not included in the
outlining search space. This allows us to reduce the size of the outliner's
suffix tree as well, giving us a potential memory win.
By precalculating areas, we can also optimize other checks too, like whether
or not LR is live across an outlining candidate.
Doing all of this is about a 16% compile time improvement on the case.
This is likely useful for other targets (e.g. ARM + RISCV) as well, but for now,
this only implements the AArch64 path. The original "is the MBB safe" method
still works as before.
Linux kernel sets SCTRL_EL1.BT0 and BT1 to 1 unconditionally, which
makes PACIASP equivalent to BTI C + PACIA LR,SP.
Use the shorter instruction sequence by default.
I'm not aware of anyone who needs the opposite. They are welcome to
revert to the current behavior under a subtarget feature or an
environment check.
This reverts commit 571c8c5263a79293aaadae07b11feb36726eaf53.
Differential Revision: https://reviews.llvm.org/D141978
Use deduction guides instead of helper functions.
The only non-automatic changes have been:
1. ArrayRef(some_uint8_pointer, 0) needs to be changed into ArrayRef(some_uint8_pointer, (size_t)0) to avoid an ambiguous call with ArrayRef((uint8_t*), (uint8_t*))
2. CVSymbol sym(makeArrayRef(symStorage)); needed to be rewritten as CVSymbol sym{ArrayRef(symStorage)}; otherwise the compiler is confused and thinks we have a (bad) function prototype. There was a few similar situation across the codebase.
3. ADL doesn't seem to work the same for deduction-guides and functions, so at some point the llvm namespace must be explicitly stated.
4. The "reference mode" of makeArrayRef(ArrayRef<T> &) that acts as no-op is not supported (a constructor cannot achieve that).
Per reviewers' comment, some useless makeArrayRef have been removed in the process.
This is a follow-up to https://reviews.llvm.org/D140896 that introduced
the deduction guides.
Differential Revision: https://reviews.llvm.org/D140955
`-mcpu=` in `llvm/test/CodeGen/AArch64/machine-combiner.ll` is changed
to `neoverse-n2` to use FP16 and SVE/SVE2 instructions. By this, the
register allocation and/or instruction scheduling are slightly changed
and some existing `CHECK` lines need to be updated.
Differential Revision: https://reviews.llvm.org/D139809
With D134950, targets get notified when a virtual register is created and/or
cloned. Targets can do the needful with the delegate callback. AMDGPU propagates
the virtual register flags maintained in the target file itself. They are useful
to identify a certain type of machine operands while inserting spill stores and
reloads. Since RegAllocFast spills the physical register itself, there is no way
its virtual register can be mapped back to retrieve the flags. It can be solved
by passing the virtual register as an additional argument. This argument has no
use when the spill interfaces are called during the greedy allocator or even the
PrologEpilogInserter and can pass a null register in such cases.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D138656
Before this change, the order of instructions in `case` labels was
inconsistent. It is alphabetical order for FP instructions but another
order for integer instructions. This commit changes the order to
1) instruction set (base/FP/SIMD), 2) mnemonic, 3) element type.
I believe this change makes it consistent, improves understandability,
and makes it easy to add/remove a group of instructions.
Differential Revision: https://reviews.llvm.org/D139607