If the add has more than one use then applying the transformation
won't cause it to be removed, so we can end up applying it again
causing an infinite loop.
Differential Revision: https://reviews.llvm.org/D129361
Currently, for vectorised loops that use the get.active.lane.mask
intrinsic we only use the mask for predicated vector operations,
such as masked loads and stores, etc. The loop itself is still
controlled by comparing the canonical induction variable with the
trip count. However, for some targets this is inefficient when it's
cheap to use the mask itself to control the loop.
This patch adds support for using the active lane mask for control
flow by:
1. Generating the active lane mask for the next iteration of the
vector loop, rather than the current one. If there are still any
remaining iterations then at least the first bit of the mask will
be set.
2. Extract the first bit of this mask and use this bit for the
conditional branch.
I did this by creating a new VPActiveLaneMaskPHIRecipe that sets
up the initial PHI values in the vector loop pre-header. I've also
made use of the new BranchOnCond VPInstruction for the final
instruction in the loop region.
Differential Revision: https://reviews.llvm.org/D125301
These three subtarget features are meant to control where MVE
instructions take 1 vs 2 vs 4 architectural beats. The mve1beat feature
is described as "Model MVE instructions as a 1 beat per tick
architecture", meaning MVE instruction will execute over 4 cycles.
mve4beat is the opposite where the entire 4 beats of the MVE instruction
execute in a single cycle. The costs for the two were backwards though,
not matching the cycle counts like they should. This patch switches the
costs on the two to bring them in-line with expectations.
Differential Revision: https://reviews.llvm.org/D129141
This patch adds support for Arm's Cortex-M85 CPU. The Cortex-M85 CPU is
an Arm v8.1m Mainline CPU, with optional support for MVE and PACBTI,
both of which are enabled by default.
Parts have been coauthored by by Mark Murray, Alexandros Lamprineas and
David Green.
Differential Revision: https://reviews.llvm.org/D128415
Currently the a AAPCS compliant frame record is not always created for
functions when it should. Although a consistent frame record might not
be required in some cases, there are still scenarios where applications
may want to make use of the call hierarchy made available trough it.
In order to enable the use of AAPCS compliant frame records whilst keep
backwards compatibility, this patch introduces a new command-line option
(`-mframe-chain=[none|aapcs|aapcs+leaf]`) for Aarch32 and Thumb backends.
The option allows users to explicitly select when to use it, and is also
useful to ensure the extra overhead introduced by the frame records is
only introduced when necessary, in particular for Thumb targets.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D125094
Deciding to load an arbitrary global based on whether the entire module is
being built for long calls is pretty clearly spurious, and in fact the existing
indirect logic is sufficient.
Running iwyu-diff on LLVM codebase since fb67d683db46dfd88da09d99 detected a few
regressions, fixing them.
The impact on preprocessed output is negligible: -4k lines.
This fixes the combining of constant vector GEP operands in the
optimization of MVE gather/scatter addresses, when opaque pointers are
enabled. As opaque pointers reduce the number of bitcasts between geps,
more can be folded than before. This can cause problems if the index
types are now different between the two geps.
This fixes that by making sure each constant is scaled appropriately,
which has the effect of transforming the geps to have a scale of 1,
changing [r0, q0, uxtw #1] gathers to [r0, q0] with a larger q0. This
helps use a simpler instruction that doesn't need the extra uxtw.
Differential Revision: https://reviews.llvm.org/D127733
Although this doesn't usually come up, we can have uses of the
BaseAccess of a distributed postinc being a PHI. This doesn't need the
usual dominance check as we will dominate along the phi edge, allowing
us to still create a postinc load/store.
Differential Revision: https://reviews.llvm.org/D127676
This reverts commit 7625e01d661644a560884057755d48a0da8b77b4 and
dependent cbcce82ef6b512d97e92a319a75a03e997c844e1.
Commit 7625e01d661644a560884057755d48a0da8b77b4 causes some new codegen test
failures under asan, e.g., CodeGen/ARM/execute-only.ll:
https://lab.llvm.org/buildbot/#/builders/5/builds/24659/steps/15/logs/stdio.
Currently the a AAPCS compliant frame record is not always created for
functions when it should. Although a consistent frame record might not
be required in some cases, there are still scenarios where applications
may want to make use of the call hierarchy made available trough it.
In order to enable the use of AAPCS compliant frame records whilst keep
backwards compatibility, this patch introduces a new command-line option
(`-mframe-chain=[none|aapcs|aapcs+leaf]`) for Aarch32 and Thumb backends.
The option allows users to explicitly select when to use it, and is also
useful to ensure the extra overhead introduced by the frame records is
only introduced when necessary, in particular for Thumb targets.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D125094
Currently the a AAPCS compliant frame record is not always created for
functions when it should. Although a consistent frame record might not
be required in some cases, there are still scenarios where applications
may want to make use of the call hierarchy made available trough it.
In order to enable the use of AAPCS compliant frame records whilst keep
backwards compatibility, this patch introduces a new command-line option
(`-mframe-chain=[none|aapcs|aapcs+leaf]`) for Aarch32 and Thumb backends.
The option allows users to explicitly select when to use it, and is also
useful to ensure the extra overhead introduced by the frame records is
only introduced when necessary, in particular for Thumb targets.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D125094
Simiarly to what's done on both ARM's and AArch64's frame lowering code,
this updates Thumb1FrameLowering to use the FrameDestroy Machine
Instruction flag to identify instructions inserted as part of the epilog
instead of relying on assumptions about specific machine instructions.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D126285
In the same spirit as D73543 and in reply to https://reviews.llvm.org/D126768#3549920 this patch is adding support for `__builtin_memset_inline`.
The idea is to get support from the compiler to easily write efficient memory function implementations.
This patch could be split in two:
- one for the LLVM part adding the `llvm.memset.inline.*` intrinsics.
- and another one for the Clang part providing the instrinsic as a builtin.
Differential Revision: https://reviews.llvm.org/D126903
We already have patterns for matching fadd(select(..., -0.0)),
but an upcoming patch will lead to patterns using +0.0 as the
identity instead of -0.0. I'm adding support for these patterns
now to avoid any regressions for MVE.
Differential Revision: https://reviews.llvm.org/D127275
MIR support is totally unusable for AMDGPU without this, since the set
of reserved registers is set from fields here.
Add a clone method to MachineFunctionInfo. This is a subtle variant of
the copy constructor that is required if there are any MIR constructs
that use pointers. Specifically, at minimum fields that reference
MachineBasicBlocks or the MachineFunction need to be adjusted to the
values in the new function.
I can't remove the function just yet as it is used in the generated .inc files.
I would also like to provide a way to compare alignment with TypeSize since it came up a few times.
Differential Revision: https://reviews.llvm.org/D126910
The MVE shuffle costing for VREV instructions was making incorrect
assumptions as to legalized vector types remaining as vectors. Add a
quick check to ensure they are indeed vectors before attempting to get
the number of elements.
The directive name is not useful because the next line replicates the error line
which includes the directive. The prevailing style uses "expected newline".
We intentionally disable Thumb2SizeReduction for SEH
prologues/epilogues, to avoid needing to guess what will happen with
the instructions in a potential future pass in frame lowering.
But for this specific case, where we know we can express the
intent with a narrow instruction, change to that instruction form
directly in frame lowering.
Differential Revision: https://reviews.llvm.org/D126949
We intentionally disable Thumb2SizeReduction for SEH
prologues/epilogues, to avoid needing to guess what will happen with
the instructions in a potential future pass in frame lowering.
But for this specific case, where we know we can express the
intent with a narrow instruction, change to that instruction form
directly in frame lowering.
Differential Revision: https://reviews.llvm.org/D126948
For functions that require restoring SP from FP (e.g. that need to
align the stack, or that have variable sized allocations), the prologue
and epilogue previously used to look like this:
push {r4-r5, r11, lr}
add r11, sp, #8
...
sub r4, r11, #8
mov sp, r4
pop {r4-r5, r11, pc}
This is problematic, because this unwinding operation (restoring sp
from r11 - offset) can't be expressed with the SEH unwind opcodes
(probably because this unwind procedure doesn't map exactly to
individual instructions; note the detour via r4 in the epilogue too).
To make unwinding work, the GPR push is split into two; the first one
pushing all other registers, and the second one pushing r11+lr, so that
r11 can be set pointing at this spot on the stack:
push {r4-r5}
push {r11, lr}
mov r11, sp
...
mov sp, r11
pop {r11, lr}
pop {r4-r5}
bx lr
For the same setup, MSVC generates code that uses two registers;
r11 still pointing at the {r11,lr} pair, but a separate register
used for restoring the stack at the end:
push {r4-r5, r7, r11, lr}
add r11, sp, #12
mov r7, sp
...
mov sp, r7
pop {r4-r5, r7, r11, pc}
For cases with clobbered float/vector registers, they are pushed
after the GPRs, before the {r11,lr} pair.
Differential Revision: https://reviews.llvm.org/D125649
Skip inserting regular CFI instructions if using WinCFI.
This is based a fair amount on the corresponding ARM64 implementation,
but instead of trying to insert the SEH opcodes one by one where
we generate other prolog/epilog instructions, we try to walk over the
whole prolog/epilog range and insert them. This is done because in
many cases, the exact number of instructions inserted is abstracted
away deeper.
For some cases, we manually insert specific SEH opcodes directly where
instructions are generated, where the automatic mapping of instructions
to SEH opcodes doesn't hold up (e.g. for __chkstk stack probes).
Skip Thumb2SizeReduction for SEH prologs/epilogs, and force
tail calls to wide instructions (just like on MachO), to make sure
that the unwind info actually matches the width of the final
instructions, without heuristics about what later passes will do.
Mark SEH instructions as scheduling boundaries, to make sure that they
aren't reordered away from the instruction they describe by
PostRAScheduler.
Mark the SEH instructions with the NoMerge flag, to avoid doing
tail merging of functions that have multiple epilogs that all end
with the same sequence of "b <other>; .seh_nop_w, .seh_endepilogue".
Differential Revision: https://reviews.llvm.org/D125648
It's a fairly common issue that the generating code incorrectly marks
instructions as narrow or wide; check that the instruction lengths
add up to the expected value, and error out if it doesn't. This allows
catching code generation bugs.
Also check that prologs and epilogs are properly terminated, to
catch other code generation issues.
Differential Revision: https://reviews.llvm.org/D125647
This includes .seh_* directives for generating it from assembly.
It is designed fairly similarly to the ARM64 handling.
For .seh_handler directives, such as
".seh_handler __C_specific_handler, @except" (which is supported
on x86_64 and aarch64 so far), the "@except" bit doesn't work in
ARM assembly, as '@' is used as a comment character (on all current
platforms).
Allow using '%' instead of '@' for this purpose. This convention
is used by GAS in similar contexts already,
e.g. [1]:
Note on targets where the @ character is the start of a comment
(eg ARM) then another character is used instead. For example the
ARM port uses the % character.
In practice, this unfortunately means that all such .seh_handler
directives will need ifdefs for ARM.
Contrary to ARM64, on ARM, it's quite common that we can't evaluate
e.g. the function length at this point, due to instructions whose
length is finalized later. (Also, inline jump tables end with
a ".p2align 1".)
If unable to to evaluate the function length immediately, emit
it as an MCExpr instead. If we'd implement splitting the unwind
info for a function (which isn't implemented for ARM64 yet either),
we wouldn't know whether we need to split it though.
Avoid calling getFrameIndexOffset() on an unset
FuncInfo.UnwindHelpFrameIdx, to avoid triggering asserts in the
preexisting testcase CodeGen/ARM/Windows/wineh-basic.ll. (Once
MSVC exception handling is fully implemented, those changes
can be reverted.)
[1] https://sourceware.org/binutils/docs/as/Section.html#Section
Differential Revision: https://reviews.llvm.org/D125645