Using segmented stacks with execute-only mostly works, but we need to
use the correct movi32 opcode in 6-M, and there's one place where for
thumb1 (i.e. 6-M and 8-M.base) a constant pool was unconditionally
used which needed to be fixed.
Differential Revision: https://reviews.llvm.org/D156339
llvm-objdump -d will be changed to not display mapping symbols by
default (D156190).
Add --show-all-symbols to make the intent clearer and prevent test
adjustment with the new behavior.
Record the call frame size on entry to each basic block. This is usually
zero except when a basic block has been split in the middle of a call
sequence.
This simplifies PEI::replaceFrameIndices which previously had to visit
basic blocks in a specific order and had special handling for
unreachable blocks. More importantly it paves the way for an equally
simple implementation of a backwards version of replaceFrameIndices,
which is required to fully convert PrologEpilogInserter to backwards
register scavenging, which is preferred because it does not rely on
accurate kill flags.
Differential Revision: https://reviews.llvm.org/D156113
Refactor to use BasicBlockUtils functions and make life easier for
a subsequent patch for updating the dominator tree.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D154053
Currently when compiling for an execute-only target without movt then
EmitStructByval will generate a constant pool load which isn't
compatible with execute-only. Handle this by emitting tMOVi32imm,
and also simplify the existing movt handling by emitting t2MOVi32imm
or MOVi32imm.
Differential Revision: https://reviews.llvm.org/D154944
The expansion of the various MOVi32imm pseudo-instructions works by
splitting the operand into components (either halfwords or bytes) and
emitting instructions to combine those components into the final
result. When the operand is an immediate with some components being
zero this can result in pointless instructions that just add zero.
Avoid this by restructuring things so that a separate function handles
splitting the operand into components, then don't emit the component
if it is a zero immediate. This is straightforward for movw/movt,
where we just don't emit the movt if it's zero, but the thumb1
expansion using mov/add/lsl is more complex, as even when we don't
emit a given byte we still need to get the shift correct.
Differential Revision: https://reviews.llvm.org/D154943
This is a reduced version of one of the tests that was broken by the
original commit of D154281 "[CodeGen] Store SP adjustment in
MachineBasicBlock. NFCI.".
Differential Revision: https://reviews.llvm.org/D155471
In most places where TransferImpOps is currently used we just have one
machine instruction, so it's doing the same thing as copyImplicitOps
anyway. In those cases where we have more than one machine
instruction the destination is written to in each instruction so any
implicit defs should appear on all of them (and we shouldn't see any
implicit refs as these pseudo-instruction don't have any register
inputs), meaning the current use of TransferImpOps is incorrect and
we should be using copyImplicitOps on all of the generated
instructions.
Differential Revision: https://reviews.llvm.org/D155301
In change https://reviews.llvm.org/D152790, it was discovered that the
alignment requirement calculation for LDRD/STRD codegen was suboptimal
and the calculation for volatile loads and stores was adjusted.
This change here adopts the calculation for the remaining non-volatile
occurances.
Recommitting after undefined behavior fix in D155093.
Differential Revision: https://reviews.llvm.org/D153800
Record the SP adjustment on entry to each basic block. This is almost
always zero except on targets like ARM which can split a basic block in
the middle of a call sequence.
This simplifies PEI::replaceFrameIndices which previously had to visit
basic blocks in a specific order and had special handling for
unreachable blocks. More importantly it paves the way for an equally
simple implementation of a backwards version of replaceFrameIndices,
which is required to fully convert PrologEpilogInserter to backwards
register scavenging, which is preferred because it does not rely on
accurate kill flags.
Differential Revision: https://reviews.llvm.org/D154281
Currently when compiling for an execute-only target without movt then
EmitStructByval will generate a constant pool load which isn't
compatible with execute-only. Handle this by emitting tMOVi32imm,
and also simplify the existing movt handling by emitting t2MOVi32imm
or MOVi32imm.
Differential Revision: https://reviews.llvm.org/D154944
The expansion of the various MOVi32imm pseudo-instructions works by
splitting the operand into components (either halfwords or bytes) and
emitting instructions to combine those components into the final
result. When the operand is an immediate with some components being
zero this can result in pointless instructions that just add zero.
Avoid this by restructuring things so that a separate function handles
splitting the operand into components, then don't emit the component
if it is a zero immediate. This is straightforward for movw/movt,
where we just don't emit the movt if it's zero, but the thumb1
expansion using mov/add/lsl is more complex, as even when we don't
emit a given byte we still need to get the shift correct.
Differential Revision: https://reviews.llvm.org/D154943
Mark the tMOVi32imm pseudo instr as killing the flags register.
The pseudo instruction expands to a sequence of 7 movs/lsls/adds
instructions, which are all Thumb-1 flag setting instructions.
For a test case, take an existing arm test which checks for
"Don't CSE a cmp across a call that clobbers CPSR."
and retarget it at thumbv6m execute-only.
Reviewed By: stuij
Differential Revision: https://reviews.llvm.org/D154845
Change-Id: I8f8209fbc40a833f8875629937b9606c1e2c021d
Currently in LowerConstantFP, when we compile for execute-only (XO) we don't
check what architecture we're compiling for (v6m=< or >v6m). We shouldn't get
here for v6m, so put in an assert.
Reviewed By: simonwallis2, dmgreen
Differential Revision: https://reviews.llvm.org/D154506
In llvm/test/CodeGen/ARM/large-stack.ll, the C in FileCheck wasn't
uppercased. This wasn't spotted in development as MacOS's HFS+ fs is apparently
often configured case-insensitive.
The ARM backend codebase is dotted with places where armv6-m will generate
constant pools. Now that we can generate execute-only code for armv6-m, we need
to make sure we use the movs/lsls/adds/lsls/adds/lsls/adds pattern instead of
these.
Big stacks is one of the obvious places. In this patch we take care of two
sites:
1. take care of big stacks in prologue/epilogue
2. take care of save/tSTRspi nodes, which implicitly fixes
emitThumbRegPlusImmInReg which is used in several frame lowering fns
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D154233
This reverts commit 92a9c30c61da7f973d55cd84fade424159b9cac9.
This has caused a test failure in the 2nd stage of Linaro's
Arm 32 bit buildbots.
LLVM::simplified-template-names.s
7: error: Simplified template DW_AT_name could not be reconstituted:
check:10'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
8: original: f3<unsigned char, (unsigned char)'\x00'>
check:10'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
9: reconstituted: f3<unsigned char, (unsigned char)'\x7f'>
check:10'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I suspect a load/store is slightly off.
In change https://reviews.llvm.org/D152790, it was discovered that the
alignment requirement calculation for LDRD/STRD codegen was suboptimal
and the calculation for volatile loads and stores was adjusted.
This change here adopts the calculation for the remaining non-volatile
occurances.
Differential Revision: https://reviews.llvm.org/D153800
Fix https://github.com/llvm/llvm-project/issues/63579
```
% cat a.c
void foo() {}
% clang --target=arm-none-eabi -mthumb -mno-unaligned-access -fsanitize=kcfi a.c -S -o - | grep p2align
.p2align 1
% clang --target=armv6m-none-eabi -fsanitize=function a.c -S -o - | grep p2align
.p2align 1
```
Ensure that -fsanitize={function,kcfi} instrumented functions are aligned by at
least 4, so that loading the type hash before the function label will not cause
a misaligned access. This is especially important for -mno-unaligned-access
configurations that don't set `setMinFunctionAlignment` to 4 or greater.
With this patch, the generated assembly for the examples above will contain `.p2align 2`
before the type hash.
If `__attribute__((aligned(N)))` or `-falign-functions=N` is specified, the
larger alignment will be used.
Reviewed By: simon_tatham, samitolvanen
Differential Revision: https://reviews.llvm.org/D154125
The ExpandLibcallResult result was a bitcast and not the direct call
result, so we couldn't find the chain. Use the new separate chain
return value instead.
If we have a store of a load with no other uses in between it, it's
considered dead and is removed. So sometimes when legalizing a fixed
length vector store of an insert, we end up producing better code
through scalarization than without.
An example is the follow below:
%a = load <4 x i64>, ptr %x
%b = insertelement <4 x i64> %a, i64 %y, i32 2
store <4 x i64> %b, ptr %x
If this is scalarized, then DAGCombine successfully removes 3 of the 4
stores which are considered dead, and on RISC-V we get:
sd a1, 16(a0)
However if we make the vector type legal (-mattr=+v), then we lose the
optimisation because we don't scalarize it.
This patch attempts to recover the optimisation for vectors by
identifying patterns where we store a load with a single insert
inbetween, replacing it with a scalar store of the inserted element.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D152276
When we only have a 16-bit pc-relative branch instruction we generate
a table of address for a jump table. Currently this is placed inline,
but this won't work with execute-only memory. In this case generate
the jump table out-of-line.
Differential Revision: https://reviews.llvm.org/D153774
Recently eXecute Only (XO) codegen was also allowed for armv6-M. Previously this
was only implemented for ~armv7+, effectively if MOVW/MOVT is
available. Regarding long calls, we remove the check for MOVW/MOVT when
generating code for XO, which already was redundant as in the subtarget
initialization we already check if XO is valid for the target. And targets that
generate valid XO code should be able to handle the (wrapper globaladdress)
node.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D153782
The CPSR registers ops of the instructions constructed in ExpandTMOV32BitImm
were marked as kill, instead of define. Best to use the pre-existing
t1CondCodeOp fn to construct CPSRs.
Reviewed By: simonwallis2
Differential Revision: https://reviews.llvm.org/D153763
No longer conservatively assume a load/store accesses the stack when we
can prove that we did not compute any stack-relative address up to this
point in the program.
We do this in a cheap not-quite-a-dataflow-analysis: Assume
`NoStackAddressUsed` when all predecessors of a block already guarantee
it. Process blocks in reverse post order to guarantee that except for
loop headers we have processed all predecessors of a block before
processing the block itself. For loops we accept the conservative answer
as they are unlikely to be shrink-wrappable anyway.
Differential Revision: https://reviews.llvm.org/D152213
Switch and update some tests to use `update_llc_test_checks` to reduce
clutter in upcoming change.
Differential Revision: https://reviews.llvm.org/D152215
Volatile loads/stores of i64 are lowered to LDRD/STRD on ARMv5TE.
However, these instructions require the addresses to be aligned.
Unaligned loads/stores therefore should be ignored by this handling.
Differential Revision: https://reviews.llvm.org/D152790
The current implementation tries to handle the high and low halves
separately, but that's less efficient in most cases; use a wide SETCC
instead.
Differential Revision: https://reviews.llvm.org/D151358
Temporarily disabling the execute-only tests. We recently added codegen for
armv6-m, which is still in heavy development (D152795).
Disabling the tests while we're figuring out what's going on is probably the
least disruptive option, as a patch dependent on it also already landed.
[ARM] generate armv6m eXecute Only (XO) code for immediates, globals
Previously eXecute Only (XO) support was implemented for targets that support
MOVW/MOVT (~armv7+). See: https://reviews.llvm.org/D27449
XO prevents the compiler from generating data accesses to code sections. This
patch implements XO codegen for armv6-M, which does not support MOVW/MOVT, and
must resort to the following general pattern to avoid loads:
movs r3, :upper8_15:foo
lsls r3, #8
adds r3, :upper0_7:foo
lsls r3, #8
adds r3, :lower8_15:foo
lsls r3, #8
adds r3, :lower0_7:foo
ldr r3, [r3]
This is equivalent to the code pattern generated by GCC.
The above relocations are new to LLVM and have been implemented in a parent
patch: https://reviews.llvm.org/D149443.
This patch limits itself to implementing codegen for this pattern and enabling
XO for armv6-M in the backend.
Separate patches will follow for:
- switch tables
- replacing specific loads from constant islands which are spread out over the
ARM backend codebase. Amongst others: FastISel, call lowering, stack frames.
Reviewed By: john.brawn
Differential Revision: https://reviews.llvm.org/D152795
The `__DATA,xray_instr_map` section has label differences like
`.quad Lxray_sled_0-Ltmp0` that is represented as a pair of UNSIGNED and SUBTRACTOR relocations.
LLVM integrated assembler attempts to rewrite A-B into A-B'+offset where B' can
be included in the symbol table. B' is called an atom and should be a
non-temporary symbol in the same section. However, since `xray_instr_map` does
not define a non-temporary symbol, the SUBTRACTOR relocation will have no
associated symbol, and its `r_extern` value will be 0. Therefore, we will see
linker errors like:
error: SUBTRACTOR relocation must be extern at offset 0 of __DATA,xray_instr_map in a.o
To fix this issue, we need to define a non-temporary symbol in the section. We
can accomplish this by renaming `Lxray_sleds_start0` to `lxray_sleds_start0`
("L" to "l").
`lxray_sleds_start0` serves as the atom for this dead-strippable subsection.
With the `S_ATTR_LIVE_SUPPORT` attribute, `ld -dead_strip` will retain
subsections that reference live functions.
Special thanks to Oleksii Lozovskyi for reporting the issue and providing
initial analysis.
Differential Revision: https://reviews.llvm.org/D153239
Commit ec77747fbdca901e0fded58f940dae62e0f6b726 regenerated the check lines
without being very careful about which lines were updated. This attempts to fix
them to make sure the V7 and V8 lines are emitted as needed.
As mentioned by commit c5d38924dc6688c15b3fa133abeb3626e8f0767c (Apr 2020),
PC-relative entries avoid dynamic relocations and can therefore make the
section read-only.
This is similar to D78082 and D78590. We cannot commit to support
compiler/runtime built at different versions, so just don't play with versions.
For Mach-O support (incomplete yet), we use non-temporary `lxray_fn_idx[0-9]+`
symbols. Label differences are represented as a pair of UNSIGNED and SUBTRACTOR
relocations. The SUBTRACTOR external relocation requires r_extern==1 (needs to
reference a symbol table entry) which can be satisfied by `lxray_fn_idx[0-9]+`.
A `lxray_fn_idx[0-9]+` symbol also serves as the atom for this dead-strippable
section (follow-up to commit b9a134aa629de23a1dcf4be32e946e4e308fc64d).
Differential Revision: https://reviews.llvm.org/D152661
Add the `S_ATTR_LIVE_SUPPORT` attribute to the sections so that `ld -dead_strip`
will retain subsections that reference live functions, once we we add linker
private "l" symbols as atoms.
AArch64 has five system registers intended to be useful as thread
pointers: one for each exception level which is RW at that level and
inaccessible to lower ones, and the special TPIDRRO_EL0 which is
readable but not writable at EL0. AArch32 has three, corresponding to
the AArch64 ones that aren't specific to EL2 or EL3.
Currently clang supports only a subset of these registers, and not
even a consistent subset between AArch64 and AArch32:
- For AArch64, clang permits you to choose between the four TPIDR_ELn
thread registers, but not the fifth one, TPIDRRO_EL0.
- In AArch32, on the other hand, the //only// thread register you can
choose (apart from 'none, use a function call') is TPIDRURO, which
corresponds to (the bottom 32 bits of) AArch64's TPIDRRO_EL0.
So there is no thread register that you can currently use in both
targets!
For custom and bare-metal purposes, users might very reasonably want
to use any of these thread registers. There's no reason they shouldn't
all be supported as options, even if the default choices follow
existing practice on typical operating systems.
This commit extends the range of values acceptable to the `-mtp=`
clang option, so that you can specify any of these registers by (the
lower-case version of) their official names in the ArmARM:
- For AArch64: tpidr_el0, tpidrro_el0, tpidr_el1, tpidr_el2, tpidr_el3
- For AArch32: tpidrurw, tpidruro, tpidrprw
All existing values of the option are still supported and behave the
same as before. Defaults are also unchanged. No command line that
worked already should change behaviour as a result of this.
The new values for the `-mtp=` option have been agreed with Arm's gcc
developers (although I don't know whether they plan to implement them
in the near future).
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D152433
Currently, a node and its users are added back to the worklist in reverse topological order after it is combined. This diff changes that order to be topological. This is part of a larger migration to get the DAGCombiner to process nodes in topological order.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D127115