Fixes the "use after poison" issue introduced by #121516 (see
<https://github.com/llvm/llvm-project/pull/121516#issuecomment-2585912395>).
The root cause of this issue is that #121516 introduced "Called Global"
information for call instructions modeling how "Call Site" info is
stored in the machine function, HOWEVER it didn't copy the
copy/move/erase operations for call site information.
The fix is to rename and update the existing copy/move/erase functions
so they also take care of Called Global info.
1. When two (or more) nodes are glued, DAG scheduler will always
schedule them as one piece, i.e. it will not allow any instructions to
be scheduled between them. It does so because if nodes are glued this
usually means that there is an implicit register dependency between
them, and an intervening node could clobber this physical register. When
emitting such nodes into machine IR, they will also be stuck together,
e.g.:
```
%9:gpr = MOVsrl_glue killed %8, implicit-def $cpsr
%10:gpr = RRX %3, implicit $cpsr
```
2. If a node has Glue result, SelectionDAG will not try to CSE this
node. If it did, it would break the implicit physical register
dependency. In practice this means that if a node with Glue result has
multiple uses, it has to be duplicated before each use. This the reason
for `ARMTargetLowering::duplicateCmp` to exist.
When using normal data dependency, dependent nodes can freely be
scheduled around. If there is a physical register dependency between
nodes, the physical register will be copied to/from a virtual register,
allowing other nodes to intervene between them. The resulting machine IR
might look like this:
```
%9:gpr = LSRs1 killed %8, implicit-def $cpsr
%10:gpr = COPY $cpsr
%11:gpr = ORRrsi killed %9, %3, 242, 14 /* CC::al */, $noreg, $noreg
%12:gpr = BICri killed %11, -2147483648, 14 /* CC::al */, $noreg, $noreg
$cpsr = COPY %10
%13:gpr = RRX %3, implicit $cpsr
```
The two copies are likely to be eliminated by register coalescer, given
that there are no instructions between them that clobber this physical
register. If the copies are unwanted in the first place (they could be
expensive or impossible), DAG scheduler will try to avoid inserting them
wherever possible, and the resulting machine IR will look like this:
```
%9:gpr = LSRs1 killed %8, implicit-def $cpsr
%10:gpr = ORRrsi killed %9, %3, 242, 14 /* CC::al */, $noreg, $noreg
%11:gpr = BICri killed %10, -2147483648, 14 /* CC::al */, $noreg, $noreg
%12:gpr = RRX %3, implicit $cpsr
```
On ARM, arithmetic operations and LSLS already use the new data flow
approach. This patch extends it to include 1-bit shifts.
Pull Request: https://github.com/llvm/llvm-project/pull/116547
When doing a call from CMSE secure state to non-secure state for
v8-M.main, we use the VLLDM and VLSTM instructions to save, clear and
restore the FP registers around the call. These instructions both check
the CONTROL_S.SFPA bit, and if it is clear (meaning the current contents
of the FP registers are not secret) they execute as no-ops.
This causes a problem when CONTROL_S.SFPA==0 before the call, which
happens if there are no floating-point instructions executed between
entry to secure state and the call. If this is the case, then the VLSTM
instruction will do nothing, leaving the save area in the stack
uninitialised. If the called function returns a value in floating-point
registers, the call sequence includes an instruction to copy the return
value from a floating-point register to a GPR, which must be before the
VLLDM instruction. This copy sets CONTROL_S.SFPA, meaning that the VLLDM
will fully execute, and load the uninitialised stack memory into the FP
registers.
This causes two problems:
* The FP register file is clobbered, including all of the callee-saved
registers, which might contain live values.
* The stack region might contain secret values, which will be leaked to
non-secure state through the floating-point registers if/when we
return to non-secure state.
The fix is to insert a `vmov s0, s0` instruction before the VLSTM
instruction, to ensure that CONTROL_S.SFPA is set for both the VLLDM and
VLSTM instruction.
CVE: https://www.cve.org/cverecord?id=CVE-2024-7883
Security bulletin:
https://developer.arm.com/Arm%20Security%20Center/Cortex-M%20Security%20Extensions%20Vulnerability
When expanding a VBSP pseudo into VMOV; VBSL, if the first reg was
killed in the BSP then the kill flags could be incorrect copied to the
mov (vorr) and the vbsl. Drop the kill flags.
Note that this sometimes comes up when all the operands of the VBSP are
the same, which can be optimized separately.
Otherwise, the enum defaults to 'int'. Update a few places that used
'int' for registers that now need to change to avoid a signed/unsigned
compare warning.
I was hoping this would allow us to remove the 'int' comparison
operators in Register.h and MCRegister.h, but compares with literal 0
still need them.
This test case was failing to compile with a "ran out of registers
during register allocation" error at -O0. This was because CMP_SWAP_64
has 3 operands which must be an even-odd register pair, and two other
GPR operands. All of the def operands are also early-clobber, so
registers can't be shared between uses and defs. Because the function
has an over-aligned alloca it needs frame and base pointers, so r6 and
r11 are both reserved. That leaves r0/r1, r2/r3, r4/r5 and r8/r9 as the
only valid register pairs, and if the two individual GPR operands happen
to get allocated to registers in different pairs then only 2 pairs will
be available for the three GPRPair operands.
To fix this, I've merged the two GPR operands into a single GPRPair
operand. This means that the instruction now has 4 GPRPair operands,
which can always be allocated without relying on luck. This does
constrain register allocation a bit more, but this pseudo instruction is
only used at -O0, so I don't think that's a problem.
When compiling for thumbv8.1m with +pacbti and making an indirect tail
call, the compiler was free to put the function pointer into R12.
This is incorrect because R12 is restored to contain authentication code
for the caller's return address.
This patch excludes R12 from the set of registers the compiler can put
the function pointer in.
Fixes https://github.com/llvm/llvm-project/issues/75998
Re-land 634b0243b8f7acc85af4f16b70e91d86ded4dc83.
T1 allow for an optional registers list,
the register list must be {d0-d15}.
T2 define a mandatory register list,
the register list must be {d0-d31}.
The requirements for T1/T2 are as follows:
T1 T2
Require: v8-M.Main, v8.1-M.Main,
secure state secure state
16 D Regs valid valid
32 D Regs UNDEFINED valid
No D Regs NOP NOP
T1 allows for an optional registers list, the register list must be {d0-d15}.
T2 defines a mandatory register list, the register list must be {d0-d31}.
The requirements for T1/T2 are as follows:
T1 T2
Require: v8-M.Main, v8.1-M.Main,
secure state secure state
16 D Regs valid valid
32 D Regs UNDEFINED valid
No D Regs NOP NOP
The expansion of the various MOVi32imm pseudo-instructions works by
splitting the operand into components (either halfwords or bytes) and
emitting instructions to combine those components into the final
result. When the operand is an immediate with some components being
zero this can result in pointless instructions that just add zero.
Avoid this by restructuring things so that a separate function handles
splitting the operand into components, then don't emit the component
if it is a zero immediate. This is straightforward for movw/movt,
where we just don't emit the movt if it's zero, but the thumb1
expansion using mov/add/lsl is more complex, as even when we don't
emit a given byte we still need to get the shift correct.
Differential Revision: https://reviews.llvm.org/D154943
In most places where TransferImpOps is currently used we just have one
machine instruction, so it's doing the same thing as copyImplicitOps
anyway. In those cases where we have more than one machine
instruction the destination is written to in each instruction so any
implicit defs should appear on all of them (and we shouldn't see any
implicit refs as these pseudo-instruction don't have any register
inputs), meaning the current use of TransferImpOps is incorrect and
we should be using copyImplicitOps on all of the generated
instructions.
Differential Revision: https://reviews.llvm.org/D155301
When built without LLVM_ENABLE_ASSERTIONS we can get a warning in
ARMExpandPseudoInsts.cpp due to a variable only being used inside
of a LLVM_DEBUG statement. Fix this with a dummy use, like we do
elsewhere.
The expansion of the various MOVi32imm pseudo-instructions works by
splitting the operand into components (either halfwords or bytes) and
emitting instructions to combine those components into the final
result. When the operand is an immediate with some components being
zero this can result in pointless instructions that just add zero.
Avoid this by restructuring things so that a separate function handles
splitting the operand into components, then don't emit the component
if it is a zero immediate. This is straightforward for movw/movt,
where we just don't emit the movt if it's zero, but the thumb1
expansion using mov/add/lsl is more complex, as even when we don't
emit a given byte we still need to get the shift correct.
Differential Revision: https://reviews.llvm.org/D154943
When we only have a 16-bit pc-relative branch instruction we generate
a table of address for a jump table. Currently this is placed inline,
but this won't work with execute-only memory. In this case generate
the jump table out-of-line.
Differential Revision: https://reviews.llvm.org/D153774
The CPSR registers ops of the instructions constructed in ExpandTMOV32BitImm
were marked as kill, instead of define. Best to use the pre-existing
t1CondCodeOp fn to construct CPSRs.
Reviewed By: simonwallis2
Differential Revision: https://reviews.llvm.org/D153763
[ARM] generate armv6m eXecute Only (XO) code for immediates, globals
Previously eXecute Only (XO) support was implemented for targets that support
MOVW/MOVT (~armv7+). See: https://reviews.llvm.org/D27449
XO prevents the compiler from generating data accesses to code sections. This
patch implements XO codegen for armv6-M, which does not support MOVW/MOVT, and
must resort to the following general pattern to avoid loads:
movs r3, :upper8_15:foo
lsls r3, #8
adds r3, :upper0_7:foo
lsls r3, #8
adds r3, :lower8_15:foo
lsls r3, #8
adds r3, :lower0_7:foo
ldr r3, [r3]
This is equivalent to the code pattern generated by GCC.
The above relocations are new to LLVM and have been implemented in a parent
patch: https://reviews.llvm.org/D149443.
This patch limits itself to implementing codegen for this pattern and enabling
XO for armv6-M in the backend.
Separate patches will follow for:
- switch tables
- replacing specific loads from constant islands which are spread out over the
ARM backend codebase. Amongst others: FastISel, call lowering, stack frames.
Reviewed By: john.brawn
Differential Revision: https://reviews.llvm.org/D152795
In D79537, `nomerge` was made to only apply to non-tail calls. This fixes it by also applying it to tail calls.
For ARM, I only made the new MI to inherit the flag under `TCRETURNdi` and `TCRETURNri`, because that's the place tail calls got replaced. Not sure if there's any other place needed.
Fixes#61545.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D146749
This patch makes two notable changes to the MIR debug info representation,
which result in different MIR output but identical final DWARF output (NFC
w.r.t. the full compilation). The two changes are:
* The introduction of a new MachineOperand type, MO_DbgInstrRef, which
consists of two unsigned numbers that are used to index an instruction
and an output operand within that instruction, having a meaning
identical to first two operands of the current DBG_INSTR_REF
instruction. This operand is only used in DBG_INSTR_REF (see below).
* A change in syntax for the DBG_INSTR_REF instruction, shuffling the
operands to make it resemble DBG_VALUE_LIST instead of DBG_VALUE,
and replacing the first two operands with a single MO_DbgInstrRef-type
operand.
This patch is the first of a set that will allow DBG_INSTR_REF
instructions to refer to multiple machine locations in the same manner
as DBG_VALUE_LIST.
Reviewed By: jmorse
Differential Revision: https://reviews.llvm.org/D129372
Given a patch like D129506, using instructions not valid for the current
target feature set becomes an error. This fixes an issue in
ARMExpandPseudo::ExpandCMP_SWAP where Thumb2 compares were used in
Thumb1Only code, such as thumbv8m.baseline targets.
Differential Revision: https://reviews.llvm.org/D129695
Skip inserting regular CFI instructions if using WinCFI.
This is based a fair amount on the corresponding ARM64 implementation,
but instead of trying to insert the SEH opcodes one by one where
we generate other prolog/epilog instructions, we try to walk over the
whole prolog/epilog range and insert them. This is done because in
many cases, the exact number of instructions inserted is abstracted
away deeper.
For some cases, we manually insert specific SEH opcodes directly where
instructions are generated, where the automatic mapping of instructions
to SEH opcodes doesn't hold up (e.g. for __chkstk stack probes).
Skip Thumb2SizeReduction for SEH prologs/epilogs, and force
tail calls to wide instructions (just like on MachO), to make sure
that the unwind info actually matches the width of the final
instructions, without heuristics about what later passes will do.
Mark SEH instructions as scheduling boundaries, to make sure that they
aren't reordered away from the instruction they describe by
PostRAScheduler.
Mark the SEH instructions with the NoMerge flag, to avoid doing
tail merging of functions that have multiple epilogs that all end
with the same sequence of "b <other>; .seh_nop_w, .seh_endepilogue".
Differential Revision: https://reviews.llvm.org/D125648
This adds at extra check into ARMBaseInstrInfo::verifyInstruction to
verify the offsets used in addressing mode immediates using
isLegalAddressImm. Some tests needed fixing up as a result, adjusting
the opcode created from CMSE stack adjustments.
Differential Revision: https://reviews.llvm.org/D114939
This patch implements PAC return address signing for armv8-m. This patch roughly
accomplishes the following things:
- PAC and AUT instructions are generated.
- They're part of the stack frame setup, so that shrink-wrapping can move them
inwards to cover only part of a function
- The auth code generated by PAC is saved across subroutine calls so that AUT
can find it again to check
- PAC is emitted before stacking registers (so that the SP it signs is the one
on function entry).
- The new pseudo-register ra_auth_code is mentioned in the DWARF frame data
- With CMSE also in use: PAC is emitted before stacking FPCXTNS, and AUT
validates the corresponding value of SP
- Emit correct unwind information when PAC is replaced by PACBTI
- Handle tail calls correctly
Some notes:
We make the assembler accept the `.save {ra_auth_code}` directive that is
emitted by the compiler when it saves a register that contains a
return address authentication code.
For EHABI we need to have the `FrameSetup` flag on the instruction and
handle the `t2PACBTI` opcode (identically to `t2PAC`), so we can emit
`.save {ra_auth_code}`, instead of `.save {r12}`.
For PACBTI-M, the instruction which computes return address PAC should use SP
value before adjustment for the argument registers save are (used for variadic
functions and when a parameter is is split between stack and register), but at
the same it should be after the instruction that saves FPCXT when compiling a
CMSE entry function.
This patch moves the varargs SP adjustment after the FPCXT save (they are never
enabled at the same time), so in a following patch handling of the `PAC`
instruction can be placed between them.
Epilogue emission code adjusted in a similar manner.
PACBTI-M code generation should not emit any instructions for architectures
v6-m, v8-m.base, and for A- and R-class cores. Diagnostic message for such cases
is handled separately by a future ticket.
note on tail calls:
If the called function has four arguments that occupy registers `r0`-`r3`, the
only option for holding the function pointer itself is `r12`, but this register
is used to keep the PAC during function/prologue epilogue and clobbers the
function pointer.
When we do the tail call we need the five registers (`r0`-`r3` and `r12`) to
keep six values - the four function arguments, the function pointer and the PAC,
which is obviously impossible.
One option would be to authenticate the return address before all callee-saved
registers are restored, so we have a scratch register to temporarily keep the
value of `r12`. The issue with this approach is that it violates a fundamental
invariant that PAC is computed using CFA as a modifier. It would also mean using
separate instructions to pop `lr` and the rest of the callee-saved registers,
which would offset the advantages of doing a tail call.
Instead, this patch disables indirect tail calls when the called function take
four or more arguments and the return address sign and authentication is enabled
for the caller function, conservatively assuming the caller function would spill
LR.
This patch is part of a series that adds support for the PACBTI-M extension of
the Armv8.1-M architecture, as detailed here:
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/armv8-1-m-pointer-authentication-and-branch-target-identification-extension
The PACBTI-M specification can be found in the Armv8-M Architecture Reference
Manual:
https://developer.arm.com/documentation/ddi0553/latest
The following people contributed to this patch:
- Momchil Velikov
- Ties Stuij
Reviewed By: danielkiss
Differential Revision: https://reviews.llvm.org/D112429
We can't use the existing pseudo ARM::tLDRLIT_ga_pcrel for loading the
stack guard for PIC code that references the GOT, since arm-pseudo may
expand this to the narrow tLDRpci rather than the wider t2LDRpci.
Create a new pseudo, t2LDRLIT_ga_pcrel, and expand it to t2LDRpci.
Fixes: https://bugs.chromium.org/p/chromium/issues/detail?id=1270361
Reviewed By: ardb
Differential Revision: https://reviews.llvm.org/D114762
Recently a vulnerability issue is found in the implementation of VLLDM
instruction in the Arm Cortex-M33, Cortex-M35P and Cortex-M55. If the
VLLDM instruction is abandoned due to an exception when it is partially
completed, it is possible for subsequent non-secure handler to access
and modify the partial restored register values. This vulnerability is
identified as CVE-2021-35465.
The mitigation sequence varies between v8-m and v8.1-m as follows:
v8-m.main
---------
mrs r5, control
tst r5, #8 /* CONTROL_S.SFPA */
it ne
.inst.w 0xeeb00a40 /* vmovne s0, s0 */
1:
vlldm sp /* Lazy restore of d0-d16 and FPSCR. */
v8.1-m.main
-----------
vscclrm {vpr} /* Clear VPR. */
vlldm sp /* Lazy restore of d0-d16 and FPSCR. */
More details on
developer.arm.com/support/arm-security-updates/vlldm-instruction-security-vulnerability
Differential Revision: https://reviews.llvm.org/D109157
When expanding the non-secure call instruction we are emiting code
to clear the secure floating-point registers only if the targeted
architecture has floating-point support. The potential problem is
when the source code containing non-secure calls are built with
-mfloat-abi=soft but some other part of the system has been built
with -mfloat-abi=softfp (soft and softfp are compatible as they use
the same procedure calling standard). In this case floating-point
registers could leak to non-secure state as the non-secure won't
have cleared them assuming no floating point has been used.
Differential Revision: https://reviews.llvm.org/D109153
As a part of D107642, this adds pseudo instructions for MQQPR and
MQQQQPR register classes, that can spill and reloads entire registers
whilst keeping them combined, not splitting them into multiple D subregs
that a VLDMIA/VSTMIA would use. This can help certain analyses, and
helps to prevent verifier issues with subreg liveness.
This assert is intended to ensure that the high registers are not
selected when it is passed to one of the thumb UXT instructions. However
it was triggering even for 32 bit where no UXT instruction is emitted.
Fixes PR51313.
Differential Revision: https://reviews.llvm.org/D107363
Based on the same for AArch64: 4751cadcca45984d7671e594ce95aed8fe030bf1
At -O0, the fast register allocator may insert spills between the ldrex and
strex instructions inserted by AtomicExpandPass when expanding atomicrmw
instructions in LL/SC loops. To avoid this, expand to cmpxchg loops and
therefore expand the cmpxchg pseudos after register allocation.
Required a tweak to ARMExpandPseudo::ExpandCMP_SWAP to use the 4-byte encoding
of UXT, since the pseudo instruction can be allocated a high register (R8-R15)
which the 2-byte encoding doesn't support. However, the 4-byte encodings
are not present for ARM v8-M Baseline. To enable this, two new pseudos are
added for Thumb which are only valid for v8mbase, tCMP_SWAP_8 and
tCMP_SWAP_16.
The previously committed attempt in D101164 had to be reverted due to runtime
failures in the test suites. Rather than spending time fixing that
implementation (adding another implementation of atomic operations and more
divergence between backends) I have chosen to follow the approach taken in
D101163.
Differential Revision: https://reviews.llvm.org/D101898
Depends on D101912
atomicrmw instructions are expanded by AtomicExpandPass before register allocation
into cmpxchg loops. Register allocation can insert spills between the exclusive loads
and stores, which invalidates the exclusive monitor and can lead to infinite loops.
To avoid this, reimplement atomicrmw operations as pseudo-instructions and expand them
after register allocation.
Floating point legalisation:
f16 ATOMIC_LOAD_FADD(*f16, f16) is legalised to
f32 ATOMIC_LOAD_FADD(*i16, f32) and then eventually
f32 ATOMIC_LOAD_FADD_16(*i16, f32)
Differential Revision: https://reviews.llvm.org/D101164
Originally submitted as 3338290c187b254ad071f4b9cbf2ddb2623cefc0.
Reverted in c7df6b1223d88dfd15248fbf7b7b83dacad22ae3.
atomicrmw instructions are expanded by AtomicExpandPass before register allocation
into cmpxchg loops. Register allocation can insert spills between the exclusive loads
and stores, which invalidates the exclusive monitor and can lead to infinite loops.
To avoid this, reimplement atomicrmw operations as pseudo-instructions and expand them
after register allocation.
Floating point legalisation:
f16 ATOMIC_LOAD_FADD(*f16, f16) is legalised to
f32 ATOMIC_LOAD_FADD(*i16, f32) and then eventually
f32 ATOMIC_LOAD_FADD_16(*i16, f32)
Differential Revision: https://reviews.llvm.org/D101164