When doing a call from CMSE secure state to non-secure state for
v8-M.main, we use the VLLDM and VLSTM instructions to save, clear and
restore the FP registers around the call. These instructions both check
the CONTROL_S.SFPA bit, and if it is clear (meaning the current contents
of the FP registers are not secret) they execute as no-ops.
This causes a problem when CONTROL_S.SFPA==0 before the call, which
happens if there are no floating-point instructions executed between
entry to secure state and the call. If this is the case, then the VLSTM
instruction will do nothing, leaving the save area in the stack
uninitialised. If the called function returns a value in floating-point
registers, the call sequence includes an instruction to copy the return
value from a floating-point register to a GPR, which must be before the
VLLDM instruction. This copy sets CONTROL_S.SFPA, meaning that the VLLDM
will fully execute, and load the uninitialised stack memory into the FP
registers.
This causes two problems:
* The FP register file is clobbered, including all of the callee-saved
registers, which might contain live values.
* The stack region might contain secret values, which will be leaked to
non-secure state through the floating-point registers if/when we
return to non-secure state.
The fix is to insert a `vmov s0, s0` instruction before the VLSTM
instruction, to ensure that CONTROL_S.SFPA is set for both the VLLDM and
VLSTM instruction.
CVE: https://www.cve.org/cverecord?id=CVE-2024-7883
Security bulletin:
https://developer.arm.com/Arm%20Security%20Center/Cortex-M%20Security%20Extensions%20Vulnerability
This teaches dagcombiner to fold:
`(asr (add nsw x, y), 1) -> (avgfloors x, y)`
`(lsr (add nuw x, y), 1) -> (avgflooru x, y)`
as well the combine them to a ceil variant:
`(avgfloors (add nsw x, y), 1) -> (avgceils x, y)`
`(avgflooru (add nuw x, y), 1) -> (avgceilu x, y)`
iff valid for the target.
Removes some of the ARM MVE patterns that are now dead code.
It adds the avg opcodes to `IsQRMVEInstruction` as to preserve the
immediate splatting as before.
This adds the `llvm.sincos` intrinsic, legalization, and lowering.
The `llvm.sincos` intrinsic takes a floating-point value and returns
both the sine and cosine (as a struct).
```
declare { float, float } @llvm.sincos.f32(float %Val)
declare { double, double } @llvm.sincos.f64(double %Val)
declare { x86_fp80, x86_fp80 } @llvm.sincos.f80(x86_fp80 %Val)
declare { fp128, fp128 } @llvm.sincos.f128(fp128 %Val)
declare { ppc_fp128, ppc_fp128 } @llvm.sincos.ppcf128(ppc_fp128 %Val)
declare { <4 x float>, <4 x float> } @llvm.sincos.v4f32(<4 x float> %Val)
```
The lowering is built on top of the existing FSINCOS ISD node, with
additional type legalization to allow for f16, f128, and vector values.
With -fomit-frame-pointer, even if we set up a frame pointer for other
reasons (e.g. variable-sized or over-aligned stack allocations), we
don't need to create an ABI-compliant frame record. This means that we
can save all of the general-purpose registers in one push, instead of
splitting it to ensure that the frame pointer and link register are
adjacent on the stack, saving two instructions per function.
We don't need to copy byval arguments to tail calls via a temporary, if
we can prove that we are not copying from the outgoing argument area.
This patch does this when the source if the argument is one of:
* Memory in the local stack frame, which can't be used for tail-call
arguments.
* A global variable.
We can also avoid doing the copy completely if the source and
destination are the same memory location, which is the case when the
caller and callee have the same signature, and pass some arguments
through unmodified.
When passing byval arguments to tail-calls, we need to store them into
the stack memory in which this the caller received it's arguments. If
any of the outgoing arguments are forwarded from incoming byval
arguments, then the source of the copy is from the same stack memory.
This can result in the copy corrupting a value which is still to be
read.
The fix is to first make a copy of the outgoing byval arguments in local
stack space, and then copy them to their final location. This fixes the
correctness issue, but results in extra copying, which could be
optimised.
Byval arguments which are passed partially in registers get stored into
the local stack frame, but it is valid to tail-call them because the
part which gets spilled is always re-loaded into registers before doing
the tail-call, so it's OK for the spill area to be deallocated.
The ARM backend was checking that the outgoing values for a tail-call
matched the incoming argument values of the caller. This isn't
necessary, because the caller can change the values in both registers
and the stack before doing the tail-call. The actual limitation is that
the callee can't need more stack space for it's arguments than the
caller does.
This is needed for code using the musttail attribute, as well as
enabling tail calls as an optimisation in more cases.
There are lots of reasons a call might not be eligible for tail-call
optimisation, this adds debug trace to help understand the compiler's
decisions here.
When using AAPCS-compliant frame chains with PACBTI return address
signing, there ware a number of bugs in the generation of the frame
pointer and function prologues. The most obvious was that we sometimes
would modify r11 before pushing it to the stack, so it wasn't preserved
as required by the PCS. We also sometimes did not push R11 and LR
adjacent to one another on the stack, or used R11 as a frame pointer
without pointing it at the saved value of R11, both of which are
required to have an AAPCS compliant frame chain.
The original work of this patch was done by James Westwood, reviewed as
#82801 and #81249, with some tidy-ups done by Mark Murray and myself.
This is a recommit of #107120 . The original PR was approved but failed
buildbot. The newly added tests should only be run for compilers that
support the ARM target. This has been resolved by adding a config file
for these tests.
- Pass optimizes memcpy's by padding out destinations and sources to a
full word to make ARM backend generate full word loads instead of
loading a single byte (ldrb) and/or half word (ldrh). Only pads
destination when it's a stack allocated constant size array and source
when it's constant string. Heuristic to decide whether to pad or not
is very basic and could be improved to allow more examples to be
padded.
- Pass works at the midend level
Fixes#113154
The encodings used for llvm.trap() on ARM were all marked as barriers
and terminators. This lead to stack frame destroy code being inserted
before the trap if the trap was the last thing in the function and it
had no return statement.
```
void fn() {
volatile int i = 0;
__builtin_trap();
}
```
Produced:
```
fn:
push {r11, lr} << stack frame create
<...>
mov sp, r11
pop {r11, lr} << stack frame destroy
.inst 0xe7ffdefe << trap
bx lr
```
All the other targets don't mark them this way, instead they mark them
with isTrap. I've changed ARM to do this, which fixes the code
generation:
```
fn:
push {r11, lr} << stack frame create
<...>
.inst 0xe7ffdefe << trap
mov sp, r11
pop {r11, lr} << stack frame destroy
bx lr
```
I've updated the existing trap test to force the need for a stack frame,
then check that the instruction immediately after the trap is resetting
the stack pointer.
debugtrap was already working but I've added the same checks for it
anyway.
**Change relanded after feedback on failures and improvements to the
check of the addend. Original PR #111970**
Changes from original patch:
- The value that is being checked has changed, it is now correctly
checking any Addend for the instruction, rather than the Value. The
addend is kept within the Target data structure from my investigation.
- Removed changes to the following tests due to the original behaviour
being correct, and my original patch causing unexpected errors
- llvm/test/MC/ARM/Windows/mov32t-range.s
- llvm/test/MC/MachO/ARM/thumb2-movw-fixup.s
As per the ARM ABI, the MOVT and MOVW instructions should have addends
that fall within a 16bit signed range. LLVM does not check this so it is
possible to use addends that are beyond the accepted range. These
addends are silently truncated.
A new check is added to ensure the addend falls within the expected
range, rejecting an addend that falls outside with an error.
Information relating to the ABI requirements can be found here:
https://github.com/ARM-software/abi-aa/blob/main/aaelf32/aaelf32.rst#addends-and-pc-bias-compensation
The previous behavior could be harmful in some edge cases, such as
emitting a call to `fma()` in the `fma()` implementation itself.
Do this by just being more accurate in `isFMAFasterThanFMulAndFAdd()`.
This was already done for PowerPC; this commit just extends that to Arm,
z/Arch, and x86. MIPS and SPARC already got it right, but I added tests
for them too, for good measure.
Note: I don't have commit access.
Some targets (e.g. PPC and Hexagon) already did this. I think it's best
to do this consistently so that frontend authors don't run into
inconsistent results when they emit `naked` functions. For example, in
Zig, we had to change our emit code to also set `frame-pointer=none` to
get reliable results across targets.
Note: I don't have commit access.
Add support for using a thread-local variable with a specified offset
for holding the stack guard canary value. This supports both 32- and 64-
bit PowerPC targets.
This mirrors changes from #108942 but targeting PowerPC instead of
RISCV. Because both of these PRs modify the same driver functions, this
series is stack on top of the RISC-V one.
---------
Signed-off-by: Keith Packard <keithp@keithp.com>
- Pass optimizes memcpy's by padding out destinations and sources to a
full word to make backend generate full word loads instead of loading a
single byte (ldrb) and/or half word (ldrh). Only pads destination when
it's a stack allocated constant size array and source when it's constant
array. Heuristic to decide whether to pad or not is very basic and could
be improved to allow more examples to be padded.
- Pass works within GlobalOpt but is disabled by default on all targets
except ARM.
The register list in vscclrm is unusual in three ways:
* The encoded size can be zero, meaning the list contains only vpr.
* Double-precision registers past d15 are permitted even when the
subtarget doesn't have them, they are instead ignored when the
instruction executes.
* The single-precision variant allows double-precision registers d16
onwards, which are encoded as a pair of single-precision registers.
Fixing this also incidentally changes a vlldm/vlstm error message: when
the first register is in the range d16-d31 we now get the "operand must
be exactly..." error instead of "register expected".
When using AAPCS-compliant frame chains with PACBTI return address
signing, there ware a number of bugs in the generation of the frame
pointer and function prologues. The most obvious was that we sometimes
would modify r11 before pushing it to the stack, so it wasn't preserved
as required by the PCS. We also sometimes did not push R11 and LR
adjacent to one another on the stack, or used R11 as a frame pointer
without pointing it at the saved value of R11, both of which are
required to have an AAPCS compliant frame chain.
The original work of this patch was done by James Westwood, reviewed as
#82801 and #81249, with some tidy-ups done by Mark Murray and myself.
This fixes all the places that hit the new assertion added in
https://github.com/llvm/llvm-project/pull/106524 in tests. That is,
cases where the value passed to the APInt constructor is not an N-bit
signed/unsigned integer, where N is the bit width and signedness is
determined by the isSigned flag.
The fixes either set the correct value for isSigned, set the
implicitTrunc flag, or perform more calculations inside APInt.
Note that the assertion is currently still disabled by default, so this
patch is mostly NFC.
… (#111970)"
I was made aware of breakages in Windows/ARM, so reverting while I
investigate.
This reverts commit f3aebe623b49b7ae14d0f0996999114aac052e4b.
Gentoo is planning to introduce a `*t64` suffix for triples that will be
used by 32-bit platforms that use 64-bit `time_t`. Add support for
parsing and accepting these triples, and while at it make clang
automatically enable the necessary glibc feature macros when this suffix
is used.
An open question is whether we can backport this to LLVM 19.x. After
all, adding new triplets to Triple sounds like an ABI change — though I
suppose we can minimize the risk of breaking something if we move new
enum values to the very end.
ARM uses complex encoding of immediate values using small number of
bits. As a result, some values cannot be represented as immediate
operands, they need to be synthesized in a register. This change
implements legalization of such constants with loading values from
constant pool.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Previously, any value could be used for the MOVT and MOVW instructions,
however the ARM ABI dictates that the addend should be a signed 16 bit
value. To ensure this is followed, the Assembler will now check that
when using these instructions, the addend is a 16bit signed value, and
throw an error if this is not the case.
Information relating to the ABI requirements can be found here:
https://github.com/ARM-software/abi-aa/blob/main/aaelf32/aaelf32.rst#addends-and-pc-bias-compensation
Rename the function to reflect its correct behavior and to be consistent
with `Module::getOrInsertFunction`. This is also in preparation of
adding a new `Intrinsic::getDeclaration` that will have behavior similar
to `Module::getFunction` (i.e, just lookup, no creation).
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.