In the unlikely case where the stack size is greater than 4GB, we may run into
the situation where the local stack size and the callee saved registers stack
size get combined incorrectly when restoring the callee saved registers. This
happens because the stack size in shouldCombineCSRLocalStackBumpInEpilogue
is represented as an 'unsigned', but is passed in as an 'int64_t'. We end up with
something like
$fp, $lr = frame-destroy LDPXi $sp, 536870912
This change just makes 'shouldCombineCSRLocalStackBumpInEpilogue' match
'shouldCombineCSRLocalStackBump' where 'StackBumpBytes' is an 'uint64_t'
This is already defined for each register class in AArch64RegisterInfo,
not hardcoding it here makes these values easier to change (perhaps
based on hardware mode).
When generating function epilogues using AArch64 stack tagging, we can
fold an SP update into the tag-setting loop. The loop tags 32 bytes at a
time using ST2G, so the final SP update might be done either by a post
indexed STG which tags the final 16 bytes of the tagged region, or by an
ADD/SUB instruction after the loop. However, we were only considering
the range of the ADD/SUB instructions when deciding whether to do this,
and the valid immediate range for STG is slightly lower when the offset
is positive, because it is a signed immediate, and must include the
extra 16 bytes being tagged.
When merging STG instructions used for AArch64 stack tagging, we were
stopping on reaching a load or store instruction, but not calls, so it
was possible for an STG to be moved past a call to memcpy.
This test case (reduced from fuzzer-generated C code) was the result of
StackColoring merging allocas A and B into one stack slot, and
StackSafetyAnalysis proving that B does not need tagging, so we end up
with tagged and untagged objects in the same stack slot. The tagged
object (A) is live first, so it is important that it's memory is
restored to the background tag before it gets reused to hold B.
The `-aarch64-stack-hazard-size=<val>` option disables register paring
(as the hazard padding may mean the offset is too large for STP/LDP).
This broke setting the frame-pointer offset, as the code to find the
frame record looked for a (FP, LR) register pair.
This patch resolves this by looking for FP, LR as two unpaired registers
when hazard padding is enabled.
Previously, with `-fzero-call-used-regs` clang/LLVM would incorrectly
emit Neon instructions in streaming functions, and streaming-compatible
functions without SVE.
With this change:
* In streaming functions, Z/p registers will be zeroed
* In streaming compatible functions w/o SVE, D registers will be zeroed
- (As Neon vector instructions are illegal including `movi v..`)
This patch aims to reduce the include used by AArch64ISelLowering, allowing it
to be included by unittests so that they can reference the AArch64ISD nodes.
It:
- Moves the inclusion of AArch64SMEAttributes.h to the uses.
- Moves LowerPtrAuthGlobalAddressStatically to a static function, so that
AArch64PACKey is not required in the header.
- Moves the definitions of getExceptionPointerRegister to the cpp file, to
remove the reference of AArch64::X0.
This is defined by the `-aarch64-streaming-hazard-size` option or its
alias `-aarch64-stack-hazard-size` (the original name). It has been
renamed to be more general as this option will (for the time being) be
used to detect if the current target has streaming mode memory hazards.
---------
Co-authored-by: Hari Limaye <hari.limaye@arm.com>
As part of FEAT_PAuthLR, a new DWARF Frame Instruction was introduced,
`DW_CFA_AARCH64_negate_ra_state_with_pc`. This instructs Libunwind that
the PC has been used with the signing instruction. This change includes
three commits
- Libunwind support for the newly introduced DWARF Instruction
- CodeGen Support for the DWARF Instructions
- Reversing the changes made in #96377. Due to
`DW_CFA_AARCH64_negate_ra_state_with_pc`'s requirements to be placed
immediately after the signing instruction, this would mean the CFI
Instruction location was not consistent with the generated location when
not using FEAT_PAuthLR. The commit reverses the changes and makes the
location consistent across the different branch protection options.
While this does have a code size effect, this is a negligible one.
For the ABI information, see here:
853286c7ab/aadwarf64/aadwarf64.rst (id23)
Some targets (e.g. PPC and Hexagon) already did this. I think it's best
to do this consistently so that frontend authors don't run into
inconsistent results when they emit `naked` functions. For example, in
Zig, we had to change our emit code to also set `frame-pointer=none` to
get reliable results across targets.
Note: I don't have commit access.
The iterator passed to `fixupCalleeSaveRestoreStackOffset` may be
incorrect when it tries to skip over the instructions that get the
current value of 'vg', when there is a 'rdsvl' instruction straight
after the prologue. That's because it doesn't check that the instruction
is still a 'frame-setup' instruction.
…me lowering
SME instructions can only be used in streaming mode. PTRUE for
predicated counter and the ld/st pair can be used when:
sve2.1 is available or
sme2 available in function in streaming mode.
Previously the frame lowering only checking if sme2 available when
building the machine instruction.
This fix checks if sme2 is available and is subtarget in streaming mode
In https://reviews.llvm.org/D159196 we avoided stackslot scavenging
when there was no FP available. But in the case where FP is available
we need to actually prefer using the FP over the BP.
This change affects more than just SME, but it should be a general
improvement, since any slot above the (address pointed to by) FP
is always closer to FP than BP, so it makes sense to always favour
using the FP to address it when the FP is available.
This also fixes the issue for SME where this is not just preferred
but required.
This patch fixes incorrect usage of scalar+immediate variant of ld1/st1
instructions during stack allocation caused by
[c4bac7f](c4bac7f7dc).
This commit used ld1/st1 even when stack offset was outside of immediate
range for this instruction, producing invalid assembly. This commit was also using incorrect offsets when using ld1/st1.
For the tests I just added +sve instead of what actual hardware has, which is only SME,
since otherwise all the test functions need to be marked as streaming mode.
rdar://121864771
On Darwin we don't have any hardware that has SVE support, only SME.
Therefore we don't need to save VG for unwinders and can safely omit it.
This also fixes crashes introduced since this feature landed since Darwin's
compact unwind code can't handle the presence of VG anyway.
rdar://131072344
Emit an optimization remark when objects in the stack frame may cause
hazards in a streaming mode function. The analysis requires either the
`aarch64-stack-hazard-size` or `aarch64-stack-hazard-remark-size` flag
to be set by the user, with the former flag taking precedence.
This patch tries to clean up some of the existing values in
getMemOpInfo. All values should now be in bytes (not bits), and the
MinOffset/MaxOffset are now always represented unscaled (the immediate
that will be present in the final instruction).
Although I could not find a place where it altered codegen, the offset
of a post-index instruction will be 0, not scale*imm. A
IsPostIndexLdStOpcode method has been added to try and make sure that
case is handled properly.
StackFrameLayoutAnalysis currently calculates SP-relative offsets in a
target-independent way via MachineFrameInfo offsets. This is incorrect
for some Targets, e.g. AArch64, when there are scalable vector stack
slots.
This patch adds a virtual function to TargetFrameLowering to provide
offsets from SP, with a default implementation matching what is
currently used in StackFrameLayoutAnalysis, and refactors
StackFrameLayoutAnalysis to use this function. Only non-zero scalable
offsets are output by the analysis pass.
An implementation of this function is added for AArch64 targets, which
aims to provide correct SP offsets in most cases.
Under some SME contexts, a coprocessor with its own separate cache will
be used for FPR operations. This can create hazards if the CPU and the
SME unit try to access the same area of memory, including if the access
is to an area of the stack.
To try to alleviate that, this patch attempts to introduce extra padding
into the stack frame between FP and GPR accesses, controlled by the
StackHazardSize option. Without changing the layout of the stack frame,
a stack object of the right size is added between GPR and FPR CSRs.
Another is added to the stack objects section, and stack objects are
sorted so that FPR > Hazard padding slot > GPRs (where possible).
Unfortunately some things are not handled well (VLA area, FPR arguments
on the stack, object with both GPR and FPR accesses), but if those are
controlled by the user then the entire stack frame becomes GPR at the
start/end with FPR in the middle, surrounded by Hazard padding. This can
greatly help reduce something that can be difficult for the user to
control themselves.
The current implementation is opt-in through an
-aarch64-stack-hazard-size flag, and should have no effect if the option
is unset. In the long run the implementation might change (for example
using more base pointers to separate in more cases, re-enabling ldp/stp
using an extra register, etc), but this gets at least something for
people to use in llvm-19 if they need it. The only change whilst the
option is unset will be a fix for making sure the stack increment is
added at the right place when it cannot be converted to postinc
(++MBBI). I believe without extra padding that can not normally be
reached.
This re-introduces the effective behaviour that was reverted in
7ad481e76c9bee5b9895ebfa0fdb52f31cb7de77.
This time we're not using the same mechanism, exposing another
reservation feature
that prevents only regalloc from using the register, but not for other
required uses
like ABIs.
This also fixes a consequent issue with reserving LR, which is that
frame lowering
was only adding live-in flags for non-reserved regs. This would cause
issues later
since the outliner needs accurate flags to determine when LR needs to be
preserved.
rdar://131313095
These registers include:
- X19, used by LLVM as the base pointer
- X15 on Windows, where it is used for stack allocation. It can still be
used on Linux/Darwin.
- Adjust FrameLowering scratch register code to not assume X9 is
available if the calling convention is preserve_nonecc. The code will
then pick an unused register as scratch, and allow X9 to continue being
used for argument passing.
If a function requires any streaming-mode change, the vector granule
value must be stored to the stack and unwind info must also describe the
save of VG to this location.
This patch adds VG to the list of callee-saved registers and increases
the
callee-saved stack size if the function requires streaming-mode changes.
A new type is added to RegPairInfo, which is also used to skip restoring
the register used to spill the VG value in the epilogue.
See
https://github.com/ARM-software/abi-aa/blob/main/aadwarf64/aadwarf64.rst
In the reverted change, the order of the IR was dependent on the host
compiler, because we inserted instructions in arguments to functions.
Fix that, and also fix another problem with the test.
This reverts commit 3313f28897a87ec313ec0b52ef71c14d3b9ff652.
https://github.com/llvm/llvm-project/pull/79940 put calls to
recomputeLiveIns into
a loop, to repeatedly call the function until the computation converges.
However,
this repeats a lot of code. This changes moves the loop into a function
to simplify
the handling.
Note that this changes the order in which recomputeLiveIns is called.
For example,
```
bool anyChange = false;
do {
anyChange = recomputeLiveIns(*ExitMBB) || recomputeLiveIns(*LoopMBB);
} while (anyChange);
```
only begins to recompute the live-ins for LoopMBB after the computation
for ExitMBB
has converged. With this change, all basic blocks have a recomputation
of the live-ins
for each loop iteration. This can result in less or more calls,
depending on the
situation.
This patch removes the `-reverse-csr-restore-seq` option from
AArch64FrameLowering, since this is no longer used.
This patch was reverted because of a crash in PR#79623.
Merging it back as it was fixed in PR#82492.
This is needed by PR#77665[1] that uses a P-register while restoring
Z-registers.
The reverse for SVE register restore in the epilogue was added to
guarantee performance, but further work was done to improve sve frame
restore and besides that the schedule also may change the order of the
restore, undoing the reverse restore.
This also fix the problem reported in (PR #79623) on Windows with
std::reverse and .base().
[1]https://github.com/llvm/llvm-project/pull/77665
Certain stack probing sequences might clobber flags, then we can't use a
block as a prologue if the flags register is a live-in on entry to that
block.