243 Commits

Author SHA1 Message Date
Harrison Hao
8c14d3f44f
[MISched] Use SchedRegion in overrideSchedPolicy and overridePostRASchedPolicy (#149297)
This patch updates `overrideSchedPolicy` and `overridePostRASchedPolicy`
to take a
`SchedRegion` parameter instead of just `NumRegionInstrs`. This provides
access to both the
instruction range and the parent `MachineBasicBlock`, which enables
looking up function-level
attributes.

With this change, targets can select post-RA scheduling direction per
function using a function
attribute. For example:

```cpp
void overridePostRASchedPolicy(MachineSchedPolicy &Policy,
                               const SchedRegion &Region) const {
  const Function &F = Region.RegionBegin->getMF()->getFunction();
  Attribute Attr = F.getFnAttribute("amdgpu-post-ra-direction");
  ...
}
2025-07-22 15:55:12 +08:00
Sjoerd Meijer
d26106dbf0
[AArch64] Set the cache line size to 64 for the V2 and V3. (#148213)
This sets the cache line size to 64 for the Neoverse V2 and V3. I've
tested this with loop-interchange: it doesn't result in extra
compile-times, but it does enable a lot more interchange.
2025-07-15 14:59:18 +01:00
Peter Collingbourne
2197671109
AArch64: Relax x16/x17 constraint on AUT in certain cases.
On most operating systems, the x16 and x17 registers are not special,
so there is no benefit, and only a code size cost, to constraining AUT to
only using them. Therefore, adjust the backend to only use the AUT pseudo
(renamed AUTx16x17 for clarity) on Darwin platforms. All other platforms
use an unconstrained variant of the pseudo, AUTxMxN, for selection.

Reviewers: ahmedbougacha, kovdan01, atrosinenko

Reviewed By: atrosinenko

Pull Request: https://github.com/llvm/llvm-project/pull/132857
2025-07-09 13:46:44 -07:00
Guy David
bb1f5c3189
[AArch64] Lower jump table cases threshold to 10 (#143632)
Previous stabs at this setting
(https://github.com/llvm/llvm-project/pull/71166) hypertuned it for
SPEC2017, but Clang's own compilation can benefit from a slightly lower
threshold, yielding a 0.3% improvement in compile time, while still not
regressing SPEC.

Most notable beneficiaries of this change are:
 - `llvm::Instruction::getNumSuccessors` (11 cases)
 - `llvm::Instruction::getSuccessor` (11 cases)
 
Test Suite with a bootstrapped build:
```
Tests: 4316
Metric: compile_time

Program                                       compile_time             
                                              lhs          rhs    diff 
SingleSour...ce/UnitTests/SignlessTypes/div     0.02         0.02  3.0%
SingleSour.../UnitTests/SignlessTypes/cast2     0.02         0.02  2.8%
SingleSource/Benchmarks/Misc/flops-4            0.02         0.02  1.9%
SingleSour...ebra/solvers/cholesky/cholesky     0.05         0.05  1.8%
SingleSour...tTests/2020-01-06-coverage-006     0.02         0.02  1.7%
SingleSour...ce/Benchmarks/Stanford/FloatMM     0.03         0.03  1.7%
SingleSour...9-04-16-BitfieldInitialization     0.02         0.02  1.7%
SingleSour...nitTests/2003-07-08-BitOpsTest     0.02         0.02  1.7%
MultiSourc...marks/Prolangs-C++/vcirc/vcirc     0.02         0.02  1.6%
MultiSourc...Prolangs-C/fixoutput/fixoutput     0.05         0.05  1.5%
SingleSour...h/stencils/jacobi-1d/jacobi-1d     0.04         0.04  1.4%
MultiSourc...rks/Prolangs-C++/office/office     0.28         0.28  1.4%
SingleSour...arks/Adobe-C++/functionobjects     0.39         0.40  1.3%
SingleSour...Tests/2003-10-29-ScalarReplBug     0.02         0.02  1.2%
SingleSour...arks/Adobe-C++/stepanov_vector     0.41         0.42  1.2%
                           Geomean difference                     -0.3%
      compile_time                         
l/r            lhs          rhs        diff
count  4316.000000  4316.000000  469.000000
mean   0.057747     0.057595    -0.003034  
std    0.544528     0.543139     0.007625  
min    0.000000     0.000000    -0.035294  
25%    0.000000     0.000000    -0.007006  
50%    0.000000     0.000000    -0.003257  
75%    0.000000     0.000000     0.000000  
max    18.295300    18.252500    0.030151
```
2025-06-19 01:53:36 +03:00
Ties Stuij
269f5fe91e
[AARCH64] Add support for Cortex-A320 (#139055)
This patch adds initial support for the recently announced Armv9
Cortex-A320 processor.

For more information, including the Technical Reference Manual, see:
https://developer.arm.com/Processors/Cortex-A320

---------

Co-authored-by: Oliver Stannard <oliver.stannard@arm.com>
2025-05-09 16:24:48 +01:00
Ricardo Jesus
847e46ca01
[AArch64] Add initial support for -mcpu=olympus. (#132368)
This patch adds support for the NVIDIA Olympus core.

This does not add any special tuning decisions, and those may come
later.
2025-03-25 08:09:04 +00:00
Kazu Hirata
41b76119ec
[llvm] Use range constructors for *Set (NFC) (#132636) 2025-03-23 15:50:34 -07:00
Kazu Hirata
71935281e0
[Target] Use *Set::insert_range (NFC) (#132140)
DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently
gained C++23-style insert_range.  This patch replaces:

  Dest.insert(Src.begin(), Src.end());

with:

  Dest.insert_range(Src);

This patch does not touch custom begin like succ_begin for now.
2025-03-20 09:09:30 -07:00
David Green
6424abcd6c
[AArch64] Enable AvoidLDAPUR for cpu=generic between armv8.4 and armv9.3. (#125261)
As added in #124274, CPUs in this range can suffer from performance
issues with ldapur. As the gain from ldar->ldapr is expected to be
greater than the minor gain from ldapr->ldapur, this opts to avoid the
instruction under the default -mcpu=generic when the -march is less that
armv8.8 / armv9.3.

I renamed AArch64Subtarget::Others to AArch64Subtarget::Generic to be
clearer what it means.
2025-02-07 10:16:57 +00:00
Benjamin Maxwell
82c6b8f7bb
[AArch64][SME] Spill p-regs as z-regs when streaming hazards are possible (#123752)
This patch adds a new option `-aarch64-enable-zpr-predicate-spills`
(which is disabled by default), this option replaces predicate spills
with vector spills in streaming[-compatible] functions.

For example:

```
str	p8, [sp, #7, mul vl]            // 2-byte Folded Spill
// ...
ldr	p8, [sp, #7, mul vl]            // 2-byte Folded Reload
```

Becomes:

```
mov	z0.b, p8/z, #1
str	z0, [sp]                        // 16-byte Folded Spill
// ...
ldr	z0, [sp]                        // 16-byte Folded Reload
ptrue	p4.b
cmpne	p8.b, p4/z, z0.b, #0
```

This is done to avoid streaming memory hazards between FPR/vector and
predicate spills, which currently occupy the same stack area even when
the `-aarch64-stack-hazard-size` flag is set.

This is implemented with two new pseudos SPILL_PPR_TO_ZPR_SLOT_PSEUDO
and FILL_PPR_FROM_ZPR_SLOT_PSEUDO. The expansion of these pseudos
handles scavenging the required registers (z0 in the above example) and,
in the worst case spilling a register to an emergency stack slot in the
expansion. The condition flags are also preserved around the `cmpne` in
case they are live at the expansion point.
2025-02-03 12:04:46 +00:00
Benjamin Maxwell
6e1ea7e5a7
[AArch64] Set the default streaming hazard size to 1024 for +sme,+sve (#123753)
The default for all other feature combinations remains at zero (i.e. no
streaming hazards). This value may be adjusted in the future (e.g. based
on the processor family), for now, it is set conservatively.
2025-01-22 09:08:17 +00:00
Sander de Smalen
61510b51c3 Revert "[AArch64] Enable subreg liveness tracking by default."
This reverts commit 9c319d5bb40785c969d2af76535ca62448dfafa7.

Some issues were discovered with the bootstrap builds, which
seem like they were caused by this commit. I'm reverting to investigate.
2024-12-12 17:22:15 +00:00
Sander de Smalen
9c319d5bb4 [AArch64] Enable subreg liveness tracking by default.
Internal testing didn't flag up any functional- or performance regressions.
2024-12-12 16:05:49 +00:00
Kinoshita Kotaro
a1197a2ca8
[AArch64] Add initial support for FUJITSU-MONAKA (#118432)
This patch adds initial support for FUJITSU-MONAKA CPU (-mcpu=fujitsu-monaka).

The scheduling model will be corrected in the future.
2024-12-09 09:56:02 +09:00
Sjoerd Meijer
9bccf61f5f
[AArch64][LV] Set MaxInterleaving to 4 for Neoverse V2 and V3 (#100385)
Set the maximum interleaving factor to 4, aligning with the number of available
SIMD pipelines. This increases the number of vector instructions in the vectorised
loop body, enhancing performance during its execution. However, for very low
iteration counts, the vectorised body might not execute at all, leaving only the
epilogue loop to run. This issue affects e.g. cam4_r from SPEC FP, which
experienced a performance regression. To address this, the patch reduces the
minimum epilogue vectorisation factor from 16 to 8, enabling the epilogue to be
vectorised and largely mitigating the regression.
2024-11-20 09:33:39 +00:00
Anatoly Trosinenko
44076c9822
[AArch64][PAC] Move emission of LR checks in tail calls to AsmPrinter (#110705)
Move the emission of the checks performed on the authenticated LR value
during tail calls to AArch64AsmPrinter class, so that different checker
sequences can be reused by pseudo instructions expanded there.
This adds one more option to AuthCheckMethod enumeration, the generic
XPAC variant which is not restricted to checking the LR register.
2024-11-12 18:27:19 +03:00
David Green
1d6d073fbb [AArch64] Remove FeatureUseScalarIncVL
FeatureUseScalarIncVL is a tuning feature, used to control whether addvl or
add+cnt is used. It was previously added as a dependency for FeatureSVE2, an
architecture feature but this can be seen as a layering violation. The main
disadvantage is that -use-scalar-inc-vl cannot be used without disabling sve2
and all dependant features.

This patch now replaces that with an option that if unset defaults to hasSVE ||
hasSME, but is otherwise overriden by the option. The hope is that no cpus will
rely on the tuning feature (or we can readdit if needed.
2024-11-10 14:51:55 +00:00
Sander de Smalen
5192cb772a [AArch64] Add hidden option to enable subreg liveness tracking.
Subreg liveness tracking is disabled by default for now until all issues
are ironed out. This option allows the feature to be used in tests.
2024-10-30 17:09:56 +00:00
Benjamin Maxwell
ddd463be7e
[AArch64] Add getStreamingHazardSize() to AArch64Subtarget (#113679)
This is defined by the `-aarch64-streaming-hazard-size` option or its
alias `-aarch64-stack-hazard-size` (the original name). It has been
renamed to be more general as this option will (for the time being) be
used to detect if the current target has streaming mode memory hazards.

---------

Co-authored-by: Hari Limaye <hari.limaye@arm.com>
2024-10-28 13:01:22 +00:00
Madhur Amilkanthwar
b73771cf0f
[AArch64] Increase scatter overhead on Neoverse-V2 (#101296)
This patch increases scatter overhead on Neoverse-V2 to 13. This
benefits s128 kernel from TSVC_2 test suite.
SPEC 17, RAJAPerf, and Sptter are unaffected by this patch.

This patch boosts s128 kernel's performance from TSVC test suite by about
40% as this enables vectorization. Also, handle minor code refactoring
for gather related part.
2024-08-14 10:12:40 +05:30
Daniil Kovalev
56fd2472d8
[PAC] Sign LR with B key for non-leaf functions with ptrauth-returns attr (#100552)
For pauthtest ABI, there is a bunch of ptrauth-* options, including
ptrauth-returns. Use "ptrauth-returns" function attribute to indicate
need for LR signing with B key for non-leaf function to avoid using
"sign-return-address" and "sign-return-address-key" which were
originally designed for pac-ret.

Co-authored-by: Ahmed Bougacha <ahmed@bougacha.org>
Co-authored-by: Anatoly Trosinenko <atrosinenko@accesssoftek.com>
2024-07-25 22:21:03 +03:00
Ahmed Bougacha
b8721fa0af
[AArch64][PAC] Sign block addresses used in indirectbr. (#97647)
Enabled in clang using:
    -fptrauth-indirect-gotos

and at the IR level using function attribute:
    "ptrauth-indirect-gotos"

Signing uses IA and a per-function integer discriminator. The
discriminator isn't ABI-visible, and is currently:
    ptrauth_string_discriminator("<function_name> blockaddress")

A sufficiently sophisticated frontend could benefit from per-indirectbr
discrimination, which would need additional machinery, such as allowing
"ptrauth" bundles on indirectbr. For our purposes, the simple scheme
above is sufficient.

This approach doesn't support subtracting label addresses and using
the result as offsets, because each label address is signed.
Pointer arithmetic on signed pointers corrupts the signature bits,
and because label address expressions aren't typed beyond void*,
we can't do anything reliably intelligent on the arithmetic exprs.
Not signing addresses when used to form offsets would allow
easily hijacking control flow by overwriting the offset.

This diagnoses the basic cases (`&&lbl2 - &&lbl1`) in the frontend,
while we evaluate either alternative implementations (e.g., lowering
blockaddress to a bb number, and indirectbr to a checked jump-table),
or better diagnostics (both at the frontend level and on unencodable
IR constants).
2024-07-22 21:24:39 -07:00
Jon Roelofs
2b33591386
[llvm][AArch64] Support -mcpu=apple-m4 (#95478) 2024-06-14 17:24:45 -07:00
Jonathan Thackray
e80c59556d
[AArch64] Add support for Cortex-A725 and Cortex-X925 (#95214)
Cortex-A725 and Cortex-X925 are Armv9.2 AArch64 CPUs.

Technical Reference Manual for Cortex-A725:
   https://developer.arm.com/documentation/107652/latest

Technical Reference Manual for Cortex-X925:
   https://developer.arm.com/documentation/102807/latest
2024-06-13 00:00:57 +01:00
Wei Zhao
6b9753a0ec
[AArch64] Add support for Qualcomm Oryon processor (#91022)
Oryon is an ARM V8 AArch64 CPU from Qualcomm.

---------

Co-authored-by: Wei Zhao <wezhao@qti.qualcomm.com>
2024-06-06 07:27:50 -07:00
Sander de Smalen
1015f51dd9
[AArch64] NFC: Rename -force-streaming-compatible-sve to -force-streaming-compatible (#92774)
The behaviour of the flag should be equivalent to
__arm_streaming_compatible.

At the moment, the name suggests that '-force-streaming-compatible-sve'
on its own (i.e. without specifying `+sve`) enables the compiler to use
the streaming-compatible subset of SVE instructions, but the semantics
merely are that the function can be called with either PSTATE.SM=0 or
PSTATE.SM=1.
2024-05-22 07:58:54 +01:00
Jonathan Thackray
e50a857fb1
[AArch64] Add support for Cortex-R82AE and improve Cortex-R82 (#90440) 2024-04-30 14:15:01 +01:00
Jonathan Thackray
a670cdadca
[AArch64] Add support for Neoverse-N3, Neoverse-V3 and Neoverse-V3AE (#90143)
Neoverse-N3, Neoverse-V3 and Neoverse-V3AE are Armv9.2 AArch64 CPUs.

Technical Reference Manual for Neoverse-N3:
   https://developer.arm.com/documentation/107997/latest/

Technical Reference Manual for Neoverse-V3:
   https://developer.arm.com/documentation/107734/latest/

Technical Reference Manual for Neoverse-V3AE:
   https://developer.arm.com/documentation/101595/latest/
2024-04-26 13:04:35 +01:00
Tomas Matheson
71c5964f5c
[ARM][AArch64] autogenerate header file for TargetParser from Target tablegen files (#88378)
Introduce a mechanism to share data between the ARM and AArch64 backends and
TargetParser, to reduce duplication of code. This is similar to the current
RISC-V implementation.

The target tablegen file (in this case `ARM.td` or `AArch64.td`) is
processed during building of `TargetParser` to generate the following
files in the build tree:
 - `build/include/llvm/TargetParser/ARMTargetParserDef.inc`
 - `build/include/llvm/TargetParser/AArch64TargetParserDef.inc`

For now, the use of these generated files is limited to files _outside_
of `TargetParser`. The main reason for this is that the modifications to
`TargetParser` will require additional data added to the tablegen files,
which I want to split into separate PRs.
2024-04-24 09:18:36 +01:00
David Green
b24af43fdf
[AArch64] Improve scheduling latency into Bundles (#86310)
By default the scheduling info of instructions into a BUNDLE are given a
latency of 0 as they operate on the implicit register of the bundle.
This modifies that for AArch64 so that the latency is adjusted to use
the latency from the instruction in the bundle instead. This essentially
assumes that the bundled instructions are executed in a single cycle,
which for AArch64 is probably OK considering they are mostly used for
MOVPFX bundles, where this can help create slightly better scheduling
especially for in-order cores.
2024-04-12 10:57:01 +01:00
Arthur Eubanks
94c988bcfd [NFC] Remove unused parameter from shouldAssumeDSOLocal() 2024-03-11 19:48:17 +00:00
Jonathan Thackray
8160139136
Add support for Arm Cortex A78AE CPU (#84485)
Add support for Arm Cortex A78AE CPU

Technical Reference Manual for Arm Cortex A78AE:
   https://developer.arm.com/documentation/101779/0003

Fixes #84450
2024-03-08 16:11:36 +00:00
Fangrui Song
201572e34b
[AArch64] Implement -fno-plt for SelectionDAG/GlobalISel
Clang sets the nonlazybind attribute for certain ObjC features. The
AArch64 SelectionDAG implementation for non-intrinsic calls
(46e36f0953aabb5e5cd00ed8d296d60f9f71b424) is behind a cl option.

GCC implements -fno-plt for a few ELF targets. In Clang, -fno-plt also
sets the nonlazybind attribute. For SelectionDAG, make the cl option not
affect ELF so that non-intrinsic calls to a dso_preemptable function use
GOT. Adjust AArch64TargetLowering::LowerCall to handle intrinsic calls.

For FastISel, change `fastLowerCall` to bail out when a call is due to
-fno-plt.

For GlobalISel, handle non-intrinsic calls in CallLowering::lowerCall
and intrinsic calls in AArch64CallLowering::lowerCall (where the
target-independent CallLowering::lowerCall is not called).
The GlobalISel test in `call-rv-marker.ll` is therefore updated.

Note: the current -fno-plt -fpic implementation does not use GOT for a
preemptable function.

Link: #78275

Pull Request: https://github.com/llvm/llvm-project/pull/78890
2024-03-05 13:55:29 -08:00
Philipp Tomsich
fbba818a78
[AArch64] Add the Ampere1B core (#81297)
The Ampere1B is Ampere's third-generation core implementing a
superscalar, out-of-order microarchitecture with nested virtualization,
speculative side-channel mitigation and architectural support for
defense against ROP/JOP style software attacks.

Ampere1B is an ARMv8.7+ implementation, adding support for the FEAT
WFxT, FEAT CSSC, FEAT PAN3 and FEAT AFP extensions. It also includes all
features of the second-generation Ampere1A, such as the Memory Tagging
Extension and SM3/SM4 cryptography instructions.
2024-02-09 15:22:09 -08:00
Yuta Mukai
70eab122bc
[AArch64][MachinePipeliner] Add pipeliner support for AArch64 (#79589)
Add AArch64 implementations for the interfaces of MachinePipeliner pass.
The pass is disabled by default for AArch64. It is enabled by specifying
--aarch64-enable-pipeliner.

5 tests in llvm-test-suites show performance improvement by more than 5%
on a Neoverse V1 processor.

| test | improvement |
| ---------------------------------------------------------------- |
-----------:|
| MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl.test | 16%
|
| MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-flt.test | 16%
|
| SingleSource/Benchmarks/Adobe-C++/loop_unroll.test | 14% |
| SingleSource/Benchmarks/Misc/flops-5.test | 13% |
| SingleSource/Benchmarks/BenchmarkGame/spectral-norm.test | 6% |

(base flags: -mcpu=neoverse-v1 -O3 -mrecip, flags for pipelining: -mllvm
-aarch64-enable-pipeliner -mllvm
-pipeliner-max-stages=100 -mllvm -pipeliner-max-mii=100 -mllvm
-pipeliner-enable-copytophi=0)

On the other hand, there are cases of significant performance
degradation. Algorithm improvements and adding the option/pragma will be
needed in the future.
2024-02-02 10:33:44 +09:00
Eli Friedman
a6065f0fa5
Arm64EC entry/exit thunks, consolidated. (#79067)
This combines the previously posted patches with some additional work
I've done to more closely match MSVC output.

Most of the important logic here is implemented in
AArch64Arm64ECCallLowering. The purpose of the
AArch64Arm64ECCallLowering is to take "normal" IR we'd generate for
other targets, and generate most of the Arm64EC-specific bits:
generating thunks, mangling symbols, generating aliases, and generating
the .hybmp$x table. This is all done late for a few reasons: to
consolidate the logic as much as possible, and to ensure the IR exposed
to optimization passes doesn't contain complex arm64ec-specific
constructs.

The other changes are supporting changes, to handle the new constructs
generated by that pass.

There's a global llvm.arm64ec.symbolmap representing the .hybmp$x
entries for the thunks. This gets handled directly by the AsmPrinter
because it needs symbol indexes that aren't available before that.

There are two new calling conventions used to represent calls to and
from thunks: ARM64EC_Thunk_X64 and ARM64EC_Thunk_Native. There are a few
changes to handle the associated exception-handling info,
SEH_SaveAnyRegQP and SEH_SaveAnyRegQPX.

I've intentionally left out handling for structs with small
non-power-of-two sizes, because that's easily separated out. The rest of
my current work is here. I squashed my current patches because they were
split in ways that didn't really make sense. Maybe I could split out
some bits, but it's hard to meaningfully test most of the parts
independently.

Thanks to @dpaoliello for extensive testing and suggestions.

(Originally posted as https://reviews.llvm.org/D157547 .)
2024-01-22 21:28:07 -08:00
Tim Northover
10d6d5f224
AArch64: add support for currently released Apple CPUs. (#73499)
These are still v8.6a and have no real changes as far as LLVM cares, so
it's mostly just a copy/paste job.
2023-11-29 09:51:42 +00:00
Matthew Devereau
cdf6693f07
[AArch64][SME] Add support for sme-fa64 (#70809) 2023-11-20 08:37:52 +00:00
Jonathan Thackray
066c4524bc
[AArch64] Add support for Cortex-A520, Cortex-A720 and Cortex-X4 CPUs (#72395)
Cortex-A520, Cortex-A720 and Cortex-X4 are Armv9.2 AArch64 CPUs.

Technical Reference Manual for Cortex-A520:
   https://developer.arm.com/documentation/102517/latest/

Technical Reference Manual for Cortex-A720:
   https://developer.arm.com/documentation/102530/latest/

Technical Reference Manual for Cortex-X4:
   https://developer.arm.com/documentation/102484/latest/

Patch co-authored by: Sivan Shani <sivan.shani@arm.com>
2023-11-16 22:08:58 +00:00
David Sherwood
bdc0afc871
[CodeGen][AArch64] Set min jump table entries to 13 for AArch64 targets (#71166)
There are some workloads that are negatively impacted by using jump
tables when the number of entries is small. The SPEC2017 perlbench
benchmark is one example of this, where increasing the threshold to
around 13 gives a ~1.5% improvement on neoverse-v1. I chose the minimum
threshold based on empirical evidence rather than science, and just
manually increased the threshold until I got the best performance
without impacting other workloads. For neoverse-v1 I saw around ~0.2%
improvement in the SPEC2017 integer geomean, and no overall change for
neoverse-n1. If we find issues with this threshold later on we can
always revisit this.

The most significant SPEC2017 score changes on neoverse-v1 were:

500.perlbench_r: +1.6%
520.omnetpp_r: +0.6%

and the rest saw changes < 0.5%.

I updated CodeGen/AArch64/min-jump-table.ll to reflect the new
threshold. For most of the affected tests I manually set the min number
of entries back to 4 on the RUN line because the tests seem to rely upon
this behaviour.
2023-11-14 13:00:28 +00:00
Anatoly Trosinenko
1d2b558265 [AArch64][PAC] Check authenticated LR value during tail call
When performing a tail call, check the value of LR register after
authentication to prevent the callee from signing and spilling an
untrusted value. This commit implements a few variants of check,
more can be added later.

If it is safe to assume that executable pages are always readable,
LR can be checked just by dereferencing the LR value via LDR.

As an alternative, LR can be checked as follows:

    ; lowered AUT* instruction
    ; <some variant of check that LR contains a valid address>
    b.cond break_block
  ret_block:
    ; lowered TCRETURN
  break_block:
    brk 0xc471

As the existing methods either break the compatibility with execute-only
memory mappings or can degrade the performance, they are disabled by
default and can be explicitly enabled with a command line option.

Individual subtargets can opt-in to use one of the available methods
by updating AArch64FrameLowering::getAuthenticatedLRCheckMethod().

Reviewed By: kristof.beyls

Differential Revision: https://reviews.llvm.org/D156716
2023-10-11 17:38:17 +03:00
Sander de Smalen
7e815dd76d [AArch64][SME] Create new interface for isSVEAvailable.
When a function is compiled to be in Streaming(-compatible) mode, the full
set of SVE instructions may not be available. This patch adds an interface
to query that and changes the codegen for FADDA (not legal in Streaming-SVE
mode) to instead be expanded for fixed-length vectors, or otherwise not to
code-generate for scalable vectors.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D156109
2023-09-01 12:00:36 +00:00
Sander de Smalen
08fd44b300 [AArch64] Force streaming-compatible codegen when attributes are set.
Before this patch, the only way to generate streaming-compatible code
was to use the `-force-streaming-compatible-sve` flag, but the compiler
should also avoid the use of instructions invalid in streaming mode
when a function has the aarch64_pstate_sm_enabled/compatible attribute.

Reviewed By: paulwalker-arm, david-arm

Differential Revision: https://reviews.llvm.org/D155428
2023-07-18 10:26:00 +00:00
Sander de Smalen
ec6af93d02 [AArch64] NFC: Replace 'forceStreamingCompatibleSVE' with 'isNeonAvailable'.
The AArch64Subtarget interface 'isNeonAvailable' is more appropriate going
forward, as we may also want to generate 'streaming SVE' code (not just
'streaming-compatible SVE' code), but here we must still make sure not to
use NEON instructions which are invalid in streaming SVE mode.
2023-07-17 08:24:10 +00:00
David Sherwood
c7dbe326df [AArch64][LoopVectorize] Enable tail-folding of simple loops on neoverse-v1
This patch enables the tail-folding of simple loops by default
when targeting the neoverse-v1 CPU. Simple loops exclude those
with recurrences or reductions or loops that are reversed.

New tests have been added here:

Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

In terms of SPEC2017 only one benchmark is really affected when
building with "-Ofast -mcpu=neoverse-v1 -flto", which is
(+ faster, - slower):

525.x264: +7.0%

Differential Revision: https://reviews.llvm.org/D130618
2023-05-18 10:35:57 +00:00
Archibald Elliott
4679d7a26a [NFC][ARM][AArch64] Cleanup TargetParser includes
llvm/TargetParser/TargetParser.h now only includes AMDGPU-specific
functionality, the ARM- and AArch64-specific functionality is in other
headers.
2023-03-03 16:24:55 +00:00
Archibald Elliott
8e3d7cf5de [NFC][TargetParser] Remove llvm/Support/TargetParser.h 2023-02-07 11:08:21 +00:00
Archibald Elliott
8c712296fb [NFC][TargetParser] Remove llvm/Support/AArch64TargetParser.h
Removes the forwarding header `llvm/Support/AArch64TargetParser.h`.

I am proposing to do this for all the forwarding headers left after
rGf09cf34d00625e57dea5317a3ac0412c07292148 - for each header:
- Update all relevant in-tree includes
- Remove the forwarding Header

Differential Revision: https://reviews.llvm.org/D140999
2023-02-03 17:34:01 +00:00
Guillaume Chatelet
d6e0ff6074 [NFC] Migrate aarch64 alignment to Align 2023-02-03 16:29:11 +00:00
Mitch Phillips
486729ce06 Re-land: [MTE] Add AArch64GlobalsTagging Pass
Adds an IR pass for -fsanitize=memtag-globals. This pass goes over the
tag-capable global variables, and replaces them with a tagged global
variable of the same contents. This new global variable will have its
size and alignment adjusted if neccesary so that they're both a multiple
of the tag granule size (16 bytes).

Global merge must also be suppressed for tagged globals, as each global
variable must have a unique tag. This can possibly be relaxed in future;
globals that are identical in size, alignment, and content can possibly
be merged. The major problem comes from tail- or head-merging, which if
left unchecked, could have partially-overlapping global variables with
different memory tags, leading to crashes at runtime.

Reviewed By: fmayer, eugenis

Differential Revision: https://reviews.llvm.org/D133392
2023-01-31 13:03:37 -08:00