Set the maximum interleaving factor to 4, aligning with the number of available
SIMD pipelines. This increases the number of vector instructions in the vectorised
loop body, enhancing performance during its execution. However, for very low
iteration counts, the vectorised body might not execute at all, leaving only the
epilogue loop to run. This issue affects e.g. cam4_r from SPEC FP, which
experienced a performance regression. To address this, the patch reduces the
minimum epilogue vectorisation factor from 16 to 8, enabling the epilogue to be
vectorised and largely mitigating the regression.
Move the emission of the checks performed on the authenticated LR value
during tail calls to AArch64AsmPrinter class, so that different checker
sequences can be reused by pseudo instructions expanded there.
This adds one more option to AuthCheckMethod enumeration, the generic
XPAC variant which is not restricted to checking the LR register.
FeatureUseScalarIncVL is a tuning feature, used to control whether addvl or
add+cnt is used. It was previously added as a dependency for FeatureSVE2, an
architecture feature but this can be seen as a layering violation. The main
disadvantage is that -use-scalar-inc-vl cannot be used without disabling sve2
and all dependant features.
This patch now replaces that with an option that if unset defaults to hasSVE ||
hasSME, but is otherwise overriden by the option. The hope is that no cpus will
rely on the tuning feature (or we can readdit if needed.
This is defined by the `-aarch64-streaming-hazard-size` option or its
alias `-aarch64-stack-hazard-size` (the original name). It has been
renamed to be more general as this option will (for the time being) be
used to detect if the current target has streaming mode memory hazards.
---------
Co-authored-by: Hari Limaye <hari.limaye@arm.com>
This patch increases scatter overhead on Neoverse-V2 to 13. This
benefits s128 kernel from TSVC_2 test suite.
SPEC 17, RAJAPerf, and Sptter are unaffected by this patch.
This patch boosts s128 kernel's performance from TSVC test suite by about
40% as this enables vectorization. Also, handle minor code refactoring
for gather related part.
For pauthtest ABI, there is a bunch of ptrauth-* options, including
ptrauth-returns. Use "ptrauth-returns" function attribute to indicate
need for LR signing with B key for non-leaf function to avoid using
"sign-return-address" and "sign-return-address-key" which were
originally designed for pac-ret.
Co-authored-by: Ahmed Bougacha <ahmed@bougacha.org>
Co-authored-by: Anatoly Trosinenko <atrosinenko@accesssoftek.com>
Enabled in clang using:
-fptrauth-indirect-gotos
and at the IR level using function attribute:
"ptrauth-indirect-gotos"
Signing uses IA and a per-function integer discriminator. The
discriminator isn't ABI-visible, and is currently:
ptrauth_string_discriminator("<function_name> blockaddress")
A sufficiently sophisticated frontend could benefit from per-indirectbr
discrimination, which would need additional machinery, such as allowing
"ptrauth" bundles on indirectbr. For our purposes, the simple scheme
above is sufficient.
This approach doesn't support subtracting label addresses and using
the result as offsets, because each label address is signed.
Pointer arithmetic on signed pointers corrupts the signature bits,
and because label address expressions aren't typed beyond void*,
we can't do anything reliably intelligent on the arithmetic exprs.
Not signing addresses when used to form offsets would allow
easily hijacking control flow by overwriting the offset.
This diagnoses the basic cases (`&&lbl2 - &&lbl1`) in the frontend,
while we evaluate either alternative implementations (e.g., lowering
blockaddress to a bb number, and indirectbr to a checked jump-table),
or better diagnostics (both at the frontend level and on unencodable
IR constants).
The behaviour of the flag should be equivalent to
__arm_streaming_compatible.
At the moment, the name suggests that '-force-streaming-compatible-sve'
on its own (i.e. without specifying `+sve`) enables the compiler to use
the streaming-compatible subset of SVE instructions, but the semantics
merely are that the function can be called with either PSTATE.SM=0 or
PSTATE.SM=1.
Introduce a mechanism to share data between the ARM and AArch64 backends and
TargetParser, to reduce duplication of code. This is similar to the current
RISC-V implementation.
The target tablegen file (in this case `ARM.td` or `AArch64.td`) is
processed during building of `TargetParser` to generate the following
files in the build tree:
- `build/include/llvm/TargetParser/ARMTargetParserDef.inc`
- `build/include/llvm/TargetParser/AArch64TargetParserDef.inc`
For now, the use of these generated files is limited to files _outside_
of `TargetParser`. The main reason for this is that the modifications to
`TargetParser` will require additional data added to the tablegen files,
which I want to split into separate PRs.
By default the scheduling info of instructions into a BUNDLE are given a
latency of 0 as they operate on the implicit register of the bundle.
This modifies that for AArch64 so that the latency is adjusted to use
the latency from the instruction in the bundle instead. This essentially
assumes that the bundled instructions are executed in a single cycle,
which for AArch64 is probably OK considering they are mostly used for
MOVPFX bundles, where this can help create slightly better scheduling
especially for in-order cores.
Clang sets the nonlazybind attribute for certain ObjC features. The
AArch64 SelectionDAG implementation for non-intrinsic calls
(46e36f0953aabb5e5cd00ed8d296d60f9f71b424) is behind a cl option.
GCC implements -fno-plt for a few ELF targets. In Clang, -fno-plt also
sets the nonlazybind attribute. For SelectionDAG, make the cl option not
affect ELF so that non-intrinsic calls to a dso_preemptable function use
GOT. Adjust AArch64TargetLowering::LowerCall to handle intrinsic calls.
For FastISel, change `fastLowerCall` to bail out when a call is due to
-fno-plt.
For GlobalISel, handle non-intrinsic calls in CallLowering::lowerCall
and intrinsic calls in AArch64CallLowering::lowerCall (where the
target-independent CallLowering::lowerCall is not called).
The GlobalISel test in `call-rv-marker.ll` is therefore updated.
Note: the current -fno-plt -fpic implementation does not use GOT for a
preemptable function.
Link: #78275
Pull Request: https://github.com/llvm/llvm-project/pull/78890
The Ampere1B is Ampere's third-generation core implementing a
superscalar, out-of-order microarchitecture with nested virtualization,
speculative side-channel mitigation and architectural support for
defense against ROP/JOP style software attacks.
Ampere1B is an ARMv8.7+ implementation, adding support for the FEAT
WFxT, FEAT CSSC, FEAT PAN3 and FEAT AFP extensions. It also includes all
features of the second-generation Ampere1A, such as the Memory Tagging
Extension and SM3/SM4 cryptography instructions.
Add AArch64 implementations for the interfaces of MachinePipeliner pass.
The pass is disabled by default for AArch64. It is enabled by specifying
--aarch64-enable-pipeliner.
5 tests in llvm-test-suites show performance improvement by more than 5%
on a Neoverse V1 processor.
| test | improvement |
| ---------------------------------------------------------------- |
-----------:|
| MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl.test | 16%
|
| MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-flt.test | 16%
|
| SingleSource/Benchmarks/Adobe-C++/loop_unroll.test | 14% |
| SingleSource/Benchmarks/Misc/flops-5.test | 13% |
| SingleSource/Benchmarks/BenchmarkGame/spectral-norm.test | 6% |
(base flags: -mcpu=neoverse-v1 -O3 -mrecip, flags for pipelining: -mllvm
-aarch64-enable-pipeliner -mllvm
-pipeliner-max-stages=100 -mllvm -pipeliner-max-mii=100 -mllvm
-pipeliner-enable-copytophi=0)
On the other hand, there are cases of significant performance
degradation. Algorithm improvements and adding the option/pragma will be
needed in the future.
This combines the previously posted patches with some additional work
I've done to more closely match MSVC output.
Most of the important logic here is implemented in
AArch64Arm64ECCallLowering. The purpose of the
AArch64Arm64ECCallLowering is to take "normal" IR we'd generate for
other targets, and generate most of the Arm64EC-specific bits:
generating thunks, mangling symbols, generating aliases, and generating
the .hybmp$x table. This is all done late for a few reasons: to
consolidate the logic as much as possible, and to ensure the IR exposed
to optimization passes doesn't contain complex arm64ec-specific
constructs.
The other changes are supporting changes, to handle the new constructs
generated by that pass.
There's a global llvm.arm64ec.symbolmap representing the .hybmp$x
entries for the thunks. This gets handled directly by the AsmPrinter
because it needs symbol indexes that aren't available before that.
There are two new calling conventions used to represent calls to and
from thunks: ARM64EC_Thunk_X64 and ARM64EC_Thunk_Native. There are a few
changes to handle the associated exception-handling info,
SEH_SaveAnyRegQP and SEH_SaveAnyRegQPX.
I've intentionally left out handling for structs with small
non-power-of-two sizes, because that's easily separated out. The rest of
my current work is here. I squashed my current patches because they were
split in ways that didn't really make sense. Maybe I could split out
some bits, but it's hard to meaningfully test most of the parts
independently.
Thanks to @dpaoliello for extensive testing and suggestions.
(Originally posted as https://reviews.llvm.org/D157547 .)
There are some workloads that are negatively impacted by using jump
tables when the number of entries is small. The SPEC2017 perlbench
benchmark is one example of this, where increasing the threshold to
around 13 gives a ~1.5% improvement on neoverse-v1. I chose the minimum
threshold based on empirical evidence rather than science, and just
manually increased the threshold until I got the best performance
without impacting other workloads. For neoverse-v1 I saw around ~0.2%
improvement in the SPEC2017 integer geomean, and no overall change for
neoverse-n1. If we find issues with this threshold later on we can
always revisit this.
The most significant SPEC2017 score changes on neoverse-v1 were:
500.perlbench_r: +1.6%
520.omnetpp_r: +0.6%
and the rest saw changes < 0.5%.
I updated CodeGen/AArch64/min-jump-table.ll to reflect the new
threshold. For most of the affected tests I manually set the min number
of entries back to 4 on the RUN line because the tests seem to rely upon
this behaviour.
When performing a tail call, check the value of LR register after
authentication to prevent the callee from signing and spilling an
untrusted value. This commit implements a few variants of check,
more can be added later.
If it is safe to assume that executable pages are always readable,
LR can be checked just by dereferencing the LR value via LDR.
As an alternative, LR can be checked as follows:
; lowered AUT* instruction
; <some variant of check that LR contains a valid address>
b.cond break_block
ret_block:
; lowered TCRETURN
break_block:
brk 0xc471
As the existing methods either break the compatibility with execute-only
memory mappings or can degrade the performance, they are disabled by
default and can be explicitly enabled with a command line option.
Individual subtargets can opt-in to use one of the available methods
by updating AArch64FrameLowering::getAuthenticatedLRCheckMethod().
Reviewed By: kristof.beyls
Differential Revision: https://reviews.llvm.org/D156716
When a function is compiled to be in Streaming(-compatible) mode, the full
set of SVE instructions may not be available. This patch adds an interface
to query that and changes the codegen for FADDA (not legal in Streaming-SVE
mode) to instead be expanded for fixed-length vectors, or otherwise not to
code-generate for scalable vectors.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D156109
Before this patch, the only way to generate streaming-compatible code
was to use the `-force-streaming-compatible-sve` flag, but the compiler
should also avoid the use of instructions invalid in streaming mode
when a function has the aarch64_pstate_sm_enabled/compatible attribute.
Reviewed By: paulwalker-arm, david-arm
Differential Revision: https://reviews.llvm.org/D155428
The AArch64Subtarget interface 'isNeonAvailable' is more appropriate going
forward, as we may also want to generate 'streaming SVE' code (not just
'streaming-compatible SVE' code), but here we must still make sure not to
use NEON instructions which are invalid in streaming SVE mode.
This patch enables the tail-folding of simple loops by default
when targeting the neoverse-v1 CPU. Simple loops exclude those
with recurrences or reductions or loops that are reversed.
New tests have been added here:
Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
In terms of SPEC2017 only one benchmark is really affected when
building with "-Ofast -mcpu=neoverse-v1 -flto", which is
(+ faster, - slower):
525.x264: +7.0%
Differential Revision: https://reviews.llvm.org/D130618
Removes the forwarding header `llvm/Support/AArch64TargetParser.h`.
I am proposing to do this for all the forwarding headers left after
rGf09cf34d00625e57dea5317a3ac0412c07292148 - for each header:
- Update all relevant in-tree includes
- Remove the forwarding Header
Differential Revision: https://reviews.llvm.org/D140999
Adds an IR pass for -fsanitize=memtag-globals. This pass goes over the
tag-capable global variables, and replaces them with a tagged global
variable of the same contents. This new global variable will have its
size and alignment adjusted if neccesary so that they're both a multiple
of the tag granule size (16 bytes).
Global merge must also be suppressed for tagged globals, as each global
variable must have a unique tag. This can possibly be relaxed in future;
globals that are identical in size, alignment, and content can possibly
be merged. The major problem comes from tail- or head-merging, which if
left unchecked, could have partially-overlapping global variables with
different memory tags, leading to crashes at runtime.
Reviewed By: fmayer, eugenis
Differential Revision: https://reviews.llvm.org/D133392
This reverts commit 4edfcff71e150770675a19576f698c7bbe788ee2.
Broke the non-aarch64-containing target builds.
https://reviews.llvm.org/D133392 has more context.
Adds an IR pass for -fsanitize=memtag-globals. This pass goes over the
tag-capable global variables, and replaces them with a tagged global
variable of the same contents. This new global variable will have its
size and alignment adjusted if neccesary so that they're both a multiple
of the tag granule size (16 bytes).
Global merge must also be suppressed for tagged globals, as each global
variable must have a unique tag. This can possibly be relaxed in future;
globals that are identical in size, alignment, and content can possibly
be merged. The major problem comes from tail- or head-merging, which if
left unchecked, could have partially-overlapping global variables with
different memory tags, leading to crashes at runtime.
Reviewed By: fmayer, eugenis
Differential Revision: https://reviews.llvm.org/D133392
The Ampere1A core improves on the Ampere1 with key differences being:
* memory tagging is supported
* SM3/SM4 are supported
* adds a new fusion pair for (A+B+1 and A-B-1)
(added in a later commit)
Depends on D142395
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D142396
This adds a little hasSVEorSME() helper, and as a NFC updates existing
code to use it. The assertions get[Min|Max]SVEVectorSizeInBits() are
also now corrected to use hasSVEorSME() rather than just hasSVE().
Differential Revision: https://reviews.llvm.org/D138575
In AArch64 backend X30 is named as LR, X29 is named as FP. So the code in AArch64Subtarget::AArch64Subtarget can't recognize these 2 registers.
for (unsigned i = 0; i < 31; ++i) {
if (ReservedRegNames.count(TRI->getName(AArch64::X0 + i)))
ReserveXRegisterForRA.set(i);
}
This patch add code to explicitly handle these 2 registers.
Differential Revision: https://reviews.llvm.org/D137810
Arm64EC has two different ways to refer to dllimport'ed functions in an
object file. One is using the usual __imp_ prefix, the other is using an
Arm64EC-specific prefix __imp_aux_. As far as I can tell, if a function
is in an x64 DLL, __imp_aux_ refers to the actual x64 address, while
__imp_ points to some linker-generated code that calls the exit thunk.
So __imp_aux_ is used to refer to the address in non-call contexts,
while __imp_ is used for calls to avoid the indirect call checker.
There's one twist to this, though: if an object refers to a symbol using
the __imp_aux_ prefix, the object file's symbol table must also contain
the symbol with the usual __imp_ prefix. The symbol doesn't actually
have to be used anywhere, it just has to exist; otherwise, the linker's
symbol lookup in x64 import libraries doesn't work correctly. Currently,
this is handled by emitting a .globl __imp_foo directive; we could try
to design some better way to handle this.
One minor quirk I haven't figured out: apparently, in Arm64EC mode, MSVC
prefers to use a linker-synthesized stub to call dllimport'ed functions,
instead of branching directly. The linker stub appears to do the same
thing that inline code would do, so not sure if it's just a code-size
optimization, or if the synthesized stub can actually do something other
than just load from the import table in some circumstances.
Differential Revision: https://reviews.llvm.org/D136202
When the SME attributes tell that a function is or may be executed in Streaming
SVE mode, we currently need to be conservative and disable _any_ vectorization
(fixed or scalable) because the code-generator does not yet support generating
streaming-compatible code.
Scalable auto-vec will be gradually enabled in the future when we have
confidence that the loop-vectorizer won't use any SVE or NEON instructions
that are illegal in Streaming SVE mode.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D135950
Add a compile-time flag for enabling streaming mode.
When streaming mode is enabled, lower basic loads and stores of fixed-width vectors;
to generate code that is compatible to streaming mode.
Differential Revision: https://reviews.llvm.org/D133433
They're roughly ARMv8.6. This works in the .td file, but in
AArch64TargetParser.def, marking them v8.6 brings in support for the SM4
cryptographic hash and we don't actually have that. So TargetParser side
they're marked as v8.5, with the extra features (BF16 and I8MM added manually).
Finally, A16 supports the HCX extension in addition to v8.6. This has no
TargetParser implications.