This patch aims to reduce the include used by AArch64ISelLowering, allowing it
to be included by unittests so that they can reference the AArch64ISD nodes.
It:
- Moves the inclusion of AArch64SMEAttributes.h to the uses.
- Moves LowerPtrAuthGlobalAddressStatically to a static function, so that
AArch64PACKey is not required in the header.
- Moves the definitions of getExceptionPointerRegister to the cpp file, to
remove the reference of AArch64::X0.
Previously, the cost model was returning an invalid cost. This simply
moves the check from one place to another. This is mostly to make the
cost modeling code a bit easier to follow.
---------
Co-authored-by: Mel Chen <mel.chen@sifive.com>
Rename the function to reflect its correct behavior and to be consistent
with `Module::getOrInsertFunction`. This is also in preparation of
adding a new `Intrinsic::getDeclaration` that will have behavior similar
to `Module::getFunction` (i.e, just lookup, no creation).
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.
Affected intrinsics:
llvm.aarch64.sve.fcvt.bf16f32
llvm.aarch64.sve.fcvtnt.bf16f32
The named intrinsics took a predicate based on the smallest element type
when it should be based on the largest. The intrinsics have been replace
by v2 equivalents and affected code ported to use them.
Patch includes changes to getSVEPredicateBitCast() that ensure the
generated code for the auto-upgraded old intrinsics is unchanged.
The "narrowing top" convert instructions leave the bottom half of active
elements untouched and thus the first paramater of their associated
intrinsic remains live even when there are no inactive lanes.
This follows in the spirit of 7d82c99403f615f6236334e698720bf979959704,
and extends the costing API for compares and selects to provide
information about the operands passed in an analogous manner. This
allows us to model the cost of materializing the vector constant, as
some select-of-constants are significantly more expensive than others
when you account for the cost of materializing the constants involved.
This is a stepping stone towards fixing
https://github.com/llvm/llvm-project/issues/109466. A separate SLP patch
will be required to utilize the new API.
fadd reduction with
1. Fast flag set
2. No of elements in input vector is power of 2 results in series of
faddp instructions. faddp instruction has latency/throughput identical
to fadd instruction and hence, we set relative cost=1 for faddp as well.
The change didn't show any regression with SPEC17-FP(C/C++),
llvm-test-suite on Neoverse-V2.
This patch is moving out stepvector intrinsic from the experimental
namespace.
This intrinsic exists in LLVM for several years now, and is widely used.
A fneg(fmul(..)) or fmul(fneg(..)) can be folded into a fnmul under
AArch64. https://clang.godbolt.org/z/znPj34Mae
This discounts the cost of the fneg in such patterns to be free.
This patch increases scatter overhead on Neoverse-V2 to 13. This
benefits s128 kernel from TSVC_2 test suite.
SPEC 17, RAJAPerf, and Sptter are unaffected by this patch.
This patch boosts s128 kernel's performance from TSVC test suite by about
40% as this enables vectorization. Also, handle minor code refactoring
for gather related part.
The code-generator is currently not able to handle scalable vectors of
<vscale x 1 x eltty>. The usual "fix" for this until it is supported is
to mark the costs of loads/stores with an invalid cost, preventing the
vectorizer from vectorizing at those factors. But on rare occasions
loops do not contain load/stores, only reductions.
So whilst this is still unsupported return an invalid cost to avoid
selecting vscale x 1 VFs. The cost of a reduction is not currently used
by the vectorizer so this adds the cost to the add/mul/and/or/xor or
min/max that should feed the reduction. It includes reduction costs
too, for completeness. This change will be removed when code-generation
for these types is sufficiently reliable.
Fixes#99760
The command line option enable-scalable-autovec-in-streaming-mode is
used to enable scalable vectors but the same check is missing from
enableScalableVectorization, which is blocking auto-vectorisation.
Inlining may result in different behaviour when the callee clobbers ZT0,
because normally the call-site will have code to preserve ZT0. When
inlining the function this code to preserve ZT0 will no longer be
emitted, and so the resulting behaviour of the program is changed.
This reverts commit 19b785b7334d01354e8430634bab3c3341c671ca.
My bisect must have been wrong because they're still failing,
and there are follow ups to this that would need unpicking anyway.
Most of these bring the costs in line with the code generation. The f16 costs
without FullFP16 are usually converted to f32. Extended v2f32->v2f64 vectors
similarly use fcvtl + fcvt. As a backup we use the costs similar to the target
independent code, which should give a relatively high cost.
This special case tried to measure if the shuffle vector will be multiple
inserts into an existing vector, with one of the lanes already in-place. If so
it reduces the cost by 1 to to represent it will can insert n-1 vector lanes.
This isn't always true though as the original vector may need to be moved to a
new value to start inserting new values into it, if other values from the
original are still needed.
This didn't effect performance much when I tried it, but should hopefully start
to address a regression we see from differences in SLP vectorization lane
orders.
We should not call `getVectorElementType` on the result MVT from
`getTypeLegalizationCost` when we don't if the legal type is a
vector. This is the case when the type needs to be legalized
using scalarization.
This is just an initial cost, making it invalid for any target which
doesn't specifically return a cost for now. Also adds an AArch64
specific cost check.
We will need to improve that later, e.g. by returning a scalarization
cost for generic targets and possibly introducing a new TTI method, at
least once LoopVectorize has changed it's cost model. The reason is
that the histogram intrinsic also effectively contains a gather and
scatter, and we will need details of the addressing to determine an
appropriate cost for that.
This is a helper to avoid writing `getModule()->getDataLayout()`. I
regularly try to use this method only to remember it doesn't exist...
`getModule()->getDataLayout()` is also a common (the most common?)
reason why code has to include the Module.h header.
Previously isElementTypeLegalForScalableVector returned false for i1
types, which also prevented vectorisation of loops with i1 reductions.
This is overkill - we only need to disable vectorisation for loads
and/or stores of i1 types. I've added i1 as a legal type, but changed
the cost model to return an invalid cost for loads and stores.
I've not added any new tests for these, because the original conditions
were wrong (they did not consider streaming mode) and we have tests for
the positive cases.
At the moment, vectorization is only enabled in streaming(-compatible)
mode when enabled through an option. But the interfaces should check
more than just 'hasSVE()', because a function with +sme in streaming
mode should also vectorize with the option enabled.
Additionally, a streaming-compatible function should only be able to use
fixed-length autovec if SVE is available, otherwise the vector code will
be scalarised by the backend.
Adds an AArch64-specific version of isLSRCostLess, changing the relative
importance of the various terms from the formulae being evaluated.
This has been split out from my vscale-aware LSR work, see the RFC for
reference:
https://discourse.llvm.org/t/rfc-vscale-aware-loopstrengthreduce/77131
This removes the GISel versions of isREVMask, isTRNMask, isUZPMask and
isZipMask. They are combined with the existing versions from SDAG into
AArch64PerfectShuffle.h.
When SVE is available we can lower calls to get.active.lane.mask using
the SVE whilelo instruction, however in practice since vXi1 types are
not legal for NEON we often end up expanding the predicate into a vector
of integers, e.g. v4i1 -> v4i32. This usually happens when we have to
keep the predicate live out of the block, for example when the predicate
is the incoming value to a PHI node in a tail-folded vector loop.
Currently in such cases the intrinsic call has a cost of 1, which is far
too low when considering the extra instructions required to expand the
predicate. This patch fixes that by basing the cost on the number of
lane moves required for expansion. This is required for a follow-on
patch that adds the cost of the intrinsic call to the vectorisation cost
model, so that we can teach the vectoriser to make better choices.
Similar to #87934, this adds costs to the shuffles in a canonical LD3/LD4
pattern, which are represented in LLVM as deinterleaving-shuffle(load). This
likely has less effect at the moment than the ST3/ST4 costs as instcombine will
perform certain transforms without considering the cost.
This tries to add some costs for the shuffle in a ST3/ST4 instruction,
which are represented in LLVM IR as store(interleaving shuffle). In
order to detect the store, it needs to add a CxtI context instruction to
check the users of the shuffle. LD3 and LD4 are added, LD2 should be a
zip1 shuffle, which will be added in another patch.
It should help fix some of the regressions from #87510.
Currently patchpoints can only have two result types, `void` and `i64`.
This limits the result to general purpose registers.
This patch makes `patchpoint.i64` an overloadable intrinsic, allowing
result values that can fit in a single register (e.g. integers,
pointers, floats).
These are the last remaining "trivial" changes to passes that use
Instruction pointers for insertion. All of this should be NFC, it's just
changing the spelling of how we identify a position.
In one or two locations, I'm also switching uses of getNextNode etc to
using std::next with iterators. This too should be NFC.
---------
Merged by: Stephen Tozer <stephen.tozer@sony.com>