In streaming mode, both the @llvm.aarch64.sme.cnts and @llvm.aarch64.sve.cnt
intrinsics are equivalent. For SVE, cnt* is lowered in instCombineIntrinsic
to @llvm.sme.vscale(). This patch lowers the SME intrinsic similarly when
in streaming-mode.
There are a couple of places in the loop vectoriser where we
want to calculate the cost of extracting the last lane in a
vector. However, we wrongly assume that asking for the cost
of extracting lane (VF.getKnownMinValue() - 1) is an accurate
representation of the cost of extracting the last lane. For
SVE at least, this is non-trivial as it requires the use of
whilelo and lastb instructions.
To solve this problem I have added a new
getReverseVectorInstrCost interface where the index is used
in reverse from the end of the vector. Suppose a vector has
a given ElementCount EC, the extracted/inserted lane would be
EC - 1 - Index. For scalable vectors this index is unknown at
compile time. I've added a AArch64 hook that better represents
the cost, and also a RISCV hook that maintains compatibility
with the behaviour prior to this PR.
I've also taken the liberty of adding support in vplan for
calculating the cost of VPInstruction::ExtractLastElement.
As fmul and fmadd are so similar, their performance characteristics tend
to be the same on most platforms, at least in terms of reciprocal
throughputs. Processors capable of performing a given number of fmul per
cycle can usually perform the same number of fma, with the extra add
being relatively simple on top. This patch makes the scores of the two
operations the same, which brings the throughput cost of a fma/fmuladd
to 2, and the latency to 3, which are the defaults for fmul.
Note that we might also want to change the throughput cost of a fmul to
1, as most processors have ample bandwidth for them, but they should
still stay in-line with one another.
This patch add cost kind to `getAddressComputationCost()` for #149955.
Note that this patch also remove all the default value in `getAddressComputationCost()`.
Enhance the heuristics in `getAppleRuntimeUnrollPreferences` to let a
bit more loops to be unrolled.
Specifically, this patch adjusts two checks:
I. Tune the loop size budget from 8 to 10
II. Include immediate in-loop users of loaded values in the load/stores
dependencies predicate
---------
Co-authored-by: Florian Hahn <flo@fhahn.com>
PR: https://github.com/llvm/llvm-project/pull/149358
This updates everywhere we emit/check an SME routines to use
RuntimeLibcalls to get the function name and calling convention.
Note: RuntimeLibcallEmitter had some issues with emitting non-unique
variable names for sets of libcalls, so I tweaked the output to avoid
the need for variables.
In some places we were passing the type of value being accessed, in
other cases we were passing the type of the pointer for the access.
The most "involved" user is
LoopVectorizationCostModel::getMemInstScalarizationCost, which is the
only call site that passes in the SCEV, and it passes along the pointer
type.
This changes call sites to consistently pass the pointer type, and
renames the arguments to clarify this.
No target actually checks the contents of the type passed, only to see
if it's a vector or not, so this shouldn't have an effect.
This extracts the code for modelling a fp16 operation as
`fptrunc(fpop(fpext,fpext))` into a new function named
getFP16BF16PromoteCost so that it can be reused by the
arithmetic instructions. The function takes a lambda to
calculate the cost of the operation with the promoted type.
Compare sinking is selectable based on the result of
hasMultipleConditionRegisters. This function is too coarse grained by
not taking into account the differences between scalar and vector
compares. This PR extends the interface to take an EVT to allow finer
control.
The new interface is used by AArch64 to disable sinking of scalable
vector compares, but with isProfitableToSinkOperands updated to maintain
the cases that are specifically tested.
Certain fcmp predicates need to be expanded into multiple operations and
or'd together. This adds some more accurate cost modelling for them
based on the predicate. Unsupported operations are given the cost of a
libcall and the latency is set to 2 as that seemed to be fairly common
between different CPUs.
This prevents them from generating Invalid costs, as generating the
instructions seems to work fine with and without +bf16. The costs are
mostly taken from the number of instructions (minus ptrue and constants).
Without this change, the following test would fail to compile
with `-march=armv8-a+sme`:
```
void func1(const svuint32_t *in, svuint32_t *out) {
[&]() __arm_streaming { *out = *in; }();
}
```
But in general, it's probably better never to inline
streaming functions into non-streaming functions, because
they will have been marked as 'streaming' for a reason
by the user.
#147420 changed the unrolling preferences to permit unrolling of
non-auto vectorized loops by checking for the isvectorized attribute,
however when a loop is vectorized this attribute is put on both the
vector loop and the scalar epilogue, so this change prevented the scalar
epilogue from being unrolled.
Restore the previous behaviour of unrolling the scalar epilogue by
checking both for the isvectorized attribute and vector instructions in
the loop.
FMV priority is the returned value of a polymorphic function. On RISC-V
and X86 targets a 32-bit value is enough. On AArch64 we currently need
64 bits and we will soon exceed that. APInt seems to be a suitable
replacement for uint64_t, presumably with minimal compile time overhead.
It allows bit manipulation, comparison and variable bit width.
Since the costs were added the codegen for i8/i16 and/or/xor reductions
has improved. This updates the cost model to produce the same costs in
terms of number of instructions.
getVectorInstrCostHelper would return costs of zero for vector
inserts/extracts that move data between GPR and vector registers, if
there was no 'real' use, i.e. there was no corresponding existing
instruction.
This meant that passes like LoopVectorize and SLPVectorizer, which
likely are the main users of the interface, would understimate the cost
of insert/extracts that move data between GPR and vector registers,
which has non-trivial costs.
The patch removes the special case and only returns costs of zero for
lane 0 if it there is no need to transfer between integer and vector
registers.
This impacts a number of SLP test, and most of them look like general
improvements.I think the change should make things more accurate for any
AArch64 target, but if not it could also just be Apple CPU specific.
I am seeing +2% end-to-end improvements on SLP-heavy workloads.
PR: https://github.com/llvm/llvm-project/pull/146526
The code here probably needs to change to handle types more uniformly,
but this patch prevents it from trying to use a simple type where it does
not exist.
Fixes#148438.
This patch permits loops with vector instructions to be unrolled.
Today there is an early exit in `getUnrollingPreferences()` of AArch64
targets if a vector instruction is observed in any of the loop blocks.
This patch fixes that so common loops like this one get a chance to be
unrolled:
void saxpy (float * dst, const float * src, const float a, const int
len) {
float32x4_t * vdst = (float32x4_t *)dst;
float32x4_t * vsrc = (float32x4_t *)src;
float32x4_t vk = vdupq_n_f32(a);
for (int i = 0; i < (len >> 2); i++)
{
vdst[i] = vaddq_f32(vdst[i], vmulq_f32(vsrc[i], vk));
}
}
Auto-vectorized loops are still not unrolled, unless they were not
interleaved when vectorized.
The provided test case shows the enhancement on top of runtime/partial
unrolling, depending on the CPU.
PR: https://github.com/llvm/llvm-project/pull/147420
This removes the CostKind == TCK_RecipThroughput limitation from
getCmpSelInstrCost, allowing it to return more accurate costs for CodeSize and
Lat / SizeLat. Especially for larger vectors under CodeSize, the returned costs
are currently 1, not the legalization cost.
getOrCreateResultFromMemIntrinsic can modify the current function by
inserting new instructions without EarlyCSE keeping track of the
changes.
Introduce a new CanCreate argument, and update the function to only create
new instructions when CanCreate = true. Use it when appropriate.
Fixes https://github.com/llvm/llvm-project/issues/145183
Consider IR such as this:
for.body:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
%accum = phi i32 [ 0, %entry ], [ %add, %for.body ]
%gep.a = getelementptr i8, ptr %a, i64 %iv
%load.a = load i8, ptr %gep.a, align 1
%ext.a = zext i8 %load.a to i32
%add = add i32 %ext.a, %accum
%iv.next = add i64 %iv, 1
%exitcond.not = icmp eq i64 %iv.next, 1025
br i1 %exitcond.not, label %for.exit, label %for.body
Conceptually we can vectorise this using partial reductions too,
although the current loop vectoriser implementation requires the
accumulation of a multiply. For AArch64 this is easily done with
a udot or sdot with an identity operand, i.e. a vector of (i16 1).
In order to do this I had to teach getScaledReductions that the
accumulated value may come from a unary op, hence there is only
one extension to consider. Similarly, I updated the vplan and
AArch64 TTI cost model to understand the possible unary op.
---------
Co-authored-by: Matt Devereau <matthew.devereau@arm.com>
A shuffle will take two input vectors and a mask, to produce a new
vector of size <MaskElts x SrcEltTy>. Historically it has been assumed
that the SrcTy and the DstTy are the same for getShuffleCost, with that
being relaxed in recent years. If the Tp passed to getShuffleCost is the
SrcTy, then the DstTy can be calculated from the Mask elts and the src
elt size, but the Mask is not always provided and the Tp is not reliably
always the SrcTy. This has led to situations notably in the SLP
vectorizer but also in the generic cost routines where assumption about
how vectors will be legalized are built into the generic cost routines -
for example whether they will widen or promote, with the cost modelling
assuming they will widen but the default lowering to promote for integer
vectors.
This patch attempts to start improving that - it originally tried to
alter more of the cost model but that too quickly became too many
changes at once, so this patch just plumbs in a DstTy to getShuffleCost
so that DstTy and SrcTy can be reliably distinguished. The callers of
getShuffleCost have been updated to try and include a DstTy that is more
accurate. Otherwise it tries to be fairly non-functional, keeping the
SrcTy used as the primary type used in shuffle cost routines, only using
DstTy where it was in the past (for InsertSubVector for example).
Some asserts have been added that help to check for consistent values
when a Mask and a DstTy are provided to getShuffleCost. Some of them
took a while to get right, and some non-mask calls might still be
incorrect. Hopefully this will provide a useful base to build more
shuffles that alter size.
Currently the loop vectorizer can only vectorize interleave groups for
power-of-2 factors at scalable VFs by recursively interleaving
[de]interleave2 intrinsics.
However after https://github.com/llvm/llvm-project/pull/124825 and
#139893, we now have [de]interleave intrinsics for all factors up to 8,
which is enough to support all types of segmented loads and stores on
RISC-V.
Now that the interleaved access pass has been taught to lower these in
#139373 and #141512, this patch teaches the loop vectorizer to emit
these intrinsics for factors up to 8, which enables scalable
vectorization for non-power-of-2 factors.
As far as I'm aware, no in-tree target will vectorize a scalable
interelave group above factor 8 because the maximum interleave factor is
capped at 4 on AArch64 and 8 on RISC-V, and the
`-max-interleave-group-factor` CLI option defaults to 8, so the
recursive [de]interleaving code has been removed for now.
Factors of 3 with scalable VFs are also turned off in AArch64 since
there's no lowering for [de]interleave3 just yet either.
Negated powers of 2 have similar or (exact in the case of remainder)
codegen with lowering sdiv. In the case of sdiv, it just negates the
result in the end anyway, so nothing dissimilar at all.
CreateVScale took a scaling parameter that had a single use outside of
IRBuilder with all other callers having to create a redundant
ConstantInt. To work round this some code perferred to use
CreateIntrinsic directly.
This patch simplifies CreateVScale to return a call to the llvm.vscale()
intrinsic and nothing more. As well as simplifying the existing call
sites I've also migrated the uses of CreateIntrinsic.
Whilst IRBuilder used CreateVScale's scaling parameter as part of the
implementations of CreateElementCount and CreateTypeSize, I have
follow-on work to switch them to the NUW varaiety and thus they would
stop using CreateVScale's scaling as well. To prepare for this I have
moved the multiplication and constant folding into the implementations
of CreateElementCount and CreateTypeSize.
As a final step I have replaced some callers of CreateVScale with
CreateElementCount where it's clear from the code they wanted the
latter.
Given a larger-than-legal shuffle we will split into multiple sub-parts.
This adds a check to the computed costs of sub-shuffles so that repeated
sequences are not accounted for multiple times. This especially reduces
the cost of broadcasts/splats.
SMECallAttrs is a new helper class that holds all the SMEAttrs for a
call. The interfaces to query actions needed for the call (e.g. change
streaming mode) have been moved to the SMECallAttrs class.
The main motivation for this change is to make the split between the
caller, callee, and callsite attributes more apparent.
Before this change, we would always merge callsite and callee
attributes. The main reason to do this was to handle indirect calls,
however, we also occasionally used callsite attributes on direct calls
in tests (mainly to avoid creating multiple function declarations). With
this patch, we now explicitly handle indirect calls and disallow
incompatible attributes on direct calls (so this patch is not entirely
an NFC).
Same as #137239, but with a change to avoid inferring SME attributes for
function definitions. This allows stubbing the SME ABI routines in C/C++
(and matches the old behaviour).
The intent of this code is to split larger vectors into smaller shuffles, but
it currently triggering on some small vector types. Limit it to vectors of size
>128bit.
SMECallAttrs is a new helper class that holds all the SMEAttrs for a
call. The interfaces to query actions needed for the call (e.g. change
streaming mode) have been moved to the SMECallAttrs class.
The main motivation for this change is to make the split between the
caller, callee, and callsite attributes more apparent.
Before this change, we would always merge callsite and callee
attributes. The main reason to do this was to handle indirect calls,
however, we also occasionally used callsite attributes on direct calls
in tests (mainly to avoid creating multiple function declarations). With
this patch, we now explicitly handle indirect calls and disallow
incompatible attributes on direct calls (so this patch is not entirely
an NFC).
This patch combines uxt[bhw] intrinsics to and_u when the governing
predicate is all-true or the passthrough is undef (e.g. in cases of
``unknown'' merging). This improves code gen as the latter can be
emitted as AND immediate instructions.
For example, given:
```cpp
svuint64_t foo(svuint64_t x) {
return svextb_z(svptrue_b64(), x);
}
```
Currently:
```gas
foo:
ptrue p0.d
movi v1.2d, #0000000000000000
uxtb z0.d, p0/m, z0.d
ret
```
Becomes:
```gas
foo:
and z0.d, z0.d, #0xff
ret
```