In streaming mode, both the @llvm.aarch64.sme.cnts and @llvm.aarch64.sve.cnt
intrinsics are equivalent. For SVE, cnt* is lowered in instCombineIntrinsic
to @llvm.sme.vscale(). This patch lowers the SME intrinsic similarly when
in streaming-mode.
This put the onus on the caller to ensure the result type is big enough.
In the unlikely event a cropped result is required then explicitly
truncate a safe value.
We currently combine (AES (EOR (A, B)), 0) into (AES A, B) for Neon
intrinsics when the zero operand appears in the RHS of the AES
instruction.
This patch extends the combine to support AES SVE intrinsics and
the case where the zero operand appears in the LHS of the AES
instructions.
This patch combines uxt[bhw] intrinsics to and_u when the governing
predicate is all-true or the passthrough is undef (e.g. in cases of
``unknown'' merging). This improves code gen as the latter can be
emitted as AND immediate instructions.
For example, given:
```cpp
svuint64_t foo(svuint64_t x) {
return svextb_z(svptrue_b64(), x);
}
```
Currently:
```gas
foo:
ptrue p0.d
movi v1.2d, #0000000000000000
uxtb z0.d, p0/m, z0.d
ret
```
Becomes:
```gas
foo:
and z0.d, z0.d, #0xff
ret
```
The SVE intrinsics support shift amounts greater-than-or-equal to the
element type's bit-length, essentially saturating the shift amount to
the bit-length. However, the IR instructions consider this undefined
behaviour that results in poison. To account for this we now ignore the
result of the simplifications that result in poison. This allows
existing code to be used to simplify the shifts but does mean:
1) We don't simplify cases like "svlsl_s32(x, splat(32)) => 0".
2) We no longer constant fold cases like "svadd(poison, X) => poison"
For (1) we'd need dedicated target specific combines anyway and the
result of (2) is not specified by the ACLE and replicating LLVM IR
behaviour might be confusing to ACLE writers.
This is the subset of binops (mul and fmul are already enabled) whose
behaviour fully aligns with the equivalent SVE intrinsic. The omissions
are integer divides and shifts that are defined to return poison for
values where the intrinsics have a defined result. These will be covered
in a seperate PR.
SVE Operations such as predicated loads become canonicalized to LLVM
masked loads, and doing the same for ptrue(all) to splat(1) creates
further optimization opportunities from generic LLVM IR passes.
After https://github.com/llvm/llvm-project/issues/126928 it's now
possible to rewrite the existing combines, which mostly only handle
cases where a operand is an identity value, to use existing simplify
code to unlock general constant folding.
Affected intrinsics:
llvm.aarch64.sve.fcvt.bf16f32
llvm.aarch64.sve.fcvtnt.bf16f32
The named intrinsics took a predicate based on the smallest element type
when it should be based on the largest. The intrinsics have been replace
by v2 equivalents and affected code ported to use them.
Patch includes changes to getSVEPredicateBitCast() that ensure the
generated code for the auto-upgraded old intrinsics is unchanged.
The "narrowing top" convert instructions leave the bottom half of active
elements untouched and thus the first paramater of their associated
intrinsic remains live even when there are no inactive lanes.
This is mostly NFC but some output does change due to consistently
inserting into poison rather than undef and using i64 as the index
type for inserts.
Concatenating two predictes using uzp1 after converting to double length
using sve.convert.to/from.svbool is optimized poorly in the backend,
resulting in additional `and` instructions to zero the lanes. See
https://github.com/llvm/llvm-project/pull/78623/
Combine this pattern to use `llvm.vector.insert` to concatenate and get
rid of convert to/from svbools.
Using "eabi" for aarch64 targets is a common mistake and warned by Clang Driver.
We want to avoid it elsewhere as well. Just use the common "aarch64" without
other triple components.
This patch implements IR combines to convert intrinsics used for _m C/C++ builtins
which take an all active predicate to their equivalent _u intrinsic.
Differential Revision: https://reviews.llvm.org/D152005
Consider: add(pg, a, mul_u(pg, b, c))
Although the multiply's inactive lanes are undefined, they don't
contribute to the final result. The overall result of the inactive
lanes come from "a" and thus the above is another form of mla
rather than mla_u.
This patch extends existing IR combines for: fmul, fsub and fadd,
relying on all active predicate to also apply to their equivalent
undef (_u) intrinsics.
Differential Revision: https://reviews.llvm.org/D150768
This is a follow-up to b71edfaa4ec3c998aadb35255ce2f60bba2940b0
since I forgot the lit.local.cfg files in that one.
Reformatting is done with `black`.
If you end up having problems merging this commit because you
have made changes to a python file, the best way to handle that
is to run git checkout --ours <yourfile> and then reformat it
with black.
If you run into any problems, post to discourse about it and
we will try to help.
RFC Thread below:
https://discourse.llvm.org/t/rfc-document-and-standardize-python-code-style
Reviewed By: barannikov88, kwk
Differential Revision: https://reviews.llvm.org/D150762