This is another step in aligning addTypeForStreamingSVE with addTypeForFixedLengthSVE,
which also improves code quality for extending loads and truncating stores.
Reviewed By: hassnaa-arm
Differential Revision: https://reviews.llvm.org/D141266
Ensure the negation required when lowering negative power-of-two
divides uses the scalable vector container type with the fixed
length result extracted from it.
Fixes: #59647
Differential Revision: https://reviews.llvm.org/D140563
By default expand all operations, then change to Custom/Legal if needed.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D141068
Use deduction guides instead of helper functions.
The only non-automatic changes have been:
1. ArrayRef(some_uint8_pointer, 0) needs to be changed into ArrayRef(some_uint8_pointer, (size_t)0) to avoid an ambiguous call with ArrayRef((uint8_t*), (uint8_t*))
2. CVSymbol sym(makeArrayRef(symStorage)); needed to be rewritten as CVSymbol sym{ArrayRef(symStorage)}; otherwise the compiler is confused and thinks we have a (bad) function prototype. There was a few similar situation across the codebase.
3. ADL doesn't seem to work the same for deduction-guides and functions, so at some point the llvm namespace must be explicitly stated.
4. The "reference mode" of makeArrayRef(ArrayRef<T> &) that acts as no-op is not supported (a constructor cannot achieve that).
Per reviewers' comment, some useless makeArrayRef have been removed in the process.
This is a follow-up to https://reviews.llvm.org/D140896 that introduced
the deduction guides.
Differential Revision: https://reviews.llvm.org/D140955
The baseline legalization for `ISD::ZERO_EXTEND_VECTOR_INREG`
(`VectorLegalizer::ExpandZERO_EXTEND_VECTOR_INREG`),
blends-in the zeros, but as mentioned e.g.
in b4bd0a404fe26071dab0854dfd9767974909c7c4,
there is no such thing for AArch64.
So some of the shuffles that would be nicely lowered
by `LowerVECTOR_SHUFFLE()`, e.g. into `ZIP1`,
would now be unrecognizable after round-tripping
through `ISD::ZERO_EXTEND_VECTOR_INREG` recognition & legalization.
The most obvious solution is to just custom-lower
`ISD::ZERO_EXTEND_VECTOR_INREG` as the `ZIP1`-with-zeros,
like it would have been originally in that test case.
This adds some tablegen patterns for RSHRN, which performs a rounding
shift with narrow. This is similar to the existing SHRN patterns with an
extra addition to perform the rounding, that adds 1<<(shift-1) before
the right shift. Because the round immediate and the shift amount are
tied, it goes via a ComplexPattern that uses a SelectRoundingVLShr
method to perform the selection checks.
aarch64_neon_rshrn are expanded into the sequence of equivalent
instructions (trunc(shr(add(x, 1<<(sht-1)), sht))) so that they can be
converted back into RSHRN. Which also allows us to match raddhn through
the adjusted patterns that previously used aarch64_neon_rshrn.
DIfferential Revision: https://reviews.llvm.org/D140297
This adds a simple fold of TRUNCATE(AArch64ISD::DUP) -> AArch64ISD::DUP,
which can help generate more optimal UMULL sequences, and seems useful
in general.
Differential Revision: https://reviews.llvm.org/D140289
If a load is consumed by a single splat, don't consider indexed loads.
This is an alternative implementation to D138581.
Depends on D139637.
Differential Revision: https://reviews.llvm.org/D139850
Given mul(zext(a), b), we can convert to a umull so long as we know that
the top bits of b are zero. This uses MaskedValueIsZero to detect that
case for NEON UMULL patterns.
Differential Revision: https://reviews.llvm.org/D140287
Address the inconsistency between FLT_ROUNDS_ and SET_ROUNDING SDAG
node. Rename FLT_ROUNDS_ to GET_ROUNDING and add llvm.get.rounding
intrinsic to replace flt.rounds.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D139507
This allows it to fold WHILEop with constant operand to PTRUE instruction in
the case given range is fitted to predicate format. Also, this change
fixes the unsigned overflow error introduced in D137547 for WHILELO lowering.
Differential Revision: https://reviews.llvm.org/D139068
This change:
- Modifies the ACLE code to allow the new SLC value (3) for the prefetch
target.
- Introduces a new intrinsic, @llvm.aarch64.prefetch which matches the
PRFM family instructions much more closely, and can represent all
values for the PRFM immediate.
The target-independent @llvm.prefetch intrinsic does not have enough
information for us to be able to lower to it from the ACLE intrinsics
correctly.
- Lowers the acle calls to the new intrinsic on aarch64 (the ARM
lowering is unchanged).
- Implements code generation for the new intrinsic in both SelectionDAG
and GlobalISel. We specifically choose to continue to support lowering
the target-independent @llvm.prefetch intrinsic so that other
frontends can continue to use it.
Differential Revision: https://reviews.llvm.org/D139443
[AArch64] Patch for lowering trunc instructions to 'tbl' for (8|16)xi32 -> (8|16)xi8 conversions in https://reviews.llvm.org/D133495 is extended to support trunc to tbl lowering for (8|16) x i64 to (8|16) x i8.
A microbenchmark for runtime for these transformations is added in https://reviews.llvm.org/D136274
Reviewed by: fhahn, t.p.northover
Differential Revision: https://reviews.llvm.org/D135229
This covers 128-bit loads, and atomicrmw operations without a single native
instruction. Using CAS saves has a better chance of succeeding with high
contention on some systems.
This adds basic HADD and RHADD support for SVE, by marking the AVGFLOOR
and AVGCEIL as custom and converting those to HADD_PRED/RHADD_PRED
AArch64 nodes. Both the existing intrinsics and the _PRED nodes are then
lowered to the _ZPmZ instructions.
Differential Revision: https://reviews.llvm.org/D131875
This is a recommit of f9e0390751cb5eefbbbc191f851c52422acacab1
The previous commit failed to handle cases where the zero extended operand is an extended `BUILD_VECTOR`.
We don't replace zext with a sext operand to select smull if any operand is `BUILD_VECTOR`
Original commit message:
we can safely replace a `zext` instruction with `sext` if the top bit is zero. This is useful because we can select `smull` when both operands are sign extended.
Reviewed By: fhahn, dmgreen
Differential Revision: https://reviews.llvm.org/D134711
The prior code worked before SVE DIV was enabled 128 bit vectors.
With 128 bit vectors, when run on a 256 bit machine, it would split and
do a signed unpack, but this resulted in one full vector and one empty
vector with a half-sized predicate. The effect was that only half the
elements were treated correctly.
The fix is to bisect the vector, sign extend, do the division, truncate
and then concat.
Fixes#59357.
Differential Revision: https://reviews.llvm.org/D139618
As suggested in D12425 it would be better for the readcyclecounter
function on ARM architecture to use the CNTVCT_EL0 register
(Counter-timer Virtual Count register) instead of the PMCCNTR_EL0
(Performance Monitors Cycle Count Register) because the PMCCNTR_EL0 is a
PMU register which, depending on the configuration, it might always
return zeroes and it doesn't guaranteed to always be increased.
Differential Revision: https://reviews.llvm.org/D136999
Adding support for ZExt lowering for destination types beyond the existing support for (8|16) x i32
Patch for lowering zext instructions to 'tbl' for (8|16)xi8 -> (8|16)xi32 conversions in https://reviews.llvm.org/D120571 is extended to support zext to 'tbl' lowering for Y x i8 to Y x i8X where X > 2 and X < 8, that is, any number of vector elements & any destination element type whose size is a multiple of 8 and lies between 16 & 64 is allowed for this transformation.
Related microbenchmarks are in https://reviews.llvm.org/D136274 & https://reviews.llvm.org/D138059
Differential Revision: https://reviews.llvm.org/D136722
we can safely replace a `zext` instruction with `sext` if the top bit is zero. This is useful because we can select `smull` when both operands are sign extended.
Reviewed By: fhahn, dmgreen
Differential Revision: https://reviews.llvm.org/D134711
Reversing double-words within a quard-word is possible using the REVD instruction
when SVE2p1 is enabled.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D139119
This only contains the SelectionDAG implementation. GlobalISel to
follow.
The broad approach is:
- Introduce new builtins for 128-bit wide instructions.
- Lower these to @llvm.read_register.i128/@llvm.write_register.i128
- Introduce target-specific ISD nodes which have legal operands (two
i64s rather than an i128). These are named AArch64::{MRRS, MSRR} to
match the instructions they are for. These are a little complex as
they need to match the "shape" of what they're replacing or the
legaliser complains.
- Select these using the existing tryReadRegister/tryWriteRegister to
share the MDString parsing code, and introduce additional code to
ensure these are selected into the right MRRS/MSRR instructions. What
makes this hard is ensuring that the two i64s end up in an XSeqPair
register pair, because SelectionDAG doesn't care that much about
register classes if it can avoid doing so.
The main change to existing code is the reorganisation of
tryReadRegister and tryWriteRegister to try to keep the string parsing
code separate from the instruction creating code.
This also includes the changes to clang to define and use the ACLE
feature macro named `__ARM_FEATURE_SYSREG128`.
Contributors:
Sam Elliott
Lucas Prates
Differential Revision: https://reviews.llvm.org/D139086
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated. The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.
This is part of an effort to migrate from llvm::Optional to
std::optional:
https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
These are the two places where we explicitly want to use cnt in
SelectionDAG when feature CSSC is available: ISD::popcnt and ISD::parity
For both, we need to make sure we're emitting optimized code for i32 (and
lower), i64 and i128. The most optimal way is of course using the GPR CNT
instruction. If we don't have CSSC, but we do have neon, we'll use floating
point CNT. If all fails, we'll fall back on the general GPR popcnt and parity
implementations.
spec:
https://developer.arm.com/documentation/ddi0602/2022-09/Base-Instructions/CNT--Count-bits-
Reviewed By: lenary
Differential Revision: https://reviews.llvm.org/D138808
To generate code compatible to streaming mode:
- enable custome lowering for TruncStore to avoid crashing
during legalizing TruncStore for non Integer vector.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D138720
To generate code compatible to streaming mode:
- enable expanding ISD::SETUEQ to avoid custom-lowering setcc to setcc_merge_zero
which cause a crash while instruction selection because there is no pattern match for it.
- Testing files:
- fp-compares.ll
Differential Revision: https://reviews.llvm.org/D138670
As apparent in the newly-added test, provided in:
cf624b23bc (commitcomment-90836329),
we should be more careful with handling wider vectors,
or we will assert later on.
This allows DemandedBits to see that the SVE count intrinsics (CNTB,
CNTH, CNTW, CNTD) sans multiplier will only ever produce small
positive integers. The maximum value you could get here is 256, which
is CNTB on a machine with a 2048bit vector size (the maximum for SVE).
Using this various redundant operations (zexts, sexts, ands, ors, etc)
can be eliminated.
Differential Revision: https://reviews.llvm.org/D138424