The RISC-V vector crypto extensions have been ratified. This patch
updates the Clang and LLVM support for these extensions to be
non-experimental, while leaving the C intrinsics as experimental since
the C intrinsics are not yet standardized.
Co-authored-by: Brandon Wu <brandon.wu@sifive.com>
These tests rely on SCEV looking recognizing an "or" with no common
bits as an "add". Add the disjoint flag to relevant or instructions
in preparation for switching SCEV to use the flag instead of the
ValueTracking query. The IR with disjoint flag matches what
InstCombine would produce.
Avoids infinite issues in some upcoming patches to help D152928 - x86 sees a number of regressions that are addressed by extending SimplifyDemandedVectorEltsForTargetNode to cover more binop opcodes
After determining the cost of loads that could not be coalesced into
`VectorizedLoads` in SLP, computing the cost of a gather-vectorized
load is carried out. Favour a potentially high valid cost when the
type of a group of loads, whose type is a vector of size dependent
upon `VF`, may be legalized into a scalar value.
Fixes: https://github.com/llvm/llvm-project/issues/68953.
There are many tests that specify a target triple/CPU flags but no
DataLayout which can lead to IR being generated that has unusual
behaviour. This commit attempts to use the default DataLayout based
on the relevant flags if there is no explicit override on the command
line or in the IR file.
One thing that is not currently possible to differentiate from a missing
datalayout `target datalayout = ""` in the IR file since the current
APIs don't allow detecting this case. If it is considered useful to
support this case (instead of passing "-data-layout=" on the command
line), I can change IR parsers to track whether they have seen such a
directive and change the callback type.
Differential Revision: https://reviews.llvm.org/D141060
This was reverted in commit 0abaf3caee88ae74def2c7000aff8e61b24634bb
(#67178).
This version of the patch includes a fix which was caused by
vp-reductions having an extra start value argument which the non-vp
counterparts did not have.
The issue #55208 noticed that std::rint is vectorized by the
SLPVectorizer, but a very similar function, std::lrint, is not.
std::lrint corresponds to ISD::LRINT in the SelectionDAG, and
std::llrint is a familiar cousin corresponding to ISD::LLRINT. Now,
neither ISD::LRINT nor ISD::LLRINT have a corresponding vector variant,
and the LangRef makes this clear in the documentation of llvm.lrint.*
and llvm.llrint.*.
This patch extends the LangRef to include vector variants of
llvm.lrint.* and llvm.llrint.*, and lays the necessary ground-work of
scalarizing it for all targets. However, this patch would be devoid of
motivation unless we show the utility of these new vector variants.
Hence, the RISCV target has been chosen to implement a custom lowering
to the vfcvt.x.f.v instruction. The patch also includes a CostModel for
RISCV, and a trivial follow-up can potentially enable the SLPVectorizer
to vectorize std::lrint and std::llrint, fixing #55208.
The patch includes tests, obviously for the RISCV target, but also for
the X86, AArch64, and PowerPC targets to justify the addition of the
vector variants to the LangRef.
This reverts commit fc865c20345860f394448c228054beafc22a1d4d.
Triggering assert on X86:
```
iree-compile: /work/third_party/llvm-project/llvm/include/llvm/Support/Casting.h:662: decltype(auto) llvm::dyn_cast(From *) [To = llvm::PointerType, From = llvm::Type]: Assertion `detail::isPresent(Val) && "dyn_cast on a non-existent value"' failed.
```
See PR for comments and full stack trace.
On RISCV, only a few VPIntrinsics have their cost modeled by the
VectorIntrinsicCostTable. Even so, none of those entries consider LMUL.
All other VPIntrinsics do not have meaningful modeling.
This patch models the cost of a VPIntrinsic as the cost of its non-VP
counterpart. It is possible that the VP Intrinsic is cheaper than the
non-VP
version depending on VL. On RISCV, this may be due two reasons (if the
instruction is part of a loop):
1. A smaller VL can be used on the last iteration of the loop.
2. The VP instruction may avoid a scalar remainder loop.
I have left this as a TODO since I think this change puts us on the
right path of modeling the cost of a VPInstruction, and it isn't
entirely clear to me how much of a discount we should give to a
known VL<VLMAX or what to do when VL is unknown at compile time.
Add initial half/bfloat broadcast shuffles test coverage (more to follow)
Fixes#68117 - which was stuck in a loop between getting scalarized insert/extract costs for the shuffle and then trying to convert a bfloat insert into a shuffle again......
When performing a masked load of an unpacked SVE vector type, i.e.
nxv8i8, followed by a zero- or sign-extend to an illegal wide type
such as nxv8i32 we typically end up with a combination of an
extending masked load and pair(s) of uunpklo/hi or sunpklo/hi
instructions. For example, see test @masked_sload_8i8_8i32 in file
CodeGen/AArch64/sve-masked-ldst-sext.ll
where
%aval = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8(...
%aext = sext <vscale x 8 x i8> %aval to <vscale x 8 x i32>
gets lowered to
ld1sb { z1.h }, ...
sunpklo z0.s, z1.h
sunpkhi z1.s, z1.h
Currently the cost for the 'sext' operation in the example above is
1, whereas this patch changes it to 2 to reflect the pair of
instructions required. Similarly, when doing a masked load of a
nxv8i8 and extending to nxv8i64 the cost is changed to 6 to reflect
the 6 unpacks required.
Under RISCV experimental-zvbb, vector variants of llvm.ctpop lower to a
single instruction: vcpop. The cost-model does not check for the ZVBB
extension, and always associates a high cost to vector variants of
llvm.ctpop. Fix this defect.
Vector ctpop only exists under ZVBB, but ZVBB is unaccounted for in the
cost-model of ctpop. Document this defect with an additional RUN line in
the test for ctpop, showing identical costs with/without ZVBB. A
follow-up patch could fix this defect.
This patch fixes the compilation time issue of matrix-types-spec test
from test-suite.
Reproduction of the problem:
```
clang++ -DNDEBUG --target=riscv64-linux-gnu --sysroot=<sysroot path> --gcc-toolchain=<gcc path> -O2 -fenable-matrix <test-suite-path>/SingleSource/UnitTests/matrix-types-spec.cpp
```
On my machine, compilation takes 50.44s. In comparison, the same test
with RVV (-march=rv64gcv) compiles in 3.06s, and for x86-64 target it
takes 1.71s. It turns out that the main issue is unrolling of loop in
multiplySpec function, that has extractelements with non-constant index:
```
for.body9.i: ; preds = %for.body9.i, %for.cond6.preheader.i
%indvars.iv.i92 = phi i64 [ 0, %for.cond6.preheader.i ], [ %indvars.iv.next.i93, %for.body9.i ]
%Elt.033.i = phi double [ 0.000000e+00, %for.cond6.preheader.i ], [ %80, %for.body9.i ]
%77 = mul nuw nsw i64 %indvars.iv.i92, 25
%78 = add nuw nsw i64 %77, %indvars.iv39.i91
%matrixext.i = extractelement <475 x double> %62, i64 %78
%79 = add nuw nsw i64 %indvars.iv.i92, %74
%matrixext13.i = extractelement <209 x double> %73, i64 %79
%80 = tail call double @llvm.fmuladd.f64(double %matrixext.i, double %matrixext13.i, double %Elt.033.i)
%indvars.iv.next.i93 = add nuw nsw i64 %indvars.iv.i92, 1
%exitcond.not.i94 = icmp eq i64 %indvars.iv.next.i93, 19
br i1 %exitcond.not.i94, label %for.cond.cleanup8.i, label %for.body9.i, !llvm.loop !21
```
When RVV is supported, extractelement/insertelement with non-constant
index can be lowered quite efficiently with vslidedown/vslideup;
otherwise it's lowered via stack memory operations, i.e. for
extractelement each vector element is stored on stack and then the
needed element is loaded back; for insertelement is stores all vector
elements, rewrites the required element value and then loads vector
back. Currently the cost of such expensive operation is estimated as
zero, so loop unroll processes the loop more aggresively. The proper
estimation of cost (in a way like in X86 target) prohibits unrolling of
this loop and fixes compilation time (2.77s on my machine).
This patch implements getCFInstrCost TTI hook that mostly affects
LoopVectorizer decisions. It sets zero cost for PHI nodes and zero
throughput cost for branches (assuming that branches are likely to
be predicted). The implementation is similar to X86/AArch64/PowerPC
targets and reduces loop cost by excluding induction PHIs/loop latch
branches, which in turn leads to selecting smaller vectorization
factor.
There are several typos in fround.ll, persumably caused by copy-pasting,
where there is a strange nvx5* type. From the surrounding code, it is
clear that this was intended to be nvx4*. Fix these typos.
Make codegen emit correctly rounded sqrt by default.
Emit the fast but only kind of fast expansion in AMDGPUCodeGenPrepare
based on !fpmath, like the fdiv case. Hack around visitation ordering
problems from AMDGPUCodeGenPrepare using forward iteration instead of
a well behaved combiner.
https://reviews.llvm.org/D158129
This adds some basic and/or/xor reduction costs for NEON/MVE, handling them
like other reductions where vector operations are used to reduce to legal
sizes, followed by an optional VREV+VAND/VORR/VEOR step and scalarization from
there.
This adds some basic smin/smax/umin/umax reduction costs for MVE/NEON, similar
to the existing Add reduction costs. They follow the same style as Add
reductions, but include a higher cost as the costs tend to be dependant on the
element size for vminv/vmaxv. These costs may not be precise, but will be more
inline than the default that extracts each element.
Similar to the other reductions, this changes the cost of fmin/fmax reductions
under MVE/NEON to perform vector operations until the types need to be
scalarized. The fp16 vectors can perform a VREV+FMIN/FMAX to skip a step of the
reduction, and otherwise need lanewise extract fro the top lanes.
This adds some basic fadd/fmul reduction costs for MVE/NEON. It reduces by
halving the vector size until it it gets scalarized, with some additional costs
for fp16 which may require extracting the top lanes.
Differential Revision: https://reviews.llvm.org/D159367
In particular, high LMULs, constant offsets within high LMUL, and types which require splitting. Note that most of these are way off with current lowering.
When SVE2 is enabled, we can combine an add of 1, add & shift right by 1
to a single s/urhadd instruction. If the operands to the adds are extended,
these extends will fold into the s/urhadd and their costs should be 0.
Reviewed By: david-arm, dtemirbulatov
Differential Revision: https://reviews.llvm.org/D157628
Refresh of the generic scheduling model to use A510 instead of A55.
Main benefits are to the little core, and introducing SVE scheduling information.
Changes tested on various OoO cores, no performance degradation is seen.
Differential Revision: https://reviews.llvm.org/D156799
The subtarget was unconditionally reporting that SVE was to be used to
lower vectors when Neon was unavailable, even when SVE itself was
unavailable. This decision leads other parts of the compiler to crash,
e.g., when querying SVE vector sizes.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D158179
Try to avoid some unprofitable predication on PPC. Recognize in the cost model that computing on i1 values will require extra mask or compare operation.
Differential Revision: https://reviews.llvm.org/D155876
In SystemZTTIImpl::getMemoryOpCost, the call to getNumberOfParts will
run type legalization, which can't handle structs. So before that, we
check for an unknown value type and forward to BaseT, just like many
other targets do in this situation.
https://bugzilla.redhat.com/show_bug.cgi?id=2224885
Reviewed By: uweigand
Differential Revision: https://reviews.llvm.org/D156379
The cost of vector instructions has always been high under AArch64, in order to
add a high cost for inserts/extracts, shuffles and scalarization. This is a
conservative approach to limit the scope of unusual SLP vectorization where the
codegen ends up being quite poor, but has always been higher than the correct
costs would be for any specific core.
This relaxes that, reducing the vector insert/extract cost from 3 to 2. It is a
generalization of D142359 to all AArch64 cpus. The ScalarizationOverhead is
also overridden for integer vector at the same time, to remove the effect of
lane 0 being considered free for integer vectors (something that should only be
true for float when scalarizing).
The lower insert/extract cost will reduce the cost of insert, extracts,
shuffling and scalarization. The adjustments of ScalaizationOverhead will
increase the cost on integer, especially for small vectors. The end result will
be lower cost for float and long-integer types, some higher cost for some
smaller vectors. This, along with the raw insert/extract cost being lower, will
generally mean more vectorization from the Loop and SLP vectorizer.
We may end up regretting this, as that vectorization is not always profitable.
In all the benchmarking I have done this is generally an improvement in the
overall performance, and I've attempted to address the places where it wasn't
with other costmodel adjustments.
Differential Revision: https://reviews.llvm.org/D155459
As noted on #63980 rotate by immediate amounts is much cheaper than variable amounts.
This still needs to be expanded to vector rotate cases, and we need to add reasonable funnel-shift costs as well (very tricky as there's a huge range in CPU behaviour for these).
rocm-device-libs and llpc were avoiding using f64 sqrt
intrinsics in favor of their own expansions. Port the
expansion into the backend. Both of these users should be
updated to call the intrinsic instead.
The library and llpc expansions are slightly different.
llpc uses an ldexp to do the scale; the library uses a multiply.
Use ldexp to do the scale instead of the multiply.
I believe v_ldexp_f64 and v_mul_f64 are always the same number of
cycles, but it's cheaper to materialize the 32-bit integer constant
than the 64-bit double constant.
The libraries have another fast version of sqrt which will
be handled separately.
I am tempted to do this in an IR expansion instead. In the IR
we could take advantage of computeKnownFPClass to avoid
the 0-or-inf argument check.
Unlike fmaxnum and fminnum, these operations propagate nan and
consider -0.0 to be less than +0.0.
Without Zfa, we don't have a single instruction for this. The
lowering I've used forces the other input to nan if one input
is a nan. If both inputs are nan, they get swapped. Then use
the fmax or fmin instruction.
New ISD nodes are needed because fmaxnum/fminnum to not define
the order of -0.0 and +0.0.
This lowering ensures the snans are quieted though that is probably not
required in default environment). Also ensures non-canonical nans
are canonicalized, though I'm also not sure that's needed.
Another option could be to use fmax/fmin and then overwrite the
result based on the inputs being nan, but I'm not sure we can do
that with any less code.
Future work will handle nonans FMF, and handling the case where
we can prove the input isn't nan.
This does fix the crash in #64022, but we need to do more work
to avoid scalarization.
Reviewed By: fakepaper56
Differential Revision: https://reviews.llvm.org/D156069
vrgather.vv across multiple vector registers (i.e. LMUL > 1) requires all to all data movement. This includes two conceptual sets of changes:
For permutes, we were modeling these as being linear in LMUL.
For reverse, we were modeling them as being fixed cost in LMUL.
Both were wrong, and have been adjusted to O(LMUL^2). Noticed via code inspection while looking at something else.
Its worth asking whether we should be lowering reverse to something other than a vrgather at high LMULs. That shuffle is quite expensive. (Future work)
Differential Revision: https://reviews.llvm.org/D152019
As in D140287, we can now generate umull from mul(zext(x), y) in cases where we
know that the top bits of y are zero. This teaches that to the cost model,
adjusting how isWideningInstruction detects mul operations that can extend both
operands. This helps for constants and other cases where the operands of the
mul are known to be extended, but not directly extends.
Differential Revision: https://reviews.llvm.org/D154936