This is a fix for a subset of legalization problems around 64 bit indices on
rv32 targets. For RV32+V, we were using the wrong mask type for the manual
truncation lowering for fixed length vectors. Instead, just use the generic
TRUNCATE node, and let it be lowered as needed.
Note that legalization is still broken for rv32+zve32. That appears to be
a different issue.
For RVV, If we want to perform an i8 or i16 element-wise vector
arithmetic right shift in the upper C/C++ program, the value to be
shifted would be first sign extended to i32, and the shift amount would
also be zero_extended to i32 to perform the vsra.vv instruction, and
followed by a truncate to get the final calculation result, such pattern
will later expanded to a series of "vsetvli" and "vnsrl" instructions
later, this is because the RVV spec only support 2 * SEW -> SEW
truncate. But for vector, the shift amount can also be determined by
smin (Y, ScalarSizeInBits(Y) - 1)). Also, for the vsra instruction, we
only care about the low lg2(SEW) bits as the shift amount.
- Alive2: https://alive2.llvm.org/ce/z/u3-Zdr
- C++ Test cases : https://gcc.godbolt.org/z/q1qE7fbha
This patch adjusts the legality check for riscv to use `cpop/cpopw` since `isOperationLegal(ISD::CTPOP, MVT::i32)` returns false on rv64gc_zbb.
Clang vs gcc: https://godbolt.org/z/rc3s4hjPh
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D156390
If the data type is larger than e16, and the requires more than LMUL1 register
class, prefer the use of vrgatherei16. This has three major benefits:
1) Less work needed to evaluate the constant for e.g. vid sequences. Remember
that arithmetic generally scales lineary with LMUL.
2) Less register pressure. In particular, the source and indices registers
*can* overlap so using a smaller index can significantly help at m8.
3) Smaller constants. We've got a bunch of tricks for materializing small
constants, and if needed, can use a EEW=16 load.
If we have a gather or a scatter whose index describes a permutation of the
lanes, we can lower this as a shuffle + a unit strided memory operation. For
RISCV, this replaces a indexed load/store with a unit strided memory operation
and a vrgather (at worst).
I did not bother to implement the vp.scatter and vp.gather variants of these
transforms because they'd only be legal when EVL was VLMAX. Given that, they
should have been transformed to the non-vp variants anyways. I haven't checked
to see if they actually are.
In D154687, we added a transform to narrow indexed load/store indices of the
form (shl (zext), C). We can move this into a generic transform over the
target independent nodes instead, and pick up the fixed vector cases with no
additional work required. This is an alternative to D158163.
Performing this transform points out that we weren't eliminating zero_extends
via the the generic DAG combine. Adjust the (existing) callbacks so that we
do.
This change *removes* the existing transform on the target specific intrinsic
nodes. If anyone has a use case this impacts, please speak up.
Note: Reviewed as part of a stack of changes in PR# 66405.
If the index type is greater or equal to XLEN, then signed and unsigned
are the same. Canonacalize towards unsigned to simplify upcoming transform.
Note: Reviewed as part of a stack of changes in PR# 66405.
We have the analogous case in the single insert path. The reasoning here is that if the original VL fits in LMUL1, we'd prefer to clobber a few extra dead lanes than to force two VL toggles. VTYPE toggles are generally cheaper than VL toggles.
reland [InlineAsm] wrap ConstraintCode in enum class NFC (#66003)
This reverts commit ee643b706be2b6bef9980b25cc9cc988dab94bb5.
Fix up build failures in targets I missed in #66003
Kept as 3 commits for reviewers to see better what's changed. Will
squash when
merging.
- reland [InlineAsm] wrap ConstraintCode in enum class NFC (#66003)
- fix all the targets I missed in #66003
- fix off by one found by llvm/test/CodeGen/SystemZ/inline-asm-addr.ll
This reverts commit 2ca4d136124d151216aac77a0403dcb5c5835bcd.
Also revert the followup, "[InlineAsm] fix botched merge conflict resolution"
This reverts commit 8b9bf3a9f715ee5dce96eb1194441850c3663da1.
There were SystemZ and Mips build errors, too many to fix forward.
Similar to
commit 2fad6e69851e ("[InlineAsm] wrap Kind in enum class NFC")
Fix the TODOs added in
commit 93bd428742f9 ("[InlineAsm] refactor InlineAsm class NFC
(#65649)")
Instead of switching on type before and after common code, use a helper function. This matches the style of DAGCombine.cpp more closely, and makes porting candidate changes from one place to the other much easier.
Add a DAG combine to form a masked.load from a masked_strided_load
intrinsic with stride equal to element size. This covers a couple of
extra test cases, and allows us to simplify and common some existing
code on the concat_vector(load, ...) to strided load transform.
This is the first in a mini-patch series to try and generalize our
strided load and gather matching to handle more cases, and common up
different approaches to the same problems in different places.
This is the extract side of D159332. The goal is to avoid non-linear costing on patterns where an entire vector is split back into scalars. This is an idiomatic pattern for SLP.
Each vslide operation is linear in LMUL on common hardware. (For instance, the sifive-x280 cost model models slides this way.) If we do a VL unique extracts, each with a cost linear in LMUL, the overall cost is O(LMUL2) * VLEN/ETYPE. To avoid the degenerate case, fallback to the stack if we're beyond LMUL2.
There's a subtly here. For this to work, we're *relying* on an optimization in LegalizeDAG which tries to reuse the stack slot from a previous extract. In practice, this appear to trigger for patterns within a block, but if we ended up with an explode idiom split across multiple blocks, we'd still be in quadratic territory. I don't think that variant is fixable within SDAG.
It's tempting to think we can do better than going through the stack, but well, I haven't found it yet if it exists. Here's the results for sifive-s280 on all the variants I wrote (all 16 x i64 with V):
output/sifive-x280/linear_decomp_with_slidedown.mca:Total Cycles: 20703
output/sifive-x280/linear_decomp_with_vrgather.mca:Total Cycles: 23903
output/sifive-x280/naive_linear_with_slidedown.mca:Total Cycles: 21604
output/sifive-x280/naive_linear_with_vrgather.mca:Total Cycles: 22804
output/sifive-x280/recursive_decomp_with_slidedown.mca:Total Cycles: 15204
output/sifive-x280/recursive_decomp_with_vrgather.mca:Total Cycles: 18404
output/sifive-x280/stack_by_vreg.mca:Total Cycles: 12104
output/sifive-x280/stack_element_by_element.mca:Total Cycles: 4304
I am deliberately excluding scalable vectors. It functionally works, but frankly, the code quality for an idiomatic explode loop is so terrible either way that it felt better to leave that for future work.
Differential Revision: https://reviews.llvm.org/D159375
As noted in
https://github.com/llvm/llvm-project/pull/65392#discussion_r1316259471,
when lowering an extract of a fixed length vector from another vector,
we don't need to perform the vslidedown on the full vector type. Instead
we can extract the smallest subregister that contains the subvector to
be extracted and perform the vslidedown with a smaller LMUL. E.g, with
+Zvl128b:
v2i64 = extract_subvector nxv4i64, 2
is currently lowered as
vsetivli zero, 2, e64, m4, ta, ma
vslidedown.vi v8, v8, 2
This patch shrinks the vslidedown to LMUL=2:
vsetivli zero, 2, e64, m2, ta, ma
vslidedown.vi v8, v8, 2
Because we know that there's at least 128*2=256 bits in v8 at LMUL=2,
and we only need the first 256 bits to extract a v2i64 at index 2.
lowerEXTRACT_VECTOR_ELT already has this logic, so this extracts it out
and reuses it.
I've split this out into a separate PR rather than include it in #65392,
with the hope that we'll be able to generalize it later.
This patch refactors extract_subvector lowering to lower to
extract_subreg directly, and to shortcut whenever the index is 0 when
extracting a scalable vector. This doesn't change any of the existing
behaviour, but makes an upcoming patch that extends the scalable path
slightly easier to read.
This patch refactors extract_subvector lowering to lower to
extract_subreg directly, and to shortcut whenever the index is 0 when
extracting a scalable vector. This doesn't change any of the existing
behaviour, but makes an upcoming patch that extends the scalable path
slightly easier to read.
As mentioned in TODOs from D159332. This PR doesn't actually
common up that copy of the code because doing so is not NFC - due to
DLEN. Fixing that will be a future PR.
If we have a build_vector such as [i64 0, i64 3, i64 1, i64 2], we
instead lower this as vsext([i8 0, i8 3, i8 1, i8 2]). For vectors with
4 or fewer elements, the resulting narrow vector can be generated via
scalar materialization.
For shuffles which get lowered to vrgathers, constant build_vectors of
small constants are idiomatic. As such, this change covers all shuffles
with an output type of 4 or less.
I deliberately started narrow here. I think it makes sense to expand
this to longer vectors, but we need a more robust profit model on the
recursive expansion. It's questionable if we want to do the zsext if
we're going to generate a constant pool load for the narrower type
anyways.
One possibility for future exploration is to allow the narrower VT to be
less than 8 bits. We can't use vsext for that, but we could use
something analogous to our widening interleave lowering with some extra
shifts and ands.
Each vslide1down operation is linear in LMUL on common hardware. (For instance, the sifive-x280 cost model models slides this way.) If we do a VL unique inserts, each with a cost linear in LMUL, the overall cost is O(VL*LMUL). Since VL is a linear function of LMUL, this means the current lowering is quadradic in both LMUL and VL. To avoid the degenerate case, fallback to the stack if the cost is more than a fixed (linear) threshold.
For context, here's the sifive-x280 llvm-mca results for the current lowering and stack based lowering for each LMUL (using e64). Assumes code was compiled for V (i.e. zvl128b).
buildvector_m1_via_stack.mca:Total Cycles: 1904
buildvector_m2_via_stack.mca:Total Cycles: 2104
buildvector_m4_via_stack.mca:Total Cycles: 2504
buildvector_m8_via_stack.mca:Total Cycles: 3304
buildvector_m1_via_vslide1down.mca:Total Cycles: 804
buildvector_m2_via_vslide1down.mca:Total Cycles: 1604
buildvector_m4_via_vslide1down.mca:Total Cycles: 6400
buildvector_m8_via_vslide1down.mca:Total Cycles: 25599
There are other schemes we could use to cap the cost. The next best is recursive decomposition of the vector into smaller LMULs. That's still quadratic, but with a better constant. However, stack based seems to cost better on all LMULs, so we can just go with the simpler scheme.
Arguably, this patch is fixing a regression introduced with my D149667 as before that change, we'd always fallback to the stack, and thus didn't have the non-linearity.
Differential Revision: https://reviews.llvm.org/D159332
Now that the codegen for the expanded ISD::ROTL sequence has been improved,
it's probably profitable to lower a shuffle that's a rotate to the
vsll+vsrl+vor sequence to avoid a vrgather where possible, even if we don't
have the vror instruction.
This patch relaxes the restriction on ISD::ROTL being legal in
lowerVECTOR_SHUFFLEAsRotate. It also attempts to do the lowering twice: Once
if zvbb is enabled before any of the interleave/deinterleave/vmerge lowerings,
and a second time unconditionally just before it falls back to the vrgather.
This way it doesn't interfere with any of the above patterns that may be more
profitable than the expanded ISD::ROTL sequence.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D159353
We currently have log, log2, log10, exp and exp2 intrinsics. Add exp10
to fix this asymmetry. AMDGPU already has most of the code for f32
exp10 expansion implemented alongside exp, so the current
implementation is duplicating nearly identical effort between the
compiler and library which is inconvenient.
https://reviews.llvm.org/D157871
If the high and low 32 bits are the same, we try to use
(ADD X, (SLLI X, 32)) but that only works if bit 31 is clear since
the low 32 bits will be sign extended.
If we have Zba we can use add.uw to zero the sign extended bits.
Reviewed By: reames, wangpc
Differential Revision: https://reviews.llvm.org/D159253
A shuffle of v256i1 with a large enough minimum vlen might make it through type
legalization and into lowering. In this case, zvl1024b was enough. The
bitreverse shuffle lowering would then try to convert this to a v1i256 type
which is invalid (v1i128 exists though, which is why the existing v128i1 tests
were fine).
This patch checks to make sure that the new type is not only legal but also
valid.
Reviewed By: craig.topper, reames
Differential Revision: https://reviews.llvm.org/D159215
Now that DAG.getConstant uses splat_vector_parts if needed on RV32, we can use
it directly without having to manually lower to a vmv_v_x_vl.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D159287
This re-implements the special casing we had in lowerScalarSplat as a DAG combine. As can be seen in the tests, this ends up triggering in a bunch more cases.
The semantically interesting bit of this change is the use of the implicit truncate semantics for when XLEN > SEW. We'd already been doing this for vmv.v.x, but this change extends e.g. the constant matching to make the same assumption about vmv.s.x. Per my reading of the specification, this should be fine, and if anything, is more obviously true of vmv.s.x than vmv.v.x.
Differential Revision: https://reviews.llvm.org/D158874
We'd discussed this in the original set of patches months ago, but decided against it. I think we should reverse ourselves here as the code is significantly more readable, and we do pick up cases we'd missed by not calling the appropriate helper routine.
Differential Revision: https://reviews.llvm.org/D158854
A rotate of 8 bits of an e16 vector in either direction is equivalent to a
byteswap, i.e. vrev8. There is a generic combine on ISD::ROT{L,R} to
canonicalize these rotations to byteswaps, but on fixed vectors they are
legalized before they have the chance to be combined. This patch teaches the
rotate vector_shuffle lowering to emit these rotations as byteswaps to match
the scalable vector behaviour.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D158195
Given a shuffle mask like <3, 0, 1, 2, 7, 4, 5, 6> for v8i8, we can
reinterpret it as a shuffle of v2i32 where the two i32s are bit rotated, and
lower it as a vror.vi (if legal with zvbb enabled).
We also need to make sure that the larger element type is a valid SEW, hence
the tests for zve32x.
X86 already did this, so I've extracted the logic for it and put it inside
ShuffleVectorSDNode so it could be reused by RISC-V. I originally tried to add
this as a generic combine in DAGCombiner.cpp, but it ended up causing worse
codegen on X86 and PPC.
Reviewed By: reames, pengfei
Differential Revision: https://reviews.llvm.org/D157417
If doubling the VL will fit in a vsetivli, use it. It will be cheap
to change and cheap to change back.
This improves codegen from D158896.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D158896
We can use a 32-bit splat and bitcast to i64 vector.
This only handles the case where we are using vlmax so that the new
vl is cheap to compute. This could be generalized to double the VL.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D158879
There was quite a bit of duplication between splatPartsI64WithVL
and the scalable vector handling in lowerSPLAT_VECTOR_PARTS, but
scalable vector had one additional case. Move that case to
splatPartsI64WithVL which improves some fixed vector tests.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D158876
There is no vp.fpclass after FCLASS_VL(D151176), try to support vp.fpclass.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D152993
When lowering a splat_vector_parts, if the hi bits are undefined then we can
splat the lo bits without having to check if it's going to be sign extended or
not, because those bits will be undefined anyway.
I've handled it for both fixed and scalable vectors, but there's no diff
on the scalable vror tests, since the hi bits aren't combined away to
undef in SimplifyDemanded for scalable vectors. I'm not sure why that is.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D158625
At some point a merge operand was added to the binary vl ops, so this combine
was using the mask for the VL. This causes a crash when trying to
select the vmv_v_x_vl, which showed up locally when messing about with
selectVSplat, but thankfully in ToT the vmv_v_x_vl gets pattern matched
away into the .vx and .vi operands every time, so there's no noticeable
change.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D158634
For most fp16 vector ops, we could promote it to fp32 vector when zvfhmin is enable but zvfh is not.
But for nxv32f16, we need to split it first since nxv32f32 is not a valid MVT.
Reviewed By: michaelmaitland
Differential Revision: https://reviews.llvm.org/D153848
This extends the concat_vector of loads to strided_load transform to handle reversed index pattern. The previous code expected indexing of the form (a0, a1+S, a2+S,...). However, we can also see indexing of the form (a1+S, a2+S, a3+S, .., aS). This form is a strided load starting at address aN + S*(n-1) with stride -S.
Note that this is also fixing what looks to be a bug in the memory location reasoning for forward strided case. A strided load with negative stride access eltsize bytes past base ptr, and then bytes *before* base ptr. (That is, the range should extend from before base ptr to after base ptr.)
Differential Revision: https://reviews.llvm.org/D157886
If we have a known (or bounded) index which definitely fits in a smaller LMUL register group size, we can reduce the LMUL of the slide and extract instructions. This loosens constraints on register allocation, and allows the hardware to do less work, at the potential cost of some additional VTYPE toggles. In practice, we appear (after prior patches) to do a decent job of eliminating the additional VTYPE toggles in most cases.
Differential Revision: https://reviews.llvm.org/D158460
Preparation for developing a new rounding mode insertion algorithm
that is going to be different between them since VXRM doesn't need
to be save/restored.
This also unifies the FRM handling in RISCVISelLowering.cpp between
scalar and vector.
Fixes outdated comments in RISCVAsmPrinter and sorts the predicate
function by the reverse order of the operands being skipped.
Reviewed By: eopXD
Differential Revision: https://reviews.llvm.org/D158326
clang recently started checking for INT64_MIN being passed to 64-bit std::abs.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D158304