This builds on bdc41106ee48dce59c500c9a3957af947f30c8c3.
This change completes the migration to a recursive shuffle lowering
strategy where when we encounter an unknown two argument shuffle, we
lower each operand as a single source permute, and then use a vselect
(i.e. a vmerge) to combine the results. This relies for code quality on
the post-isel combine which will aggressively fold that vmerge back into
the materialization of the second operand if possible.
Note: The change includes only the most immediately obvious of the
stylistic cleanup. There's a bunch of code movement that this enables
that I'll do as a separate patch as rolling it into this creates an
unreadable diff.
This patch allows VCIX instructions that have side effect to be
reordered
with memory and other side effecting instructions. However we don't want
VCIX instructions to be reordered with each other, so we propose a dummy
register called VCIX_STATE and make these instructions implicitly define
and use
it.
This patch adds basic TLSDESC support in the RISC-V backend.
Specifically, we add new relocation types for TLSDESC, as prescribed in
https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/373, and add a
new pseudo instruction to simplify code generation.
This patch does not try to optimize the local dynamic case, which can be
improved in separate patches.
Linker side changes will also be handled separately.
The current implementation is only enabled when passing the new
`-enable-tlsdesc` codegen flag.
This is the first step towards an alternate shuffle lowering design for
the general two vector argument case. The goal is to leverage the
existing lowering for single vector permutes to avoid as many of the
vrgathers as required - even if we do need the other.
This patch handles only the first argument, and is arguably a slightly
weird half-step. However, the test changes from the full two argument
recurse patch are a lot harder to reason about. Taking this half step
gives much more easily reviewable changes, and is thus worthwhile. I
intend to post the patch for the second argument once this has landed.
If we have a shuffle which is larger than m1, we may be able to split it
into a series of individual m1 shuffles. This patch starts with the
subcase where the mask allows a 1-to-1 mapping from source register to
destination register - each with a possible permutation of their own. We
can potentially extend this later, thought in practice this seems to
already catch a number of the most interesting cases.
Ensure that getVLENFactoredAmount does not fail when the scale amount
requires the use of a non-trivial multiplication but the M extension is
not enabled. In such case, perform the multiplication using shifts and
adds.
Although there are predicated versions of minnum/maxnum, the ones for
minimum/maximum are currently missing. This patch introduces these
intrinsics and implements their lowering to RISC-V.
For inline asm with memory operands, we can merge the offset into
the second operand of memory constraint operands.
Differential Revision: https://reviews.llvm.org/D158062
We want to know the upper 33 bits of the And Input are zero. SExt
only guarantees they are the same.
We originally checked for SExt or ZExt when we were using
isImpliedByDomCondition because a ZExt may have been changed to SExt
before we visited the And.
We are no longer using isImpliedByDomCondition so we can only look
for zext with the nneg flag.
While here, switch to PatternMatch to simplify the code.
Fixes#78783
If the element type of the vector we're extracting from doesn't match the type we're
inserting into, we can't directly insert or extract the subvector.
This adds minimal support for 7 new unprivileged extensions that were
defined as a part of
the RISC-V Profiles specification here:
https://github.com/riscv/riscv-profiles/blob/main/profiles.adoc#7-new-isa-extensions
* Ziccif: Main memory supports instruction fetch with atomicity
requirement
* Ziccrse: Main memory supports forward progress on LR/SC sequences
* Ziccamoa: Main memory supports all atomics in A
* Zicclsm: Main memory supports misaligned loads/stores
* Za64rs: Reservation set size of 64 bytes
* Za128rs: Reservation set size of 128 bytes
* Zic64b: Cache block size isf 64 bytes
As stated in the specification, these extensions don't add any new
features but
describe existing features. So this patch only adds parsing and
subtarget
features.
LLVM vector reduction intrinsics return a scalar result, but on RISC-V
vector reduction instructions write the result in the first element of a
vector register. So when a reduction in a loop uses a scalar phi, we end
up with unnecessary scalar moves:
loop:
vfmv.s.f v10, fa0
vfredosum.vs v8, v8, v10
vfmv.f.s fa0, v8
This mainly affects ordered fadd reductions, which has a scalar accumulator
operand.
This tries to vectorize any scalar phis that feed into a fadd reduction
in RISCVCodeGenPrepare, converting:
loop:
%phi = phi <float> [ ..., %entry ], [ %acc, %loop]
%acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %phi, <vscale x 2 x float> %vec)
```
to
loop:
%phi = phi <vscale x 2 x float> [ ..., %entry ], [ %acc.vec, %loop]
%phi.scalar = extractelement <vscale x 2 x float> %phi, i64 0
%acc = call float @llvm.vector.reduce.fadd.nxv4f32(float %x, <vscale x 2 x float> %vec)
%acc.vec = insertelement <vscale x 2 x float> poison, float %acc.next, i64 0
Which eliminates the scalar -> vector -> scalar crossing during
instruction selection.
This patch was originally introduced in PR #72340, but was reverted due
to a bug on invalid extension combine.
Specifically, we resolve the case in the
https://github.com/llvm/llvm-project/pull/72340#issuecomment-1874810998
```
define <vscale x 1 x i32> @foo(<vscale x 1 x i1> %x, <vscale x 1 x i2> %y) {
%a = zext <vscale x 1 x i1> %x to <vscale x 1 x i32>
%b = zext <vscale x 1 x i1> %y to <vscale x 1 x i32>
%c = add <vscale x 1 x i32> %a, %b
ret <vscale x 1 x i32> %c
}
```
The previous patch didn't check if the semantic of `ISD::ZERO_EXTEND`
and `ISD::ZERO_EXTEND` is equivalent to the `vsext.vf2` or `vzext.vf2`
(not ensuring the SEW condition on widening Vector Arithmetic
Instructions).
Thanks for @topperc pointing out this bug.
## The original description
This PR mainly aims at resolving the below missed-optimization case,
while it could also be considered as an extension of the previous patch
https://reviews.llvm.org/D133739?id=
### Missed-Optimization Case
Compiler Explorer: https://godbolt.org/z/GzWzP7Pfh
### Source Code:
```
define <vscale x 2 x i16> @multiple_users(ptr %x, ptr %y, ptr %z) {
%a = load <vscale x 2 x i8>, ptr %x
%b = load <vscale x 2 x i8>, ptr %y
%b2 = load <vscale x 2 x i8>, ptr %z
%c = sext <vscale x 2 x i8> %a to <vscale x 2 x i16>
%d = sext <vscale x 2 x i8> %b to <vscale x 2 x i16>
%d2 = sext <vscale x 2 x i8> %b2 to <vscale x 2 x i16>
%e = mul <vscale x 2 x i16> %c, %d
%f = add <vscale x 2 x i16> %c, %d2
%g = sub <vscale x 2 x i16> %c, %d2
%h = or <vscale x 2 x i16> %e, %f
%i = or <vscale x 2 x i16> %h, %g
ret <vscale x 2 x i16> %i
}
```
### Before This Patch
```
# %bb.0:
vsetvli a3, zero, e16, mf2, ta, ma
vle8.v v8, (a0)
vle8.v v9, (a1)
vle8.v v10, (a2)
svf2 v11, v8
vsext.vf2 v8, v9
vsext.vf2 v9, v10
vmul.vv v8, v11, v8
vadd.vv v10, v11, v9
vsub.vv v9, v11, v9
vor.vv v8, v8, v10
vor.vv v8, v8, v9
ret
```
### After This Patch
```
# %bb.0:
vsetvli a3, zero, e8, mf4, ta, ma
vle8.v v8, (a0)
vle8.v v9, (a1)
vle8.v v10, (a2)
vwmul.vv v11, v8, v9
vwadd.vv v9, v8, v10
vwsub.vv v12, v8, v10
vsetvli zero, zero, e16, mf2, ta, ma
vor.vv v8, v11, v9
vor.vv v8, v8, v12
ret
```
We can see Add/Sub/Mul are combined with the Sign Extension.
### Relation to the Patch D133739
The patch D133739 introduced an optimization for folding `ADD_VL`/
`SUB_VL` / `MUL_V` with `VSEXT_VL` / `VZEXT_VL`. However, the patch did
not consider the case of non-fixed length vector case, thus this PR
could also be considered as an extension for the D133739.
If VLEN is exactly known, we may be able to use the vsetivli encoding
instead of the vsetvli a0, zero, <vtype> encoding. This slightly reduces
register pressure.
This builds on 632f1c5, but reverses course a bit. It turns out to be
quite complicated to canonicalize from VLMAX to immediate early because
the sentinel value is widely used in tablegen patterns without knowledge
of LMUL. Instead, we canonicalize towards the VLMAX representation, and
then pick the immediate form during insertion since we have the LMUL
information there.
Within InsertVSETVLI, this could reasonable fit in a couple places. If
reviewers want me to e.g. move it to emission, let me know. Doing so may
require a bit of extra code to e.g. handle comparisons of the two forms,
but shouldn't be too complicated.
Relying on ComputeKnownBits to find a splat is causing miscompilations where a shift of zero is being assumed to give zero, but further simplification leads to a shift of zero by undef, resulting in an unexpected undef value.
Fixes#78109
This commit includes the necessary changes to clang and LLVM to support
codegen of `RVE` and the `ilp32e`/`lp64e` ABIs.
The differences between `RVE` and `RVI` are:
* `RVE` reduces the integer register count to 16(x0-x16).
* The ABI should be `ilp32e` for 32 bits and `lp64e` for 64 bits.
`RVE` can be combined with all current standard extensions.
The central changes in ilp32e/lp64e ABI, compared to ilp32/lp64 are:
* Only 6 integer argument registers (rather than 8).
* Only 2 callee-saved registers (rather than 12).
* A Stack Alignment of 32bits (rather than 128bits).
* ilp32e isn't compatible with D ISA extension.
If `ilp32e` or `lp64` is used with an ISA that has any of the registers
x16-x31 and f0-f31, then these registers are considered temporaries.
To be compatible with the implementation of ilp32e in GCC, we don't use
aligned registers to pass variadic arguments and set stack alignment\
to 4-bytes for types with length of 2*XLEN.
FastCC is also supported on RVE, while GHC isn't since there is only one
avaiable register.
Differential Revision: https://reviews.llvm.org/D70401
Reordering based on the sort order of the MemOpInfo array was disabled
in <https://reviews.llvm.org/D72706>. However, it's not clear this is
desirable for al targets. It also makes it more difficult to compare the
incremental benefit of enabling load clustering in the selectiondag
scheduler as well was the machinescheduler, as the sdag scheduler does
seem to allow this reordering.
This patch adds a parameter that can control the behaviour on a
per-target basis.
Split out from #73789.
vmv.s.x and vmv.x.s ignore LMUL, so we can replace the PseudoVMV_S_X_MX
and
PseudoVMV_X_S_MX with just one pseudo each. These pseudos use the VR
register
class (just like the actual instruction), so we now only have TableGen
patterns for vectors of LMUL <= 1.
We now rely on the existing combines that shrink LMUL down to 1 for
vmv_s_x_vl (and vfmv_s_f_vl). We could look into removing these combines
later and just inserting the nodes with the correct type in a later
patch.
The test diff is due to the fact that a PseudoVMV_S_X/PsuedoVMV_X_S no
longer
carries any information about LMUL, so if it's the only vector pseudo
instruction in a block then it now defaults to LMUL=1.
InstCombine will add the disjoint flag to these or instructions. This patch
adds them to the tests so that it matches the input RISCVGatherScatterLowering
will receive in practice, allowing us to rely on said disjoint flag:
https://github.com/llvm/llvm-project/pull/77800#discussion_r1449231844
This patch adds support for the disjoint flag in the non-recursive case,
as well as adding an additional check for it in the recursive case. Note
that haveNoCommonBitsSet should be equivalent to having the disjoint
flag set, and the check can be removed in a follow-up patch.
Co-authored-by: Philip Reames <preames@rivosinc.com>
---------
Co-authored-by: Philip Reames <preames@rivosinc.com>
Currently vfmv.s.f intrinsics are directly selected to their pseudos via
a
tablegen pattern in RISCVInstrInfoVPseudos.td, whereas the other move
instructions (vmv.s.x/vmv.v.x/vmv.v.f etc.) first get lowered to their
corresponding VL SDNode, then get selected from a pattern in
RISCVInstrInfoVVLPatterns.td
This patch brings vfmv.s.f inline with the other move instructions.
Split out from #71501, where we did this to preserve the behaviour of
selecting
vmv_s_x for VFMV_S_F_VL for small enough immediates.
This reverts commit fdb87640ee2be63af9b0e0cd943cb13d79686a03, and thus
re-enables terminator folding for RISCV. The reported miscompile has
been fixed in f5dd70c58277d925710e5a7c25c86d7565cc3c6c.
This already gets converted to a strided intrinsic because we currently call
haveNoCommonBitsSet when checking or instructions, but an upcoming patch will
change this logic and we want to preserve this case.
Note that this IR is in the form that comes from instcombine. The splats need
to be inline constexprs, otherwise isSplatValue() will fail. (It can't
currently handle splats where the shufflevector is an instruction, and the
insertelement is a constexpr.
This adds new isel patterns for Zacas that take priority over the
pseudoinstructions we use for the A extension.
Support for 2x XLen types will come in a separate patch since they need
to be done differently.
* Add IRTranslate tests for ADD, SUB, AND, OR, and XOR with scalable
vector types to show that they work as expected.
* Legalize G_ADD, G_SUB, G_AND, G_OR, and G_XOR of scalable vector
type for the RISC-V vector extension.
Similar to #76550, but for `ISD::AVGCEILU`.
Specifically, this patch aims to use `vaaddu` with rounding mode rnu
(i.e `vxrm[1:0] = 0b00`) for `ISD::AVGCEILU`.
### Source code
```
define <vscale x 8 x i8> @vaaddu_vv_nxv8i8_ceil(<vscale x 8 x i8> %x, <vscale x 8 x i8> %y) {
%xzv = zext <vscale x 8 x i8> %x to <vscale x 8 x i16>
%yzv = zext <vscale x 8 x i8> %y to <vscale x 8 x i16>
%add = add nuw nsw <vscale x 8 x i16> %xzv, %yzv
%one = insertelement <vscale x 8 x i16> poison, i16 1, i32 0
%splat = shufflevector <vscale x 8 x i16> %one, <vscale x 8 x i16> poison, <vscale x 8 x i32> zeroinitializer
%add1 = add nuw nsw <vscale x 8 x i16> %add, %splat
%div = lshr <vscale x 8 x i16> %add1, %splat
%ret = trunc <vscale x 8 x i16> %div to <vscale x 8 x i8>
ret <vscale x 8 x i8> %ret
}
```
### Before this patch
```
vaaddu_vv_nxv8i8_ceil:
vsetvli a0, zero, e8, m1, ta, ma
vwaddu.vv v10, v8, v9
vsetvli zero, zero, e16, m2, ta, ma
vadd.vi v10, v10, 1
vsetvli zero, zero, e8, m1, ta, ma
vnsrl.wi v8, v10, 1
ret
```
### After this patch
```
vaaddu_vv_nxv8i8_ceil:
vsetvli a0, zero, e8, m1, ta, ma
csrwi vxrm, 0
vaaddu.vv v8, v8, v9
ret
```
Regarding
```
.option norelax
j label
.option relax
// relaxable instructions
// For assembly input, RISCVAsmParser::ParseInstruction will set ForceRelocs (https://reviews.llvm.org/D46423).
// For direct object emission, ForceRelocs is not set after https://github.com/llvm/llvm-project/pull/73721
label:
```
The J instruction needs a relocation to ensure the target is correct
after linker relaxation. This is related a limitation in the assembler:
RISCVAsmBackend::shouldForceRelocation decides upfront whether a
relocation is needed, instead of checking more information (whether
there are relaxable fragments in between).
Despite the limitation, `j label` produces a relocation in direct object
emission mode, but was broken by #73721 due to the shouldForceRelocation
limitation.
Add a workaround to RISCVTargetELFStreamer to emulate the previous
behavior.
Link: https://github.com/ClangBuiltLinux/linux/issues/1965
When both SHF_LINK_ORDER | SHF_GROUP flags are set, GNU assembler from
2.35 onwards (https://sourceware.org/PR25381https://sourceware.org/binutils/docs/as/Section.html) parses the
SHF_LINK_ORDER argument before section group name, different from us.
This is unfortunate, but does not matter because the `.section` flag `o`
is a niche feature only used by compiler instrumentations, not adopted
by hand-written assembly, and using both flags is extremely rare. Let's
just match GNU assembler. There is another benefit: we now support
zero-flag section group with the SHF_LINK_ORDER flag, while previously
there isn't a syntax.
While here, print 'G' after 'o' to be clear that the 'G' argument is
parsed after the 'o' argument. To make the diff smaller, we don't print
'G' after 'w' in the absence of 'o' for now.
This patch aims to use `vaaddu` with rounding mode rdn (i.e `vxrm[1:0] =
0b10`) for `ISD::AVGFLOORU`.
### Source code
```
define <8 x i8> @vaaddu_auto(ptr %x, ptr %y, ptr %z) {
%xv = load <8 x i8>, ptr %x, align 2
%yv = load <8 x i8>, ptr %y, align 2
%xzv = zext <8 x i8> %xv to <8 x i16>
%yzv = zext <8 x i8> %yv to <8 x i16>
%add = add nuw nsw <8 x i16> %xzv, %yzv
%div = lshr <8 x i16> %add, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
%ret = trunc <8 x i16> %div to <8 x i8>
ret <8 x i8> %ret
}
```
### Before this patch
```
vaaddu_auto:
vsetivli zero, 8, e8, mf2, ta, ma
vle8.v v8, (a0)
vle8.v v9, (a1)
vwaddu.vv v10, v8, v9
vnsrl.wi v8, v10, 1
ret
```
### After this patch
```
vaaddu_auto:
vsetivli zero, 8, e8, mf2, ta, ma
vle8.v v8, (a0)
vle8.v v9, (a1)
csrwi vxrm, 2
vaaddu.vv v8, v8, v9
ret
```
### Note on signed averaging addition
Based on the rvv spec, there is also a variant for signed averaging
addition called `vaadd`.
But AFAIU, no matter in which rounding mode, we cannot achieve the
semantic of signed averaging addition through `vaadd`.
Thus this patch only introduces `vaaddu`.
sifive-p450 supports a very restricted version of the short forward
branch optimization from the sifive-7-series.
For sifive-p450, a branch over a single c.mv can be macrofused as a
conditional move operation. Due to encoding restrictions on c.mv, we
can't conditionally move from X0. That would require c.li instead.
Since their values are small enough ([-1, 65535] & [0, 65535],
respectively) to fit into signed 32 bits, any sext (or downcasting +
sext) will be redundnat. Hence marking them as SignExtendingOpW.