This Change adds support for two SiFive vendor attributes in clang:
- "SiFive-CLIC-preemptible"
- "SiFive-CLIC-stack-swap"
These can be given together, and can be combined with "machine", but
cannot be combined with any other interrupt attribute values.
These are handled primarily in RISCVFrameLowering:
- "SiFive-CLIC-stack-swap" entails swapping `sp` with `sf.mscratchcsw`
at function entry and exit, which holds the trap stack pointer.
- "SiFive-CLIC-preemptible" entails saving `mcause` and `mepc` before
re-enabling interrupts using `mstatus`. To save these, `s0` and `s1`
are first spilled to the stack, and then the values are read into
these registers. If these registers are used in the function, their
values will be spilled a second time onto the stack with the generic
callee-saved-register handling. At the end of the function interrupts
are disabled again before `mepc` and `mcause` are restored.
This Change also adds support for the following two experimental
extensions, which only contain CSRs:
- XSfsclic - for SiFive's CLIC Supervisor-Mode CSRs
- XSfmclic - for SiFive's CLIC Machine-Mode CSRs
The latter is needed for interrupt support.
The CFI information for this implementation is not correct, but I'd
prefer to correct this in a follow-up. While it's unlikely anyone wants
to unwind through a handler, the CFI information is also used by
debuggers so it would be good to get it right.
Co-authored-by: Ana Pazos <apazos@quicinc.com>
These instructions are included in XRivosVisni. They perform a scalar
insert into a vector (with a potentially non-zero index) and a scalar
extract from a vector (with a potentially non-zero index) respectively.
They're very analogous to vmv.s.x and vmv.x.s respectively.
The instructions do have a couple restrictions:
1) Only constant indices are supported w/a uimm5 format.
2) There are no FP variants.
One important property of these instructions is that their throughput
and latency are expected to be LMUL independent.
This handles combining fixed-length disjoint ors to vwadd[u].wv, as was
done for scalable vectors in #86929.
vwadd[u].vv patterns need to be handled separately with a pattern in a
separate patch due to the extends being sunk, see #136716.
This is a reland of #99752 with the bug fixed (see test diff in the
third commit in this PR).
All `popcount` libcalls return `int`, but `ISD::CTPOP` returns the type
of the argument, which can be wider than `int`. The fix is to make DAG
legalizer pass the correct return type to `makeLibCall` and sign-extend
the result afterwards.
Original commit message:
The main change is adding CTPOP to `RuntimeLibcalls.def` to allow
targets to use LibCall action for CTPOP. DAG legalizers are changed
accordingly.
Pull Request: https://github.com/llvm/llvm-project/pull/101786
InstructionCost is already an optional value, containing an Invalid
state that can be checked with isValid(). There is little point in
returning another optional from getValue(). Most uses do not make use of
it being a std::optional, dereferencing the value directly (either
isValid has been checked previously or the Cost is assumed to be valid).
The one case that does in AMDGPU used value_or which has been replaced
by a isValid() check.
This is a continuation from 22d5890c and adds the neccessary logic to
handle SEW!=64 profitably. The interesting case is needing to handle
e.g. a single m1 which is split via extract_subvector into two operands,
and form that back into a single m1 operation - instead of letting the
vslidedown by vlenb/Constant sequence be generated. This is analogous to
the getSingleShuffleSrc for vnsrl, and we can share a bunch of code.
If XRivosVizip is available, the ri.vzip2a and ri.vzip2b instructions
can be used perform a interleave shuffle. This patch only effects the
intrinsic lowering (and thus scalable vectors). Fixed vectors go through
shuffle lowering and the zip2a (but not zip2b) case is already handled
there..
If XRivosVizip is available, the ri.vunzip2a and ri.vunzip2b can be used
to the concatenation and register deinterleave shuffle. This patch only
effects the intrinsic lowering (and thus scalable vectors because the
fixed vectors go through shuffle lowering).
Note that this patch is restricted to e64 for staging purposes only. e64
is obviously profitable (i.e. we remove a vcompress). At e32 and below,
our alternative is a vnsrl instead, and we need a bit more complexity
around lowering with fractional LMUL before the ri.vunzip2a/b versions
becomes always profitable. I'll post the followup change once this
lands.
This change removes the uint64_t constructor on LocationSize
preventing implicit conversion, and fixes up the using APIs to adapt to
the change. Note that I'm adding a couple of explicit conversion points
on routines where passing in a fixed offset as an integer seems likely
to have well understood semantics.
We had an unfortunate case which arose if you tried to pass a TypeSize
value to a parameter of LocationSize type. We'd find the implicit
conversion path through TypeSize -> uint64_t -> LocationSize which works
just fine for fixed values, but looses information and fails assertions
if the TypeSize was scalable. This change breaks the first link in that
implicit conversion chain since that seemed to be the easier one.
If we have a build vector which could be either a splat or a scalar
insert, prefer the scalar insert. At high LMUL, this reduces vector
register pressure (locally, the use will likely still be aligned), and
the amount of work performed for the splat.
This extends the DAG combine introduced in 336b2909 to handle the case
where the prior value is defined by a vmv.s.x instead of a vmv.v.x. If
the vrgather splats the single source element, and has no passthru we
can replace it with a vmv.v.x - which will in turn usually get folded
into a vmerge if a select follows.
If there is another branch instruction also with immediate operand, but
it is used to specify which bit to be tested is set or clear. We only
check whether operand2 is immediate or not here. There are no way to
distinguish between them.
So add new CondCode COND_CV_BEQIMM/COND_CV_BNEIMM that we can know what
kinds of immediate branch instruction are matched in Select_* Pseudo.
Extend the transform introduced in 336b290 to vfmv.v.f. This is fairly
trivial and would have been in the original commit except I hadn't
written the FP tests yet.
If the vrgather.vi is preceeded by a vfmv.v.f which writes a superset of
the lanes writen by the vrgather, and the vrgather has no passthru, then
the vrgather has no semantic effect.
If the vrgather.vi is preceeded by a vmv.v.x which writes a superset of
the lanes writen by the vrgather, and the vrgather has no passthru, then
the vrgather has no semantic effect.
This is the start of a mini-series of patches around rewriting
vrgather.vi/vx preceeded by vmv.v.x, vfmf.v.f, vmv.s.x, etc... Starting
with the simplest, but also lowest impact.
One point I'd like a second oppinion on is the out of bounds semenatic
change. As far as I can tell, all the indices are in bounds by
construction. The doc change is as much as I couldn't figure out how to
test the alternative as anything else.
This can be done with a vrgather.vi/vx, and (possibly) a register move.
The alternative is to do a vrgather.vv with a full width index vector.
We'd already caught the two operands forms of this shuffle; this patch
specifically handles the single operand form.
Unfortunately only in abstract, it would be nice if we canonicalized
shuffles in some way wouldn't it?
This is a follow up to f8ee58a3c, and improves code generation for the
XRivosVizip extension.
If we have a slide pair which could be a zipeven or zipodd if the
shuffle was widened, widen the shuffle and then mask the zipeven or
zipodd.
This is basically working around an order of matching issue; we match
the slide pair variants before trying widening. I considered whether we
should just widen slide pairs without any consideration of the zip
idioms, but the resulting codegen changes look mostly like churn, and
have no clear evidence of profitability.
The element type i64 of the BUILD_VECTOR is not legal on RV32. It
doesn't catch the VID pattern after being legalized for i64.
So try to customized lower it to VID during type legalization.
Fix https://github.com/llvm/llvm-project/issues/134126.
The matching code was previous written as if we were mutating the
indices to replace undef elements with preferred values, but the actual
lowering code just took a prefix of the index vector. This resulted in
us using undef indices for lanes which should have been defined,
resulting in incorrect codegen.
Longer term, we probably should rewrite the mask, but this seemed like
an easier tactical fix.
InstCombine will combine this zext of an icmp where the source has a
single bit set to a lshr plus trunc
(`InstCombinerImpl::transformZExtICmp`):
```llvm
define <vscale x 1 x i8> @f(<vscale x 1 x i64> %x) {
%1 = and <vscale x 1 x i64> %x, splat (i64 8)
%2 = icmp ne <vscale x 1 x i64> %1, splat (i64 0)
%3 = zext <vscale x 1 x i1> %2 to <vscale x 1 x i8>
ret <vscale x 1 x i8> %3
}
```
```llvm
define <vscale x 1 x i8> @reverse_zexticmp_i64(<vscale x 1 x i64> %x) {
%1 = trunc <vscale x 1 x i64> %x to <vscale x 1 x i8>
%2 = lshr <vscale x 1 x i8> %1, splat (i8 2)
%3 = and <vscale x 1 x i8> %2, splat (i8 1)
ret <vscale x 1 x i8> %3
}
```
In a loop, this ends up being unprofitable for RISC-V because the
codegen now goes from:
```asm
f: # @f
.cfi_startproc
# %bb.0:
vsetvli a0, zero, e64, m1, ta, ma
vand.vi v8, v8, 8
vmsne.vi v0, v8, 0
vsetvli zero, zero, e8, mf8, ta, ma
vmv.v.i v8, 0
vmerge.vim v8, v8, 1, v0
ret
```
To a series of narrowing vnsrl.wis:
```asm
f: # @f
.cfi_startproc
# %bb.0:
vsetvli a0, zero, e64, m1, ta, ma
vand.vi v8, v8, 8
vsetvli zero, zero, e32, mf2, ta, ma
vnsrl.wi v8, v8, 3
vsetvli zero, zero, e16, mf4, ta, ma
vnsrl.wi v8, v8, 0
vsetvli zero, zero, e8, mf8, ta, ma
vnsrl.wi v8, v8, 0
ret
```
In the original form, the vmv.v.i is loop invariant and is hoisted out,
and the vmerge.vim usually gets folded away into a masked instruction,
so you usually just end up with a vsetvli + vmsne.vi.
The truncate requires multiple instructions and introduces a vtype
toggle for each one, and is measurably slower on the BPI-F3.
This reverses the transform in RISCVISelLowering for truncations greater
than twice the bitwidth, i.e. it keeps single vnsrl.wis.
Fixes#132245
Previously we only marked fixed length vector extracts as cheap, so this
extends it to any extract at index 0 which should just be a subreg
extract.
This allows extracts of i1 vectors to be considered for DAG combines,
but also scalable vectors too.
This causes some slight improvements with large legalized fixed-length
vectors, but the underlying motiviation for this is to actually prevent
an unprofitable DAG combine on a scalable vector in an upcoming patch.
Fixes#130510.
In RISCV, modify the folding of (X ^ Y == 0) -> (X == Y) to account for
cases where the (X ^ Y) will be re-used.
If a constant is being used for the XOR before a branch, ensure that it
is small enough to fit within a 12-bit immediate field. Otherwise, the
equality check is more efficient than the check against 0, see the
following:
```
# %bb.0:
lui a1, 5
addiw a1, a1, 1365
xor a0, a0, a1
beqz a0, .LBB0_2
# %bb.1:
ret
.LBB0_2:
```
```
# %bb.0:
lui a1, 5
addiw a1, a1, 1365
beq a0, a1, .LBB0_2
# %bb.1:
xor a0, a0, a1
ret
.LBB0_2:
```
Similarly, if the XOR is between 1 and a size one integer, we should
still fold away the XOR since that comparison can be optimized as a
comparison against 0.
```
# %bb.0:
slt a0, a0, a1
xor a0, a0, 1
beqz a0, .LBB0_2
# %bb.1:
ret
.LBB0_2:
```
```
# %bb.0:
slt a0, a0, a1
bnez a0, .LBB0_2
# %bb.1:
xor a0, a0, 1
ret
.LBB0_2:
```
One question about my code is that I used a hard-coded value for the
width of a RISCV ALU immediate. Do you know of a way that I can gather
this from the `context`, I was unable to devise one.
For example for the following situation:
%6:gpr = SLLI %2:gpr, 2
%7:gpr = ADDI killed %6:gpr, 24
%8:gpr = ADD %0:gpr, %7:gpr
If we swap the two add instrucions we can merge the shift and add. The
final code will look something like this:
%7 = SH2ADD %0, %2
%8 = ADDI %7, 24
lib/Target/RISCV/RISCVISelLowering.cpp:4629:26: error: comparison of integers of different signs: 'unsigned int' and 'int' [-Werror,-Wsign-compare]
4629 | for (unsigned i = 0; i != NumElts; ++i) {
| ~ ^ ~~~~~~~
1 error generated.
This implements initial code generation support for a subset of the
xrivosvizip extension. Specifically, this adds support for vzipeven,
vzipodd, and vzip2a, but not vzip2b, vunzip2a, or vunzip2b. The others
will follow in separate patches.
One review note: The zipeven/zipodd matchers were recently rewritten to
better match upstream style, so careful review there would be
appreciated. The matchers don't yet support type coercion to wider
types. This will be done in a future patch.
v2048i1 is an MVT, but v2048i8 is not so we don't support i8 vectors
with more than 1024 elements. Lowering a v2048i1 shufflevector would
requires promoting to v2048i8. Since v2048i8 isn't legal and isn't an
MVT this leads to a crash.
To fix the crash, this patch makes v2048i1 an illegal type.
When we're trying to lower `extractelement + splat` with
vrgather.vi/.vx, we should also check the legality of source vector type
from `extractelement`, as the entire transformation assumes legal types.
Fixes#133020
Given this shuffle:
```
shufflevector <8 x i8> %0, <8 x i8> %1, <8 x i32> <i32 0, i32 4, i32 8, i32 12, i32 undef, i32 undef, i32 undef, i32 undef>
```
#127272 lowers it with a bunch of vnsrl. If we describe the result in
terms of the shuffle mask, we expect:
```
<0, 4, 8, 12, u, u, u, u>
```
but we actually got:
```
<0, 4, u, u, 8, 12, u, u>
```
for factor larger than 2. This is caused by `CONCAT_VECTORS` on
incorrect (sub) vector types. This patch fixes the said issue by
building an aggregate vector with the correct sub vector types.
Fix#132071
The vmv.s.x instruction copies the scalar integer register to element 0
of the destination vector register. If SEW < XLEN, the least-significant
bits are copied and the upper XLEN-SEW bits are ignored.
Co-authored-by: yanming <ming.yan@terapines.com>
The prior logic was reasoning in terms of vsetivli immediates, but using
the vmv.v.x is strongly profitable for high LMUL cases. The key
difference is that the vmv.v.x form is rematerializeable during
register allocation, and the vsle form is not.
This change uses the vlmax form of the vsetvli for all cases where the
2 x size can't be encoded as a vsetivli. This has the effect of increasing
VL more than necessary across the vmv.v.x, which could in theory be
problematic performance-wise on some hardware. We can revisit (or
add a tune flag) if this turns out to be noteworthy.
This change adds support for `qci-nest` and `qci-nonest` interrupt
attribute values. Both of these are machine-mode interrupts, which use
instructions in Xqciint to push and pop A- and T-registers (and a few
others) from the stack.
In particular:
- `qci-nonest` uses `qc.c.mienter` to save registers at the start of the
function, and uses `qc.c.mileaveret` to restore those registers and
return from the interrupt.
- `qci-nest` uses `qc.c.mienter.nest` to save registers at the start of
the function, and uses `qc.c.mileaveret` to restore those registers and
return from the interrupt.
- `qc.c.mienter` and `qc.c.mienter.nest` both push registers ra, s0
(fp), t0-t6, and a0-a10 onto the stack (as well as some CSRs for the
interrupt context). The difference between these is that
`qc.c.mienter.nest` re-enables M-mode interrupts.
- `qc.c.mileaveret` will restore the registers that were saved by
`qc.c.mienter(.nest)`, and return from the interrupt.
These work for both standard M-mode interrupts and the non-maskable
interrupt CSRs added by Xqciint.
The `qc.c.mienter`, `qc.c.mienter.nest` and `qc.c.mileaveret`
instructions are compatible with push and pop instructions, in as much
as they (mostly) only spill the A- and T-registers, so we can use the
`Zcmp` or `Xqccmp` instructions to spill the S-registers. This
combination (`qci-(no)nest` and `Xqccmp`/`Zcmp`) is not implemented in
this change.
The `qc.c.mienter(.nest)` instructions have a specific register storage
order so they preserve the frame pointer convention linked list past the
current interrupt handler and into the interrupted code and frames if
frame pointers are enabled.
Co-authored-by: Pankaj Gode <quic_pgode@quicinc.com>
If there are the shuffle mask <1, u, u, u, 2, u, u, u> with factor 4. we
should have the shuffle mask <1, 2> for lane 0 and <u, u> for lane 1,
and so on. Since we use createSequentialMask to create the shuffle mask,
the shuffle mask for lane 1 would be <u, 0>(dervied from <u, u+1>). This
leads to poor code generation.
These were left over from when Craig removed
`__attribute__((interrupt("user")))` support in
05d0caef6081e1a6cb23a5a5afe43dc82e8ca558.
The tests change "interrupt"="user" into "interrupt"="machine" as they
are still intending to be interrupt tests. ISelLowering will now reject
"interrupt"="user". The docs no longer mention "user" as a possible
interrupt attribute argument.
This patch combines (iN vector.reduce.add (zext (vXi1 A to vXiN)) into
vcpop.m instruction (similarly to bitcast + ctpop pattern). It can be
useful for counting number of set bits in scalable vector types, which
can't be expressed with bitcast + ctpop (this was previously discussed
here: https://github.com/llvm/llvm-project/pull/74294).
This patch adds a function attribute `riscv_vls_cc` for RISCV VLS
calling
convention which takes 0 or 1 argument, the argument is the `ABI_VLEN`
which is the `VLEN` for passing the fixed-vector arguments, it wraps the
argument as a scalable vector(VLA) using the `ABI_VLEN` and uses the
corresponding mechanism to handle it. The range of `ABI_VLEN` is [32,
65536],
if not specified, the default value is 128.
Here is an example of VLS argument passing:
Non-VLS call:
```
void original_call(__attribute__((vector_size(16))) int arg) {}
=>
define void @original_call(i128 noundef %arg) {
entry:
...
ret void
}
```
VLS call:
```
void __attribute__((riscv_vls_cc(256))) vls_call(__attribute__((vector_size(16))) int arg) {}
=>
define riscv_vls_cc void @vls_call(<vscale x 1 x i32> %arg) {
entry:
...
ret void
}
}
```
The first Non-VLS call passes generic vector argument of 16 bytes by
flattened integer.
On the contrary, the VLS call uses `ABI_VLEN=256` which wraps the
vector to <vscale x 1 x i32> where the number of scalable vector
elements
is calaulated by: `ORIG_ELTS * RVV_BITS_PER_BLOCK / ABI_VLEN`.
Note: ORIG_ELTS = Vector Size / Type Size = 128 / 32 = 4.
PsABI PR: https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/418
C-API PR: https://github.com/riscv-non-isa/riscv-c-api-doc/pull/68
With a fix for fully undef masks. These can't reach the lowering code, but
can reach the costing code via e.g. SLP.
This change adds the TTI costing corresponding to the recently added
isMaskedSlidePair lowering for vector shuffles. However, since the
existing costing code hadn't covered either slideup, slidedown, or the
(now removed) isElementRotate, the impact is larger in scope than just
that new lowering.
---------
Co-authored-by: Alexey Bataev <a.bataev@gmx.com>
Co-authored-by: Luke Lau <luke_lau@icloud.com>