Now that we have matching for vqdot in it's basic variants, we can
extend the matcher to handle reduction trees instead of individual
reductions. This is important as we canonicalize reductions by
performing a tree in the vector domain before the root reduction
instruction.
The particular approach taken here has the unfortunate implication that
non-matches visit the entire reduction tree once for each time the
reduction root is visited in DAG. While conceptually problematic for
compile time, this is probably fine in practice as we should only visit
the root once per pass of DAGCombine. I don't really see a better
solution - suggestions welcome.
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>
As with the recently added subvector variants, provide the unsigned
index operand to simplify a bunch of code.
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>
Note that this change is possibly not NFC. The prior routines used
getConstant with XLenVT. The new wrappers will used getVectorIdxConstant
instead. Digging through the code, the type used for the index will be
the integer of pointer width from DL. For typical RV32 and RV64
configurations the pointer will be of equal width to XLEN, but you could
have a 32b pointer on an RV64 machine.
Follow up to 6e654caab, use the new routines in more places. Note that
I've excluded from this patch any case which uses a getConstant index
instead of a getVectorIdxConstant index just to minimize room for
error. I'll get those in a separate follow up.
RISCVVectorPeepholePass would replace instructions with all-ones mask
with their unmask variant, so there isn't really a point to keep
separate versions of intrinsics.
Note that `riscv.segN.load/store.mask` does not take pointer type (i.e.
address space) as part of its overloading type signature, because RISC-V
doesn't really use address spaces other than the default one.
Mechanical change to introduce the new wrappers, and add enough users to
make the usage pattern clear. Once this lands, I'm going to do a further
pass to adjust more callsites as separate changes.
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>
Teach InterleavedAccessPass to recognize vp.load + shufflevector and
shufflevector + vp.store. Though this patch only adds RISC-V support to
actually lower this pattern. The vp.load/vp.store in this pattern
require constant mask.
This patch adds pattern matching for the basic usages of the dot product
instructions introduced by the experimental zvqdotq extension. It
specifically only handles the case where the pattern is feeding a i32
sum reduction as we need to reassociate the reduction tree to use these
instructions.
The vecreduce_add (sext) and vecreduce_add (zext) cases are included
mostly to exercise the VX matchers. For the generic matching, we fail to
match due to an order of combine issue which results in the bitcast
being separated from the splat.
I chose to do this lowering as an early combine so as to avoid having to
integrate the entire logic into the reduction lowering flow. In
particular, that would get a lot more complicated as we extend this to
handle add-trees feeding the reductions.
This implements the result of the discussion at:
https://discourse.llvm.org/t/rfc-report-fatal-error-and-the-default-value-of-gencrashdialog/73587
There are two different use cases for report_fatal_error, so replace it
with two functions reportFatalInternalError() and
reportFatalUsageError(). The former indicates a bug in LLVM and
generates a crash dialog. The latter does not. The names have been
suggested by rnk and people seemed to like them.
This replaces a lot of the usages that passed an explicit value for
GenCrashDiag. I did not bulk replace remaining report_fatal_error usage
-- they probably require case by case review for which function to use.
Extends changes from
[ff687af](ff687af04f).
Fixes https://github.com/llvm/llvm-project/issues/131476.
This patch adds a DAG combine to replace an `AND` of an `ATOMIC_LOAD`
with a full-bit mask (e.g. `0xFF`, `0xFFFF`, etc.) which is generated as
a result of `(zext (atomic_load))`, by a zero-extended load, provided
the atomic operation is monotonic or weaker.
This Change adds support for two SiFive vendor attributes in clang:
- "SiFive-CLIC-preemptible"
- "SiFive-CLIC-stack-swap"
These can be given together, and can be combined with "machine", but
cannot be combined with any other interrupt attribute values.
These are handled primarily in RISCVFrameLowering:
- "SiFive-CLIC-stack-swap" entails swapping `sp` with `sf.mscratchcsw`
at function entry and exit, which holds the trap stack pointer.
- "SiFive-CLIC-preemptible" entails saving `mcause` and `mepc` before
re-enabling interrupts using `mstatus`. To save these, `s0` and `s1`
are first spilled to the stack, and then the values are read into
these registers. If these registers are used in the function, their
values will be spilled a second time onto the stack with the generic
callee-saved-register handling. At the end of the function interrupts
are disabled again before `mepc` and `mcause` are restored.
This Change also adds support for the following two experimental
extensions, which only contain CSRs:
- XSfsclic - for SiFive's CLIC Supervisor-Mode CSRs
- XSfmclic - for SiFive's CLIC Machine-Mode CSRs
The latter is needed for interrupt support.
The CFI information for this implementation is not correct, but I'd
prefer to correct this in a follow-up. While it's unlikely anyone wants
to unwind through a handler, the CFI information is also used by
debuggers so it would be good to get it right.
Co-authored-by: Ana Pazos <apazos@quicinc.com>
These instructions are included in XRivosVisni. They perform a scalar
insert into a vector (with a potentially non-zero index) and a scalar
extract from a vector (with a potentially non-zero index) respectively.
They're very analogous to vmv.s.x and vmv.x.s respectively.
The instructions do have a couple restrictions:
1) Only constant indices are supported w/a uimm5 format.
2) There are no FP variants.
One important property of these instructions is that their throughput
and latency are expected to be LMUL independent.
This handles combining fixed-length disjoint ors to vwadd[u].wv, as was
done for scalable vectors in #86929.
vwadd[u].vv patterns need to be handled separately with a pattern in a
separate patch due to the extends being sunk, see #136716.
This is a reland of #99752 with the bug fixed (see test diff in the
third commit in this PR).
All `popcount` libcalls return `int`, but `ISD::CTPOP` returns the type
of the argument, which can be wider than `int`. The fix is to make DAG
legalizer pass the correct return type to `makeLibCall` and sign-extend
the result afterwards.
Original commit message:
The main change is adding CTPOP to `RuntimeLibcalls.def` to allow
targets to use LibCall action for CTPOP. DAG legalizers are changed
accordingly.
Pull Request: https://github.com/llvm/llvm-project/pull/101786
InstructionCost is already an optional value, containing an Invalid
state that can be checked with isValid(). There is little point in
returning another optional from getValue(). Most uses do not make use of
it being a std::optional, dereferencing the value directly (either
isValid has been checked previously or the Cost is assumed to be valid).
The one case that does in AMDGPU used value_or which has been replaced
by a isValid() check.
This is a continuation from 22d5890c and adds the neccessary logic to
handle SEW!=64 profitably. The interesting case is needing to handle
e.g. a single m1 which is split via extract_subvector into two operands,
and form that back into a single m1 operation - instead of letting the
vslidedown by vlenb/Constant sequence be generated. This is analogous to
the getSingleShuffleSrc for vnsrl, and we can share a bunch of code.
If XRivosVizip is available, the ri.vzip2a and ri.vzip2b instructions
can be used perform a interleave shuffle. This patch only effects the
intrinsic lowering (and thus scalable vectors). Fixed vectors go through
shuffle lowering and the zip2a (but not zip2b) case is already handled
there..
If XRivosVizip is available, the ri.vunzip2a and ri.vunzip2b can be used
to the concatenation and register deinterleave shuffle. This patch only
effects the intrinsic lowering (and thus scalable vectors because the
fixed vectors go through shuffle lowering).
Note that this patch is restricted to e64 for staging purposes only. e64
is obviously profitable (i.e. we remove a vcompress). At e32 and below,
our alternative is a vnsrl instead, and we need a bit more complexity
around lowering with fractional LMUL before the ri.vunzip2a/b versions
becomes always profitable. I'll post the followup change once this
lands.
This change removes the uint64_t constructor on LocationSize
preventing implicit conversion, and fixes up the using APIs to adapt to
the change. Note that I'm adding a couple of explicit conversion points
on routines where passing in a fixed offset as an integer seems likely
to have well understood semantics.
We had an unfortunate case which arose if you tried to pass a TypeSize
value to a parameter of LocationSize type. We'd find the implicit
conversion path through TypeSize -> uint64_t -> LocationSize which works
just fine for fixed values, but looses information and fails assertions
if the TypeSize was scalable. This change breaks the first link in that
implicit conversion chain since that seemed to be the easier one.
If we have a build vector which could be either a splat or a scalar
insert, prefer the scalar insert. At high LMUL, this reduces vector
register pressure (locally, the use will likely still be aligned), and
the amount of work performed for the splat.
This extends the DAG combine introduced in 336b2909 to handle the case
where the prior value is defined by a vmv.s.x instead of a vmv.v.x. If
the vrgather splats the single source element, and has no passthru we
can replace it with a vmv.v.x - which will in turn usually get folded
into a vmerge if a select follows.
If there is another branch instruction also with immediate operand, but
it is used to specify which bit to be tested is set or clear. We only
check whether operand2 is immediate or not here. There are no way to
distinguish between them.
So add new CondCode COND_CV_BEQIMM/COND_CV_BNEIMM that we can know what
kinds of immediate branch instruction are matched in Select_* Pseudo.
Extend the transform introduced in 336b290 to vfmv.v.f. This is fairly
trivial and would have been in the original commit except I hadn't
written the FP tests yet.
If the vrgather.vi is preceeded by a vfmv.v.f which writes a superset of
the lanes writen by the vrgather, and the vrgather has no passthru, then
the vrgather has no semantic effect.
If the vrgather.vi is preceeded by a vmv.v.x which writes a superset of
the lanes writen by the vrgather, and the vrgather has no passthru, then
the vrgather has no semantic effect.
This is the start of a mini-series of patches around rewriting
vrgather.vi/vx preceeded by vmv.v.x, vfmf.v.f, vmv.s.x, etc... Starting
with the simplest, but also lowest impact.
One point I'd like a second oppinion on is the out of bounds semenatic
change. As far as I can tell, all the indices are in bounds by
construction. The doc change is as much as I couldn't figure out how to
test the alternative as anything else.
This can be done with a vrgather.vi/vx, and (possibly) a register move.
The alternative is to do a vrgather.vv with a full width index vector.
We'd already caught the two operands forms of this shuffle; this patch
specifically handles the single operand form.
Unfortunately only in abstract, it would be nice if we canonicalized
shuffles in some way wouldn't it?
This is a follow up to f8ee58a3c, and improves code generation for the
XRivosVizip extension.
If we have a slide pair which could be a zipeven or zipodd if the
shuffle was widened, widen the shuffle and then mask the zipeven or
zipodd.
This is basically working around an order of matching issue; we match
the slide pair variants before trying widening. I considered whether we
should just widen slide pairs without any consideration of the zip
idioms, but the resulting codegen changes look mostly like churn, and
have no clear evidence of profitability.
The element type i64 of the BUILD_VECTOR is not legal on RV32. It
doesn't catch the VID pattern after being legalized for i64.
So try to customized lower it to VID during type legalization.
Fix https://github.com/llvm/llvm-project/issues/134126.
The matching code was previous written as if we were mutating the
indices to replace undef elements with preferred values, but the actual
lowering code just took a prefix of the index vector. This resulted in
us using undef indices for lanes which should have been defined,
resulting in incorrect codegen.
Longer term, we probably should rewrite the mask, but this seemed like
an easier tactical fix.
InstCombine will combine this zext of an icmp where the source has a
single bit set to a lshr plus trunc
(`InstCombinerImpl::transformZExtICmp`):
```llvm
define <vscale x 1 x i8> @f(<vscale x 1 x i64> %x) {
%1 = and <vscale x 1 x i64> %x, splat (i64 8)
%2 = icmp ne <vscale x 1 x i64> %1, splat (i64 0)
%3 = zext <vscale x 1 x i1> %2 to <vscale x 1 x i8>
ret <vscale x 1 x i8> %3
}
```
```llvm
define <vscale x 1 x i8> @reverse_zexticmp_i64(<vscale x 1 x i64> %x) {
%1 = trunc <vscale x 1 x i64> %x to <vscale x 1 x i8>
%2 = lshr <vscale x 1 x i8> %1, splat (i8 2)
%3 = and <vscale x 1 x i8> %2, splat (i8 1)
ret <vscale x 1 x i8> %3
}
```
In a loop, this ends up being unprofitable for RISC-V because the
codegen now goes from:
```asm
f: # @f
.cfi_startproc
# %bb.0:
vsetvli a0, zero, e64, m1, ta, ma
vand.vi v8, v8, 8
vmsne.vi v0, v8, 0
vsetvli zero, zero, e8, mf8, ta, ma
vmv.v.i v8, 0
vmerge.vim v8, v8, 1, v0
ret
```
To a series of narrowing vnsrl.wis:
```asm
f: # @f
.cfi_startproc
# %bb.0:
vsetvli a0, zero, e64, m1, ta, ma
vand.vi v8, v8, 8
vsetvli zero, zero, e32, mf2, ta, ma
vnsrl.wi v8, v8, 3
vsetvli zero, zero, e16, mf4, ta, ma
vnsrl.wi v8, v8, 0
vsetvli zero, zero, e8, mf8, ta, ma
vnsrl.wi v8, v8, 0
ret
```
In the original form, the vmv.v.i is loop invariant and is hoisted out,
and the vmerge.vim usually gets folded away into a masked instruction,
so you usually just end up with a vsetvli + vmsne.vi.
The truncate requires multiple instructions and introduces a vtype
toggle for each one, and is measurably slower on the BPI-F3.
This reverses the transform in RISCVISelLowering for truncations greater
than twice the bitwidth, i.e. it keeps single vnsrl.wis.
Fixes#132245
Previously we only marked fixed length vector extracts as cheap, so this
extends it to any extract at index 0 which should just be a subreg
extract.
This allows extracts of i1 vectors to be considered for DAG combines,
but also scalable vectors too.
This causes some slight improvements with large legalized fixed-length
vectors, but the underlying motiviation for this is to actually prevent
an unprofitable DAG combine on a scalable vector in an upcoming patch.
Fixes#130510.
In RISCV, modify the folding of (X ^ Y == 0) -> (X == Y) to account for
cases where the (X ^ Y) will be re-used.
If a constant is being used for the XOR before a branch, ensure that it
is small enough to fit within a 12-bit immediate field. Otherwise, the
equality check is more efficient than the check against 0, see the
following:
```
# %bb.0:
lui a1, 5
addiw a1, a1, 1365
xor a0, a0, a1
beqz a0, .LBB0_2
# %bb.1:
ret
.LBB0_2:
```
```
# %bb.0:
lui a1, 5
addiw a1, a1, 1365
beq a0, a1, .LBB0_2
# %bb.1:
xor a0, a0, a1
ret
.LBB0_2:
```
Similarly, if the XOR is between 1 and a size one integer, we should
still fold away the XOR since that comparison can be optimized as a
comparison against 0.
```
# %bb.0:
slt a0, a0, a1
xor a0, a0, 1
beqz a0, .LBB0_2
# %bb.1:
ret
.LBB0_2:
```
```
# %bb.0:
slt a0, a0, a1
bnez a0, .LBB0_2
# %bb.1:
xor a0, a0, 1
ret
.LBB0_2:
```
One question about my code is that I used a hard-coded value for the
width of a RISCV ALU immediate. Do you know of a way that I can gather
this from the `context`, I was unable to devise one.
For example for the following situation:
%6:gpr = SLLI %2:gpr, 2
%7:gpr = ADDI killed %6:gpr, 24
%8:gpr = ADD %0:gpr, %7:gpr
If we swap the two add instrucions we can merge the shift and add. The
final code will look something like this:
%7 = SH2ADD %0, %2
%8 = ADDI %7, 24