This takes inspiration from AArch64 which does the same thing to assist
with zip/trn/etc.. Doing this recursion unconditionally when the mask
allows is slightly questionable, but seems to work out okay in practice.
As a bit of context, it's helpful to realize that we have existing logic
in both DAGCombine and InstCombine which mutates the element width of in
an analogous manner. However, that code has two restriction which
prevent it from handling the motivating cases here. First, it only
triggers if there is a bitcast involving a different element type.
Second, the matcher used considers a partially undef wide element to be
a non-match. I considered trying to relax those assumptions, but the
information loss for undef in mid-level opt seemed more likely to open a
can of worms than I wanted.
Every call should have regmask operand to indicate what registers are
preserved or clobbered by the call. VirtRegRewriter uses this to tell
MachineRegisterInfo what registers are clobbered by a function. If the
mask isn't present the registers potentially clobbered by a tail called
function aren't counted. I have checked ARM, AArch64, and X86 and they
all have a regmask operand on their tail calls.
I believe this fixes an issue I'm seeing with IPRA.
This fixes a regression from #101294 by checking if we might be
clobbering a sh{1,2,3}add pattern.
Only do this is the underlying add isn't going to be folded away into an
address offset.
This reverts commit b8952d4b1b0c73bf39d6440ad3166a088ced563f.
spec x264 fails to build in all VLS configurations, with the assertion
failure: clang: ../llvm-project/llvm/lib/Target/RISCV/RISCVISelLowering.cpp:5246: llvm::SDValue lowerShuffleViaVRegSplitting(llvm::ShuffleVectorSDNode*, llvm::SelectionDAG&, const llvm::RISCVSubtarget&): Assertion `RegCnt == NumOfDestRegs && "Whole vector must be processed"' failed.
I can reduce a failing piece of IR, but the failure appears pretty
broad, so I suspect any reasonable vls build will hit it.
Patch adds usage of processShuffleMasks in in codegen
in lowerShuffleViaVRegSplitting. This function is already used for X86
shuffles estimations and in DAGTypeLegalizer::SplitVecRes_VECTOR_SHUFFLE
functions, unifies the code.
Reviewers: preames, topperc, lukel97, wangpc-pp
Reviewed By: wangpc-pp
Pull Request: https://github.com/llvm/llvm-project/pull/120803
With this change, targets are no longer required to put memory / strict-fp opcodes after special
`ISD::FIRST_TARGET_MEMORY_OPCODE`/`ISD::FIRST_TARGET_STRICTFP_OPCODE` markers.
This will also allow autogenerating `isTargetMemoryOpcode`/`isTargetStrictFPOpcode (#119709).
Pull Request: https://github.com/llvm/llvm-project/pull/119969
SDNode::use_iterator now returns an SDUse& when dereferenced.
SDNode::user_iterator returns SDNode*. SDNode::use_begin/use_end/uses
work on use_iterator. SDNode::user_begin/user_end/users work on
user_iterator.
We can now write range based for loops using SDUse& and SDNode::uses().
I've converted many of these in this patch. I didn't update loops that
have additional variables updated in their for statement.
Some loops use SDNode::use_iterator::getOperandNo() which also prevents
using range based for loops. I plan to move this into SDUse in a follow
up patch.
Most of these are just places that want the first user and aren't
iterating over the whole list.
While there I changed some use_size() == 1 to hasOneUse() which
is more efficient.
This is part of an effort to rename use_iterator to user_iterator
and provide a use_iterator that dereferences to SDUse&. This patch
helps reduce the diff on later patches.
This function is most often used in range based loops or algorithms
where the iterator is implicitly dereferenced. The dereference returns
an SDNode * of the user rather than SDUse * so users() is a better name.
I've long beeen annoyed that we can't write a range based loop over
SDUse when we need getOperandNo. I plan to rename use_iterator to
user_iterator and add a use_iterator that returns SDUse& on dereference.
This will make it more like IR.
The default legalization uses vmslt with a vector of XLen to compute a
mask. This doesn't work if the type isn't legal. For fixed vectors it
will scalarize. For scalable vectors it crashes the compiler.
This patch uses an alternate strategy that promotes the i1 vector to an
i8 vector and does the merge. I don't claim this to be the best
lowering. I wrote it quickly almost 3 years ago when a crash was
reported in our downstream.
Fixes#120405.
This is a follow up to 2af2634 to use vmv.v.x of i8 constants instead of
the prior vid/vand/vmsne sequence. The advantage of the vmv.v.x sequence
is that it's always m1 (so cheaper at high LMUL), and can be
rematerialized by the register allocator if needed to locally reduce
register pressure.
Lowering to load-acquire/store-release for RISCV Zalasr.
Currently uses the psABI lowerings for WMO load-acquire/store-release
(which are identical to A.7). These are incompatable with the A.6
lowerings currently used by LLVM. This should be OK for now since Zalasr
is behind the enable experimental extensions flag, but needs to be fixed
before it is removed from that.
For TSO, it uses the standard Ztso mappings except for lowering seq_cst
loads/store to load-acquire/store-release, I had Andrea review that.
In x264, there's a few kernels with shuffles like this:
%41 = add nsw <16 x i32> %39, %40
%42 = sub nsw <16 x i32> %39, %40
%43 = shufflevector <16 x i32> %41, <16 x i32> %42, <16 x i32> <i32 11,
i32 15, i32 7, i32 3, i32 26, i32 30, i32 22, i32 18, i32 9, i32 13, i32
5, i32 1, i32 24, i32 28, i32 20, i32 16>
Because this is a complex two-source shuffle, this will get lowered as
two vrgather.vvs that are blended together.
vadd.vv v20, v16, v12
vsub.vv v12, v16, v12
vrgatherei16.vv v24, v20, v10
vrgatherei16.vv v24, v12, v16, v0.t
However the indices coming from each source are disjoint, so we can
blend the two together and perform a single source shuffle instead:
%41 = add nsw <16 x i32> %39, %40
%42 = sub nsw <16 x i32> %39, %40
%43 = select <0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1> %41, %42
%44 = shufflevector <16 x i32> %43, <16 x i32> poison, <16 x i32> <i32
11, i32 15, i32 7, i32 3, i32 10, i32 14, i32 6, i32 2, i32 9, i32 13,
i32 5, i32 1, i32 8, i32 12, i32 4, i32 0>
The select will likely get merged into the preceding instruction, and
then we only have to do one vrgather.vv:
vadd.vv v20, v16, v12
vsub.vv v20, v16, v12, v0.t
vrgatherei16.vv v24, v20, v10
This patch bails if either of the sources are a broadcast/splat/identity
shuffle, since that will usually already have some sort of cheaper
lowering.
This improves performance on 525.x264_r by 4.12% with -O3 -flto
-march=rva22u64_v on the spacemit-x60.
Enable `-fstack-clash-protection` for RISCV and stack probe for function
prologues.
We probe the stack by creating a loop that allocates and probe the stack
in ProbeSize chunks.
We emit an unrolled probe loop for small allocations and emit a variable
length probe loop for bigger ones.
[Reverts d57892a2a153ab71a796f07e39d939eae6910c21]
For IR like this:
%icmp = icmp ult <4 x i32> %a, splat (i32 5)
%res = extractelement <4 x i1> %icmp, i32 1
where there is only one use of %icmp we can take a similar approach
to what we already do for binary ops such add, sub, etc. and convert
this into
%ext = extractelement <4 x i32> %a, i32 1
%res = icmp ult i32 %ext, 5
For AArch64 targets at least the scalar boolean result will almost
certainly need to be in a GPR anyway, since it will probably be
used by branches for control flow. I've tried to reuse existing code
in scalarizeExtractedBinop to also work for setcc.
NOTE: The optimisations don't apply for tests such as
extract_icmp_v4i32_splat_rhs in the file
CodeGen/AArch64/extract-vector-cmp.ll
because scalarizeExtractedBinOp only works if one of the input
operands is a constant.
---------
Co-authored-by: Paul Walker <paul.walker@arm.com>
If we're performing a segment store and all but one of the segments are
undefined, that's equivalent to performing a strided store of the one
active segment.
This is the store side of a905203b. As before, this only covers fixed
vectors.
This fixes
https://discourse.llvm.org/t/fixed-register-being-spill-and-restored-in-clang/83058.
We need to do it in `MachineRegisterInfo::getCalleeSavedRegs` instead of
`RISCVRegisterInfo::getCalleeSavedRegs` since the MF argument of
`TargetRegisterInfo:::getCalleeSavedRegs` is `const`, so we can't call
`MF->getRegInfo().disableCalleeSavedRegister` there.
So to put it in `MachineRegisterInfo::getCalleeSavedRegs`, we move
`isRegisterReservedByUser` into `TargetSubtargetInfo`.
For a spread with an element type small enough, we can use a zext and
shift to perform the shuffle. For e8, this covers spread(2,4,8), and for
e16 covers spread(2,4). Note that spread(2) is already covered by the
existing interleave logic, and is simply listed for completeness in the
prior description.
This is a prep patch for improving spread(4,8) shuffles. I also think it
improves the readability of the existing code, but the primary
motivation is simply staging work.
This patch updates the matchSplatAsGather function so we can handle vectors of different sizes. The goal is to improve the code gen for @llvm.experimental.vector.match on RISCV.
Currently, we use a scalar extract and splat instead of vrgather, and the patch changes that.
A spread(2) shuffle is just a interleave with an undef lane. The
existing lowering was reusing the even lane for the undef value. This
was entirely legal, but non-optimal.
For IR like this:
%icmp = icmp ult <4 x i32> %a, splat (i32 5)
%res = extractelement <4 x i1> %icmp, i32 1
where there is only one use of %icmp we can take a similar approach
to what we already do for binary ops such add, sub, etc. and convert
this into
%ext = extractelement <4 x i32> %a, i32 1
%res = icmp ult i32 %ext, 5
For AArch64 targets at least the scalar boolean result will almost
certainly need to be in a GPR anyway, since it will probably be
used by branches for control flow. I've tried to reuse existing code
in scalarizeExtractedBinop to also work for setcc.
NOTE: The optimisations don't apply for tests such as
extract_icmp_v4i32_splat_rhs in the file
CodeGen/AArch64/extract-vector-cmp.ll
because scalarizeExtractedBinOp only works if one of the input
operands is a constant.
I want to use this function for GISel too so Type * is a better common
interface. All of the callers already convert EVT to Type * as needed
by calling lowering anyway.
It doesn't make sense to add a new generic ISD to handle riscv tuple
type. Instead we use `SPLAT_VECTOR` for ISD and further lower to
`VMV_V_X`.
Note: If there's `visitSPLAT_VECTOR` in generic DAG combiner, it needs
to skip riscv vector tuple type.
Stack on https://github.com/llvm/llvm-project/pull/114329
We can extend the existing SHL+TRUNC lowering used for deinterleave2 for
deinterleave4, and deinterleave8 when the result types are small enough
to allow the shift to be legal. On RV64, this means i8 and i16 results
for deinterleave4 and i8 results for deinterleave8.
This is analogous to febbf91 which added shuffle lowering using
vcompress; we can do the same thing in the deinterleave2 lowering path
which is used for scalable vectors.
Note that we can further improve this for high lmul usage by adjusting
how we materialize the mask (whose result is at most m1 with a known bit
pattern). I am deliberately staging the work so that the changes to
reduce register pressure are more easily evaluated on their own merit.
Instead of directly lowering to vnsrl_vl and having custom pattern
matching for that case, we can just lower to a (legal) shift and
truncate, and let generic pattern matching produce the vnsrl.
The major motivation for this is that I'm going to reuse this logic to
handle e.g. deinterleave4 w/ i8 result.
The test changes aren't particularly interesting. They're minor code
improvements - I think because we do slightly better with the
insert_subvector patterns, but that's mostly irrelevant.
This change matches a subset of vcompress patterns during shuffle
lowering. The subset implemented requires a contiguous prefix of
demanded elements followed by undefs. This subset was chosen for two
reasons: 1) which elements to spurious demand is a non-obvious problem,
and 2) my first several attempts at implementing the general case were
buggy. I decided to go with the simple case to start with.
vcompress scales better with LMUL than a general vrgather, and at least
the SpaceMit X60, has higher throughput even at m1. It also has the
advantage of requiring smaller vector constants at one bit per element
as opposed to vrgather which is a minimum of 8 bits per element. The
downside to using vcompress is that we can't fold a vselect into it, as
there is no masked vcompress variant.
For reference, here are the relevant throughputs from camel-cdr's data
table on BP3 (X60):
vrgather.vv v8,v16,v24 4.0 16.0 64.0 256.0
vcompress.vm v8,v16,v24 3.0 10.0 36.0 136.
vmerge.vvm v8,v16,v24,v0 2.0 4.0 8.0 16.0
The largest concern with the extra vmerge is that we locally increase
register pressure. If we do have masking, we also have a passthru,
without the ability to fold that into the vcompress, we need to keep it
alive a bit longer. This can hurt at e.g. m8 where we have very few
architectural registers. As compared with the vrgather.vv sequence, this
is only one additional m1 VREG - since we no longer need the index
vector. It compares slightly worse against vrgatherie16.vv which can use
index vectors smaller than other operands. Note that we could
potentially fold the vmerge if only tail elements are being preserved; I
haven't investigated this.
It is unfortunately hard given our current lowering structure to know if
we're emitting a shuffle where masking will follow. Thankfully, it
doesn't seem to show up much in practice, so I think we can probably
ignore it.
This patch only handles single source compress idioms at the moment.
This is an effort to avoid interacting with other patches on review for
changing how we canonicalize length changing shuffles.
A special case in type legalization wasn't accounting for different
operand numbering between FLDEXP and STRICT_FLDEXP.
AArch64 already asked STRICT_FLDEXP to be promoted, but had no test for
it.
For IR like this:
%icmp = icmp ult <4 x i32> %a, splat (i32 5)
%res = extractelement <4 x i1> %icmp, i32 1
where there is only one use of %icmp we can take a similar approach
to what we already do for binary ops such add, sub, etc. and convert
this into
%ext = extractelement <4 x i32> %a, i32 1
%res = icmp ult i32 %ext, 5
For AArch64 targets at least the scalar boolean result will almost
certainly need to be in a GPR anyway, since it will probably be
used by branches for control flow. I've tried to reuse existing code
in scalarizeExtractedBinop to also work for setcc.
NOTE: The optimisations don't apply for tests such as
extract_icmp_v4i32_splat_rhs in the file
CodeGen/AArch64/extract-vector-cmp.ll
because scalarizeExtractedBinOp only works if one of the input
operands is a constant.
This only handles the simplest case where vXi1 is a legal vector type.
If the vector type isn't legal we need to go through type legalization,
but the pattern gets much harder to recognize after that. Either because
ctpop gets expanded due to Zbb not being enabled, or the bitcast
becoming a bitcast+extractelt, or the ctpop being split into multiple
ctpops and adds, etc.
As suggested by Craig, this tries to merge the two sets of register
classes created in #112983, GPRPair* and GPRF64Pair*.
- I added some explicit annotations to `RISCVInstrInfoD.td` which fixed
the type inference issues I was seeing from tablegen for select
patterns.
- I've had to make the behaviour of `splitValueIntoRegisterParts` and
`joinRegisterPartsIntoValue` cover more cases, because you cannot
bitcast to/from untyped (the bitcast would otherwise have been inserted
automatically by TargetLowering code).
- I apparently didn't need to change `getNumRegisters` again, which
continues to tell me there's a bug in the code for tied inputs. I added
some more test coverage of this case but it didn't seem to help find the
asserts I was finding before - I think the difference is between the
default behaviour for integers which doesn't apply to floats.
- There's still a difference between BuildGPRPair and BuildPairF64 (and
the same for SplitGPRPair and SplitF64). I'm not happy with this, I
think it's quite confusing, as they're very similar, just differing in
whether they give a `untyped` or a `f64`. I haven't really worked out
how the DAGCombiner copes if one meets the other, I know we have some of
this for the f64 variants already, but they're a lot more complex than
the GPRPair variants anyway.
This is an alternative fix for #81192. This allows the SelectionDAG
scheduler to be able to find a representative register class for i32 on
RV64. The representative register class is the super register class with
the largest spill size that is also legal. The default implementation of
findRepresentativeClass only works for legal types which i32 is not for
RV64.
I did some investigation of why tablegen uses i32 in output patterns on
RV64. It appears it comes down to a function called
ForceArbitraryInstResultType that picks a type for the output
pattern when the isel pattern isn't specific enough. I believe it picks
the smallest type(lowested numbered) to resolve the conflict.
A similar issue occurs for f16 and bf16 which both use the FPR16
register class. If the isel pattern doesn't specify, tablegen may find
both f16 and bf16 and may pick bf16 from Zfh pattern when Zfbfmin isn't
present. Since bf16 isn't legal in that case, findRepresentativeClass
will fail.
For i8, i16, i32, this patch calls the base class with XLenVT to get the
representative class since XLenVT is always legal.
For bf16/f16, we call the base class with f32 since all of the f16/bf16
extensions depend on either F or Zfinx which will make f32 a legal type.
The final representative register class further depends on whether D or
Zdinx is also enabled, but that should be handled by the default
implementation.
This patch adds support for getting even-odd general purpose register
pairs into and out of inline assembly using the `R` constraint as
proposed in riscv-non-isa/riscv-c-api-doc#92
There are a few different pieces to this patch, each of which need their
own explanation.
- Renames the Register Class used for f64 values on rv32i_zdinx from
`GPRPair*` to `GPRF64Pair*`. These register classes are kept broadly
unmodified, as their primary value type is used for type inference
over selection patterns. This rename affects quite a lot of files.
- Adds new `GPRPair*` register classes which will be used for `R`
constraints and for instructions that need an even-odd GPR pair. This
new type is used for `amocas.d.*`(rv32) and `amocas.q.*`(rv64) in
Zacas, instead of the `GPRF64Pair` class being used before.
- Marks the new `GPRPair` class legal as for holding a `MVT::Untyped`.
Two new RISCVISD node types are added for creating and destructing a
pair - `BuildGPRPair` and `SplitGPRPair`, and are introduced when
bitcasting to/from the pair type and `untyped`.
- Adds functionality to `splitValueIntoRegisterParts` and
`joinRegisterPartsIntoValue` to handle changing `i<2*xlen>` MVTs into
`untyped` pairs.
- Adds an override for `getNumRegisters` to ensure that `i<2*xlen>`
values, when going to/from inline assembly, only allocate one (pair)
register (they would otherwise allocate two). This is due to a bug in
SelectionDAGBuilder.cpp which other backends also work around.
- Ensures that Clang understands that `R` is a valid inline assembly
constraint.
- This also allows `R` to be used for `f64` types on `rv32_zdinx`
architectures, where doubles are stored in a GPR pair.