ISD::VP_MERGE treats the false operand as the source for elements past
VL. The vmerge instruction encodes 3 registers and treats the vd
register as the source for the tail.
This patch adds a new ISD opcode that models the tail source explicitly.
During lowering we copy the false operand to this operand.
I think we can merge RISCVISD::VSELECT_VL with this new opcode by using
an UNDEF passthru, but I'll save that for another patch.
IR intrinsics were already defined, but no codegen support had been
added.
I extracted this code from our downstream. Some of it may have come from
https://repo.hca.bsc.es/gitlab/rferrer/llvm-epi/ originally.
PR #75576 and #75735 update some implies in
llvm/lib/Support/RISCVISAInfo.cpp, but both of them miss the subtarget
feature part.
This patch still preserve predicate HasStdExtZfhOrZfhmin and
HasStdExtZhinxOrZhinxmin, since they could make error message more
readable. ( Users might not know that zfh implies zfhmin.)
If we're lowering a fixed length vector load or store which happens to
exactly VLEN in size (when VLEN is exactly known), we can use a whole
register load or store instead of the unit strided variants. This
doesn't require a vsetvli in some cases, allows additional flexibility
of vsetvli cases in others, and doesn't have a runtime dependency on the
value of VL.
If we know the exact VLEN, then we can tell if the AVL for particular
operation is equivalent to the vsetvli xN, zero, <vtype> encoding. Using
this encoding is better than having to materialize an immediate in a
register, but worse than being able to use the vsetivli zero, imm,
<type> encoding.
Middle end up optimizations can speculate away the short circuit
behavior of C/C++ && and ||. Using i1 and/or or logical select
instructions and a single branch.
SelectionDAGBuilder can turn i1 and/or/select back into multiple
branches, but this is disabled when jump is expensive.
RISC-V can use slt(u)(i) to evaluate a condition into any GPR which
makes us better than other targets that use a flag register. RISC-V also
has single instruction compare and branch. So its not clear from a code
size perspective that using compare+and/or is better.
If the full condition is dependent on multiple loads, using a logic
delays the branch resolution until all the loads are resolved even if
there is a cheap condition that makes the loads unnecessary.
PowerPC and Lanai are the only CPU targets that use setJumpIsExpensive.
NVPTX and AMDGPU also use it but they are GPU targets. PowerPC appears
to have a MachineIR pass that turns AND/OR of CR bits into multiple
branches. I don't know anything about Lanai and their reason for using
setJumpIsExpensive.
I think the decision to use logic vs branches is much more nuanced than
this big hammer. So I propose to make RISC-V match other CPU targets.
Anyone who wants the old behavior can still pass -mllvm
-jump-is-expensive=true.
Previously we allocated one object for each GPR. We also allocated the
same offset twice, once to save for VASTART and then again for the first
register in the save loop.
This patch uses a single object for all the registers and shares this
with VASTART. This is more consistent with other targets like AArch64
and ARM.
I've removed the setValue(nullptr) from the memory operand now. Having a
single object makes me a lot more comfortable about alias analysis being
able to see what is going on. This led to the scheduling changes in
push-pop-popret.ll and vararg.ll.
The computation we use for computing the size already returns 0 when all
registers are allocated. We don't need an if to set it to 0.
Use the size being 0 to check for whether we need to spill registers or
not.
I have another change I want to make to this code, but this change
seemed to stand on its own. I left the curly braces since I need them
for the other change.
When we'd originally added unaligned-scalar-mem and
unaligned-vector-mem, they were separated into two parts under the
theory that some processor might implement one, but not the other. At
the moment, we don't have evidence of such a processor. The C/C++ level
interface, and the clang driver command lines have settled on a single
unaligned flag which indicates both scalar and vector support unaligned.
Given that, let's remove the test matrix complexity for a set of
configurations which don't appear useful.
Given these are internal feature names, I don't think we need to provide
any forward compatibility. Anyone disagree?
Note: The immediate trigger for this patch was finding another case
where the unaligned-vector-mem wasn't being properly serialized to IR
from clang which resulted in problems reproducing assembly from clang's
-emit-llvm feature. Instead of fixing this, I decided getting rid of the
complexity was the better approach.
This patch contains two related combines:
1) If we have an scalar vector insert into the result of a
concat_vector,
sink the insert into the operand of the concat.
2) If we have a insert of a scalar binop into a vector binop of the
same opcode and the RHS of both are constant, perform the insert
and then the binop.
The common theme to both is pushing inserts closer to the sources of the
computation graph. The goal is to enable forming vector bin ops from
inserts of scalar binops at the end of another vector.
For RISCV specifically, the concat_vector transform will push inserts to
smaller vectors. This will have the effect of reducing lmul for the
vslides, and usually doesn't require an additional vsetvli since
the source vectors are already working in the narrower VL. I tried
that one as a target independent combine first, and it doesn't appear
profitable on all targets.
This is only one approach to the problem. Another idea would be to
aggressively form build_vectors and subvector inserts from the
individual scalar inserts, and then have a transform which sunk a
subvector_insert down through the concat. The advantage of the alternate
approach is that we expose parallelism in the insert sequence, even if
the source vector isn't a concat_vector. If reviewers are okay with it,
I'd like to start with this approach, and then explore that direction in
a follow up patch.
Found while investigating whether we still need to stop DAG combiner
from turning (i64 (sext (i32 X))) into zext when i32 is known non
negative.
No test case because I still need to find fixes for some other issues
before I can remove the code from DAGCombiner.
PR #72978 disabled transformation (select_cc seteq (and x, C), 0, 0, A)
-> (and (sra(shl x)), A) for better Zicond codegen. It still enables the
combine when C is not fit into 12-bits. This patch disables the combine
when Zbs enabled.
Currently, llvm can transform (setcc ne (and x, c)) to (bexti x,
log2(c)) where c is power of 2.
This patch transform (select (setcc ne (and x, c)), T, F) into (select
(setcc eq (and x, c)), F, T).
It is benefit to the case c is not fit to 12-bits.
If we have a constant index and a known vlen, then we can identify which
registers out of a register group is being accessed. Given this, we can
reuse the (slightly generalized) existing handling for working on
sub-register groups. This results in all constant index extracts with
known vlen becoming m1 operations.
One bit of weirdness to highlight and explain: the existing code uses
the VL from the original vector type, not the inner vector type. This is
correct because the inner register group must be smaller than the
original (possibly fixed length) vector type. Overall, this seems to a
reasonable codegen tradeoff as it biases us towards immediate AVLs,
which avoids needing the vsetvli form which clobbers a GPR for no real
purpose. The downside is that for large fixed length vectors, we end up
materializing an immediate in register for little value. We should
probably generalize this idea and try to optimize the large fixed length
vector case, but that can be done in separate work.
If we have a high LMUL build_vector and a known exact VLEN, we can
decompose the build_vector into one build_vector per register in the
register group. Doing so requires exact knowledge of which elements
correspond to each register in the register group, and thus an exact
VLEN must be known.
Since we no longer have operations which are linear (or worse) in LMUL,
this also allows us to lower all build_vectors without resorting to
going through the stack.
This is the first in a planned patch series to teach our vector lowering
how to exploit register boundaries in LMUL>1 types when VLEN is known to
be an exact constant. This corresponds to code compiled by clang with
the -mrvv-vector-bits=zvl option.
For extract_vector_elt, if we have a constant index and a known vlen,
then we can identify which register out of a register group is being
accessed. Given this, we can do a sub-register extract for that
register, and then shift any remaining index.
This results in all constant index extracts becoming m1 operations, and
thus eliminates the complexity concern for explode-vector idioms at high
lmul.
It seems TypeSize is currently broken in the sense that:
TypeSize::Fixed(4) + TypeSize::Scalable(4) => TypeSize::Fixed(8)
without failing its assert that explicitly tests for this case:
assert(LHS.Scalable == RHS.Scalable && ...);
The reason this fails is that `Scalable` is a static method of class
TypeSize,
and LHS and RHS are both objects of class TypeSize. So this is
evaluating
if the pointer to the function Scalable == the pointer to the function
Scalable,
which is always true because LHS and RHS have the same class.
This patch fixes the issue by renaming `TypeSize::Scalable` ->
`TypeSize::getScalable`, as well as `TypeSize::Fixed` to
`TypeSize::getFixed`,
so that it no longer clashes with the variable in
FixedOrScalableQuantity.
The new methods now also better match the coding standard, which
specifies that:
* Variable names should be nouns (as they represent state)
* Function names should be verb phrases (as they represent actions)
DAGCombiner folds (select_cc seteq (and x, y), 0, 0, A) to (and (sra
(shl x)) A) where y has a single bit set. Previously, DAGCombiner relies
on `shouldAvoidTransformToShift` to decide when to do the combine, but
`shouldAvoidTransformToShift` is only about shift cost. This patch
introuduces a specific hook to decide when to do the combine and disable
the combine when Zicond enabled and AndMask <= 1024.
If we are using entirely vslide1downs to initialize an otherwise undef
vector, we end up with an implicit_def as the source of the first
vslide1down. This register has to be allocated, and creates false
dependencies with surrounding code.
Instead, start our sequence with a vmv.v.x in the hopes of creating a
dependency breaking idiom. Unfortunately, it's not clear this will
actually work as due to the VL=0 special case for T.A. the hardware has
to work pretty hard to recognize that the vmv.v.x actually has no source
dependence. I don't think we can reasonable expect all hardware to have
optimized this case, but I also don't see any downside in prefering it.
Adds GraalVM calling conventions. The only difference with the default calling conventions is that GraalVM reserves two registers for the heap base and the thread. Since the registers are then accessed by name, getRegisterByName has to be updated accordingly.
This patch implements the calling conventions only for X86, AArch64 and RISC-V.
For X86, the reserved registers are X14 and X15. For AArch64, they are X27 and X28. For RISC-V, they are X23 and X27.
This patch has been used by the LLVM backend of GraalVM's Native Image project in production for around 4 months with no major issues.
Differential Revision: https://reviews.llvm.org/D151107
Allocate a register of the correct register class for inline asm
constraint "r" when used for FP values with -Zfinx/-Zdinx.
---------
Co-authored-by: Nemanja Ivanovic <nemanja@synopsys.com>
We already have the pseudo's for lowering these as MI nodes with
rounding mode operands, and the generic FRM insertion pass. Doing the
insertion later in the backend allows SSA level passes to avoid
reasoning about physical register copies, and happens to produce better
code in practice. The later is mostly an accident of our insertion
order; we happen to place the frm write after the vsetvli, and it's very
common for a register to be killed at the vsetvli. End result is that we
get slightly better scalar register allocation.
I'm a bit unclear on the history here. I was surprised to find this code
in ISEL lowering at all, but am also surprised once I found it that all
the patterns and pseudos seem to already exist. My best guess is that
maybe we didn't do all the possible cleanup after introducing the
HasRoundMode mechanism?
Previously getDefaultScalableVLOps called getDefaultVLOps. getDefaultVLOps
also handles fixed vectors so had to then check if it was fixed
or scalable.
Since getDefaultScalableVLOps know the type is scalable, it makes
sense for it to contain the scalable case directly and have
getDefaultVLOps call it for the scalable case.
This patch lowers `sdiv x, +/-2**k` to `add + select + shift` when the
short forward branch optimization is enabled. The latter inst seq
performs faster than the seq generated by target-independent
DAGCombiner. This algorithm is described in ***Hacker's Delight***.
This patch also removes duplicate logic in the X86 and AArch64 backend.
But we cannot do this for the PowerPC backend since it generates a
special instruction `addze`.
We can match this directly in isel with the i32 type being legal.
The generic DAG combine will unpromote part of the pattern and
prevent it from being matched in isel.
This is similar to vector.reverse, but only reverses the first EVL
elements.
I extracted this code from our downstream. Some of it may have come from
https://repo.hca.bsc.es/gitlab/rferrer/llvm-epi/ originally.
This will select i32 operations directly to W instructions without
custom nodes. Hopefully this can allow us to be less dependent on
hasAllNBitUsers to recover i32 operations in RISCVISelDAGToDAG.cpp.
This support is enabled with a command line option that is off by
default.
Generated code is still not optimal.
I've duplicated many test cases for this, but its not complete. Enabling this runs all existing lit tests without crashing.