- [AArch64]: TargetLowering is updated to spot load/store (de)interleave4 like sequences using PatternMatch,
and emit equivalent sve.ld4 and sve.st4 intrinsics.
It doesn't matter which extend we use to promote the operands. Use
whatever is the most efficient.
The custom handler for RISC-V was using SIGN_EXTEND when the Zbb
extension is enabled so we no longer need that.
This has received no development work in a while and is slowly bit
rotting as new extensions are added.
At the moment, I don't think this is viable without adding a new
invariant that 32 bit values are always in sign extended form like
Mips64 does. We are very dependent on computeKnownBits and
ComputeNumSignBits in SelectionDAG to remove sign extends created for
ABI reasons. If we can't propagate sign bit information through 64-bit
values in SelectionDAG, we can't effectively clean up those extends.
We already had a DAG combine for (sra (sext_inreg (shl X, C1), i32), C2)
-> (sra (shl X, C1+32), C2+32) that we used for RV64. This patch
generalizes it to other sext_inregs for both RV32 and RV64.
Fixes#101040.
Previously, we created a vsetvlimax intrinsic. Using X0 simplifies the
code and enables some optimizations to kick when the exact value of
vlmax is known.
None of our addressing modes support a scalable offset. I could not
figure out how to get LSR to actually try such a formula, but let's
be defensive and explicitly prevent this case from being considered
a valid address mode match.
The tablegen patterns all have isRV32. I did not check if any of them
could naively support RV64.
Fixes#101067 and probably other bugs like it we haven't found yet.
At zvl1024b, we may have legal fixed length vectors where a vid.v would
overflow at i8, e.g. <512 x i8>.
When lowering constant build_vectors, isSimpleVIDSequence used uint64_t
to model the vid.v sequence which meant it didn't account for the fact
that it could overflow in these larger types.
This patch fixes it by modelling the sequence with an SEW-wide APInt so
if it does overflow the loop that checks/calculates the addend will
detect it and bail.
Fixes#99729
This patch folds `fmul X, (fcopysign 1.0, Y)` into `fsgnjx X, Y`. This
pattern exists in some graphics applications/math libraries.
Alive2: https://alive2.llvm.org/ce/z/epyL33
Since fpimm +1.0 is lowered to a load from constant pool after
OpLegalization, I have to introduce a new RISCVISD node FSGNJX and fold
this pattern in DAGCombine.
Closes https://github.com/dtcxzyw/llvm-opt-benchmark/issues/1072.
We can teach RISCVVectorPeephole to detect when an AVL is equal to the
VLMAX when the exact VLEN is known and use the VLMAX sentinel instead,
and in doing so remove the need for getVLOp in RISCVISelLowering. This
keeps all the VLMAX logic in one place.
RISCVGatherScatterLowering is the last user of
riscv_masked_strided_{load,store} after #98131 and #98112, this patch
changes it to emit the VP equivalent instead. This allows us to remove
the masked_strided intrinsics so we have only have one lowering path.
riscv_masked_strided_{load,store} didn't have AVL operands and were
always VLMAX, so this passes in the fixed or scalable element count to
the EVL instead, which RISCVVectorPeephole should now convert to VLMAX
after #97800.
For loads we also use a vp_select to get passthru (mask undisturbed)
behaviour
These new opcodes drop the shift amount, rounding mode, and passthru.
Making them exactly like TRUNCATE_VECTOR_VL. The shift amount, rounding
mode, and passthru are added in isel patterns similar to how we
translate TRUNCATE_VECTOR_VL to vnsrl with a shift of 0.
This should simplify #99418 a little.
With correct test update.
Original message:
We were only checking that the node from the worklist is a supported
root. We weren't checking the strategy or any of its operands unless it
was the original node. For any other node, we just rechecked the
original node's strategy and operands.
The effect of this is that we don't do all of the transformations at
once. Instead, when there were multiple possible nodes to transform we
would only do them as each node was visited by the main DAG combine
worklist.
The test shows a case where we widened an instruction without removing
all of the uses of the vsext. The sext is shared by one node that shares
another sext node with the root another node that doesn't share anything
with the root.
We were only checking that the node from the worklist is a supported
root. We weren't checking the strategy or any of its operands unless it
was the original node. For any other node, we just rechecked the
original node's strategy and operands.
The effect of this is that we don't do all of the transformations at
once. Instead, when there were multiple possible nodes to transform we
would only do them as each node was visited by the main DAG combine
worklist.
The test shows a case where we widened an instruction without removing
all of the uses of the vsext. The sext is shared by one node that shares
another sext node with the root another node that doesn't share anything
with the root.
A nxvXi64 VMV_V_X_VL on RV32 sign extends its 32 bit input to 64 bits.
If that input is positive, the sign extend can also be considered as a
zero extend.
Well, not quite that simple. We can tc memset since it returns the first
argument but bzero doesn't do that and therefore we can end up
miscompiling.
This patch also refactors the logic out of isInTailCallPosition() into the callers.
As a result memcpy and memmove are also modified to do the same thing
for consistency.
rdar://131419786
Like #93321, this patch also tries to solve the conflict usage of x7 for
fastcc and Zicfilp. But this patch removes x7 from fastcc directly. Its
purpose is to reduce the code complexity of #93321, and we also found
that it at most increase 0.02% instruction count for most benchmarks and
it might be benefit for benchmarks.
The _VL nodes are only used with scalable vectors so we don't need
to check that.
It doesn't matter if Zvfhmin is enabled. All that really matters is
whether Zvfh is.
Doing so allows one side to fold entirely into the mask applied to the
other recursive call (or a vmerge.vv at worst). This is a generalization
of the existing IsSelect case (both operands are selects), so I removed
that code in the process.
This actually started as an attempt to remove the IsSelect bit as I'd
thought it was fully redundant
with the recursive formulation, but digging into test deltas revealed
that we depended on that
to catch the majority of the identity cases, and that in turn we were
missing some cases where only RHS was an identity.
Summary:
These Libcalls represent which functions are available to the backend.
If a runtime call is not available, the target sets the the name to
`nullptr`. Currently, this logic is spread around the various targets.
This patch pulls all of the locations that disable libcalls into the
intializer. This patch is effectively NFC.
The motivation behind this patch is that currently the LTO handling uses
the list of all runtime calls to determine which functions cannot be
internalized and must be extracted from static libraries. We do not want
this to happen for libcalls that are not emitted by the backend. A
follow-up patch will move out this logic so the LTO pass can know which
rtlib calls are actually used by the backend.
This combine is a duplication of the transform in
RISCVGatherScatterLowering but at the SelectionDAG level, so similarly
to #98111 we can replace the use of riscv_masked_strided_load with a VP
strided load.
Unlike #98111 we don't require #97800 or #97798 since it only operates
on fixed vectors with a non-zero stride.
Marking SPLAT_VECTOR as Custom enables generic DAGCombine to turn
BUILD_VECTOR into SPLAT_VECTOR. We need to custom type legalize BUILD_VECTOR
without Zfhmin since we don't have the scalar f16 type. If we allow
SPLAT_VECTOR to be formed, we'll need to custom type legalize it too.
Easiest fix is to only enable SPLAT_VECTOR with Zvfhmin+Zfhmin. There's
still an issue that we need to properly support BUILD_VECTOR with Zvfhmin+Zfhmin.
Should fix the new case reported in #97849.
I've also changed the predicates to Zfhmin instead of ZfhminOrZhinxmin
since Zhinx isn't compatible with Zvfhmin.
In 03d4332, we extended build_vector lowering to pack elements into the
largest size which doesn't exceed either ELEN or XLEN. The zbkb
extension - ratified under scalar crypto, but otherwise not really
connected to crypto per se - adds the packh, packw, and pack
instructions. These instructions are designed for exactly this pairwise
packing.
I ended up choosing to directly lower to machine nodes. A combination of
the slightly non-uniform semantics of these instructions (packw *sign*
extends the result, whereas packh *zero* extends it), and our generic
dag canonicalization (which sinks shl through or nodes), make pattern
matching these tricky and not particularly robust. Another alternative
was to have an ISD node for them, but that didn't seem to add much in
practice.
With the tag merging in place, we can safely change the default for
+seq-cst-trailing-fence to the default, according to the recommendation
in
https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-atomic.adoc
This patch changes the default for the feature flag, and moves to more
consistent naming with respect to existing features.
This was reverted with https://github.com/llvm/llvm-project/pull/84597,
because ld.bfd would segfault with unknown riscv attributes. Now that
attributes emission is guarded with a backend flag,
`--riscv-abi-attributes`, this should be safe to reland, since it won't
introduce abi tags unless the user opts into them.
Our worst case build_vector lowering is a serial chain of vslide1down.vx
operations which creates a serial dependency chain through a relatively
high latency operation. We can instead pack together elements into ELEN
sized chunks, and move them from integer to scalar in a single
operation.
This reduces the length of the serial chain on the vector side, and
costs at most three scalar instructions per element. This is a win for
all cores when the sum of the latencies of the scalar instructions is
less than the vslide1down.vx being replaced, and is particularly
profitable for out-of-order cores which can overlap the scalar
computation.
This patch is restricted to configurations with zba and zbb. Without
both, the zero extend might require two instructions which would bring
the total scalar instructions per element to 4. zba and zba are both
present in the rva22u64 baseline which is looking to be quite common for
hardware in practice; we could extend this to systems without bitmanip
with a bit of extra effort.
If we don't have Zfhmin, we will call `SoftPromoteHalfOperand` on the
BUILD_VECTOR. This operation is not supported by the generic code.
Instead, custom lower to a vXi16 BUILD_VECTOR using bitcasts.
Fixes#97849.