We can shuffle vXf16 vectors just like vXi16 vectors. We don't need any
FP instructions. Update the predicates for vrgather and vslides patterns
to only check the predicates based on the equivalent integer type. If we
use the FP type it will check Zvfh and block Zvfhmin.
These are probably not the only patterns that need to be fixed, but the
test from the bug report no longer crashes.
Fixes#97477
Implement XCValu intrinsics for CV32E40P according to the specification.
This commit is part of a patch-set to upstream the vendor specific
extensions of CV32E40P that need LLVM intrinsics to implement Clang
builtins.
Contributors: @CharKeaney, @ChunyuLiao, @jeremybennett, @lewis-revill,
@NandniJamnadas, @PaoloS02, @serkm, @simonpcook, @xingmingjie.
The splat we generate after the load doesn't use the extended bits, so it
shouldn't matter which extend type we use.
EXTLOAD is lowered as SEXTLOAD on every element type except i8.
This is a helper to avoid writing `getModule()->getDataLayout()`. I
regularly try to use this method only to remember it doesn't exist...
`getModule()->getDataLayout()` is also a common (the most common?)
reason why code has to include the Module.h header.
If only bits 8, 16, 24, 32, etc. can be non-zero.
This is what (mul X, 255) is decomposed to. This decomposition happens
early before RISC-V DAG combine runs.
This patch does not support types larger than XLen so i64 on rv32 fails
to generate 2 orc.b instructions. It might have worked if the mul hadn't
been decomposed before it was expanded.
Partial fix for #96595.
As far as I can tell, this pull request was not approved, and
did not go through an RFC on discourse.
This reverts commit 89881480030f48f83af668175b70a9798edca2fb.
This reverts commit 225d8fc8eb24fb797154c1ef6dcbe5ba033142da.
Currently, on different platform, the behaivor of llvm.minnum is
different if one operand is sNaN:
When we compare sNaN vs NUM:
ARM/AArch64/PowerPC: follow the IEEE754-2008's minNUM: return qNaN.
RISC-V/Hexagon follow the IEEE754-2019's minimumNumber: return NUM. X86:
Returns NUM but not same with IEEE754-2019's minimumNumber as
+0.0 is not always greater than -0.0.
MIPS/LoongArch/Generic: return NUM.
LIBCALL: returns qNaN.
So, let's introduce llvm.minmumnum/llvm.maximumnum, which always follow
IEEE754-2019's minimumNumber/maximumNumber.
Half-fix: #93033
This is a three instruction expansion, and does not depend on zba, so
most of the test changes are in base RV32/64I configurations.
With zba, this gets immediates such as 14, 28, 30, 56, 60, 62.. which
aren't covered by our other expansions.
This change is a preliminary step to support trampolines on RISC-V. Trampolines are used by flang to implement obtaining the address of an internal program (i.e., a nested function in Fortran parlance).
In this change we lower `llvm.clear_cache` intrinsic on glibc targets to
`__riscv_flush_icache` which is what GCC is currently doing for Linux targets.
The instructions are only defined to operator f16 data. If the
scalar FPR register isn't properly nan-boxed, these instructions
will create a fp16 nan not a bf16 nan in the vector register.
The custom lowering converts to f32, splats as f32, then narrows the
vector to bf16. None of that requires Zvfhmin.
Add new bf16 test files without Zvfh/Zvfmin in their RUN lines. I will
remove the bf16 tests from other files in a follow up patch.
This pattern is an obscured way to express saturating a signed value
into a smaller unsigned value.
If (setltu, X, 256) is true, then the value is already in the desired
range so we can pick X. If it's false, we select (sext (setgt X, 0))
which is 0 for negative values and all ones for positive values. The all
ones value when truncated to the final type will still be all ones like
we want.
If the smax removed all negative numbers, then we can treat the smin
like a umin.
If the smin and smax are in the other order we can swap them and use a
vnclipu as long as the smax constant is smaller than the smin constant.
This is based on similar code from X86's detectUSatPattern.
Similar to #93596, this moves the signed vnclip patterns into DAG
combine.
This will allows us to support more than 1 level of truncate in a
future patch.
I plan to add support for multiple layers of vnclipu. For example,
i32->i8 using 2 vnclipu instructions. First clipping to 65535, then
clipping to 255. Similar for signed vnclip.
This scales poorly if we need to add patterns with 2 or 3 truncates.
Instead, move the code to DAGCombiner with new ISD opcodes to represent
VCLIP(U).
This patch just moves the existing patterns into DAG combine. Support
for multiple truncates will as a follow up. A similar patch series will
be made for the signed vnclip.
We checked the VL and mask of any additional TRUNCATE_VECTOR_VL
nodes we peek through, but not the outermost.
This moves the check to the outer node and then verifies all the
additional nodes have the same VL and Mask.
Stacked on #93574
Similar for i16 and i64 elements for both fixed and scalable vectors.
This reduces the number of vector instructions, but increases vl/vtype
toggles.
This reduces some code in 525.x264_r from SPEC2017. In that usage, the
vectors are fixed with a small number of elements so vsetivli can be
used.
This is similar to `performMulVectorCmpZeroCombine` from AArch64.
The elements that aren't sNans need to get passed through this fadd
instruction unchanged. With the agnostic mask policy they might be
forced to all ones.
I think the behaviors are the same if this describes their behavior.
AVGFLOORS sign extends the inputs by 1 bit, adds them, then does an
arithmetic shift right by 1 before truncating to the original bit width.
This is vaadd with rdn rounding mode.
AVGCEILS sign extends the inputs by 1 bit, adds them, then does an
arithmetic shift right by 1. If the bit shifted out is 1, it adds 1 to
the shifted value. Then truncates to the original bit width. This is vaadd
with rnu rounding mode.
I think this wasn't implemented previously because there was some
confusion about what average means. Some may expect average to round
towards zero, but there is no way to do that in RISC-V or with the
SelectionDAG nodes. Related issue
https://github.com/riscv/riscv-v-spec/issues/935
This patch fixes:
llvm/lib/Target/RISCV/RISCVISelLowering.cpp:19848:11: error:
enumeration value 'SW_GUARDED_BRIND' not handled in switch
[-Werror,-Wswitch]
Doing so avoids negative interactions with other combines which don't
know the shl_add is a single instruction. From the commit log, we've had
several combine loops already.
This was originally posted as part of #88791, where a bug was pointed
out. That bug was fixed by #89789 which hits the same issue from another
angle. To confirm the fix, I included the reduced test case here.
Android supports per thread stack protectors that are individually
managed and
initialized, which can provide stronger protections than using the
global stack
protector cookie. This patch matches the convention for other
architectures
targeting Android platforms.
The vlseg and vsseg intrinsic functions are not overloaded on pointer
type, so cannot handle non-default address spaces.
This fixes an error we see after #90583.
This moves our last major category tablegen driven multiply strength
reduction into the post legalize combine framework. The one slightly
tricky bit is making sure that we use a leading shl if we can form a
slli.uw, and trailing shl otherwise. Having the trailing shl is critical
for shNadd matching, and folding any following sext.w.
As can be seen in the TD deltas, this allows us to kill off both the
actual multiply patterns and the explicit add (mul X, C) Y patterns. The
later are now handled by the generic shNadd matching code, with the
exception of the THead only C=200 case because we don't (yet) have a
multiply expansion with two shNadd + a shift.
---------
Co-authored-by: Yingwei Zheng <dtcxzyw@qq.com>
This is the insert_subvector equivalent to #79949, where we can avoid
sliding up by the full LMUL amount if we know the exact subregister the
subvector will be inserted into.
This mirrors the lowerEXTRACT_SUBVECTOR changes in that we handle this
in two parts:
- We handle fixed length subvector types by converting the subvector to
a scalable vector. But unlike EXTRACT_SUBVECTOR, we may also need to
convert the vector being inserted into too.
- Whenever we don't need a vslideup because either the subvector fits
exactly into a vector register group *or* the vector is undef, we need
to emit an insert_subreg ourselves because RISCVISelDAGToDAG::Select
doesn't correctly handle fixed length subvectors yet: see d7a28f7ad
A subvector exactly fits into a vector register group if its size is a
known multiple of the size of a vector register, and this adds a new
overload for TypeSize::isKnownMultipleOf for scalable to scalable
comparisons to help reason about this.
I've left RISCVISelDAGToDAG::Select untouched for now (minus relaxing an
invariant), so that the insert_subvector and extract_subvector code
paths are the same.
We should teach it to properly handle fixed length subvectors in a
follow-up patch, so that the "exact subregsiter" logic is handled in one
place instead of being spread across both RISCVISelDAGToDAG.cpp and
RISCVISelLowering.cpp.
We can think of this as two separate combines
(czero_eqz x, (setne y, 0)) -> (czero_eqz x, y)
and
(czero_eqz x, x) -> x
Similary the (czero_nez x, (seteq x, 0)) -> x combine can be broken into
(czero_nez x, (seteq y, 0)) -> (czero_eqz x, y)
and
(czero_eqz x, x) -> x
isel already does the (czero_eqz x, (setne y, 0)) -> (czero_eqz x, y)
and (czero_nez x, (seteq y, 0)) -> (czero_eqz x, y) combines, but doing
them early could expose other opportunities.
This patch is moving out following intrinsics:
* vector.interleave2/deinterleave2
* vector.reverse
* vector.splice
from the experimental namespace.
All these intrinsics exist in LLVM for more than a year now, and are
widely used, so should not be considered as experimental.
According to langref, llvm.maximum/minimum has -0.0 < +0.0 semantics and
propagates NaN.
Expand the nodes on targets not supporting the operation, by adding
extra check for NaN and using is_fpclass to check zero signs.