Godbolt example: https://godbolt.org/z/ThdfP475a
In the example single-element vse is used to store reduction result
instead of scalar store ([this optimization was introduced by this
patch](https://reviews.llvm.org/D109482)). However, vmv.x.s can't be
eliminated here because it has other uses (e.g. CopyToReg), so it seems
more profitable to use scalar store (we already have store value in a
scalar register, and can save one vsetvli which is likely to be required
for single-element vse). The proposed solution is to this transform only
if vmv.x.s has one use (in store instruction)
The check should be about unsigned 16-bit immediates, not signed ones.
This is not a bug per-se, as the old codegen was correct for the
uint16_max case, it just didn't end up using `qc.e.bgeui`, which we
would prefer it did.
The information whether a specific argument is vararg or fixed is
currently stored separately from all the other argument information in
ArgFlags. This means that it is not accessible from CCAssign, and
backends have developed all kinds of workarounds for how they can access
it after all.
Move this information to ArgFlags to make it directly available in all
relevant places.
I've opted to invert this and store it as IsVarArg, as I think that both
makes the meaning more obvious and provides for a better default (which
is IsVarArg=false).
A lot of fixed-length custom lowerings just involve inserting the
operands into a scalable container and extracting the result out, and
lowerToScalableOp already does this.
We just need to teach it to handle operands with different element types
(but same vector element count), and we can reuse it for
vselect/zext/sext/setcc/fcopysign.
This reverts commit fe4f6c1a58ab4f00a88a97af01000b6783b573ee, but leaves
the tests that were added.
The original commit mistakenly assumed that if regular bf16/f16 loads
and stores could be lowered without zvfbfmin/zvfhmin, then so too could
masked loads/stores and gathers/scatters.
However SelectionDAG can't actually type-legalize masked.load/stores
since it needs to be done in ScalarizeMaskedMemIntrinPass.
This was causing crashes on IREE because we now returned true for
isLegalMaskedLoadStore.
The original intent of this was to remove a discrepancy in the loop
vectorizer tests whenever predication was enabled, but this has gone
away after 92d09245d61dce80d3e68a27cc34d5fc6f062c93. So I don't think we
need to reapply this patch.
This is a rewrite of the current strided store optimization to be a DAG
combine. This allows it to kick in slightly more broadly, in particular
for the scalable lowering paths.
This reworks an existing optimization on the fixed vector (shuffle
based) deinterleave lowering into a DAG combine. This has the effect of
making it kick in much more widely - in particular on the deinterleave
intrinsic (i.e. scalable) path, deinterleaveN (without load) lowering,
but also the intrinsic lowering paths.
When vectorizing with predication some loops that were previously
vectorized without zvfhmin/zvfbfmin will no longer be vectorized because
the masked load/store or gather/scatter cost returns illegal.
This is due to a discrepancy where for these costs we check
isLegalElementTypeForRVV but for regular memory accesses we don't.
But for bf16 and f16 vectors we don't actually need the extension
support for loads and stores, so this adds a new function which takes
this into account.
For regular memory accesses we should probably also e.g. return an
invalid cost for i64 elements on zve32x, but it doesn't look like we
have tests for this yet.
We also should probably not be vectorizing these bf16/f16 loops to begin
with if we don't have zvfhmin/zvfbfmin and zfhmin/zfbfmin. I think this
is due to the scalar costs being too cheap. I've added tests for this in
a100f6367205c6a909d68027af6a8675a8091bd9 to fix in another patch.
We use `lh` to load 2 bytes from memory into a gpr, then mask this gpr
with -65536 to emulate nan-boxing behavior, and then the value in gpr is
moved to fpr using `fmv.w.x`.
To move the value back from fpr to gpr, we use `fmv.x.w` and finally,
`sh` is used to store the lower 2 bytes back to memory.
If zfh is enabled at the same time, we can just use flh/fsw to
load/store bf16 directly.
Spotted while reviewing #150211. If we're multiplying by -3 in i32
MulAmt contains 4,294,967,293 since we zero extend to uint64_t. Adding 3
to this gives 0x100000000 which is a power of 2 and the log2 of that is
32, but we can't shift left by 32 in an i32.
Detect this case and skip the transform. We could use 0, but we don't
handle the case for i64 so this seemed more consistent.
Normally we don't hit this case because decomposeMulByConstant handles
it, but that's disabled by Xqciac. And after #150211 the code in
expandMul is now unreachable for this case.
There was a mismatch between isFPImmlegal and the cases that are handled
by lowerConstantFP. isFPImmLegal didn't check for the case where we
support `fli` of a negated constant (and so can lower to fli+fneg). This
has very minimal impact (42 insertion, 47 deletions across an
rv22u64_zfa llvm-test-suite build including SPEC CPU 2017) but is added
here for completeness.
See the PR thread https://github.com/llvm/llvm-project/pull/149075 for furrther discussion about the degree to which isFPImmLegal and lowerConstantFP are consistent. We ultimately agreed it makes sense to add fli+fneg, but there may be other future cases where it doesn't make sense to match.
The change implements custom lowering of `get_fpmode`, `set_fpmode` and
`reset_fpmode` for RISCV target. The implementation is aligned with the
functions `fegetmode` and `fesetmode` in GLIBC.
XAndesBFHCvt provides two builtins functions for converting between
float and bf16. Users can use them to convert bf16 values loaded from
memory to float, perform arithmetic operations, then convert them back
to bf16 and store them to memory.
The load/store and move operations for bf16 will be handled in a later
patch.
Follow up on the work from e5bc7e7d, and extend it to the lowering used
for interleave and deinterleave when we can't combine with a nearby
memory operation.
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
There have been discussions on splitting RISCVISelLowering.cpp. I think
InterleavedAccess related TLI hooks would be some of the low hanging
fruit as it's relatively isolated and also because X86 is already doing
it.
NFC.
Add basic isel patterns for the multiple accumulate QC.MULIADD
instruction.
While most case work with just the TD file pattern, there are few cases
which need to be handled in ISelLowering depending on the immediate we
are multiplying with:
- imm + 1 , imm - 1, 1 - imm, -1 - imm are a power of 2 --> these become
slli and add/sub
- immediate is 2^n - 2 ^m --> this becomes (add/sub (shl X, C1), (shl X,
C2))
- imm - 2, imm - 4, imm - 6 is a power of 2 --> these use shxadd when
zba is enabled
The patch does not decompose mul if Xqciac is present, for the above
conditions. There could be cases where this may not beneficial which I
plan to address in follow up patches.
Lower it just like the vector [l]lrint, using vfcvt, with the right
rounding mode. Updating costs to account for this custom-lowering is
left to a companion patch.
Add LLVM Context to getOptimalMemOpType and findOptimalMemOpLowering. So
that we can use EVT::getVectorVT to generate EVT type in
getOptimalMemOpType.
Related to [#146673](https://github.com/llvm/llvm-project/pull/146673).
As noted in post commit review, the API change here was not required.
I'd apparently confused myself when teasing apart patches from my
development branch.
For the fixed vector cases, we already support this, but the
deinterleave intrinsic cases (primary used by scalable vectors) didn't.
Supporting it requires plumbing through the Factor separately from the
extracts, as there can now be fewer extracts than the Factor. Note that
the fixed vector path handles this slightly differently - it uses the
shuffle and indices scheme to achieve the same thing.
Extend lowerVectorXRINT to also do a FP_EXTEND_VL when the source
element type is [b]f16, and wire up this custom-promote. Updating the
cost-model to not give these an invalid cost is left to a companion
patch.
When vectorizing a loop with a fixed-order recurrence we use a splice,
which gets lowered to a vslidedown and vslideup pair.
However with the way we lower it today we end up with extra vl toggles
in the loop, especially with EVL tail folding, e.g:
.LBB0_5: # %vector.body
# =>This Inner Loop Header: Depth=1
sub a5, a2, a3
sh2add a6, a3, a1
zext.w a7, a4
vsetvli a4, a5, e8, mf2, ta, ma
vle32.v v10, (a6)
addi a7, a7, -1
vsetivli zero, 1, e32, m2, ta, ma
vslidedown.vx v8, v8, a7
sh2add a6, a3, a0
vsetvli zero, a5, e32, m2, ta, ma
vslideup.vi v8, v10, 1
vadd.vv v8, v10, v8
add a3, a3, a4
vse32.v v8, (a6)
vmv2r.v v8, v10
bne a3, a2, .LBB0_5
Because the vslideup overwrites all but UpOffset elements from the
vslidedown, we currently set the vslidedown's AVL to said offset.
But in the vslideup we use either VLMAX or the EVL which causes a
toggle.
This increases the AVL of the vslidedown so it matches vslideup, even if
the extra elements are overridden, to avoid the toggle.
A new tuning feature +vl-dependent-latency has been added which keeps
the old behaviour for microarchitectures that dynamically dispatch uops
based on vl, e.g. sifive-x280.
+vl-dependent-latency can be reused for the recently proposed Ovlt
optimization directive if/when it's ratified:
https://lists.riscv.org/g/tech-privileged/message/2487
If we wanted to aggressively optimise for vl at the expense of
introducing more toggles we could probably look at doing this in
RISCVVLOptimizer.
XAndesVPackFPH can actually be used independently without requiring
Zvfhmin. Therefore, we remove the implicitly required Zvfhmin extension
from XAndesVPackFPH and imply that the f extension is sufficient.