Lower it just like the vector [l]lrint, using vfcvt, with the right
rounding mode. Updating costs to account for this custom-lowering is
left to a companion patch.
Add LLVM Context to getOptimalMemOpType and findOptimalMemOpLowering. So
that we can use EVT::getVectorVT to generate EVT type in
getOptimalMemOpType.
Related to [#146673](https://github.com/llvm/llvm-project/pull/146673).
As noted in post commit review, the API change here was not required.
I'd apparently confused myself when teasing apart patches from my
development branch.
For the fixed vector cases, we already support this, but the
deinterleave intrinsic cases (primary used by scalable vectors) didn't.
Supporting it requires plumbing through the Factor separately from the
extracts, as there can now be fewer extracts than the Factor. Note that
the fixed vector path handles this slightly differently - it uses the
shuffle and indices scheme to achieve the same thing.
Extend lowerVectorXRINT to also do a FP_EXTEND_VL when the source
element type is [b]f16, and wire up this custom-promote. Updating the
cost-model to not give these an invalid cost is left to a companion
patch.
When vectorizing a loop with a fixed-order recurrence we use a splice,
which gets lowered to a vslidedown and vslideup pair.
However with the way we lower it today we end up with extra vl toggles
in the loop, especially with EVL tail folding, e.g:
.LBB0_5: # %vector.body
# =>This Inner Loop Header: Depth=1
sub a5, a2, a3
sh2add a6, a3, a1
zext.w a7, a4
vsetvli a4, a5, e8, mf2, ta, ma
vle32.v v10, (a6)
addi a7, a7, -1
vsetivli zero, 1, e32, m2, ta, ma
vslidedown.vx v8, v8, a7
sh2add a6, a3, a0
vsetvli zero, a5, e32, m2, ta, ma
vslideup.vi v8, v10, 1
vadd.vv v8, v10, v8
add a3, a3, a4
vse32.v v8, (a6)
vmv2r.v v8, v10
bne a3, a2, .LBB0_5
Because the vslideup overwrites all but UpOffset elements from the
vslidedown, we currently set the vslidedown's AVL to said offset.
But in the vslideup we use either VLMAX or the EVL which causes a
toggle.
This increases the AVL of the vslidedown so it matches vslideup, even if
the extra elements are overridden, to avoid the toggle.
A new tuning feature +vl-dependent-latency has been added which keeps
the old behaviour for microarchitectures that dynamically dispatch uops
based on vl, e.g. sifive-x280.
+vl-dependent-latency can be reused for the recently proposed Ovlt
optimization directive if/when it's ratified:
https://lists.riscv.org/g/tech-privileged/message/2487
If we wanted to aggressively optimise for vl at the expense of
introducing more toggles we could probably look at doing this in
RISCVVLOptimizer.
XAndesVPackFPH can actually be used independently without requiring
Zvfhmin. Therefore, we remove the implicitly required Zvfhmin extension
from XAndesVPackFPH and imply that the f extension is sufficient.
Always try to fold freeze(op(....)) -> op(freeze(),freeze(),freeze(),...).
This patch proposes we drop the opt-in limit for opcodes that are allowed to push a freeze through the op to freeze all its operands, through the tree towards the roots.
I'm struggling to find a strong reason for this limit apart from the DAG freeze handling being immature for so long - as we've improved coverage in canCreateUndefOrPoison/isGuaranteedNotToBeUndefOrPoison it looks like the regressions are not as severe.
Hopefully this will help some of the regression issues in #143102 etc.
Make the fixed-vector lowering of ISD::[L]LRINT use the custom-lowering
routine, lowerVectorXRINT, and fix issues in lowerVectorXRINT related to
this new functionality.
Based on the comments and tests, we only want to call
EmitLoweredCascadedSelect on selects of FP registers.
Everytime we add a new branch with immediate opcode, we've been
excluding it here.
This patch switches to checking that the comparison operands are both
registers so branch on immediate is automatically excluded.
This wasn't scalable and made the RISCVCC enum effectively just
a different way of spelling the branch opcodes.
This patch reduces RISCVCC back down to 6 enum values. The primary user
is select pseudoinstructions which now share the same encoding across
all
vendor extensions. The select opcode and condition code are used to
determine the branch opcode when expanding the pseudo.
The Cond SmallVector returned by analyzeBranch now returns the opcode
instead of the RISCVCC. reverseBranchCondition now works directly on
opcodes. getOppositeBranchCondition is also retained.
Stacked on #145622
As of 20b5728b7b1ccc4509a316efb270d46cc9526d69, C always enables Zca, so
the check `C || Zca` is equivalent to just checking for `Zca`.
This replaces any uses of `HasStdExtCOrZca` with a new `HasStdExtZca`
(with the same assembler description, to avoid changes in error
messages), and simplifies everywhere where C++ needed to check for
either C or Zca.
The Subtarget function is just deprecated for the moment.
With proper co-author.
Original message:
We need to pass the operand of LLA to GetSupportedConstantPool.
This replaces #142292 with test from there added as a pre-commit
for both medlow and pic.
Co-authored-by: Carl Nettelblad carl.nettelblad@rapidity-space.com
We need to pass the operand of LLA to GetSupportedConstantPool.
This replaces #142292 with test from there added as a pre-commit
for both medlow and pic.
I happened to notice that when legalizing get.active.lane.mask with
large vectors we were materializing via constant pool instead of just
shifting by a constant.
We should probably be doing a full cost comparison for the different
lowering strategies as opposed to our current adhoc heuristics, but the
few cases this regresses seem pretty minor. (Given the reduction in vset
toggles, they might not be regressions at all.)
---------
Co-authored-by: Craig Topper <craig.topper@sifive.com>
We can convert non-power-of-2 types into extended value types
and then they will be widen.
Reviewers: lukel97
Reviewed By: lukel97
Pull Request: https://github.com/llvm/llvm-project/pull/114971
Put one copy on RISCVTargetLowering as a static function so that both
locations can use it, and rename the method to getM1VT for slightly
improved readability.
See #143580 for MR with the test commit.
Performs the following transformations:
(select c, c1, t) -> (add (czero_nez t - c1, c), c1)
(select c, t, c1) -> (add (czero_eqz t - c1, c), c1)
@mgudim
This patch adds the support of generating vector instructions for
`memcmp`. This implementation is inspired by X86's.
We convert integer comparisons (eq/ne only) into vector comparisons
and do a vector reduction and to get the result.
The range of supported load sizes is (XLEN, VLEN * LMUL8] and
non-power-of-2 types are not supported.
Fixes#143294.
Reviewers: lukel97, asb, preames, topperc, dtcxzyw
Reviewed By: topperc, lukel97
Pull Request: https://github.com/llvm/llvm-project/pull/114517
This involves a codegen regression at the moment due to the issue
described in 443cdd0b, but this aligns the lowering paths for this case
and makes it less likely future bugs go undetected.
We have recently added the partial_reduce_smla and partial_reduce_umla
nodes to represent Acc += ext(b) * ext(b) where the two extends have to
have the same source type, and have the same extend kind.
For riscv64 w/zvqdotq, we have the vqdot and vqdotu instructions which
correspond to the existing nodes, but we also have vqdotsu which
represents the case where the two extends are sign and zero respective
(i.e. not the same type of extend).
This patch adds a partial_reduce_sumla node which has sign extension for
A, and zero extension for B. The addition is somewhat mechanical.
Trampoline will use a alternative sequence when branch CFI is on.
The stack of the test is organized as follow
```
56 $ra
44 $a0 f
36 $a1 p
32 00038067 jalr t2
28 010e3e03 ld t3, 16(t3)
24 018e3383 ld t2, 24(t3)
20 00000e17 auipc t3, 0
sp+16 00000023 lpad 0
```