This is implementation is based on what the X86 target does but
emitting the instructions that GCC emits for rv64.
---------
Co-authored-by: Pengcheng Wang <wangpengcheng.pp@bytedance.com>
Porting to TTI provides direct access to the instruction cost model,
which can enable instruction cost based sinking without introducing code
duplication.
Support for large code model are added recently, and sementically
direct calls are lowered to an indirect branch with a constant pool target.
By default it does not use the x7 register and this is suboptimal with
Zicfilp because it introduces landing pad check, which is unnecessary
since the constant pool is read-only and unlikely to be tampered.
Change direct calls and tail calls to use x7 as the scratch
register (a.k.a. software guarded branch in the CFI spec)
A recent atomics ABI change / fix requires that for the "A6C" and A6S"
atomics ABIs (i.e. both of those supported by LLVM currently), an
additional fence is inserted for an atomic_compare_exchange with seq_cst
failure ordering.
<https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/445>
This isn't trivial to support through the hooks used by AtomicExpandPass
because that pass assumes that when fences are inserted, the original
atomics ordering information can be removed from the instruction. Rather
than try to change and complicate that API, this patch implements the
needed fence insertion through a small special purpose pass.
Most of the constants fli can generate are positive numbers. We can use
fli+fneg to generate their negative versions.
Previously, we considered such negative constants as "legal" and let
isel generate the fli+fneg. However, it is useful to expose the fneg to
DAG combines to fold with fadd to produce fsub or with fma to produce
fnmadd, fnmsub, or fmsub.
This patch moves the fneg creation to lowering so that the fneg will be
visible to the last DAG combine.
I might move the rest of Zfa handling from isel to lowering as a follow
up.
Fixes#107772.
These are used by both SelectionDAG and GlobalISel and are separate from
RISCVTargetLowering.
Having a separate file is how other targets are structured. Though other
targets generate most of their calling convention code through tablegen.
I moved the `CC_RISV` functions from the `llvm::RISCV` namespace to
`llvm::`. That's what the tablegen code on other targets does and the
functions already have RISCV in their name. `RISCVCCAssignFn` is moved
from `RISCVTargetLowering` to the `llvm` namespace.
The first mask vector operand is supposed to be assigned to V0. No other
vector types will be assigned to V0. We don't need to pre-assign, we can
just try V0 first for any mask vectors in the normal processing.
Use isel patterns on regular FP_ROUND. For double->bf16 we need
to emit two instructions. Note the double->bf16 conversion does
double rounding, but I don't know a good way to fix that.
I don't think we need this node. We can isel fp_extend directly.
fp_extend to f64 requires two instructions, but we can emit them with an
isel pattern.
I have not removed RISCVISD::FP_ROUND_BF16 because f64->bf16 needs more
work to fix the double rounding.
This patch handles target lowering and calling convention.
For target lowering, the vector tuple type represented as multiple
scalable vectors is now changed to a single `MVT`, each `MVT` has a
corresponding register class.
The load/store of vector tuples are handled as the same way but need
another vector insert/extract instructions to get sub-register group.
Inline assembly constraint for vector tuple type can directly be modeled
as "vr" which is identical to normal vector registers.
For calling convention, it no longer needs an alternative algorithm to
handle register allocation, this makes the code easier to maintain and
read.
Stacked on https://github.com/llvm/llvm-project/pull/97994
- [AArch64]: TargetLowering is updated to spot load/store (de)interleave4 like sequences using PatternMatch,
and emit equivalent sve.ld4 and sve.st4 intrinsics.
This patch folds `fmul X, (fcopysign 1.0, Y)` into `fsgnjx X, Y`. This
pattern exists in some graphics applications/math libraries.
Alive2: https://alive2.llvm.org/ce/z/epyL33
Since fpimm +1.0 is lowered to a load from constant pool after
OpLegalization, I have to introduce a new RISCVISD node FSGNJX and fold
this pattern in DAGCombine.
Closes https://github.com/dtcxzyw/llvm-opt-benchmark/issues/1072.
These new opcodes drop the shift amount, rounding mode, and passthru.
Making them exactly like TRUNCATE_VECTOR_VL. The shift amount, rounding
mode, and passthru are added in isel patterns similar to how we
translate TRUNCATE_VECTOR_VL to vnsrl with a shift of 0.
This should simplify #99418 a little.
This change is a preliminary step to support trampolines on RISC-V. Trampolines are used by flang to implement obtaining the address of an internal program (i.e., a nested function in Fortran parlance).
In this change we lower `llvm.clear_cache` intrinsic on glibc targets to
`__riscv_flush_icache` which is what GCC is currently doing for Linux targets.
I plan to add support for multiple layers of vnclipu. For example,
i32->i8 using 2 vnclipu instructions. First clipping to 65535, then
clipping to 255. Similar for signed vnclip.
This scales poorly if we need to add patterns with 2 or 3 truncates.
Instead, move the code to DAGCombiner with new ISD opcodes to represent
VCLIP(U).
This patch just moves the existing patterns into DAG combine. Support
for multiple truncates will as a follow up. A similar patch series will
be made for the signed vnclip.
I think the behaviors are the same if this describes their behavior.
AVGFLOORS sign extends the inputs by 1 bit, adds them, then does an
arithmetic shift right by 1 before truncating to the original bit width.
This is vaadd with rdn rounding mode.
AVGCEILS sign extends the inputs by 1 bit, adds them, then does an
arithmetic shift right by 1. If the bit shifted out is 1, it adds 1 to
the shifted value. Then truncates to the original bit width. This is vaadd
with rnu rounding mode.
I think this wasn't implemented previously because there was some
confusion about what average means. Some may expect average to round
towards zero, but there is no way to do that in RISC-V or with the
SelectionDAG nodes. Related issue
https://github.com/riscv/riscv-v-spec/issues/935
The cost of `experimental.cttz.elts` in RISC-V equals to the cost of
vfirst when the zero_is_poison argument is true. Otherwise, we add
additional costs of cmp + select to convert the -1 result from vfirst to
EVL.
Changes since original commit:
* Rebase over improved test coverage for theadba
* Revert change to use TargetConstant as it appears to prevent the uimm2
clause from matching in the XTheadBa patterns.
* Fix an order of operands bug in the THeadBa pattern visible in the new
test coverage.
Original commit message follows:
This implements a RISCV specific version of the SHL_ADD node proposed in
https://github.com/llvm/llvm-project/pull/88791.
If that lands, the infrastructure from this patch should seamlessly
switch over the to generic DAG node. I'm posting this separately because
I've run out of useful multiply strength reduction work to do without
having a way to represent MUL X, 3/5/9 as a single instruction.
The majority of this change is moving two sets of patterns out of
tablgen and into the post-legalize combine. The major reason for this is
that I have an upcoming change which needs to reuse the expansion logic,
but it also helps common up some code between zba and the THeadBa
variants.
On the test changes, there's a couple major categories:
* We chose a different lowering for mul x, 25. The new lowering involves
one fewer register and the same critical path, so this seems like a win.
* The order of the two multiplies changes in (3,5,9)*(3,5,9) in some
cases. I don't believe this matters.
* I'm removing the one use restriction on the multiply. This restriction
doesn't really make sense to me, and the test changes appear positive.
This reverts commit 5a7c80ca58c628fab80aa4f95bb6d18598c70c80. Noticed failures
with the following command:
$ llc -mtriple=riscv64 -mattr=+m,+xtheadba -verify-machineinstrs < test/CodeGen/RISCV/rv64zba.ll
I think I know the cause and will likely reland with a fix tomorrow.
This implements a RISCV specific version of the SHL_ADD node proposed in
https://github.com/llvm/llvm-project/pull/88791.
If that lands, the infrastructure from this patch should seamlessly
switch over the to generic DAG node. I'm posting this separately because
I've run out of useful multiply strength reduction work to do without
having a way to represent MUL X, 3/5/9 as a single instruction.
The majority of this change is moving two sets of patterns out of
tablgen and into the post-legalize combine. The major reason for this is
that I have an upcoming change which needs to reuse the expansion logic,
but it also helps common up some code between zba and the THeadBa
variants.
On the test changes, there's a couple major categories:
* We chose a different lowering for mul x, 25. The new lowering involves
one fewer register and the same critical path, so this seems like a win.
* The order of the two multiplies changes in (3,5,9)*(3,5,9) in some
cases. I don't believe this matters.
* I'm removing the one use restriction on the multiply. This restriction
doesn't really make sense to me, and the test changes appear positive.
Bug fix: Handle RVV return type in calling convention correctly.
Return values are handled in a same way as function arguments.
One thing to mention is that if a type can be broken down into
homogeneous
vector types, e.g. {<vscale x 4 x i32>, {<vscale x 4 x i32>, <vscale x 4
x i32>}},
it is considered as a vector tuple type and need to be handled by tuple
type rule.
For experimental.cttz.elts, we can use a vfirst instruction, but we need
to correct the result if input vector can be 0. cttz.elts returns the
vector length while vfirst returns -1.
Integer RISCVISD::SELECT_CC doesn't create poison. If none of the,
operands are poison, the result is not poison.
This allows ISD::FREEZE to be hoisted above RISCVISD::SELECT_CC.
This intrinsic was introduced by #81331, which is a lot like
`llvm.readcyclecounter`.
For the RISCV implementation, we rename `ReadCycleWide` pseudo to
`ReadCounterWide` and make it accept two operands (the low and high
parts of the counter). As for legalization and lowering parts, we
reuse the code of `ISD::READCYCLECOUNTER` (make it able to handle
both intrinsics), and we use `time` CSR for `ISD::READSTEADYCOUNTER`.
Tests using Clang builtins are runned on real hardware and it works
as excepted.
Reviewers: asb, MaskRay, dtcxzyw, preames, topperc, jhuber6
Reviewed By: jhuber6, asb, MaskRay, dtcxzyw
Pull Request: https://github.com/llvm/llvm-project/pull/82322
This patch allows VCIX instructions that have side effect to be
reordered
with memory and other side effecting instructions. However we don't want
VCIX instructions to be reordered with each other, so we propose a dummy
register called VCIX_STATE and make these instructions implicitly define
and use
it.
This patch adds basic TLSDESC support in the RISC-V backend.
Specifically, we add new relocation types for TLSDESC, as prescribed in
https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/373, and add a
new pseudo instruction to simplify code generation.
This patch does not try to optimize the local dynamic case, which can be
improved in separate patches.
Linker side changes will also be handled separately.
The current implementation is only enabled when passing the new
`-enable-tlsdesc` codegen flag.
This commit includes the necessary changes to clang and LLVM to support
codegen of `RVE` and the `ilp32e`/`lp64e` ABIs.
The differences between `RVE` and `RVI` are:
* `RVE` reduces the integer register count to 16(x0-x16).
* The ABI should be `ilp32e` for 32 bits and `lp64e` for 64 bits.
`RVE` can be combined with all current standard extensions.
The central changes in ilp32e/lp64e ABI, compared to ilp32/lp64 are:
* Only 6 integer argument registers (rather than 8).
* Only 2 callee-saved registers (rather than 12).
* A Stack Alignment of 32bits (rather than 128bits).
* ilp32e isn't compatible with D ISA extension.
If `ilp32e` or `lp64` is used with an ISA that has any of the registers
x16-x31 and f0-f31, then these registers are considered temporaries.
To be compatible with the implementation of ilp32e in GCC, we don't use
aligned registers to pass variadic arguments and set stack alignment\
to 4-bytes for types with length of 2*XLEN.
FastCC is also supported on RVE, while GHC isn't since there is only one
avaiable register.
Differential Revision: https://reviews.llvm.org/D70401
Similar to #76550, but for `ISD::AVGCEILU`.
Specifically, this patch aims to use `vaaddu` with rounding mode rnu
(i.e `vxrm[1:0] = 0b00`) for `ISD::AVGCEILU`.
### Source code
```
define <vscale x 8 x i8> @vaaddu_vv_nxv8i8_ceil(<vscale x 8 x i8> %x, <vscale x 8 x i8> %y) {
%xzv = zext <vscale x 8 x i8> %x to <vscale x 8 x i16>
%yzv = zext <vscale x 8 x i8> %y to <vscale x 8 x i16>
%add = add nuw nsw <vscale x 8 x i16> %xzv, %yzv
%one = insertelement <vscale x 8 x i16> poison, i16 1, i32 0
%splat = shufflevector <vscale x 8 x i16> %one, <vscale x 8 x i16> poison, <vscale x 8 x i32> zeroinitializer
%add1 = add nuw nsw <vscale x 8 x i16> %add, %splat
%div = lshr <vscale x 8 x i16> %add1, %splat
%ret = trunc <vscale x 8 x i16> %div to <vscale x 8 x i8>
ret <vscale x 8 x i8> %ret
}
```
### Before this patch
```
vaaddu_vv_nxv8i8_ceil:
vsetvli a0, zero, e8, m1, ta, ma
vwaddu.vv v10, v8, v9
vsetvli zero, zero, e16, m2, ta, ma
vadd.vi v10, v10, 1
vsetvli zero, zero, e8, m1, ta, ma
vnsrl.wi v8, v10, 1
ret
```
### After this patch
```
vaaddu_vv_nxv8i8_ceil:
vsetvli a0, zero, e8, m1, ta, ma
csrwi vxrm, 0
vaaddu.vv v8, v8, v9
ret
```
This patch aims to use `vaaddu` with rounding mode rdn (i.e `vxrm[1:0] =
0b10`) for `ISD::AVGFLOORU`.
### Source code
```
define <8 x i8> @vaaddu_auto(ptr %x, ptr %y, ptr %z) {
%xv = load <8 x i8>, ptr %x, align 2
%yv = load <8 x i8>, ptr %y, align 2
%xzv = zext <8 x i8> %xv to <8 x i16>
%yzv = zext <8 x i8> %yv to <8 x i16>
%add = add nuw nsw <8 x i16> %xzv, %yzv
%div = lshr <8 x i16> %add, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
%ret = trunc <8 x i16> %div to <8 x i8>
ret <8 x i8> %ret
}
```
### Before this patch
```
vaaddu_auto:
vsetivli zero, 8, e8, mf2, ta, ma
vle8.v v8, (a0)
vle8.v v9, (a1)
vwaddu.vv v10, v8, v9
vnsrl.wi v8, v10, 1
ret
```
### After this patch
```
vaaddu_auto:
vsetivli zero, 8, e8, mf2, ta, ma
vle8.v v8, (a0)
vle8.v v9, (a1)
csrwi vxrm, 2
vaaddu.vv v8, v8, v9
ret
```
### Note on signed averaging addition
Based on the rvv spec, there is also a variant for signed averaging
addition called `vaadd`.
But AFAIU, no matter in which rounding mode, we cannot achieve the
semantic of signed averaging addition through `vaadd`.
Thus this patch only introduces `vaaddu`.
As far as I can tell if getIndexedAddressParts received an ISD::SUB, the
constant would be negated. So `IsInc` should be set to true since the
SUB was effectively converted to ADD. This means we should never use
PRE_DEC/POST_DEC.
No tests are affected because DAGCombine aggressively turns SUB with
constant into ADD so no lit test has a SUB reach getIndexedAddressParts.
Instruction cost for CodeSize and Latency/RecipThroughput can be very
different. Considering the diversity of CostKind and vendor-specific
cost, and how they are spread across various TTI functions, it's
becoming quite a challenge to handle. This patch adds an interface
getRISCVInstructionCost to address it.
We can use RISCVISD::VMERGE_VL with an undef passthru operand.
I had to rewrite the FMA patterns to handle both undef and non-undef
cases so we can get the tail policy.
…Kind
Instruction cost for CodeSize and Latency/RecipThroughput can be very
different. Considering the diversity of CostKind and vendor-specific
cost, and how they are spread across various TTI functions, it's
becoming quite a challenge to handle. This patch adds an interface
getRISCVInstructionCost to address it.