First part of tt-ascalon-d8 scheduling model, only containing scalar
ops. Scheduling for vector instructions will be added in a follow-up
patch.
---------
Co-authored-by: Anton Blanchard <antonb@tenstorrent.com>
Co-authored-by: Pengcheng Wang <wangpengcheng.pp@bytedance.com>
This patch introduces a scheduling model for the MIPS p8700, an
out-of-order
RISC-V processor. The model includes pipelines for the following units:
- 2 Integer Arithmetic/Logical Units (ALU and AL2)
- Multiply/Divide Unit (MDU)
- Branch Unit (CTI)
- Load/Store Unit (LSU)
- Short Floating-Point Pipe (FPUS)
- Long Floating-Point Pipe (FPUL)
For additional details, refer to the official product page:
https://mips.com/products/hardware/p8700/.
Also adds `UnsupportedSchedZfhmin` to handle cases like
`WriteFCvtF16ToF32` that
previously caused build failures.
The results differ on different platforms so it is really hard to
determine a common default value.
Tune info for postra scheduling direction is added and CPUs can
set their own preferable postra scheduling direction.
The P8700 is a high-performance processor from MIPS designed to meet the
demands of modern workloads, offering exceptional scalability and
efficiency. It builds on MIPS's established architectural strengths
while introducing enhancements that set it apart. For more details, you
can check out the official product page here:
https://mips.com/products/hardware/p8700/.
Scheduling model will be added in a separate commit/PR.
Change the intersect for the anticipated algorithm to ignore unknown
when anticipating. This effectively allows VXRM writes speculatively
because it could do a VXRM write even when there's branches where VXRM
is unneeded.
The importance of this change is because VXRM writes causes pipeline
flushes in some micro-architectures and so it makes sense to allow more
aggressive hoisting even if it causes some degradation for the slow
path.
An example is this code:
```
typedef unsigned char uint8_t;
__attribute__ ((noipa))
void foo (uint8_t *dst, int i_dst_stride,
uint8_t *src1, int i_src1_stride,
uint8_t *src2, int i_src2_stride,
int i_width, int i_height )
{
for( int y = 0; y < i_height; y++ )
{
for( int x = 0; x < i_width; x++ )
dst[x] = ( src1[x] + src2[x] + 1 ) >> 1;
dst += i_dst_stride;
src1 += i_src1_stride;
src2 += i_src2_stride;
}
}
```
With this patch, the code above generates a hoisting VXRM writes out of
the outer loop.
This reverts commit b36fcf4f493ad9d30455e178076d91be99f3a7d8.
This reverts commit c11b6b1b8af7454b35eef342162dc2cddf54b4de.
This reverts commit 775148f2367600f90d28684549865ee9ea2f11be.
multiple bot build breakages, e.g. https://lab.llvm.org/buildbot/#/builders/3/builds/8076
Ascalon is an out-of-order CPU core from Tenstorrent. Overview:
https://tenstorrent.com/ip/tt-ascalon
Adding 8-wide version, -mcpu=tt-ascalon-d8. Scheduling model will be
added in a separate PR.
---------
Co-authored-by: Anton Blanchard <antonb@tenstorrent.com>
This is a follow up to #111511, where after benchmarking we learnt that
the Banana Pi F3 has fast segmented loads for not just NF=2, but also
NF=3 and NF=4:
https://github.com/preames/bp3-microarch#vlseg_lmul_x_sew_throughput
This adds tuning features to allow these segment loads and stores to be
costed cheaper and enables it for the spacemit-x60.
It also enables +optimized-nf2-segment-load-store by default in the
generic tuning to maintain the previous behaviour when compiled without
-mcpu or -mtune.
Syntacore SCR7 is rv64imafdcv_zba_zbb_zbc_zbs_zkn.
Scheduling model for RVV will be added later.
Overview: https://syntacore.com/products/scr7
---------
Co-authored-by: Dmitrii Petrov <dmitrii.petrov@syntacore.com>
Co-authored-by: Anton Afanasyev <anton.afanasyev@syntacore.com>
Co-authored-by: Elena Lepilkina <elena.lepilkina@syntacore.com>
Syntacore SCR7 is a high-performance Linux-capable RISC-V processor
core.
The core has rv64imafdcv_zba_zbb_zbc_zbs_zkn march.
Overview: https://syntacore.com/products/scr7
Scheduling model will be added in a subsequent PR.
---------
Co-authored-by: Dmitrii Petrov <dmitrii.petrov@syntacore.com>
Co-authored-by: Anton Afanasyev <anton.afanasyev@syntacore.com>
Co-authored-by: Elena Lepilkina <elena.lepilkina@syntacore.com>
Luke Wren's Hazard3 is a configurable, open-source 32-bit RISC-V core.
The core's source code and docs are available on github:
https://github.com/wren6991/hazard3
This is the RISC-V core used in the RP2350, a recently announced SoC by
Raspberry Pi (which also contains Arm cores):
https://datasheets.raspberrypi.com/rp2350/rp2350-datasheet.pdf
We have agreed to name this `-mcpu` option `rp2350-hazard3`, and it
reflects exactly the options configured in the RP2350 chips. Notably,
the Zbc is not configured, and nor is B because the `misa.B` bit is not
either.
Syntacore SCR4 is a microcontroller-class processor core that has much
in common with SCR3, but also supports F and D extensions.
Overview: https://syntacore.com/products/scr4
Syntacore SCR5 is an entry-level Linux-capable 32/64-bit RISC-V
processor core which scheduling model almost match SCR4.
Overview: https://syntacore.com/products/scr5
Co-authored-by: Dmitrii Petrov <dmitrii.petrov@syntacore.com>
Co-authored-by: Anton Afanasyev <anton.afanasyev@syntacore.com>
Syntacore SCR5 is an entry-level Linux-capable 32/64-bit RISC-V
processor core.
Overview: https://syntacore.com/products/scr5
Scheduling model will be added in a subsequent PR.
Co-authored-by: Dmitrii Petrov <dmitrii.petrov@syntacore.com>
Co-authored-by: Anton Afanasyev <anton.afanasyev@syntacore.com>
This matches sifive-p470.
RVA22U64Features includes the Zicntr extension which was not present for
these CPUs before. I believe that was a mistake due to weird history of
the Zicntr extension. I've updated the p470 test accordingly since this
was missed there too.
This is an OOO core that has a vector unit. For more information see
https://www.sifive.com/cores/performance-p450-470.
Use the existing P400 scheduler model. This model is missing accurate
vector scheduling support, but it will be added in a follow up patch.
Other tunings can come in future patches too.
Syntacore SCR4 is a microcontroller-class processor core that has much
in common with SCR3. The most significant difference for compilers is F
and D extensions support. Overview: https://syntacore.com/products/scr4
Two CPUs are added:
* 'syntacore-scr4-rv32' -- rv32imfdc
* 'syntacore-scr4-rv64' -- rv64imafdc
Scheduling model will be added in a separate PR.
Co-authored-by: Dmitrii Petrov <dmitrii.petrov@syntacore.com>
Co-authored-by: Anton Afanasyev <anton.afanasyev@syntacore.com>
This is just like AArch64.
Changing the threshold to 6 will increase the code size, but will
also decrease unconditional branches. CPUs with wide fetch/issue units
can benefit from it.
The value 6 may be debatable, we can set it to `SchedModel.IssueWidth`.
Add all the implied feature directly to the SiFive7 CPUs tuning feature
list instead.
The implication is dangerous because explicitly disalbing any of the
implied features through the command line would also clear the SiFive7
feature bit.
Some cores implement an optimization for a strided load with an x0
stride, which results in fewer memory operations being performed then
implied by VL since all address are the same. It seems to be the case
that this is the case only for a minority of available implementations.
We know that sifive-x280 does, but sifive-p670 and spacemit-x60 both do
not.
(To be more precise, measurements on the x60 appear to indicate that a
stride of x0 has similar latency to a non-zero stride, and that both
are about twice a vleN.v. I'm taking this to mean the x0
case is not optimized.)
We had an existing flag by which a processor could opt out of this
assumption but no upstream users. Instead of adding this flag to the
p670 and x60, this patch reverses the default and adds the opt-in flag
only to the x280.
Syntacore SCR3 is a microcontroller-class processor core. Overview:
https://syntacore.com/products/scr3
This PR introduces two CPUs:
* 'syntacore-scr3-rv32' which is rv32imc
* 'syntacore-scr3-rv64' which is rv64imac
---------
Co-authored-by: Dmitrii Petrov <dmitrii.petrov@syntacore.com>
On RISC-V, there are a few ways to control whether the
PostMachineScheduler is enabled. If `-enable-post-misched` is passed or
passed with a value of true, then the PostMachineScheduler is enabled.
If it is passed with a value of false then the PostMachineScheduler is
disabled. If the option is not passed at all, then
`RISCVSubtarget::enablePostRAMachineScheduler` decides whether the pass
should be enabled or not. `TargetSubtargetInfo::enablePostRAScheduler`
and `TargetSubtargetInfo::enablePostRAMachineScheduler` who check the
SchedModel value are not called by RISC-V backend.
`RISCVSubtarget::enablePostRAMachineScheduler` currently checks if the
active scheduler model sets `PostRAScheduler`. If it is set to true by
the scheduler model, then the pass is enabled. If it is not set to true
by the scheduler model, then the value of `UsePostRAScheduler` subtarget
feature is used.
I argue that the RISC-V backend should not use `PostRAScheduler` field
of the scheduler model to control whether the PostMachineScheduler is
enabled for the following reasons:
1. No other targets use this value to control whether
PostMachineScheduler is enabled. They only use it to check whether the
legacy PostRASchedulerList scheduler is enabled.
2. We can add the `UsePostRAScheduler` feature to the processor
definition in RISCVProcessors.td to tie a processor to whether the pass
should be enabled by default. This makes the feature and the sched model
field redundant.
3. Since these options are redundant, we should prefer the feature,
since we can set `+` and `-` on the feature, but the value of the
scheduler cannot be controlled on the command line.
4. Keeping both options allows us to set the feature and the scheduler
model value to conflicting values. Although the scheduler model value
will win out, it feels awkward to allow it.
This is largely a revert of commit
e81796671890b59c110f8e41adc7ca26f8484d20.
As #88029 shows, there exists hardware that only supports unaligned
scalar.
I'm leaving how this gets exposed to the clang interface to a future
patch.
This is currently being implied in RISCVISAInfo.cpp. Make it explicit.
I'm planning to move all extension information to RISCVFeatures.td and
have tablegen create the tables for RISCVISAInfo.cpp. This requires
making the creation of RISCVTargetParserDef.inc in tablegen independent
of RISCVISAInfo.cpp. So we need an accurate extension list for CPUs in
tablegen.
This PR includes an initial scheduler model shows improvement on
multiple workloads over NoSchedModel and SiFive7Model for sifive-p670.
We plan on making significant changes to this model in the future so
that it is more accurate. This patch would close
https://github.com/llvm/llvm-project/pull/80612.
sifive-p450 supports a very restricted version of the short forward
branch optimization from the sifive-7-series.
For sifive-p450, a branch over a single c.mv can be macrofused as a
conditional move operation. Due to encoding restrictions on c.mv, we
can't conditionally move from X0. That would require c.li instead.
This reverts 0d3eee33f262402562a1ff28106dbb2f59031bdb and
4c37d30e22ae655394c8b3a7e292c06d393b9b44.
XSfcie is not an official SiFive extension name. It stands for
SiFive Custom Instruction Extension, which is mentioned in the S76
manual, but then elsewhere in the manual says it is not supported
for S76.
LLVM had various instructions and CSRs listed as part of this
extension, but as far as SiFive is concerned, none of them are part
of it. There are no documented extension names for these instructions
and CSRs either externally or internally.
If these are important to LLVM users, I can facilitate creating
extension names for them and have them documented. For now I'm
removing everything.
Unfortunately, these instructions and CSRs are in LLVM 17 so this
is an incompatible change.
Support was added for the following fusions:
auipc-addi, slli-srli, ld-add
Some parts of the code became repetative, so small refactoring of
existing lui-addi fusion was done.