These are some macrofusions that are used internally in Ventana in an
yet not upstreamed processor. Figured it would be good to contribute
them ahead of the processor to allow the community to also use them in
their own processors, while also alleaviting our own downstream upkeep.
The macrofusions being added are, considering load =
lb,lh,lw,ld,lbu,lhu,lwu:
- bfext (slli+srli)
- auipc+load
- lui+load
- add(.uw)+load
- addi+load
- shXadd(.uw)+load, where X=1,2,3
Some processors benefit more from store clustering than load clustering,
and vice-versa, depending on factors that are exclusive to each one
(e.g. macrofusions implemented).
Likewise, certain optimizations benefits more from misched clustering
than postRA clustering. Macrofusions are again an example: in a
processor with store pair macrofusions, like the veyron-v1, it is
observed that misched clustering increases the amount of macrofusions
more than postRA clustering. This of course isn't necessarily true for
other processors, but it shows that processors can benefit from a more
fine grained control of clustering mutations, and each one is able to do
it differently.
Add 4 new subtarget features that deprecates the existing
riscv-misched-load-store-clustering and
riscv-postmisched-load-store-clustering
options:
- disable-misched-load-clustering and disable-misched-store-clustering:
disable load/store clustering during misched;
- disable-postmisched-load-clustering and
disable-postmisched-store-clustering:
disable load/store clustering during PostRA.
Note that the new subtarget features disables specific stages of the
default
clustering settings. The default per se (load and store clustering for
both
misched and PostRA) is left untouched.
Disable all clustering but misched-store-clustering for the veyron-v1
processor
using the new features.
When vectorizing a loop with a fixed-order recurrence we use a splice,
which gets lowered to a vslidedown and vslideup pair.
However with the way we lower it today we end up with extra vl toggles
in the loop, especially with EVL tail folding, e.g:
.LBB0_5: # %vector.body
# =>This Inner Loop Header: Depth=1
sub a5, a2, a3
sh2add a6, a3, a1
zext.w a7, a4
vsetvli a4, a5, e8, mf2, ta, ma
vle32.v v10, (a6)
addi a7, a7, -1
vsetivli zero, 1, e32, m2, ta, ma
vslidedown.vx v8, v8, a7
sh2add a6, a3, a0
vsetvli zero, a5, e32, m2, ta, ma
vslideup.vi v8, v10, 1
vadd.vv v8, v10, v8
add a3, a3, a4
vse32.v v8, (a6)
vmv2r.v v8, v10
bne a3, a2, .LBB0_5
Because the vslideup overwrites all but UpOffset elements from the
vslidedown, we currently set the vslidedown's AVL to said offset.
But in the vslideup we use either VLMAX or the EVL which causes a
toggle.
This increases the AVL of the vslidedown so it matches vslideup, even if
the extra elements are overridden, to avoid the toggle.
A new tuning feature +vl-dependent-latency has been added which keeps
the old behaviour for microarchitectures that dynamically dispatch uops
based on vl, e.g. sifive-x280.
+vl-dependent-latency can be reused for the recently proposed Ovlt
optimization directive if/when it's ratified:
https://lists.riscv.org/g/tech-privileged/message/2487
If we wanted to aggressively optimise for vl at the expense of
introducing more toggles we could probably look at doing this in
RISCVVLOptimizer.
This patch adds the scheduling model for sifive-x390. X390 is a dual
issue in-order CPU. It has two scalar and two vector pipes, with
VLEN=1024 and DLEN=512.
Co-authored-by: Michael Maitland <michaeltmaitland@gmail.com>
The Andes CPU is configurable with optional extensions. The minimal
required extension set does not include `B` and `Zbc` extensions. So we
decided to remove them.
This patch implements scheduling model for IMAFD and Zb extension. The
latency and throughput of all instructions, except load/store, are
measured by llvm-exegesis.
Scheduling model for V and other extensions will be added in a follow-up
patch.
This patch adds an initial scheduler model for the SpacemiT-X60,
including latency for scalar instructions only.
The scheduler is based on the documented characteristics of the C908,
which the SpacemiT-X60 is believed to be based on, and provides the
expected latency for several instructions. I ran a probe to confirm all
of these values and to get the latency of instructions not provided by
the C908 documentation (e.g., double floating-point instructions).
For load and store instructions, the C908 documentation says the latency
is \>= 3 for load and 1 for store. I tried a few combinations of values
until I got the current values of 5 and 3, which yield the best results.
Although the X60 does appear to support multiple issue for at least some
floating point instructions, this model assumes single issue as
increasing it reduces the gains below.
This patch gives a geomean improvement of ~4% on SPEC CPU 2017 for both
rva22u64 and rva22u64_v, with some benchmarks improving up to 18%
(508.namd_r). There were a couple of execution time regressions, but
only in noisy benchmarks (523.xalancbmk_r and 510.parest_r).
* rva22u64: https://lnt.lukelau.me/db_default/v4/nts/507?compare_to=405
(compares a55f7275 to the baseline 8286b804)
* rva22u64_v:
https://lnt.lukelau.me/db_default/v4/nts/474?compare_to=404 (compares
a55f7275 to the baseline 8286b804)
This initial scheduling model is strongly focused on providing
sufficient definitions to provide improved performance for the
SpacemiT-X60. Further incremental gains may be possible through a much
more detailed microarchitectural analysis, but that is left to future
work.
Further scheduling definitions for RVV can be added in a future PR.
We add a generic out-of-order CPU model here just like what GCC
has done.
People may use this model to evaluate some optimizations, and more
importantly, people can use this model as a template to customize
their own CPU models.
The design (units, cycles, ...) of this model is random so don't
take it seriously.
This reverts commit 9cc8442a2b438962883bbbfd8ff62ad4b1a2b95d.
This reverts commit 859c871184bdfdebb47b5c7ec5e59348e0534e0b.
A performance regression was reported on the original review. There appears
to have been an unexpected interaction here. Reverting during investigation.
This change introduces a default schedule model for the RISCV target
which leaves everything unchanged except the MicroOpBufferSize. The
default value of this flag in NoSched is 0. Both configurations
represent in order cores (i.e. no reorder window), the difference
between them comes down to whether heuristics other than latency are
allowed to apply. (Implementation details below)
I left the processor models which explicitly set MicroOpBufferSize=0
unchanged in this patch, but strongly suspect we should change those
too. Honestly, I think the LLVM wide default for this flag should be
changed, but don't have the energy to manage the updates for all
targets.
Implementation wise, the effect of this change is that schedule units
which are ready to run *except that* one of their predecessors may not
have completed yet are added to the Available list, not the Pending one.
The result of this is that it becomes possible to chose to schedule a
node before it's ready cycle if the heuristics prefer. This is
essentially chosing to insert a resource stall instead of e.g.
increasing register pressure.
Note that I was initially concerned there might be a correctness aspect
(as in some kind of exposed pipeline design), but the generic scheduler
doesn't seem to know how to insert noop instructions. Without that, a
program wouldn't be guaranteed to schedule on an exposed pipeline
depending on the program and schedule model in question.
The effect of this is that we sometimes prefer register pressure in
codegen results. This is mostly churn (or small wins) on scalar because
we have many more registers, but is of major importance on vector -
particularly high LMUL - because we effectively have many fewer
registers and the relative cost of spilling is much higher. This is a
significant improvement on high LMUL code quality for default rva23u
configurations - or any non -mcpu vector configuration for that matter.
Fixes#107532
P550 falls between P450 and P650. It has 1 additional FEX pipe over
P450. Mul and cpop latency are 3 instead of 2.
I've set the MicroOpBufferSize to 96 instead of 56 based on the ROB size
measurement from
https://chipsandcheese.com/p/inside-sifives-p550-microarchitecture I
believe we set this value too low for P450 and P650 and should update
them in a separate PR.
First part of tt-ascalon-d8 scheduling model, only containing scalar
ops. Scheduling for vector instructions will be added in a follow-up
patch.
---------
Co-authored-by: Anton Blanchard <antonb@tenstorrent.com>
Co-authored-by: Pengcheng Wang <wangpengcheng.pp@bytedance.com>
This patch introduces a scheduling model for the MIPS p8700, an
out-of-order
RISC-V processor. The model includes pipelines for the following units:
- 2 Integer Arithmetic/Logical Units (ALU and AL2)
- Multiply/Divide Unit (MDU)
- Branch Unit (CTI)
- Load/Store Unit (LSU)
- Short Floating-Point Pipe (FPUS)
- Long Floating-Point Pipe (FPUL)
For additional details, refer to the official product page:
https://mips.com/products/hardware/p8700/.
Also adds `UnsupportedSchedZfhmin` to handle cases like
`WriteFCvtF16ToF32` that
previously caused build failures.
The results differ on different platforms so it is really hard to
determine a common default value.
Tune info for postra scheduling direction is added and CPUs can
set their own preferable postra scheduling direction.
The P8700 is a high-performance processor from MIPS designed to meet the
demands of modern workloads, offering exceptional scalability and
efficiency. It builds on MIPS's established architectural strengths
while introducing enhancements that set it apart. For more details, you
can check out the official product page here:
https://mips.com/products/hardware/p8700/.
Scheduling model will be added in a separate commit/PR.
Change the intersect for the anticipated algorithm to ignore unknown
when anticipating. This effectively allows VXRM writes speculatively
because it could do a VXRM write even when there's branches where VXRM
is unneeded.
The importance of this change is because VXRM writes causes pipeline
flushes in some micro-architectures and so it makes sense to allow more
aggressive hoisting even if it causes some degradation for the slow
path.
An example is this code:
```
typedef unsigned char uint8_t;
__attribute__ ((noipa))
void foo (uint8_t *dst, int i_dst_stride,
uint8_t *src1, int i_src1_stride,
uint8_t *src2, int i_src2_stride,
int i_width, int i_height )
{
for( int y = 0; y < i_height; y++ )
{
for( int x = 0; x < i_width; x++ )
dst[x] = ( src1[x] + src2[x] + 1 ) >> 1;
dst += i_dst_stride;
src1 += i_src1_stride;
src2 += i_src2_stride;
}
}
```
With this patch, the code above generates a hoisting VXRM writes out of
the outer loop.
This reverts commit b36fcf4f493ad9d30455e178076d91be99f3a7d8.
This reverts commit c11b6b1b8af7454b35eef342162dc2cddf54b4de.
This reverts commit 775148f2367600f90d28684549865ee9ea2f11be.
multiple bot build breakages, e.g. https://lab.llvm.org/buildbot/#/builders/3/builds/8076
Ascalon is an out-of-order CPU core from Tenstorrent. Overview:
https://tenstorrent.com/ip/tt-ascalon
Adding 8-wide version, -mcpu=tt-ascalon-d8. Scheduling model will be
added in a separate PR.
---------
Co-authored-by: Anton Blanchard <antonb@tenstorrent.com>
This is a follow up to #111511, where after benchmarking we learnt that
the Banana Pi F3 has fast segmented loads for not just NF=2, but also
NF=3 and NF=4:
https://github.com/preames/bp3-microarch#vlseg_lmul_x_sew_throughput
This adds tuning features to allow these segment loads and stores to be
costed cheaper and enables it for the spacemit-x60.
It also enables +optimized-nf2-segment-load-store by default in the
generic tuning to maintain the previous behaviour when compiled without
-mcpu or -mtune.
Syntacore SCR7 is rv64imafdcv_zba_zbb_zbc_zbs_zkn.
Scheduling model for RVV will be added later.
Overview: https://syntacore.com/products/scr7
---------
Co-authored-by: Dmitrii Petrov <dmitrii.petrov@syntacore.com>
Co-authored-by: Anton Afanasyev <anton.afanasyev@syntacore.com>
Co-authored-by: Elena Lepilkina <elena.lepilkina@syntacore.com>
Syntacore SCR7 is a high-performance Linux-capable RISC-V processor
core.
The core has rv64imafdcv_zba_zbb_zbc_zbs_zkn march.
Overview: https://syntacore.com/products/scr7
Scheduling model will be added in a subsequent PR.
---------
Co-authored-by: Dmitrii Petrov <dmitrii.petrov@syntacore.com>
Co-authored-by: Anton Afanasyev <anton.afanasyev@syntacore.com>
Co-authored-by: Elena Lepilkina <elena.lepilkina@syntacore.com>
Luke Wren's Hazard3 is a configurable, open-source 32-bit RISC-V core.
The core's source code and docs are available on github:
https://github.com/wren6991/hazard3
This is the RISC-V core used in the RP2350, a recently announced SoC by
Raspberry Pi (which also contains Arm cores):
https://datasheets.raspberrypi.com/rp2350/rp2350-datasheet.pdf
We have agreed to name this `-mcpu` option `rp2350-hazard3`, and it
reflects exactly the options configured in the RP2350 chips. Notably,
the Zbc is not configured, and nor is B because the `misa.B` bit is not
either.