Don't call raw_string_ostream::flush(), which is essentially a no-op.
As specified in the docs, raw_string_ostream is always unbuffered.
( 65b13610a5226b84889b923bae884ba395ad084d for further reference )
GNU ld will error when encountering a pcrel_lo whose corresponding
pcrel_hi is in a different section. [1] introduced a check to help
prevent this issue by preventing outlining in a few circumstances.
However, we can also hit this same issue when outlining from functions
with prefixes ("hot"/"unlikely"/"unknown" from profile information, for
example) as the outlined function might not have the same prefix,
possibly resulting in a "paired" pcrel_lo and pcrel_hi ending up in
different sections.
To prevent this issue, take a similar approach as [1] and additionally
prevent outlining when we see a pcrel_lo and the function has a prefix.
[1]
96c85f80f0Fixes#107520
Continuing with #107993 and #108007, this handles the last of the main
rematerializable vector instructions.
There's an extra spill in one of the test cases, but it's likely noise
from the spill weights and isn't an issue in practice.
Even though vmv.v.x has a non constant scalar operand, we can still
rematerialize it because we have split register allocation between
vectors and scalars.
InlineSpiller will check to make sure that the scalar operand is live at
the point where the rematerialization occurs, so this won't extend any
scalar live ranges. However this also means we may not be able to
rematerialize in some cases, as shown in @vmv.v.x_needs_extended.
It might be worthwhile teaching InlineSpiller to extend scalar live
ranges in a future patch. I experimented with this locally and it
reduced spills on 531.deepsjeng_r by a further 3%.
This continues the line of work started in #97520, and gives a 2.5%
reduction in the number of spills on SPEC CPU 2017.
Program regalloc.NumSpills
lhs rhs diff
605.mcf_s 141.00 141.00 0.0%
505.mcf_r 141.00 141.00 0.0%
519.lbm_r 73.00 73.00 0.0%
619.lbm_s 68.00 68.00 0.0%
631.deepsjeng_s 354.00 353.00 -0.3%
531.deepsjeng_r 354.00 353.00 -0.3%
625.x264_s 1896.00 1886.00 -0.5%
525.x264_r 1896.00 1886.00 -0.5%
508.namd_r 6665.00 6598.00 -1.0%
644.nab_s 761.00 753.00 -1.1%
544.nab_r 761.00 753.00 -1.1%
638.imagick_s 4287.00 4181.00 -2.5%
538.imagick_r 4287.00 4181.00 -2.5%
602.gcc_s 12771.00 12450.00 -2.5%
502.gcc_r 12771.00 12450.00 -2.5%
510.parest_r 43876.00 42740.00 -2.6%
500.perlbench_r 4297.00 4179.00 -2.7%
600.perlbench_s 4297.00 4179.00 -2.7%
526.blender_r 13503.00 13103.00 -3.0%
511.povray_r 2006.00 1937.00 -3.4%
620.omnetpp_s 984.00 946.00 -3.9%
520.omnetpp_r 984.00 946.00 -3.9%
657.xz_s 302.00 289.00 -4.3%
557.xz_r 302.00 289.00 -4.3%
541.leela_r 378.00 356.00 -5.8%
641.leela_s 378.00 356.00 -5.8%
623.xalancbmk_s 1646.00 1548.00 -6.0%
523.xalancbmk_r 1646.00 1548.00 -6.0%
Geomean difference -2.5%
I initially held off submitting this patch because it surprisingly
introduced a lot of spills in the test diffs, but after #107290 the
vmv.v.is that caused them are now gone.
The gist is that marking vmv.v.i as spillable decreased its spill
weight, which actually resulted in more m8 registers getting evicted and
spilled during register allocation.
The SPEC results show this isn't an issue in practice though, and I plan
on posting a separate patch to explain this in more detail.
Previously for vector peepholes that fold based on VL, we checked if the
VLMAX is the same as a proxy to check that the EEWs were the same. This
only worked at LMUL >= 1 because the EMULs of the Src output and user's
input had to be the same because the register classes needed to match.
At fractional LMULs we would have incorrectly folded something like
this:
%x:vr = PseudoVADD_VV_MF4 $noreg, $noreg, $noreg, 4, 4 /* e16 */, 0
%y:vr = PseudoVMV_V_V_MF8 $noreg, %x, 4, 3 /* e8 */, 0
This models the EEW of the destination operands of vector instructions
with a TSFlag, which is enough to fix the incorrect folding.
There's some overlap with the TargetOverlapConstraintType and
IsRVVWideningReduction. If we model the source operands as well we may
be able to subsume them.
This patch prepares the NFC groundwork for global outlining using
CGData, which will follow
https://github.com/llvm/llvm-project/pull/90074.
- The `MinRepeats` parameter is now explicitly passed to the
`getOutliningCandidateInfo` function, rather than relying on a default
value of 2. For local outlining, the minimum number of repetitions is
typically 2, but for the global outlining (mentioned above), we will
optimistically create a single `Candidate` for each `OutlinedFunction`
if stable hashes match a specific code sequence. This parameter is
adjusted accordingly in global outlining scenarios.
- I have also implemented `unique_ptr` for `OutlinedFunction` to ensure
safe and efficient memory management within `FunctionList`, avoiding
unnecessary implicit copies.
This depends on https://github.com/llvm/llvm-project/pull/101461.
This is a patch for
https://discourse.llvm.org/t/rfc-enhanced-machine-outliner-part-2-thinlto-nolto/78753.
The renamable flag is useful during MachineCopyPropagation but renamable
flag will be dropped after lowerCopy in some case.
This patch introduces extra arguments to pass the renamable flag to
copyPhysReg.
This is just like AArch64.
Changing the threshold to 6 will increase the code size, but will
also decrease unconditional branches. CPUs with wide fetch/issue units
can benefit from it.
The value 6 may be debatable, we can set it to `SchedModel.IssueWidth`.
This extension consists of 8 additional 16-bit compressed forms for
existing standard load/store opcodes.
These opcodes are found in some RISC-V microcontrollers from WCH /
Nanjing Qinheng Microelectronics.
As discussed in the Discourse forums, this uses incompatible extension
and opcode names vs the vendor binary toolchain. The chosen names
instead follow the conventions for other vendor extensions listed on the
"riscv-non-isa" project.
This adds initial support for rematerializing vector instructions,
starting with vid.v since it's simple and has the least number of
operands. It has one passthru operand which we need to check is
undefined. It also has an AVL operand, but it's fine to rematerialize
with it because it's scalar and register allocation is split between
vector and scalar.
RISCVInsertVSETVLI can still happen before vector regalloc if
-riscv-vsetvl-after-rvv-regalloc is false, so this makes sure that we
only rematerialize after regalloc by checking for the implicit uses that
are added.
Fixes#82659
There are some functions, such as `findRegisterDefOperandIdx` and `findRegisterDefOperand`, that have too many default parameters. As a result, we have encountered some issues due to the lack of TRI parameters, as shown in issue #82411.
Following @RKSimon 's suggestion, this patch refactors 9 functions, including `{reads, kills, defines, modifies}Register`, `registerDefIsDead`, and `findRegister{UseOperandIdx, UseOperand, DefOperandIdx, DefOperand}`, adjusting the order of the TRI parameter and making it required. In addition, all the places that call these functions have also been updated correctly to ensure no additional impact.
After this, the caller of these functions should explicitly know whether to pass the `TargetRegisterInfo` or just a `nullptr`.
This opens up a door for reusing reassociation optimizations on
target-specific binary operations with non-standard operand list.
This is effectively a NFC.
CASE_VFMA_OPCODE_VV and CASE_VFMA_CHANGE_OPCODE_VV need to match up if we are
are to avoid "Unexpected opcode" errors, but in CASE_VFMA_CHANGE_OPCODE_VV,
CASE_VFMA_CHANGE_OPCODE_LMULS_MF2 had mistakenly been used instead of
CASE_VFMA_CHANGE_OPCODE_LMULS_MF4.
This PR includes:
* vadd.vv/vand.vv/vor.vv/vxor.vv
* vmseq.vv/vmsne.vv
* vmin.vv/vminu.vv/vmax.vv/vmaxu.vv
* vmul.vv/vmulh.vv/vmulhu.vv
* vwadd.vv/vwaddu.vv
* vwmul.vv/vwmulu
* vwmacc.vv/vwmaccu.vv
* vadc.vvm
There is no test change, I may add it later.
Fixes part of #64422
Reviewers: michaelmaitland, preames, lukel97, topperc, asb
Reviewed By: topperc, lukel97
Pull Request: https://github.com/llvm/llvm-project/pull/88379
We split target-dependent MachineCombiner patterns into their target
folder.
This makes MachineCombiner much more target-independent.
Reviewers:
davemgreen, asavonic, rotateright, RKSimon, lukel97, LuoYuanke, topperc, mshockwave, asi-sc
Reviewed By: topperc, mshockwave
Pull Request: https://github.com/llvm/llvm-project/pull/87991
This improves a pattern that occurs in 531.deepsjeng_r. Reducing the
dynamic instruction count by 0.5%.
This may be possible to improve in SelectionDAG, but given the special
cases around shXadd formation, it's not obvious it can be done in a
robust way without adding multiple special cases.
I've used a GEP with 2 indices because that mostly closely resembles the
motivating case. Most of the test cases are the simplest GEP case. One
test has a logical right shift on an index which is closer to the
deepsjeng code. This requires special handling in isel to reverse a
DAGCombiner canonicalization that turns a pair of shifts into (srl (and
X, C1), C2).
This restructures the code to make the fact that most of
getVLENFactoredAmount is just a generic multiply w/immediate more
obvious and prepare for a couple of upcoming enhancements to this code.
Note that I plan to switch mulImm to early return, but decided I'd do
that as a separate commit to keep this diff readable.
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>
This TSFlags was introduced by https://reviews.llvm.org/D108767.
A base class of all RISCV RegisterClass is added and we store
IsVRegClass/VLMul/NF into TSFlags and add helpers to get them.
This can reduce some lines and I think there will be more usages.
Reviewers: preames, topperc
Reviewed By: topperc
Pull Request: https://github.com/llvm/llvm-project/pull/84894
When the encoding of register tuples are aligned, we can use a copy
with larger LMUL to reduce copies.
Reviewers: preames, topperc, lukel97
Reviewed By: topperc, lukel97
Pull Request: https://github.com/llvm/llvm-project/pull/84455
This is part of #70452 that changes the type used for the external
interface of MMO to LocationSize as opposed to uint64_t. This means the
constructors take LocationSize, and convert ~UINT64_C(0) to
LocationSize::beforeOrAfter(). The getSize methods return a
LocationSize.
This allows us to be more precise with unknown sizes, not accidentally
treating them as unsigned values, and in the future should allow us to
add proper scalable vector support but none of that is included in this
patch. It should mostly be an NFC.
Global ISel is still expected to use the underlying LLT as it needs, and
are not expected to see unknown sizes for generic operations. Most of
the changes are hopefully fairly mechanical, adding a lot of getValue()
calls and protecting them with hasValue() where needed.
This is another part of #70452 which makes getMemOperandsWithOffsetWidth
use a LocationSize for Width, as opposed to the unsigned it currently
uses. The advantages on it's own are not super high if
getMemOperandsWithOffsetWidth usually uses known sizes, but if the
values can come from an MMO it can help be more accurate in case they
are Unknown (and in the future, scalable).
Instead of initializing the accumulator to 0. Initialize it on first
assignment with a mv from the register that holds VLENB << ShiftAmount.
Fix a missing kill flag on the final Add.
I have no real interest in this case, just an easy optimization I
noticed.
This was trying to rewrite a branch that uses X0 to a branch that uses a
register produced by LI of 1 or -1. Using X0 is free so there is no
reason to rewrite it. Doing so would just extend the live range of the
LI register increasing register pressure.
In practice this might not have triggered often because we were calling
MRI.hasOneUse on X0. I'm not sure what the returns for a physical
reigster.
If it isn't virtual, we may extend the live range of the physical
register past were it is valid. For example, across a call.
Found while trying to enable -riscv-enable-sink-fold which enables some
copy propagation in machine sink that led to ADDIs with physical
register destinations.
When a branch target is too far away we need to emit an indirect branch.
We scavenge a register for this since we don't know we need this until
after register allocation.
Jumps using X1 and X5 as the source are hints to the hardware to pop the
return-address stack. We should avoiding using them for jumps that
aren't a return or tail call.