This patch replaces SmallSet<T *, N> with SmallPtrSet<T *, N>. Note
that SmallSet.h "redirects" SmallSet to SmallPtrSet for pointer
element types:
template <typename PointeeType, unsigned N>
class SmallSet<PointeeType*, N> : public SmallPtrSet<PointeeType*, N>
{};
We only have 140 instances that rely on this "redirection", with the
vast majority of them under llvm/. Since relying on the redirection
doesn't improve readability, this patch replaces SmallSet with
SmallPtrSet for pointer element types.
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
pli.h and pli.w both accept signed immediates, so pli.b should too. But
unlike those instructions, pli.b doesn't do any extension so its ok to
accept an unsigned immediate as well.
Currently we have a switch statement that checks if a vector instruction
may read elements past VL. However it currently doesn't account for
instructions in vendor extensions.
Handling all possible vendor instructions will result in quite a lot of
opcodes being added, so I've created a new TSFlag that we can declare in
TableGen, and added it to the existing instruction definitions.
I've tried to be conservative as possible here: All SiFive vendor vector
instructions should be covered by the flag, as well as all of
XRivosVizip, and ri.vextract from XRivosVisni.
For now this should be NFC because coincidentally, these instructions
aren't handled in getOperandInfo, so RISCVVLOptimizer should currently
avoid touching them despite them being liberally handled in
getMinimumVLForUser.
However in an upcoming patch we'll need to also bail in
getMinimumVLForUser, so this prepares for it.
What most code wants to know is the direction and we have to decode the
opcode to figure that out. Instead pass the direction around as a bool
and convert to opcode when we create the merge instruction.
If we're moving the second copy before another instruction that reads
the copied register, we need to clear the kill flag on the combined
move.
Fixes#153598.
This patch adds CodeGen support for qc.insbi and qc.insb instructions
defined in the Qualcomm uC Xqcibm extension. qc.insbi and qc.insb
inserts bits into destination register from immediate and register
operand respectively.
A sequence of `xor`, `and` & `xor` depending on appropriate conditions
are converted to `qc.insbi` or `qc.insb` which depends on the
immediate's value.
These instructions are the shift by immediate and saturate by immediate
instructions from the top half of page 9 of
https://jhauser.us/RISCV/ext-P/RVP-instrEncodings-015.pdf
I've also improved the CHECK lines in the invalid tests to check line
and column number from the diagnostic.
Co-authored-by: realqhc <caiqihan021@hotmail.com>
Follow-up PR to #153071, adding the remaining zvbb instructions
(VBREV8_V and VREV8_V), plus the zvbc instruction (VCLMUL_VV, VCLMUL_VX,
VCLMULH_VV, VCLMULH_VX).
Godbolt example: https://godbolt.org/z/ThdfP475a
In the example single-element vse is used to store reduction result
instead of scalar store ([this optimization was introduced by this
patch](https://reviews.llvm.org/D109482)). However, vmv.x.s can't be
eliminated here because it has other uses (e.g. CopyToReg), so it seems
more profitable to use scalar store (we already have store value in a
scalar register, and can save one vsetvli which is likely to be required
for single-element vse). The proposed solution is to this transform only
if vmv.x.s has one use (in store instruction)
The check should be about unsigned 16-bit immediates, not signed ones.
This is not a bug per-se, as the old codegen was correct for the
uint16_max case, it just didn't end up using `qc.e.bgeui`, which we
would prefer it did.
This PR adds support for the following instructions to the RISC-V
VLOptimizer: vandn.vx, vandn.vv, vbrev.v, vclz.v, vcpop.v, vctz.v,
vror.vi, vror.vx, vror.vv, vrol.vx, vrol.vv.
Span-dependent instructions on RISC-V interact in a complex manner with
linker relaxation. The span-dependent assembler algorithm implemented in
LLVM has to start with the smallest version of an instruction and then
only make it larger, so we compress instructions before emitting them to
the streamer.
When the instruction is streamed, the information that the instruction
(or rather, the fixup on the instruction) is linker relaxable must be
accurate, even though the assembler relaxation process may transform a
not-linker-relaxable instruction/fixup into one that that is linker
relaxable, for instance `c.jal` becoming `qc.e.jal`, or `bne` getting
turned into `beq; jal` (the `jal` is linker relaxable).
In order for this to work, the following things have to happen:
- Any instruction/fixup which might be relaxed to a linker-relaxable
instruction/fixup, gets marked as `RelaxCandidate = true` in
RISCVMCCodeEmitter.
- In RISCVAsmBackend, when emitting the `R_RISCV_RELAX` relocation, we
have to check that the relocation/fixup kind is one that may need a
relax relocation, as well as that it is marked as linker relaxable (the
latter will not be set if relaxation is disabled).
- Linker Relaxable instructions streamed to a Relaxable fragment need to
mark the fragment and its section as linker relaxable.
I also added more debug output for Sections/Fixups which are marked
Linker Relaxable.
This results in more relocations, when these PC-relative fixups cross an
instruction with a fixup that is resolved as not linker-relaxable but
caused the fragment to be marked linker relaxable at streaming time
(i.e. `c.j`).
Fixes: #150071
If we have a floating point vector and no zve32f/zve64f/zve64d, we can
end up with an invalid type-legalization cost from
getTypeLegalizationCost.
Previously this triggered an assertion that the type must have been
legalized if the "legal" type is a vector, but in this case when it's
not possible to legalize the original type is spat back out.
This fixes it by just checking that the legalization cost is valid.
We don't have much testing for zve64x, so we may have other places in
the cost model with this issue.
Fixes#153008
Similar to the PACKH+PACK pattern for RV32. We can end up with the
shift left by 32 neeed by our PACK pattern hidden behind an OR that
packs 2 half words.
It is common to have ABI requirements for illegal types: For example,
two i64 argument parts that originally came from an fp128 argument may
have a different call ABI than ones that came from a i128 argument.
The current calling convention lowering does not provide access to this
information, so backends come up with various hacks to support it (like
additional pre-analysis cached in CCState, or bypassing the default
logic entirely).
This PR adds the original IR type to InputArg/OutputArg and passes it
down to CCAssignFn. It is not actually used anywhere yet, this just does
the mechanical changes to thread through the new argument.
If the upper 32 bits are demanded, we might have a sext_inreg in
the pattern on the byte shifted by 24. We can also match this case
since packw sign extends from bit 31.
This PR adds hardware-measured latencies for all instructions defined in
Section 12 of the RVV specification: "Vector Fixed-Point Arithmetic
Instructions" to the SpacemiT-X60 scheduling model.
We have been tracking the performance of EVL tail folding in the loop
vectorizer on RISC-V for a while now, and after much hard work from
various contributors we think it should be generally profitable to
enable by default now.
With tail folding there is a 21% improvement on 525.x264_r on SPEC CPU
2017 on the BPI-F3 (-march=rva22u64_v -O3 -flto), as well as a 30%
geomean codesize reduction on SPEC and TSVC, with no significant
regressions detected.
Now that we are early into the LLVM 22.x development cycle it seems like
a good time to enable it to catch any issues. There are still more EVL
related items of work being tracked in #123069, which should continue to
improve performance.
Each section now tracks the index of the first linker-relaxable
fragment, enabling two changes:
* Delete redundant ALIGN relocations before the first linker-relaxable
instruction in a section. The primary example is the offset 0
R_RISCV_ALIGN relocation for a text section aligned by 4.
* For alignments larger than the NOP size after the first
linker-relaxable instruction, ALIGN relocations are now generated, even in
norelax regions. This fixes the issue #150159.
The new test llvm/test/MC/RISCV/Relocations/align-after-relax.s
verifies the required ALIGN in a norelax region following
linker-relaxable instructions.
By using a fragment index within the subsection (which is less than or
equal to the section's index), the implementation may generate redundant
ALIGN relocations in lower-numbered subsections before the first
linker-relaxable instruction.
align-option-relax.s demonstrates the ALIGN optimization.
Add an initial `call` to a few tests to prevent the ALIGN optimization.
---
When the alignment exceeds 2, we insert $alignment-2 bytes of NOPs, even
in non-RVC code. This enables non-RVC code following RVC code to handle
a 2-byte adjustment without requiring an additional state in MCSection
or AsmParser.
```
.globl _start
_start:
// GNU ld can relax this to 6505 lui a0, 0x1
// LLD hasn't implemented this transformation.
lui a0, %hi(foo)
.option push
.option norelax
.option norvc
// Now we generate R_RISCV_ALIGN with addend 2, even if this is a norvc region.
.balign 4
b0:
.word 0x3a393837
.option pop
foo:
```
Pull Request: https://github.com/llvm/llvm-project/pull/150816
This implements very basic support for RISC-V mapping symbols in
llvm-objdump, sharing the implementation with how Arm/AArch64/CSKY
implement this feature.
This only supports the `$x` (instruction) and `$d` (data) mapping
symbols for RISC-V, and not the version of `$x` which includes an
architecture string suffix.
This PR fixes the issue that caused an ub in PR #151472.
The issue was a shl call taking a negative shift amount (posDiff). The
result was never used, but tablegen would perform the calculation
anyway. The fix was to replace the shl call with just multiplications
with constants.
Original PR description:
This patch improves the helper classes in the SpacemiT-X60 vector
scheduling model and will be used in follow-up PRs:
There are now two functions to map LMUL to values:
* ConstValueUntilLMULThenDoubleBase: returns BaseValue for LMUL values
before startLMUL, Value for startLMUL, then doubles Value for each
subsequent LMUL. Useful for cases where fractional LMULs have constant
cycles, and integer LMULs double as they increase.
* GetLMULValue: takes an ordered list of LMUL cycles and LMUL and
returns the corresponding cycle. Useful for cases we can't easily cover
with ConstValueUntilLMULThenDoubleBase.
This PR also adds some useful simplified versions of
ConstValueUntilLMULThenDoubleBase, e.g.: ConstValueUntilLMULThenDouble
(when BaseValue == Value), or ConstOneUntilMF4ThenDouble (when cycles
start to double after MF2)
The information whether a specific argument is vararg or fixed is
currently stored separately from all the other argument information in
ArgFlags. This means that it is not accessible from CCAssign, and
backends have developed all kinds of workarounds for how they can access
it after all.
Move this information to ArgFlags to make it directly available in all
relevant places.
I've opted to invert this and store it as IsVarArg, as I think that both
makes the meaning more obvious and provides for a better default (which
is IsVarArg=false).
For RV32 we don't need the byte shifted by 24 to be zero extend
since the extended bits are shifted out.
For RV64, we don't need the byte shifted by 24 to be zero extended
if the upper 32 bits of the result aren't demanded.
These are some macrofusions that are used internally in Ventana in an
yet not upstreamed processor. Figured it would be good to contribute
them ahead of the processor to allow the community to also use them in
their own processors, while also alleaviting our own downstream upkeep.
The macrofusions being added are, considering load =
lb,lh,lw,ld,lbu,lhu,lwu:
- bfext (slli+srli)
- auipc+load
- lui+load
- add(.uw)+load
- addi+load
- shXadd(.uw)+load, where X=1,2,3
This is similar to an existing pattern from RV32 with the
simpliflication proposed by #152045. Instead of pack we need to use
packw and we need to know that the upper 32 bits are being ignored since
packw sign extends from bit 31.
The use of allBinOpWUsers prevents tablegen from automatically
reassociating the pattern so we need to do it manually. Tablegen is
still able to commute operands though.
Some processors benefit more from store clustering than load clustering,
and vice-versa, depending on factors that are exclusive to each one
(e.g. macrofusions implemented).
Likewise, certain optimizations benefits more from misched clustering
than postRA clustering. Macrofusions are again an example: in a
processor with store pair macrofusions, like the veyron-v1, it is
observed that misched clustering increases the amount of macrofusions
more than postRA clustering. This of course isn't necessarily true for
other processors, but it shows that processors can benefit from a more
fine grained control of clustering mutations, and each one is able to do
it differently.
Add 4 new subtarget features that deprecates the existing
riscv-misched-load-store-clustering and
riscv-postmisched-load-store-clustering
options:
- disable-misched-load-clustering and disable-misched-store-clustering:
disable load/store clustering during misched;
- disable-postmisched-load-clustering and
disable-postmisched-store-clustering:
disable load/store clustering during PostRA.
Note that the new subtarget features disables specific stages of the
default
clustering settings. The default per se (load and store clustering for
both
misched and PostRA) is left untouched.
Disable all clustering but misched-store-clustering for the veyron-v1
processor
using the new features.
This pattern previously checked a specific variant of 4 bytes being
packed that is generated by unaligned load expansion.
Our individual PACK patterns don't handle that particular case because a
DAG combine turns (or (or A, (shl B, 8)), (shl (or C, (shl D, 8)), 16))
into (or (or A, (shl B, 8)), (or (shl C, 16), (shl D, 24)). After this,
the outer OR doesn't have a shl operand so we needed a pattern that
looks through 2 layers of OR.
To match this pattern we don't need to look at the (or A, (shl B, 8))
part since that part wasn't affected by the DAG combine and can be
matched to PACKH by itself. It's enough to make sure that part of the
pattern has zeros in the upper 16 bits.
This allows tablegen to automatically generate more permutations of this pattern.
The associative variant expansion is limited to 3 children.