In the vmerge peephole, we currently allow different AVLs for the vmerge
and its true operand.
If vmerge's VL > true's VL, vmerge can "preserve" elements from false
that would otherwise be clobbered with a tail agnostic policy on true.
mask 1 1 1 1 0 0 0 0
true x x x x|. . . . AVL=4
vmerge x x x x f f|. . AVL=6
If we convert this to vmv.v.v we will lose those false elements:
mask 1 1 1 1 0 0 0 0
true x x x x|. . . . AVL=4
vmv.v.v x x x x . .|. . AVL=6
Fix this by checking that vmerge's AVL is <= true's AVL.
Should fix#149335
These instructions don't have an rs1 field unlike other instructions
that use RVInstIBase.
Rename the classes to not use Unary since we have historically used that
for a single register operand.
In TargetFrameLowering::determineCalleeSaves, any vector register is
marked
as saved if any of its subregister is clobbered, this is not correct in
vector registers. We only want the vector register to be marked as saved
only if all of its subregisters are clobbered.
This patch handles vector callee saved registers in target hook.
This is the masked.store side to the masked.load support added in
881b3fd.
With this change, we support masked.load and masked.store via the
intrinsic lowering path used primarily with scalable vectors. An
upcoming change will extend the fixed vector (i.a. shuffle vector) paths
in the same manner.
This builds on the whole series of recent API reworks to implement
support for deinterleaveN of masked.load. The goal is to be able to
enable masked interleave groups in the vectorizer once all the codegen
and costing pieces are in place.
I considered including the shuffle path support in this review as well
(since the RISCV target specific stuff should be common), but decided to
separate it into it's own review just to focus attention on one thing at
a time.
After the refactoring in #149710 the logic change is trivial.
Motivation for preferring sign-extended 32-bit loads (LW) vs
zero-extended (LWU):
* LW is compressible while LWU is not.
* Helps to minimise the diff vs RV32 (e.g. LWU vs LW)
* Helps to minimise distracting diffs vs GCC. I see this come up
frequently when comparing GCC code and in these cases it's a red
herring.
Similar normalisation could be done for LHU and LH, but this is less
well motivated as there is a compressed LHU (and if performing the
change in RISCVOptWInstrs it wouldn't be done for RV32). There is a
compressed LBU but not LB, meaning doing a similar normalisation for
byte-sized loads would actually be a regression in terms of code size.
Load narrowing when allowed by hasAllNBitUsers isn't explored in this
patch.
This changes ~20500 instructions in an RVA22 build of the
llvm-test-suite including SPEC 2017. As part of the review, the option
of doing the change at ISel time was explored but was found to be less
effective.
This refactor was suggested in
<https://github.com/llvm/llvm-project/pull/144703>.
I have checked for unexpected changes by comparing builds of
llvm-test-suite with/without this refactor, including with preferWInst
force enabled.
The previous assert wasn't passing the TSFlags but the opcode, so wasn't
working.
Fixing it reveals that it was actually triggering, because we're too
strict with viota and vmsxf.m We already reduce the VL on these
instructions because the result in each element doesn't depend on VL.
However, it does change if masked, so account for that.
This is purely a benefit for reducing unnecessary diffs between RV32 and
RV64, as RVC does have a compressed form of SUBW (so SUB isn't more
compressible). This affects ~57.2k instructions in an rva22u64 build of
llvm-test-suite with SPEC CPU 2017 included.
RISCVOptWInstrs has a NumTransformedToWInstrs statistic, but didn't have
one for the W=>Non-W transform done by stripWSuffixes. It also didn't do
debug printing of the transformation. This patch addresses both issues.
Reviewed as part of <https://github.com/llvm/llvm-project/pull/149071>,
but landing separately.
Previously, two MCAsmBackend hooks were used, with
shouldInsertFixupForCodeAlign calling getWriter().recordRelocation
directly, bypassing generic code.
This patch:
* Introduces MCAsmBackend::relaxAlign to replace the two hooks.
* Tracks padding size using VarContentEnd (content is ignored).
* Move setLinkerRelaxable from MCObjectStreamer::emitCodeAlignment to the backends.
Pull Request: https://github.com/llvm/llvm-project/pull/149465
We're going to end up repeating the operand extraction four times once
all of the routines have been updated to support both plain load/store
and vp.load/vp.store. I plan to add masked.load/masked.store in the near
future, and we'd need to add that to each of the four cases. Instead,
factor out a single copy of the operand normalization.
Currently, AsmPrinter skips CFI instructions created by a backend if
they are not needed. I'd like to change that so that it always
prints/encodes CFI instructions if a backend created them.
This change should slightly (perhaps negligibly) improve compile time as
post-PEI passes no longer need to skip over these instructions in
no-exceptions no-debug builds, and will allow to simplify convoluted
logic in AsmPrinter once other targets stop emitting CFI instructions
when they are not needed (that's my final goal).
The changes in a test seem to be caused by slightly different post-RA
scheduling in the absence of CFI instructions.
This continues in the direction started by commit 4b81dc7. We
essentially merges the handling for VPLoad - currently in
lowerInterleavedVPLoad - into the existing dedicated routine. This
removes the last use of the dedicate lowerInterleavedVPLoad and thus we
can remove it.
This isn't quite NFC as the main callback has support for the strided
load optimization whereas the VPLoad specific version didn't. So this
adds the ability to form a strided load for a vp.load deinterleave with
one shuffle used.
If we're checking to see if a number is a multiple of a small constant,
we need to be sure the multiply doesn't overflow for the mul logic to
hold. The VL is a unsigned number, so we care about unsigned overflow.
Once we've proven a number of a multiple, we can also use an
exact udiv as we know we're not discarding any bits.
This fixes what is technically a miscompile with EVL vectorization, but
I doubt we'd ever have seen it in practice since most EVLs are going to
much less than UINT_MAX.
This continues in the direction started by commit 4b81dc7. We
essentially merges the handling for VPStore - currently in
lowerInterleavedVPStore which is shared between shuffle and intrinsic
based interleaves - into the existing dedicated routine.
There are cases where InstCombine / InstSimplify might sink extractvalue
instructions that use a deinterleave intrinsic into successor blocks,
which prevents InterleavedAccess from kicking in because the current
pattern requires deinterleave intrinsic to be used by extractvalue.
However, this requirement is bit too strict while we could have just
replaced the users of deinterleave intrinsic with whatever generated by
the target TLI hooks.
This PR adds support for the vrgather.vi, vrgather.vx, vrgather.vv,
vrgatherei16.vv instructions in the RISC-V VLOptimizer.
To support vrgatherei16.vv I also needed to add support for it in
getOperandLog2EEW.
The change implements custom lowering of `get_fpmode`, `set_fpmode` and
`reset_fpmode` for RISCV target. The implementation is aligned with the
functions `fegetmode` and `fesetmode` in GLIBC.
XAndesBFHCvt provides two builtins functions for converting between
float and bf16. Users can use them to convert bf16 values loaded from
memory to float, perform arithmetic operations, then convert them back
to bf16 and store them to memory.
The load/store and move operations for bf16 will be handled in a later
patch.
Rename UnwrapShl->SelectShl. Make it only responsible for matching
a SHL by constant.
Handle the fallback case of reg+reg with no scale outside of SelectShl.
Reorder the check so RHS is checked for shift first. The base pointer
is most likely on the LHS. It's very unlikely both operands are shifts.
This is preparation for adding better costing decisions to this code.
Refactor the fragment representation of `push rax; jmp foo; nop; jmp foo`,
previously encoded as
`MCDataFragment(nop); MCRelaxableFragment(jmp foo); MCDataFragment(nop); MCRelaxableFragment(jmp foo)`,
to
```
MCFragment(fixed: push rax, variable: jmp foo)
MCFragment(fixed: nop, variable: jmp foo)
```
Changes:
* Eliminate MCEncodedFragment, moving content and fixup storage to MCFragment.
* The new MCFragment contains a fixed-size content (similar to previous
MCDataFragment) and an optional variable-size tail.
* The variable-size tail supports FT_Relaxable, FT_LEB, FT_Dwarf, and
FT_DwarfFrame, with plans to extend to other fragment types.
dyn_cast/isa should be avoided for the converted fragment subclasses.
* In `setVarFixups`, source fixup offsets are relative to the variable part's start.
Stored fixup (in `FixupStorage`) offsets are relative to the fixed part's start.
A lot of code does `getFragmentOffset(Frag) + Fixup.getOffset()`,
expecting the fixup offset to be relative to the fixed part's start.
* HexagonAsmBackend::fixupNeedsRelaxationAdvanced needs to know the
associated instruction for a fixup. We have to add a `const MCFragment &` parameter.
* In MCObjectStreamer, extend `absoluteSymbolDiff` to apply to
FT_Relaxable as otherwise there would be many more FT_DwarfFrame
fragments in -g compilations.
https://llvm-compile-time-tracker.com/compare.php?from=28e1473e8e523150914e8c7ea50b44fb0d2a8d65&to=778d68ad1d48e7f111ea853dd249912c601bee89&stat=instructions:u
```
stage2-O0-g instructins:u geomeon (-0.07%)
stage1-ReleaseLTO-g (link only) max-rss geomean (-0.39%)
```
```
% /t/clang-old -g -c sqlite3.i -w -mllvm -debug-only=mc-dump &| awk '/^[0-9]+/{s[$2]++;tot++} END{print "Total",tot; n=asorti(s, si); for(i=1;i<=n;i++) print si[i],s[si[i]]}'
Total 59675
Align 2215
Data 29700
Dwarf 12044
DwarfCallFrame 4216
Fill 92
LEB 12
Relaxable 11396
% /t/clang-new -g -c sqlite3.i -w -mllvm -debug-only=mc-dump &| awk '/^[0-9]+/{s[$2]++;tot++} END{print "Total",tot; n=asorti(s, si); for(i=1;i<=n;i++) print si[i],s[si[i]]}'
Total 32287
Align 2215
Data 2312
Dwarf 12044
DwarfCallFrame 4216
Fill 92
LEB 12
Relaxable 11396
```
Pull Request: https://github.com/llvm/llvm-project/pull/148544
This patch adds dummy symbols for PLT entries for RISC-V 32-bit and
64-bit targets so llvm-objdump can show the function symbol that
corresponds to each PLT entry.
This essentially merges the handling for VPLoad - currently in
lowerInterleavedVPLoad which is shared between shuffle and intrinsic
based interleaves - into the existing dedicated routine.
My plan is that if we like this factoring is that I'll do the same for
the intrinsic store paths, and then remove the excess generality from
the shuffle paths since we don't need to support both modes in the
shared VPLoad/Store callbacks. We can probably even fold the VP versions
into the non-VP shuffle variants in the analogous way.
Follow up on the work from e5bc7e7d, and extend it to the lowering used
for interleave and deinterleave when we can't combine with a nearby
memory operation.
The lower 5-bits of the immediate are not part of the address unlike
other InstFormatS instructions.
We use InstFormatS in RISCVRegisterInfo::needsFrameBaseReg and
RISCVRegisterInfo::getFrameIndexInstrOffset which is not aware of this
special encoding. Force the format to InstFormatOther so those functions
will ignore it.
InstFormatS is also used by relocation emission, but I don't believe we
ever emit these instructions with a relocation because of the encoding.
The transformation done in #147349 was incorrect since we were not
passing the input node of the `OR` instruction to the `QC.INSBI`
instruction leading to the generated instruction doing the wrong thing.
In order to do this we first needed to add the output register to
`QC.INSBI` as being both an input and output.
The code produced after the above fix will need a copy (mv) to preserve
the register input to the OR instruction if it has more than one use
making the transformation net neutral ( `6-byte QC.E.ORI/ORAI` vs
`2-byte C.MV + 4-byte QC.INSB`I). Avoid doing the transformation if
there is more than one use of the input register to the OR instruction.
If a VL is zero then it's known to be less than or equal to every other
VL.
This looks weird on its own since a VL of zero isn't that common. The
test diffs come from a type being split resulting in a VP intrinsic's
EVL being zero.
The motivation for this is to split off part of an upcoming patch I plan
on submitting for RISCVVLOptimizer, which generalizes it to handle
recurrences, and needs to reason about an initial state of demanded VLs
set to zero.
When not in the prologue we do not want to set the FrameSetup flag, by
passing the flag as argument we can use allocateStack correctly on those
cases.
This fixes the allocation and probe in eliminateCallFramePseudoInstr.
Instead of using dyn_cast, just use isa combined with accessors on the base
VectotType class. Working towards being able to merge code from some of
these routines.