…n in DWARF. (#167666)"
This reverts commit 418ba6e8ae2cde7924388142b8ab90c636d2c21f.
The commit caused an ICE due to hitting unreachable in
llvm/lib/CodeGen/AsmPrinter/DwarfCompileUnit.cpp:1307
Fixes#182337
Given the test case:
struct CBase {
virtual void foo();
};
void bar(CBase *Base) {
Base->foo();
}
and using '-emit-call-site-info' with llc, the following DWARF
is produced for the indirect call 'Base->foo()':
1$: DW_TAG_structure_type "CBase"
...
2$: DW_TAG_subprogram "foo"
...
3$: DW_TAG_subprogram "bar"
...
4$: DW_TAG_call_site
...
We add DW_AT_LLVM_virtual_call_origin to existing call-site
information, linking indirect calls to the function-declaration
they correspond to.
4$: DW_TAG_call_site
...
DW_AT_LLVM_virtual_call_origin (2$ "_ZN5CBase3fooEv")
The new attribute DW_AT_LLVM_virtual_call_origin helps to
address the ambiguity to any consumer due to the usage of
DW_AT_call_origin.
The functionality is available to all supported debuggers.
I want to enable the frame pointer when the call frame size is too large
to access emergency spill slots. To do that I need to know the call
frame size early enough to reserve FP.
The code here is copied from AArch64. ARM does the same. I did not check
other targets.
Splitting this off separately because it stops us from unnecessarily
reserving the base pointer in the some RVV tests. That appears to due to
this check
(!hasReservedCallFrame(MF) && (!MFI.isMaxCallFrameSizeComputed() ||
MFI.getMaxCallFrameSize() != 0))) &&
By calculating early !MFI.isMaxCallFrameSizeComputed() is no longer true
and the size is zero.
WADDAU is rd += zext(rs1) + zext(rs2)
If we only have 1 32-bit input can force rs2 to avoid zeroing the upper
part of a register pair to use ADDD.
Unfortunately, WADDAU clobbers rd so it might need a GPRPair copy
if we need the old value of rd. We might need to look into that in
the future. Maybe we could have convertToThreeAddress could turn
it back into ADDD+WADDU or ADDD+LI.
Assisted-by: claude
I think we may want to be able to fold ADDD nodes independent of the MUL
in some cases. For example turning NSRAI into NSRARI.
If we fold ADDD into WMACC we would need to be able to extract it again.
Keep the nodes separate avoids this.
Code change was assisted by AI.
For an i64 shift by a constant < 32 on RV32, we can use NSRLI
with 32-ShAmt to calculate the high half of the result.
For non-constant shifts, we can use SLX and some bit tricks to
avoid branches. I wanted to use the target independent code from
TargetLowering, but it currently produces worse code.
Assisted-by: claude
If the shift amount might be in the range [0, 31], we can use
NSRL/NSRA to shift the i64 value to compute the lower 32 bits of
the result.
If the shift amount is >= 32, the high half of the result is all
zeros or sign bits. Otherwise it is a srl/sra of the high bits.
I've handled the constant case in ReplaceNodeResults but deferred
the non-constant case to lowerShiftRightParts. This function is
not called for constants. This gives the opportunity for DAGCombine to
optimize the SRL_PARTS/SRA_PARTS if the shift amount can be proven
to be >= 32 or < 32.
Sequences were also discussed on the P extension mailing list here
https://lists.riscv.org/g/tech-p-ext/message/861
Assisted-by: claude
Basically https://github.com/llvm/llvm-project/pull/168506 but for
riscv, so to be clear the hard work here is @heiher 's. I figured we may
as well get some extra eyeballs on this from riscv too.
Previously the riscv backend could not handle `musttail` calls with more
arguments than fit in registers, or any explicit `byval` or `sret`
parameters/return values. Those have now been implemented.
This is part of my push to get more LLVM backends to support `byval` and
`sret` parameters so that rust can stabilize guaranteed tail call
support. See also:
- https://github.com/llvm/llvm-project/pull/168956
- https://github.com/rust-lang/rust/issues/148748
---------
Co-authored-by: WANG Rui <wangrui@loongson.cn>
Similar to #180706, the masked off lanes in vp.reverse are poison so can
be replaced with anything. Because of this, we should be able to fold a
masked vp.reverse(vp.load) into a vp.strided.load stride=-1 even when
the mask isn't all ones.
We have combines for vp.reverse(vp.load) -> vp.strided.load stride=-1
and vp.store(vp.reverse) -> vp.strided.store stride=-1.
If the load or store is masked, the mask needs to be also a vp.reverse
with the same EVL. However we also have the requirement that the mask's
vp.reverse is unmasked (has an all-ones mask).
vp.reverse's mask only sets masked off lanes to poison, and doesn't
affect the permutation of elements. So given those lanes are poison, I
believe the combine is valid for any mask, not just all ones.
This is split off from another patch I plan on posting to generalize
those combines to vector.splice+vector.reverse patterns, as part of
#172961
These are 3 variations of the same operation with a different operand
tied to the destination register. We need to pick the one that
minimizes the number of mvs.
To do this we take the approach used by AArch64 to select between
BIT, BIF, and BSL which the same operations. We define a pseudo
with no tied constraint and expand it after register allocation based
on where the destination register ended up. If the destination
register is none of the operands, we'll insert a mv.
I've replaced RISCVISD::MVM with RISCVISD::MERGE and updated the operand
order accordingly. I find the MERGE name easier to read so I've made it
the canonical name.
Ideally we could use commuteInstructionImpl and the
TwoAddressInstructionPass
to select the opcode before register allocation. That only works if
you can commute exactly 2 operands and maybe change the opcode in the MI
representation of any of the forms to get to the either of the other 2
forms.
That is not possible. We'd need to define 3 more pseudoinstructions
with different permutations.
With the current approach it might be possible that we insert a mv
not because all of the operand registers we needed by later
instructions,
but because the register allocator needed to put the result in a
different register. It's possible a different allocation for other
instructions might have avoided the mv.
I wrote the patch based on the AArch64, but the tests were generated
by AI.
When comparing multi-word integers with Zicond, we generate:
(or (czero_eqz (lo1 < lo2), (hi1 == hi2)),
(czero_nez (hi1 < hi2), (hi1 == hi2)))
The czero_nez is redundant because when hi1 == hi2 is true, hi1 < hi2 is
already 0. This patch adds a DAG combine to recognize:
czero_nez (setcc X, Y, CC), (setcc X, Y, eq) -> (setcc X, Y, CC)
when CC is a strict inequality (lt, gt, ult, ugt).
This saves one instruction in 128-bit comparisons on RV64 with Zicond.
Note the czero_nez becomes a czero.eqz in the final assembly because the
seteq is replaced by an xor that produces 0 when the values are equal.
Part of #179584
Assisted-by: claude
Compressing to a single shuffle doesn't remove any information and the backend can better apply specific optimizations to a single shuffle.
Addresses #176218.
---------
Co-authored-by: Luke Lau <luke_lau@igalia.com>
Extend the existing combineADDDToWMACC DAG combine to also match
RISCVISD::WMULSU and produce RISCVISD::WMACCSU. This is similar to
how ADDD+UMUL_LOHI is combined to WMACCU and ADDD+SMUL_LOHI is
combined to WMACC.
This patch was generated by AI, but I reviewed it.
We add pseudos/patterns for `vabs.v` instruction and handle the
lowering in `RISCVTargetLowering::lowerABS`.
Reviewers: topperc, 4vtomat, mshockwave, preames, lukel97, tclin914
Reviewed By: mshockwave
Pull Request: https://github.com/llvm/llvm-project/pull/180142
We directly lower `ISD::ABDS`/`ISD::ABDU` to `Zvabd` instructions.
Note that we only support SEW=8/16 for `vabd.vv`/`vabdu.vv`.
Reviewers: mshockwave, lukel97, topperc, preames, tclin914, 4vtomat
Reviewed By: lukel97, topperc
Pull Request: https://github.com/llvm/llvm-project/pull/180141
Combine the pattern:
ADDD(addlo, addhi, UMUL_LOHI(x, y).0, UMUL_LOHI(x, y).1)
into:
WMACCU(x, y, addlo, addhi)
And similarly for SMUL_LOHI -> WMACC.
This patch was written with AI, but I reviewed it carefully.
We already do all the checks necessary in order to prioritize
MULHU/MULHS/UMUL_LOHI/SMUL_LOHI over MULHSU/WMULSU. We might as
well just emit the nodes instead of letting generic type legalization
redo the checks.
This is slightly different than the default legalization because we
don't have access to ExpandInteger so we have to emit TRUNCATES and
BUILD_PAIR. Not sure if this will result in any differences in practice.
Add custom lowering for INSERT_VECTOR_ELT on P extension vector types
using the MVM instruction.
TODO: Handle <4 x i8> on RV64 which is constructed to extract_vector_elt
+ build_vector instead of insert_vector_elt.
Order the operands so the the low and high part of the rs1 regpair are
first, followed by the low and high part of the rs2 regpair.
Also change the type to use v4i8 for the result so that it's only
shuffling elements not combining elements into a larger elment.
I'm planning to add ADDD and SUBD opcodes that will be defined with the
same operand order allowing RISCVISelDAGToDAG.cpp code to be shared.
There's a good chance we'll want to use these for scalar too.
Drop vector type from SDTypeProfile. Remove PMULHSU since we already
have RISCVISD::MULHSU for scalars in the base ISA.
If the source of an fpto*i doesn't fit in the destination type, the
result is poison. For i1 destinations, this means the result needs to be
0 or 1/-1, so we can just compare the result to 0 directly instead of
truncating.
The VP lowering for fpto*i already does this.
XOR should be OR to match the comment.
Found while reviewing #179622 which deletes this function. I would like
to commit this first so we have a correct baseline for reviewing that
patch.
This is part of the work to remove trivial VP intrinsics.
In the combineVectorSizedSetCCEquality combine, used for the compares
that ExpandMemcmp generates, we currently emit a VP_SETCC. We can just
emit a regular SETCC and let RISCVVLOptimizer take care of reducing the
VL.
This commit enables code generation for RISC-V targeting Mach-O:
- Implement RISCVMachOTargetObjectFile::getNameWithPrefix method to
handle Mach-O symbol naming requirements.
- Use shouldAssumeDSOLocal() in RISCVTargetLowering::lowerGlobalAddress
instead of isDSOLocal() for proper Mach-O semantics in global address
lowering. Note that this is a NFC for RISCV when targeting ELF.
- Add comprehensive tests for various relocation types (direct globals,
GOT-based addressing, static vs PIC models).
- Test function calls, tail calls, and various symbol reference patterns
including addends and subtractions.
This patch is based on code originally written by Tim Northover.
We also get some i32->i64 promotion for CLMULH. The DAGCombiner
change is to prevent an infinite loop from that.
Test file was rewritten to cover all types and split between clmul
and clmulh.
I added a couple masked tests to show that VectorPeephole works.
The test outputs were already large so I didn't want to add more than a couple.
This reverts commit e3156c531da5aa4ec604605ed4e19638879d773c.
We need to resolve a crash on trunk and LLVM 22. Reverting makes it
easier to backport.
Fixes#176637.
I did not replace riscv.clmulh/clmulr since those require a multiple
instruction pattern match. I wanted to ensure that -O0 will select the
correct instructions without relying on combines.
This patch does the minimum to remove RISCVISD::CLMUL*. It does not
remove existing intrinsics.
There's some missed optimizations for i32 CLMULH/CLMULR on RV64, but
those may be generic issues.
I've put the test cases in the existing files so it's more obvious what
the missed optimizations are by comparing within the file.
This prevents getCommonSubClass from finding them before VR/VRNoV0.
Fixes a crash reported post-commit in #171231. getCommonSubClass
returned one of these classes, but it doesn't have the same VTs as
VR/VRNoV0 leading to an assertion failure.
The subregister-undef-early-clobber.mir still ends up finding these
register classes in the InitUndef pass.
For sub-register width vectors (v2i16, v4i8) on RV64 with P-extension,
the type legalizer widens them to legal types, i.e. v4i16, v8i8, before
they're getting unrolled, so they'll be redundant computation for higher
part of register.
The correct way to handle is similar to widening div/rem where there's
undef padded for high part.
stack on: https://github.com/llvm/llvm-project/pull/176093
P extension packed SIMD types are passed in GPRs. For types larger than
XLen (e.g. v8i8 on RV32), they are split and passed via the 2XLen
mechanism, similar to i64 on RV32.
FIXME: Need to figure out the mechanism when P and V are enabled at the
same time.
stack on: https://github.com/llvm/llvm-project/pull/176193
Splits out change from https://github.com/llvm/llvm-project/pull/176015
Changes shouldExpandAtomicRMWInIR to take a constant argument: This is
to allow some other TargetLowering constant-argument functions to call
it. This change touches several backends. An alternative solution
exists, but to me, this seems the "right" way.