On targets without ADDCARRY or ADDE, we need to emit a separate
SETCC to determine carry from the low half to the high half. Usually
we do (setult Lo, LHSLo). If RHSLo is 1 we can instead do (seteq Lo, 0).
This can reduce the live range of LHSLo.
On targets that lack ADDCARRY support we split a wide uaddo into
an ADD and a SETCC that both need to be split.
For (uaddo X, 1) we can observe that when the add overflows the result
will be 0. We can emit (seteq (or Lo, Hi), 0) to detect this.
This improves D142071.
There is an alternative here. We could use either ~(lo(X) & hi(X)) == 0 or
(lo(X) & hi(X)) == -1 before the addition. That would be closer to the
code before D142071.
Reviewed By: liaolucy
Differential Revision: https://reviews.llvm.org/D144614
Lower the two intrinsics introduced in D141924.
These intrinsics can be combined with loads and stores into the much more efficient segmented load and store instructions in a following patch.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D144092
Adds intrinsics for the following SME2 instructions:
- srshl (single, 2 & 4 vector)
- srshl (multi, 2 & 4 vector)
- urshl (single, 2 & 4 vector)
- urshl (multi, 2 & 4 vector)
NOTE: These intrinsics are still in development and are subject to future changes.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D144118
If more registers are needed for VAddr then the NSA format allows then the
final register can act as a contigous set of remaining addresses. Update
legalizer to pack register for this new format and allow instruction
selection to use NSA encoding when number of addresses exceeds max size.
Also update SIShrinkInstructions to handle partial NSA.
Differential Revision: https://reviews.llvm.org/D144034
The patch ensures last two operands of vp.abs/ctlz/cttz are mask and evl.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D144536
This introduces R_RISCV_PLT32, PC-relative data relocation that takes
the 32-bit relative offset to a function or its PLT entry from its
relocation location.
This is needed to support relative vtables on RISCV.
Github PR: https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/363
The lld handling of this reloc is D143115.
Differential Revision: https://reviews.llvm.org/D143226
LSR can include Regs of AddRec SCEVs from different loops, which do not combine
well when added in Scalar Evolution. As they should never produce constant
differences so we can just guard against trying to create them.
Fixes#60927
65420c8041f4 introduced an ICE in combineMinNumMaxNum(...) when
combineMinNumMaxNumImpl(...) returns an SDValue(). Make sure to check that a
value is returned before trying to perform an FNEG on it.
GitHub Issue: #60924
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D144571
A mistake in the control flow of performMemPairCombine resulted in paired
loads/stores for types that were not supported by the instructions (i8/i16).
These loads/stores could not match the constraints of the patterns defined
in the THead td file and the compiler would throw a 'Cannot select' error.
This is now fixed and two new test functions have been added in xtheadmempair.ll
which would previously crash the compiler. The compiler was additionally tested
with a wide range of benchmarks and no issues were observed.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D144559
Currently, raw_buffer_load_{i8,i16} and struct_buffer_load_{i8,i16}
intrinsics are lowered as buffer_load_{u8,u16}. This patch combines
buffer_load_{u8,u16} and sign extension instructions in order to
generate buffer_load_{i8,i16} instructions.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D144313
v_permlane16_b32 and v_permlanex16_b32 should not set abs and neg src
modifiers on any input, but they can set op_sel on src0 or src1 to
represent fi or bc when desired. The ISel patterns were setting
the src_modifier bits to -1, effectively setting abs and neg as well,
whenever it was intended to set op_sel, due to an error in ISel. ISel
should now correctly only set the op_sel bits.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D144519
These checks show optimized instructions if an operand is known to be
(partially) zero.
Change-Id: Ie2f6d0d3ee9d5b279d1f4c1dd0787492e39cc77a
Differential Revision: https://reviews.llvm.org/D140208
D139469 "[AMDGPU] Enable OMod on more VOP3 instructions" caused an
assertion failure when trying to fold into src2 of V_FMAC_F16. It would
temporarily convert the instruction to V_FMA_F16_gfx9 and add an opsel
operand, but if the fold still failed then it would forget to remove the
opsel operand.
Differential Revision: https://reviews.llvm.org/D144558
RISC-V vector instruction has register overlapping constraint for certain
instructions, and will cause illegal instruction trap if violated, we use
early clobber to model this constraint, but it can't prevent register allocator
allocated same or overlapped if the input register is undef value, so convert
IMPLICIT_DEF to temporary pseudo could prevent that happen, it's not best way
to resolve this. Ideally we should model the constraint right, but before we
model the constraint right, it's the approach to prevent that happen.
See also: https://github.com/llvm/llvm-project/issues/50157
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D129735
This partially reverts a regression introduced in 8f25e382c5b1 for
AArch64 targets. In particular, we restore the logic of `(abs (sub nsw
x, y)) -> abds(x, y)` for all targets except X86, which keeps the logic
introduced in 8f25e382c5b1. See also https://reviews.llvm.org/D142288.
Differential Revision: https://reviews.llvm.org/D144379
After change with D144169, the codegen generates redundant instructions
like and and wrap. This fixes it.
Differential Revision: https://reviews.llvm.org/D144360
This change adds a new spv_undef intrinsic which is emitted in place of
aggregate undef operands and later selected to single OpUndef SPIR-V
instruction. The behavior matches that of Khronos SPIR-V Translator and
should support nested aggregates.
Differential Revision: https://reviews.llvm.org/D143107
These tests show inefficient behavior that will be optimized by a
later change.
By using Known Bits Analysis, we can avoid unnecessary multiplications
or additions with 0.
Interleaves generated with vwaddu.vv and vwmaccu.vx were using VLs that
were twice the number of elements actually needed in the vector.
This also pulls the interleaving logic out into its own function so it
can be reused by later patches, and adapts it for scalable vectors.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D144386
Adds a new pass that removes functions
if they use features that are not supported on the current GPU.
This change is aimed at preventing crashes when building code at O0 that
uses idioms such as `if (ISA_VERSION >= N) intrinsic_a(); else intrinsic_b();`
where ISA_VERSION is not constexpr, and intrinsic_a is not selectable
on older targets.
This is a pattern that's used all over the ROCm device libs. The main
motive behind this change is to allow code using ROCm device libs
to be built at O0.
Note: the feature checking logic is done ad-hoc in the pass. There is no other
pass that needs (or will need in the foreseeable future) to do similar
feature-checking logic so I did not see a need to generalize the feature
checking logic yet. It can (and should probably) be generalized later and
moved to a TargetInfo-like class or helper file.
Reviewed By: arsenm, Joe_Nash
Differential Revision: https://reviews.llvm.org/D139000
These tests show inefficient behavior that will be optimized by a
later change.
By using Known Bits Analysis, we can avoid unnecessary multiplications
or additions with 0.
Summary: Currently we lower MEMCPY/MEMMOVE/MEMSET/BZERO to the corresponding libc functions. And the libc functions call the millicode functions on AIX. We can lower these intrinsics directly to save one call layer.
Reviewed By: shchenz
Differential Revision: https://reviews.llvm.org/D143997
Without this patch:
return X < 4 ? 3 : 2;
return X < 9 ? 7 : 6;
are compiled as:
31 c0 xor %eax,%eax
83 ff 04 cmp $0x4,%edi
0f 93 c0 setae %al
83 f0 03 xor $0x3,%eax
31 c0 xor %eax,%eax
83 ff 09 cmp $0x9,%edi
0f 92 c0 setb %al
83 c8 06 or $0x6,%eax
respectively. With this patch, we generate:
31 c0 xor %eax,%eax
83 ff 04 cmp $0x4,%edi
83 d0 02 adc $0x2,%eax
31 c0 xor %eax,%eax
83 ff 04 cmp $0x4,%edi
83 d0 02 adc $0x2,%eax
respectively, saving 3 bytes while reducing the tree height.
This patch recognizes the equivalence of OR and ADD
(if bits do not overlap) and delegates to combineAddOrSubToADCOrSBB
for further processing. The same applies to the equivalence of XOR
and SUB.
Differential Revision: https://reviews.llvm.org/D143838
Add a member function isPPC64ELFv2ABI() to determine what ABI is used on the
64-bit PowerPC big endian operating environment.
Reviewed By: nemanjai, dim, pkubaj
Differential Revision: https://reviews.llvm.org/D144321
The AArch64 backend will use legal BUILDVECTORs for zero vectors or all-ones
vectors, so during selection tablegen patterns get rely on immAllZerosV and
immAllOnesV pattern frags in patterns like vnot. It was not always consistent
though, which this patch attempt to fix by recognizing where constant splat +
insert vector element is used. The main outcome of this will be that full
vector movi v0.2d, #0000000000000000 will be used as opposed to movi d0, #0, as
per https://reviews.llvm.org/D53579. This helps simplify what tablegen will
see, to make pattern matching simpler.
Differential Revision: https://reviews.llvm.org/D144018
Adds intrinsics for the following SME2 instructions (2 & 4 vectors):
- smlall
- smlsll
- umlall
- umlsll
- usmlall
NOTE: These intrinsics are still in development and are subject to future changes.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D143277