This is a partial revert of e947f953370abe8ffc8713b8f3250a3ec39599fe.
It caused a miscompile in downstream testing.
Spoke with Philip offline. We believe the issue is that LSR needs to
make sure the Step of the other AddRec is non-zero. Reverting until
Philip is back from vacation.
We lower overflow arithmetics to its M68kISD counterparts that produce
results of {i16/i32, i8} in which the second resut represents CCR. In
the event where we're certain there won't be an overflow, for instance
8 & 16-bit multiplications, we simply use zero in replacement of the
second result.
This patch replaces M68kISD::CMOV that takes this kind of zero or
all-ones CCR as condition value with its corresponding operand value.
M68k only has 16-bit x 16-bit -> 32-bit variant for multiplications
taking 16-bit operands. We still define two input operands for this
class of instructions, and tie the first operand to the result value.
The problem is that the two operands have different register classes
(DR32 and DR16) hence making these instructions communitive produces
invalid MachineInstr (though the final assembly will still be correct).
The codegen logic for overflow arithmetics (e.g. llvm.uadd.overflow)
was a mess; overflow multiplications were not even supported.
This patch clean up the legalization of overflow arithmetics and add
supports for common variants of overflow multiplications.
This implements experimental support for the Zimop extension as
specified here:
https://github.com/riscv/riscv-isa-manual/blob/main/src/zimop.adoc.
This change adds only assembly support.
---------
Co-authored-by: ln8-8 <lyut.nersisyan@gmail.com>
Co-authored-by: ln8-8 <73429801+ln8-8@users.noreply.github.com>
Machine Copy Propagation Pass may lose some opportunities to further
remove the redundant copy instructions during the ForwardCopyPropagateBlock
procedure. When we Clobber a "Def" register, we also need to remove the record
from the copy maps that indicates "Src" defined "Def" to ensure the correct semantics
of the ClobberRegister function. This patch reapplies #70778 and addresses the corner
case bug #73512 specific to the AMDGPU backend. Additionally, it refines the criteria
for removing empty records from the copy maps, thereby enhancing overall safety.
For more information, please see the C++ test case generated code in
"vector.body" after the MCP Pass: https://gcc.godbolt.org/z/nK4oMaWv5.
Demonstrate `IMPLICIT_DEF implicit-def ...` can be generated after
coalescing on PPC.
The case is reduced from failure in #75570. The failure is triggered
after #75271 .
By looking at whether a global is large instead of looking at the code
model.
This also fixes references to large data in the small code model.
We now always fold any 32-bit offset into the addressing mode with the
large code model since it uses 64-bit relocations.
Depositing value into the lowest byte/word is a common code pattern.
This patch improves the code generation for it to avoid redundant AND
and OR operations.
IR intrinsics were already defined, but no codegen support had been
added.
I extracted this code from our downstream. Some of it may have come from
https://repo.hca.bsc.es/gitlab/rferrer/llvm-epi/ originally.
- Adds a new +pc option to -mbranch-protection that will enable
the use of PC as a diversifier in PAC branch protection code.
- When +pauth-lr is enabled (-march=armv9.5a+pauth-lr) in combination
with -mbranch-protection=pac-ret+pc, the new 9.5-a instructions
(pacibsppc, retaasppc, etc) are used.
Documentation for the relevant instructions can be found here:
https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/
Co-authored-by: Lucas Prates <lucas.prates@arm.com>
This presents misleading and confusing output. If you have a function
defined at the beginning of an XCOFF object file, and you have a
function call to an external function, the function call disassembles as
a branch to the local function. That is,
`void f() { f(); g();}`
disassembles as
>00000000 <.f>:
0: 7c 08 02 a6 mflr 0
4: 94 21 ff c0 stwu 1, -64(1)
8: 90 01 00 48 stw 0, 72(1)
c: 4b ff ff f5 bl 0x0 <.f>
10: 4b ff ff f1 bl 0x0 <.f>
With this PR, the second call will display:
`10: 4b ff ff f1 bl 0x0 <.g> `
Using -r can help, but you still get the confusing output:
>10: 4b ff ff f1 bl 0x0 <.f>
00000010: R_RBR .g
[TLI] Pass replace-with-veclib works with Scalable Vectors.
The pass is heavily refactored.
It uses the Masked variant of a TLI method when the Intrinsic operates on Scalable Vectors.
Improve tests for ArmPL and SLEEF Intrinsics:
- Auto-generate test `armpl-intrinsics.ll`, and use active lane mask to have shorter `shufflevector` check lines.
- Update scripts now add `@llvm.compiler.used` instead of using the regex: `@[[LLVM_COMPILER_USED:[a-zA-Z0-9_$"\\.-]+]]`
- Add simplifycfg pass and noalias to ensure tail folding. `noalias` attribute was added only to the `%in.ptr` parameter of the ArmPL Intrinsics.
This adds post-legalizing lowering of G_UNMERGE_VALUES which take a vector and
produce scalar values for each lane. They are converted to a G_EXTRACT_VECTOR_ELT
for each lane, allowing all the existing tablegen patterns to apply to them.
A couple of tablegen patterns need to be altered to make sure the type of the
constant operand is known, so that the patterns are recognized under global
isel.
Closes#75662
emitPopInst checks a single function exit MBB. If other paths also exit
the function and any of there terminators uses LR implicitly, it is not
save to clear the Restored bit.
Check all terminators for the function before clearing Restored.
This fixes a mis-compile in outlined-fn-may-clobber-lr-in-caller.ll
where the machine-outliner previously introduced BLs that clobbered LR
which in turn is used by the tail call return.
Alternative to #73553
Fix bitcast test, which was splitting apart phis intended to force
bitcasts that survive all the way to selection.
Disable the amdgpu-codegenprepare phi splitting, which defeats the technique
of using a phi to ensure a bitcast reaches all the way to selection. Also
add a variety of bfloat tests. These probably need revisiting to avoid the
cast folding into argument loads. Also round out set of bfloat bitcast and
ABI tests.
Add codegen tests for more bf16 operations The promotion of these works
contrary to the comment.