In PowerPC, if a borrow occurs during a subtraction, the carry bit is
zero (unset). The carry bit is set if no borrow occurs.
For ISD::USUBO_CARRY, the nodes produce two results: the normal result
of the addition or subtraction, and a boolean value that is 1 if and
only if there is an outgoing carry or borrow.
Therefore, we need to convert a 1 (which indicates a borrow in
ISD::USUBO_CARRY) to 0 to match PowerPC's definition of borrow.
Similarly, we need to convert a 0 (no borrow in ISD::USUBO_CARRY) to 1
for PowerPC.
To perform this conversion, we use XOR 1 instead of XOR
DAG.getAllOnesConstant(DL, CarryOp.getValueType()).
`
This patch adds the following Dense Math Facility 16-bit half-precision
floating-point calculation instructions: dmxvf16gerx2, dmxvf16gerx2pp,
dmxvf16gerx2pn, dmxvf16gerx2np, dmxvf16gerx2nn, pmdmxvf16gerx2,
pmdmxvf16gerx2pp, pmdmxvf16gerx2pn, pmdmxvf16gerx2np, pmdmxvf16gerx2nn,
along with their corresponding intrinsics and tests.
Add a pre- commit test case for Patch
https://github.com/llvm/llvm-project/pull/111696
Test ppc-vsx-fma-mutate pass work with
-schedule-ppc-vsx-fma-mutation-early not hoist the instruction
`xxspltiw vs2, 1170469888` out the loop.
---------
Co-authored-by: Amy Kwan <amy.kwan1@ibm.com>
This patch adds the following Dense Math Facility bfloat16
floating-point calculation instructions: dmxvbf16gerx2,
dmxvbf16gerx2pp,dmxvbf16gerx2pn, dmxvbf16gerx2np, dmxvbf16gerx2nn,
pmdmxvbf16gerx2, pmdmxvbf16gerx2pp, pmdmxvbf16gerx2pn,
pmdmxvbf16gerx2np, pmdmxvbf16gerx2nn, along with their corresponding
intrinsics and tests.
The PR will fix the issue
https://github.com/llvm/llvm-project/issues/122728
This patch addresses the signed/zero extension of poison by using a
poison value of the extended type instead of a constant zero of the
extended type.
[NFC] add a pre-commit test case for patch [Eliminating li of 0 into arg
registers of unused
arguments](https://github.com/llvm/llvm-project/pull/122741)
The test case tests that extend poison are lower to undef and also test
there are redendunt instrution load 0 into argument registers for unused
arguments.
ISD::ADDC, ISD::ADDE, ISD::SUBC and ISD::SUBE are being deprecated,
using ISD::UADDO_CARRY,ISD::USUBO_CARRY instead. Lowering the UADDO,
UADDO_CARRY, USUBO, USUBO_CARRY in the patch.
The llvm.metadata section is not emitted and has special semantics. We
should not merge globals in it, similarly to how we already skip merging
of `llvm.xyz` globals.
Fixes https://github.com/llvm/llvm-project/issues/131394.
The existing pretty printer generates excessive parentheses for
MCBinaryExpr expressions. This update removes unnecessary parentheses
of MCBinaryExpr with +/- operators and MCUnaryExpr.
Since relocatable expressions only use + and -, this change improves
readability in most cases.
Examples:
- (SymA - SymB) + C now prints as SymA - SymB + C.
This updates the output of -fexperimental-relative-c++-abi-vtables for
AArch64 and x86 to `.long _ZN1B3fooEv@PLT-_ZTV1B-8`
- expr + (MCTargetExpr) now prints as expr + MCTargetExpr, with this
change primarily affecting AMDGPUMCExpr.
When generating code for functions that have `__builtin_frame_address`
calls and `noinline` attribute, prologue was not emitted correctly
leading to an assertion failure in PowerPC. The issue was due to
improper insertion of prologue for a function that contain llvm
`__builtin_frame_address`.
Shrink-wrap pass computes the save and restore points of a function.
Default points are the entry and exit points of the function. During
shrink-wrapping the frame-pointer was not honored like the stack pointer
and it was considered as a callee-saved register. This change will treat
the FP similar to SP and will insert the prolog on top the instruction
containing FP.
---------
Co-authored-by: Tony Varghese <tony.varghese@ibm.com>
Update to use REG_SEQUENCE when possible.
This patch only update td pattern to utilize REG_SEQUENCE for
INSERT_SUBREG for cases where it does not produce
a nesting of REG_SEQUENCE. This seem to show some improvement in code
gen for `llvm/test/CodeGen/PowerPC/mmaplus-intrinsics.ll`.
Fixes part of https://github.com/llvm/llvm-project/issues/125502
This is NFC patch to capture the insertion of prologue and epilogue by
`shrinkwrap` pass for Powerpc target for functions that contain llvm
`__builtin_frame_address`.
---------
Co-authored-by: Tony Varghese <tony.varghese@ibm.com>
This commit adds the following Dense Math Facility integer calculation
instructions: dmxvi8gerx4, dmxvi8gerx4pp, dmxvi8gerx4spp, pmdmxvi8gerx4,
pmdmxvi8gerx4pp, and pmdmxvi8gerx4spp, along with their corresponding
intrinsics and tests.
This is meant as a preparation for PR #130988 "[AMDGPU] Implement IR
expansion for frem instruction" which implements the expansion of
another instruction in this pass. The more general name seems more
appropriate given this change and quite reasonable even without it.
Fixes https://github.com/llvm/llvm-project/issues/127119
Remove `hasSideEffects` from `xxspltib` since there is no special
restriction specified in the ISA that prevent it from being reordered,
move, CSE, or LICM. Removing this restriction will allow `xxspltib` to
be hoisted out of loop bodies.
If a (two-result) node like `FMODF` or `FFREXP` is expanded to a library
call, where said library has the function prototype like: `float(float,
float*)` -- that is it returns a float from the call and via an output
pointer. The first result of the node maps to the value returned by
value and the second result maps to the value returned via the output
pointer.
If only the second result is used after the expansion, we hit an issue
on x87 targets:
```
// Before expansion:
t0, t1 = fmodf x
return t1 // t0 is unused
```
Expanded result:
```
ptr = alloca
ch0 = call modf ptr
t0, ch1 = copy_from_reg, ch0 // t0 unused
t1, ch2 = ldr ptr, ch1
return t1
```
So far things are alright, but the DAGCombiner optimizes this to:
```
ptr = alloca
ch0 = call modf ptr
// copy_from_reg optimized out
t1, ch1 = ldr ptr, ch0
return t1
```
On most targets this is fine. The optimized out `copy_from_reg` is
unused and is a NOP. However, x87 uses a floating-point stack, and if
the `copy_from_reg` is optimized out it won't emit a pop needed to
remove the unused result.
The prior solution for this was to attach the chain from the
`copy_from_reg` to the root, which did work, however, the root is not
always available (it's set to null during legalize types). So the
alternate solution in this patch is to replace the `copy_from_reg` with
an `X86ISD::POP_FROM_X87_REG` within the X86 call lowering. This node is
the same as `copy_from_reg` except this node makes it explicit that it
may lower to an x87 FPU stack pop. Optimizations should be more cautious
when handling this node than a normal CopyFromReg to avoid removing a
required FPU stack pop.
```
ptr = alloca
ch0 = call modf ptr
t0, ch1 = pop_from_x87_reg, ch0 // t0 unused
t1, ch2 = ldr ptr, ch1
return t1
```
Using this node ensures a required x87 FPU pop is not removed due to the
DAGCombiner.
This is an alternate solution for #127976.
The MI scheduler skips regions containing a single MI during scheduling.
This can prevent targets that perform multi-stage scheduling and move
MIs between regions during some stages to reason correctly about the
entire IR, since some MIs will not be assigned to a region at the
beginning.
This makes the machine scheduler no longer skip single-MI regions. Only
a few unit tests are affected (mainly those which check for the
scheduler's debug output).
Begin extending replaceShuffleOfInsert to handle other forms of scalar insertion into a vector.
I've limited this to targets that have Custom/Legal ISD::INSERT_VECTOR_ELT handling for now - although we can probably always fold this before LegalOperations.
The standard libcalls for half to float and float to half conversion are
__extendhfsf2 and __truncsfhf2. However, LLVM currently uses
__gnu_h2f_ieee and __gnu_f2h_ieee instead. As far as I can tell, these
libcalls are an ARM-ism and only provided by libgcc on that platform.
compiler-rt always provides both libcalls.
Use the standard libcalls by default, and only use the __gnu libcalls on
ARM.
ISD::ADDC, ISD::ADDE, ISD::SUBC and ISD::SUBE are being deprecated,
using ISD::UADDO_CARRY,ISD::USUBO_CARRY instead. Lowering the UADDO,
UADDO_CARRY, USUBO, USUBO_CARRY in the patch.
`RegisterClassInfo` was supposed to be kept alive between pass runs,
which wasn't being done leading to recomputations increasing the compile
time.
Now the Impl class is a member of the legacy and new passes so that it
is not reconstructed on every pass run.
---------
Co-authored-by: Christudasan Devadasan <christudasan.devadasan@amd.com>
This fixes the handling of subregister extract copies. This
will allow AMDGPU to remove its implementation of
shouldRewriteCopySrc, which exists as a 10 year old workaround
to this bug. peephole-opt-fold-reg-sequence-subreg.mir will
show the expected improvement once the custom implementation
is removed.
The copy coalescing processing here is overly abstracted
from what's actually happening. Previously when visiting
coalescable copy-like instructions, we would parse the
sources one at a time and then pass the def of the root
instruction into findNextSource. This means that the
first thing the new ValueTracker constructed would do
is getVRegDef to find the instruction we are currently
processing. This adds an unnecessary step, placing
a useless entry in the RewriteMap, and required skipping
the no-op case where getNewSource would return the original
source operand. This was a problem since in the case
of a subregister extract, shouldRewriteCopySource would always
say that it is useful to rewrite and the use-def chain walk
would abort, returning the original operand. Move the process
to start looking at the source operand to begin with.
This does not fix the confused handling in the uncoalescable
copy case which is proving to be more difficult. Some currently
handled cases have multiple defs from a single source, and other
handled cases have 0 input operands. It would be simpler if
this was implemented with isCopyLikeInstr, rather than guessing
at the operand structure as it does now.
There are some improvements and some regressions. The
regressions appear to be downstream issues for the most part. One
of the uglier regressions is in PPC, where a sequence of insert_subrgs
is used to build registers. I opened #125502 to use reg_sequence instead,
which may help.
The worst regression is an absurd SPARC testcase using a <251 x fp128>,
which uses a very long chain of insert_subregs.
We need improved subregister handling locally in PeepholeOptimizer,
and other pasess like MachineCSE to fix some of the other regressions.
We should handle subregister composes and folding more indexes
into insert_subreg and reg_sequence.
This is needed for architectures that actually use strict pointer
arithmetic instead of integers such as AArch64 with FEAT_CPA (see
https://github.com/llvm/llvm-project/pull/105669) or CHERI. Using an
index as the first operand of pointer arithmetic may result in an
invalid output.
While there are quite a few codegen changes here, these only change the
order of registers in add instructions. One MIPS combine had to be
updated to handle the new node order.
Reviewed By: topperc
Pull Request: https://github.com/llvm/llvm-project/pull/125279