Replace uses of the old load output chain with the new load output
chain. A plain replacement here is fine because the transform verifies
the load is one-use.
Fixes https://github.com/llvm/llvm-project/issues/186549.
For `-mcpu=future`, add patterns to use paired vector instructions
(lxvp/lxvpx/stxvp/stxvpx)
for v256i1 operations instead of splitting into two separate vector
operations.
Assistend by AI.
Remove `_mma` from the following built-ins as they are not related to
MMA:
* __builtin_mma_dmsetdmrz
* __builtin_mma_dmmr
* __builtin_mma_dmxor
* __builtin_mma_build_dmr
* __builtin_mma_disassemble_dmr
AI Assisted.
This commit adds support for lwat/ldat atomic operations with function
code 16 (Compare and Swap Not Equal) via 4 clang builtins:
__builtin_amo_lwat_csne for 32-bit unsigned operations
__builtin_amo_ldat_csne for 64-bit unsigned operations
__builtin_amo_lwat_csne_s for 32-bit signed operations
__builtin_amo_ldat_csne_s for 64-bit signed operations
Currently, the AIX linker and loader do not provide a mechanism to
implement ifuncs similar to GNU_ifunc on ELF Linux.
On AIX, we will lower `__attribute__((ifunc("resolver"))` to the llvm
`ifunc` as other platforms do. The llvm `ifunc` in turn will get lowered
at late stages of the optimization pipeline to an AIX-specific
implementation. No special linkage or relocations are needed when
generating assembly/object output.
On AIX, a function `foo` has two symbols associated with it: a function
descriptor (`foo`) residing in the `.data` section, and an entry point
(`.foo`) residing in the `.text` section. The first field of the
descriptor is the address of the entry point. Typically, the address
field in the descriptor is initialized once: statically, at load time
(?), or at runtime if runtime linking is enabled.
Here we would like to use the address field in the descriptor to
implement the `ifunc` semantics. Specifically, the ifunc function will
become a stub that jumps to the entry point in the address field. A
constructor function is linked into every linkage module. The
constructor walks an array of `{descriptor, resolver}` pairs, calling
the resolver and saving the result in the address field in the
descriptor (thus setting `foo`'s descriptor to point to the resolved
version early during program runtime).
Known limitations:
- Due to bug #161576, which affects object generation path, you will
need either `-ffunction-sections` or `-fno-integrated-as` to generate a
correct/linkable object file.
- aliases to ifuncs are not supported, a testcase has been added and
marked XFAIL. I'm planning to address in a follow-up PR because it's not
important enough, IMHO, for this PR
- dead ifuncs in a CU that contains at least one live ifunc, will result
in all ifuncs being kept by the linker. The fix for this is common with
a similar problem we have with PGO. PR #159435 is trying to provide a
mechanism that will allow the ifunc and PGO implementations to avoid the
dead code retention at the link step.
- the resolver must return a function that is in the same DSO as the
ifunc; the compiler will try to detect if this condition is violated and
report it, but it cannot detect it in general. To be safe, all candidate
functions (returned by a particular resolver) must either be static or
have hidden/protected visibility. This is so that the ifunc stub doesn't
have to save and restore the TOC register r2. In future work, this case
will be supported and the requirement will be lifted.
---------
Co-authored-by: Wael Yehia <wyehia@ca.ibm.com>
I forgot that you need to clear the upper 32 bits for the carry flag to
work properly on ppc64 or else there will be garbage and possibly
incorrect results.
Fixes: https://github.com/llvm/llvm-project/issues/179119
I do not have merge permissions.
This Change makes `RegState` into an enum class, with bitwise operators.
It also:
- Updates declarations of flag variables/arguments/returns from
`unsigned` to `RegState`.
- Updates empty RegState initializers from 0 to `{}`.
If this is causing problems in downstream code:
- Adopt the `RegState getXXXRegState(bool)` functions instead of using a
ternary operator such as `bool ? RegState::XXX : 0`.
- Adopt the `bool hasRegState(RegState, RegState)` function instead of
using a bitwise check of the flags.
The existing ((X+Y+1)>>1) patterns didn't correct handle overflow, like
the VAVG instructions would
Remove the old patterns and correctly mark the altivec VAVGS/VAVGU
patterns as matching the ISD::AVGCEIL opcodes - the generic DAG folds
will handle everything else
I've updated the vavg.ll tests to correct match ISD::AVGCEILS/U patterns
and added the old tests as negative "overflow" patterns that shouldn't
fold to VAVG instructions
Fixes#174718
Support the following BCD format conversion builtins for PowerPC.
- `__builtin_bcdshift` – Shifts a packed decimal value by a specified
number of decimal digits.
- `__builtin_bcdshiftround` – Shifts a packed decimal value by a
specified number of decimal digits, with rounding applied.
- `__builtin_bcdtruncate` –Truncates a packed decimal value to a
specified number of digits.
- `__builtin_bcdunsignedtruncate` – Truncates a packed decimal value and
returns the result as an unsigned packed decimal.
- `__builtin_bcdunsignedshift` – Shifts an unsigned packed decimal value
by a specified number of digits.
> Note: This built-in functions are valid only when all following
conditions are met:
> -qarch is set to utilize POWER9 technology.
> The bcd.h file is included.
## Prototypes
```c
vector unsigned char __builtin_bcdshift(vector unsigned char, int, unsigned char);
vector unsigned char __builtin_bcdshiftround(vector unsigned char, int, unsigned char);
vector unsigned char __builtin_bcdtruncate(vector unsigned char, int, unsigned char);
vector unsigned char __builtin_bcdunsignedtruncate(vector unsigned char, int);
vector unsigned char __builtin_bcdunsignedshift(vector unsigned char, int);
```
---------
This commit adds 4 Clang builtins for PowerPC AMO store operations:
__builtin_amo_stwat for 32-bit unsigned operations
__builtin_amo_stdat for 64-bit unsigned operations
__builtin_amo_stwat_s for 32-bit signed operations
__builtin_amo_stdat_s for 64-bit signed operations
and maps GCC's AMO store functions to these Clang builtins for
compatibility.
This Change is to prepare to make RegState into an enum class. It:
- Updates documentation to match the order in the code.
- Brings the `get<>RegState` functions together and makes them
`constexpr`.
- Adopts the `get<>RegState` where RegStates were being chosen with
ternary operators in backend code.
- Introduces `hasRegState` to make querying RegState easier once it is
an enum class.
- Adopts `hasRegState` where equivalent was done with bitwise
arithmetic.
- Introduces `RegState::NoFlags`, which will be used for the lack of
flags.
- Documents that `0x1` is a reserved flag value used to detect if
someone is passing `true` instead of flags (due to implicit bool to
unsigned conversions).
- Updates two calls to `MachineInstrBuilder::addReg` which were passing
`false` to the flags operand, to no longer pass a value.
- Documents that `getRegState` seems to have forgotten a call to
`getEarlyClobberRegState`.
This PR relands llvm/llvm-project#176091 (commit
1d616cdca3aba9d22f120888bb6b09b75ca90b92) which was reverted in
llvm/llvm-project#176190 (commit
6309cd8668fc2ae589f156b23f86821f4ce5b7ea).
This commit adds 4 Clang builtins for PowerPC AMO load conditional
increment and decrement operations:
__builtin_amo_lwat_cond for 32-bit unsigned operations
__builtin_amo_ldat_cond for 64-bit unsigned operations
__builtin_amo_lwat_cond_s for 32-bit signed operations
__builtin_amo_ldat_cond_s for 64-bit signed operations
Splits out change from https://github.com/llvm/llvm-project/pull/176015
Changes shouldExpandAtomicRMWInIR to take a constant argument: This is
to allow some other TargetLowering constant-argument functions to call
it. This change touches several backends. An alternative solution
exists, but to me, this seems the "right" way.
Reverts llvm/llvm-project#176091
Reverting because some compilers were erroring on the call to
`Reg.isReg()` (which is not `constexpr`) in a `constexpr` function.
This Change is to prepare to make RegState into an enum class. It:
- Updates documentation to match the order in the code.
- Brings the `get<>RegState` functions together and makes them
`constexpr`.
- Adopts the `get<>RegState` where RegStates were being chosen with
ternary operators in backend code.
- Introduces `hasRegState` to make querying RegState easier once it is
an enum class.
- Adopts `hasRegState` where equivalent was done with bitwise
arithmetic.
- Introduces `RegState::NoFlags`, which will be used for the lack of
flags.
- Documents that `0x1` is a reserved flag value used to detect if
someone is passing `true` instead of flags (due to implicit bool to
unsigned conversions).
- Updates two calls to `MachineInstrBuilder::addReg` which were passing
`false` to the flags operand, to no longer pass a value.
- Documents that `getRegState` seems to have forgotten a call to
`getEarlyClobberRegState`.
The crash happens because the cast for `Mask =
cast<ShuffleVectorSDNode>(Res)->getMask();` fails for node `t197: v16i8
= vector_shuffle<16,17,18,19,4,5,6,7,8,9,10,11,u,u,u,u> t196, t196`.
However, both `LHS` and `RHS` are the same node, so
`DAG.getCommutedVectorShuffle` doesn't return a `ShuffleVectorSDNode`
and crashes. The fix is to add a check before the cast is performed.
Closes https://github.com/llvm/llvm-project/issues/172265
The existing condition for checking whether or not to expand an frem
instruction in expand-fp is not sufficiently precise.
The expansion on other targets than AMDGPU - which is the only intended
user right now - is only prevented due to the interaction with the
MaxLegalFpConvertBitWidth check. Relying on this is conceptually wrong
and limits the use of the pass for other targets and further expansions
(e.g. merging with the similar ExpandLargeDivRem pass).
Change the expansion criterion to always expand frem of a given type
for targets that use "Expand" as the legalization action for the
underlying scalar type and use this to exit the pass early for targets
which do not require any expansions. This requires to change the
frem legalization action for all targets which do not want frem to
be expanded in this pass from "Expand" to "LibCall".
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Create a mutable aliased fixed stack object for the va_list when any of
the optional arguments are passed in gprs. Since we need to spill the
gpr registers into the parameter save area the stack object is not
immutable, and since the values will almost certainly be accessed
through the IR value for a va_list make the stack object aliased as
well.
In LangRef, we claim that FMINNUM and FMAXNUM should follow the minNum
and maxNum operators in IEEE754-2008.
PowerPC/VSX does have these instructions XSMINDP and XSMAXDP.
Now we use FMINNUM_IEEE and FMAXNUM_IEEE, since they are used by the
non-arch expand codes now.
In future, we may replace all FMINNUM_IEEE/FMAXNUM_IEEE with FMINNUM and
FMAXNUM.
---------
Co-authored-by: Your Name <you@example.com>
This commit adds two Clang builtins for PowerPC AMO load operations:
__builtin_amo_lwat for 32-bit unsigned operations
__builtin_amo_ldat for 64-bit unsigned operations
Also adds an amo.h header that maps GCC's AMO functions to these Clang
builtins for compatibility.
PPCISelLowering.cpp:15567:27: warning: suggest parentheses around '&&' within '||' [-Wparentheses]
15567 | CC == ISD::SETEQ && "CC mus be ISD::SETNE or ISD::SETEQ");
| ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Currently LibcallLoweringInfo is defined inside of TargetLowering,
which is owned by the subtarget. Pass in the subtarget so we can
construct LibcallLoweringInfo with the subtarget. This is a temporary
step that should be revertable in the future, after LibcallLoweringInfo
is moved out of TargetLowering.
This patch improves the codegen for saddo on i32 and i64 in both 32-bit
and 64-bit modes by custom lowering. It implements signed-add overflow
detection using the `(x eqv y) & (sum xor x)`bit-level sequence.
This allows SDNodes to be validated against their expected type profiles
and reduces the number of changes required to add a new node.
The validation functionality has detected several issues, see
`PPCSelectionDAGInfo::verifyTargetNode()`.
Most of the nodes have a description in `*.td` files and were
successfully "imported". Those that don't have a description are listed
in the enum in `PPCSelectionDAGInfo.td`. These nodes are not validated.
Part of #119709.
Pull Request: https://github.com/llvm/llvm-project/pull/168108
The patch add 16 bytes load size for function
PPCTTIImpl::enableMemCmpExpansion and fold i128 equality/inequality
compares of two loads into a vectorized compare using vcmpequb.p when
Altivec is available.
Rationale:
A scalar i128 SETCC (eq/ne) normally lowers to multiple scalar ops. On
VSX-capable subtargets, we can instead reinterpret the i128 loads as
v16i8 vectors and use the Altive vcmpequb.p instruction to perform a
full 128-bit equality check in a single vector compare.
Example Result:
This transformation replaces memcmp(a, b, 16) with two vector loads and
one vector compare instruction.
Currently it is considered suitable to lower to a bit test for a set of
switch case clusters when the the number of unique destinations
(`NumDests`) and the number of total comparisons (`NumCmps`) satisfy:
`(NumDests == 1 && NumCmps >= 3) || (NumDests == 2 && NumCmps >= 5) ||
(NumDests == 3 && NumCmps >= 6)`
However it is found for some cases on powerpc, for example, when
NumDests is 3, and the number of comparisons for each destination is all
2, it's not profitable to lower the switch to bit test. This is to add
an option to set the minimum of largest number of comparisons to use bit
test for switch lowering.
---------
Co-authored-by: Shimin Cui <scui@xlperflep9.rtp.raleigh.ibm.com>
The branches emitted for atomic operations after the store-conditional
are currently not hinted, even though they should be.
According to the Power10 Processor Chip User’s Manual:
` “Without static prediction, if the lock is not acquired in the first
iteration, the branch history mechanism works to update the prediction
to predict taken; that is, predict lock acquisition failure and cause
more lwarx traffic for the next iteration.”`
This patch addresses the issue by adding explicit branch hints for
atomic operations after the store-conditional.
Try to remove `UnsafeFPMath` uses in PowerPC backend. These global flags
block some improvements like
https://discourse.llvm.org/t/rfc-honor-pragmas-with-ffp-contract-fast/80797.
Remove them incrementally.
FP operations may raise exceptions are replaced by constrained
intrinsics. However, vector type is not supported by these intrinsics.
### Optimize BUILD_VECTOR having special quadword patterns
This change optimizes `BUILD_VECTOR` operations by using the `lxvkq` or
`xxpltib + vsrq` instructions to inline constants matching specific
128-bit patterns:
- **MSB set pattern**: `0x8000_0000_0000_0000_0000_0000_0000_0000`
- **LSB set pattern**: `0x0000_0000_0000_0000_0000_0000_0000_0001`
### Implementation Details
The `lxvkq` instruction loads special quadword values into VSX
registers:
```asm
lxvkq XT, UIM
# When UIM=16: loads 0x8000_0000_0000_0000_0000_0000_0000_0000
```
The optimization reconstructs the 128-bit register pattern from
`BUILD_VECTOR` operands, accounting for target endianness. For example,
the MSB pattern can be represented as:
- **Big-Endian**: `<i64 -9223372036854775808, i64 0>`
- **Little-Endian**: `<i64 0, i64 -9223372036854775808>`
Both produce the same register value:
`0x8000_0000_0000_0000_0000_0000_0000_0000`
### MSB Pattern (`0x8000...0000`)
All vector types (`v2i64`, `v4i32`, `v8i16`, `v16i8`) generate:
```asm
lxvkq v2, 16
```
### LSB Pattern (`0x0000...0001`)
All vector types generate:
```asm
xxspltib v2, 255
vsrq v2, v2, v2
```
---------
Co-authored-by: Tony Varghese <tony.varghese@ibm.com>
This code was already creating HandleSDNodes to handle the case where a
node gets replaced with an equivalent node. However, the code before the
handles are created also performs RAUW operations, which can end up
CSEing and deleting nodes.
Fix this issue by moving the handle creation earlier.
Fixes https://github.com/llvm/llvm-project/issues/160040.
The result type of the vector extend intrinsics generated by the
BUILD_VECTOR lowering code should match how they are actually defined.
Currently the result type is defaulting to the operand type there. This
can conflict with calls to the same intrinsic from other paths.