Update MMA tests to add run line for `cpu=future` to ensure MMA
functionality is not broken with the new `wacc` register classes
introduced. Previous commit have added def for using the new `wacc`
registers, this just add in testing and fixes a few patterns that was
missing .
Unused loop invariant loads were not sunk from the preheader to the exit
block, increasing live range.
This commit moves the sinkUnusedInvariant logic from indvarsimplify to
LICM also adds functionality to sink unused load that's not
clobbered by the loop body.
For an insertelt with a dynamic index, the default handling in
DAGTypeLegalizer and LegalizeDAG will reserve a stack slot for the
vector, lower the insertelt to a store, then load the modified vector
back into temporaries. The vector store and load may be legalized into a
sequence of smaller operations depending on the target.
Let V = the vector size and L = the length of a chain of insertelts with
dynamic indices. In the worse case, this chain will lower to O(VL)
operations, which can increase code size dramatically.
Instead, identify such chains, reserve one stack slot for the vector,
and lower all of the insertelts to stores at once. This requires only
O(V + L) operations. This change only affects the default lowering
behavior.
Currently it is considered suitable to lower to a bit test for a set of
switch case clusters when the the number of unique destinations
(`NumDests`) and the number of total comparisons (`NumCmps`) satisfy:
`(NumDests == 1 && NumCmps >= 3) || (NumDests == 2 && NumCmps >= 5) ||
(NumDests == 3 && NumCmps >= 6)`
However it is found for some cases on powerpc, for example, when
NumDests is 3, and the number of comparisons for each destination is all
2, it's not profitable to lower the switch to bit test. This is to add
an option to set the minimum of largest number of comparisons to use bit
test for switch lowering.
---------
Co-authored-by: Shimin Cui <scui@xlperflep9.rtp.raleigh.ibm.com>
The instruction `tlbie` changed in ISA3.0.
ISA V2.07: `tlbie RB,RS`
ISA V3.0: `tlbie RB,RS,RIC,PRS,R`, with `tlbie RB,RS` aliased to `tlbie
RB,RS,0,0,0`
The branches emitted for atomic operations after the store-conditional
are currently not hinted, even though they should be.
According to the Power10 Processor Chip User’s Manual:
` “Without static prediction, if the lock is not acquired in the first
iteration, the branch history mechanism works to update the prediction
to predict taken; that is, predict lock acquisition failure and cause
more lwarx traffic for the next iteration.”`
This patch addresses the issue by adding explicit branch hints for
atomic operations after the store-conditional.
Try to remove `UnsafeFPMath` uses in PowerPC backend. These global flags
block some improvements like
https://discourse.llvm.org/t/rfc-honor-pragmas-with-ffp-contract-fast/80797.
Remove them incrementally.
FP operations may raise exceptions are replaced by constrained
intrinsics. However, vector type is not supported by these intrinsics.
The previous [NFC
patch](https://github.com/llvm/llvm-project/pull/160476#top) addressed
only the vector type `v4i32`, this is a continuation for the previous
patch which adds the remaining 3 vector types which were left out.
This should include the following operands:
- `v2i64`: `A + vector {1, 1,}`
- `v8i16`: `A + vector {1, 1, 1, 1, 1, 1, 1, 1}`
- `v16i8`: `A + vector {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}`
---------
Co-authored-by: himadhith <himadhith.v@ibm.com>
### Optimize BUILD_VECTOR having special quadword patterns
This change optimizes `BUILD_VECTOR` operations by using the `lxvkq` or
`xxpltib + vsrq` instructions to inline constants matching specific
128-bit patterns:
- **MSB set pattern**: `0x8000_0000_0000_0000_0000_0000_0000_0000`
- **LSB set pattern**: `0x0000_0000_0000_0000_0000_0000_0000_0001`
### Implementation Details
The `lxvkq` instruction loads special quadword values into VSX
registers:
```asm
lxvkq XT, UIM
# When UIM=16: loads 0x8000_0000_0000_0000_0000_0000_0000_0000
```
The optimization reconstructs the 128-bit register pattern from
`BUILD_VECTOR` operands, accounting for target endianness. For example,
the MSB pattern can be represented as:
- **Big-Endian**: `<i64 -9223372036854775808, i64 0>`
- **Little-Endian**: `<i64 0, i64 -9223372036854775808>`
Both produce the same register value:
`0x8000_0000_0000_0000_0000_0000_0000_0000`
### MSB Pattern (`0x8000...0000`)
All vector types (`v2i64`, `v4i32`, `v8i16`, `v16i8`) generate:
```asm
lxvkq v2, 16
```
### LSB Pattern (`0x0000...0001`)
All vector types generate:
```asm
xxspltib v2, 255
vsrq v2, v2, v2
```
---------
Co-authored-by: Tony Varghese <tony.varghese@ibm.com>
This NFC patch adds a new function which aids in emitting machine
instructions for floating point vectors. This was previously not
included in the test file as it currently only checks for integer
vectors.
---------
Co-authored-by: himadhith <himadhith.v@ibm.com>
Previously if we had a subregister extract reading from a
full copy, the no-subregister incoming copy would overwrite
the DefSubReg index of the folding context.
There's one ugly rvv regression, but it's a downstream
issue of this; an unnecessary same class reg-to-reg full copy
was avoided.
This NFC patch looks to lock down the instruction generated for the
operation of `A + vector {1, 1, 1, 1}` in which the current code emits
`vspltisw`.
It can be made better with the use of a `2 cycle` instruction `xxleqv`
over the current `4 cycle vspltisw`.
---------
Co-authored-by: himadhith <himadhith.v@ibm.com>
This code was already creating HandleSDNodes to handle the case where a
node gets replaced with an equivalent node. However, the code before the
handles are created also performs RAUW operations, which can end up
CSEing and deleting nodes.
Fix this issue by moving the handle creation earlier.
Fixes https://github.com/llvm/llvm-project/issues/160040.
The result type of the vector extend intrinsics generated by the
BUILD_VECTOR lowering code should match how they are actually defined.
Currently the result type is defaulting to the operand type there. This
can conflict with calls to the same intrinsic from other paths.
AIX has "millicode" routines, which are functions loaded at boot time
into fixed addresses in kernel memory. This allows them to be customized
for the processor. The __strlen routine is a millicode implementation;
we use millicode for the strlen function instead of a library call to
improve performance.
The PowerPC changes are caused by shifts created by different IR
operations being CSEd now. This allows consecutive loads to be turned
into vectors earlier. This has effects on the ordering of other combines
and legalizations. This leads to some improvements and some regressions.
This was being used for 2 different purposes.
The TargetMachine constructor prepends +64bit based on isPPC64
triples as a mode switch. The same feature name was also explicitly
added to different processors, making it impossible to perform a pure
feature check for whether 64-bit mode is enabled ir not. i.e.,
checkFeatures("+64bit") would be true even for ppc32 triples.
The comment in tablegen suggests it's relevant to track which processors
support 64-bit mode independently of whether that's the active compile
target, so replace that with a new feature.
Pre-commit test case for exploitation of `xxsel` for ternary operations
of the pattern. This adds support for v4i32, v2i64, v16i8 and v8i16
operand types for the following patterns.
The following are the patterns involved in the change:
```
ternary(A, and(B,C), nor(B,C))
ternary(A, B, nor(B,C))
ternary(A, C, nor(B,C))
ternary(A, xor(B,C), nor(B,C))
ternary(A, not(C), nor(B,C))
ternary(A, not(B), nor(B,C))
ternary(A, nand(B,C), nor(B,C))
ternary(A, or(B,C), eqv(B,C))
ternary(A, nor(B,C), eqv(B,C))
ternary(A, not(C), eqv(B,C))
ternary(A, nand(B,C), eqv(B,C))
ternary(A, and(B,C), not(C))
ternary(A, B, not(C))
ternary(A, xor(B,C), not(C))
ternary(A, or(B,C), not(C))
ternary(A, not(B), not(C))
ternary(A, nand(B,C), not(C))
ternary(A, and(B,C), not(B))
ternary(A, xor(B,C), not(B))
ternary(A, or(B,C), not(B))
ternary(A, nand(B,C), not(B))
ternary(A, B, nand(B,C))
ternary(A, C, nand(B,C))
ternary(A, xor(B,C), nand(B,C))
ternary(A, or(B,C), nand(B,C))
ternary(A, eqv(B,C), nand(B,C))
```
Exploitation of `xxeval` for the above patterns to be added as a follow
up.
Co-authored-by: Tony Varghese <tony.varghese@ibm.com>
`f16` is more functional than just a storage type on the platform,
though it does have some codegen issues [1]. To prepare for future
changes, do the following nonfunctional updates to the existing `half`
test:
* Add tests for passing and returning the type directly.
* Add tests showing bitcast behavior, which is currently incorrect but
serves as a baseline.
* Add tests for `fabs` and `copysign` (trivial operations that shouldn't
require libcalls).
* Add invocations for big-endian and for PPC32.
* Rename the test to `half.ll` to reflect its status, which also matches
other backends.
[1]: https://github.com/llvm/llvm-project/issues/97975
This patch enables `-fpatchable-function-entry` on PPC64 little-endian
Linux. It is mutually exclusive with existing XRay instrumentation on
this target.
NFC patch to add the flags -ppc-asm-full-reg-names --ppc-vsr-nums-as-vr
to the following test files
```
llvm/test/CodeGen/PowerPC/recipest.ll
llvm/test/CodeGen/PowerPC/setcc-logic.ll
llvm/test/CodeGen/PowerPC/vector-popcnt-128-ult-ugt.ll
```
Created this PR based on this discussion:
https://github.com/llvm/llvm-project/pull/151971#issuecomment-3234090675
Co-authored-by: himadhith <himadhith.v@ibm.com>
The patch add branch hint for AtomicExpandImpl::expandAtomicCmpXchg, For
example: in PowerPC, it support branch hint as
```
loop:
lwarx r6,0,r3 # load and reserve
cmpw r4,r6 #1st 2 operands equal? bne- exit #skip if not
bne- exit #skip if not
stwcx. r5,0,r3 #store new value if still res’ved bne- loop #loop if lost reservation
bne- loop #loop if lost reservation
exit:
mr r4,r6 #return value from storage
```
`-` hints not taken,
`+` hints taken,
NFC patch to add the flags `-ppc-asm-full-reg-names
--ppc-vsr-nums-as-vr` to the test file
`llvm/test/CodeGen/PowerPC/check-zero-vector.ll`.
Created this PR based on this discussion:
https://github.com/llvm/llvm-project/pull/151971#issuecomment-3234090675
Co-authored-by: himadhith <himadhith.v@ibm.com>
Co-authored-by: Lei Huang <lei@ca.ibm.com>
This change implements a patfrag based pattern matching ~dag combiner~
that combines consecutive `VSRO (Vector Shift Right Octet)` and `VSR
(Vector Shift Right)` instructions into a single `VSRQ (Vector Shift
Right Quadword)` instruction on Power10+ processors.
Vector right shift operations like `vec_srl(vec_sro(input, byte_shift),
bit_shift)` generate two separate instructions `(VSRO + VSR)` when they
could be optimised into a single `VSRQ `instruction that performs the
equivalent operation.
```
vsr(vsro (input, vsro_byte_shift), vsr_bit_shift) to vsrq(input, vsrq_bit_shift)
where vsrq_bit_shift = (vsro_byte_shift * 8) + vsr_bit_shift
```
Note:
```
vsro : Vector Shift Right by Octet VX-form
- vsro VRT, VRA, VRB
- The contents of VSR[VRA+32] are shifted right by the number of bytes specified in bits 121:124 of VSR[VRB+32].
- Bytes shifted out of byte 15 are lost.
- Zeros are supplied to the vacated bytes on the left.
- The result is placed into VSR[VRT+32].
vsr : Vector Shift Right VX-form
- vsr VRT, VRA, VRB
- The contents of VSR[VRA+32] are shifted right by the number of bits specified in bits 125:127 of VSR[VRB+32]. 3 bits.
- Bits shifted out of bit 127 are lost.
- Zeros are supplied to the vacated bits on the left.
- The result is place into VSR[VRT+32], except if, for any byte element in VSR[VRB+32], the low-order 3 bits are not equal to the shift amount, then VSR[VRT+32] is undefined.
vsrq : Vector Shift Right Quadword VX-form
- vsrq VRT,VRA,VRB
- Let src1 be the contents of VSR[VRA+32]. Let src2 be the contents of VSR[VRB+32].
- src1 is shifted right by the number of bits specified in the low-order 7 bits of src2.
- Bits shifted out the least-significant bit are lost.
- Zeros are supplied to the vacated bits on the left.
- The result is placed into VSR[VRT+32].
```
---------
Co-authored-by: Tony Varghese <tony.varghese@ibm.com>
Adds support for ternary equivalent operations of the form `ternary(A,
X, B)` and `ternary(A, X, C)` where `X=[and(B,C)| nor(B,C)| eqv(B,C)|
nand(B,C)]`.
The following are the patterns involved and the imm values:
| **Operation** | **Immediate Value** |
|----------------------------|---------------------|
| ternary(A, and(B,C), B) | 49 |
| ternary(A, nor(B,C), B) | 56 |
| ternary(A, eqv(B,C), B) | 57 |
| ternary(A, nand(B,C), B) | 62 |
| | |
| ternary(A, and(B,C), C) | 81 |
| ternary(A, nor(B,C), C) | 88 |
| ternary(A, eqv(B,C), C) | 89 |
| ternary(A, nand(B,C), C) | 94 |
eg. `xxeval XT, XA, XB, XC, 49`
- performs `XA ? and(XB, XC) : B`and places the result in `XT`.
This is the continuation of [[PowerPC] Exploit xxeval instruction for
ternary patterns - ternary(A, X,
and(B,C))](https://github.com/llvm/llvm-project/pull/141733#top).
---------
Co-authored-by: Tony Varghese <tony.varghese@ibm.com>
This patch updates PPCInstrInfo::copyPhysReg to support DMR and WACC
register classes and extends the PPCVSXCopy pass to handle specific WACC
copy patterns.
The scalarized IR was written before improvements to SLP / cost models
ensured that the abs intrinsic was easily vectorizable
opt -O3 : https://zig.godbolt.org/z/39T65vh8M
Now that it is we need a more useful llc test
This pseudo-instruction emits a local `bl` writing LR, so that must be
saved and restored for the function to return to the right place. If
not, we'll return to the inline `.long` that the `bl` stepped over.
This fixes the `SIGILL` seen in rayon-rs/rayon#1268.
Add entries for_stack_chk_guard, __ssp_canary_word, __security_cookie,
and __guard_local. As far as I can tell these are all just different
names for the same shaped functionality on different systems.
These aren't really functions, but special global variable names. They
should probably be treated the same way; all the same contexts that
need to know about emittable function names also need to know about
this. This avoids a special case check in IRSymtab.
This isn't a complete change, there's a lot more cleanup which
should be done. The stack protector configuration system is a
complete mess. There are multiple overlapping controls, used in
3 different places. Some of the target control implementations overlap
with conditions used in the emission points, and some use correlated
but not identical conditions in different contexts.
i.e. useLoadStackGuardNode, getIRStackGuard, getSSPStackGuardCheck and
insertSSPDeclarations are all used in inconsistent ways so I don't know
if I've tracked the intention of the system correctly.
The PowerPC test change is a bug fix on linux. Previously the manual
conditions were based around !isOSOpenBSD, which is not the condition
where __stack_chk_guard are used. Now getSDagStackGuard returns the
proper global reference, resulting in LOAD_STACK_GUARD getting a
MachineMemOperand which allows scheduling.
When moving fcti results from float registers to normal registers
through memory, even though MPI was adjusted to account for endianness,
FIPtr was always adjusted for big-endian, which caused loads of wrong
half of a value in little-endian mode.
Support the following BCD format conversion builtins for PowerPC.
- `__builtin_bcdcopysign` – Conversion that returns the decimal value of
the first parameter combined with the sign code of the second parameter.
`
- `__builtin_bcdsetsign` – Conversion that sets the sign code of the
input parameter in packed decimal format.
> Note: This built-in function is valid only when all following
conditions are met:
> -qarch is set to utilize POWER9 technology.
> The bcd.h file is included.
## Prototypes
```c
vector unsigned char __builtin_bcdcopysign(vector unsigned char, vector unsigned char);
vector unsigned char __builtin_bcdsetsign(vector unsigned char, unsigned char);
```
## Usage Details
`__builtin_bcdsetsign`: Returns the packed decimal value of the first
parameter combined with the sign code.
The sign code is set according to the following rules:
- If the packed decimal value of the first parameter is positive, the
following rules apply:
- If the second parameter is 0, the sign code is set to 0xC.
- If the second parameter is 1, the sign code is set to 0xF.
- If the packed decimal value of the first parameter is negative, the
sign code is set to 0xD.
> notes:
> The second parameter can only be 0 or 1.
> You can determine whether a packed decimal value is positive or
negative as follows:
> - Packed decimal values with sign codes **0xA, 0xC, 0xE, or 0xF** are
interpreted as positive.
> - Packed decimal values with sign codes **0xB or 0xD** are interpreted
as negative.
---------
Co-authored-by: Aditi-Medhane <aditi.medhane@ibm.com>