The stack slot coloring pass is concerned with optimizing spill
slots. If any change is a pass is made over the function to remove
stack stores that use the same register and stack slot as an
immediately preceding load.
The register check is too simple for constant registers like AArch64
and RISC-V's zero register. This register can be used as the result
of a load if we want to discard the result, but still have the memory
access performed. Like for a volatile or atomic load.
If the code sees a load from the zero register followed by a store
of the zero register at the same stack slot, the pass mistakenly
believes the store isn't needed.
Since the main stack coloring optimization is only concerned with
spill slots, it seems reasonable that RemoveDeadStores should
only be concerned with spills. Since we never generate a reload of
x0, this avoids the issue seen by RISC-V.
Test case concept is adapted from pr30821.mir from X86. That test
had to be updated to mark the stack slot as a spill slot.
Fixes#80052.
In preparation for implementing code generation for more @llvm.ptrauth.* intrinsics, move the expansion of blend(register, small integer) variant of @llvm.ptrauth.blend to the AArch64PointerAuth pass, where most other PAuth-related code generation takes place.
RegisterBank Selection for scalable vector G_ADD and G_SUB by creating
new mappings for different types of vector register banks.
Then implement Instruction Selection for the same operations by choosing
the correct RISC-V vector register class.
GCC has supported a generic constraint "s" for a long time (since at
least 1992), which references a symbol or label with an optional
constant offset. "i" is a superset that also supports a constant
integer.
GCC's RISC-V port also supports a machine-specific constraint "S",
which cannot be used with a preemptible symbol. (We don't bother to
check preemptibility.) In PIC code, an external symbol is preemptible by
default, making "S" less useful if you want to create an artificial
reference for linker garbage collection, or define sections to hold
symbol addresses:
```
void fun();
// error: impossible constraint in ‘asm’ for riscv64-linux-gnu-gcc -fpie/-fpic
void foo() { asm(".reloc ., BFD_RELOC_NONE, %0" :: "S"(fun)); }
// good even if -fpie/-fpic
void foo() { asm(".reloc ., BFD_RELOC_NONE, %0" :: "s"(fun)); }
```
This patch adds support for "s". Modify https://reviews.llvm.org/D105254
("S") to handle multi-depth GEPs (https://reviews.llvm.org/D61560).
Currently machine LICM hoist PC_ADD_REL_OFFSET out of loops, causes
register pressure when function calls are deep in loops. This is a main
cause of sgpr spill for programs containing large number of function
calls in loops.
This patch marks PC_ADD_REL_OFFSET as rematerializable, which eliminates
sgpr spills due to function calls in loops.
This patch utilizes the -maix-small-local-exec-tls option to produce a
faster,
non-TOC-based access sequence for the local-exec TLS model.
Specifically, for
when the offsets from the TLS variable are non-zero.
In particular, this patch produces either a single:
- addi/la with a displacement off of R13 plus a non-zero offset for when
an address is calculated, or
- load or store off of R13 plus a non-zero offset for when an address is
calculated and used for further
access where R13 is the thread pointer, respectively.
In order to produce a single addi or load/store off of the thread
pointer with a non-zero offset,
this patch also adds the necessary support in the assembly printer when
printing these instructions.
Specifically:
- The non-zero offset is added to the TLS variable address when the
address of the
TLS variable + it's offset is less than 32KB.
- Otherwise, when the address of the TLS variable + its offset is
greater than 32KB, the
non-zero offset (and a multiple of 64KB) is subtracted from the TLS
address.
This handling in the assembly printer is necessary to ensure that the
TLS address + the non-zero offset
is between [-32768, 32768), so that the total displacement can fit
within the addi/load/store instructions.
This patch is meant to be a follow-up to
3f46e5453d9310b15d974e876f6132e3cf50c4b1 (where the
optimization occurs for when the offset is zero).
Since https://github.com/ARM-software/acle/pull/276 the ACLE
defines attributes to better describe the use of a given SME state.
Previously the attributes merely described the possibility of it being
'shared' or 'preserved', whereas the new attributes have more semantics
and also describe how the data flows through the program.
For ZT0 we already had to add new LLVM IR attributes:
* aarch64_new_zt0
* aarch64_in_zt0
* aarch64_out_zt0
* aarch64_inout_zt0
* aarch64_preserves_zt0
We have now done the same for ZA, such that we add:
* aarch64_new_za (previously `aarch64_pstate_za_new`)
* aarch64_in_za (more specific variation of `aarch64_pstate_za_shared`)
* aarch64_out_za (more specific variation of `aarch64_pstate_za_shared`)
* aarch64_inout_za (more specific variation of
`aarch64_pstate_za_shared`)
* aarch64_preserves_za (previously `aarch64_pstate_za_shared,
aarch64_pstate_za_preserved`)
This explicitly removes 'pstate' from the name, because with SME2 and
the new ACLE attributes there is a difference between "sharing ZA"
(sharing
the ZA matrix register with the caller) and "sharing PSTATE.ZA" (sharing
either the ZA or ZT0 register, both part of PSTATE.ZA with the caller).
Summary:
The old `s_memtime` instruction was supported until the GFX10
architecture. Although this instruction has a higher latency than the
new shader counter, it's much more usable as a processor clock as it is
a full 64-bit counter. The new shader counter is only a 20-bit counter,
which makes it difficult to use as a standard cycle counter as it will
overflow in a few milliseconds. This patch suggests preferring
`s_memtime` for this instrinsic if it is still available.
We don't have an AMO instruction for Nand, so with the A extension we
use an LR/SC loop. If we have Zacas we can use a CAS loop instead.
According to the Zacas spec, a CAS loop scales to highly parallel
systems better than LR/SC.
If we can't produce a large enough index vector in i8, we may need to legalize
the shuffle (via scalarization - which in turn gets lowered into stack usage).
This change makes two related changes:
* Deferring legalization until we actually need to generate the vrgather
instruction. With the new recursive structure, this only happens when
doing the fallback for one of the arms.
* Check the actual mask values for something outside of the representable
range.
Both are covered by recently added tests.
Add TableGen patterns to convert more instructions to boolean
expressions:
- **mul -> and/or**: i1 multiply instructions currently cannot be
selected causing the compiler to crash. See
https://github.com/llvm/llvm-project/issues/57404
- **select -> and/or**: Converting selects to and/or can enable more
optimizations. `InstCombine` cannot do this as aggressively due to
poison semantics.
Add a new node `AArch64ISD::URSHR_I_PRED`.
`srl(add(X, 1 << (ShiftValue - 1)), ShiftValue)` is transformed to
`urshr`, or to `rshrnb` (as before) if the result it truncated.
`uzp1(rshrnb(uunpklo(X),C), rshrnb(uunpkhi(X), C))` is converted to
`urshr(X, C)` (tested by the wide_trunc tests).
Pattern matching code in `canLowerSRLToRoundingShiftForVT` is taken
from prior code in rshrnb. It returns true if the add has NUW or if the
number of bits used in the return value allow us to not care about the
overflow (tested by rshrnb test cases).
This patch adds support for common and local symbols in the TOC for AIX.
Note that we need to update isVirtualSection so as a common symbol in
TOC will have the symbol type XTY_CM and will be initialized when placed
in the TOC so sections with this type are no longer virtual.
---------
Co-authored-by: Zaara Syeda <syzaara@ca.ibm.com>
Currently, the `PPCMergeStringPool` merges the global variable after the
`AsmPrinter` initializer adds the global variables to its symbol list.
This is to move the merging work of `PPCMergeStringPool` to its
initializer, just like what GlobalMerge does, to avoid adding merged
global variables to the `AsmPrinter` symbol lis.
Extra space causes the checks generated by update_mir_test_checks to be
unavailable.
```
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4
# RUN: llc -mtriple=x86_64-- -o - %s -run-pass=none -verify-machineinstrs -simplify-mir | FileCheck %s
---
name: foo
body: |
; CHECK-LABEL: name: foo
; CHECK: bb.0:
; CHECK-NEXT: successors:
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: bb.1:
; CHECK-NEXT: RET 0, $eax
bb.0:
successors:
bb.1:
RET 0, $eax
...
```
The failure log is as follows:
```
llvm/test/CodeGen/MIR/X86/unreachable-block-print.mir:9:16: error: CHECK-NEXT: is on the same line as previous match
; CHECK-NEXT: {{ $}}
^
<stdin>:21:13: note: 'next' match was here
successors:
^
<stdin>:21:13: note: previous match ended here
successors:
```
Broadcast of a single float should not be any slower than
loading 32B using vmovaps. So remat it can help reduce
register spill when there is big register pressure.
We try to only use X32 for gnux32 triple tests.
Use no_x86_scrub_mem_shuffle so the test shows updated shuffle intermediate and the +4 offset into the constant pool vector entry
The goal of this PR is to tolerate differences between description of
formal arguments by function metadata (represented by "kernel_arg_type")
and LLVM actual parameter types. A compiler may use "kernel_arg_type" of
function metadata fields to encode detailed type information, whereas
LLVM IR may utilize for an actual parameter a more general type, in
particular, opaque pointer type. This PR proposes to resolve this by a
fallback to LLVM actual parameter types during the lowering of formal
function arguments in cases when the type can't be created by string
content of "kernel_arg_type", i.e., when "kernel_arg_type" contains a
type unknown for the SPIR-V Backend.
An example of the issue manifestation is
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/blob/main/test/transcoding/KernelArgTypeInOpString.ll,
where a compiler generates for the following kernel function detailed
`kernel_arg_type` info in a form of `!{!"image_kernel_data*", !"myInt",
!"struct struct_name*"}`, and in LLVM IR same arguments are referred to
as `@foo(ptr addrspace(1) %in, i32 %out, ptr addrspace(1) %outData)`.
Both definitions are correct, and the resulting LLVM IR is correct, but
lowering stage of SPIR-V Backend fails to generate SPIR-V type.
```
typedef int myInt;
typedef struct {
int width;
int height;
} image_kernel_data;
struct struct_name {
int i;
int y;
};
void kernel foo(__global image_kernel_data* in,
__global struct struct_name *outData,
myInt out) {}
```
```
define spir_kernel void @foo(ptr addrspace(1) %in, i32 %out, ptr addrspace(1) %outData) ... !kernel_arg_type !7 ... {
entry:
ret void
}
...
!7 = !{!"image_kernel_data*", !"myInt", !"struct struct_name*"}
```
The PR changes a contract of `SPIRVType *getArgSPIRVType(...)` in a way
that it may return `nullptr` to signal that the metadata string content
is not recognized, so corresponding comments are added and a couple of
checks for `nullptr` are inserted where appropriate.
PAL uses ELF REL (not RELA) relocations which can only store a 32-bit
addend in the instruction, even for reloc types like R_AMDGPU_ABS32_HI
which require the upper 32 bits of a 64-bit address calculation to be
correct. This means that it is not safe to fold an arbitrary offset into
a GlobalAddressSDNode, so stop doing that.
In practice this is mostly a problem for small negative offsets which do
not work as expected because PAL treats the 32-bit addend as unsigned.
This patch merges the logic of `cannotBeOrderedLessThanZeroImpl` into
`computeKnownFPClass` to improve the signbit inference.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
This patch introduces a 'COALESCER_BARRIER' which is a pseudo node that
expands to
a 'nop', but which stops the register allocator from coalescing a COPY
node when
its use/def crosses a SMSTART or SMSTOP instruction.
For example:
%0:fpr64 = COPY killed $d0
undef %2.dsub:zpr = COPY %0 // <- Do not coalesce this COPY
ADJCALLSTACKDOWN 0, 0
MSRpstatesvcrImm1 1, 0, csr_aarch64_smstartstop, implicit-def dead $d0
$d0 = COPY killed %0
BL @use_f64, csr_aarch64_aapcs
If the COPY would be coalesced, that would lead to:
$d0 = COPY killed %0
being replaced by:
$d0 = COPY killed %2.dsub
which means the whole ZPR reg would be live upto the call, causing the
MSRpstatesvcrImm1 (smstop) to spill/reload the ZPR register:
str q0, [sp] // 16-byte Folded Spill
smstop sm
ldr z0, [sp] // 16-byte Folded Reload
bl use_f64
which would be incorrect for two reasons:
1. The program may load more data than it has allocated.
2. If there are other SVE objects on the stack, the compiler might use
the
'mul vl' addressing modes to access the spill location.
By disabling the coalescing, we get the desired results:
str d0, [sp, #8] // 8-byte Folded Spill
smstop sm
ldr d0, [sp, #8] // 8-byte Folded Reload
bl use_f64
Similar to #78403, but for scalable `vwadd(u).wv`, given that #76785 is recommited.
### Code
```
define <vscale x 8 x i64> @vwadd_wv_mask_v8i32(<vscale x 8 x i32> %x, <vscale x 8 x i64> %y) {
%mask = icmp slt <vscale x 8 x i32> %x, shufflevector (<vscale x 8 x i32> insertelement (<vscale x 8 x i32> poison, i32 42, i64 0), <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer)
%a = select <vscale x 8 x i1> %mask, <vscale x 8 x i32> %x, <vscale x 8 x i32> zeroinitializer
%sa = sext <vscale x 8 x i32> %a to <vscale x 8 x i64>
%ret = add <vscale x 8 x i64> %sa, %y
ret <vscale x 8 x i64> %ret
}
```
### Before this patch
[Compiler Explorer](https://godbolt.org/z/xsoa5xPrd)
```
vwadd_wv_mask_v8i32:
li a0, 42
vsetvli a1, zero, e32, m4, ta, ma
vmslt.vx v0, v8, a0
vmv.v.i v12, 0
vmerge.vvm v24, v12, v8, v0
vwadd.wv v8, v16, v24
ret
```
### After this patch
```
vwadd_wv_mask_v8i32:
li a0, 42
vsetvli a1, zero, e32, m4, ta, ma
vmslt.vx v0, v8, a0
vsetvli zero, zero, e32, m4, tu, mu
vwadd.wv v16, v16, v8, v0.t
vmv8r.v v8, v16
ret
```
Make the wmma intrinsic type signatures to be canonical. We need
a type signature as long as the type is not fixed. However, when an
argument's type matches a previous argument's type, we do not need the
signature for this argument.
This patch fixes three general cases:
1. add missing signatures
2. remove signatures for matching arguments
3. reorer the signatures -- return type signature should always appear
first
This is a fix for the regression seen in
https://github.com/llvm/llvm-project/pull/79498
> Currently, the way that recomputeLiveIns works is that it will
recompute the livein registers for that MachineBasicBlock but it matters
what order you call recomputeLiveIn which can result in incorrect
register allocations down the line.
Now we do not recompute the entire CFG but we do ensure that the newly
added MBB do reach convergence.
In `CoalesceFeaturesAndStripAtomics`, feature string is converted to FeatureBitset and back to feature string. It will lose information about explicit diasbled features.