It is generally better to allow the target independent combines before
creating AArch64 specific nodes (providing they don't mess it up). This
moves the generation of BSL nodes to lowering, not a combine, so that
intermediate nodes are more likely to be optimized. There is a small
change in the constant handling to detect legalized buildvector
arguments correctly.
Fixes#149380 but not directly. #151856 contained a direct fix for
expanding the pseudos.
This patch reworks how VG is handled around streaming mode changes.
Previously, for functions with streaming mode changes, we would:
- Save the incoming VG in the prologue
- Emit `.cfi_offset vg, <offset>` and `.cfi_restore vg` around streaming
mode changes
Additionally, for locally streaming functions, we would:
- Also save the streaming VG in the prologue
- Emit `.cfi_offset vg, <incoming VG offset>` in the prologue
- Emit `.cfi_offset vg, <streaming VG offset>` and `.cfi_restore vg`
around streaming mode changes
In both cases, this ends up doing more than necessary and would be hard
for an unwinder to parse, as using `.cfi_offset` in this way does not
follow the semantics of the underlying DWARF CFI opcodes.
So the new scheme in this patch is to:
In functions with streaming mode changes (inc locally streaming)
- Save the incoming VG in the prologue
- Emit `.cfi_offset vg, <offset>` in the prologue (not at streaming mode
changes)
- Emit `.cfi_restore vg` after the saved VG has been deallocated
- This will be in the function epilogue, where VG is always the same as
the entry VG
- Explicitly reference the incoming VG expressions for SVE callee-saves
in functions with streaming mode changes
- Ensure the CFA is not described in terms of VG in functions with
streaming mode changes
A more in-depth discussion of this scheme is available in:
https://gist.github.com/MacDue/b7a5c45d131d2440858165bfc903e97b
But the TLDR is that following this scheme, SME unwinding can be
implemented with minimal changes to existing unwinders. All unwinders
need to do is initialize VG to `CNTD` at the start of unwinding, then
everything else is handled by standard opcodes (which don't need changes
to handle VG).
This updates the fp<->int tests to include some store(fptoi) and itofp(load)
test cases. It also cuts down on the number of large vector cases which are not
testing anything new.
Fix incorrect super-register lookup when copying from $wzr on subtargets
that lack zero-cycle zeroing but support 64-bit zero-cycle moves.
When copying from $wzr, we used the wrong register class to lookup the
super-register, causing
$w0 = COPY $wzr
to get expanded as
$x0 = ORRXrr $xzr, undef $noreg, implicit $wzr,
rather than the correct
$x0 = ORRXrr $xzr, undef $xzr, implicit $wzr.
## Short Summary
This patch adds a new pass `aarch64-machine-sme-abi` to handle the ABI
for ZA state (e.g., lazy saves and agnostic ZA functions). This is
currently not enabled by default (but aims to be by LLVM 22). The goal
is for this new pass to more optimally place ZA saves/restores and to
work with exception handling.
## Long Description
This patch reimplements management of ZA state for functions with
private and shared ZA state. Agnostic ZA functions will be handled in a
later patch. For now, this is under the flag `-aarch64-new-sme-abi`,
however, we intend for this to replace the current SelectionDAG
implementation once complete.
The approach taken here is to mark instructions as needing ZA to be in a
specific ("ACTIVE" or "LOCAL_SAVED"). Machine instructions implicitly
defining or using ZA registers (such as $zt0 or $zab0) require the
"ACTIVE" state. Function calls may need the "LOCAL_SAVED" or "ACTIVE"
state depending on the callee (having shared or private ZA).
We already add ZA register uses/definitions to machine instructions, so
no extra work is needed to mark these.
Calls need to be marked by glueing Arch64ISD::INOUT_ZA_USE or
Arch64ISD::REQUIRES_ZA_SAVE to the CALLSEQ_START.
These markers are then used by the MachineSMEABIPass to find
instructions where there is a transition between required ZA states.
These are the points we need to insert code to set up or restore a ZA
save (or initialize ZA).
To handle control flow between blocks (which may have different ZA state
requirements), we bundle the incoming and outgoing edges of blocks.
Bundles are formed by assigning each block an incoming and outgoing
bundle (initially, all blocks have their own two bundles). Bundles are
then combined by joining the outgoing bundle of a block with the
incoming bundle of all successors.
These bundles are then assigned a ZA state based on the blocks that
participate in the bundle. Blocks whose incoming edges are in a bundle
"vote" for a ZA state that matches the state required at the first
instruction in the block, and likewise, blocks whose outgoing edges are
in a bundle vote for the ZA state that matches the last instruction in
the block. The ZA state with the most votes is used, which aims to
minimize the number of state transitions.
A llvm.vector.reduce.fadd(float, <1 x float>) will be translated to
G_VECREDUCE_SEQ_FADD with two scalar operands, which is illegal
according to the verifier. This makes sure we generate a fadd/fmul
instead.
LiveVariables will mark instructions with their implicit subregister
uses. However, it will also mark the subregister as an implicit if its
own definition is a subregister of it, i.e. `$r3 = OP val, implicit-def
$r0_r1_r2_r3, ..., implicit $r2_r3`, even if it is otherwise unused,
which defines $r3 on the same line it is used.
This change ensures such uses are marked without implicit, i.e. `$r3 =
OP val, implicit-def $r0_r1_r2_r3, ..., $r2_r3`.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Reland of #142941
Squashed with fixes for #150004, #149585
This pattern matches gather-like patterns where
values are loaded per lane into neon registers, and
replaces it with loads into 2 separate registers, which
will be combined with a zip instruction. This decreases
the critical path length and improves Memory Level
Parallelism.
rdar://151851094
The patch adds patterns to select the EXT_ZZI_CONSTRUCTIVE pseudo
instead of the EXT_ZZI destructive instruction for vector_splice. This
only works when the two inputs to vector_splice are identical.
Given that registers aren't tied anymore, this gives the register
allocator more freedom and a lot of MOVs get replaced with MOVPRFX.
In some cases however, we could have just chosen the same input and
output register, but regalloc preferred not to. This means we end up
with some test cases now having more instructions: there is now a
MOVPRFX while no MOV was previously needed.
This reverts commit 16314eb7312dab38d721c70f247f2117e9800704 as the test cases
are failing under EXPENSIVE_CHECKS. Scalar vecreduce.fadd are not valid in
GISel.
They use extract shuffles for fixed vectors, and
llvm.vector.splice intrinsics for scalable vectors.
In the previous tests using ld+extract+st, the extract was optimized
away and replaced by a smaller load at the right offset. This meant
we didn't really test the vector_splice ISD node.
It will get expanded into MOVPRFX_ZZ and EXT_ZZI by the
AArch64ExpandPseudo pass. This instruction takes a single Z register as
input, as opposed to the existing destructive EXT_ZZI instruction.
Note this patch only defines the pseudo, it isn't used in any ISel
pattern yet. It will later be used for vector.extract.
TargetFrameIndex shouldn't be used as an operand to target-independent
node such as a load. This causes ISel issues.
#81635 fixed a similar issue with this code using a TargetConstant,
instead of a Constant.
Fixes#142314.
For a <8 x i32> -> <2 x i128> bitcast, that under aarch64 is split into
two halfs, the scalar i128 remainder was causing problems, causing a
crash with invalid vector types. This makes sure they are handled
correctly in fewerElementsBitcast.
Cause:
1. `implicit_def` inside bundle does not count for define of reg in
machineinst verifier
2. Including `implicit_def` will cause relative reg not define, result
in `Bad machine code: Using an undefined physical register` in the
machineinst verifier
Fixes https://github.com/llvm/llvm-project/issues/139102
---------
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
`f16` is passed and returned in vector registers on both x86 on AArch64,
the same calling convention as `f32`, so it is a straightforward type to
support. The calling convention support already exists, added as part of
a6065f0fa55a ("Arm64EC entry/exit thunks, consolidated. (#79067)").
Thus, add mangling and remove the error in order to make `half` work.
MSVC does not yet support `_Float16`, so for now this will remain an
LLVM-only extension.
Fixes the `f16` portion of
https://github.com/llvm/llvm-project/issues/94434
In review of bbde6b, I had originally proposed that we support the
legacy text format. As review evolved, it bacame clear this had been a
bad idea (too much complexity), but in order to let that patch finally
move forward, I approved the change with the variant. This change undoes
the variant, and updates all the tests to just use the array form.
Since the `dontcall-*` attributes are checked both by
`FastISel`/`GlobalISel` and `SelectionDAGBuilder`, and both `FastISel`
and `GlobalISel` bail for calls on Arm64EC for AFTER doing the check, we
ended up emitting duplicate copies of this error.
This change moves the checking for `dontcall-*` in `FastISel` and
`GlobalISel` to after it has been successfully lowered.
For loops such as this:
```
struct foo {
double a, b;
};
void foo(struct foo *dst, struct foo *src, int n) {
for (int i = 0; i < n; i++) {
dst[i].a += src[i].a * 3.2;
dst[i].b += src[i].b * 3.2;
}
}
```
the complex deinterleaving pass will spot that the deinterleaving
associated with the structured loads cancels out the interleaving
associated with the structured stores. This happens even though
they are not truly "complex" numbers because the pass can handle
symmetric operations too. This is great because it means we can
then perform normal loads and stores instead. However, we can also
do the same for higher interleave factors, e.g. 4:
```
struct foo {
double a, b, c, d;
};
void foo(struct foo *dst, struct foo *src, int n) {
for (int i = 0; i < n; i++) {
dst[i].a += src[i].a * 3.2;
dst[i].b += src[i].b * 3.2;
dst[i].c += src[i].c * 3.2;
dst[i].d += src[i].d * 3.2;
}
}
```
This PR extends the pass to effectively treat such structures as
a set of complex numbers, i.e.
```
struct foo_alt {
std::complex<double> x, y;
};
```
with equivalence between members:
```
foo_alt.x.real == foo.a
foo_alt.x.imag == foo.b
foo_alt.y.real == foo.c
foo_alt.y.imag == foo.d
```
I've written the code to handle sets with arbitrary numbers of
complex values, but since we only support interleave factors
between 2 and 4 I've restricted the sets to 1 or 2 complex
numbers. Also, for now I've restricted support for interleave
factors of 4 to purely symmetric operations only. However, it
could also be extended to handle complex multiplications,
reductions, etc.
Fixes: https://github.com/llvm/llvm-project/issues/144795
We only do conditional streaming mode changes in two cases:
- Around calls in streaming-compatible functions that don't have a
streaming body
- At the entry/exit of streaming-compatible functions with a streaming
body
In both cases, the condition depends on the entry pstate.sm value. Given
this, we don't need to emit calls to __arm_sme_state at every mode
change.
This patch handles this by placing a "AArch64ISD::ENTRY_PSTATE_SM" node
in the entry block and copying the result to a register. The register is
then used whenever we need to emit a conditional streaming mode change.
The "ENTRY_PSTATE_SM" node expands to a call to "__arm_sme_state" only
if (after SelectionDAG) the function is determined to have
streaming-mode changes.
This has two main advantages:
1. It allows back-to-back conditional smstart/stop pairs to be folded
2. It has the correct behaviour for EH landing pads
- These are entered with pstate.sm = 0, and should switch mode based on
the entry pstate.sm
- Note: This is not fully implemented yet
Many backends are missing either all tests for lrint, or specifically
those for f16, which currently crashes for `softPromoteHalf` targets.
For a number of popular backends, do the following:
* Ensure f16, f32, f64, and f128 are all covered
* Ensure both a 32- and 64-bit target are tested, if relevant
* Add `nounwind` to clean up CFI output
* Add a test covering the above if one did not exist
* Always specify the integer type in intrinsic calls
There are quite a few FIXMEs here, especially for `f16`, but much of
this will be resolved in the near future.
in the following list we keep the first test:
- extadd[us]_v8i8_i16, test_vaddl_[us]8, [us]addl8h
- extadd[us]_v4i16_i32, test_vaddl_[us]16, [us]addl4s
- extadd[us]_v2i32_i64, test_vaddl_[us]32, [us]addl2d
So far, GlobalISel's G_PTR_ADD combines have ignored MIFlags like nuw, nusw,
and inbounds. That was in many cases unnecessarily conservative and in others
unsound, since reassociations re-used the existing G_PTR_ADD instructions
without invalidating their flags. This patch aims to improve that.
I've checked the transforms in this PR with Alive2 on corresponding middle-end
IR constructs.
A longer-term goal would be to encapsulate the logic that determines which
GEP/ISD::PTRADD/G_PTR_ADD flags can be preserved in which case, since this
occurs in similar forms in the middle end, the SelectionDAG combines, and the
GlobalISel combines here.
For SWDEV-516125.
Now, why would we want to do this?
There are a small number of places where this works:
1. It helps peepholeopt when less flag checking.
2. It allows the folding of things such as x - 0x80000000 < 0 to be
folded to cmp x, register holding this value
3. We can refine the other passes over time for this.
Now that #149310 has restricted lifetime intrinsics to only work on
allocas, we can also drop the explicit size argument. Instead, the size
is implied by the alloca.
This removes the ability to only mark a prefix of an alloca alive/dead.
We never used that capability, so we should remove the need to handle
that possibility everywhere (though many key places, including stack
coloring, did not actually respect this).
The histogram DAG combine went into an infinite loop of creating the
same histogram node due to an incorrect use of the `refineUniformBase`
and `refineIndexType` APIs.
These APIs take SDValues by reference (SDValue&) and return `true` if
they were "refined" (i.e., set to new values).
Previously, this DAG combine would create the `Ops` array (used to
create the new histogram node) before calling the `refine*` APIs, which
copies the SDValues into the array, meaning the updated values were not
used to create the new histogram node.
Reproducer: https://godbolt.org/z/hsGWhTaqY (it will timeout)
This prevents cases where some of the operands match from hitting
verifier errors with kill flags. These nodes should have been removed
earlier in most cases.
Fixes the direct issue from #149380. #151855 cleans up the codegen.