This fixes issue #114194
The issue happens during the `LiveRangeShrink` pass, which runs early,
before phi elimination. LandingPads, which are lowered to EHLabels, need
to be the first non phi instruction in an EHPad. In case of a phi node
being in front of the EHLabel and a use being after the EHLabel, we
hoist the use in front of the label.
This results in a portion of the landingpad missing due to being hoisted
in front of the label.
Add support for '`llvm.nvvm.flo.[su].*`' intrinsics which correspond to
a PTX `bfind` instruction.
See [PTX ISA 9.7.1.16. Integer Arithmetic Instructions: bfind]
(https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#integer-arithmetic-instructions-bfind)
The '`llvm.nvvm.flo.u`' family of intrinsics identifies the bit position
of the leading one, returning either it's offset from the most or least
significant bit.
The '`llvm.nvvm.flo.s`' family of intrinsics identifies the bit position
of the leading non-sign bit, returning either it's offset from the most
or least significant bit.
In ca409892c5396fa3fbb8ea4dbf53d0e952f36d09, frame indexes started
being treated more like registers, rather than immediates. Update
the commute logic to avoid failing the verifier by moving illegal
SGPR operands in place of a frame index.
Tests taken from Luke's 88147 with minimal changes by me (preames).
The main case of interest here is when mask length is less than source
length (i.e. length is decreasing). We often scalarize these, which
on RISCV can be quite painful.
This reverts commit 8a849a2a567d4e519b246a16936b6e7519936d4b.
It seems I missed a spot when trying to ensure the code in the
instruction selection tests were actually legalized MIR.
`v2i8` is an unsupported type, so we hit the default legalization rules
which perform the bitcast in stack memory and is very inefficient on
GPU.
This adds a custom lowering where we pack `v2i8` into `i16` and from
there use another bitcast node to reach the final desired type. And also
the inverse unpacking `i16` into `v2i8`.
With SEW=64, the vnsrl trick we primary rely on does not work. This
is handled correctly today, but we have fairly minimal testing of the
resulting shuffles which makes it hard to demonstrate value of an
upcoming change.
When doing a call from CMSE secure state to non-secure state for
v8-M.main, we use the VLLDM and VLSTM instructions to save, clear and
restore the FP registers around the call. These instructions both check
the CONTROL_S.SFPA bit, and if it is clear (meaning the current contents
of the FP registers are not secret) they execute as no-ops.
This causes a problem when CONTROL_S.SFPA==0 before the call, which
happens if there are no floating-point instructions executed between
entry to secure state and the call. If this is the case, then the VLSTM
instruction will do nothing, leaving the save area in the stack
uninitialised. If the called function returns a value in floating-point
registers, the call sequence includes an instruction to copy the return
value from a floating-point register to a GPR, which must be before the
VLLDM instruction. This copy sets CONTROL_S.SFPA, meaning that the VLLDM
will fully execute, and load the uninitialised stack memory into the FP
registers.
This causes two problems:
* The FP register file is clobbered, including all of the callee-saved
registers, which might contain live values.
* The stack region might contain secret values, which will be leaked to
non-secure state through the floating-point registers if/when we
return to non-secure state.
The fix is to insert a `vmov s0, s0` instruction before the VLSTM
instruction, to ensure that CONTROL_S.SFPA is set for both the VLLDM and
VLSTM instruction.
CVE: https://www.cve.org/cverecord?id=CVE-2024-7883
Security bulletin:
https://developer.arm.com/Arm%20Security%20Center/Cortex-M%20Security%20Extensions%20Vulnerability
This re-applies #96164 after revert in #102434.
Support the following relocations and assembly operators:
- `R_AARCH64_AUTH_ADR_GOT_PAGE` (`:got_auth:` for `adrp`)
- `R_AARCH64_AUTH_LD64_GOT_LO12_NC` (`:got_auth_lo12:` for `ldr`)
- `R_AARCH64_AUTH_GOT_ADD_LO12_NC` (`:got_auth_lo12:` for `add`)
`LOADgotAUTH` pseudo-instruction is introduced which is later expanded to
actual instruction sequence like the following.
```
adrp x16, :got_auth:sym
add x16, x16, :got_auth_lo12:sym
ldr x0, [x16]
autia x0, x16
```
If a resign is requested, like below, `LOADgotPAC` pseudo is used, and GOT
load is lowered similarly to `LOADgotAUTH`.
```
@var = global i32 0
define ptr @resign_globalvar() {
ret ptr ptrauth (ptr @var, i32 3, i64 43)
}
```
If FPAC bit is not set and auth instruction is emitted, a check+trap sequence
similar to one used for `AUT` pseudo is emitted to ensure auth success.
Both SelectionDAG and GlobalISel are suppported.
For FastISel, we fall back to SelectionDAG.
Tests starting with 'ptrauth-' have corresponding variants w/o this prefix.
See also specification
https://github.com/ARM-software/abi-aa/blob/main/pauthabielf64/pauthabielf64.rst#appendix-signed-got
Fixes: https://github.com/llvm/llvm-project/issues/71030
Bug only happens in 64bit involving spills. Since we don't know when the
spill will happen, all instructions in the chain used to deduce sign
extension for eliminating 'extsw' will need to be promoted to 64-bit
pseudo instructions.
The following instruction will promoted in PPCMIPeepholes: EXTSH, LHA,
ISEL to EXTSH8, LHA8, ISEL8
Check to see if we are only demanding (shifted) signbits from a SRL node that are also signbits in the source node.
We can't demand any upper zero bits that the SRL will shift in (up to max shift amount), and the lower demanded bits bound must already be all signbits.
Prior to this patch, both `pcaddu18i` and `jirl` were marked as
scheduling boundaries to prevent instruction reordering that would
disrupt their adjacency. However, in certain cases, epilogues were still
being inserted between these two instructions, breaking the required
proximity. This patch ensures that `pcaddu18i` and `jirl` remain
adjacent even in the presence of epilogues, maintaining correct
relocation behavior for tail calls on LoongArch.
If the runtime flat address resolves to a scratch address,
64-bit atomics do not work correctly. Insert a runtime address
space check (which is quite likely to be uniform) and select between
the non-atomic and real atomic cases.
Consider noalias.addrspace metadata and avoid this expansion when
possible (we also need to consider it to avoid infinitely expanding
after adding the predication code).
Atomic loads are handled differently from the DAG, and have separate opcodes
and explicit control over the extensions, like ordinary loads. Add
new patterns for these.
There's room for cleanup and improvement. d16 cases aren't handled.
Fixes#111645
Many tests had this flag removed because of the G_BITCAST emission
issue. Now that the PR is merged, we can re-enable this additional
check.
2 tests (basic_int_types) just have the TODO removed because they are
not useful for SPIR-V as-is: SPIR-V requires reg2mem/mem2reg to run,
which removes all the body. Integers are used in other spirv tests, and
seems like testing for spirv32/64 and relying on others for the logical
target coverage should be fine.
Signed-off-by: Nathan Gauër <brioche@google.com>
This shares most of its code with the scalar sincos expansion. It allows
expanding vector FSINCOS nodes to a library call from the specified
`-vector-library`. The upside of this is it will mean the vectorizer
only needs to handle the sincos intrinsic, which has no memory effects,
and this can handle lowering the intrinsic to a call that takes output
pointers.
We can use `viota`+`vrgather` to synthesize `vdecompress` and lower
expanding load to `vcpop`+`load`+`vdecompress`.
And if `%mask` is all ones, we can lower expanding load to a normal
unmasked load.
Fixes#101914.
This teaches dagcombiner to fold:
`(asr (add nsw x, y), 1) -> (avgfloors x, y)`
`(lsr (add nuw x, y), 1) -> (avgflooru x, y)`
as well the combine them to a ceil variant:
`(avgfloors (add nsw x, y), 1) -> (avgceils x, y)`
`(avgflooru (add nuw x, y), 1) -> (avgceilu x, y)`
iff valid for the target.
Removes some of the ARM MVE patterns that are now dead code.
It adds the avg opcodes to `IsQRMVEInstruction` as to preserve the
immediate splatting as before.
The int_amdgcn_mov_dpp8 is overloaded, but we can only select i32.
To allow a corresponding builtin to be overloaded the same way as
int_amdgcn_mov_dpp we need it to be able to split unsupported values.
This commit makes the `generic` target to support FP and LSX, as
discussed in #110211. Thereby, it allows 128-bit vector to be enabled by
default in the loongarch64 backend.
In https://reviews.llvm.org/D147608 we added custom lowering for
integers, but inadvertently also marked it as custom for scalable FP
vectors despite not handling it.
This adds handling for floats and marks it as custom lowered for
fixed-length FP vectors too.
Note that this doesn't handle bf16 or f16 vectors that would need
promotion, but these scalar_to_vector nodes seem to be emitted when
expanding them.
This PR implements instruction selection for G_BITCAST on an earlier
stage to avoid MachineVerifier complains on subtle semantics difference
between G_BITCAST and OpBitcast.
We do instruction selections for OpBitcast after IR Translation instead
of calling MIB.buildBitcast() generating the general op code G_BITCAST,
because when MachineVerifier validates G_BITCAST we see a check of a
kind: 'if Source Type is equal to Destination Type then report error
"bitcast must change the type"'. This doesn't take into account the
notion of a typed pointer that is important for SPIR-V where a user may
and should use bitcast between pointers with different pointee types
(https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpBitcast).
It's important for correct lowering in SPIR-V, because interpretation of
the data type is not left to instructions that utilize the pointer, but
encoded by the pointer declaration, and the SPIRV target can and must
handle the declaration and use of pointers that specify the type of data
they point to.
It's not feasible to improve validation of G_BITCAST using just
information provided by low level types of source and destination.
Therefore we don't produce G_BITCAST as the general op code with
semantics different from OpBitcast, but rather lower to OpBitcast
immediately.
See discussion in https://github.com/llvm/llvm-project/pull/110270 for
even more context.
Use TSFlags to distinquish which type of rounding mode it is. We use the same tablegen base classes for vxrm and frm sometimes so its hard to have different types for different instructions.
It is not correct to lower "x<<32-y>>32-y" to "bfe x, 0, y". When y
equals to 32, the left-hand side is still x (unchanged), however, the
right-hand side will be evaluated to 0. So it is not always correct to
do such transformation.
We may be able to keep the pattern for immediate y while y is within [0,
31]. However, the immediate operands of the sub (32 - y) are easily
folded, and "(x << imm) >> imm" will be lowered to "and x,
(2^(32-imm))-1" anyway. So no bfe matching is needed.