Add support for '`llvm.nvvm.flo.[su].*`' intrinsics which correspond to
a PTX `bfind` instruction.
See [PTX ISA 9.7.1.16. Integer Arithmetic Instructions: bfind]
(https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#integer-arithmetic-instructions-bfind)
The '`llvm.nvvm.flo.u`' family of intrinsics identifies the bit position
of the leading one, returning either it's offset from the most or least
significant bit.
The '`llvm.nvvm.flo.s`' family of intrinsics identifies the bit position
of the leading non-sign bit, returning either it's offset from the most
or least significant bit.
In ca409892c5396fa3fbb8ea4dbf53d0e952f36d09, frame indexes started
being treated more like registers, rather than immediates. Update
the commute logic to avoid failing the verifier by moving illegal
SGPR operands in place of a frame index.
This reverts commit 8a849a2a567d4e519b246a16936b6e7519936d4b.
It seems I missed a spot when trying to ensure the code in the
instruction selection tests were actually legalized MIR.
`v2i8` is an unsupported type, so we hit the default legalization rules
which perform the bitcast in stack memory and is very inefficient on
GPU.
This adds a custom lowering where we pack `v2i8` into `i16` and from
there use another bitcast node to reach the final desired type. And also
the inverse unpacking `i16` into `v2i8`.
This pull request corrects multiple occurrences of the typo "avaliable"
to "available" across the LLVM and Clang codebase. These changes improve
the clarity and accuracy of comments and documentation. Specific
modifications are in the following files:
1. clang-tools-extra/clang-tidy/readability/FunctionCognitiveComplexityCheck.cpp:
Updated comments in readability checks for cognitive complexity.
2. llvm/include/llvm/ExecutionEngine/Orc/ExecutionUtils.h: Corrected
documentation for JITDylib responsibilities.
3. llvm/include/llvm/Target/TargetMacroFusion.td: Fixed descriptions for
FusionPredicate variables.
4. llvm/lib/CodeGen/SafeStack.cpp: Improved comments on DominatorTree
availability.
5. llvm/lib/Target/RISCV/RISCVSchedSiFive7.td: Enhanced resource usage
descriptions for vector units.
6. llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp: Updated invariant
description in shift-detect idiom logic.
7. llvm/test/MC/ARM/mve-fp-registers.s: Amended ARM MVE register
availability notes.
8. mlir/lib/Bytecode/Reader/BytecodeReader.cpp: Adjusted forward
reference descriptions for bytecode reader operations.
These changes have no impact on code functionality, focusing solely on
documentation clarity.
Co-authored-by: wangqiang <wangqiang1@kylinos.cn>
When doing a call from CMSE secure state to non-secure state for
v8-M.main, we use the VLLDM and VLSTM instructions to save, clear and
restore the FP registers around the call. These instructions both check
the CONTROL_S.SFPA bit, and if it is clear (meaning the current contents
of the FP registers are not secret) they execute as no-ops.
This causes a problem when CONTROL_S.SFPA==0 before the call, which
happens if there are no floating-point instructions executed between
entry to secure state and the call. If this is the case, then the VLSTM
instruction will do nothing, leaving the save area in the stack
uninitialised. If the called function returns a value in floating-point
registers, the call sequence includes an instruction to copy the return
value from a floating-point register to a GPR, which must be before the
VLLDM instruction. This copy sets CONTROL_S.SFPA, meaning that the VLLDM
will fully execute, and load the uninitialised stack memory into the FP
registers.
This causes two problems:
* The FP register file is clobbered, including all of the callee-saved
registers, which might contain live values.
* The stack region might contain secret values, which will be leaked to
non-secure state through the floating-point registers if/when we
return to non-secure state.
The fix is to insert a `vmov s0, s0` instruction before the VLSTM
instruction, to ensure that CONTROL_S.SFPA is set for both the VLLDM and
VLSTM instruction.
CVE: https://www.cve.org/cverecord?id=CVE-2024-7883
Security bulletin:
https://developer.arm.com/Arm%20Security%20Center/Cortex-M%20Security%20Extensions%20Vulnerability
This re-applies #96164 after revert in #102434.
Support the following relocations and assembly operators:
- `R_AARCH64_AUTH_ADR_GOT_PAGE` (`:got_auth:` for `adrp`)
- `R_AARCH64_AUTH_LD64_GOT_LO12_NC` (`:got_auth_lo12:` for `ldr`)
- `R_AARCH64_AUTH_GOT_ADD_LO12_NC` (`:got_auth_lo12:` for `add`)
`LOADgotAUTH` pseudo-instruction is introduced which is later expanded to
actual instruction sequence like the following.
```
adrp x16, :got_auth:sym
add x16, x16, :got_auth_lo12:sym
ldr x0, [x16]
autia x0, x16
```
If a resign is requested, like below, `LOADgotPAC` pseudo is used, and GOT
load is lowered similarly to `LOADgotAUTH`.
```
@var = global i32 0
define ptr @resign_globalvar() {
ret ptr ptrauth (ptr @var, i32 3, i64 43)
}
```
If FPAC bit is not set and auth instruction is emitted, a check+trap sequence
similar to one used for `AUT` pseudo is emitted to ensure auth success.
Both SelectionDAG and GlobalISel are suppported.
For FastISel, we fall back to SelectionDAG.
Tests starting with 'ptrauth-' have corresponding variants w/o this prefix.
See also specification
https://github.com/ARM-software/abi-aa/blob/main/pauthabielf64/pauthabielf64.rst#appendix-signed-got
Fixes: https://github.com/llvm/llvm-project/issues/71030
Bug only happens in 64bit involving spills. Since we don't know when the
spill will happen, all instructions in the chain used to deduce sign
extension for eliminating 'extsw' will need to be promoted to 64-bit
pseudo instructions.
The following instruction will promoted in PPCMIPeepholes: EXTSH, LHA,
ISEL to EXTSH8, LHA8, ISEL8
The vsha2ms.vv from Zvknh[ab] currently supports both SEW=32 and SEW=64.
It might have different performance characteristics depending on the SEW
on some processors. This patch splits these two different SEWs into
their own VPsuedo opcodes and scheduling classes.
This is effectively a NFC change.
The change improves the code in general and, as a side effect, avoids
crashing on an impossible address space casts guarded
by `__isGlobal/__isShared`, which partially fixes
https://github.com/llvm/llvm-project/issues/112760
It's still possible to trigger the issue by using explicit AS casts w/o
AS checks, but LLVM should no longer crash on valid code.
This is #112964 + a small fix for the crash on unintended argument
access which was the root cause to revers the earlier version of the patch.
Prior to this patch, both `pcaddu18i` and `jirl` were marked as
scheduling boundaries to prevent instruction reordering that would
disrupt their adjacency. However, in certain cases, epilogues were still
being inserted between these two instructions, breaking the required
proximity. This patch ensures that `pcaddu18i` and `jirl` remain
adjacent even in the presence of epilogues, maintaining correct
relocation behavior for tail calls on LoongArch.
In #111000 we removed promotion of fadd/fmul reductions for bf16 and f16
without zvfh, and marked the cost as invalid to prevent the vectorizers
from emitting them. However it inadvertently didn't change the cost for
ordered reductions, so this moves the check earlier to fix this.
This also uses BasicTTIImpl instead which now assigns a valid but
expensive cost for fixed-length vectors, which reflects how codegen will
actually scalarize them.
If the runtime flat address resolves to a scratch address,
64-bit atomics do not work correctly. Insert a runtime address
space check (which is quite likely to be uniform) and select between
the non-atomic and real atomic cases.
Consider noalias.addrspace metadata and avoid this expansion when
possible (we also need to consider it to avoid infinitely expanding
after adding the predication code).
Atomic loads are handled differently from the DAG, and have separate opcodes
and explicit control over the extensions, like ordinary loads. Add
new patterns for these.
There's room for cleanup and improvement. d16 cases aren't handled.
Fixes#111645
We can use `viota`+`vrgather` to synthesize `vdecompress` and lower
expanding load to `vcpop`+`load`+`vdecompress`.
And if `%mask` is all ones, we can lower expanding load to a normal
unmasked load.
Fixes#101914.
This teaches dagcombiner to fold:
`(asr (add nsw x, y), 1) -> (avgfloors x, y)`
`(lsr (add nuw x, y), 1) -> (avgflooru x, y)`
as well the combine them to a ceil variant:
`(avgfloors (add nsw x, y), 1) -> (avgceils x, y)`
`(avgflooru (add nuw x, y), 1) -> (avgceilu x, y)`
iff valid for the target.
Removes some of the ARM MVE patterns that are now dead code.
It adds the avg opcodes to `IsQRMVEInstruction` as to preserve the
immediate splatting as before.
The int_amdgcn_mov_dpp8 is overloaded, but we can only select i32.
To allow a corresponding builtin to be overloaded the same way as
int_amdgcn_mov_dpp we need it to be able to split unsupported values.
This commit makes the `generic` target to support FP and LSX, as
discussed in #110211. Thereby, it allows 128-bit vector to be enabled by
default in the loongarch64 backend.
In https://reviews.llvm.org/D147608 we added custom lowering for
integers, but inadvertently also marked it as custom for scalable FP
vectors despite not handling it.
This adds handling for floats and marks it as custom lowered for
fixed-length FP vectors too.
Note that this doesn't handle bf16 or f16 vectors that would need
promotion, but these scalar_to_vector nodes seem to be emitted when
expanding them.