This fixes a crash when lowering an extract_subvector like:
t0:v1i64 = extract_subvector t1:v2i64, 1
Whilst we never need a vslidedown with M1 on scalable vector types, we might
need to do it for v1i64/v1f64, since the smallest container type for it is
nxv1i64/nxv1f64.
The lowering code is still correct for this case, but the assertion was too
strict. The actual invariant we're relying on is that ContainerSubVecVT's LMUL
<= M1, not < M1. Hence why we handled v2i32 fine, because its container type
was nxv1i32 and MF2.
llvm.dbg.assign intrinsics have 2 {value, expression} pairs; fix hwasan to
update the second expression.
Fixes#76545. This is #78606 rebased and with the addition of DPValue handling.
Note the addition of --try-experimental-debuginfo-iterators in the tests and
some shuffling of code in MemoryTaggingSupport.cpp.
If we have something like G_TRUNC from v2s32 to v2s16, then lowering
this to a concat of two G_TRUNC s32 to s16 followed by G_TRUNC from
v2s16 to v2s8 does not bring us any closer to legality. In fact, the
first part of that is a G_BUILD_VECTOR whose legalization will produce a
new G_TRUNC from v2s32 to v2s16, and both G_TRUNCs will then get
combined to the original, causing a legalization cycle.
Make the lowering condition more precise, by requiring that the original
vector is >128 bits, which is I believe the only case where this
specific splitting approach is useful.
Note that this doesn't actually produce a legal result (the alwaysLegal
is a lie, as before), but it will cause a proper globalisel abort
instead of an infinite legalization loop.
Fixes https://github.com/llvm/llvm-project/issues/81244.
This is a revival of #65392. When we lower an extract_subvector, we
extract the
subregister that the subvector is contained in first and then do a
vslidedown
with LMUL=1. We can currently only do this for scalable vectors though
because
the index is scaled by vscale and thus we will know what subregister the
subvector lies in.
For fixed length vectors, the index isn't scaled by vscale and so the
subvector
could lie in any arbitrary subregister, so we have to do a vslidedown
with the
full LMUL.
The exception to this is when we know the exact VLEN: in which case, we
can
still work out the exact subregister and do the LMUL=1 vslidedown on it.
This patch handles this case by scaling the index by 1/vscale before
computing
the subregister, and extending the LMUL=1 path to handle fixed length
vectors.
Similar to d39b4ce3ce8a3c256e01bdec2b140777a332a633
Using "eabi" or "gnueabi" for aarch64 targets is a common mistake and
warned by Clang Driver. We want to avoid them elsewhere as well. Just
use the common "aarch64" without other triple components.
The replacement should've had BFE() as the arguments for the comparison,
not the source register.
While at that, tighten the patterns a bit, and expand them to cover
variants with immediate arguments. Also change the default lowering of
bfe() to use unsigned variant, so the value of the upper bits is
predictable.
This PR provides implementation of the basic codegen infra such as
TargetFrameLowering, MCInstLower,
AsmPrinter, RegisterInfo, InstructionInfo, TargetLowering,
SelectionDAGISel.
Migrated from https://reviews.llvm.org/D145658
Hint further tail call optimization opportunities when the examined
returned value is the return value of a known intrinsic or library
function, and it appears as first function argument.
Fixes: https://github.com/llvm/llvm-project/issues/75455.
Summary:
This patch simply states that `__builtin_readcyclecounter` is legal on
NVPTX and makes it return the value from the `clock64` sreg. The timer
intrinsics are marked as having side effects, which is desireable for
timing primitives and required to pattern match the instrinic DAG.
These generic targets include multiple GPUs and will, in the future,
provide a way to build once and run on multiple GPU, at the cost of less
optimization opportunities.
Note that this is just doing the compiler side of things, device libs an
runtimes/loader/etc. don't know about these targets yet, so none of them
actually work in practice right now. This is just the initial commit to
make LLVM aware of them.
This contains the documentation changes for both this change and #76954
as well.
This optimization tries to optimize bitcasts from `<N x i1>` to iN, but
currently also triggers for `<N x i1>` to `<M x iK>` bitcasts, if custom
lowering has been requested for these for an unrelated reason. Fix this
by explicitly checking that the result type is scalar.
Fixes https://github.com/llvm/llvm-project/issues/81216.
Expand64BitShift was always dropping to generic shift legalization if the shift amount type was larger than i64, even if the constant shift amount was actually very small. I've adjusted the constant bounds checks to work with APInt types so we can always perform the comparison.
This results in the MVE long shift instructions being used more often, and it looks like this is preventing some additional combines from happening. This could be addressed in the future.
This came about while I was trying to extend the DAGTypeLegalizer::ExpandShift* helpers and need to move to consistently using the legal shift amount types instead of reusing the shift amount type from the original wider shift.
This adds support for marking arbitrary general purpose registers -
except for those with special purpose (G0, I6-I7, O6-O7) - as reserved,
as needed by some software like the Linux kernel.
- Enable equivalent between `brcond` and `G_BRCOND`.
- Remove the manual selection of `G_BRCOND` in Mips. Revise test cases.
Reviewers: petar-avramovic, bcardosolopes, arsenm
Reviewed By: arsenm
Pull Request: https://github.com/llvm/llvm-project/pull/81306
This patch sorts stack objects by their alignment value from the largest
to the smallest. If two objects have the same alignment, then they are
sorted by their size from the largest to the smallest. This minimizes
padding and reduces run time stack size.
This patch enable hardware shadow stack with `Zicifss` and
`mno-forced-sw-shadow-stack`. New feature forced-sw-shadow-stack
disables hardware shadow stack even when `Zicfiss` enabled.
We apply custom lowering to 64 bit constants where we use the same logic
as in non-global isel: if materializing in registers is too expensive,
we emit a load from constant pool. Later, during instruction selection,
constant pool address is generated using `selectAddr`.
The Ampere1B is Ampere's third-generation core implementing a
superscalar, out-of-order microarchitecture with nested virtualization,
speculative side-channel mitigation and architectural support for
defense against ROP/JOP style software attacks.
Ampere1B is an ARMv8.7+ implementation, adding support for the FEAT
WFxT, FEAT CSSC, FEAT PAN3 and FEAT AFP extensions. It also includes all
features of the second-generation Ampere1A, such as the Memory Tagging
Extension and SM3/SM4 cryptography instructions.
Summary:
Some recent support made usage of `__nvvm_reflect` more consistent. We
should expose it as a builtin rather than forcing users to externally
define the function.
Summary:
This test requires at least sm_30 to run, but that is still below the
minimum supported version of sm_52 currently. Just set this to sm_60 so
the tests pass in the future.
Summary:
The previous patch did very simple folding that only worked for driectly
used branches. This patch improves this by traversing the use-def chain
to sipmlify every constant subexpression until it reaches a terminator
we can delete. The support should work for all expected cases now.
32-bit targets perform i64 CTPOP as a v2i64 CTPOP - if we can perform this as a i32 CTPOP by shifting the source bits, then do so to avoid the gpr<->xmm
This also triggers on non-SSE2 capable targets, as can be seen with the minor codegen diffs in ctpop_shifted_mask16
Reverts llvm/llvm-project#79586
This broke the AMDGPU OpenMP Offload buildbot.
The typical error message was that the GPU attempted to read beyong the
largest legal address.
Error message:
AMDGPU fatal error 1: Received error in queue 0x7f8363f22000:
HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to
access memory beyond the largest legal address.
At the moment, the emergency spill slot is a fixed object for entry
functions and chain functions, and a regular stack object otherwise.
This patch adopts the latter behaviour for entry/chain functions too. It
seems this was always the intention [1] and it will also save us a bit
of stack space in cases where the first stack object has a large
alignment.
[1]
34c8b835b1