By default, `isShuffleMaskLegal` always returns true, which can result
in the expansion of `BUILD_VECTOR` into a `VECTOR_SHUFFLE` node in
certain situations. Subsequently, the `VECTOR_SHUFFLE` node is expanded
again into a `BUILD_VECTOR`, leading to an infinite loop.
To address this, we always return false, allowing the expansion of
`BUILD_VECTOR` through the stack.
So that when mixing small and large text, large text stays out of the
way of the rest of the binary.
This is useful for mixing precompiled small code model object files and
built-from-source large code model binaries so that the the text
sections don't get merged.
The reland fixes an issue where a function in the large code model would reference small data without GOTOFF.
This was incorrectly reverted in 76f78ecc789d58baa3a88b2fe2a57428f07e5362.
When we'd originally added unaligned-scalar-mem and
unaligned-vector-mem, they were separated into two parts under the
theory that some processor might implement one, but not the other. At
the moment, we don't have evidence of such a processor. The C/C++ level
interface, and the clang driver command lines have settled on a single
unaligned flag which indicates both scalar and vector support unaligned.
Given that, let's remove the test matrix complexity for a set of
configurations which don't appear useful.
Given these are internal feature names, I don't think we need to provide
any forward compatibility. Anyone disagree?
Note: The immediate trigger for this patch was finding another case
where the unaligned-vector-mem wasn't being properly serialized to IR
from clang which resulted in problems reproducing assembly from clang's
-emit-llvm feature. Instead of fixing this, I decided getting rid of the
complexity was the better approach.
This reverts commit 4bf8a688956a759b7b6b8d94f42d25c13c7af130.
This commit seems to be breaking the semantics of the
ObjectFile::isSectionText method, which breaks numba/llvmlite bindings.
When we lower calls, the sequence of argument copy-to-reg nodes are
glued to the smstart. In the InstrEmitter, these glued copies are turned
into implicit defs, since the actual call instruction uses those
physregs, resulting in the register allocator adding unnecessary copies
of regs that are preserved anyway.
Split out the parts of aix-cc-abi.ll that requires to be regenerated by
utils/update_mir_test_checks.py into aix-cc-abi-mir.ll, and regenerate
it using the script. Regenerate aix-cc-abi.ll using
utils/update_llc_test_checks.py.
Lower 16xi8 vector stores in NVPTX ISel efficiently using
st.v4.b32 instead of multiple st.v4.u8 along the lines of vector loads
and 8xf16. Similarly, 8xi8 using st.v2.u32.
This patch make riscv-split-regalloc as true by default.
It will not affect the codegen result if it vector register allocation
doesn't exist. If there is the vector register allocation, it may affect
the non-rvv register LiveInterval's segment/weight. It will make the
allocation in a different order.
https://github.com/llvm/llvm-project/issues/70703 pointed out that
cloning LLVM modules could lead to miscompiles when using FatLTO.
This is due to an existing issue when cloning modules with labels (see
#55991 and #47769). Since this can lead to miscompilation, we can avoid
cloning the LLVM modules, which was desirable anyway.
This patch modifies the EmbedBitcodePass to no longer clone the module
or run an input pipeline over it. Further, it make FatLTO always perform
UnifiedLTO, so we can still defer the Thin/Full LTO decision to
link-time. Lastly, it removes dead/obsolete code related to now defunct
options that do not work with the EmbedBitcodePass implementation any
longer.
So that when mixing small and large text, large text stays out of the
way of the rest of the binary.
This is useful for mixing precompiled small code model object files and
built-from-source large code model binaries so that the the text
sections don't get merged.
The reland fixes an issue where a function in the large code model would reference small data without GOTOFF.
These tests are currently failing and their fix is being tracked in
Issue #60133. Marking them as XFAIL for now will get the test suite to a
passing state so we can work on adding a GitHub action to automatically
run these tests on a PR bot to help keep the tree green.
Also removed the no-longer supported -opaque-pointers=0 flag from the
couple tests where it was remaining.
This commit adds a new BPF specific structure attribte
`__attribute__((preserve_static_offset))` and a pass to deal with it.
This attribute may be attached to a struct or union declaration, where
it notifies the compiler that this structure is a "context" structure.
The following limitations apply to context structures:
- runtime environment might patch access to the fields of this type by
updating the field offset;
BPF verifier limits access patterns allowed for certain data
types. E.g. `struct __sk_buff` and `struct bpf_sock_ops`. For these
types only `LD/ST <reg> <static-offset>` memory loads and stores are
allowed.
This is so because offsets of the fields of these structures do not
match real offsets in the running kernel. During BPF program
load/verification loads and stores to the fields of these types are
rewritten so that offsets match real offsets. For this rewrite to
happen static offsets have to be encoded in the instructions.
See `kernel/bpf/verifier.c:convert_ctx_access` function in the Linux
kernel source tree for details.
- runtime environment might disallow access to the field of the type
through modified pointers.
During BPF program verification a tag `PTR_TO_CTX` is tracked for
register values. In case if register with such tag is modified BPF
programs are not allowed to read or write memory using register. See
kernel/bpf/verifier.c:check_mem_access function in the Linux kernel
source tree for details.
Access to the structure fields is translated to IR as a sequence:
- `(load (getelementptr %ptr %offset))` or
- `(store (getelementptr %ptr %offset))`
During instruction selection phase such sequences are translated as a
single load instruction with embedded offset, e.g. `LDW %ptr, %offset`,
which matches access pattern necessary for the restricted
set of types described above (when `%offset` is static).
Multiple optimizer passes might separate these instructions, this
includes:
- SimplifyCFGPass (sinking)
- InstCombine (sinking)
- GVN (hoisting)
The `preserve_static_offset` attribute marks structures for which the
following transformations happen:
- at the early IR processing stage:
- `(load (getelementptr ...))` replaced by call to intrinsic
`llvm.bpf.getelementptr.and.load`;
- `(store (getelementptr ...))` replaced by call to intrinsic
`llvm.bpf.getelementptr.and.store`;
- at the late IR processing stage this modification is undone.
Such handling prevents various optimizer passes from generating
sequences of instructions that would be rejected by BPF verifier.
The __attribute__((preserve_static_offset)) has a priority over
__attribute__((preserve_access_index)). When preserve_access_index
attribute is present preserve access index transformations are not
applied.
This addresses the issue reported by the following thread:
https://lore.kernel.org/bpf/CAA-VZPmxh8o8EBcJ=m-DH4ytcxDFmo0JKsm1p1gf40kS0CE3NQ@mail.gmail.com/T/#m4b9ce2ce73b34f34172328f975235fc6f19841b6
Differential Revision: https://reviews.llvm.org/D133361
This adds code to AArch64 function prologues to protect against stack
clash attacks by probing (writing to) the stack at regular enough
intervals to ensure that the guard page cannot be skipped over.
The patch depends on and maintains the following invariants:
Upon function entry the caller guarantees that it has probed the stack
(e.g. performed a store) at some address [sp, #N], where`0 <= N <=
1024`. This invariant comes from a requirement for compatibility with
GCC. Any address range in the allocated stack, no smaller than
stack-probe-size bytes contains at least one probe At any time the stack
pointer is above or in the guard page Probes are performed in
descreasing address order
The stack-probe-size is a function attribute that can be set by a
platform to correspond to the guard page size.
By default, the stack probe size is 4KiB, which is a safe default as
this is the smallest possible page size for AArch64. Linux uses a 64KiB
guard for AArch64, so this can be overridden by the stack-probe-size
function attribute.
For small frames without a frame pointer (<= 240 bytes), no probes are
needed.
For larger frame sizes, LLVM always stores x29 to the stack. This serves
as an implicit stack probe. Thus, while allocating stack objects the
compiler assumes that the stack has been probed at [sp].
There are multiple probing sequences that can be emitted, depending on
the size of the stack allocation:
A straight-line sequence of subtracts and stores, used when the
allocation size is smaller than 5 guard pages. A loop allocating and
probing one page size per iteration, plus at most a single probe to deal
with the remainder, used when the allocation size is larger but still
known at compile time. A loop which moves the SP down to the target
value held in a register (or a loop, moving a scratch register to the
target value help in SP), used when the allocation size is not known at
compile-time, such as when allocating space for SVE values, or when
over-aligning the stack. This is emitted in AArch64InstrInfo because it
will also be used for dynamic allocas in a future patch. A single probe
where the amount of stack adjustment is unknown, but is known to be less
than or equal to a page size.
---------
Co-authored-by: Oliver Stannard <oliver.stannard@linaro.org>
The base change here is to change getMemOperandWithOffsetWidth to return
a TypeSize Width, which in turn allows areMemAccessesTriviallyDisjoint
to reason about trivially disjoint widths.
…[UN]MERGE_VALUES
When MERGE or UNMERGE s64 on a subtarget that is non-64bit, it must have
the D extension and use FPR in order to be legal.
All other instances of MERGE and UNMERGE that can be made legal should
be narrowed, widend, or replaced by the combiner.
This patch contains two related combines:
1) If we have an scalar vector insert into the result of a
concat_vector,
sink the insert into the operand of the concat.
2) If we have a insert of a scalar binop into a vector binop of the
same opcode and the RHS of both are constant, perform the insert
and then the binop.
The common theme to both is pushing inserts closer to the sources of the
computation graph. The goal is to enable forming vector bin ops from
inserts of scalar binops at the end of another vector.
For RISCV specifically, the concat_vector transform will push inserts to
smaller vectors. This will have the effect of reducing lmul for the
vslides, and usually doesn't require an additional vsetvli since
the source vectors are already working in the narrower VL. I tried
that one as a target independent combine first, and it doesn't appear
profitable on all targets.
This is only one approach to the problem. Another idea would be to
aggressively form build_vectors and subvector inserts from the
individual scalar inserts, and then have a transform which sunk a
subvector_insert down through the concat. The advantage of the alternate
approach is that we expose parallelism in the insert sequence, even if
the source vector isn't a concat_vector. If reviewers are okay with it,
I'd like to start with this approach, and then explore that direction in
a follow up patch.
Add a test to document an existing problem in GCNRegPressure tracker.
The upward tracker does not count the registers used (16 of them) in
movrel instruction (for example V_INDIRECT_REG_WRITE_MOVREL_B32_V16).
The downward tracker counts the registers but reports a mismatch:
%0:L0000000000000C00 isn't found in LIS reported set
Fixes#73787
Not doing so lead to us making use of a register after the call, which
has been clobbered by the call.
Added an MIR test that runs only the pseudo expansion pass.
We were allowing extra immediate arguments, and only bothering to check
if registers were implicit or not.
Also consolidate extra operand checks in verifier, to make this
testable. We had 3 different places checking if you were trying to build
an instruction with more operands than allowed by the definition. We had
an assertion in addOperand, a direct check in the MIRParser to avoid the
assertion, and the machine verifier checks. Remove the assert and parser
check so the verifier can provide a consistent verification experience,
which will also handle instructions modified in place.
NOTE: I'm not sure how many of the corner cases are part of the
documented ABI but that shouldn't matter because my goal is for
`-msve-vector-bits` to have no affect on the way arguments and returns
are processed.