Introduce common infrastructure for runtimes that determines compiler
resource path locations. These variables introduced are:
* RUNTIMES_OUTPUT_RESOURCE_DIR
* RUNTIMES_INSTALL_RESOURCE_PATH
That contain the location for the compiler resource path (typically
`lib/clang/<version>`) in the build tree and the install tree (the
latter relative to CMAKE_INSTALL_PREFIX).
Additionally, define
* RUNTIMES_OUTPUT_RESOURCE_LIB_DIR
* RUNTIMES_INSTALL_RESOURCE_LIB_PATH
as for the location of clang/flang version-locked libraries (typically
`lib${LLVM_LIBDIR_SUFFIX}/<targer-triple>`, but also depends on `APPLE`
and `LLVM_ENABLE_PER_TARGET_RUNTIME_DIR`). This code is moved from
flang-rt and initially becomes its only user.
Refactored out of #171610 as requested
[here](https://github.com/llvm/llvm-project/pull/171610#discussion_r2687382481).
Extracted `get_runtimes_target_libdir_common` from compiler-rt as
requested
[here](https://github.com/llvm/llvm-project/pull/171610#discussion_r2689565634).
Added TODO comments to all runtimes as requested
[here](https://github.com/llvm/llvm-project/pull/171610#issuecomment-3789598635).
For MTE, we can't use the whole size or we might trigger a segfault.
Therefore, use the exact size when MTE is enabled or the exact usable
size parameter is true.
Also, optimize out the call to getUsableSize and use a simpler
calculation.
getUsableSize returns the actual capacity of the underlying block, which
may be larger than the size originally requested by the user. If the
user writes data into this extra space accessible via getUsableSize and
subsequently calls reallocate, the existing implementation only copies
the original requested number of bytes. This resulted in data loss for
any information stored beyond the requested size but within the usable
bounds.
Target specific flags (Notably `-mimplict=always` for ARM) are not
recognized by the clang assembler unless the target is specified. This
PR passes the value of `CMAKE_C_COMPILER_TARGET` to the assembler so
that target specific flags are recognized.
## Previous behaviour
When configuring builtins for an ARMv7 target:
```
-- Builtin supported architectures: armv7
-- Checking for assembler flag -mimplicit-it=always
-- Checking for assembler flag -mimplicit-it=always - Not accepted
-- Checking for assembler flag -Wa,-mimplicit-it=always
-- Checking for assembler flag -Wa,-mimplicit-it=always - Not accepted
CMake Warning at CMakeLists.txt:462 (message):
Don't know how to set the -mimplicit-it=always flag in this assembler; not
including Arm optimized implementations
```
## New behaviour
```
-- Builtin supported architectures: armv7
-- Checking for assembler flag -mimplicit-it=always
-- Checking for assembler flag -mimplicit-it=always - Accepted
```
## Steps to reproduce:
```
$ cmake -S compiler-rt/lib/builtins -B build \
-DCMAKE_C_COMPILER=clang \
-DCMAKE_C_COMPILER_TARGET=armv7-unknown-linux-gnueabihf \
-DCOMPILER_RT_DEFAULT_TARGET_ONLY=ON
```
Historically, alignment and size weren't taken into account when freeing
allocations since `free` just takes a pointer. With `free_sized` and
`free_aligned_sized`, we can do these size and alignment checks in asan
now. This adds a new report type specifically for these functions.
Checking is hidden behind a new env flag `free_size_mismatch` which is
enabled by default, but downstream users can opt out of it.
The bulk of this PR was generated by gemini but thoroughly reviewed and
edited by me to the best of my ability.
The one new assembly source file, `arm/adddf3.S`, implements both
addition and subtraction via cross-branching after flipping signs, since
both operations must provide substantially the same logic. The new cmake
properties introduced in a prior commit are used to arrange that
including `adddf3.S` supersedes the C versions of both addition and
subtraction, and also informs the test suite that both functions are
available to test.
The Hexagon XRay sled was 5 words (20 bytes) and the patched sequence
clobbered r31 (the link register) via callr without saving it first.
When the trampoline returned, the instrumented function's own allocframe
would then save the wrong return address, causing a crash or misrouted
return.
Expand the sled to 7 words (28 bytes) and wrap the call with
allocframe(#0)/deallocframe so the caller's r31:30 are preserved across
the trampoline call.
Detailed fixes:
- HexagonAsmPrinter: emit 6 nop words after the jump (7 words total)
- xray_hexagon.cpp: patch allocframe(#0) as first word, immext+r7 (func
ID), immext+r6 (trampoline), callr r6, deallocframe; write the first
word last for atomicity
- xray_trampoline_hexagon.S: complete rewrite -- properly load and
dereference the global handler pointer, save/restore r0-r5 and r31, add
stack frame with correct 8-byte alignment, add jumpr r31 to actually
return from trampolines
- xray_interface.cpp: update Hexagon cSledLength from 20 to 28
- Update lit tests for 6-nop sled
Apparently on macOS there's a system header file also called
arm/endian.h, and another system header #includes it with "" rather than
<>, so that this compiler-rt header accidentally shadows it. Worked
around by prefixing "crt" to the name.
No changes are needed except the rename, because the planned functions
that use this header are still under review.
The shmat interceptor calls REAL(shmctl), but shmctl is not intercepted
on all targets (e.g. 32-bit Linux with musl). Guard shmat behind
SANITIZER_INTERCEPT_SHMCTL and use a MSAN_MAYBE_INTERCEPT pattern
consistent with other conditional interceptors.
Add the runtime infrastructure for MemorySanitizer on Hexagon Linux.
Hexagon is 32-bit, so the shadow memory layout uses a compact XOR-based
mapping that fits within the lower 3GB of address space:
0x00000000 - 0x10000000 APP-1 (256MB, program text/data/heap)
0x10000000 - 0x20000000 ALLOCATOR (256MB)
0x20000000 - 0x40000000 SHADOW-1 (512MB, covers APP-1 + ALLOCATOR)
0x40000000 - 0x50000000 APP-2 (256MB, shared libs + stack)
0x60000000 - 0x70000000 SHADOW-2 (256MB, covers APP-2)
0x70000000 - 0x90000000 ORIGIN-1 (512MB)
0xB0000000 - 0xC0000000 ORIGIN-2 (256MB)
MEM_TO_SHADOW uses XOR 0x20000000, and SHADOW_TO_ORIGIN adds 0x50000000.
The dual-APP layout accommodates QEMU user-mode, which places shared
libraries and the stack at 0x40000000.
The allocator uses SizeClassAllocator32 with a 256MB region at
0x10000000, and kMaxAllowedMallocSize is set to 1GB consistent with
other 32-bit targets.
Use start + (end - start) / 2 instead of (start + end) / 2 to compute
the midpoint address. The original expression overflows when start + end
exceeds UPTR_MAX, which happens on 32-bit targets whose memory layout
includes regions above 0x80000000.
On musl, rlimit64 is an alias for rlimit rather than a distinct type
provided by glibc. Add a SANITIZER_MUSL elif branch so that
struct_rlimit64_sz is defined for musl-based Linux targets.
Summary:
People need to be able to build this without a CUDA installation.
Long term we should bump up the minimum version as I'm pretty sure every
architecture before this has been deprecated by NVIDIA.
This is failing in some configurations on AArch64 Linux. Given there are
a lot of follow-up commits that makes this hard to revert, just disable
it for now pending future investigation.
Summary:
Currently, building through runtimes will yield this warning:
```
CMake Warning at compiler-rt/cmake/Modules/CompilerRTUtils.cmake:335 (message):
LLVMTestingSupport not found in LLVM_AVAILABLE_LIBS
Call Stack (most recent call first)
```
This is due to the fact that the builtins target does not go through the
s tandard runtimes patch and sets them as BUILDTREE_ONLY so they do not
show up. These are not used in this case, so just guard the condition to
suppress the warning.
On musl-based systems the dynamic linker does not process
DT_PREINIT_ARRAY, so the .preinit_array entry alone never calls
__xray_init(). Without initialization, the global XRay Flags struct is
zero-initialized and flags()->xray_mode is NULL. When the basic-mode or
FDR-mode static initializers run from .init_array and call
internal_strcmp(flags()->xray_mode, ...), they dereference NULL and
crash.
Fix this by always registering a constructor(0) in addition to the
.preinit_array entry. On glibc where .preinit_array works, __xray_init()
will have already run and the constructor returns immediately (the
function is idempotent). On musl, the constructor ensures __xray_init()
runs before other .init_array entries that depend on XRay flags being
initialized.
As noted in https://github.com/llvm/llvm-project/pull/188406 comments,
the documentation recommends guarding only with
__has_feature(address_sanitizer). This patch updates the test to follow
the same pattern by removing the
__SANITIZER_DISABLE_CONTAINER_OVERFLOW__ checks. Having this macro
defined results in the common_interface_defs.h header defining the
contiguous container functions as no-ops anyway.
This is a followup to https://github.com/llvm/llvm-project/pull/188406.
This allows static asserts to be set in tracing code that might use the
ReleaseToOS values as indexes.
This would have caused a compile failure instead of a runtime crash when
I added the use of a new ReleaseToOS value.
Summary:
Currently, the GPU iterates through all of the present symbols and
copies them by prefix. This is inefficient as it requires a lot of small
high-latency data transfers rather than a few large ones. Additionally,
we force every single profiling symbol to have protected visibility.
This means potentially hundreds of unnecessary symbols in the symbol
table.
This PR changes the interface to move towards the start / stop section
handling. AMDGPU supports this natively as an ELF target, so we need
little changes. Instead of overriding visibility, we use a single table
to define the bounds that we can obtain with one contiguous load.
Using a table interface should also work for the in-progress HIP
implementation for this, as it wraps the start / stop sections into
standard void pointers which will be inside of an already mapped region
of memory, so they should be accessible from the HIP API.
NVPTX is more difficult as it is an ELF platform without this support. I
have hooked up the 'Other' handling to work around this, but even then
it's a bit of a stretch. I could remove this support here, but I wanted
to demonstrate that we can share the ABI. However, NVPTX will only work
if we force LTO and change the backend to emit variables in the same
TL;DR, we now do this:
```c
struct { start1, stop1, start2, stop2, start3, stop3, version; } device;
struct host = DtoH(lookup("device"));
counters = DtoH(host.stop - host.start)
version = DtoH(host.version);
```
Currently all of these functions are empty bodies; this means that when
including this header and compiling with
__SANITIZER_DISABLE_CONTAINER_OVERFLOW__ defined, warnings are emitted
about the missing return values for those functions that do return
values.
This patch returns success values for all those check functions with
non-void return types. Note: these were originally added in
https://github.com/llvm/llvm-project/pull/163468.
This commit adds C helper functions `dnan2`, `dnorm2` and `dunder` for
handling the less critical edge cases of double-precision arithmetic,
similar to `fnan2`, `fnorm2` and `funder` that were added in commit
f7e652127772e93.
It also adds a header file that defines some register aliases for
handling double-precision numbers in AArch32 software floating point in
an endianness-independent way, by providing aliases `xh` and `xl` for
the high and low words of the first double-precision function argument,
regardless of which of them is in r0 and which in r1, and similarly `yh`
and `yl` for the second argument in r2/r3.
Tests for Android specific behavior don't really belong here since it is
affected by the config which is not necessarily the same on Android.
There are already tests that the config options and flag options work
properly. Android wrapper tests belong to Android.
#171941 got the builtins tests running under LLVM_ENABLE_RUNTIMES by
testing the builtins as part of the runtimes build.
As a consequence, CMake in `lib/builtins/` is no longer visible when
configuring the tests (but `test/builtins/` is). This means that the
`cmake_dependent_option` from `lib/builtins/` is not accounted for by
the tests, allowing COMPILER_RT_BUILD_CRT to be YES when
COMPILER_RT_HAS_CRT is NO. As a consequence, the CRT tests are running
on platforms where COMPILER_RT_HAS_CRT is false (#176892).
367da15a11/compiler-rt/lib/builtins/CMakeLists.txt (L1106-L1108)
Although the long-term solution could be to split both the builtins (and
their tests) out of compiler-rt into a top-level directory with shared
options, this works around the issue for the moment by checking both
COMPILER_RT_HAS_CRT and COMPILER_RT_BUILD_CRT before enabling the "crt"
feature.
Fixes#176892
Summary:
This PR enables the basic unit tests for builtins to be run on the GPU
architectures. Other targets like profiling are supported, but the
host-device natures will make it more difficult to adequately unit
test. It may be be possible to do basic tests there, to simply verify
that
counters are present and in the proper format for when they are copied
to the host.
Now that the corresponding libcxx change has landed, these tests should
be passing on some platforms.
This patch re-enables them for all platforms, so that we can see which
bots these do not work on and mark them unsupported accordingly.
rdar://167946476
In the builtins library, most functions have a portable C implementation
(e.g. `mulsf3.c`), and platforms might provide an optimized assembler
implementation (e.g. `arm/mulsf3.S`). The cmake script automatically
excludes the C source file corresponding to each assembly source file it
includes. Additionally, each source file name is automatically
translated into a flag that lit tests can query, with a name like
`librt_has_mulsf3`, to indicate that a function is available to be
tested.
In future commits I plan to introduce cases where a single .S file
provides more than one function (so that they can share code easily),
and therefore, must supersede more than one existing source file.
I've introduced the `crt_supersedes` cmake property, which you can set
on a .S file to name a list of .c files that it should supersede. Also,
the `crt_provides` property can be set on any source file to indicate a
list of functions it makes available for testing, in addition to the one
implied by its name.
Add one new flag, dealloc_align_mismatch that turns on/off alignment
checks. Add three new config parameters, one for deallocate type
mismatch (such as abort on new/free if true), one for checking if the
size parameter matches on dealloc and one for checking if the alignment
is correct on a dealloc.
Add extra flags to be passed for to indicate to do an align/size check.
Update report functions to better indicate the errors. Add unit tests
for all of these.
This is based on these upstream cls by jcking:
https://github.com/llvm/llvm-project/pull/147735https://github.com/llvm/llvm-project/pull/146556
Align beg address down instead of up in __asan_region_is_poisoned(), so
the shadow scan includes the first granule. This fixes a false negative
when first granule has an unpoisoned prefix and poisoned suffix.
Add test that covers this scenario.
Summary:
The changes in https://www.github.com/llvm/llvm-project/pull/185552
allowed us to
start building the standard `libclang_rt.profile.a` for GPU targets.
This PR expands this by adding an optimized GPU routine for counter
increment and removing the special-case handling of these functions in
the OpenMP runtime.
Vast majority of these functions are boilerplate, but we should be able
to do more interesting things with this in the future, like value or
memory profiling.
As per PEP-0394[1], there is no real concensus over what binary names
Python has, specifically 'python' could be Python 3, Python 2, or not
exist.
However, everyone has a python3 interpreter and the scripts are all
written for Python 3. Unify the shebangs so that the ~50% of shebangs
that use python now use python3.
[1] https://peps.python.org/pep-0394/