For MTE, we can't use the whole size or we might trigger a segfault.
Therefore, use the exact size when MTE is enabled or the exact usable
size parameter is true.
Also, optimize out the call to getUsableSize and use a simpler
calculation.
getUsableSize returns the actual capacity of the underlying block, which
may be larger than the size originally requested by the user. If the
user writes data into this extra space accessible via getUsableSize and
subsequently calls reallocate, the existing implementation only copies
the original requested number of bytes. This resulted in data loss for
any information stored beyond the requested size but within the usable
bounds.
Historically, alignment and size weren't taken into account when freeing
allocations since `free` just takes a pointer. With `free_sized` and
`free_aligned_sized`, we can do these size and alignment checks in asan
now. This adds a new report type specifically for these functions.
Checking is hidden behind a new env flag `free_size_mismatch` which is
enabled by default, but downstream users can opt out of it.
The bulk of this PR was generated by gemini but thoroughly reviewed and
edited by me to the best of my ability.
The one new assembly source file, `arm/adddf3.S`, implements both
addition and subtraction via cross-branching after flipping signs, since
both operations must provide substantially the same logic. The new cmake
properties introduced in a prior commit are used to arrange that
including `adddf3.S` supersedes the C versions of both addition and
subtraction, and also informs the test suite that both functions are
available to test.
The Hexagon XRay sled was 5 words (20 bytes) and the patched sequence
clobbered r31 (the link register) via callr without saving it first.
When the trampoline returned, the instrumented function's own allocframe
would then save the wrong return address, causing a crash or misrouted
return.
Expand the sled to 7 words (28 bytes) and wrap the call with
allocframe(#0)/deallocframe so the caller's r31:30 are preserved across
the trampoline call.
Detailed fixes:
- HexagonAsmPrinter: emit 6 nop words after the jump (7 words total)
- xray_hexagon.cpp: patch allocframe(#0) as first word, immext+r7 (func
ID), immext+r6 (trampoline), callr r6, deallocframe; write the first
word last for atomicity
- xray_trampoline_hexagon.S: complete rewrite -- properly load and
dereference the global handler pointer, save/restore r0-r5 and r31, add
stack frame with correct 8-byte alignment, add jumpr r31 to actually
return from trampolines
- xray_interface.cpp: update Hexagon cSledLength from 20 to 28
- Update lit tests for 6-nop sled
Apparently on macOS there's a system header file also called
arm/endian.h, and another system header #includes it with "" rather than
<>, so that this compiler-rt header accidentally shadows it. Worked
around by prefixing "crt" to the name.
No changes are needed except the rename, because the planned functions
that use this header are still under review.
The shmat interceptor calls REAL(shmctl), but shmctl is not intercepted
on all targets (e.g. 32-bit Linux with musl). Guard shmat behind
SANITIZER_INTERCEPT_SHMCTL and use a MSAN_MAYBE_INTERCEPT pattern
consistent with other conditional interceptors.
Add the runtime infrastructure for MemorySanitizer on Hexagon Linux.
Hexagon is 32-bit, so the shadow memory layout uses a compact XOR-based
mapping that fits within the lower 3GB of address space:
0x00000000 - 0x10000000 APP-1 (256MB, program text/data/heap)
0x10000000 - 0x20000000 ALLOCATOR (256MB)
0x20000000 - 0x40000000 SHADOW-1 (512MB, covers APP-1 + ALLOCATOR)
0x40000000 - 0x50000000 APP-2 (256MB, shared libs + stack)
0x60000000 - 0x70000000 SHADOW-2 (256MB, covers APP-2)
0x70000000 - 0x90000000 ORIGIN-1 (512MB)
0xB0000000 - 0xC0000000 ORIGIN-2 (256MB)
MEM_TO_SHADOW uses XOR 0x20000000, and SHADOW_TO_ORIGIN adds 0x50000000.
The dual-APP layout accommodates QEMU user-mode, which places shared
libraries and the stack at 0x40000000.
The allocator uses SizeClassAllocator32 with a 256MB region at
0x10000000, and kMaxAllowedMallocSize is set to 1GB consistent with
other 32-bit targets.
Use start + (end - start) / 2 instead of (start + end) / 2 to compute
the midpoint address. The original expression overflows when start + end
exceeds UPTR_MAX, which happens on 32-bit targets whose memory layout
includes regions above 0x80000000.
On musl, rlimit64 is an alias for rlimit rather than a distinct type
provided by glibc. Add a SANITIZER_MUSL elif branch so that
struct_rlimit64_sz is defined for musl-based Linux targets.
Summary:
People need to be able to build this without a CUDA installation.
Long term we should bump up the minimum version as I'm pretty sure every
architecture before this has been deprecated by NVIDIA.
This is failing in some configurations on AArch64 Linux. Given there are
a lot of follow-up commits that makes this hard to revert, just disable
it for now pending future investigation.
On musl-based systems the dynamic linker does not process
DT_PREINIT_ARRAY, so the .preinit_array entry alone never calls
__xray_init(). Without initialization, the global XRay Flags struct is
zero-initialized and flags()->xray_mode is NULL. When the basic-mode or
FDR-mode static initializers run from .init_array and call
internal_strcmp(flags()->xray_mode, ...), they dereference NULL and
crash.
Fix this by always registering a constructor(0) in addition to the
.preinit_array entry. On glibc where .preinit_array works, __xray_init()
will have already run and the constructor returns immediately (the
function is idempotent). On musl, the constructor ensures __xray_init()
runs before other .init_array entries that depend on XRay flags being
initialized.
This allows static asserts to be set in tracing code that might use the
ReleaseToOS values as indexes.
This would have caused a compile failure instead of a runtime crash when
I added the use of a new ReleaseToOS value.
Summary:
Currently, the GPU iterates through all of the present symbols and
copies them by prefix. This is inefficient as it requires a lot of small
high-latency data transfers rather than a few large ones. Additionally,
we force every single profiling symbol to have protected visibility.
This means potentially hundreds of unnecessary symbols in the symbol
table.
This PR changes the interface to move towards the start / stop section
handling. AMDGPU supports this natively as an ELF target, so we need
little changes. Instead of overriding visibility, we use a single table
to define the bounds that we can obtain with one contiguous load.
Using a table interface should also work for the in-progress HIP
implementation for this, as it wraps the start / stop sections into
standard void pointers which will be inside of an already mapped region
of memory, so they should be accessible from the HIP API.
NVPTX is more difficult as it is an ELF platform without this support. I
have hooked up the 'Other' handling to work around this, but even then
it's a bit of a stretch. I could remove this support here, but I wanted
to demonstrate that we can share the ABI. However, NVPTX will only work
if we force LTO and change the backend to emit variables in the same
TL;DR, we now do this:
```c
struct { start1, stop1, start2, stop2, start3, stop3, version; } device;
struct host = DtoH(lookup("device"));
counters = DtoH(host.stop - host.start)
version = DtoH(host.version);
```
This commit adds C helper functions `dnan2`, `dnorm2` and `dunder` for
handling the less critical edge cases of double-precision arithmetic,
similar to `fnan2`, `fnorm2` and `funder` that were added in commit
f7e652127772e93.
It also adds a header file that defines some register aliases for
handling double-precision numbers in AArch32 software floating point in
an endianness-independent way, by providing aliases `xh` and `xl` for
the high and low words of the first double-precision function argument,
regardless of which of them is in r0 and which in r1, and similarly `yh`
and `yl` for the second argument in r2/r3.
Tests for Android specific behavior don't really belong here since it is
affected by the config which is not necessarily the same on Android.
There are already tests that the config options and flag options work
properly. Android wrapper tests belong to Android.
In the builtins library, most functions have a portable C implementation
(e.g. `mulsf3.c`), and platforms might provide an optimized assembler
implementation (e.g. `arm/mulsf3.S`). The cmake script automatically
excludes the C source file corresponding to each assembly source file it
includes. Additionally, each source file name is automatically
translated into a flag that lit tests can query, with a name like
`librt_has_mulsf3`, to indicate that a function is available to be
tested.
In future commits I plan to introduce cases where a single .S file
provides more than one function (so that they can share code easily),
and therefore, must supersede more than one existing source file.
I've introduced the `crt_supersedes` cmake property, which you can set
on a .S file to name a list of .c files that it should supersede. Also,
the `crt_provides` property can be set on any source file to indicate a
list of functions it makes available for testing, in addition to the one
implied by its name.
Add one new flag, dealloc_align_mismatch that turns on/off alignment
checks. Add three new config parameters, one for deallocate type
mismatch (such as abort on new/free if true), one for checking if the
size parameter matches on dealloc and one for checking if the alignment
is correct on a dealloc.
Add extra flags to be passed for to indicate to do an align/size check.
Update report functions to better indicate the errors. Add unit tests
for all of these.
This is based on these upstream cls by jcking:
https://github.com/llvm/llvm-project/pull/147735https://github.com/llvm/llvm-project/pull/146556
Align beg address down instead of up in __asan_region_is_poisoned(), so
the shadow scan includes the first granule. This fixes a false negative
when first granule has an unpoisoned prefix and poisoned suffix.
Add test that covers this scenario.
Summary:
The changes in https://www.github.com/llvm/llvm-project/pull/185552
allowed us to
start building the standard `libclang_rt.profile.a` for GPU targets.
This PR expands this by adding an optimized GPU routine for counter
increment and removing the special-case handling of these functions in
the OpenMP runtime.
Vast majority of these functions are boilerplate, but we should be able
to do more interesting things with this in the future, like value or
memory profiling.
As per PEP-0394[1], there is no real concensus over what binary names
Python has, specifically 'python' could be Python 3, Python 2, or not
exist.
However, everyone has a python3 interpreter and the scripts are all
written for Python 3. Unify the shebangs so that the ~50% of shebangs
that use python now use python3.
[1] https://peps.python.org/pep-0394/
__asan_region_is_poisoned() uses an exclusive end address
(end = beg + size) to validate the region [beg, end) and to compute
the aligned inner shadow region. This causes correctness issue
near memory range upper boundary and could trigger address space
overflow on 32-bit targets.
1. Incorrect handling of the last byte of a memory range
The implementation checks AddrIsInMem(end) instead of the last
application byte (end - 1). For regions ending at the last byte
of Low/Mid/HighMem (e.g. __asan_region_is_poisoned(kHighMemEnd, 1)),
this returns end (kHighMemEnd + 1) instead of the original
pointer. This behavior is inconsistent with the function’s
semantics and with __asan_address_is_poisoned().
2) address space overflow and invalid shadow range
If a region ends at the top of the virtual address space (kHighMemEnd),
e.g. on 32-bit targets, end = beg + size could wrap to 0.
This violated the invariant beg < end and could trigger
the CHECK failure.
Additionally, overflow in RoundUpTo alignment computations
for aligned_b could produce an invalid shadow region spanning
LowShadow to HighShadow across ShadowGap, leading mem_is_zero()
to access unmapped memory and crash.
Fix by switching to an inclusive last byte:
last = beg + size - 1
All checks are now performed on beg and last. The aligned inner
shadow region is also computed from [beg, last]. Additional guard
for aligned_b prevents the mapping to shadow if aligned_b is wrapped
(in this case the aligned inner region is also empty and doesn't
require the shadow scan via mem_is_zero()).
This fixes incorrect return values at memory range ends and
prevents overflow related crashes on 32-bit targets.
Test is extended to cover these boundary cases.
---------
Co-authored-by: Vitaly Buka <vitalybuka@gmail.com>
Currently, when building the Go race detector (when SANITIZER_GO
is set), SANITIZER_WEAK_IMPORT is no-op. It is perfectly fine to
define SANITIZER_WEAK_IMPORT for Go just like other cases. That
will tell the Go linker to treat _dyld_get_dyld_header as a weak
import.
Perhaps SANITIZER_WEAK_ATTRIBUTE can also be defined for Go. That
would be a separate patch.
Add the architecture-specific pieces needed for the ASan and UBSan
sanitizer runtimes to build and run on hexagon-unknown-linux-musl.
Without this patch, building sanitizer runtimes for Hexagon Linux fails
with:
sanitizer_linux.cpp: error: member access into incomplete type
'struct stat64'
because musl libc does not provide struct stat64. This patch routes
Hexagon through the statx() syscall path (like LoongArch) to avoid the
stat64 dependency entirely.
Changes:
* asan_mapping.h: Add ASAN_SHADOW_OFFSET_CONST (0x20000000) for Hexagon
with shadow layout documentation.
* sanitizer_linux.cpp: Implement internal_clone() for Hexagon using
inline assembly (trap0 syscall, generic clone argument order: flags,
stack, ptid, ctid, tls). Route Hexagon through the statx() path for stat
operations since musl lacks struct stat64.
* sanitizer_linux.h: Add Hexagon to the internal_clone() declaration
guard.
* sanitizer_stoptheworld_linux_libcdep.cpp: Add Hexagon to the
StopTheWorld architecture guard with register definitions.
* sanitizer_asm.h: Define ASM_TAIL_CALL as 'jump' for Hexagon.
* CMakeLists.txt: Add -fno-emulated-tls for Hexagon targets. Hexagon
Linux uses native TLS via the UGP register; emulated TLS produces broken
sanitizer runtimes with unresolvable __emutls references.
As far as I am aware, AOR is no longer used anywhere within LLVM, as
most of the required code has since been ported to elsewhere within the
project.
Removes the entire directory, and updates some now outdated comments.
When the sum of two sub-normal values is not also subnormal, we need to
set the exponent to one.
Test case:
static volatile float x = 0x1.362b4p-127;
static volatile float x2 = 0x1.362b4p-127 * 2;
int
main (void)
{
printf("x %a x2 %a x + x %a\n", x, x2, x + x);
return x2 == x + x ? 0 : 1;
}
Signed-off-by: Keith Packard <keithp@keithp.com>
This is similar to #185770 where it removes an
exception-handling-related symbol from `compiler-rt` in favor of having
definitions elsewhere. The compiler-rt library is linked into all shared
objects, for example, which can result in duplicate definitions of a
symbol where this tag wants to have one unique definition. The intention
behind this commit is to defer the definition of this symbol to
downstream libraries, such as the definition of `longjmp` itself. An
example of this is WebAssembly/wasi-libc#772 where the responsibility of
defining this symbol now lies with wasi-libc.
The `__cpp_exception` symbol is now defined in libunwind instead of
compiler-rt. This is moved for a few reasons, but the primary reason is
that compiler-rt is linked duplicate-ly into all shared objects meaning
that it's not suitable for define-once symbols such as
`__cpp_exception`. By moving the definition to the user of the symbol,
libunwind itself, that guarantees that the symbol should be defined
exactly once and only when appropriate. A secondary reason for this
movement is that it avoids the need to compile compiler-rt twice: once
with exception and once without, and instead the same build can be used
for both exceptions-and-not.
Summary:
As suggested in https://github.com/llvm/llvm-project/pull/177665, we
should build a GPU version of the compiler-rt profile library instead of
writing it in-line in the lowering. This PR does not define anything GPU
specific, it simply re-uses the baremetal handling. Later PRs will
prevent the GPU specific handling we would want to do to optimize
counter handling on the GPU.
Note that this will require using the cache file, or setting these
options
manually for existing users. Hopefully if people are using the cache
file
as they should it won't break anything.