Summary:
`Status` is unfortunately heavily overloaded in practice. Things like
X11 define it as a macro. Best to just remove that possibility entirely.
Introduce common infrastructure for runtimes that determines compiler
resource path locations. These variables introduced are:
* RUNTIMES_OUTPUT_RESOURCE_DIR
* RUNTIMES_INSTALL_RESOURCE_PATH
That contain the location for the compiler resource path (typically
`lib/clang/<version>`) in the build tree and the install tree (the
latter relative to CMAKE_INSTALL_PREFIX).
Additionally, define
* RUNTIMES_OUTPUT_RESOURCE_LIB_DIR
* RUNTIMES_INSTALL_RESOURCE_LIB_PATH
as for the location of clang/flang version-locked libraries (typically
`lib${LLVM_LIBDIR_SUFFIX}/<targer-triple>`, but also depends on `APPLE`
and `LLVM_ENABLE_PER_TARGET_RUNTIME_DIR`). This code is moved from
flang-rt and initially becomes its only user.
Refactored out of #171610 as requested
[here](https://github.com/llvm/llvm-project/pull/171610#discussion_r2687382481).
Extracted `get_runtimes_target_libdir_common` from compiler-rt as
requested
[here](https://github.com/llvm/llvm-project/pull/171610#discussion_r2689565634).
Added TODO comments to all runtimes as requested
[here](https://github.com/llvm/llvm-project/pull/171610#issuecomment-3789598635).
Add support for non-allocatable module-level CUDA managed variables
using pointer indirection through a companion global in
__nv_managed_data__. The CUDA runtime populates this pointer with the
unified memory address via __cudaRegisterManagedVar and
__cudaInitModule.
- Create a .managed.ptr companion global in the __nv_managed_data__
section and register it with _FortranACUFRegisterManagedVariable
- Call __cudaInitModule once after all variables are registered, only
when non-allocatable managed globals are present, to populate managed
pointers
- Annotate managed globals in gpu.module with nvvm.managed for PTX
.attribute(.managed) generation
- Suppress cuf.data_transfer for assignments to/from non-allocatable
module managed variables, since cudaMemcpy would target the shadow
address rather than the actual unified memory
- Preserve cuf.data_transfer for device_var = managed_var assignments
where explicit transfer is still required
Note: This PR depends on
[#189751](https://github.com/llvm/llvm-project/pull/189751) (MLIR:
nvvm.managed attribute).
Add support for non-allocatable module-level CUDA managed variables
using pointer indirection through a companion global in
__nv_managed_data__. The CUDA runtime populates this pointer with the
unified memory address via __cudaRegisterManagedVar and
__cudaInitModule.
1. Create a .managed.ptr companion global in the __nv_managed_data__
section and register it with _FortranACUFRegisterManagedVariable
(CUFAddConstructor.cpp)
2. Call __cudaInitModule after registration to populate the managed
pointer (registration.cpp)
3. Annotate managed globals in gpu.module with nvvm.managed for PTX
.attribute(.managed) generation (cuda-code-gen.mlir)
4. Suppress cuf.data_transfer for assignments to/from non-allocatable
module managed variables, since cudaMemcpy would target the shadow
address rather than the actual unified memory (tools.h)
5. Preserve cuf.data_transfer for device_var = managed_var assignments
where explicit transfer is still required
Summary:
This PR primarily changes using `LLVM_RUNTIMES_TARGET` to
`LLVM_DEFAULT_TARGET_TRIPLE`. The reason is that the default target
triple is the true cross-compiling architecture we are using, while the
runtimes_target can contain multilib strings like `+debug` or similar.
Additionally add the proper path handling to the OpenMP / Offload
libraries.
If a project depends on the Flang runtime and on libc++, linking fails
because `std::__libcpp_verbose_abort` is defined in both libraries.
Avoid that duplicate definition by defining `_LIBCPP_VERBOSE_ABORT`
before including any C++ headers and by renaming that symbol in the
Flang runtime to `flang_rt_verbose_abort`.
The function that is modified was originally introduced in D158957 to
solve an undefined symbol error when linking pure-Fortran projects with
the Flang runtime.
Providing a definition for that symbol in the Flang runtime might work
correctly for ELF or Mach-O if that symbol has weak linkage in libc++.
But at least for COFF, this now causes multiple-definition errors for
projects that are linking to the Flang runtime and to libc++.
The linker errors before this change for Windows/MinGW using
Clang+Flang+lld look like this:
```
ld.lld: error: duplicate symbol: std::__1::__libcpp_verbose_abort(char const*, ...)
>>> defined at libflang_rt.runtime.a(io-api-minimal.cpp.obj)
>>> defined at libc++.dll.a(libc++.dll)
```
Allow setting both FLANG_RT_ENABLE_SHARED and FLANG_RT_ENABLE_STATIC to
OFF at the same time.
This is extracted out of #171515 to make that PR a little smaller. By
itself it makes little sense since if not building either the `.a` or
the `.so`, you are not building anything. But with #171515, the module
files are still built, allowing building the modules files without the
library. This is mostly intended for GPGPU targets where building the
library is not always needed, but the module files are.
On Darwin, `sys/mman.h` hides `MAP_JIT` and `MAP_ANON(YMOUS)` when
`_POSIX_C_SOURCE` is defined unless `_DARWIN_C_SOURCE` is also defined.
`trampoline.cpp` uses those flags, so this change defines
`_DARWIN_C_SOURCE` before including `<sys/mman.h>` in this file.
Fixes build failure reported in #183108.
Co-authored-by: Sairudra More <moresair@pe31.hpc.amslabs.hpecorp.net>
- Use TEST_F instead of TEST so CrashHandlerFixture::SetUp() is actually
called, registering the custom crash handler for death tests.
- Move putenv/executionEnvironment.Configure calls inside EXPECT_EXIT
blocks so they run in the forked child process, preventing the
NO_STOP_MESSAGE environment variable and configured global state from
leaking into subsequent tests.
- Replace const_cast<char *>("NO_STOP_MESSAGE=1") with a mutable static
char array, to avoid casting away constness of a string literal.
- Update CrashTest's expected pattern to match the output format of the
custom crash handler installed by CrashHandlerFixture, which was
previously never invoked due to the TEST vs TEST_F bug. (Note: there was
a buildbot failure related to this:
https://lab.llvm.org/buildbot/#/builders/130/builds/18413 )
Assisted-by: AI
Change varName parameter from `const char *` to `char *` in
CUFRegisterManagedVariable to match the CUDA runtime API signature of
__cudaRegisterManagedVar, which declares deviceAddress as `char *`.
Add CUFRegisterManagedVariable runtime wrapper in flang-rt that calls
__cudaRegisterManagedVar.
This is preparation for supporting non-allocatable managed variables.
No functional change -- nothing calls this yet.
LZ: processor-dependent (default, flang prints leading zero); LZS:
suppress the optional leading zero before the decimal point; LZP: print
the optional leading zero before the decimal point. Changes span the
source parser, compile-time format validator, runtime format processing,
and runtime output formatting. Includes semantic test (io18.f90) and
documentation updates.
This is the implementation of part of F2023 new feature US 03.
Extracting tokens from a string, SPLIT intrinsic.
It's section 16.9.196 SPLIT (STRING, SET, POS [, BACK]) of Fortran 2023
Standard.
It's part of Flang issue
[#178044](https://github.com/llvm/llvm-project/issues/178044). Note that
I work with @kwyatt-ext on this issue. He implemented the other part,
TOKENIZE.
A test will be added into
[llvm-test-suite](https://github.com/llvm/llvm-test-suite) later after
this PR is merged.
NOTE: This is a new pull request, as the prior didn't have labels
properly applied.
If a bad subscript is provided in a namelisted record, the
HandleSubscripts() routine can read off into infinity. This patch
ensures that a read will not go beyond the rank of the expected
variable.
The failure will then be captured in the return status (IOSTAT) of the
READ.
The small test demonstrates the failure before and after the fix.
---------
Co-authored-by: Kevin Wyatt <kwyatt@hpe.com>
Flang currently lowers internal procedures passed as actual arguments
using LLVM's `llvm.init.trampoline` / `llvm.adjust.trampoline`
intrinsics, which require an executable stack. On modern Linux
toolchains and security-hardened kernels that enforce W^X (Write XOR
Execute), this causes link-time failures (`ld.lld: error: ... requires
an executable stack`) or runtime `SEGV` from NX violations.
This patch introduces a runtime trampoline pool that allocates
trampolines from a dedicated `mmap`'d region instead of the stack. The
pool toggles page permissions between writable (for patching) and
executable (for dispatch), so the stack stays non-executable throughout.
On macOS, MAP_JIT and `pthread_jit_write_protect_np` are used for the
same effect. An i-cache flush (`__builtin___clear_cache` on Linux,
`sys_icache_invalidate` on macOS) is performed after each write→exec
transition.
The feature is gated behind a new driver flag, `-fsafe-trampoline` (off
by default), which threads through the frontend into the
`BoxedProcedurePass`. When enabled, the pass emits calls to
`_FortranATrampolineInit`, `_FortranATrampolineAdjust`, and
`_FortranATrampolineFree` instead of the legacy intrinsics. The legacy
path is completely untouched when the flag is off.
The pool is a singleton with a fixed capacity (default 1024 slots,
overridable via `FLANG_TRAMPOLINE_POOL_SIZE`). Slot size varies by
target (32 bytes on x86-64/AArch64, 48 on PPC64, 64 fallback). Each slot
holds a small architecture-specific stub, currently x86-64 (17 bytes,
using `r10` as the nest/static-chain register) and AArch64 (24 bytes,
using `x15`). The implementation compiles on all architectures but will
crash at runtime with a clear diagnostic if trampoline emission is
actually attempted on an unsupported target. This avoids breaking the
flang-rt build on e.g. RISC-V or PPC64.
Freed slots are poisoned (the callee pointer is overwritten with a
sentinel) and recycled into a freelist, so the pool can sustain
long-running programs that repeatedly create and destroy closures.
A few design choices worth calling out:
The runtime avoids all C++ runtime dependencies, no `std::mutex`, no
`operator new`, no function-local statics with hidden guard variables.
Locking is via flang-rt's own `Lock` / `CriticalSection`, memory is via
`AllocateMemoryOrCrash` / `FreeMemory`, and the singleton uses explicit
double-checked locking with a raw pointer. This was done so the
trampoline pool links cleanly in minimal / freestanding flang-rt
configurations.
`_FortranATrampolineFree` calls are inserted immediately before every
`func.return` in the enclosing host function. This is a conservative but
correct strategy. The trampoline handle cannot outlive the host's stack
frame since the closure captures the host's local variables by
reference.
The GNU_STACK note is verified via a dedicated integration test
(`safe-trampoline-gnustack.f90`) that compiles and links a Fortran
program using the runtime path, then inspects the ELF with
`llvm-readelf` to confirm the stack segment is `RW` (not `RWE`).
**Test coverage:**
- `flang/test/Driver/fsafe-trampoline.f90` — flag forwarding (on, off,
default)
- `flang/test/Fir/boxproc-safe-trampoline.fir` — FIR-level FileCheck for
emitted runtime calls
- `flang/test/Lower/safe-trampoline.f90` — end-to-end lowering
- `flang-rt/test/Driver/safe-trampoline-gnustack.f90` — GNU_STACK ELF
verification
Closes#182813
Co-authored-by: Sairudra More <moresair@pe31.hpc.amslabs.hpecorp.net>
Previously the error message was copied, but not padded for cases where
the message was shorter than the passed CMDMSG string. Add the padding
and also change the test case to test padding on all platforms.
Summary:
This, as far as I am aware, has mostly been superceded by the runtimes
build that's built on top of libc. This build links 30% faster, supports
more functionality, and uses 95% less disk space, so it seems to be the
direction we want to go.
CUDA support remains, this is not needed urgently.
Detect cmd.exe special status code 9009 that indicates "command not
found" condition. Crash the process if "command not found" detected when
CMDSTAT was not specified.
If a comment appears immediately after a logical value in a NAMELIST
file, the flang runtime returns IostatGenericError. No error occurs when
a space preceeds the exclamation point. Add code to handle a comment
while parsing logical values.
Co-authored-by: John Otken john.otken@hpe.com
EXECUTE_COMMAND_LINE() without CMDSTAT initiated termination in runtime
if the command returned non-zero status code. For example,
EXECUTE_COMMAND_LINE('false') on Linux would cause "fatal Fortran
runtime error... : Command line execution failed with exit code: 1."
This is too strict: EXECUTE_COMMAND_LINE() successfully called 'false',
it's just 'false' happened to return non-zero status code. ifx and
gfortran don't initiate termination in such case. Changed
EXECUTE_COMMAND_LINE() implementation to behave in similar fashion.
Also during testing discovered that when the output of the program that
uses EXECUTE_COMMAND_LINE(... WAIT=.false.) is piped to a file, the
resulting file has duplicated output lines. This was because fork()
command also ends up duplicating parent's buffered output to the child.
Added flush of all units and C stdio before calling fork().
The ISO Fortran standard requires that numeric output editing produce
the full word "Infinity", rather than my current "Inf", when the output
field is wide enough to hold it. Comply.
This implements the TOKENIZE intrinsic per the Fortran 2023 Standard.
TOKENIZE is a more complicated addition to the flang intrinsics, as it
is the first subroutine that has multiple unique footprints. Intrinsic
functions have already addressed this challenge, however subroutines and
functions are processed slightly differently and the function code was
not a good 1:1 solution for the subroutines. To solve this the function
code was used as an example to create error buffering within the
intrinsics Process and select the most appropriate error message for a
given subroutine footprint.
A simple FIR compile test was added to show the proper compilation of
each case. A thorough negative path test has also been added, ensuring
that all possible errors are reported as expected.
Testing prior to commit:
= check-flang ==========================================
```
Testing Time: 139.51s
Total Discovered Tests: 4153
Unsupported : 77 (1.85%)
Passed : 4065 (97.88%)
Expectedly Failed: 11 (0.26%)
FLANG Container Test completed 2 minutes (160 s).
Total Time: 2 minutes (160 s)
Completed : Wed Feb 11 04:05:50 PM CST 2026
```
= check-flang-rt ==========================================
```
Testing Time: 1.55s
Total Discovered Tests: 258
Passed: 258 (100.00%)
FLANG Container Test completed 0 minutes (55 s).
Total Time: 0 minutes (56 s)
Completed : Wed Feb 11 04:08:32 PM CST 2026
```
= llvm-test-suite ==========================================
```
Testing Time: 1886.64s
Total Discovered Tests: 6926
Passed: 6926 (100.00%)
CCE SLES Container debug compile completed 31 minutes (1895 s).
CCE SLES Container debug install completed in 0 minutes (0 s).
Total Time: 31 minutes (1895 s)
Completed : Wed Feb 11 05:46:52 PM CST 2026
```
Additionally, (FYI) an executable test has been written and will be
added to the llvm-test-suite under a separate PR.
---------
Co-authored-by: Kevin Wyatt <kwyatt@hpe.com>
Summary:
This enables primarily `stop.cpp` and `descriptor.cpp`. Requires a
little bit of wrangling to get it to compile. Unlike the CUDA build,
this build uses an in-tree libc++ configured for the GPU. This is
configured without thread support, environment, or filesystem, and it is
not POSIX at all. So, no mutexes, pthreads, or get/setenv.
I tested stop, but i don't know if it's actually legal to exit from
OpenMP offloading.
Add specific lowering and entry point for cudaStreamDestroy. Since we
keep associated stream for some allocation, we need to reset it when the
stream is destroy so we don't use it anymore.
This is a follow up on #182635
It was suggested to place `static_assert(std::is_trivially_destructible_c<A>)`
for the `OwningPtr` class. This cannot be done, because there are
non-trivially destructible types used with `OwnerPtr` (e.g. lots of types
that inherit from `IoErrorHandler`, which is not trivially destructible).
This patch brings back the desctructor call into `OwningPtr::delete_ptr`
just to be on the safe side (though, I do not think we had any memory
leaks even without the destructor call), and removes the cyclic
dependency for the `~ChildIo()` caused by `previous_` member.
ASYNCHRONOUS="YES" is not permitted for either a parent or child data
transfer statement in ISO Fortran (F'2023 12.6.4.8.3 p19). Not that it
matters much -- we don't support true asynchronous I/O anyway -- but
someday we might, and in the meantime it's nice to be able to pass tests
that check conformance.
Add an environment variable (FORT_NO_EMPTY_ALLOCATION) that, when set to
1, changes the behavior of an ALLOCATE statement so that it will fail on
an empty allocation rather than its default behavior of allocating one
byte.
Summary:
Expands on the previous support to enable formatted output, characters,
and checking basic iostat. We intentionally do not handle cases where
the descriptor is non-null as this is a non-trivial class that cannot
easily be shepherded across the wire.
This fixes the test on MacOS. Without this change the SDK sysroot is not
set and so the library path is incorrect and the 'System' library cannot
be found.
Test with https://github.com/llvm/llvm-project/pull/182501 so that the
sysroot variable is correctly set.
Assisted-by: Codex
Summary:
This PR provides the minimal support for Fortran I/O coming from a GPU
in OpenMP offloading. We use the same support the `libc` uses for its
printing through the RPC server. The helper functions `rpc::dispatch`
and `rpc::invoke` help make this mostly automatic.
Becaus Fortran I/O is not reentrant, the vast majority of complexity
comes from needing to stitch together calls from the GPU until they can
be executed all at once. This is needed not only because of the
limitations of recursive I/O, but without this the output would all be
interleaved because of the GPU's lock-step execution.
As such, the return values from the intermediate functions are
meaningless, all returning true. The final value is correct however. For
cookies we create a context pointer on the server to chain these
together.
Works on both my AMD and NVIDIA GPUs.
```fortran
program hello_gpu
implicit none
!$omp target teams num_teams(1)
!$omp parallel num_threads(2)
! Print strings
print *, "Hello from GPU"
!$omp end parallel
!$omp end target teams
end program hello_gpu
```
```console
> flang hello.f90 -O2 -fopenmp --offload-arch=gfx1030
> ./a.out
Hello from GPU
Hello from GPU
> flang hello.f90 -O2 -fopenmp --offload-arch=sm_89
> ./a.out
Hello from GPU
Hello from GPU
```
The unittests `Reductions.InfSums` defines a test array descriptor with
shape 2x3 (i.e. 6 elements), but only provides values for 2 elements.
The result is access of likely uninitialized memory when accessing the
additional 4 elements. In most cases the additional values get gobbled
up by the infinity, but if it happens to be NaN or the negated infinity,
the result becomes NaN and fails the test.
Fix by reducing the shabe of the test array to 2. Fixes the flakyness of
the test of the flang-x86_64-windows buildbot.
Implement `F_C_STRING` to convert a Fortran string to a C
null-terminated string. Documented in F2023 Standard: 18.2.3.9
`F_C_STRING (STRING [, ASIS])`.