We have some modifications downstream to compile the flang runtime for
amdgpu using clang OpenMP, some more hacky than others to workaround
(hopefully temporary) compiler issues. The additions here are the
non-hacky alterations.
Main changes:
* Create freestanding versions of memcpy, strlen and memmove, and
replace std:: references with these so that we can default to std:: when
it's available, or our own Flang implementation when it's not. * Wrap
more bits and pieces of the library in declare target wrappers (RT_*
macros). * Fix some warnings that'll pose issues with werror on, in this
case having the namespace infront of variables passed to templates.
Another minor issues that'll likely still pop up depending on the
program you're linking with is that abort will be undefined, it is
perhaps possible to solve it with a freestanding implementation as with
memcpy etc. but we end up with multiple definitions in this case. An
alternative is to create an empty extern "c" version (which can be empty
or forwrd on to the builtin).
Co-author: Dan Palermo Dan.Palermo@amd.com
Rework derived type initialization in the runtime to just initialize the
first element of any array, and then memcpy it to the others, rather
than exercising the per-component paths for each element.
Reword derived type destruction in the runtime to detect and exploit a
fast path for allocatable components whose types themselves don't need
nested destruction.
Small tweaks were made in hot paths exposed by profiling in descriptor
operations and derived type assignment.
Recursion, both direct and indirect, prevents accurate stack size
calculation at link time for GPU device code. Restructure these
recursive (often mutually so) routines in the Fortran runtime with new
implementations based on an iterative work queue with
suspendable/resumable work tickets: Assign, Initialize, initializeClone,
Finalize, and Destroy.
Default derived type I/O is also recursive, but already disabled. It can
be added to this new framework later if the overall approach succeeds.
Note that derived type FINAL subroutine calls, defined assignments, and
defined I/O procedures all perform callbacks into user code, which may
well reenter the runtime library. This kind of recursion is not handled
by this change, although it may be possible to do so in the future using
thread-local work queues.
(Relanding this patch after reverting initial attempt due to some test
failures that needed some time to analyze and fix.)
Fixes https://github.com/llvm/llvm-project/issues/142481.
Recursion, both direct and indirect, prevents accurate stack size
calculation at link time for GPU device code. Restructure these
recursive (often mutually so) routines in the Fortran runtime with new
implementations based on an iterative work queue with
suspendable/resumable work tickets: Assign, Initialize, initializeClone,
Finalize, and Destroy.
Default derived type I/O is also recursive, but already disabled. It can
be added to this new framework later if the overall approach succeeds.
Note that derived type FINAL subroutine calls, defined assignments, and
defined I/O procedures all perform callbacks into user code, which may
well reenter the runtime library. This kind of recursion is not handled
by this change, although it may be possible to do so in the future using
thread-local work queues.
The effects of this restructuring on CPU performance are yet to be
measured.