This commit adds required files into the offload build closure,
which means adding RT_API_ATTRS and other markers.
The implementation does not work for CUDA yet, because of
std::variant,swap,reverse usage. These issues will be resolved
separately (e.g. by using libcudacxx header files).
A file unit is emulated via a temporary buffer that accumulates
the output, which is printed out via std::printf at the end
of the IO statement. This implementation will be used for the offload
devices.